箱形图（Box-plot）

关于四分位数(Quartile)在wikipedia维基百科 上搜索英语版或者中文版都有很清晰的解释

在 wikipedia维基百科 上搜索 Box Plot ：

箱形图（Box-plot）又称为盒须图、盒式图或箱线图，是一种用作显示一组数据分散情况资料的统计图。因型状如箱子而得名。在各种领域也经常被使用，常见于品质管理。不过作法相对较较繁琐。
箱形图于1977年由美国著名统计学家 John Tukey发明。它能显示出一组数据的最大值、最少值、中位数、下四分位数及上四分位数。

以下是箱形图的具体例子：

+-----+-+

* o |-------| + | |---|

+-----+-+

+---+---+---+---+---+---+---+---+---+---+ 數線

0 1 2 3 4 5 6 7 8 9 10

这组数据显示出：

最小值(min)=5。
下四分位数(Q1)=7。
中位数(Med)=8.5。
上四分位数(Q3)=9。
最大值(max)=10。
平均值=8。
四分位间距(interquartile range)=Q3 − Q1=2

http://www.physics.csbsju.edu/stats/box2.html

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers" above and below the box show the locations of the minimum and maximum.

This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median). Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers. John Tukey has provided a precise definition for two types of outliers:

Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.
Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile.

If either type of outlier is present the whisker on the appropriate side is taken to 1.5×IQR from the quartile (the "inner fence") rather than the max or min, and individual outlying data points are displayed as unfilled circles (for suspected outliers) or filled circles (for outliers). (The "outer fence" is 3×IQR from the quartile.)

If the data happens to be normally distributed,

IQR = 1.35 σ

where σ is the population standard deviation.

Suspected outliers are not uncommon in large normally distributed datasets (say more than 100 data-points). Outliers are expected in normally distributed datasets with more than about 10,000 data-points. Here is an example of 1000 normally distributed data displayed as a box plot:

Note that outliers are not necessarily "bad" data-points; indeed they may well be the most important, most information rich, part of the dataset. Under no circumstances should they be automatically removed from the dataset. Outliers may deserve special consideration: they may be the key to the phenomenon under study or the result of human blunders.

Example A

Consider two datasets:

A1={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}

A2={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}

Notice that both datasets are approximately balanced around zero; evidently the mean in both cases is "near" zero. However there is substantially more variation in A2 which ranges approximately from -6 to 6 whereas A1 ranges approximately from -2½ to 2½.

Below find box plots and the more traditional error bar plots (with 1-σ bars). Notice the difference in scales: since the box plot is displaying the full range of variation, the y-range must be expanded.

Example B

B1={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}

B2= {2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}

Notice that the datasets span much the same range of values (from about .1 to about 50) and that all the values are positive. Most of the B1 values are less than one whereas most of the B2 values are more than one. We can use a log scale to better display this large range of values:

On the other hand, a straightforward plot of the sample means and population standard deviations, suggests negative values (which prevents use of a log-scale) and broad overlap between the two distributions. (A t-test would suggest B1 and B2 are not significantly different.)

Example C

One case of particular concern --where a box plot can be deceptive-- is when the data are distributed into "two lumps" rather than the "one lump" cases we've considered so far.

A "bee swarm" plot shows that in this dataset there are lots of data near 10 and 15 but relatively few in between. See that a box plot would not give you any evidence of this.

发表于 2009-10-20 11:01 杰哥阅读(9223) 评论(2) 编辑收藏引用所属分类: Matlab

常用链接

留言簿(57)

随笔分类

随笔档案

相册

Other

Paper submission

福彩

留学相关

论坛

搜索

学者

邮箱

中科大和中科院

搜索

最新评论

阅读排行榜

评论排行榜

箱形图（Box-plot）