Descriptive Statistics
Descriptive Statistics can be best viewed as a bag of tricks that allows one to present the essential information contained in a data set in a way that the information can be readily interpreted by the reader. Included in the bag of tricks will be a number of pictures/graphs - histograms, stem plots, time plots, box plots - that give us a general picture of the data. There are also a number of formulas/indices - mean, median, quartiles, standard deviation - that give us a number to describe some facet of the data. For an on-line discussion of descriptive statistics you should check out the UCLA On-line Statistics Course, the Electronic Textbook, Mathematics 220DX Statistics at the New Hampshire College, DAU the Stat refresher, Hyperstat Online by David Lane at Rice University, and INTRODUCTION TO QUANTITATIVE METHODS.
To see what we are talking about with descriptive statistics, let's talk about grades, something that should be able to catch your interest. [You might also want to check out the example based on rolling dice]. You remember after each exam when you wanted to know the results and what they would mean for your grade in the course. Now let's look at grades from the instructor's side. Below you will find the complete data set for the first test in a recent course (ECN202). There were 52 students that received the following scores on their exams.
Grades on Exam 1 in ECN202
| Student | Grade | Student | Grade | Student | Grade | Student | Grade |
| 1 | 91.983 | 14 | 97.531 | 27 | 96.8659 | 40 | 89.0001 |
| 2 | 91.7597 | 15 | 98.2979 | 28 | 76.2859 | 41 | 76.5876 |
| 3 | 87.9158 | 16 | 70.4242 | 29 | 99.6804 | 42 | 93.3512 |
| 4 | 77.0586 | 17 | 72.6251 | 30 | 87.6299 | 43 | 88.82 |
| 5 | 98.7479 | 18 | 86.9584 | 31 | 89.6395 | 44 | 77.4919 |
| 6 | 79.8029 | 19 | 95.2241 | 32 | 85.6969 | 45 | 94.7336 |
| 7 | 80.5968 | 20 | 91.9544 | 33 | 95.2098 | 46 | 75.4139 |
| 8 | 77.8953 | 21 | 80.2882 | 34 | 71.9719 | 47 | 86.5368 |
| 9 | 96.1051 | 22 | 77.2291 | 35 | 92.3448 | 48 | 93.7865 |
| 10 | 74.1581 | 23 | 93.1482 | 36 | 74.4269 | 49 | 73.8672 |
| 11 | 83.152 | 24 | 75.1727 | 37 | 82.7137 | 50 | 75.7028 |
| 12 | 91.7678 | 25 | 87.5282 | 38 | 77.8714 | 51 | 73.415 |
| 13 | 78.1368 | 26 | 80.501 | 39 | 71.828 | 52 | 74.7755 |
While this entire data set may be useful to you for some purposes, I suspect you were not interested in studying the entire class results. Furthermore, I suspect if you looked at the table for 30 seconds and were asked to describe what you saw, you would have considerable difficulty capturing the essential features of the data set. It is for this reason that we have descriptive statistics that allow us to summarize these data with a few pictures and numbers. As we work our way through these summarizing techniques, you should realize that in general we will not be able to reproduce the exact data from the summary statistics. Some information is lost in the process, but the loss should be more than offset by the gains that we have in terms of insight into the underlying data.
As a starter, the data can be transformed from a table to a graph. One approach would be to simply create a bar graph as we did in a previous section. The first step in creating the diagram is to round off the grades to the nearest whole number and then sort them by score. This produces the following table.
Test Scores
| Score | # of Tests |
| 70 | 1 |
| 71 | |
| 72 | 2 |
| 73 | 2 |
| 74 | 3 |
| 75 | 3 |
| 76 | 2 |
| 77 | 4 |
| 78 | 3 |
| 79 | |
| 80 | 2 |
| 81 | 2 |
| 82 | |
| 83 | 2 |
| 84 | |
| 85 | |
| 86 | 1 |
| 87 | 2 |
| 88 | 3 |
| 89 | 2 |
| 90 | 1 |
| 91 | |
| 92 | 5 |
| 93 | 2 |
| 94 | 1 |
| 95 | 3 |
| 96 | 1 |
| 97 | 1 |
| 98 | 2 |
| 99 | 1 |
| 100 | 1 |
If we take these data and create a column graph with score on the horizontal and # of tests on the vertical and graph how many students received each score, we get the following diagram. In this example we can see that four students received a 77 and 5 students received a 92.

We could also create a second graph which will look exactly like the first except that we will graph the relative frequency against the scores. For example, a score of 92 is received by approximately 10 percent of the students (5 of 52).

These two graphs give us a picture of the distribution of ECN202 grades for the entire class.
You could also use the data analysis tools in excel to create a histogram that looks very similar to what you created above. The most important difference is that you do not need to sort the data to construct the graph. A histogram for the data, one that you can access to exam the procedure for generating it, appears below. Here the grades are grouped. We have four students who received below 73 and six students who received between 85 and 88.

Before you leave histograms behind, you should check out an on-line interactive example of a histogram, and the Mathematics 220DX Statistics at the New Hampshire College, and David Lane's site.
Once you are comfortable with looking at the data set graphically, we can then look at specific features of the distribution of scores. Generally there are four features of the distribution we are concerned with: modality, symmetry, central tendency, and variability. Of the four, modality and symmetry are the easiest to visualize. Any value at which the frequency curve or relative frequency curve reaches a peak is called a mode. Most distributions in practice have one peak and are described as "unimodal." A distribution with two peaks is called "bimodal." The distribution of grades pictured above has a number of modes and would not be easily characterized by the mode.
A distribution is said to be symmetric if the relative frequency is the same distance either side of its center. In the above example, if the midpoint of the distribution were 85, then symmetry would imply that the same number of students received a grade 90 as a grade of 80, and that the number that received a grade of 75 equaled the number that received a 95. The mean and median, concepts that we will discuss in the next section, are equal in a symmetric distribution. An asymmetric frequency distribution is skewed to the left if the lower tail is longer than the upper tail and skewed to the right if the upper tail is longer than the lower tail. To understand the concept of symmetry fully, however, we need to look at the two additional concepts, Central Tendency and Variability.
As you work your way through the statistical analysis, keep in mind that one of the wonders of modern computers and statistical software is that they can make statistical analysis quite simple. Consider the situation where you have been asked to analyze Voter Turnout data in in the Presidential Elections as provided by the Federal Elections Commission (FEC) appears below. With these data you can then highlight the data in the second column and then select Data Analysis in the Tools menu of Excel.
1992 |
Voter Turnout (%) |
STATE |
|
Alabama |
55.2% |
Alaska |
65.4% |
Arizona |
54.1% |
Arkansas |
53.8% |
California |
49.1% |
Colorado |
62.7% |
Connecticut |
63.8% |
Delaware |
55.2% |
District of Columbia |
49.6% |
Florida |
50.2% |
Georgia |
46.7% |
Hawaii |
41.9% |
Idaho |
65.2% |
Illinois |
58.9% |
Indiana |
55.2% |
Iowa |
65.3% |
Kansas |
63.0% |
Kentucky |
53.7% |
Louisiana |
59.8% |
Maine |
72.0% |
Maryland |
53.4% |
Massachusetts |
60.2% |
Michigan |
61.8% |
Minnesota |
71.6% |
Mississippi |
52.8% |
Missouri |
62.0% |
Montana |
70.1% |
Nebraska |
63.2% |
Nevada |
50.0% |
New Hampshire |
63.1% |
New Jersey |
56.3% |
New Mexico |
51.6% |
New York |
50.9% |
North Carolina |
50.1% |
North Dakota |
67.3% |
Ohio |
60.6% |
Oklahoma |
59.7% |
Oregon |
65.7% |
Pennsylvania |
54.2% |
Rhode Island |
58.4% |
South Carolina |
45.0% |
South Dakota |
67.0% |
Tennessee |
52.4% |
Texas |
49.1% |
Utah |
65.2% |
Vermont |
67.5% |
Virginia |
52.8% |
Washington |
59.9% |
West Virginia |
50.7% |
Wisconsin |
69.0% |
Wyoming |
62.3% |
UNITED STATES |
55.2% |
Once you have selected the Data Analysis a dialogue box will appear and you should select Descriptive Statistics. You will then get a new dialogue box and in the Input Range you should type in the first and last cell in the third column separated by a colon - in my example it was b8:b58. You will also want to select new worksheet ply and Descriptive Statistics which will generate the following table. Included here are the basic measures of central tendency (mean, median, mode) and variability (standard deviation, variance, and range). You also note the minimum and maximum values and the number of observations (count). As for measures of central tendency, the mean was 58% and the median was 59 %. The voter participation rate ranged from approximately 42% to 72% with a standard deviation 7.3%.
Column1 |
|
Mean |
0.581318 |
Standard Error |
0.010279 |
Median |
0.5894 |
Mode |
0.6515 |
Standard Deviation |
0.073408 |
Sample Variance |
0.005389 |
Kurtosis |
-0.84522 |
Skewness |
-0.03072 |
Range |
0.3004 |
Minimum |
0.4194 |
Maximum |
0.7198 |
Sum |
29.6472 |
Count |
51 |
To see the actual work in excel you can check out one of the spreadsheets that contains some examples of excel statistics. Descriptive Statistics 1 contains an example of descriptive statistics (desc stats sheet) and a histogram (Histogram). The desc stats sheet provides the summary statistics for a set of 199 grades that were generated by selecting Data Analysis under the Tools menu and then choosing Descriptive Statistics. The input is specified by simply highlighting the column of data and the output is specified to appear in cell D4.
The histogram is a bit more difficult if you want to specify the categories. You begin by selecting Data Analysis on the Tools menu. Here you will select the Histogram and you can simply create a histogram by highlighting the data and specifying output to be put in cell D46. The program determines the horizontal axis for you, and as you can see, it is very often not what you are interested in. You can override this by entering an input into the Bin. This will give you the breakdown of grades that you want to determine the groupings. I have chosen a grouping where there are no fractions or decimals. I typed in the numbers in cells D3 through D14. This sets the categories. I then selected the input data and in the Bin box I highlighted the cells D4 through D14 and asked for the output to appear in cell E3. You can see this scatter diagram is an improvement over the
To understand the concepts that appear in this table you should look at the two concepts, Central Tendency and Variability.