Out.Stats3a

Descriptive Statistics

Descriptive Statistics can be best viewed as a bag of tricks that allows one to present the essential information contained in a data set in a way that the information can be readily interpreted by the reader. Included in the bag of tricks will be a number of pictures/graphs - histograms, stem plots, time plots, box plots - that give us a general picture of the data.  There are also a number of formulas/indices - mean, median, quartiles, standard deviation - that give us a number to describe some facet of the data.  For an on-line discussion of descriptive statistics you should check out the UCLA On-line Statistics Course, the Electronic Textbook Mathematics 220DX Statistics at the New Hampshire College, DAU the Stat refresher Hyperstat Online by David Lane at Rice University, and INTRODUCTION TO QUANTITATIVE METHODS.

To see what we are talking about with descriptive statistics, let's talk about grades, something that should be able to catch your interest. [You might also want to check out the example based on rolling dice].  You remember after each exam when you wanted to know the results and what they would mean for your grade in the course.  Now let's look at grades from the instructor's side.   Below you will find the complete data set for the first test in a recent course (ECN202).   There were 52 students that received the following scores on their exams.

Grades on Exam 1 in ECN202

Student Grade Student Grade Student Grade Student Grade
1 91.983 14 97.531 27 96.8659 40 89.0001
2 91.7597 15 98.2979 28 76.2859 41 76.5876
3 87.9158 16 70.4242 29 99.6804 42 93.3512
4 77.0586 17 72.6251 30 87.6299 43 88.82
5 98.7479 18 86.9584 31 89.6395 44 77.4919
6 79.8029 19 95.2241 32 85.6969 45 94.7336
7 80.5968 20 91.9544 33 95.2098 46 75.4139
8 77.8953 21 80.2882 34 71.9719 47 86.5368
9 96.1051 22 77.2291 35 92.3448 48 93.7865
10 74.1581 23 93.1482 36 74.4269 49 73.8672
11 83.152 24 75.1727 37 82.7137 50 75.7028
12 91.7678 25 87.5282 38 77.8714 51 73.415
13 78.1368 26 80.501 39 71.828 52 74.7755

While this entire data set may be useful to you for some purposes, I suspect you were not interested in studying the entire class results. Furthermore, I suspect if you looked at the table for 30 seconds and were asked to describe what you saw, you would have considerable difficulty capturing the essential features of the data set.  It is for this reason that we have descriptive statistics that allow us to summarize these data with a few pictures and numbers.  As we work our way through these summarizing techniques, you should realize that in general we will not be able to reproduce the exact data from the summary statistics.  Some information is lost in the process, but the loss should be more than offset by the gains that we have in terms of insight into the underlying data.

As a starter, the data can be transformed from a table to a graph.  One approach would be to simply create a bar graph as we did in a previous section.  The first step in creating the diagram is to round off the grades to the nearest whole number and then sort them by score.  This produces the following table.

Test Scores

Score # of Tests
70 1
71  
72 2
73 2
74 3
75 3
76 2
77 4
78 3
79  
80 2
81 2
82  
83 2
84  
85  
86 1
87 2
88 3
89 2
90 1
91  
92 5
93 2
94 1
95 3
96 1
97 1
98 2
99 1
100 1

If we take these data and create a column graph with score on the horizontal and # of tests on the vertical and graph how many students received each score, we get the following diagram.  In this example we can see that four students received a 77 and 5 students received a 92.

We could also create a second graph which will look exactly like the first except that we will graph the relative frequency against the scores. For example, a score of 92 is received by approximately 10 percent of the students (5 of 52).

These two graphs give us a picture of the distribution of ECN202 grades for the entire class. 

You could also use the data analysis tools in excel to create a histogram that looks very similar to what you created above.  The most important difference is that you do not need to sort the data to construct the graph.  A histogram for the data, one that you can access to exam the procedure for generating it, appears below.  Here the grades are grouped. We have four students who received below 73 and six students who received between 85 and 88. 

Before you leave histograms behind, you should check out an on-line interactive example of a histogram, and the  Mathematics 220DX Statistics at the New Hampshire College, and David Lane's site. 

Once you are comfortable with looking at the data set graphically, we can then look at specific features of the distribution of scores. Generally there are four features of the distribution we are concerned with: modality, symmetry, central tendency, and variability.  Of the four, modality and symmetry are the easiest to visualize. Any value at which the frequency curve or relative frequency curve reaches a peak is called a mode. Most distributions in practice have one peak and are described as "unimodal."  A distribution with two peaks is called "bimodal."  The distribution of grades pictured above has a number of modes and would not be easily characterized by the mode.

A distribution is said to be symmetric if the relative frequency is the same distance either side of its center.  In the above example, if the midpoint of the distribution were 85, then symmetry would imply that the same number of students received a grade 90 as a grade of 80, and that the number that received a grade of 75 equaled the number that received a 95.  The mean and median, concepts that we will discuss in the next section, are equal in a symmetric distribution.  An asymmetric frequency distribution is skewed to the left if the lower tail is longer than the upper tail and skewed to the right if the upper tail is longer than the lower tail. To understand the concept of symmetry fully, however, we need to look at the two additional concepts, Central Tendency and Variability.    

As you work your way through the statistical analysis, keep in mind that one of the wonders of modern computers and statistical software is that they can make statistical analysis quite simple.  Consider the situation where you have been asked to analyze Voter Turnout data in in the Presidential Elections as provided by the Federal Elections Commission (FEC) appears below. With these data you can then highlight the data in the second column and then select Data Analysis in the Tools menu of Excel. 

1992

Voter Turnout (%)

STATE

Alabama

55.2%

Alaska

65.4%

Arizona

54.1%

Arkansas

53.8%

California

49.1%

Colorado

62.7%

Connecticut

63.8%

Delaware

55.2%

District of Columbia

49.6%

Florida

50.2%

Georgia

46.7%

Hawaii

41.9%

Idaho

65.2%

Illinois

58.9%

Indiana

55.2%

Iowa

65.3%

Kansas

63.0%

Kentucky

53.7%

Louisiana

59.8%

Maine

72.0%

Maryland

53.4%

Massachusetts

60.2%

Michigan

61.8%

Minnesota

71.6%

Mississippi

52.8%

Missouri

62.0%

Montana

70.1%

Nebraska

63.2%

Nevada

50.0%

New Hampshire

63.1%

New Jersey

56.3%

New Mexico

51.6%

New York

50.9%

North Carolina

50.1%

North Dakota

67.3%

Ohio

60.6%

Oklahoma

59.7%

Oregon

65.7%

Pennsylvania

54.2%

Rhode Island

58.4%

South Carolina

45.0%

South Dakota

67.0%

Tennessee

52.4%

Texas

49.1%

Utah

65.2%

Vermont

67.5%

Virginia

52.8%

Washington

59.9%

West Virginia

50.7%

Wisconsin

69.0%

Wyoming

62.3%

UNITED STATES

55.2%

Once you have selected the Data Analysis a dialogue box will appear and you should select Descriptive Statistics.  You will then get a new dialogue box and in the Input Range you should type in the first and last cell in the third column separated by a colon - in my example it was b8:b58.  You will also want to select new worksheet ply and Descriptive Statistics which will generate the following table.  Included here are the basic measures of central tendency (mean, median, mode) and variability (standard deviation, variance, and range).  You also note the minimum and maximum values and the number of observations (count).  As for measures of central tendency, the mean was 58% and the median was 59 %.  The voter participation rate ranged from approximately 42% to 72% with a standard deviation 7.3%.

Column1

Mean

0.581318

Standard Error

0.010279

Median

0.5894

Mode

0.6515

Standard Deviation

0.073408

Sample Variance

0.005389

Kurtosis

-0.84522

Skewness

-0.03072

Range

0.3004

Minimum

0.4194

Maximum

0.7198

Sum

29.6472

Count

51

To see the actual work in excel you can check out one of the spreadsheets that contains some examples of excel statistics.  Descriptive Statistics 1 contains an example of descriptive statistics (desc stats sheet) and a histogram (Histogram).  The desc stats sheet provides the summary statistics for a set of 199 grades that were generated by selecting Data Analysis under the Tools menu and then choosing Descriptive Statistics.   The input is specified by simply highlighting the column of data and the output is specified  to appear in cell D4.  

The histogram is a bit more difficult if you want to specify the categories.  You begin by selecting Data Analysis on the Tools menu.  Here you will select the Histogram and you can simply create a histogram by highlighting the data and specifying output to be put in cell D46.  The program determines the horizontal axis for you, and as you can see, it is very often not what you are interested in.  You can override this by entering an input into the Bin.  This will give you the breakdown of grades that you want to determine the groupings.  I have chosen a grouping where there are no fractions or decimals.  I typed in the numbers in cells D3 through D14.  This sets the categories.  I then selected the input data and in the Bin box I highlighted the cells D4 through D14 and asked for the output to appear in cell E3.  You can see this scatter diagram is an improvement over the 

To understand the concepts that appear in this table you should look at the two concepts, Central Tendency and Variability