overview.Statistics

Overview: Statistics
"There are lies, damn lies, and statistics"

Introduction

'Sadistics' tells it all. When students use this as the title for their statistics course, you know something is going on in there that is not fun. If you read the paper or listen to the news, however, you know that statistics matter and that the investment in understanding some basic statistical principles is likely to be worthwhile. If you have any doubts look at many of the important public policy debates that that are grounded in statistical work. We are all aware of the government's long-running war against smoking that is based in large part on a statistical link between smoking and an array of health problems.  In November of 1996 the Environmental Protection Agency sparked a heated public debate about statistics when it tightened the air pollution controls on fine-particulate matter (soot and dust). The decision to tighten the standards was made on the grounds that industrially produced dust particles were responsible for 10,000 to 100,000 premature deaths each year. Former Senator John Chafee, long a supporter of the EPA, would not support the measure, however, because of his belief that it was based on questionable statistics. In early 1998 an article appeared that reported that there was no link between fiber in one's diet and the likelihood of cancer, a finding that was the opposite to the prevailing view. 

In 1998 there were two front page examples of the power of statistics.  The first was the collapse and multi-billion dollar bailout of Long Term Capital.  This hedge fund that employed two Nobel prize winning economists was based on the premise that there were some statistical averages in financial prices on which you could bet BIG money. This was certainly the sentiment of Matt Damon who played a poker player in the 1998 movie Rounders

It was also the year that we heard the controversy surrounding the upcoming census.  Since 1790 the US government has taken a count of the population every ten years.  In 1790 the population was small enough so that individual's names could be found in the census, but by the year 2000 the population exceeded 270 million and real concerns were being raised about the ability to actually count each individual.  To correct some anticipated deficiencies in the collection of the data, the Census Bureau, the agency in charge of collection and dissemination of the data, wanted to use a sampling method to provide greater accuracy in the counts in certain areas.  As you would expect, this became a hotly debated issue between those who felt that the technique would increase the size of their constituents and others who saw it eroding their power.  

If you need other examples of where statistics is used, just consider the array of discrimination issues that you have seen in the headlines in recent years. At the heart of these debates is statistics - if you want to prove or disprove discrimination, you will need a thorough statistical analysis of the data. How about something closer to home - your grades. How many of you have been 'graded on the curve'? If so your grades have been based on a simple statistical concept which you should know something about. Statistics is also at the center of the gambling industry that fuels the growth of Las Vegas and Foxwoods and pours millions into the coffers of state governments that run their lotteries, as well as the numerous age, gender, and racial discrimination cases that are debated in the nation's courts.

Before we begin our foray into statistics it is important for you to realize that this IS NOT a course in statistics.  In this unit we are going to spend enough time to only scratch the surface of what would be in a traditional statistics course.  It is also a unit that relies more heavily on the web for its material since there is a good amount of classroom material on-line that appears in the index of web sites.  For those interested in a more extensive listing you might want to check out this Index.  A few of interesting / useful sites are the  UCLA On-line Statistics Course, the Electronic TextbookIntroductory Statistics: Concepts, Models, and Applications Copyright 1996 by David W. StockburgerMathematics 220DX Statistics at the New Hampshire College, DAU the Stat refresherand Hyperstat Online by David Lane at Rice University.   

You may also be interested in some statistics glossaries to keep your terminology straight.  A few on-line sites would be the Electronic Textbook.

Although we are not going to undertake a thorough study of statistics, we will look at what we will be doing within the context of traditional statistical analysis.  Let's start with a definition.  Statistics "is concerned with the development and application of processes, methods, and techniques for collecting, analyzing, and interpreting quantitative data to aid decision making."   [Collin Watson et.al. Brief Business Statistics.  Allyn & Bacon 1988.]   Now that's a mouthful, but at its core statistics is about numbers, how to extract information from data for decision making. 

In general you will find that statistical analysis can be decomposed into the four categories - descriptive statistics, probability, inferential statistics, and statistical techniques. 

wpe2.gif (3065 bytes)

Descriptive statistics, the first component of most statistics courses, is where you learn how to summarize / describe things with a few numbers - what will be called descriptive statistics.  You will find examples of these everywhere.  If you happen to listen to weather reports, you have heard meteorologists use statistics - those average daily highs and lows. For those who like finance, you need look no further than the financial pages of the newspapers where you will find the Dow Jones and S&P stock averages among an array of statistics such on stocks, bonds, currencies and other financial instruments designed to provide us with some insight into what is happening in the financial markets.  Sports fan are certainly well aware of the use of descriptive statistics. In baseball we have the batting averages, slugging percentages, and earned run averages which are used to describe the performance of individual players while in basketball we may evaluate a player by shooting percentage or rebounds per game. 

To get a "feel" for descriptive statistics, let's look at students' performance on my first exam in ECN201.  Below you will see the list of grades and I suspect that you have no "feel" for how the students did on the exam.  

Grades on ECN201 Exam

13

18

19

7

13

26

22

16

20

20

16

22

20

26

25

17

21

27

15

29

22

16

21

23

22

24

25

17

17

13

16

15

22

22

22

25

25

22

27

27

20

18

23

24

22

18

24

26

24

24

21

23

17

22

22

26

19

24

28

24

20

28

17

27

22

26

23

27

20

15

15

19

17

14

22

21

19

26

18

14

28

22

22

21

22

29

18

24

26

20

20

21

19

21

22

25

19

23

22

13

23

24

18

25

27

13

23

20

24

20

21

27

23

21

27

20

23

18

14

25

23

24

22

11

22

21

16

21

26

27

16

20

26

15

19

21

15

16

17

14

15

27

18

9

21

25

10

25

24

25

12

14

29

15

24

20

21

21

19

18

26

15

18

22

23

23

14

26

25

17

25

21

13

23

16

18

17

24

20

15

24

22

22

18

23

15

21

Descriptive statistics will provide us with ways of summarizing large amounts of data such as these grades. It is here where you will also be introduced to frequency distributions and histograms, pictures of the grade distribution.  The two sets of summary statistics that we will discuss here are the measures of central tendency and  variability.  As a preview of what you are to see in this unit, below you will find the histogram of the data which provides a visual summary of the information in the table of scores.  

Probability theory allows us to move from the world of certainty to the world of uncertainty and is usually one of the more difficult parts of traditional statistics courses.  Fortunately, an understanding of some basic probability theory is possible without too much mathematical sophistication.    In this section we will talk about the rules of probability and probability distributions with a special emphasis on the normal distribution that you have probably already experienced in some of your classes where you were graded on the bell curve.  With a little work you will understand why you will most likely lose at Foxwoods and not win the lottery.  In terms of the ECN201 grades, we would use probability theory to determine the likelihood of selecting two students from the class that both received an A on the exam (>90). 

Inferential statistics  is concerned with drawing conclusions, making inferences, developing forecasts. When a friend was found to have a brain tumor, his family was told that the odds for his survival were 30 percent. This is an example of an inferential statistic- some inferences were being made about the future of my friend based on the previous experiences of others.  So to are all those election results where the winner of the election is predicted, with that margin of error, based on a small sample of people at the polls.  How often did you hear during the impeachment hearings about the president's approval rating that was based on a small survey of individuals.  And there are those models that predict future stock market prices based on historical price movements.  At URI, if you wanted to get students' opinions on certain topics, you would sample the students to arrive at a "representative" opinion.  As for the ECN201 grades, we could select a small sample of grades and use the sample to infer something about the grades for the entire class.

The discussion of statistical techniques can be considered a special topics in statistics and included here would be discussions of time series analysis, economic forecasting, and linear regression analysis.  With regard to the ECN201 grades,we could attempt to explain these grades by comparing the grades with other characteristics of the students. 

Before we begin our foray into statistics, however, it is useful to deal with a basic classification scheme for the data which we will be describing /analyzing and examine the relationship between samples and populations. 

Variable types

At the center of our analysis of statistics will be the random variable that has a numerical value determined by a chance mechanism.  The score on an exam selected from the ECN201 grades would be an example of a random variable, and so would the number that appears when you throw two dice.  You could also think of an individual's income derived from a survey as a random variable. 

When we talk about random variables, they will be classified as either qualitative or quantitative.  We are dealing with qualitative (categorical) variables when the observations fall into separate distinct categories.  Qualitative data can be found in a number of the popular press surveys.  Some of these would include: rank (Assistant professor, associate professor, full professor), attitude toward affirmative action quotas (support, reject), exam result (pass or fail), or marital status (single, married, divorced, widowed).  If you are looking for some expertise with quantitative variables, you might look up Sociologists and Psychologists who tend to deal more often with qualitative variables.

Quantitative (numerical) variables, meanwhile, are simply numeric variables - they are numbers. These are the variables that economists tend to deal with, which is why they tend to specialize in statistical techniques designed for quantitative variables.  There are two types of numerical variables - continuous and discrete.  Discrete data are derived from a counting process, while continuous data are derived from a measuring process.  Some examples of continuous variables would be income, quantity demanded, price, unemployment rates, interest rates, and scores on the exam.  Number of long-distance phone calls made per month, number of jobs in past ten years, number of children, and the number of states visited on your trip would be examples of discrete quantitative variables. 

wpe1.gif (2468 bytes)

Samples and populations

A key relationship in statistics is that which exists between the sample and the population.  To understand the difference between the two, let's look at the questions of forecasting elections, gauging public opinion, and estimating the market for a new product.  In each instance our interest is in what the American people are thinking or doing so that the population in each instance would be the 270+ million American people.  We could ask all 270+ million how they intend to vote, how they feel about the bombing of Iraq in late 1998, and their prospective use of internet commerce.  The problem is that the cost of gaining this information would be prohibitively expensive. 

We would have a similar situation if we wanted to determine student support for a proposed curriculum change where the population would be the 9,000+ undergraduate students.  The obvious approach would be to conduct a survey in which all students were polled on their opinion, but you can only imagine the logistical difficulties involved in such an undertaking.

Fortunately, there is a way to gain insight into the answers to all of these questions without conducting exhaustive surveys of the population.   In each instance researchers could identify a small sample of the population and make inferences about the population based on the findings of the sample.  For example, we could identify a set of 1000 potential voters that were representative of the entire population to forecast the election, and this sample would not necessarily be the same as that designed to gauge public opinion since not all people are equally likely to vote.  People over age 65 are nearly twice as likely to vote as those under age 21.   If you were interested in "testing" the potential response to a new product, meanwhile, you might pick a few test cities to launch the product rather than conducting a national launch.  And for those interested in gauging student sentiments, you could identify 100 students and solicit their opinions which you would take as representative of the opinions of the entire student population.

The advantages of the sample are obvious, but it should also be obvious how important the sampling process is if correct inferences are to be made.   If we took our survey for a presidential election at the polls in center cities, we would likely find that our forecasts are often wrong since Democrats are over represented in the cities and underrepresented in the suburbs.  The same would be true if we conducted our polls on college campuses since younger voters would be over represented in the sample.  Similarly, if we were interested in determining support for president Clinton during the impeachment hearings, a survey of those leaving church on Sunday would not provide a representative opinion.  Excluded from the sample would be all those who did not attend religious services and you might expect that these two groups would have different views on the president's behavior. 

When testing the market for a new product in Rhode Island, meanwhile, you would most likely not want to use as your test markets Barrington, one of the state's wealthiest communities, or Central Falls, one of the state's lower income communities.  Neither of these cities contains a cross-section of the state's population and thus they could not produce representative results.  At the university, a sampling of student opinions among the residents of the dorms would not be expected to provide useful insight into the opinions of undergraduate students since the survey would miss all commuters and the vast majority of upper class students who live "down-the-line." 

Care must be taken when developing a sample and an important concept here is the random sample.  A sample is said to be a random sample if every individual has the same chance of being selected as any other or if all samples of the same size have the same probability of being chosen.   At the university, a random sample could be conducted by putting the names of all students on index cards and then putting the cards into a drum that mixes the cards up and then selecting 100 cards.  This is the approach taken in the lottery drawings that you can see on television each night. A more sophisticated approach would be to assign a random number to each student and then use a random number table to select the 100 students. The list of all students' names could also be placed on the wall and then survey only those students whose name has been hit by someone throwing darts at the list.  The Wall Street Journal has done this for years when picking stocks - or at least they say that they use this approach. 

A survey of only the dorm students, however, would not be a particularly good approach since first and second year students tend to live in dorms while third and fourth year students tend to live off-campus.  The sample based on the dorm would not be representative of the university and thus we could expect that the survey results would not accurately reflect the overall student population's view.   Similarly, if someone were interested in knowing how many people could not find work, a phone survey during weekday afternoons would not be a very good approach since you would be likely to miss many of those people working the traditional 9-5 jobs.

In the grade example, you probably would not want group students by their rank in class, their GPA, or their SAT scores since the averages for the groups could be expected to be quite different.  You would also have some reservations about grouping students by age or location in the classroom.  There might be reasons to expect that the front row students would have a different profile than the back row students and that this would show up in the average performance.  You could use either the index card or random number methods or you could simply choose every tenth score if there was no reason to believe that the scores were related to the order in which they were recorded.  This would be a potentially acceptable approach if the scores were recorded alphabetically or by the last digit of the social security number.  If we chose every tenth student and then sorted them by score, we would have the following scores for the sample.  In our discussion of central tendency and variability we will examine the relationship between the properties of the sample and the entire population of students.

Sample of Exam Scores

15

20

16

21

16

23

16

24

18

24

18

25

19

26

19

26

19

29

 

Conclusion

Statistics can be a powerful method for presenting information in a meaningful, understandable manner or for testing for causality (ex. smoking and lung cancer), but you should never lose sight of the potential for abuse and misuse. Some mistakes are the result of good intentions and bad procedures - an example being a polling of student in the dorm to determine student attitudes toward the differences in tuition for in-state and out-of-state students.  Other problems may arise as a result of poor intentions - a tobacco companies conducting tests to determine the link between smoking and cancer or a team of military leaders reporting on the incidence of sexual harassment in the military.  When we have finished our study of statistics, you should be in a better position to identify potential misuses. My goal here is very similar to that expressed by William S Hammack who writes in Statistics: A Primer for Lawyers.

"My general goal is to make you literate in basic statistical language and concepts... you should be able to explain at a general conceptual level the details of statistical analysis...: this primer will teach you to do this. You will develop the ability to visualize statistics physically...[and] build a foundation for understanding more sophisticated statistical techniques"

This Unit will be divided into four sections. The first is simply a guide to some useful web sites.  In the second and third sections you will find a brief overview of some of the more important concepts in Descriptive Statistics, Probability, and  Inferential Statistics.  There is also a section on Regression, one of the widely used statistical techniques for identifying relationships / causality.