Regression Analysis and Discrimination

Before discussing the use of regression analysis as a tool to ferret out discrimination, it is useful to briefly discuss some of the preliminary issues one must deal with when estimating a regression equation.  The project starts with the choice of the theoretical relationship that one wants to study - the hypotheses one is interested in testing and the data necessary to estimate the model.  First there is the need to determine the dependent variable - the phenomenon you are setting out to explain.  If you are studying discrimination in the housing or car markets, then the price of homes or cars might be the appropriate dependent variable.  In the mortgage market the dependent variable could be the rejection rate on mortgage applications while in the labor market the dependent variable would be the level of earnings. 

A decision must also be made concerning the choice of the independent variables.  The choice of the regressors is based on the underlying economic theory.  The variables should be selected because there is a reason to believe they are causally related to Y.  If, for example, your goal was to estimate a demand equation for a certain product, then based on your knowledge of microeconomic theory, you would need to identify at the very least the appropriate data to capture the influence of income, population, and the price of related goods.  In each of these instances, you will be making choices that will significantly affect the findings of your study. Furthermore, for every variable selected, you should have an a priori expectation for the estimated parameters.  You should know whether to expect a positive or negative relationship.  Based on our understanding of economic theory, for example, the coefficient of price in the demand equation should be negative and the coefficient of income should be positive. 

A good example of the importance of the proper specification of the independent variable would be the treatment of demographic factors in a demand equation. The normal choice would be often be the population, but there may be instances where this is likely to be an inappropriate choice. For example, consider the demand for motorcycles. Is it the growth in the population or is it the growth in the population of young people? To the extent that it is the latter, that the primary market for motorcycles is younger people, then the use of total population as an independent variable would cause problems if there were a divergence between the two growth rates, a phenomenon we observed in the 1970s. Similarly, a model for housing demand would most certainly include as an independent variable some measure of population. Is it the number of people, or is it the number of separate households that is the primary determinant of demand? The choice you make will have a significant impact on the results since we find that in the 1980s the growth rates of the two differed substantially.

In the case of discrimination in housing where housing prices are the dependent variable, you would want to include as right-side variables (independent variables) characteristics of the house (number of rooms, square footage, age...), characteristics of the neighborhood (average home prices, crime, taxes...), and characteristics of the individual (income, gender, race, employment status...).  Once you have assembled the appropriate data, you would then estimate an equation of the form:

Pi = b0 +b1*H +b2*N + b3*I + b4*R/G

where
H = number of characteristics of the house
N = number of characteristics of the neighborhood
I = number of characteristics of the individual
R/G = a dummy variable for either race or gender

The dummy variable is one which would have values of 0 for transactions where the buyer was white (male) and 1 when the buyer was black (female).   In essence what the regression test is to see if gender or race exert a significant impact on the purchase price.  A visual representation of the situation, albeit a very simplified version of the situation, appears below where we are testing to see if gender has a significant impact on the price of housing where we have explained the price of housing by some composite variable X and gender.  The equation would be:

Pi = b0 +b1*X +b2*G

Two possible scatter diagrams based on the data appears below.   In the left-side diagram there appears to be a positive relationship between X and P, reflected in the positive slope of the red regression line, but there is no real pattern to distinguish the red *s and the blue #s.  In the right-side diagram there is little evidence of a a positive relationship between X and P until we separate out the *s and #s.  When we do that we find that there are actually two positive relationships between X and P, one for the *s and one for the #s

wpe2.jpg (11673 bytes)

In terms of the equation above, the coefficient of X would be equal to the slope of the line and the coefficient of G would be equal to the gap between the two  lines.  In the left-side diagram there is no gap while in the right-side diagram there is a substantial gap that would suggest that there is a significant difference between the prices paid by males ( * ) and females ( # ).  The right-side diagram suggests that females pay a higher price for homes.  The equations for the two groups would be:

Pi = b0 +b1*X   (males)

Pi = b0 +b1*X +b2*G (females)

We now have the variables, the data, and the results.  But do we have proof?  As you would expect, the answer is maybe.  There are some potential problems that one may encounter, problems discussed by Yinger.  As an example of a potential problem, consider the omitted variable bias that occurs when a variable (Z) that is causally related to P, but also related to G, is omitted from the estimated equation. The result is the effect of Z on P is captured by the variable G, and if Z were included in the equation, then G would turn out to have no independent effect.  If the regression omitted tenure in the job, and women tended to have shorter tenures then men, then the exclusion of tenure would mean that tenure's effect on earnings was being picked up by the gender variable and this would tend to bias the results toward a conclusion that discrimination existed.  

The bottom line - you had best not try to prove the existence of discrimination without the services of a good econometrician / statistician.  If you do get involved in this type of research, however, it does provide a wonderful opportunity to apply the regression technique.