Tuesday, May 9, 2017

Regression Analysis

Introduction:

The purpose of this assignment is to learn how to use and interpret regression analysis and to predict outcomes using regression. Also, the goal of this assignment is to learn how to map standardized residuals and to connect statistics to a spatial output. This assignment is split up into two parts, the first part looks at the relationship between the percent of kids that get free lunches and the crime rate per 100000 in a community. The second part of this assignment looks at the responses to 911 calls in Portland OR.

Part One:

For part one, a new station made a claim that the number of kids that receive a free lunch increases, the crime rate also goes up. The goal for part one is to figure out if the news is correct or false using the data provided and SPSS. The output for the regression analysis from SPSS for the crime data is below in figure 1.  Using the outputs for the regression analysis, a lot of information can be derived
figure 1. The results from the SPSS regression analysis for the crime data provided
  from the data. The free lunches are the independent variable and the crime rate is the dependent variable.  There is a correlation between the variables because the significant level for this is .005, which is lower than .05. This means that there is a correlation between the percent of students who receive free lunches and the crime rate. However, the r^2 is only .173, which is really low. r^2 explains how much the independent variable explains the dependent variable on a scale from 0-1, 0 being it does not explain it and 1 being it totally explains it. Since this r^2 value is only .173, the percent of free lunches only explains 17.3% of the crime rate. This is a really small number and is not a major factor. Technically, the news station is correct when they say that free lunches and the crime rate are connected because there is a correlation between the variables. However, since the r^2 value is so low, the free meals only explain a small part of the crime rate and does not explain the crime rate well at all. Using this regression model, the crime rate could be estimated by the number of free lunches given away. From the regression results, a line of best fit equation can be found. The equation for the best fit line is y=21.819+1.685x.  If a new area of town had 23.5% of kids receiving a free lunch, then 23.5 would be put in for x in the equation. This results in a crime rate of 61.4165 crimes out of 100000 people. There is little confidence in the result for this equation because of the low r^2 value. This equation also shows that if nobody got a free lunch, there would be 21.819 crimes per 100000 people. Also, if every kid, 100%, got a free lunch , then the crime rate would be 190.319 per 100000 people according to the equation derived from the data. Also, for every percent increase in free lunches, the crime rate should go up by 1.685 crimes per 100000 people,

Part Two:

Introduction:

For second part of this assignment, 911 calls are compared to other variables in Portland OR. This is done with the data provided, as well as SPSS and Arcmap. For this, three variables are compared to 911 calls individually, the variable with the highest r^2 value will have its residuals mapped, and finally multiple variables are compared to 911 calls at once with multiple regression analysis and multiple regression analysis with a step wise approach.

Methods:

The first step of the second part of this assignment is to run independent regression analysis on different variables and 911 calls. For all of these, the dependent variable is the 911 calls and the independent variables are the ones that are selected to compare to 911 calls. This can done by opening SPSS and opening the data in the program. Next, go to the analyze tab and select regression/linear. Then set the  dependent variable to calls and the independent variable to what the calls is being compared to. After this is processed it will give an output that describes the regression between calls and the other selected variables.

 Next, to make a cloropleth map of the number of 911 calls per census tract, Arcmap needs to be opened and the Portland census tracts layer needs to be added. From here a simple symbology change will result in a cloropleth map of the 911 calls. To map the residuals, the toolbox needs to be opened and the spatial statistics need to be navigated to. Next, select modeling spatial relationships and choose ordinary least squares. Once this tool is opened, select the census tracts as an input and set the unique field id to UniqID and set the dependent variable to calls and the explanatory variable to the independent variable. This will result in a new layer that shows the residuals for each census tract in terms of calls and the variable chosen.

The last steps for this assignment dealt with multiple regression analysis. For this it is the same as an individual regression analysis, but multiple independent variables are chosen instead of one. Getting the result from this will show the regression for all the independent variables together. However, a step wise approach is needed to only choose the variables that work well with the data. This can be done by selecting step wise under the methods drop down in the linear regression tool window. This will give an output with only variables that help increase the r^2.  

Results:

For the first step of part two, variables within census tracts were compared to the amount of 911 calls in the census tract. The variables that were selected to try to explain the number of 911 per census tract are the number of people with no high school degree, the unemployment rate, and the population density. The first individual regression analysis that tried to explain 911 calls is the number of people with no high school degree. The results for this regression analysis can be seen below in figure 2.
figure 2. Regression analysis output for 911 calls and low education population
 There is a positive relationship between 911 calls and the amount of uneducated people. This is seen by the equation of the best fit line for the data. The equation is y=3.931+.166x. This equation shows that the 911 calls will increase by .166 for each new uneducated person in that area. Since that amount of 911 calls and the amount of uneducated people both go up or down together, there is a positive relationship between the variables. The r^2 for this regression is .567, which is a pretty high r^2 value. This means that the number of uneducated people in an area explains the 911 calls 56.7% of the time. For the hypothesis testing, since the significance level is under .05 we reject the null hypotheses ,that there is not relationship between 911 calls and the uneducated population, in favor for the alternative hypothesis, there is a relationship between 911 calls and the uneducated population.

The second variable that was selected to try to explain the amount of 911 calls is unemployment rate. The regression analysis output can be seen below in figure 3. From this output, an equation for a best
figure 3. Regression analysis output for 911 calls and unemployment rates. 
line can be derived. the Equation is y=1.106+.507x where x is the unemployed and y is the 911 calls. Looking at this equation, a positive relationship can be seen between the unemployment rate and amount of 911 calls. This can be seen by looking at the slope of the equation, .507. This means that 911 calls will increase by .507 for every unit increase in unemployment. The r^2 value is .543, which is fairly high. This means that the employment rate explains 54% of the 911 calls in the census tracts. The significant level is under .05 so the null hypothesis would be rejected. This means that the idea that there is no relationship between the unemployment rate and 911 calls is rejected in favor for the idea that there is a relationship between the unemployment rate and 911 calls.

The third variable that was selected to try to explain the amount of 911 calls is the population density. The regression output can be seen below in figure 4. Looking at this output a equation for a best fit
figure 4.  Regression analysis output for 911 calls and population densisty.
line can be found. The equation is y=20.616+21909.074x where y is the number of 911 calls and x is the population density. From this we can find the relationship between the variables. There is a positive relationship because the slope, 21909.074, is positive. However, the population density does not do a good job predicting the 911 calls because the r^2 value is .004. This means that the population density only explains .4% of the 911 calls. Looking at the equation again, for every 1 unit increase for population density, the amount of 911 calls goes up by 21909.074. Since the significance level  is over .05 , .555, the null hypothesis will fail to be rejected meaning that there is no relationship between 911 calls and the population density.  

The second step of part two maps the amount of 911 calls per census tract and the residuals for the variable that has the highest r^2 value, uneducated. The first map simply shows the total number of 911 calls per census tract in Portland. This map can be seen below in figure 5. From this, it is easy to
figure 5. Number of 911 calls per census tract in Portland OR
see where the most 911 calls are occurring. The majority of 911 calls come from the north east census tracts. There are also a few census tracts in the south east that have a high amount of 911 calls. Also, the west edge of Portland makes very few 911 calls. The next map shows the residuals for the regression analysis between 911 calls and the amount of uneducated people. A residual is the distance the data point is from the best fit line. When all the residuals are found, the can be mapped. Figure 6 below shows the mapped residuals for the regression analysis. This map shows areas that the model
figure 6. Map of residuals from regression analysis for 911 calls and amount of uneducated people
is over predicting and under predicting. Areas on the map that are red are under predicting and areas that are in blue are over predicting the outcome.

The third step of part two deals with multiple regression. For this part a multiple regression analysis is preformed on the data. The number of 911 calls will remain the dependent variable but multiple independent variables will used as the input. The output for the multiple regression analysis can be seen below in figure 7 and figure 8. These two figures show the multiple regression analysis for this
figure 7. Output for multiple regression analysis part 1


figure 8. Output for multiple regression analysis part 2
data. the r^2 value for this data is .783, which is a pretty high r^2 and means that all these variables explain 78.3% of the 911 calls. Looking at the data, the most influential variable can be seen by looking at the absolute value of the beta value. In this case, LowEduc is the most influential variable and unemployed is the least influential variable. This output also shows if there is any collinearity going on between variables. This can be found by looking at the bottom row of the collinearity diagnostics table. If the condition index is above 30, then there is collinearity and a varible needs to be eliminated. In this case, the condition index is under 30 so there is no collinearity. However, if there was, the entry that was closest to 1 to the right of the condition index would be eliminated and the regression would be ran again without that variable. Next, the same regression is ran but the method is changed to a step wise approach. The output for the step wise regression can be seen below in figure 8, 9, 10, 11. Step wise regression will only include the variables that help derive the
figure 8. Output for step wise regression part 1
   

figure 9. Output for step wise regression part 2



figure 10. output for step wise regression part 3
figure 11. Output for step wise regression part 4

equation the most. The three variables that help drive the equation the most were Renters, LowEduc, and Jobs. With these three variables together, the r^2 value is .771. This is a pretty high r^2 value and means that these three variables explain 77.1% of the 911 calls. Also, the equation for this output is 911 calls = renters*.024+LowEduc*.103+Jobs*.004. All of these variables have a positive slope so they all have a positive relationship with the amount of 911 calls. Looking at the beta values, the variables can be ranked from most influential to least influential; LowEduc, Jobs, Renters. Using this data, the residuals can be mapped. This can be seen below in figure 12. This map shows areas in the
figure 12. Mapped residuals for Portland OR
same way as the previous residuals map.  The red areas are under predicting and the blue areas are over predicting the outcome. This map is useful because a new hospital should go in an area that is under predicting the amount of  911 calls. These areas have a higher amount of 911 calls than the model predicts. This means that a new hospital should go I or near a census tract that is bright red. This will allow the hospital to treat the greatest amount of people that need it.

Conclusion:

This part of the assignment helps explain regression, residuals, and multiple regression with Portland OR as the example. The goal of this assignment was to figure out locations that a new hospital should be build. From looking at the regression analysis and the map of the residuals, the location of a new hospital should be right in the middle of these census tracts. This area is under represented in the amount of 911 the model says they should have. That means that there are more 911 calls in these areas than the model shows. The areas in the middle of the census tracts show the greatest under representation so the hospital should go there instead of somewhere that is over represented by the model like the outside census tracts.











No comments:

Post a Comment