Tuesday, May 9, 2017

Regression Analysis

Introduction:

The purpose of this assignment is to learn how to use and interpret regression analysis and to predict outcomes using regression. Also, the goal of this assignment is to learn how to map standardized residuals and to connect statistics to a spatial output. This assignment is split up into two parts, the first part looks at the relationship between the percent of kids that get free lunches and the crime rate per 100000 in a community. The second part of this assignment looks at the responses to 911 calls in Portland OR.

Part One:

For part one, a new station made a claim that the number of kids that receive a free lunch increases, the crime rate also goes up. The goal for part one is to figure out if the news is correct or false using the data provided and SPSS. The output for the regression analysis from SPSS for the crime data is below in figure 1.  Using the outputs for the regression analysis, a lot of information can be derived
figure 1. The results from the SPSS regression analysis for the crime data provided
  from the data. The free lunches are the independent variable and the crime rate is the dependent variable.  There is a correlation between the variables because the significant level for this is .005, which is lower than .05. This means that there is a correlation between the percent of students who receive free lunches and the crime rate. However, the r^2 is only .173, which is really low. r^2 explains how much the independent variable explains the dependent variable on a scale from 0-1, 0 being it does not explain it and 1 being it totally explains it. Since this r^2 value is only .173, the percent of free lunches only explains 17.3% of the crime rate. This is a really small number and is not a major factor. Technically, the news station is correct when they say that free lunches and the crime rate are connected because there is a correlation between the variables. However, since the r^2 value is so low, the free meals only explain a small part of the crime rate and does not explain the crime rate well at all. Using this regression model, the crime rate could be estimated by the number of free lunches given away. From the regression results, a line of best fit equation can be found. The equation for the best fit line is y=21.819+1.685x.  If a new area of town had 23.5% of kids receiving a free lunch, then 23.5 would be put in for x in the equation. This results in a crime rate of 61.4165 crimes out of 100000 people. There is little confidence in the result for this equation because of the low r^2 value. This equation also shows that if nobody got a free lunch, there would be 21.819 crimes per 100000 people. Also, if every kid, 100%, got a free lunch , then the crime rate would be 190.319 per 100000 people according to the equation derived from the data. Also, for every percent increase in free lunches, the crime rate should go up by 1.685 crimes per 100000 people,

Part Two:

Introduction:

For second part of this assignment, 911 calls are compared to other variables in Portland OR. This is done with the data provided, as well as SPSS and Arcmap. For this, three variables are compared to 911 calls individually, the variable with the highest r^2 value will have its residuals mapped, and finally multiple variables are compared to 911 calls at once with multiple regression analysis and multiple regression analysis with a step wise approach.

Methods:

The first step of the second part of this assignment is to run independent regression analysis on different variables and 911 calls. For all of these, the dependent variable is the 911 calls and the independent variables are the ones that are selected to compare to 911 calls. This can done by opening SPSS and opening the data in the program. Next, go to the analyze tab and select regression/linear. Then set the  dependent variable to calls and the independent variable to what the calls is being compared to. After this is processed it will give an output that describes the regression between calls and the other selected variables.

 Next, to make a cloropleth map of the number of 911 calls per census tract, Arcmap needs to be opened and the Portland census tracts layer needs to be added. From here a simple symbology change will result in a cloropleth map of the 911 calls. To map the residuals, the toolbox needs to be opened and the spatial statistics need to be navigated to. Next, select modeling spatial relationships and choose ordinary least squares. Once this tool is opened, select the census tracts as an input and set the unique field id to UniqID and set the dependent variable to calls and the explanatory variable to the independent variable. This will result in a new layer that shows the residuals for each census tract in terms of calls and the variable chosen.

The last steps for this assignment dealt with multiple regression analysis. For this it is the same as an individual regression analysis, but multiple independent variables are chosen instead of one. Getting the result from this will show the regression for all the independent variables together. However, a step wise approach is needed to only choose the variables that work well with the data. This can be done by selecting step wise under the methods drop down in the linear regression tool window. This will give an output with only variables that help increase the r^2.  

Results:

For the first step of part two, variables within census tracts were compared to the amount of 911 calls in the census tract. The variables that were selected to try to explain the number of 911 per census tract are the number of people with no high school degree, the unemployment rate, and the population density. The first individual regression analysis that tried to explain 911 calls is the number of people with no high school degree. The results for this regression analysis can be seen below in figure 2.
figure 2. Regression analysis output for 911 calls and low education population
 There is a positive relationship between 911 calls and the amount of uneducated people. This is seen by the equation of the best fit line for the data. The equation is y=3.931+.166x. This equation shows that the 911 calls will increase by .166 for each new uneducated person in that area. Since that amount of 911 calls and the amount of uneducated people both go up or down together, there is a positive relationship between the variables. The r^2 for this regression is .567, which is a pretty high r^2 value. This means that the number of uneducated people in an area explains the 911 calls 56.7% of the time. For the hypothesis testing, since the significance level is under .05 we reject the null hypotheses ,that there is not relationship between 911 calls and the uneducated population, in favor for the alternative hypothesis, there is a relationship between 911 calls and the uneducated population.

The second variable that was selected to try to explain the amount of 911 calls is unemployment rate. The regression analysis output can be seen below in figure 3. From this output, an equation for a best
figure 3. Regression analysis output for 911 calls and unemployment rates. 
line can be derived. the Equation is y=1.106+.507x where x is the unemployed and y is the 911 calls. Looking at this equation, a positive relationship can be seen between the unemployment rate and amount of 911 calls. This can be seen by looking at the slope of the equation, .507. This means that 911 calls will increase by .507 for every unit increase in unemployment. The r^2 value is .543, which is fairly high. This means that the employment rate explains 54% of the 911 calls in the census tracts. The significant level is under .05 so the null hypothesis would be rejected. This means that the idea that there is no relationship between the unemployment rate and 911 calls is rejected in favor for the idea that there is a relationship between the unemployment rate and 911 calls.

The third variable that was selected to try to explain the amount of 911 calls is the population density. The regression output can be seen below in figure 4. Looking at this output a equation for a best fit
figure 4.  Regression analysis output for 911 calls and population densisty.
line can be found. The equation is y=20.616+21909.074x where y is the number of 911 calls and x is the population density. From this we can find the relationship between the variables. There is a positive relationship because the slope, 21909.074, is positive. However, the population density does not do a good job predicting the 911 calls because the r^2 value is .004. This means that the population density only explains .4% of the 911 calls. Looking at the equation again, for every 1 unit increase for population density, the amount of 911 calls goes up by 21909.074. Since the significance level  is over .05 , .555, the null hypothesis will fail to be rejected meaning that there is no relationship between 911 calls and the population density.  

The second step of part two maps the amount of 911 calls per census tract and the residuals for the variable that has the highest r^2 value, uneducated. The first map simply shows the total number of 911 calls per census tract in Portland. This map can be seen below in figure 5. From this, it is easy to
figure 5. Number of 911 calls per census tract in Portland OR
see where the most 911 calls are occurring. The majority of 911 calls come from the north east census tracts. There are also a few census tracts in the south east that have a high amount of 911 calls. Also, the west edge of Portland makes very few 911 calls. The next map shows the residuals for the regression analysis between 911 calls and the amount of uneducated people. A residual is the distance the data point is from the best fit line. When all the residuals are found, the can be mapped. Figure 6 below shows the mapped residuals for the regression analysis. This map shows areas that the model
figure 6. Map of residuals from regression analysis for 911 calls and amount of uneducated people
is over predicting and under predicting. Areas on the map that are red are under predicting and areas that are in blue are over predicting the outcome.

The third step of part two deals with multiple regression. For this part a multiple regression analysis is preformed on the data. The number of 911 calls will remain the dependent variable but multiple independent variables will used as the input. The output for the multiple regression analysis can be seen below in figure 7 and figure 8. These two figures show the multiple regression analysis for this
figure 7. Output for multiple regression analysis part 1


figure 8. Output for multiple regression analysis part 2
data. the r^2 value for this data is .783, which is a pretty high r^2 and means that all these variables explain 78.3% of the 911 calls. Looking at the data, the most influential variable can be seen by looking at the absolute value of the beta value. In this case, LowEduc is the most influential variable and unemployed is the least influential variable. This output also shows if there is any collinearity going on between variables. This can be found by looking at the bottom row of the collinearity diagnostics table. If the condition index is above 30, then there is collinearity and a varible needs to be eliminated. In this case, the condition index is under 30 so there is no collinearity. However, if there was, the entry that was closest to 1 to the right of the condition index would be eliminated and the regression would be ran again without that variable. Next, the same regression is ran but the method is changed to a step wise approach. The output for the step wise regression can be seen below in figure 8, 9, 10, 11. Step wise regression will only include the variables that help derive the
figure 8. Output for step wise regression part 1
   

figure 9. Output for step wise regression part 2



figure 10. output for step wise regression part 3
figure 11. Output for step wise regression part 4

equation the most. The three variables that help drive the equation the most were Renters, LowEduc, and Jobs. With these three variables together, the r^2 value is .771. This is a pretty high r^2 value and means that these three variables explain 77.1% of the 911 calls. Also, the equation for this output is 911 calls = renters*.024+LowEduc*.103+Jobs*.004. All of these variables have a positive slope so they all have a positive relationship with the amount of 911 calls. Looking at the beta values, the variables can be ranked from most influential to least influential; LowEduc, Jobs, Renters. Using this data, the residuals can be mapped. This can be seen below in figure 12. This map shows areas in the
figure 12. Mapped residuals for Portland OR
same way as the previous residuals map.  The red areas are under predicting and the blue areas are over predicting the outcome. This map is useful because a new hospital should go in an area that is under predicting the amount of  911 calls. These areas have a higher amount of 911 calls than the model predicts. This means that a new hospital should go I or near a census tract that is bright red. This will allow the hospital to treat the greatest amount of people that need it.

Conclusion:

This part of the assignment helps explain regression, residuals, and multiple regression with Portland OR as the example. The goal of this assignment was to figure out locations that a new hospital should be build. From looking at the regression analysis and the map of the residuals, the location of a new hospital should be right in the middle of these census tracts. This area is under represented in the amount of 911 the model says they should have. That means that there are more 911 calls in these areas than the model shows. The areas in the middle of the census tracts show the greatest under representation so the hospital should go there instead of somewhere that is over represented by the model like the outside census tracts.











Monday, April 24, 2017

Correlation and Spatial Autocorrelation

Introduction:

The purpose of this assignment is to become familiar with correlation and spatial autocorrelation using SPSS and GEODA. The first part of this assignment uses census tracts in Milwaukee to look at the correlations between different fields such as the white population and number of retail employees. The second part of this assignment uses election and population data in Texas to look at the spatial autocorrelation of voter turnout, demarcate voters, and Hispanic populations.

Part One:

For the first part of this assignment, different attributes of the census tracts in Milwaukee Wisconsin are compared to see if there is a correlation between the attributes. A table that shows the correlation between the different attributes can be found below in table 1. This table is a result from the SPSS
table 1. Table of correlations between attributes for Milwaukee census tracts
 software and data that was provided by the instructor. The attributes that this table is comparing are number of manufacturing employees, number of retail employees, number of finance employees, the white population, the black population, the Hispanic population and the median household income. In this table, if an entry has two starts, it means that there is a correlation between the variables to a significance level of 99%, which means there is a strong correlation. The variables that have this strong correlations are the number of manufacturing employees to all the other variables, the number of retail employees to all other variables except Hispanic population, the number of finance employees to all the other variables, the white population compared to all other variables except Hispanic population, the black population compared to all other variables, the Hispanic population to all other variables except number of retail employees and white populations and median household income, and the median household income compared to everything except Hispanic populations. All these entries have a strong correlation between the two variables and a trend can be clearly seen between the variables with either a positive or negative correlation. The entries that only have one star have a correlation to a significance level of 95%. These entries are the white population compared to the Hispanic population and the Hispanic population compared to median household income. These entries still have a correlations between the two variables like the other entries, however these correlations are not quite as strong as the previous ones but a clear correlation is visible. If there is no star next to an entry, then there is no correlation between the variables. The entry that has no correlation is Hispanic populations compared to the number of retail employees. This means that there is no clear correlation between the variable in the data. The table does not only show the strength of the correlation, but it also shows the direction of the correlation. The direction can either be positive or negative. If the correlation is positive, then when one variable goes up the other variable should go up also. If the correlation is negative, then when one variable goes up the other variable should go down. All of the correlations in the table above are positive except for the black population compared to all other variables and the Hispanic population compared to number of finance employees, black populations, and median household income, This means, for example, that when the Hispanic population goes up then the black population goes down or if the black population goes up then the Hispanic population goes down. Using this data it is easy to infer that black and Hispanic populations are not doing very well in the Milwaukee area because they both have a negative correlation with median household income. This means that when there is an increase in black or Hispanic populations, then there is a decrease in median household income or when there is an increase in household income, then the black or Hispanic population decreases. This is compared to the correlation between the white population and the median household income which is a strong positive correlation. This means that when the white population goes up then the median household income in that area goes up.

Part Two:

Introduction:

For this part of the assignment, the Texas Election Commission (TEC) provided data about the 1980 and 2016 presidential elections and wants analysis done on the patterns. They want to know if there are clustering of voting patterns in the state, as well as voter turnout. They also want to know if the election patterns have changed over 36 years. As well as election data, population data is also analyzed to see if there is clustering of Hispanic populations in Texas.

Methodology:

For this assignment, the election data for  1980 and 2016 was provided by the TEC. The population data needed to be downloaded separately from the US Census website with a shapefile of Texas. The population data from the US Census is very cluttered so all the fields can be deleted except for the geo id field and the percent of Hispanic population field. Once the table is simplified, it can be joined to the Texas shapefile and exported as a new feature. Next, the election data can be joined to the new feature and exported to create a feature that has all the attributes that are desired. Once this feature is complete, it needs to be exported and saved as a shapefile. Next, inside GEODA, the shapefile needs to be opened and a new "weights manager" needs to be created and an id variable needs to be added. Once this is complete, the "Moran's scatter plot" button can be clicked and the scatter plot and LISA cluster map need to be selected as an output. After this process is done running, a scatter plot and a cluster map will appear that show the spatial autocorrelation of the variables in Texas.

Results:

The results from this process are a cluster map and a scatter plot for each of the variables. The first variable that a scatter plot and cluster map were made from was the voter turn out in 1980. The map and scatter plot can be seen below in map 1 and graph 1. The red areas are counties that have high
map 1. Texas voter turnout for 1980


graph 1. Texas voter turnout for 1980













voter turnout with areas of high voter turnout around it. The light red areas are counties that have a high voter turnout but are surrounded by counties that have low voter turnout. The light blue areas are counties that have low voter turnout surrounded by counties that have a high voter turnout. The blue areas are counties that have low voter turnout that are surrounded by counties that also have low voter turnout. Looking at the cluster map, map 1, the north and central part of Texas have clusters of high voter turnout and the south and east side of the state have clusters of low voter turnouts. For most of the state, there is no signification clustering. Graph 1 show a scatter plot for the voter turnout in 1980 and provided the Moran's I value. Moran's I value tells how much the data is grouped from -1 to 1. 1 being perfect grouping and -1 being no grouping at all. The Moran's I value for this scatter plot is 0.468, which shows that the voter turnout in 1980 did have some clustering, but not to a great extent. The next cluster map and scatter plot were made from the voter turnout in 2016. The map and scatter plot can be seen below in map 2 and graph 2. The map colors are the same as above. Looking at this

map 2. Texas voter turnout for 2016



graph 2. Texas voter turnout for 2016


















 map, there is a pattern that emerges. There is a lot of clustering of low voter turnout in the south edge of the state with a small cluster on the north west edge of the state. The north and central part of the state have clusters of higher voter turnout. The Moran's I value for this scatter plot is 0.287, which is lower than the Moran's I value for 1980. This means that there is even less clustering in 2016 than 1980. When map 1 and map 2 are compared, the change in voter turnout can be seen. The cluster that is the south end of the state stays about the same but does decrease a little bit. The cluster of high voter turnout in the middle of the state gets smaller and looks like there are a lot less counties that have high voter turnout. The cluster of low voter turnout on the west side of the state is gone entirely and the west side of the state has a lot lower voter turnout than it did in 1980.

The second set of maps and scatter plots deal with the percent of democratic voters in each county in 1980 and 2016. Map 3 and graph 3 below show the percent of democratic vote in each county in Texas in 1980. Like the previous map, the colors represent the same thing, however instead of voter

map 3. Percent of democratic vote by county in 1980


graph 3. Percent of democratic vote by county in 1980















turnout, the map shows the percent of democratic vote. This map shows the clusters of counties that had a high percent of democratic votes and cluster of counties that had a low percent of democratic votes. The south and east part of the state had a large amount of counties clustered together that had a high percentage of democratic votes. The north and west edge of the state had a large amount of counties clustered together that had a low percentage of democratic votes. The Moran's I value for this graph 3 is 0.575, which shows that there is a good amount of clustering going on, but it could still be better.The next map and scatter plot are on the percent democratic vote in Texas in 2016. Below are map 4 and graph 4 that show the spatial autocorrelation for this attribute. Like above, the colors represent the same thing. Using this map, a pattern can be seen. This map shows that there is heavy

map 4. Percent of democratic vote in 1980


graph 4. Percent of democratic vote in 1980














clustering of counties that had a high democratic vote percentage on the south edge of the state and a heavy clustering of counties that had a low democratic vote percentage in the north and central part of the state. The Moran's I value for graph 4 is 0.685. This means that there is even more clustering in 2016 than 1980 in terms of percent of democratic vote. Comparing these maps, the change of democratic voters can be seen. from 1980 to 2016 the south has gained a lot more clustering of high percent of democratic voters, especially towards the west. The north stays the same but the central part of the state shifts to the east. Also, the high cluster area on the east edge of the state disappears completely.

The last map and scatter plot is the percent of Hispanic population per county in Texas. The map and scatter plot can be seen below in map 5 and graph 5. This map shows a strong pattern of where like
map 5. Percent Hispanic population



graph 5. Percent Hispanic population














counties in terms of percent Hispanic population are clustered. There is a high clustering of counties with high percentage of Hispanic population in the south and west edge of the state and a high clustering of counties with low percentage of Hispanic population to the north and east. The Moran's I value for graph 5 is 0.778. This means that there is a lot of clustering going on, in fact it is the highest Moran's I value for all the data and has the most clustering. When map 5 is compared to map 2 and 4, A correlation can be seen. between the Hispanic population and the voters turnout and percent of democratic voters. Table 2 shows the correlation matrix for this data. The percent_1
table 2. Correlation matrix for Texas data
column is the percent Hispanic. This table shows that there is a correlation between the percent of Hispanic population and the percent of democratic voters. This means that when there is a higher percentage of Hispanic population, then the percent of democratic votes goes up or vise versa  There is also a negative correlation between the percent of Hispanic population and the voter turnout for 2016 and 1980. This means that when the percent of Hispanic population goes up then the voter turnout goes down.

Conclusion:

In conclusion for part 2 of this assignment, it seems that when the percent of Hispanic population goes up in a county, then the voter turnout for those counties goes down and the percent of democratic voters goes up. This correlation leads to the assumption that Hispanic populations tend to vote for democratic candidates if they turn out to vote at all.

Tuesday, April 4, 2017

Hypothesis Testing

Objective:

The objective of this assignment was to use hypothesis testing on sample and real world data to better understand z and t tests and when to use them. Also, this assignments goal was to learn how to calculated z and t values and to make a decision about the null and alternative hypothesis.

Part 1:

The first part of of this assignment was to finish a table that was provided. The table provides the interval type (one or two tailed), the confidence level, and the sample size. From this, the level of significance, whether a t or z test was required, and what the z or t value is. The table below (table 1) shows the table filled out with the missing information added. The interval type can be either a one
table 1. Finished statistics table
 or two tailed test. A one tail test means that only the positive or negative critical value is found. A two tailed test means that two values are found on both the positive and negative extremes and the level of significance is divided by 2. The confidence level is the amount of times out of 100 that sample data should be the same as population data. N is the sample size of the sample that is taken. Level of significance is set to determine the probability of a type 1 error occurring. To choose z or t is simple, if n is over 30, use a z test. If n is 30 or under, use t test. For the t or z values, a z or t value chart is needed and the level of significance is looked up in the chart to find the cut off for when the null hypothesis is rejected or not rejected.

The next part of this assignment looks at Kenya's agriculture. From a sample of 23 farmers, it needs to be determined if their yields for different types of crops are statically different from the county as a whole. The three crops are ground nuts, cassava, and beans. For all calculations, a confidence level of 95% is used with a two tailed test. This means that the test statistic for all three tests are +/- 2.074. The test done for these will be a t test because the sample size is under 30. Finding the t value for these follows the following equation: (sample mean - population mean)/(sample standard deviation/ square root of N).

Ground nuts:
Null hypothesis- There is no difference between the yield of ground nuts between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of ground nuts between the sample farmers and the county as a whole.
Equation: (0.52-0.57)/(.3/sqrt 23)= -0.799
Probability: 21.66%
-0.779 falls between -2.074 and 2.074 so for ground nuts, the null hypothesis will fail to be rejected.

Cassava:
Null hypothesis: There is no difference between the yield of cassava between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of cassava between the sample farmers and the county as a whole.
Equation: (3.3-3.7)/(.75/sqrt 23) = -2.558
Probability: 1.07%
 -2.558 falls outside of -2.074 and 2.074 so for cassavas, the null hypothesis will be rejected.

Beans:
Null hypothesis: There is no difference between the yield of beans between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of beans between the sample farmers and the county as a whole.
Equation: (0.34-0.29)/(0.12/sqrt 23) = 1.998
Probability: 97.03%
1.998 falls  between -2.074 and 2.074 so for beans, the null hypothesis will fail to be rejected.

Similarities:
The beans and ground nuts both failed to reject the null hypothesis meaning that they both are not statistically different from the population mean. Another similarity is between ground nuts and cassava. Both of sample means for these two crops were lower than the population mean.

Differences:
The cassava was the only crop that rejected the null hypothesis. The beans was the only crop that had a sample mean larger than the population mean.

The last section of part one looks at the level pollution in a stream. There were 17 samples taken and with a sample mean pollution level of 6.4 mg/l and a standard deviation of 4.4. The allowable limit for stream pollution is 4.2 mg/l. The part uses hypothesis testing to determine if the stream is statically over the allowable limit.

Null hypothesis: There is no difference between the mean of the pollution samples and the allowable limit for the stream.
Alternative hypothesis: There is a difference between the mean of the pollution samples and the allowable limit for the stream.
Statistical test: The sample size is 17. 17 is under 30 so a t test is needed for this.
Level of significance: This was provided and is 95% of a one tailed test. For a level of significance of 95% for a one tailed t test is 1.746.
Equation: (6.4-4.2)/(4.4/sqrt 17) = 2.0616
Probability: 97.4%
2.0161 is over 1.746 so the null hypothesis will be rejected.
This means that there is a difference between the mean of the pollution samples and the allowable limit for the stream. The researcher was correct that the level of pollution in the stream is over the allowable limit. However, by how much it is over is unknown.

Part 2:

For the second part of the assignment, home values are compared between the block groups for the city of Eau Claire and the block groups for Eau Claire county as a whole to see if home values for the city are statistically different from that of the county. This can be done with hypothesis testing. The mean for the home values for Eau Claire county is 169438.13. The mean for the home values for the city of Eau Claire is 151876.509 with a standard deviation of 49706.919 with a n of 53.

Null hypothesis: There is no difference between the home values for the city of Eau Claire and the county of Eau Claire.
Alternative hypothesis: There is a difference between the home values for the city of Eau Claire and the county of Eau Claire.
Statistical test: The sample size is 53. 53 is larger than 30 so a z test will be used.
Level of significance: For this a level of significance will be 95% giving a critical value 1.64 but the sample mean is smaller than the population mean so the critical value needs to be multiplied by -1 giving a critical value of -1.64.  
Equation: (151876.509-169438.13)/(49706.919/sqrt 53) = -2.572
-2.572 is smaller than -1.64 so the null hypothesis will be rejected. This means that there is a statistical difference between the home values of the city of Eau Claire and the county of Eau Claire. The map below (map 1) shows the home values for the city of Eau Claire and the county of Eau
map 1. Home values by block group for Eau Claire county

Claire. This map helps show that home values in the city of Eau Claire are higher than the rest of Eau Claire county. The map also gives reference for where the city of Eau Claire is in Eau Claire county and where Eau Claire county is in Wisconsin.


Wednesday, March 8, 2017

Z Score and Probability

Introduction:

The purpose of this assignment is to investigate the foreclosures in Dane County using z-scores, probability, and add field tool. The officials in Dane County are concerned about the increasing amount of foreclosures from 2011 to 2012 and would like spatial analysis to be done to find patterns. The questions asked for this assignment are what is the pattern of change from 2011 to 2012 and what foreclosures will look like in 2013 if the trend continues.

Methodology:

To answer the first question the Dane County officials asked, the Dane County tracts feature class needs to be added to the work space. The tracts feature class contains information on the foreclosures for the years of 2011 and 2012 but does not have information for the differences between the two. However, the difference can be found with the information that is given. To do this, a new field needs to be created and the field calculator is used to make the field values equal to "2012 foreclosures - 2011 foreclosures". This will give each tract a new attribute that is the differences from 2011 to 2012. Once this is complete, a map can be made that shows the change for each tract from 2011 to 2012.

For the next question, z scores are needed. A z score is a specific value for a observation that represents the exact number of standard deviations an observation is from the mean. It is calculated by taking the observation value and subtracting the mean from it, then take that number and divide it by the standard deviation. This will give the z score. The larger the z score is, the farther from the mean the observation is. The smaller the z score is, the closer to the mean the observation is. For this assignment, the z score for three tracts was found for 2011 and 2012. Map 1 below shows all the tracts in Dane County. The three counties that z scores were found for were tract 122.01, 31, and
map 1. Dane County Wisconsin Tracts
114.01. Tract 122.01 in 2011 had a z score of -0.61 and a z score of -0.64 in 2012. This means that this tract is within -1 - 1 standard deviation from the mean and these counties have an about average amount of foreclosures. Tract 31 has a z score of 1.44 and .58 in 2011 and 2012 respectively. This shows that from 2011 to 2012 the amount of foreclosures  has gotten a lot closer to the mean. Tract 114.01  had a z score of 2.35 and 2.70 in 2011 and 2012 respectively. This tract is a lot more of an outlier because it is over 2 standard deviations from the mean. This means that there are a lot more foreclosures in tract 114.01 than there are in other tracts.  Z scores can be used with a z score table to find out the probability that the observation will happen. The z score and z score table are used to answer the next question about how the foreclosures will look in 2013 if current trends continue. The first thing that needs to be done is the Dane County tracts feature class needs to be added to the work space. The differences field that was created above is used to create another field to predict what the foreclosures in 2013 will look like. This new fields values are found by adding the "2012 foreclosures" to the "differences" in the field calculator. This will give a new attribute for each tract that is the estimated foreclosures for 2013. After the new attribute is added, the z score table can be used to find out the amount of foreclosures a tract needs before the foreclosures are in the top 70% and top 20% of all tracts.    

Results:

The results from the first question can be seen below in map 2. Map 2 shows the change in amounts of foreclosures form 2011 to 2012. Tracts that are blue had more foreclosures in 2012 than 2011 and red tracts had more foreclosures in 2012 than 2011. The darker the color, the more the foreclosures
map 2. Dane County foreclosure differences from 2011 to 2012

have changes from 2011 to 2012. If a tract is neither blue nor red, then the tract did not have substantial change from 2011 to 2012. For the next question, the results are in map 3 below. Map 3 shows the probability that a tract in 2013 will be in the top 70% of foreclosures and the top 20% of foreclosures. For the predictred foreclosures for 2013, the data has a mean 13.2 foreclosures and a standard deviation of 13.4.  The tracts that are in green are predicted to be in the top 70% of foreclosures with a z score of 6.2 foreclosures. This means if a tract has more than 6.2 foreclsoures it is in the top 70% of tract. The cross hatched tracts are going to be in the top 20% of foreclosures with a z score of 24.5. This means that if a tract has 24.5 forclosures it is in the top 20% of tracts in Dane County. Naturally, all the cross hatched tracts are also green. These green cross hatched areas are where the most expected foreclosures will be located. Most of these areas are in large tracts that
map 3. Dane county projected foreclosures in 2013 by tracts.
have more houses and more opportunity to get foreclosed. Most of the green crosshatched areas are also blue areas on map 2 where they gained a lot of foreclosures from 2011 to 2012. This is most likely due to areas that gained a lot of foreclosures between 2011 and 2012 are projected to gain the same amount of foreclosures in the next year causing the projected foreclosures to be exaggerated.

Conclusion: 

Most of the foreclosures are located in the larger tracts on the outside of the county and the counties that gained a lot of foreclosures from 2011 to 2012. This implies that more rural areas have a lot greater chance of having a foreclosure and larger tracts also have a greater chance of having a foreclosure. A recommendation to solve this problems is to investigate how easy it is to get a loan for these home and maybe put higher restriction on home loans and mortgage so only people who can afford homes and pay for them can get homes so they do not get foreclosed on.
















Monday, February 20, 2017

Descriptive Statistics and Mean Centers

Part 1:

The first part of this assignment takes a look at two different cycling teams and uses statistics to determine which team to invest money into. The two teams are team Astana and team Tobler and the times for their last race are listed below in table 1. 
table 1. Team Astana and Team Tobler individual race times in minutes
With this data it is easy to figure out the different statistics for this data to get a better idea of what this information is showing. The statistics that will be applied to this data are range, mean, median, mode, kurtosis, skewness, and standard deviation. Below, these terms will be defined and applied to the data. 

  • Range: This refers to the extent of the information that is available. It is found by finding the difference between the highest and lowest value.   
  • Mean:  This is the average of the data. It is the middle of the data and is heavily influenced by outliers. It is found by adding all the values together and dividing by the total number of values. 
  • Median: This is the exact middle of the data set. This is different from mean because it is the middle spot of the data that is available and not the actual middle. If there is an odd number of values the median is the middle value and if there is an even number of values, the middle two values need to be added together and divided by 2. This measure is also more resistant to outliers than the mean is. 
  • Mode: This is the most common value that is in the data set. There needs to be at least two of the same values in the data set to have a mode. 
  • Kurtosis: This refers to the shape of the graphed data. Kurtosis is a measure of how peaked or flat the graph is. A positive kurtosis means the graph is relatively peaked and that is called leptokurtic. A negative kurtosis means the graph is relatively flat and that is called platykurtic.
  • Skewness: This, like kurtosis, refers to the shape of the graphed data. Skewness is a measure of how symmetrical the graph is. (-1) - 1 means that the distribution is normal or acceptable. If skewness is positive that means that the graph is shifted to the right and there are large outliers effecting the data. If skewness is negative that means that the graph is shifted to the left and there are small outliers effecting the data. 
  • Standard Deviation: This measures how spread out the data is. There are 6 standard deviations for every data set, 3 positive and 3 negative, and fall on equal intervals from the mean. Between the first positive and negative standard deviation is 68% of all observations. Between the second positive and negative standard deviation is 95% of all observations. Between the third positive and negative standard deviations is 99% of  all observations. If the graph is flatter the data is more spread out and the standard deviation will be larger and if the graph is more peaked the data is closer together  and the standard deviation will be smaller. A population standard deviation is found by finding the difference between the individual observation and the average. Then squaring all those values and adding them together. The next step is to divide that number by the total number of observations and finding the square root of that number. If the whole population is not known then all the same steps are followed except when dividing the sum of the squared values by the total number of observations. Instead subtract one from the total number of observations and use that value for the total number if observations. 
Above are the definitions of the statics that are used to better understand the data that was given. Below is a table (table 2.) that shows the value of each of these measures for both teams as well as the work done by hand to calculate standard deviation (figure 1).      
table 2. statistics applied to team Astana and team Tobler race times (time in hours, minutes)  
figure 1. standard deviation calculations by hand for team Astana and team Tobler 
Based on this information from the data that was given, the team that should be invested in is team Astana. This team was chosen because it has a higher average time and it has the fastest individual. assuming that a team wins by having the best average time it makes sense to go with the team that has the better average time. Also, team Astana has the fastest individual so the team would have the individual that would most likely win. This means that the owner of team Astana would get  the 25% of $300,00 as well as the 35% of the $400,000 instead of $0 that the owner of team Tobler would get. The most important team statistic is the average or mean because that is most likely what the teams will be judged on.

Part 2:

The second part of the assignment  looks at geographic mean centers in Wisconsin and weighted geographic mean center in Wisconsin. Mean centers take the coordinates (X, Y) for a series of points and finds the mean (average) of the X and the Y values separately. When the mean of the X and Y values are found, the new coordinate set can be plotted and the mean center is shown. For this assignment, the first mean center that was found was for all the counties in Wisconsin. The coordinates used for this are the geographic center of each county. This is shown by the green dot on the map below (map 1). The second mean center that was found for this assignment was weighted by the population in 2000. The purple dot on the map below (map 1) shows the mean center for Wisconsin weighted by county population. The third mean center that was found was weighted for the population in 2015. This is shown by the blue dot on the map below (map 1). 
map 1. weighted and geographic mean centers by county in Wisconsin; 2000, 2015


The geographic mean is the green dot and represents the spacial center of Wisconsin based on county that is not weighted. It looks like it is in the center of Wisconsin and should be. The next point is the mean center for population by county in 2000. This dot is shifted extensively to the south east. This is because the high population of Milwaukee county is causing that county to be weighted much heavier than the other counties so it drags the dot towards the county. The last point is the mean center for population by  county  in 2015. Here, the dot shifts to the west and a little to the north. This is either because Milwaukee is losing some of its population or more people are moving to the center of western Wisconsin around Eau Claire and close to Minneapolis. 









Wednesday, February 1, 2017

Data Types and Classification

Part one: Data types

Data is a big part of research and geography, especially quantitative geography research. Data comes in four different types; nominal, ordinal, interval, and ratio. they type of data that is needed changes based on what information is being shown and how it is being shown.

Nominal data: This data type sorts the information into different unique categories that have no assumed relationship between them. This data type is normally for information like names or street numbers that no meaningful math can be applied. An example of this is that "346 Water St" and " 792 Clairemont Ave". The text cannot be added or subtracted and the numbers can be added or subtracted but the result does not mean anything. The map below is an example of nominal data because it maps the counties in the US and separates them into different unique categories and there is not assumed relationship between the categories. The categories in this map are republican and democrat won counties during the 2016 US presidential election.

        "2016 US Presidential Election Map By County & Vote Share." Brilliant Maps. November 30, 2016. Accessed February 01, 2017. http://brilliantmaps.com/2016-county-election-map/.

Ordinal data: This data type sorts information into an arbitrary scale whose categories are related to each other by a rank. There are two different types of ordinal data. The first one is strong ordered data, which is ranked data where the information is given a specific place in the order. An example of this is if the 10 coldest cities were mapped. The next one is  weakly ordered data, which the data is put into differently ranked categories. An example of this is if the percentages of a certain ethnic group was mapped and put into different parentage categories. The map below is an example of ordinal data because the information is split up into categories that have a specific order, are on an arbitrary scale, and are related to each other. In this case that categories go from soft to hard and show that hardness of the ground over the US. 

     "Aggregate Hardness Map of the United States." ForConstructionPros.com. Accessed February 01, 2017. http://www.forconstructionpros.com/article/10745911/aggregate-hardness-map-of-the-united-states.

Interval Data: This data type sorts information into a scale that has no meaningful zero point. This scale may have a zero but it is just a reference point and not a starting point. This data can be mapped in different units such as temperature (Fahrenheit, Celsius) or elevation (meters, feet). The map below is an example of interval data because the information does not have a meaningful zero point and can be mapped using different units. This map shows elevation in the US by meters. 

     "US Elevation and Elevation Maps of Cities, Topographic Map Contour." US Elevation and Elevation Maps of Cities, Topographic Map Contour. Accessed February 01, 2017. http://www.floodmap.net/Elevation/CountryElevationMap/?ct=US.


Ratio data: This data type sorts information to a scale that has a meaningful zero point that serves as a starting point. This data is used for counting a certain amount of something in an area such as people per county, rates of something in an area such as how many pets there are per person in an area, or densities such as how many cows there are per square mile. The map below is an example of ratio data because it is a rate that has a meaningful zero point which is also the starting point.

     "CensusScope -- Demographic Maps: African-American Population." CensusScope -- Demographic Maps: African-American Population. Accessed February 01, 2017. http://www.censusscope.org/us/map_nhblack.html.



Part two: Data classification

Data can be classified in many different ways and that was that the information is classified can change how the map looks and can be misleading or skewed if the wrong one is chosen. Three of the most common types of data classification are equal interval, quantile, and natural breaks.

Equal interval: This data classification method takes the range of the values in the information and divides the range equally into the number of classes that are desired. This makes each class cover the same amount of the range as the other classes in increasing or decreasing order. The map below is an example of equal interval data classification.

Quantile: This data classification method divides the number of entries by the number of classes that are desired. This makes each class cover the same amount of entries as the other classes. The map below is an example of quantile data classification.

Natural breaks: This data classification  method looks at the data and divides it where the data has notable separations or gaps.  The map below is an example of the natural breaks method.

The data classification method that should be used is the equal interval method. This method was chosen because it makes it look like there is not a lot of women that operate farms compared to the other two. If the company wants to increase the number of women operated farms they need to show that there is a shortage of them and that is what the equal interval data makes it seem like. The other maps have far more darker counties and the equal interval has a lot more light counties so the equal interval map makes it look like there are not a lot of women running farms in Wisconsin.