Monday, April 24, 2017

Correlation and Spatial Autocorrelation

Introduction:

The purpose of this assignment is to become familiar with correlation and spatial autocorrelation using SPSS and GEODA. The first part of this assignment uses census tracts in Milwaukee to look at the correlations between different fields such as the white population and number of retail employees. The second part of this assignment uses election and population data in Texas to look at the spatial autocorrelation of voter turnout, demarcate voters, and Hispanic populations.

Part One:

For the first part of this assignment, different attributes of the census tracts in Milwaukee Wisconsin are compared to see if there is a correlation between the attributes. A table that shows the correlation between the different attributes can be found below in table 1. This table is a result from the SPSS
table 1. Table of correlations between attributes for Milwaukee census tracts
 software and data that was provided by the instructor. The attributes that this table is comparing are number of manufacturing employees, number of retail employees, number of finance employees, the white population, the black population, the Hispanic population and the median household income. In this table, if an entry has two starts, it means that there is a correlation between the variables to a significance level of 99%, which means there is a strong correlation. The variables that have this strong correlations are the number of manufacturing employees to all the other variables, the number of retail employees to all other variables except Hispanic population, the number of finance employees to all the other variables, the white population compared to all other variables except Hispanic population, the black population compared to all other variables, the Hispanic population to all other variables except number of retail employees and white populations and median household income, and the median household income compared to everything except Hispanic populations. All these entries have a strong correlation between the two variables and a trend can be clearly seen between the variables with either a positive or negative correlation. The entries that only have one star have a correlation to a significance level of 95%. These entries are the white population compared to the Hispanic population and the Hispanic population compared to median household income. These entries still have a correlations between the two variables like the other entries, however these correlations are not quite as strong as the previous ones but a clear correlation is visible. If there is no star next to an entry, then there is no correlation between the variables. The entry that has no correlation is Hispanic populations compared to the number of retail employees. This means that there is no clear correlation between the variable in the data. The table does not only show the strength of the correlation, but it also shows the direction of the correlation. The direction can either be positive or negative. If the correlation is positive, then when one variable goes up the other variable should go up also. If the correlation is negative, then when one variable goes up the other variable should go down. All of the correlations in the table above are positive except for the black population compared to all other variables and the Hispanic population compared to number of finance employees, black populations, and median household income, This means, for example, that when the Hispanic population goes up then the black population goes down or if the black population goes up then the Hispanic population goes down. Using this data it is easy to infer that black and Hispanic populations are not doing very well in the Milwaukee area because they both have a negative correlation with median household income. This means that when there is an increase in black or Hispanic populations, then there is a decrease in median household income or when there is an increase in household income, then the black or Hispanic population decreases. This is compared to the correlation between the white population and the median household income which is a strong positive correlation. This means that when the white population goes up then the median household income in that area goes up.

Part Two:

Introduction:

For this part of the assignment, the Texas Election Commission (TEC) provided data about the 1980 and 2016 presidential elections and wants analysis done on the patterns. They want to know if there are clustering of voting patterns in the state, as well as voter turnout. They also want to know if the election patterns have changed over 36 years. As well as election data, population data is also analyzed to see if there is clustering of Hispanic populations in Texas.

Methodology:

For this assignment, the election data for  1980 and 2016 was provided by the TEC. The population data needed to be downloaded separately from the US Census website with a shapefile of Texas. The population data from the US Census is very cluttered so all the fields can be deleted except for the geo id field and the percent of Hispanic population field. Once the table is simplified, it can be joined to the Texas shapefile and exported as a new feature. Next, the election data can be joined to the new feature and exported to create a feature that has all the attributes that are desired. Once this feature is complete, it needs to be exported and saved as a shapefile. Next, inside GEODA, the shapefile needs to be opened and a new "weights manager" needs to be created and an id variable needs to be added. Once this is complete, the "Moran's scatter plot" button can be clicked and the scatter plot and LISA cluster map need to be selected as an output. After this process is done running, a scatter plot and a cluster map will appear that show the spatial autocorrelation of the variables in Texas.

Results:

The results from this process are a cluster map and a scatter plot for each of the variables. The first variable that a scatter plot and cluster map were made from was the voter turn out in 1980. The map and scatter plot can be seen below in map 1 and graph 1. The red areas are counties that have high
map 1. Texas voter turnout for 1980


graph 1. Texas voter turnout for 1980













voter turnout with areas of high voter turnout around it. The light red areas are counties that have a high voter turnout but are surrounded by counties that have low voter turnout. The light blue areas are counties that have low voter turnout surrounded by counties that have a high voter turnout. The blue areas are counties that have low voter turnout that are surrounded by counties that also have low voter turnout. Looking at the cluster map, map 1, the north and central part of Texas have clusters of high voter turnout and the south and east side of the state have clusters of low voter turnouts. For most of the state, there is no signification clustering. Graph 1 show a scatter plot for the voter turnout in 1980 and provided the Moran's I value. Moran's I value tells how much the data is grouped from -1 to 1. 1 being perfect grouping and -1 being no grouping at all. The Moran's I value for this scatter plot is 0.468, which shows that the voter turnout in 1980 did have some clustering, but not to a great extent. The next cluster map and scatter plot were made from the voter turnout in 2016. The map and scatter plot can be seen below in map 2 and graph 2. The map colors are the same as above. Looking at this

map 2. Texas voter turnout for 2016



graph 2. Texas voter turnout for 2016


















 map, there is a pattern that emerges. There is a lot of clustering of low voter turnout in the south edge of the state with a small cluster on the north west edge of the state. The north and central part of the state have clusters of higher voter turnout. The Moran's I value for this scatter plot is 0.287, which is lower than the Moran's I value for 1980. This means that there is even less clustering in 2016 than 1980. When map 1 and map 2 are compared, the change in voter turnout can be seen. The cluster that is the south end of the state stays about the same but does decrease a little bit. The cluster of high voter turnout in the middle of the state gets smaller and looks like there are a lot less counties that have high voter turnout. The cluster of low voter turnout on the west side of the state is gone entirely and the west side of the state has a lot lower voter turnout than it did in 1980.

The second set of maps and scatter plots deal with the percent of democratic voters in each county in 1980 and 2016. Map 3 and graph 3 below show the percent of democratic vote in each county in Texas in 1980. Like the previous map, the colors represent the same thing, however instead of voter

map 3. Percent of democratic vote by county in 1980


graph 3. Percent of democratic vote by county in 1980















turnout, the map shows the percent of democratic vote. This map shows the clusters of counties that had a high percent of democratic votes and cluster of counties that had a low percent of democratic votes. The south and east part of the state had a large amount of counties clustered together that had a high percentage of democratic votes. The north and west edge of the state had a large amount of counties clustered together that had a low percentage of democratic votes. The Moran's I value for this graph 3 is 0.575, which shows that there is a good amount of clustering going on, but it could still be better.The next map and scatter plot are on the percent democratic vote in Texas in 2016. Below are map 4 and graph 4 that show the spatial autocorrelation for this attribute. Like above, the colors represent the same thing. Using this map, a pattern can be seen. This map shows that there is heavy

map 4. Percent of democratic vote in 1980


graph 4. Percent of democratic vote in 1980














clustering of counties that had a high democratic vote percentage on the south edge of the state and a heavy clustering of counties that had a low democratic vote percentage in the north and central part of the state. The Moran's I value for graph 4 is 0.685. This means that there is even more clustering in 2016 than 1980 in terms of percent of democratic vote. Comparing these maps, the change of democratic voters can be seen. from 1980 to 2016 the south has gained a lot more clustering of high percent of democratic voters, especially towards the west. The north stays the same but the central part of the state shifts to the east. Also, the high cluster area on the east edge of the state disappears completely.

The last map and scatter plot is the percent of Hispanic population per county in Texas. The map and scatter plot can be seen below in map 5 and graph 5. This map shows a strong pattern of where like
map 5. Percent Hispanic population



graph 5. Percent Hispanic population














counties in terms of percent Hispanic population are clustered. There is a high clustering of counties with high percentage of Hispanic population in the south and west edge of the state and a high clustering of counties with low percentage of Hispanic population to the north and east. The Moran's I value for graph 5 is 0.778. This means that there is a lot of clustering going on, in fact it is the highest Moran's I value for all the data and has the most clustering. When map 5 is compared to map 2 and 4, A correlation can be seen. between the Hispanic population and the voters turnout and percent of democratic voters. Table 2 shows the correlation matrix for this data. The percent_1
table 2. Correlation matrix for Texas data
column is the percent Hispanic. This table shows that there is a correlation between the percent of Hispanic population and the percent of democratic voters. This means that when there is a higher percentage of Hispanic population, then the percent of democratic votes goes up or vise versa  There is also a negative correlation between the percent of Hispanic population and the voter turnout for 2016 and 1980. This means that when the percent of Hispanic population goes up then the voter turnout goes down.

Conclusion:

In conclusion for part 2 of this assignment, it seems that when the percent of Hispanic population goes up in a county, then the voter turnout for those counties goes down and the percent of democratic voters goes up. This correlation leads to the assumption that Hispanic populations tend to vote for democratic candidates if they turn out to vote at all.

Tuesday, April 4, 2017

Hypothesis Testing

Objective:

The objective of this assignment was to use hypothesis testing on sample and real world data to better understand z and t tests and when to use them. Also, this assignments goal was to learn how to calculated z and t values and to make a decision about the null and alternative hypothesis.

Part 1:

The first part of of this assignment was to finish a table that was provided. The table provides the interval type (one or two tailed), the confidence level, and the sample size. From this, the level of significance, whether a t or z test was required, and what the z or t value is. The table below (table 1) shows the table filled out with the missing information added. The interval type can be either a one
table 1. Finished statistics table
 or two tailed test. A one tail test means that only the positive or negative critical value is found. A two tailed test means that two values are found on both the positive and negative extremes and the level of significance is divided by 2. The confidence level is the amount of times out of 100 that sample data should be the same as population data. N is the sample size of the sample that is taken. Level of significance is set to determine the probability of a type 1 error occurring. To choose z or t is simple, if n is over 30, use a z test. If n is 30 or under, use t test. For the t or z values, a z or t value chart is needed and the level of significance is looked up in the chart to find the cut off for when the null hypothesis is rejected or not rejected.

The next part of this assignment looks at Kenya's agriculture. From a sample of 23 farmers, it needs to be determined if their yields for different types of crops are statically different from the county as a whole. The three crops are ground nuts, cassava, and beans. For all calculations, a confidence level of 95% is used with a two tailed test. This means that the test statistic for all three tests are +/- 2.074. The test done for these will be a t test because the sample size is under 30. Finding the t value for these follows the following equation: (sample mean - population mean)/(sample standard deviation/ square root of N).

Ground nuts:
Null hypothesis- There is no difference between the yield of ground nuts between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of ground nuts between the sample farmers and the county as a whole.
Equation: (0.52-0.57)/(.3/sqrt 23)= -0.799
Probability: 21.66%
-0.779 falls between -2.074 and 2.074 so for ground nuts, the null hypothesis will fail to be rejected.

Cassava:
Null hypothesis: There is no difference between the yield of cassava between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of cassava between the sample farmers and the county as a whole.
Equation: (3.3-3.7)/(.75/sqrt 23) = -2.558
Probability: 1.07%
 -2.558 falls outside of -2.074 and 2.074 so for cassavas, the null hypothesis will be rejected.

Beans:
Null hypothesis: There is no difference between the yield of beans between the sample farmers and the county as a whole.
Alternative hypothesis: There is a difference between the yield of beans between the sample farmers and the county as a whole.
Equation: (0.34-0.29)/(0.12/sqrt 23) = 1.998
Probability: 97.03%
1.998 falls  between -2.074 and 2.074 so for beans, the null hypothesis will fail to be rejected.

Similarities:
The beans and ground nuts both failed to reject the null hypothesis meaning that they both are not statistically different from the population mean. Another similarity is between ground nuts and cassava. Both of sample means for these two crops were lower than the population mean.

Differences:
The cassava was the only crop that rejected the null hypothesis. The beans was the only crop that had a sample mean larger than the population mean.

The last section of part one looks at the level pollution in a stream. There were 17 samples taken and with a sample mean pollution level of 6.4 mg/l and a standard deviation of 4.4. The allowable limit for stream pollution is 4.2 mg/l. The part uses hypothesis testing to determine if the stream is statically over the allowable limit.

Null hypothesis: There is no difference between the mean of the pollution samples and the allowable limit for the stream.
Alternative hypothesis: There is a difference between the mean of the pollution samples and the allowable limit for the stream.
Statistical test: The sample size is 17. 17 is under 30 so a t test is needed for this.
Level of significance: This was provided and is 95% of a one tailed test. For a level of significance of 95% for a one tailed t test is 1.746.
Equation: (6.4-4.2)/(4.4/sqrt 17) = 2.0616
Probability: 97.4%
2.0161 is over 1.746 so the null hypothesis will be rejected.
This means that there is a difference between the mean of the pollution samples and the allowable limit for the stream. The researcher was correct that the level of pollution in the stream is over the allowable limit. However, by how much it is over is unknown.

Part 2:

For the second part of the assignment, home values are compared between the block groups for the city of Eau Claire and the block groups for Eau Claire county as a whole to see if home values for the city are statistically different from that of the county. This can be done with hypothesis testing. The mean for the home values for Eau Claire county is 169438.13. The mean for the home values for the city of Eau Claire is 151876.509 with a standard deviation of 49706.919 with a n of 53.

Null hypothesis: There is no difference between the home values for the city of Eau Claire and the county of Eau Claire.
Alternative hypothesis: There is a difference between the home values for the city of Eau Claire and the county of Eau Claire.
Statistical test: The sample size is 53. 53 is larger than 30 so a z test will be used.
Level of significance: For this a level of significance will be 95% giving a critical value 1.64 but the sample mean is smaller than the population mean so the critical value needs to be multiplied by -1 giving a critical value of -1.64.  
Equation: (151876.509-169438.13)/(49706.919/sqrt 53) = -2.572
-2.572 is smaller than -1.64 so the null hypothesis will be rejected. This means that there is a statistical difference between the home values of the city of Eau Claire and the county of Eau Claire. The map below (map 1) shows the home values for the city of Eau Claire and the county of Eau
map 1. Home values by block group for Eau Claire county

Claire. This map helps show that home values in the city of Eau Claire are higher than the rest of Eau Claire county. The map also gives reference for where the city of Eau Claire is in Eau Claire county and where Eau Claire county is in Wisconsin.