Monday, February 20, 2017

Descriptive Statistics and Mean Centers

Part 1:

The first part of this assignment takes a look at two different cycling teams and uses statistics to determine which team to invest money into. The two teams are team Astana and team Tobler and the times for their last race are listed below in table 1. 
table 1. Team Astana and Team Tobler individual race times in minutes
With this data it is easy to figure out the different statistics for this data to get a better idea of what this information is showing. The statistics that will be applied to this data are range, mean, median, mode, kurtosis, skewness, and standard deviation. Below, these terms will be defined and applied to the data. 

  • Range: This refers to the extent of the information that is available. It is found by finding the difference between the highest and lowest value.   
  • Mean:  This is the average of the data. It is the middle of the data and is heavily influenced by outliers. It is found by adding all the values together and dividing by the total number of values. 
  • Median: This is the exact middle of the data set. This is different from mean because it is the middle spot of the data that is available and not the actual middle. If there is an odd number of values the median is the middle value and if there is an even number of values, the middle two values need to be added together and divided by 2. This measure is also more resistant to outliers than the mean is. 
  • Mode: This is the most common value that is in the data set. There needs to be at least two of the same values in the data set to have a mode. 
  • Kurtosis: This refers to the shape of the graphed data. Kurtosis is a measure of how peaked or flat the graph is. A positive kurtosis means the graph is relatively peaked and that is called leptokurtic. A negative kurtosis means the graph is relatively flat and that is called platykurtic.
  • Skewness: This, like kurtosis, refers to the shape of the graphed data. Skewness is a measure of how symmetrical the graph is. (-1) - 1 means that the distribution is normal or acceptable. If skewness is positive that means that the graph is shifted to the right and there are large outliers effecting the data. If skewness is negative that means that the graph is shifted to the left and there are small outliers effecting the data. 
  • Standard Deviation: This measures how spread out the data is. There are 6 standard deviations for every data set, 3 positive and 3 negative, and fall on equal intervals from the mean. Between the first positive and negative standard deviation is 68% of all observations. Between the second positive and negative standard deviation is 95% of all observations. Between the third positive and negative standard deviations is 99% of  all observations. If the graph is flatter the data is more spread out and the standard deviation will be larger and if the graph is more peaked the data is closer together  and the standard deviation will be smaller. A population standard deviation is found by finding the difference between the individual observation and the average. Then squaring all those values and adding them together. The next step is to divide that number by the total number of observations and finding the square root of that number. If the whole population is not known then all the same steps are followed except when dividing the sum of the squared values by the total number of observations. Instead subtract one from the total number of observations and use that value for the total number if observations. 
Above are the definitions of the statics that are used to better understand the data that was given. Below is a table (table 2.) that shows the value of each of these measures for both teams as well as the work done by hand to calculate standard deviation (figure 1).      
table 2. statistics applied to team Astana and team Tobler race times (time in hours, minutes)  
figure 1. standard deviation calculations by hand for team Astana and team Tobler 
Based on this information from the data that was given, the team that should be invested in is team Astana. This team was chosen because it has a higher average time and it has the fastest individual. assuming that a team wins by having the best average time it makes sense to go with the team that has the better average time. Also, team Astana has the fastest individual so the team would have the individual that would most likely win. This means that the owner of team Astana would get  the 25% of $300,00 as well as the 35% of the $400,000 instead of $0 that the owner of team Tobler would get. The most important team statistic is the average or mean because that is most likely what the teams will be judged on.

Part 2:

The second part of the assignment  looks at geographic mean centers in Wisconsin and weighted geographic mean center in Wisconsin. Mean centers take the coordinates (X, Y) for a series of points and finds the mean (average) of the X and the Y values separately. When the mean of the X and Y values are found, the new coordinate set can be plotted and the mean center is shown. For this assignment, the first mean center that was found was for all the counties in Wisconsin. The coordinates used for this are the geographic center of each county. This is shown by the green dot on the map below (map 1). The second mean center that was found for this assignment was weighted by the population in 2000. The purple dot on the map below (map 1) shows the mean center for Wisconsin weighted by county population. The third mean center that was found was weighted for the population in 2015. This is shown by the blue dot on the map below (map 1). 
map 1. weighted and geographic mean centers by county in Wisconsin; 2000, 2015


The geographic mean is the green dot and represents the spacial center of Wisconsin based on county that is not weighted. It looks like it is in the center of Wisconsin and should be. The next point is the mean center for population by county in 2000. This dot is shifted extensively to the south east. This is because the high population of Milwaukee county is causing that county to be weighted much heavier than the other counties so it drags the dot towards the county. The last point is the mean center for population by  county  in 2015. Here, the dot shifts to the west and a little to the north. This is either because Milwaukee is losing some of its population or more people are moving to the center of western Wisconsin around Eau Claire and close to Minneapolis. 









Wednesday, February 1, 2017

Data Types and Classification

Part one: Data types

Data is a big part of research and geography, especially quantitative geography research. Data comes in four different types; nominal, ordinal, interval, and ratio. they type of data that is needed changes based on what information is being shown and how it is being shown.

Nominal data: This data type sorts the information into different unique categories that have no assumed relationship between them. This data type is normally for information like names or street numbers that no meaningful math can be applied. An example of this is that "346 Water St" and " 792 Clairemont Ave". The text cannot be added or subtracted and the numbers can be added or subtracted but the result does not mean anything. The map below is an example of nominal data because it maps the counties in the US and separates them into different unique categories and there is not assumed relationship between the categories. The categories in this map are republican and democrat won counties during the 2016 US presidential election.

        "2016 US Presidential Election Map By County & Vote Share." Brilliant Maps. November 30, 2016. Accessed February 01, 2017. http://brilliantmaps.com/2016-county-election-map/.

Ordinal data: This data type sorts information into an arbitrary scale whose categories are related to each other by a rank. There are two different types of ordinal data. The first one is strong ordered data, which is ranked data where the information is given a specific place in the order. An example of this is if the 10 coldest cities were mapped. The next one is  weakly ordered data, which the data is put into differently ranked categories. An example of this is if the percentages of a certain ethnic group was mapped and put into different parentage categories. The map below is an example of ordinal data because the information is split up into categories that have a specific order, are on an arbitrary scale, and are related to each other. In this case that categories go from soft to hard and show that hardness of the ground over the US. 

     "Aggregate Hardness Map of the United States." ForConstructionPros.com. Accessed February 01, 2017. http://www.forconstructionpros.com/article/10745911/aggregate-hardness-map-of-the-united-states.

Interval Data: This data type sorts information into a scale that has no meaningful zero point. This scale may have a zero but it is just a reference point and not a starting point. This data can be mapped in different units such as temperature (Fahrenheit, Celsius) or elevation (meters, feet). The map below is an example of interval data because the information does not have a meaningful zero point and can be mapped using different units. This map shows elevation in the US by meters. 

     "US Elevation and Elevation Maps of Cities, Topographic Map Contour." US Elevation and Elevation Maps of Cities, Topographic Map Contour. Accessed February 01, 2017. http://www.floodmap.net/Elevation/CountryElevationMap/?ct=US.


Ratio data: This data type sorts information to a scale that has a meaningful zero point that serves as a starting point. This data is used for counting a certain amount of something in an area such as people per county, rates of something in an area such as how many pets there are per person in an area, or densities such as how many cows there are per square mile. The map below is an example of ratio data because it is a rate that has a meaningful zero point which is also the starting point.

     "CensusScope -- Demographic Maps: African-American Population." CensusScope -- Demographic Maps: African-American Population. Accessed February 01, 2017. http://www.censusscope.org/us/map_nhblack.html.



Part two: Data classification

Data can be classified in many different ways and that was that the information is classified can change how the map looks and can be misleading or skewed if the wrong one is chosen. Three of the most common types of data classification are equal interval, quantile, and natural breaks.

Equal interval: This data classification method takes the range of the values in the information and divides the range equally into the number of classes that are desired. This makes each class cover the same amount of the range as the other classes in increasing or decreasing order. The map below is an example of equal interval data classification.

Quantile: This data classification method divides the number of entries by the number of classes that are desired. This makes each class cover the same amount of entries as the other classes. The map below is an example of quantile data classification.

Natural breaks: This data classification  method looks at the data and divides it where the data has notable separations or gaps.  The map below is an example of the natural breaks method.

The data classification method that should be used is the equal interval method. This method was chosen because it makes it look like there is not a lot of women that operate farms compared to the other two. If the company wants to increase the number of women operated farms they need to show that there is a shortage of them and that is what the equal interval data makes it seem like. The other maps have far more darker counties and the equal interval has a lot more light counties so the equal interval map makes it look like there are not a lot of women running farms in Wisconsin.