You are allowed and encouraged to work with two partners on this project. Include your names, perm numbers, and whether you are taking the class for 131 or 231 credit.
You are welcome to write up a project report in a research paper format -- abstract, introduction, methods, results, discussion -- as long as you address each of the prompts below. Alternatively, you can use the assignment handout as a template and address each prompt in sequence, much as you would for a homework assignment.
There should be no raw R output in the body of your report! All of your results should be formatted in a professional and visually appealing manner. That means that visualizations should be polished -- aesthetically clean, labeled clearly, and sized appropriately within the document you submit, tables should be nicely formatted (see pander
, xtable
, and kable
packages). If you feel you must include raw R output, this should be included in an appendix, not the main body of the document you submit.
There should be no R codes in the body of your report! Use the global chunk option echo=FALSE
to exclude code from appearing in your document. If you feel it is important to include your codes, they can be put in an appendix.
The U.S. presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated about his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it underscored that predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets.
Your final project will be to merge census data with 2016 voting data to analyze the election outcome.
To familiarize yourself with the general problem of predicting election outcomes, read the articles linked above and answer the following questions. Limit your responses to one paragraph for each.
The major issue in predicting voter behavior is that there is a difference between what Silver calls the ‘nowcast’, a model of how people will vote if the election is held on a particular day, versus true voting intention, which often changes according to known variables such as age, race, gender, etc., as well as immeasurable factors such as effects of the economy and particularly strong campaigns. Furthermore, polls have errors, and this error can aggregate from the regional level, to the state level, and finally to the national level in a hierarchical manner, and can therefore cause large prediction errors.
Silver was able to achieve good predictions in 2012 by examining a full range of probabilities for each date instead of maximizing probabilities, using models from the day prior (which give reports of actual support) to measure the probability of support shifts. Using this data, Silver was able to create a model that simulated forward in time to the election day, under the assumption that the starting point of the simulation (based on most recent polling data, the ‘nowcast’), to forecast the new probabilities of each level of support (state and national). Given that there was an immense amount of polling data coming out, especially closer to the end of the election campaign, the ‘nowcast’ could be constantly updated, and the variance of the true voting intention would decrease as the election approached and immeasurable factors such as economy, strength of campaign, etc. had less room to change.
In the 2016 election, individual polls were incorrect due to either statistical noise or other factors, such as nonresponse bias. Yet these errors are to be anticipated, and aggregation of individual polls throughout a state is intended to account for and reduce this error. In the case of 2016, the state polls missed in the same direction, which indicates a systematic polling error and implies error in the national polls, which are adjusted based on the results at the state levels. Many of the individual states that missed in their polls for this election were swing states, which caused the national polls to overestimate Clinton’s lead over Trump. The impact of these polling errors is seen primarily in the Midwestern states (Iowa, Ohio, Pennsylvania, Wisconsin, and Minnesota), which Trump was mostly expected to lose, but mostly won. Some of the widely accepted theory for the errors made in polling relate to the percentage of Trump voters who are distrustful of poll calls, meaning they were reluctant or unwilling to disclose their voting intentions. Therefore, to improve future predictions, a strategy to implement would be a method of anonymous polling, which would enable voters to report their honest intentions and could help predictions to be more accurate.
The project_data.RData
binary file contains three datasets: tract-level 2010 census data, stored as census
; metadata census_meta
with variable descriptions and types; and county-level vote tallies from the 2016 election, stored as election_raw
.
Some example rows of the election data are shown below:
county | fips | candidate | state | votes |
---|---|---|---|---|
Los Angeles County | 6037 | Hillary Clinton | CA | 2464364 |
Los Angeles County | 6037 | Donald Trump | CA | 769743 |
Los Angeles County | 6037 | Gary Johnson | CA | 88968 |
Los Angeles County | 6037 | Jill Stein | CA | 76465 |
Los Angeles County | 6037 | Gloria La Riva | CA | 21993 |
Cook County | 17031 | Hillary Clinton | IL | 1611946 |
The meaning of each column in election_raw
is self-evident except fips
. The accronym is short for Federal Information Processing Standard. In this dataset, fips
values denote the area (nationwide, statewide, or countywide) that each row of data represent.
Nationwide and statewide tallies are included as rows in election_raw
with county
values of NA
. There are two kinds of these summary rows:
fips
value of US
.fips
value.fips=2000
. Provide a reason for excluding them. Drop these observations -- please write over election_raw
-- and report the data dimensions after removal.county | fips | candidate | state | votes |
---|---|---|---|---|
NA | 2000 | Donald Trump | AK | 163387 |
NA | 2000 | Hillary Clinton | AK | 116454 |
NA | 2000 | Gary Johnson | AK | 18725 |
NA | 2000 | Jill Stein | AK | 5735 |
NA | 2000 | Darrell Castle | AK | 3866 |
NA | 2000 | Rocky De La Fuente | AK | 1240 |
We exclude observations where ‘fips=2000’ because this fips value has an associated county value of ‘NA’, which should only be true of nationwide and statewide tallies (where the ‘fips’ variable is represented by a value of either ‘US’ or the state’s name), not countywide tallies (where the ‘fips’ variable is numeric).
The new dimensions of the election_raw data are 18345 x 5.
The first few rows and columns of the census
data are shown below.
CensusTract | State | County | TotalPop | Men | Women |
---|---|---|---|---|---|
1001020100 | Alabama | Autauga | 1948 | 940 | 1008 |
1001020200 | Alabama | Autauga | 2156 | 1059 | 1097 |
1001020300 | Alabama | Autauga | 2968 | 1364 | 1604 |
1001020400 | Alabama | Autauga | 4423 | 2172 | 2251 |
1001020500 | Alabama | Autauga | 10763 | 4922 | 5841 |
1001020600 | Alabama | Autauga | 3851 | 1787 | 2064 |
Variable descriptions are given in the metadata
file. The variables shown above are:
variable | description | type |
---|---|---|
CensusTract | Census tract ID | numeric |
State | State, DC, or Puerto Rico | string |
County | County or county equivalent | string |
TotalPop | Total population | numeric |
Men | Number of men | numeric |
Women | Number of women | numeric |
Separate the rows of election_raw
into separate federal-, state-, and county-level data frames:
Store federal-level tallies as election_federal
.
Store state-level tallies as election_state
.
Store county-level tallies as election
. Coerce the fips
variable to numeric.
How many named presidential candidates were there in the 2016 election? Draw a bar graph of all votes received by each candidate, and order the candidate names by decreasing vote counts. (You may need to log-transform the vote axis.)
There were 32 different Presidential candidates that received votes in the 2016 US presidential election.
county_winner
and state_winner
by taking the candidate with the highest proportion of votes. (Hint: to create county_winner
, start with election
, group by fips
, compute total
votes, and pct = votes/total
. Then choose the highest row using slice_max
(variable state_winner
is similar).)Here you'll generate maps of the election data using ggmap
. The .Rmd file for this document contains codes to generate the following map.
## Warning: unable to access index for repository http://cran.us.r-project.org)/src/contrib:
## cannot open URL 'http://cran.us.r-project.org)/src/contrib/PACKAGES'
## Warning: package 'maps' is not available (for R version 3.6.3)
## Warning: unable to access index for repository http://cran.us.r-project.org)/bin/macosx/el-capitan/contrib/3.6:
## cannot open URL 'http://cran.us.r-project.org)/bin/macosx/el-capitan/contrib/3.6/PACKAGES'
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
map_data("county")
and color by county.In order to map the winning candidate for each state, the map data (states
) must be merged with with the election data (state_winner
).
The function left_join()
will do the trick, but needs to join the data frames on a variable with values that match. In this case, that variable is the state name, but abbreviations are used in one data frame and the full name is used in the other.
fips
variable in the states
data frame with values that match the fips
variable in election_federal
.Now the data frames can be merged. left_join(df1, df2)
takes all the rows from df1
and looks for matches in df2
. For each match, left_join()
appends the data from the second table to the matching row in the first; if no matching value is found, it adds missing values.
left_join
to merge the tables and use the result to create a map of the election results by state. Your figure will look similar to this state level New York Times map. (Hint: use scale_fill_brewer(palette="Set1")
for a red-and-blue map.)## Joining, by = "fips"
fips
value, so to create one, use information from maps::county.fips
: split the polyname
column to region
and subregion
using tidyr::separate
, and use left_join()
to combine county.fips
with the county-level map data. Then construct the map. Your figure will look similar to county-level New York Times map.## Joining, by = "fips"
census
data. Many exit polls noted that demographics played a big role in the election. If you need a starting point, use this Washington Post article and this R graph gallery for ideas and inspiration.census
data contains high resolution information (more fine-grained than county-level). Aggregate the information into county-level data by computing population-weighted averages of each attribute for each county by carrying out the following steps:Clean census data, saving the result as census_del
:
census
with missing values;Men
, Employed
, and Citizen
to percentages;Minority
variable by combining Hispanic
, Black
, Native
, Asian
, Pacific
, and remove these variables after creating Minority
; andremove Walk
, PublicWork
, and Construction
.
census_subct
:
census_del
by State
and County
;add_tally()
to compute CountyPop
;TotalPop/CountyTotal
;Aggregate census data to county level, census_ct
: group the sub-county data census_subct
by state and county and compute popluation-weighted averages of each variable by taking the sum (since the variables were already transformed by the population weights)
Print the first few rows and columns of census_ct
.
State | County | CensusTract | TotalPop | Men | Women |
---|---|---|---|---|---|
Alabama | Autauga | 1.001e+09 | 6486 | 48.43 | 51.57 |
Alabama | Baldwin | 812351652 | 6235 | 39.56 | 41.43 |
Alabama | Barbour | 620485497 | 2051 | 33.2 | 28.48 |
Alabama | Bibb | 128297002 | 812.9 | 6.805 | 5.936 |
Alabama | Blount | 318411118 | 2215 | 15.59 | 15.97 |
Our team was located in Santa Barbara,CA on election day in 2016. By looking at the county winner for Santa Barbara County, we see that Hillary Clinton was the winning candidate, and over 61% of eligible voters voted. Now, looking at census data for demographics in Santa Barbara, we are not surprised that Hillary was the winning representative because Santa Barbara is seen as a progressive city that would likely choose Clinton over Trump. Some key demographics of note are the diversity of race within the county, which has a greater minority population (51.18%) than white population (46.49%), as well as a large discrepancy between median income ($66,498.12) and income per capita ($30, 752.87). This shows that while Santa Barbara county is highly diverse, it still has a large wealth gap that could impact the way people vote within the county. This information is not surprising, and while it likely didn’t greatly affect the overall California vote, it is a contributing turnout and makes us proud to be from Santa Barbara.
We chose to center and scale the features. The objective of centering and scaling features before performing PCA is to normalize the data so that all variables have the same standard deviation, and therefore all variables have the same weight, which helps our PCA to calculate relevant axes. In our case, since census_ct variables have a range of population-weighted averages, we opt to center and scale the features because our analysis and interpretation is sensitive to these weights, and we want to consider variables that are equally weighted.
For the sub-county loadings, we note that PC2 will be large and positive when variables including child poverty, poverty, minority, and unemployment are high, while variables like work at home, family work, self employed, and white are low. The loadings for PC1 are relatively constant across all variables, where PC1 will be large and positive when all variables are moderately high with slightly lower measures of county total, transit and other transportation.
For county loadings, we find that PC2 will be large and positive when variables including white, work at home, and income per capita are high, while variables including child poverty, poverty, minority, and unemployment are low. The loadings for PC1 are relatively constant across all variables, where PC1 will be large and positive when all variables are relatively low with slightly higher measures of family work, minority, transit and other transportation.
To capture 90% of the variance in County-Level data, we need exactly 5 PCs. However, to capture 90% of the variance in SubCounty-Level data, we need a minimum of 6 PCs, where 6 principal components will capture slightly more than 90% of the variance and 5 principal components will capture slightly less than 90% of the variance.
census_ct
, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components the county-level data as inputs instead of the original features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate cluster? Comment on what you observe and discuss possible explanations for these observations.clusters | n |
---|---|
cluster 1 | 1271 |
cluster 2 | 230 |
cluster 3 | 279 |
cluster 4 | 358 |
cluster 5 | 189 |
cluster 6 | 149 |
cluster 7 | 283 |
cluster 8 | 77 |
cluster 9 | 295 |
cluster 10 | 87 |
clusters2 | n |
---|---|
cluster 1 | 1308 |
cluster 2 | 1585 |
cluster 3 | 131 |
cluster 4 | 5 |
cluster 5 | 157 |
cluster 6 | 9 |
cluster 7 | 2 |
cluster 8 | 13 |
cluster 9 | 4 |
cluster 10 | 4 |
## [1] cluster 1
## 10 Levels: cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 ... cluster 10
## [1] cluster 1
## 10 Levels: cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 ... cluster 10
We believe that running hierarchical clustering using the first five principal components as inputs instead of the original features is more appropriate. Given that the data is highly concentrated into cluster 1 when running the hierarchical clustering using the original features, it is unlikely that further analysis can be done in regards to these classifications. Using the principal components as inputs implements data divisions based on pattern encoding the highest variance in the dataset. Using this preprocessing to reduce the dimensions of our data is an important step in the accuracy for the complete linkage model because with higher dimensions it is harder for distance models to predict accurately. Since the preprocessing of dimensionality reduction occurred and we only input the first 5 principal components, the hierarchical clustering is prone to be more accurate because of the reduced dimensionality. If we opted to use the original features in hierarchical clustering, we might consider a correlation-based similarity measure approach to be more appropriate.
In order to train classification models, we need to combine county_winner
and census_ct
data. This seemingly straightforward task is harder than it sounds. Codes are provided in the .Rmd file that make the necessary changes to merge them into election_cl
for classification.
After merging the data, partition the result into 80% training and 20% testing partitions.
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Loaded gbm 2.1.8
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
## Loading required package: cluster
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
## Loading required package: rpart
##
## Classification tree:
## tree(formula = as.factor(candidate) ~ ., data = train, control = tree_opts,
## split = "deviance")
## Variables actually used in tree construction:
## [1] "Transit" "Minority" "Drive" "Employed"
## [5] "White" "Unemployment" "Poverty" "WorkAtHome"
## [9] "CensusTract" "TotalPop" "Men" "SelfEmployed"
## [13] "Production" "IncomePerCapErr" "IncomePerCap" "Income"
## [17] "OtherTransp" "Women" "PrivateWork" "Citizen"
## Number of terminal nodes: 41
## Residual mean deviance: 0.0556 = 31.91 / 574
## Misclassification error rate: 0.01626 = 10 / 615
##
## Classification tree:
## snip.tree(tree = t, nodes = c(5L, 3L, 16L, 17L))
## Variables actually used in tree construction:
## [1] "Transit" "Minority" "Drive"
## Number of terminal nodes: 5
## Residual mean deviance: 0.5487 = 334.7 / 610
## Misclassification error rate: 0.1138 = 70 / 615
## pred
## class No Yes
## Donald Trump 0.92411143 0.07588857
## Hillary Clinton 0.56032172 0.43967828
## pred
## class No Yes
## Donald Trump 0.94716619 0.05283381
## Hillary Clinton 0.57640751 0.42359249
The decision tree before pruning contains 41 terminal nodes, split on 16 variables, with a total misclassification error rate of 0.01301. After cost-complexity pruning, we have a decision tree with only three terminal nodes, split on two variables, Minority and Transit, with a total misclassification error rate of 0.1171. When we examine the misclassification errors on test data, we note that the pruned tree has a higher total misclassification error rate than the initial tree. The pruned tree also has a higher false negative rate and a lower false positive rate than the initial tree; we suspect that this is the result of overfitting, where the overfit model correctly classifies the default with near perfect accuracy, but is fairly imprecise in classifying non-defaults. The first variable that the tree is split on is Transit, where we suspect that the percentage of people in a county who utilize the transit system is correlated to other demographics and wealth distribution in the county’s population. Therefore, we see that counties with higher rates of public transit use, likely to be lower-income counties, are determined to be more likely to vote for Hillary. The resulting population (low-transit) is then split on a Minority variable, where counties with higher rates of Minority identification in their population are more likely to vote for Hillary, and counties with lower rates of Minority voters (and in correlation, we assume a higher rate of White voters) are more likely to elect Trump.
##
## Call:
## glm(formula = as.factor(candidate) ~ ., family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3471 -0.4434 -0.2118 -0.0349 3.4805
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.075e+00 2.428e-01 -8.547 < 2e-16 ***
## CensusTract 8.149e-12 1.951e-11 0.418 0.676115
## TotalPop -1.921e-03 1.308e-03 -1.469 0.141753
## Men 2.994e-01 2.570e-01 1.165 0.244024
## Women 4.157e-01 2.691e-01 1.545 0.122463
## White -3.269e-01 1.725e-01 -1.895 0.058094 .
## Minority -1.853e-01 1.629e-01 -1.138 0.255179
## Citizen 3.707e-04 1.431e-03 0.259 0.795668
## Income -1.483e-04 7.396e-05 -2.005 0.044966 *
## IncomeErr 1.114e-04 2.173e-04 0.512 0.608385
## IncomePerCap 4.292e-04 2.065e-04 2.078 0.037669 *
## IncomePerCapErr -4.582e-04 5.163e-04 -0.887 0.374860
## Poverty 2.417e-01 1.585e-01 1.525 0.127196
## ChildPoverty -3.110e-02 9.491e-02 -0.328 0.743137
## Professional 2.353e-01 1.072e-01 2.196 0.028123 *
## Service 2.445e-01 1.233e-01 1.983 0.047395 *
## Office 5.554e-02 1.299e-01 0.427 0.669068
## Production 2.336e-01 1.170e-01 1.997 0.045816 *
## Drive -4.912e-01 1.462e-01 -3.359 0.000782 ***
## Carpool -6.595e-01 2.004e-01 -3.290 0.001002 **
## Transit 9.610e-01 3.480e-01 2.762 0.005749 **
## OtherTransp -9.807e-02 3.326e-01 -0.295 0.768094
## WorkAtHome -1.311e-01 2.049e-01 -0.640 0.522138
## MeanCommute 5.847e-02 7.464e-02 0.783 0.433395
## Employed 3.352e-03 1.994e-03 1.681 0.092708 .
## PrivateWork 1.512e-01 7.215e-02 2.095 0.036164 *
## SelfEmployed 9.309e-03 1.322e-01 0.070 0.943859
## FamilyWork -1.173e+00 9.915e-01 -1.183 0.236810
## Unemployment 3.597e-01 1.339e-01 2.686 0.007240 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 515.58 on 614 degrees of freedom
## Residual deviance: 266.75 on 586 degrees of freedom
## AIC: 324.75
##
## Number of Fisher Scoring iterations: 8
## y_hat_glm
## No Yes
## Donald Trump 0.97709924 0.02290076
## Hillary Clinton 0.37362637 0.62637363
The significant variables identified in the logistic regression model (at the 0.05 significance level) are White, Income, Poverty, Professional, Service, and Drive. We find that these important variables are not consistent with our analysis done in our decision tree, which identifies two splitting variables Minority and Transit in our cost-complexity pruned model. Because logistic regression requires data to be linearly separable while decision trees capture nonlinear classification boundaries, we do not necessarily expect the identified important variables in each classification method to be consistent with one another.
We estimate that if a respondent is White, we expect the probability of Trump’s election in the county to increase by 3.499e-01.
The results in our county (Santa Barbara) matched the predicted results. Hillary Clinton won the Santa Barbara County 2016 presidential election with 60.06% of the votes.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
From our classification results for the ROC curve, we see that the logistic regression curve has a much better optimal threshold than the decision tree, which moves in a much more step-wise pattern. Decision trees are very interpretable, but also very complex and can have large variation, which we do not want when we are trying to predict a candidate out of only 2 options. This is an easier prediction to make since there are only 2 candidates, as opposed to early in the election race. On the other hand, logistic regression maps the probabilities of each predictor being in a specific class, which could be interesting if we are trying to predict the probability of a candidate winning. However, a con to linear regression is that when it is split on nonlinear variables, the decision boundaries will be nonlinear and it will be a poor classification predictor.
The main takeaway from this project is that analysis of large data sets takes diligence, but can reveal a lot about how variables relate and can lead to meaningful predictions and insightful inference. We must keep in mind that with each fitted model, there is a bias-variance tradeoff, and we must always consider the consequences of overfitting, and test our results and misclassification.
An interesting direction for this data could be to add some classification variables to the demographic based on major issues that the candidates either support or do not support. For example, when it comes to health care, Biden openly supported federally-funded health care during his campaign, while Trump claimed it was ridiculous. Citizens can eventually then fill out a survey based on current issues to them, and lead them to the candidate of their choice. This would be helpful for prediction because these major issues likely drive voting much more than demographics such as the percentage of people working from home.
An interesting question that could be asked about census data is whether economic performance affects people’s vote, and subsequently the election results. This is studied greatly in political science as economic voting, which shows that Americans are sociotropic and retrospective economic voters. This means they are highly concerned with the economy at large, as well as the state of the economy in the previous term. These are other factors that could be added to census data analysis to improve predictions of election winners.
Some possibilities for further exploration are:
Classification at the sub-county level data before determining a winner would be much more difficult, and would likely perform worse. Depending on what variables the classes would be defined, as well as their range and density in the data, we could get very different results, and likely perform worse without knowing the winner. The main assumption we made in this project is that the census data is accurate, but another major assumption that we learned from Nate Silver’s article is that people did not always vote the way they said they would in the census, such as the lot of people who claimed they would not vote for Trump, and eventually did, leading to a big surprise in the election and large errors in many prediction models.
## pred
## class Donald Trump Hillary Clinton
## Donald Trump 0.97237145 0.02762855
## Hillary Clinton 0.45474138 0.54525862
## pred
## class Donald Trump Hillary Clinton
## Donald Trump 0.94704528 0.05295472
## Hillary Clinton 0.42672414 0.57327586
## rtest_pred
## Donald Trump Hillary Clinton
## Donald Trump 0.98655139 0.01344861
## Hillary Clinton 0.54423592 0.45576408
When performing further analysis with other classification models, we note that no one model performs particularly well, with all models, including those of the decision tree and logistic regression, having a high false negative rate. This is an indicator that aligns with the widespread misprediction of the results of the election.
Similarly to the decision tree analysis, the random forest model has an extremely high true negative rate, and also a fairly high false negative rate, though the false negative rate of the random forest model is lower than that discussed in the decision tree analysis.
Linear and quadratic discriminant analysis perform fairly similarly to one another, with a more balanced true positive and true negative rate than was recorded in the decision tree and random forest models, but also had a slightly lower true negative rate than the aforementioned methods. Overall, we suspect that these models are better in context than the decision tree or random forest analysis, even if their total misclassification error is slightly higher, due to the more balanced rates that they predict.
Ultimately, we select the logistic regression as our best predictive model, which has a fairly high true negative rate while also preserving the balance between true positive and negative rates. We suspect that this model is superior because it works better in the high dimensions of our dataset and because the model only has to assign predictions to two classes: Hillary Clinton and Donald Trump.