Problem Set 5: Fake News Consumption in the 2016 US Election

This problem set is due on November 1, 2023 at 11:59pm.

You can find instructions for obtaining and submitting problem sets here.

You can find the GitHub Classroom link to download the template repository on the Ed Board

Background

For this problem set, we will analyze data from the following article:

Guess, Andrew M., Nyhan, Brendan & Reifler, Jason Exposure to untrustworthy websites in the 2016 US election. Nature Human Behavior, Vol 4, pp. 472–480 (2020).

Guess, Nyhan, and Reifler investigate the consumption of political misinformation online using a combination of survey data matched to inidividual-level web traffic data. This data allowed the researchers to assess what characteristics are predictive of a citizen consuming untrustworthy news in the lead up to the 2016 U.S. presidential election. These untrustworthy sources included, for example, occupydemocracts.com on the left and angrypatriotmovement.com on the right. The authors found that 93% of the fact-checked articles on these sites were classified as false.

We will use the data from this paper, contained in the data/fake_news.csv file, to explore the use of multiple regression. A description of the variables is listed below:

Name Description
age Age of respondent
fn_perc Percent of online news consumption classified as untrustworthy (fake news)
slant_decile Liberal-conservative slant of news consumption (1 = very liberal, 10 = very conservative)
female Gender identity of the respondent (1 = female, 0 = male)
black Respondent identifies as Black (1) or not (0)
nonwhite Respondent identifies as non-white (1) or white (0)
trump_support Respondent support Donald Trump (1) or not (0)
clinton_support Respondent support Hillary Clinton (1) or not (0)
college Respondent has a college degree
polinterest 4-point scale of interest in politics (1 = “hardly any interest”, 4 = “follows politics most of the time”)
knowledge Political knowledge scale (0=least knowledgeable, 8=most knowledgeable)

Question 1 (4 points)

Read the data from data/fake_news.csv, subset it to respondents that support either Trump or Clinton, and save it as fake_news.

In the write up, indicate the number of respondents in the sample and the average percent of online news consumption that is fake news.

Rubric: 1pt for Rmd file compiling (autograder); 2pt for loading data (autograder); 1pt for reporting number of respondents and average (PDF).

Question 2 (6 points)

Create a boxplot that compares fake news consumption by which candidate the respondent supports and save the plot as trump_box. Your plot should look like this:

To create the x-axis variable that is nicely labeled, use mutate() to create a variable that is "Trump supporter" when trump_support is 1 and "Clinton supporter" otherwise (it does not matter what this variable is called to the autograder, but don’t overwrite your trump_support variable!). Be sure to use informative labels, though they do not have to match the text exactly.

In the write-up, report what the plot tells us about which candidate’s supporters consume more untrustworthy news sources.

Rubric: 3pts for boxplot (autograder); 1pt for informative labels on plot (PDF); 2pts for written description of relationship (PDF).

Question 3 (6 points)

Run a linear regression with fake news consumption as your outcome variable and Trump support as your predictor. Save this regression as fit_1 and report the coefficients using a nicely formatted table with the following code (you may need to install the broom package to have this work):

fit_1 |>
  broom::tidy() |>
  select(term, estimate) |>  
  knitr::kable(digits = 2)

Interpret both of these coefficients. Do not merely comment on the direction of the association (i.e., whether the slope is positive or negative). Explain what the value of the coefficients mean in terms of the units in which each variable is measured.

Rubric: 3pts for correct lm output (autograder); 1pt for coefficient table (PDF); 2pt for interpretation of coefficients (PDF).

Question 4 (5 points)

You decide to investigate the results of the previous question a bit more carefully because you know that Trump supporters are probably different than Trump non-supporters in other ways. Create a scatter plot where the x-axis is the age of the respondent and the y-axis is fake news consumption. Color your points according to whether they support Trump or Clinton. That is, make the points Trump supporters one color, and make the points for Clinton supporters a different color. You plot should look like this:

You should save this plot as candidate_scatter and to create better labels for the color of the points, use mutate() in a similar way to question 2.

Answer these questions in the write-up: What is the relationship between age and consumption of fake news? Can you detect a relationship between age and support for Trump?

Rubric: 3pts for scatter plot (autograder); 2pts for written conclusions about the plots (PDF).

Question 5 (6 points)

Run a linear regression with fake news consumption as your outcome variable and with Trump support and age as your predictors. Save the output of this regression as fit_2 and report the coefficients using a nicely formatted table with the following code (you may need to install the broom package to have this work):

fit_2 |>
  broom::tidy() |>
  select(term, estimate) |>  
  knitr::kable(digits = 2)

In the main text, interpret the coefficients on the two predictors, ignoring the intercept for now (you will interpret the intercept in the next question). Explain what each coefficient represents in terms of the units of the relevant variables.

Rubric: 3pts for correct lm output (autograder); 1pt for the coefficient table (PDF); 2pts for correct interpretation of the coefficients (PDF).

Question 6 (2 points)

Now interpret the intercept from the regression model with two predictors. Is this intercept a substantively important or interesting quantity? Why or why not?

Rubric: 1pt for correct interpretation (PDF); 1pt for argument about substantive importance (PDF).

Question 7 (6 points)

Now we’ll see how consumption of misinformation varies by the slant of the media diet. Create a new variable called slant_group that takes on the following values:

  • "Liberal" when the slant_decile is less than or equal to 4
  • "Moderate" when slant_decile is 5 or 6
  • "Conservative" when slant_decile is greater than or equal to 7

Run a regression of the untrustworthy news consumption percentage on this newly created variable. Save the output of this regression as fit_3 and report the coefficients using a nicely formatted table with the following code (you may need to install the broom package to have this work):

fit_3 |>
  broom::tidy() |>
  select(term, estimate) |>  
  knitr::kable(digits = 2)

In the main text, substantively interpret each coefficient in the regression, including the intercept. From this regression, indicate which media diet slant group has the highest consumption of untrustworthy news sources.

Rubric: 3pts for correct lm output (autograder); 1pt for the coefficient table (PDF); 2pts for correct interpretation of the coefficients and identifying the highest group (PDF).