U.S. Intellectual History Blog

Mining the Media I: Clues and Coded Language

Editor's Note

This is a guest post from Aubrey Park. Aubrey Parke is an M.A. Public History student and Graduate Assistant at Duquesne University. In the past, she has worked as an oral historian, community archivist, and consulting firm analyst. She is from San Antonio, Texas, where she dedicates her spare time to work surrounding immigration and refugee issues.

Since the 2016 election, oxymorons like “alternative facts” and “fake news” have become part of the American vernacular. Social media, rife with echo chambers and ethically dubious algorithms, has accelerated the spread of misinformation. A record 60% of Americans do not trust the mass media. Concerned citizens often feel overwhelmed by the task of parsing truth from fiction.

No message comes without bias. Once equipped to identify political agendas and predispositions, media consumers can perhaps get a little closer to the truth. (Cue the digital humanists.) Can researchers use text mining to detect media bias?

To answer this question, I downloaded all Twitter posts for five different new sources between October 27 – November 28. I chose to work with Twitter to avoid the complications of selecting and formatting thousands of online news articles. I used Vicinitas to download the Tweets in five Excel spreadsheets (one for each new source), converted each spreadsheet into a Word document, and uploaded the five documents to Voyant as a corpus.

Figure 1: Lines represent the volume of tweets from each source between October 27 – November 28.

CNN posts on Twitter most frequently, while Fox posts rarely. Many of Fox’s posts were just links to online articles or videos, further limiting the potential for text analysis. Ideally, future text mining would also include Facebook posts, online news articles, and/or headlines.

My methodology relies on two main assumptions:

  1. Twitter posts represent what news outlets consider to be their most breaking, urgent, interesting, or “clickable” stories, thus making archived Tweets a reasonable corpus for analysis.
  2. I assume the political leanings of the five media sources: CNN is a mainstream left news source, Fox is mainstream right, Vox is farther left, and Breitbart is far right. I will discuss Axios, an allegedly more objective and neutral news source, in the next post.

CNN, Fox, Vox, and Breitbart are household names, with political slants that are hardly subtle. I selected them for two main purposes: first, to measure the accuracy of my text mining process in determining bias and second, to identify the specific language and content that communicates bias. Perhaps CNN is left of center, but what makes it so? And how does Vox mark itself as farther left than CNN?

Below are three types of “clues” that may help text miners recognize and examine media bias.

  1. Content

Four topics dominated all five sources: Donald Trump, Joe Biden, the 2020 presidential election, and COVID-19.

Figure 2: Word cloud of frequent terms for entire corpus. I edited the stop words in Voyant to filter out pronouns, hyperlinks, and other words irrelevant to content analysis.

In context of the election, tweets about BLM and law enforcement become especially interesting. Between October 27 and November 28, neither CNN, Vox, nor Axios promoted journalism about Black Lives Matter or police brutality on Twitter. Fox and Breitbart, on the other hand, dedicated a significant portion of their monthly tweets to those subjects.

Figure 3: Relative frequency of the terms “Black Lives Matter” and “BLM”

Figure 4: Relative frequency of the terms “defund” and “defund the police.”

Here are some examples of how Fox and Breitbart, respectively, tweeted about BLM and police brutality:

“Obama’s daughters joined summer protests against police brutality.”

Fox News, November 27, 2020

“Defund-police supporters tell Biden they’re ‘not going away.’”

Fox News, November 26, 2020

“BLM-NBA Woke Update: The Sacramento Kings fired an announcer who said, ‘all lives matter,’ only to replace him by hiring someone who claimed Donald Trump is a ‘white supremacist terrorist.'”

Breitbart News, November 18, 2020

“Antifa and BLM protestors took over roads and harassed drivers in a Portland suburb during a protest Saturday night.”

Breitbart News, November 15, 2020

Fox News aimed to draw connections between the Democratic party and anti-police protests, while Breitbart overtly disparaged BLM. Breitbart also focuses on cultural manifestations of BLM in Hollywood, the NBA, and the NFL, rather than electoral politics.

Figure 5: Police & BLM Word Association Web – Breitbart
Created using the “Link” tool in Voyant. Some extraneous terms were manually extracted.

Figure 6: Police & BLM Word Association Web – Fox News
Created using the “Link” tool in Voyant. Some extraneous terms were manually extracted.

This content analysis suggests a few key takeaways:

  1. Fox News seemed to wield BLM and related protests as weapons against the Democratic Party during the 2020 election.
  2. CNN had little interest in covering BLM, Defund the Police, or related topics during the 2020 election cycle.
  3. Breitbart News preferred to connect BLM to specific local protests, cultural elites in Hollywood, and an allegedly predatory Antifa movement, rather than the 2020 election.

2. Vocabulary and word choice

The “Summary” function on Voyant lists terms unique to each document in the corpus. After viewing those terms, I used the “Context” and “Word tree” functions to better understand how each source employed their unique terms. For example, only Vox used the word “coup,” as in “Trump is attempting a coup in plain sight.” While CNN critiqued Donald Trump’s response to the election, it avoided the more extreme language of a “coup.”

Figure 7: Vox – Word tree for “coup” (expanded)

All five sources discussed COVID-19, but they called the virus by different names. CNN and Vox tended to say “Covid-19,” while Fox and Breitbart were more likely to say “coronavirus.” Given that Fox is more conservative than Vox or CNN and Breitbart is at times overtly anti-mask, this difference in word choice could suggest a pattern. Perhaps “coronavirus” indicates a source that endorses more lenient regulations, while “COVID-19” elevates the language of the CDC and public health officials. This merits further investigation.

Figure 8: Relative frequency of the terms “COVID,” “coronavirus,” and “pandemic”

  1. Insider-outsider language

This is a simple way to identify sources that lean far to the right or left. Major news outlets are less likely to critique “mainstream media.” Text analysis shows an extremely high rate of use for the words “radical” and “establishment” from Breitbart News. Vox uses “radical” to call for major political reforms, while “radical” mainly appears on Fox Twitter in direct quotes from Donald Trump. CNN avoids using all three words, suggesting an aversion to language that separates and alienates. Breitbart, a source that predicates its existence on mistrust of mainstream media, relies heavily on divisive language.

Figure 9: Relative frequency of the terms “radical,” “establishment,” and “mainstream.”

In the following post, I will discuss the potential application of these media-mining clues.