§8 Topic Modeling 📚🔍

Welcome to the world of topic modeling in Digital Humanities! In this chapter, we’ll explore how to identify latent topics in texts using computational methods.

Learning Objectives
  • 📊 Understand core concepts of topic modeling
  • 🛠️ Understand topic modeling with LDA
  • 💻 Implement topic modeling with R
  • 📈 Visualize topic distributions

1 Warming Up: Topic Identification Exercise 👥

Before diving into computational topic modeling, let’s explore something we do naturally every day: identifying themes and topics in text. When we read a newspaper, browse social media, or scan through emails, our brains automatically categorize content into topics based on words, context, and patterns.

But how do we actually do this? What clues do we use to determine if a text is about:

  • 📰 Politics
  • ⚽ Sports
  • 💻 Technology
  • 🎭 Entertainment

Understanding our own cognitive process of topic identification will help us grasp how machines approach this task.

Interactive Exercise

Let’s try a hands-on exercise with a short text from Jane Austen’s “Pride and Prejudice.” You’ll:

  1. Read & Analyze: Identify the main topics in the text
  2. Label: Input the top 1-3 topics you found
  3. Select: Select words that support each topic
  4. Rank: Order the importance of words within each topic
Why This Matters

This process mirrors what topic modeling algorithms do at a much larger scale. By experiencing the process first hand, you’ll better understand how machines can discover hidden thematic patterns across thousands of documents.

When you completed the exercise above, you likely noticed several key aspects of topic identification that mirror how machines approach this task:

  1. Word-Topic Relationships
    • Words can belong to multiple topics (e.g., “fortune” could relate to both “Wealth” and “Marriage”)
    • Some words are more strongly associated with certain topics than others
    • Context affects how we assign words to topics
  2. Topic Distribution
    • Texts often contain multiple topics simultaneously
    • Topics aren’t evenly distributed - some are more dominant than others
    • The same word patterns can suggest different topics
  3. Challenges in Topic Identification
    • Deciding the optimal number of topics
    • Dealing with ambiguous words
    • Balancing broad vs. specific topics
    • Handling overlapping themes

These observations align closely with how topic modeling algorithms work, particularly Latent Dirichlet Allocation (LDA), which we’ll explore next.

2 What is Topic Modeling? 📊

As we saw in our exercise, identifying topics in text requires recognizing patterns of words that frequently appear together. While humans can do this intuitively for a few documents, what if we needed to analyze hundreds or thousands of texts? This is where topic modeling comes in.

Key Concept

Topic modeling is a computational method that automatically discovers hidden thematic patterns, or “topics,” across a large collection of documents. Just as we identified themes in Pride and Prejudice by looking at word patterns, topic modeling algorithms find statistical patterns of word co-occurrence that suggest underlying themes.

Several algorithms have been developed for topic modeling, but Latent Dirichlet Allocation (LDA) has become the most widely used approach in Digital Humanities.

2.1 Key Assumptions of LDA

Like any algorithm, LDA has some key assumptions about how topics, documents, and words relate to each other.

Key Assumptions of LDA

Documents are Bags of Words

  • Word order doesn’t matter
  • Only the frequency of words is considered
  • Grammar and sentence structure are ignored

Topics are Word Distributions

  • Each topic is a probability distribution over words
  • Some words are more likely to appear in certain topics
  • Example: A “politics” topic might have high probabilities for “election,” “vote,” “democracy”

Documents are Topic Mixtures

  • Each document contains multiple topics in different proportions
  • Example: An article might be:
    • 70% politics
    • 20% economics
    • 10% social issues

Topics are Generated First

  • The model assumes topics exist before documents
  • Documents are created by:
    1. Choosing topic proportions
    2. Selecting words from those topics
    3. Combining words into documents

2.2 The Three Core Matrices

Now that we understand LDA’s assumptions, let’s explore how it represents and processes text data. At its heart, LDA works with three essential matrices that capture different aspects of the relationships between documents, topics, and words:

The Three Essential Matrices
  1. Document-Term Matrix
    • Shows how words appear in documents
    • Like a spreadsheet counting word frequencies
  2. Document-Topic Matrix
    • Shows how topics are distributed in documents
    • Tells us “this document is 70% politics, 30% economics”
  3. Topic-Word Matrix
    • Shows which words are important for each topic
    • Reveals “these words are strongly associated with the politics topic”

Let’s examine each of these matrices in detail to understand how they work together to discover topics in our texts.

2.2.1 1. Document-Term Matrix

The starting point for topic modeling is converting our text collection into a format that computers can process efficiently.

What is a Document-Term Matrix?

A document-term matrix is a table where: - Each row represents a document - Each column represents a word - Each cell contains the frequency of that word in that document

For example, let’s look at how we might represent a small collection of book reviews:

Example Matrix
Document romantic secret murder love detective
Review 1 5 1 0 4 0
Review 2 0 4 3 0 5
Review 3 2 3 2 1 2
  • Review 1 appears to be about a romance novel
  • Review 2 looks like a detective story
  • Review 3 might be a romantic mystery
Key Properties
  1. Sparsity: Most cells contain zeros (most words don’t appear in most documents)
  2. Size: For real collections, this matrix is usually very large
  3. Information Loss: Word order and grammar are lost, but topic patterns remain

This matrix provides the foundation for discovering topics. LDA will use this to identify which words tend to appear together, suggesting underlying themes in our documents.

2.2.2 2. Document-Topic Matrix

This matrix represents how topics are distributed across documents.

Document-Topic Distribution

Each row shows how much of each topic appears in a document: - Rows = Documents - Columns = Topics - Values = Probability/proportion of each topic in the document

Example Matrix
Document Romance Mystery Politics
Doc 1 0.7 0.2 0.1
Doc 2 0.1 0.8 0.1
Doc 3 0.4 0.4 0.2
  • Doc 1 is mostly about romance (70%)
  • Doc 2 is primarily a mystery (80%)
  • Doc 3 is an even mix of romance and mystery

2.2.3 3. Topic-Word Matrix

This matrix shows the distribution of words within each topic.

Topic-Word Distribution

Each row represents a topic’s probability distribution over words: - Rows = Topics - Columns = Words - Values = Probability of word appearing in topic

Example Matrix
Topic love mystery murder marriage clue
Romance 0.30 0.05 0.01 0.25 0.01
Mystery 0.02 0.25 0.20 0.01 0.15
  • “love” and “marriage” are highly probable in the Romance topic
  • “mystery”, “murder”, and “clue” are more likely in the Mystery topic

Together, these three matrices capture the complex relationships between documents, topics, and words.

3 Implementing Topic Modeling with R 📊

Let’s analyze a dataset of potentially fake news articles using topic modeling. We’ll explore how different types of misinformation cluster into topics.

3.1 Fake News Dataset

For this topic modeling exercise, we’ll use a dataset of potentially fake news articles from Kaggle’s Fake News Detection Dataset. The fake news dataset contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information: article title, text, type and the date the article was published on. To match the fake news data collected for kaggle.com, we focused mostly on collecting articles from 2016 to 2017. The data collected were cleaned and processed, however, the punctuations and mistakes that existed in the fake news were kept in the text.

The dataset includes:

  • News article titles
  • Full text content
  • Labels indicating article type (bias, conspiracy, etc.)
  • Language indicators

We’ll work with a subset of 500 English-language articles to keep our analysis manageable while still providing enough data to identify meaningful patterns.

3.2 Setup and Data Loading

Download Fake News Dataset

Save it in the folder named data in your working directory (like we did for the dh_keywords.csv).

We can then load the csv file into RStudio:

library(pacman)
p_load(tidyverse)

# Load the dataset 
fake_news <- read_csv("data/fake_500.csv")
Rows: 500 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): title, text, type

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows
head(fake_news)
# A tibble: 6 × 3
  title                                                              text  type 
  <chr>                                                              <chr> <chr>
1 "Muslims BUSTED: They Stole Millions In Gov’t Benefits"            "Pri… bias 
2 "Re: Why Did Attorney General Loretta Lynch Plead The Fifth?"      "Why… bias 
3 "BREAKING: Weiner Cooperating With FBI On Hillary Email Investiga… "Red… bias 
4 "PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By IS… "Ema… bias 
5 "FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Healthcare Begins With… "Ema… bias 
6 "Hillary Goes Absolutely Berserk On Protester At Rally! (Video)"   "Pri… bias 

3.3 Text Preprocessing

Preprocessing Steps
  1. Create unique document IDs
  2. Generate n-grams (1-3 words)
  3. Remove stopwords
  4. Filter rare terms (n = 1)
# Load required packages
p_load(    
  tidytext,        
  topicmodels,    # For LDA
  LDAvis,         # For interactive visualization
  stringr,         
  stopwords      
)
# Create document-term matrix
news_dtm <- fake_news %>%
  mutate(doc_id = row_number()) %>%
  unnest_tokens(
    word, text,
    token = "ngrams", 
    n_min = 1, n = 3,
    stopwords = c(
      stopwords::stopwords("en"),
      stopwords::stopwords(source = "smart")
    )
  ) %>%
  count(doc_id, word) %>%
  filter(n > 1) %>%
  cast_dtm(doc_id, word, n)

news_dtm
<<DocumentTermMatrix (documents: 382, terms: 10441)>>
Non-/sparse entries: 21711/3966751
Sparsity           : 99%
Maximal term length: 34
Weighting          : term frequency (tf)

3.4 Fitting the LDA Model

3.4.1 Choosing the Number of Topics

What is k?

In topic modeling, ‘k’ is the number of topics you want the model to find.

  • If k = 3, you’re asking the model to find 3 topics in the data
  • If k = 10, you’re looking for 10 different topics

This is one of the most important decisions you’ll make when topic modeling!

Now, how do we choose k for our fake news dataset? We are in luck because we have “external help” in our dataset, which is the type varaible (tags of the type of news articles).

# Check unique types in our dataset
fake_news %>%
count(type)
# A tibble: 4 × 2
  type           n
  <chr>      <int>
1 bias          53
2 bs           308
3 conspiracy   124
4 fake          15
Why Start with k = 4 for Our Analysis? 📊

In this dataset, we have a helpful guide:

  1. Our articles are already labeled with 4 different types
  2. These types represent different categories of fake news
  3. This suggests that k = 4 might be a good starting point

Important Caveat About Topics vs Types ⚠️

Remember: The model’s topics might not match our human-labeled types!

  • Types = Human-assigned categories (like “bias” or “conspiracy”)
  • Topics = Patterns the model finds in the actual words used

Why they might be different:

  1. The model looks at word patterns, not human judgments
  2. An article labeled as “bias” might use similar language to “conspiracy”
  3. There might be hidden patterns we humans didn’t notice
  4. The same type of fake news might use different writing styles

Think of it like this: We’re giving the model a hint to look for 4 patterns, but it might find different groupings than our human labels suggest!

Important Caveats ⚠️
  1. In Real Life: You usually won’t have pre-labeled categories
  2. No Perfect k: There’s rarely one “right” number of topics
  3. Different Patterns: The model might find different groupings than human labels
  4. Iterative Process: You usually need to try several values of k and use your own judgement to determine which is the optimal number of k

Based on this understanding, let’s create our topic model with k = 4:

# Set global seed for reproducibility
set.seed(1234)

# Create our topic model
lda_model <- LDA(
  news_dtm,
  k = 4,                # Number of topics to find
  method = "Gibbs",     
  control = list(
    seed = 1234,        # For reproducible results
    iter = 2000,        # Number of iterations
    thin = 100,         # Save every n iteration
    best = TRUE,        # Return the best model
    verbose = FALSE     # Switch on to show progress and monitor convergence
  )
)

Think of Gibbs sampling like exploring a city to find the best restaurants:

  1. More Thorough Exploration 🗺️
    • Instead of stopping at the first good restaurant you find (like simpler methods do)
    • Gibbs sampling keeps exploring to find even better options
  2. Better with Uncertainty 🤔
    • Words often have multiple meanings
    • Gibbs sampling is good at handling this ambiguity in texts
  3. More Reliable Results
    • Less likely to get stuck in one interpretation
    • Keeps looking for better ways to group topics
  4. Good for Humanities Texts 📚
    • Works well with complex, nuanced writing
    • Handles the way words change meaning in different contexts

The parameters tell it:

  • How many times to explore (iter = 1000)
  • How often to take notes (thin = 100)
  • Where to start looking (seed = 1234)

3.5 LDA Results Can Vary Between Runs 🎲

  1. Random Starting Points
    • Each word in each document starts with a random topic assignment
    • Different starting points can lead to different final configurations
  2. Probability-Based Sampling
    • The algorithm (Gibbs sampling) randomly samples new topic assignments based on probabilities
    • While it converges towards good solutions, the path can vary
  3. Multiple Valid Solutions
    • Different runs may find different but equally valid topic configurations
    • Especially true for complex, overlapping topics in natural language

Best Practice:

  • Always set a seed for reproducibility
  • Run multiple times with different seeds to assess stability
  • Consider using consensus methods for more robust results
set.seed() and reproducibility 🎲

Think of set.seed() like setting up a recipe for randomness:

What is set.seed()? 🤔

  • It’s like giving a starting point for random number generation
  • Same seed = Same sequence of random numbers
  • Different seed = Different sequence

Why Use It? 🎯

  1. Makes results reproducible
  2. Helps others get the same results
  3. Useful for debugging and testing

Platform Differences ⚠️

Even with set.seed(), you might get different results across: - Different R versions - Different operating systems - Different computers

This happens because:

  • Random number generators work differently across platforms (Mac/Linux may be different from Windows)
  • Math operations might be handled differently
  • Some algorithms use parallel processing
Best Practices for Reproducible Random Number Generation (RNG) 🎯

To maximize reproducibility across platforms:

  1. Set RNG Type Explicitly
RNGkind(kind = "Mersenne-Twister", normal.kind = "Inversion")
  1. Document Your Environment
sessionInfo()  # Record R version, OS, and package versions
  1. Use Fixed Seeds in Parallel Processing
library(parallel)
RNGkind("L'Ecuyer-CMRG")  # For parallel-safe random numbers
  1. Consider Using a Reproducible Pipeline
  • Use renv for package management
  • Docker containers for complete environment control
  • Document all random seed values used

3.6 Exploring Results

Key Parameters/Results in Topic Modeling 📊
  1. Beta (β) or Phi (φ) Values: Topic-Word Probabilities 📝
    • What it is: How strongly each word belongs to each topic
    • Also called: φ (phi) in technical documents and code
    • Think of it like: A recipe showing how much of each ingredient (word) goes into each dish (topic)
    • Example: In a “sports” topic, words like “game” might have high β values
  2. Gamma (γ) or Theta (θ) Values: Document-Topic Probabilities 📚
    • What it is: How much each document belongs to each topic
    • Also called: θ (theta) in technical documents and code
    • Think of it like: A pie chart showing what percentage of each topic is in a document
    • Example: An article might be 70% politics and 30% economics

Why Two Names?

  • β (beta) and γ (gamma) are commonly used in explanations
  • φ (phi) and θ (theta) are often used in technical papers and code
  • They mean the same thing - just different notation conventions!
  • You’ll see phi and theta in our code, but they’re the same as β and γ

Let’s look at these values in our model:

# Extract top terms per topic
top_terms <- tidy(lda_model, matrix = "beta") %>%
  group_by(topic) %>%
  slice_max(beta, n = 15) %>%
  ungroup()


# Visualize top terms
ggplot(top_terms) +
  geom_col(aes(beta, reorder_within(term, beta, topic), fill = factor(topic))) +
  facet_wrap(~topic, scales = "free_y") +
  scale_y_reordered() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Top Terms by Topic",
    x = "Beta (Topic-Term Probability)",
    y = ""
  )

3.7 Document-Topic Distribution

Let’s see how articles are distributed across topics:

# Extract document-topic probabilities
doc_topics <- tidy(lda_model, matrix = "gamma") %>%
  mutate(title = fake_news$title[as.numeric(document)])

# Plot distribution
ggplot(doc_topics) +
  geom_histogram(aes(gamma, fill = factor(topic)), bins = 30) +
  facet_wrap(~topic) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Distribution of Topic Probabilities",
    x = "Gamma (Document-Topic Probability)",
    y = "Count",
    fill = "Topic"
  )

3.8 Interactive Visualization with LDAvis

# Prepare data for LDAvis

dtm_matrix <- as.matrix(news_dtm)

# Get document lengths and term frequencies
doc_lengths <- rowSums(dtm_matrix)
term_freqs <- colSums(dtm_matrix)

# Create the JSON object
json <- createJSON(
  phi = posterior(lda_model)$terms,      # topic-term distributions
  theta = posterior(lda_model)$topics,   # document-topic distributions
  doc.length = doc_lengths,              # document lengths
  vocab = colnames(dtm_matrix),          # terms
  term.frequency = term_freqs            # overall term frequencies
)

# Create interactive visualization
serVis(json)
Loading required namespace: servr
Understanding the Visualization 🎨

LDAvis shows us:

  1. Left side: Bubbles representing topics
    • Size = how common the topic is
    • Distance = how similar topics are to each other
  2. Right side: Most relevant words for selected topic
    • Red bars = word frequency in topic
    • Blue bars = word frequency in entire dataset
Understanding LDAvis Lambda (λ) 🎚️

The lambda slider in LDAvis helps you explore word relevance in topics:

  1. What is Lambda (λ)?
    • A number between 0 and 1
    • Controls how we rank words in topics
    • Helps balance between two ways of measuring word importance
  2. How Lambda Works
    • λ = 1.0: Shows common words in the topic
    • λ = 0.0: Shows unique/exclusive words to the topic
    • λ = 0.6: Default balance between common and unique (Sievert & Shirley, 2014)
  3. Think of it Like This
    • High λ (near 1): “What words appear most in this topic?”
    • Low λ (near 0): “What words are special to this topic?”
    • Middle λ: “What words are both common AND special?”

3.9 Finding the Most Representative Texts for Each Topic

Let’s find which articles best represent each topic we discovered:

# Get document-topic probabilities
doc_topics <- posterior(lda_model)$topics

# Create dataframe with cleaned column names
doc_topics_df <- as.data.frame(doc_topics) %>%
  # Rename columns to be more descriptive
  rename_with(~paste0("Topic_", .), everything()) %>%
  mutate(doc_id = row_number()) %>%
  left_join(
    fake_news %>% 
      mutate(doc_id = row_number()) %>%
      select(doc_id, title, text, type),
    by = "doc_id"
  )

# Find representative documents
representative_docs <- doc_topics_df %>%
  pivot_longer(
    cols = starts_with("Topic_"),
    names_to = "topic",
    values_to = "probability"
  ) %>%
  group_by(topic) %>%
  slice_max(probability, n = 3) %>%
  arrange(topic, desc(probability))

# Display results with more context
representative_docs %>%
  select(topic, title, type, probability) %>%
  arrange(topic, desc(probability)) %>%
  knitr::kable(
    caption = "Most Representative Articles for Each Topic",
    digits = 3
  )
Most Representative Articles for Each Topic
topic title type probability
Topic_1 Mike Pences Plane Slides off Runway at LaGuardia bs 0.950
Topic_1 Oregon Standoff Leaders Acquitted For Malheur Wildlife Refuge Takeover bs 0.932
Topic_1 ‘COLOR REV’ AGIT PROP: George Soros MoveOn Agitators March on America – as Billionaire Instigator Sued conspiracy 0.900
Topic_2 DONALD J. TRUMP - Officially Lays Out His NEW DEAL FOR BLACK AMERICA. bs 0.959
Topic_2 Tree-Shaped Vertical Farm Can Grow 24 Acres Of Crops conspiracy 0.937
Topic_2 The U.S./Turkey Plan For “Seizing, Holding, And Occupying” Syrian Territory In Raqqa conspiracy 0.922
Topic_3 Julian Assange Surfaces - Speaks at Argentina Conference Via Telephone bs 0.955
Topic_3 Smart Meter Case Testimony Before the Pennsylvania Public Utility Commission: What No One Wants to Acknowledge About EMF Damage (Part 4 of 4) conspiracy 0.879
Topic_3 New NASA Footage Films UFO Flying Past bs 0.848
Topic_4 NA bs 0.974
Topic_4 Hell Comes to Frogtown: Alt Right and Triumph of Transhumanism conspiracy 0.974
Topic_4 NA bs 0.970
Analysis Tips 🔍

When examining representative texts:

  1. Read the full text of top documents
  2. Look for common themes
  3. Compare with topic words
  4. Consider why these particular texts are representative
  5. Note any surprising patterns
Model Quality Checks 🔍

Check Preprocessing Issues

  • Look for remaining stopwords in topics
  • Check for non-meaningful n-grams
  • Verify document lengths after filtering
  • Examine if important domain terms were removed

Evaluate Topic Coherence

  • Do words in each topic make sense together?
  • Are there topics that seem too general?
  • Are there redundant topics?
  • Check if topics are well-separated
  • Evaluate quantitative metrics (e.g., coherence score, though these are not necessarily related to human interpretation)

Document-Topic Distribution

  • Are there documents with no clear topic assignment?
  • Are some topics dominating too many documents?
  • Check if document types cluster as expected
  • Look for outlier documents

Common Red Flags

  • Topics dominated by numbers or punctuation
  • Very similar word distributions across topics
  • Many documents with equal probability across all topics
  • Topics that don’t align with domain knowledge

Potential Solutions

  • Adjust preprocessing (stopwords, n-grams)
  • Try different numbers of topics (k)
  • Modify model parameters (α, β)
  • Consider document length normalization
  • Remove very short or very long documents

3.10 Hands-On Coding 💻

👥 Group Exercise: Finding the Optimal Topic Model

Setup

  1. Form groups of 3-4 students

  2. Each group member chooses a different k value:

    • Student 1: k = 3
    • Student 2: k = 5
    • Student 3: k = 7
    • Student 4: k = 9 (if fourth member)

Take notes on:

  1. Topic coherence (are words in each topic related?)
  2. Topic separation (how distinct are the topics?)
  3. Topic interpretability (can you name each topic?)
  4. Coverage (are important themes missing?)

Group Discussion

Compare your models using this template:

k value Pros Cons Notable Topics Missing Themes
k = 3
k = 5
k = 7
k = 9

Discussion points:

  1. 🎯 Which k value provides the most useful insights for fake news analysis?
  2. 📊 How do the topic visualizations change as k increases?
  3. 🤔 What tradeoffs do you notice between:
    • Simplicity vs. detail?
    • Coherence vs. coverage?
    • Interpretability vs. granularity?

Group Recommendation

Prepare a brief recommendation:

  1. Which k value would you choose?
  2. What are your top 3 reasons?
  3. What are the limitations of your choice?

Class Share

Each group shares their recommended k value and reasoning with the class. Compare how different groups approached the decision!

4 Learning Check 🏁

5 Conclusion

Key Takeaways

In this chapter, we’ve covered:

  • Understanding the basics of topic modeling and LDA
  • Implementing topic models with R
  • Visualizing topics using LDAvis
  • Interpreting and evaluating topic modeling results
  • Applying topic modeling to fake news analysis

Our analysis of the fake news dataset demonstrates how topic modeling can reveal hidden patterns in text collections. Through both computational analysis and human interpretation, we’ve explored how different types of misinformation cluster into topics. The interactive visualizations allow us to explore these patterns dynamically, highlighting both the strengths and limitations of topic modeling tools.

Remember that topic models are exploratory tools that complement, rather than replace, careful reading and analysis. The patterns we’ve discovered should spark questions and encourage deeper investigation of the texts, rather than being treated as definitive conclusions.

Topic Discovery

Pattern Analysis

Interactive Visualization

Interpretation