§5 Spinning the Reels 🎰

Introduction to Tidyverse

Welcome to the exciting world of tidyverse! In this chapter, we’ll build on our knowledge of R by exploring the tidyverse, a collection of R packages designed for data science. We’ll create a virtual slot machine to demonstrate the power and simplicity of tidyverse functions.

Learning Objectives
  • πŸ“¦ Understand the basics of tidyverse and its core packages
  • πŸ”„ Learn to manipulate data with dplyr functions
  • πŸ“Š Visualize data using ggplot2
  • 🎰 Build a virtual slot machine using tidyverse functions

1 Introduction to Tidyverse πŸ“¦

Tidyverse is a collection of R packages that work together harmoniously for data manipulation, exploration, and visualization.

Tidyverse vs. Base R?

You might wonder why we’re learning Tidyverse when R already has built-in functions (known as Base R). Here’s why:

  1. Readability: Tidyverse code is often easier to read and understand, especially for beginners. It uses a consistent style and vocabulary across its packages.

  2. Workflow: Tidyverse functions work well together, creating a smooth β€œpipeline” for data analysis. This makes it easier to perform complex operations step-by-step.

  3. Modern approach: Tidyverse incorporates more recent developments in R programming, addressing some limitations of Base R.

  4. Consistency: Tidyverse functions behave predictably, reducing unexpected outcomes that sometimes occur with Base R functions.

  5. Community support: Tidyverse has a large, active community, which means more resources, tutorials, and help are available online.

While Base R is still important and powerful, Tidyverse provides a more accessible entry point for beginners and a efficient toolkit for data analysis tasks common in Digital Humanities.

Remember, you’re not choosing one over the other permanently. As you grow more comfortable with R, you’ll likely use both Tidyverse and Base R, selecting the best tool for each specific task.

Let’s start by loading the tidyverse:

# install.packages('tidyverse')
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4     βœ” readr     2.1.5
βœ” forcats   1.0.0     βœ” stringr   1.5.1
βœ” ggplot2   3.5.1     βœ” tibble    3.2.1
βœ” lubridate 1.9.3     βœ” tidyr     1.3.1
βœ” purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The core tidyverse includes packages like dplyr (for data manipulation) and ggplot2 (for data visualization).

Core Tidyverse Packages

The core tidyverse includes several packages, each with a specific purpose:

  1. dplyr: for data manipulation (like sorting, filtering, and summarizing data)
  2. ggplot2: for data visualization (creating graphs and charts)
  3. tidyr: for tidying data (organizing data into a consistent format)
  4. readr: for reading rectangular data (importing data from files)
  5. purrr: for functional programming (applying functions to data)
  6. tibble: for modern data frames (an enhanced version of R’s traditional data structure)

For routine data analysis tasks, we mainly use dplyr and ggplot2, which is what we will focus on in this chapter.

1.1 Learning Check 🏁

2 Data Manipulation with dplyr πŸ”„

Let’s look at a mock book dataset again, but this time using dplyr functions:

# Create the books dataset
books <- tibble(
  title = c("1984", "Pride and Prejudice", "The Great Gatsby", "To Kill a Mockingbird", "The Catcher in the Rye"),
  author = c("Orwell", "Austen", "Fitzgerald", "Lee", "Salinger"),
  year = c(1949, 1813, 1925, 1960, 1951),
  genre = c("Dystopian", "Romance", "Modernist", "Coming-of-age", "Coming-of-age"),
  pages = c(328, 432, 180, 281, 234)
)

# View the books
books
# A tibble: 5 Γ— 5
  title                  author      year genre         pages
  <chr>                  <chr>      <dbl> <chr>         <dbl>
1 1984                   Orwell      1949 Dystopian       328
2 Pride and Prejudice    Austen      1813 Romance         432
3 The Great Gatsby       Fitzgerald  1925 Modernist       180
4 To Kill a Mockingbird  Lee         1960 Coming-of-age   281
5 The Catcher in the Rye Salinger    1951 Coming-of-age   234
Tibbles vs. Data Frames

You might notice we used tibble() instead of data.frame(). Tibbles are modern data frames that are part of the tidyverse. They have some advantages over traditional data frames:

  1. They have a cleaner print method
  2. They don’t change column types
  3. They don’t create row names
  4. They warn you when a column doesn’t exist

For most purpose you can use them interchangeably with data frames, but the Tidyverse version is often easier and more inuitive to use and we would recommend using Tidyverse versions over Base R versions.

Below are more examples of Tidyverse alternaties to Base R (the built-in functions of R):

Now, let’s explore some key dplyr functions:

Core Functions in dplyr

dplyr provides a set of powerful functions for manipulating data:

  1. filter(): Subset rows based on conditions. This function allows you to keep only the data rows that meet specific criteria, like selecting books published after a certain year.

  2. select(): Choose specific columns. Use this when you want to focus on particular variables in your dataset, similar to picking certain columns in a spreadsheet.

  3. mutate(): Add new variables or modify existing ones. This function lets you create new columns based on calculations from existing data, or change values in current columns.

  4. arrange(): Sort rows. When you need to order your data based on one or more variables, such as sorting books by publication date, use this function.

  5. summarise(): Compute summary statistics. This function is useful for calculating things like averages, totals, or counts across your entire dataset or within groups.

  6. group_by(): Group data for operations. Use this to divide your data into groups before applying other functions, allowing you to perform calculations within each group separately.

  7. join(): Combine data from multiple tables. When your data is split across different tables or datasets, this function helps you merge them together based on common variables.

These functions are designed to work together, allowing you to perform complex data manipulations step by step. As you practice, you’ll find yourself combining these functions to answer increasingly sophisticated questions about your data.

Cheat Sheet

For quick reference, here’s a handy cheat sheet summarizing the key dplyr functions:

Dplyr Cheet Sheet

Most tidyverse packages have corresponding cheat sheets. You can google the package name + cheat sheet to download them yourself.

2.1 filter(): Subset Rows

# Find all books published after 1900
books %>% 
filter(year > 1900)
# A tibble: 4 Γ— 5
  title                  author      year genre         pages
  <chr>                  <chr>      <dbl> <chr>         <dbl>
1 1984                   Orwell      1949 Dystopian       328
2 The Great Gatsby       Fitzgerald  1925 Modernist       180
3 To Kill a Mockingbird  Lee         1960 Coming-of-age   281
4 The Catcher in the Rye Salinger    1951 Coming-of-age   234
The Pipe Operator %>%

The %>% operator is called the β€œpipe” operator. It’s a fundamental concept in the tidyverse that greatly enhances code readability and workflow. Here’s how it works:

  1. Function chaining: The pipe takes the output of one function and passes it as the first argument to the next function. This allows us to chain multiple operations together in a logical sequence.

  2. Left-to-right reading: Instead of nesting functions within each other, which can be hard to read, the pipe allows us to read our code from left to right, much like we read English.

  3. Improved readability: By using the pipe, we can break down complex operations into a series of smaller, more manageable steps.

For example, let’s compare these two equivalent operations:

Without pipe:

filter(books, year > 1900)
# A tibble: 4 Γ— 5
  title                  author      year genre         pages
  <chr>                  <chr>      <dbl> <chr>         <dbl>
1 1984                   Orwell      1949 Dystopian       328
2 The Great Gatsby       Fitzgerald  1925 Modernist       180
3 To Kill a Mockingbird  Lee         1960 Coming-of-age   281
4 The Catcher in the Rye Salinger    1951 Coming-of-age   234

With pipe:

books %>% filter(year > 1900)
# A tibble: 4 Γ— 5
  title                  author      year genre         pages
  <chr>                  <chr>      <dbl> <chr>         <dbl>
1 1984                   Orwell      1949 Dystopian       328
2 The Great Gatsby       Fitzgerald  1925 Modernist       180
3 To Kill a Mockingbird  Lee         1960 Coming-of-age   281
4 The Catcher in the Rye Salinger    1951 Coming-of-age   234

The piped version can be read as β€œTake the books data, then filter it to keep only books published after 1900”.

For more complex operations, the benefits become even clearer, which we will see in a moment.

2.2 select(): Choose Columns

# Select only title and author columns
books %>%
select(title, author)
# A tibble: 5 Γ— 2
  title                  author    
  <chr>                  <chr>     
1 1984                   Orwell    
2 Pride and Prejudice    Austen    
3 The Great Gatsby       Fitzgerald
4 To Kill a Mockingbird  Lee       
5 The Catcher in the Rye Salinger  

2.3 mutate(): Add New Variables

# Add a new column for the book's age
books %>%
mutate(age = 2024 - year)
# A tibble: 5 Γ— 6
  title                  author      year genre         pages   age
  <chr>                  <chr>      <dbl> <chr>         <dbl> <dbl>
1 1984                   Orwell      1949 Dystopian       328    75
2 Pride and Prejudice    Austen      1813 Romance         432   211
3 The Great Gatsby       Fitzgerald  1925 Modernist       180    99
4 To Kill a Mockingbird  Lee         1960 Coming-of-age   281    64
5 The Catcher in the Rye Salinger    1951 Coming-of-age   234    73

2.4 arrange(): Sort Rows

# Sort books by year, oldest first
books %>%
arrange(year)
# A tibble: 5 Γ— 5
  title                  author      year genre         pages
  <chr>                  <chr>      <dbl> <chr>         <dbl>
1 Pride and Prejudice    Austen      1813 Romance         432
2 The Great Gatsby       Fitzgerald  1925 Modernist       180
3 1984                   Orwell      1949 Dystopian       328
4 The Catcher in the Rye Salinger    1951 Coming-of-age   234
5 To Kill a Mockingbird  Lee         1960 Coming-of-age   281
Comparing arrange() and order()

In Tidyverse, we use arrange() to sort data frames, which is often more intuitive and easier to use with multiple columns. In Base R, you typically use order() within square brackets or sort() for vectors.

For example:

Tidyverse: data %>% arrange(column_name)

Base R: data[order(data$column_name), ]

The Tidyverse method is more readable, especially when sorting by multiple columns or in descending order.

2.5 summarise(): Summarize Data

# Calculate average number of pages
books %>%
summarise(avg_pages = mean(pages))
# A tibble: 1 Γ— 1
  avg_pages
      <dbl>
1       291

2.6 group_by(): Group Data for Operations

# Average pages by genre
books %>%
group_by(genre) %>%
summarise(avg_pages = mean(pages))
# A tibble: 4 Γ— 2
  genre         avg_pages
  <chr>             <dbl>
1 Coming-of-age      258.
2 Dystopian          328 
3 Modernist          180 
4 Romance            432 

2.7 Chaining Multiple Actions

One of the key advantages of Tidyverse is the ability to chain multiple actions together using the pipe operator (%>%). Let’s compare how we can perform a series of data manipulations using both Tidyverse and Base R.

Let’s say we want to: 1. Filter books published after 1900 2. Select only the title, author, and year columns 3. Sort the results by year 4. Get the first 3 entries

2.7.1 Tidyverse Approach

books %>%
filter(year > 1900) %>%
select(title, author, year) %>%
arrange(year) %>%
head(3)
# A tibble: 3 Γ— 3
  title                  author      year
  <chr>                  <chr>      <dbl>
1 The Great Gatsby       Fitzgerald  1925
2 1984                   Orwell      1949
3 The Catcher in the Rye Salinger    1951

In this Tidyverse approach, we can read the code from left to right, following the logical flow of operations. Each step is clearly defined, and the pipe operator (%>%) passes the result of each operation to the next.

2.7.2 Base R Approach

# Filter books published after 1900
filtered_books <- books[books$year > 1900, ]
# Select only title, author, and year columns
selected_books <- filtered_books[, c("title", "author", "year")]
# Sort by year
sorted_books <- selected_books[order(selected_books$year), ]
# Get the first 3 entries
result <- head(sorted_books, 3)
# View the result
result
# A tibble: 3 Γ— 3
  title                  author      year
  <chr>                  <chr>      <dbl>
1 The Great Gatsby       Fitzgerald  1925
2 1984                   Orwell      1949
3 The Catcher in the Rye Salinger    1951

In the Base R approach, we need to create intermediate variables at each step. The code reads from top to bottom, with each line representing a separate operation.

2.8 Learning Check 🏁

2.9 Hands-On Coding πŸ’»

Try the following exercises:

  1. Use filter() to find all books written by Austen or Orwell.
  2. Use arrange() to sort the books by number of pages, from longest to shortest.
  3. Use mutate() to add a new column called words, assuming an average of 250 words per page.
  4. Use group_by() and summarise() to find the earliest publication year for each genre.

2.9.1 Exercise 1: Filter books by Austen or Orwell

2.9.2 Exercise 2: Sort books by pages, longest to shortest

2.9.3 Exercise 3: Add words column

2.9.4 Exercise 4: Find earliest publication year by genre

3 Data Visualization with ggplot2 πŸ“Š

ggplot2 is a powerful package for creating beautiful and informative visualizations, especially useful for exploring data.

3.1 Expand the Books Dataset

Let’s expand the books dataset to include some more variables for visualization purposes:

novels <- books %>%
  mutate(
    words = pages*250, # Estimating word count based on pages
    characters = c(30, 25, 15, 20, 10), # Number of named characters (estimated)
    rating = c(4.2, 4.5, 4.0, 4.3, 4.1), # Modern reader ratings (out of 5)
    male_chars = c(20, 10, 10, 12, 7), # Number of male characters (estimated)
    female_chars = c(10, 15, 5, 8, 3) # Number of female characters (estimated)
  )
# View the dataset
novels
# A tibble: 5 Γ— 10
  title             author  year genre pages  words characters rating male_chars
  <chr>             <chr>  <dbl> <chr> <dbl>  <dbl>      <dbl>  <dbl>      <dbl>
1 1984              Orwell  1949 Dyst…   328  82000         30    4.2         20
2 Pride and Prejud… Austen  1813 Roma…   432 108000         25    4.5         10
3 The Great Gatsby  Fitzg…  1925 Mode…   180  45000         15    4           10
4 To Kill a Mockin… Lee     1960 Comi…   281  70250         20    4.3         12
5 The Catcher in t… Salin…  1951 Comi…   234  58500         10    4.1          7
# β„Ή 1 more variable: female_chars <dbl>

This dataset gives us a rich set of variables to explore, including publication year, word count, genre, character gender representation, and modern reader ratings.

3.2 1. The Basic Structure of a ggplot

Every ggplot2 plot starts with the ggplot() function and uses + to add layers. The basic structure is:

ggplot(data = <DATA>) +
GEOM_FUNCTION(mapping = aes(<MAPPINGS>))

Let’s create a simple scatter plot of publication year vs. word count (thousands):

ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words / 1000))

Key Concepts
  • ggplot(data = novels): Initializes the plot with our dataset
  • geom_point(): Adds a layer of points (for a scatter plot)
  • aes(x = year, y = words): Maps variables to aesthetic properties (here, x and y positions)

3.3 2. Aesthetic Mappings

Aesthetics are visual properties of the objects in your plot. Common aesthetics include: - x and y positions - color - size - shape

Let’s map the rating to the color of the points:

ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words, color = rating))

Alternatively, we can also use the size of the points to indicate the rating:

ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words, size = rating))

3.4 3. Adding Labels with labs()

We can improve our plot by adding informative labels:

ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words/1000, size = rating)) +
labs(title = "Classic Novels: Publication Year vs. Word Count",
     x = "Year of Publication",
     y = "Number of Words (thousands)",
     size = "Rating")

3.5 4. Geometric Objects (geoms)

Different geom functions create different types of plots. Let’s create a bar plot of character counts:

ggplot(data = novels) +
geom_col(mapping = aes(x = title, y = characters)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Common geoms
R Graph Gallery: Inspiration for Your Visualizations

For more inspiration and examples of what’s possible with ggplot2, check out the R Graph Gallery. This fantastic resource offers:

  • A wide variety of chart types and styles
  • Reproducible code for each graph
  • Explanations and use cases for different visualizations
  • Advanced techniques and customizations

Exploring the R Graph Gallery can help you discover new ways to visualize your data and improve your ggplot2 skills!

3.6 Learning Check 🏁

4 Building a Simple Slot Machine 🎰

Now, let’s use our tidyverse skills to create and analyze a simple virtual slot machine!

What is a slot machine and how does it work

A slot machine is a game you might see in a casino. Here’s how it works:

  1. Look: It usually has three or more spinning wheels with pictures on them.

  2. Symbols: In our game, we use fruit (πŸ’, πŸ‹, 🍊, πŸ‡) and other symbols (πŸ””, πŸ’Ž).

  3. How to play: You press a button to make the wheels spin.

  4. Winning: You win if the pictures line up in a certain way when the wheels stop. For example, you win if all three symbols are the same (e.g., πŸ‹πŸ‹πŸ‹).

  5. Random: Each spin is random - you can’t predict what will come up. But in real casinos, not all outcomes have the same chance. Casinos set the odds to make sure they make money over time.

Slot machines are often used to teach math and probability. In our case, we’re using it to learn about data analysis. It’s a fun way to practice with numbers and see patterns.

In real life, it’s wise to stay away from slot machines. The odds are set so that the casino always wins in the long run. Our virtual slot machine lets us explore data without any risk!

First, let’s create our slot machine. The code has been provided for you. Take a moment to read through the script below and see if you can understand what each part does. Don’t worry if you don’t understand everything - we’ll break it down together afterwards.

library(tidyverse)

# Define slot machine symbols
symbols <- c("πŸ’", "πŸ‹", "🍊", "πŸ‡", "πŸ””", "πŸ’Ž")

# Function to play the slot machine
play_slot_machine <- function(n_plays = 10) {
  tibble(
    play = 1:n_plays,
    symbol1 = sample(symbols, n_plays, replace = TRUE),
    symbol2 = sample(symbols, n_plays, replace = TRUE),
    symbol3 = sample(symbols, n_plays, replace = TRUE)
  ) %>%
  mutate(
    win = symbol1 == symbol2 & symbol2 == symbol3,
    result = if_else(win, "πŸ’°", "😒")
  )
}

# Simulate 100 plays
results <- play_slot_machine(100)

# Display the first few results
head(results)
# A tibble: 6 Γ— 6
   play symbol1 symbol2 symbol3 win   result
  <int> <chr>   <chr>   <chr>   <lgl> <chr> 
1     1 πŸ‹      🍊      πŸ’      FALSE 😒    
2     2 πŸ’      πŸ‡      πŸ‹      FALSE 😒    
3     3 πŸ’      🍊      πŸ””      FALSE 😒    
4     4 πŸ””      πŸ’Ž      πŸ’      FALSE 😒    
5     5 πŸ‡      🍊      πŸ’Ž      FALSE 😒    
6     6 πŸ’Ž      πŸ””      πŸ‡      FALSE 😒    

After you’ve had a chance to examine the code, click β€œShow Solution” below to see a detailed, line-by-line explanation of what’s happening in this script.

Now that we understand how our slot machine works, let’s move on to analyzing its results!

4.1 Hands-On Coding πŸ’»

Let’s explore our slot machine results with some exercises. Remember to use tidyverse functions like filter(), summarise(), group_by(), and ggplot().

4.1.1 Exercise 1: Summarize the Results

Calculate the total number of plays, number of wins, and the win percentage.

4.1.2 Exercise 2: Find Winning Combinations

Create a new data frame showing only the winning plays and their symbol combinations.

4.1.3 Exercise 3: Visualize Symbol Distribution

Create a bar plot showing the distribution of symbols in the first reel (symbol1 column).

Emojis often can’t be rendered directly in plots. While there are packages like emojifont or ggtext that can handle emoji rendering, for simplicity, we’ll use a text representation of the symbols.

results <- results %>%
  mutate(symbol1_text = case_when(
    symbol1 == "πŸ’" ~ "Cherry",
    symbol1 == "πŸ‹" ~ "Lemon",
    symbol1 == "🍊" ~ "Orange",
    symbol1 == "πŸ‡" ~ "Grapes",
    symbol1 == "πŸ””" ~ "Bell",
    symbol1 == "πŸ’Ž" ~ "Diamond"
  ))

Congratulations! You’ve now practiced using various tidyverse functions to analyze and visualize data from our virtual slot machine. These skills are fundamental in data manipulation and analysis, which are crucial in many digital humanities projects.

Key Takeaways

In this chapter, we’ve covered:

  • The basics of tidyverse and its core packages
  • Data manipulation with dplyr functions
  • Data visualization with ggplot2
  • Applied tidyverse concepts to analyze our books dataset
  • Built a virtual slot machine using tidyverse functions

These skills form an essential foundation for working with data in R using the tidyverse. As we progress in our digital humanities journey, we’ll build upon these concepts to perform more complex data manipulations and analyses.

Tidyverse Basics

Data Manipulation

Data Visualization

Practical Application