ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.1 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Β§5 Spinning the Reels π°
Introduction to Tidyverse
Welcome to the exciting world of tidyverse! In this chapter, weβll build on our knowledge of R by exploring the tidyverse, a collection of R packages designed for data science. Weβll create a virtual slot machine to demonstrate the power and simplicity of tidyverse functions.
- π¦ Understand the basics of tidyverse and its core packages
- π Learn to manipulate data with dplyr functions
- π Visualize data using ggplot2
- π° Build a virtual slot machine using tidyverse functions
1 Introduction to Tidyverse π¦
Tidyverse is a collection of R packages that work together harmoniously for data manipulation, exploration, and visualization.
You might wonder why weβre learning Tidyverse when R already has built-in functions (known as Base R). Hereβs why:
Readability: Tidyverse code is often easier to read and understand, especially for beginners. It uses a consistent style and vocabulary across its packages.
Workflow: Tidyverse functions work well together, creating a smooth βpipelineβ for data analysis. This makes it easier to perform complex operations step-by-step.
Modern approach: Tidyverse incorporates more recent developments in R programming, addressing some limitations of Base R.
Consistency: Tidyverse functions behave predictably, reducing unexpected outcomes that sometimes occur with Base R functions.
Community support: Tidyverse has a large, active community, which means more resources, tutorials, and help are available online.
While Base R is still important and powerful, Tidyverse provides a more accessible entry point for beginners and a efficient toolkit for data analysis tasks common in Digital Humanities.
Remember, youβre not choosing one over the other permanently. As you grow more comfortable with R, youβll likely use both Tidyverse and Base R, selecting the best tool for each specific task.
Letβs start by loading the tidyverse:
The core tidyverse includes packages like dplyr (for data manipulation) and ggplot2 (for data visualization).
The core tidyverse includes several packages, each with a specific purpose:
- dplyr: for data manipulation (like sorting, filtering, and summarizing data)
- ggplot2: for data visualization (creating graphs and charts)
- tidyr: for tidying data (organizing data into a consistent format)
- readr: for reading rectangular data (importing data from files)
- purrr: for functional programming (applying functions to data)
- tibble: for modern data frames (an enhanced version of Rβs traditional data structure)
For routine data analysis tasks, we mainly use dplyr and ggplot2, which is what we will focus on in this chapter.
1.1 Learning Check π
2 Data Manipulation with dplyr
π
Letβs look at a mock book dataset again, but this time using dplyr functions:
# Create the books dataset
books <- tibble(
title = c("1984", "Pride and Prejudice", "The Great Gatsby", "To Kill a Mockingbird", "The Catcher in the Rye"),
author = c("Orwell", "Austen", "Fitzgerald", "Lee", "Salinger"),
year = c(1949, 1813, 1925, 1960, 1951),
genre = c("Dystopian", "Romance", "Modernist", "Coming-of-age", "Coming-of-age"),
pages = c(328, 432, 180, 281, 234)
)
# View the books
books
# A tibble: 5 Γ 5
title author year genre pages
<chr> <chr> <dbl> <chr> <dbl>
1 1984 Orwell 1949 Dystopian 328
2 Pride and Prejudice Austen 1813 Romance 432
3 The Great Gatsby Fitzgerald 1925 Modernist 180
4 To Kill a Mockingbird Lee 1960 Coming-of-age 281
5 The Catcher in the Rye Salinger 1951 Coming-of-age 234
You might notice we used tibble()
instead of data.frame()
. Tibbles are modern data frames that are part of the tidyverse. They have some advantages over traditional data frames:
- They have a cleaner print method
- They donβt change column types
- They donβt create row names
- They warn you when a column doesnβt exist
For most purpose you can use them interchangeably with data frames, but the Tidyverse version is often easier and more inuitive to use and we would recommend using Tidyverse versions over Base R versions.
Below are more examples of Tidyverse alternaties to Base R (the built-in functions of R):
- Reading data:
read_csv()
(Tidyverse) vs.read.csv()
(Base R) - Filtering data:
filter()
(Tidyverse) vs.subset()
(Base R) - Plotting:
ggplot()
(Tidyverse) vs.plot()
(Base R) - Sorting:
arrange()
(Tidyverse) vs.order()
orsort()
(Base R)
Now, letβs explore some key dplyr functions:
dplyr
dplyr provides a set of powerful functions for manipulating data:
filter()
: Subset rows based on conditions. This function allows you to keep only the data rows that meet specific criteria, like selecting books published after a certain year.select()
: Choose specific columns. Use this when you want to focus on particular variables in your dataset, similar to picking certain columns in a spreadsheet.mutate()
: Add new variables or modify existing ones. This function lets you create new columns based on calculations from existing data, or change values in current columns.arrange()
: Sort rows. When you need to order your data based on one or more variables, such as sorting books by publication date, use this function.summarise()
: Compute summary statistics. This function is useful for calculating things like averages, totals, or counts across your entire dataset or within groups.group_by()
: Group data for operations. Use this to divide your data into groups before applying other functions, allowing you to perform calculations within each group separately.join()
: Combine data from multiple tables. When your data is split across different tables or datasets, this function helps you merge them together based on common variables.
These functions are designed to work together, allowing you to perform complex data manipulations step by step. As you practice, youβll find yourself combining these functions to answer increasingly sophisticated questions about your data.
Cheat Sheet
For quick reference, hereβs a handy cheat sheet summarizing the key dplyr functions:
Most tidyverse packages have corresponding cheat sheets. You can google the package name + cheat sheet to download them yourself.
2.1 filter()
: Subset Rows
# A tibble: 4 Γ 5
title author year genre pages
<chr> <chr> <dbl> <chr> <dbl>
1 1984 Orwell 1949 Dystopian 328
2 The Great Gatsby Fitzgerald 1925 Modernist 180
3 To Kill a Mockingbird Lee 1960 Coming-of-age 281
4 The Catcher in the Rye Salinger 1951 Coming-of-age 234
%>%
The %>%
operator is called the βpipeβ operator. Itβs a fundamental concept in the tidyverse that greatly enhances code readability and workflow. Hereβs how it works:
Function chaining: The pipe takes the output of one function and passes it as the first argument to the next function. This allows us to chain multiple operations together in a logical sequence.
Left-to-right reading: Instead of nesting functions within each other, which can be hard to read, the pipe allows us to read our code from left to right, much like we read English.
Improved readability: By using the pipe, we can break down complex operations into a series of smaller, more manageable steps.
For example, letβs compare these two equivalent operations:
Without pipe:
filter(books, year > 1900)
# A tibble: 4 Γ 5
title author year genre pages
<chr> <chr> <dbl> <chr> <dbl>
1 1984 Orwell 1949 Dystopian 328
2 The Great Gatsby Fitzgerald 1925 Modernist 180
3 To Kill a Mockingbird Lee 1960 Coming-of-age 281
4 The Catcher in the Rye Salinger 1951 Coming-of-age 234
With pipe:
# A tibble: 4 Γ 5
title author year genre pages
<chr> <chr> <dbl> <chr> <dbl>
1 1984 Orwell 1949 Dystopian 328
2 The Great Gatsby Fitzgerald 1925 Modernist 180
3 To Kill a Mockingbird Lee 1960 Coming-of-age 281
4 The Catcher in the Rye Salinger 1951 Coming-of-age 234
The piped version can be read as βTake the books data, then filter it to keep only books published after 1900β.
For more complex operations, the benefits become even clearer, which we will see in a moment.
2.2 select()
: Choose Columns
2.3 mutate()
: Add New Variables
# A tibble: 5 Γ 6
title author year genre pages age
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1984 Orwell 1949 Dystopian 328 75
2 Pride and Prejudice Austen 1813 Romance 432 211
3 The Great Gatsby Fitzgerald 1925 Modernist 180 99
4 To Kill a Mockingbird Lee 1960 Coming-of-age 281 64
5 The Catcher in the Rye Salinger 1951 Coming-of-age 234 73
2.4 arrange()
: Sort Rows
# A tibble: 5 Γ 5
title author year genre pages
<chr> <chr> <dbl> <chr> <dbl>
1 Pride and Prejudice Austen 1813 Romance 432
2 The Great Gatsby Fitzgerald 1925 Modernist 180
3 1984 Orwell 1949 Dystopian 328
4 The Catcher in the Rye Salinger 1951 Coming-of-age 234
5 To Kill a Mockingbird Lee 1960 Coming-of-age 281
In Tidyverse, we use arrange()
to sort data frames, which is often more intuitive and easier to use with multiple columns. In Base R, you typically use order()
within square brackets or sort()
for vectors.
For example:
Tidyverse: data %>% arrange(column_name)
Base R: data[order(data$column_name), ]
The Tidyverse method is more readable, especially when sorting by multiple columns or in descending order.
2.5 summarise()
: Summarize Data
2.6 group_by()
: Group Data for Operations
2.7 Chaining Multiple Actions
One of the key advantages of Tidyverse is the ability to chain multiple actions together using the pipe operator (%>%
). Letβs compare how we can perform a series of data manipulations using both Tidyverse and Base R.
Letβs say we want to: 1. Filter books published after 1900 2. Select only the title, author, and year columns 3. Sort the results by year 4. Get the first 3 entries
2.7.1 Tidyverse Approach
# A tibble: 3 Γ 3
title author year
<chr> <chr> <dbl>
1 The Great Gatsby Fitzgerald 1925
2 1984 Orwell 1949
3 The Catcher in the Rye Salinger 1951
In this Tidyverse approach, we can read the code from left to right, following the logical flow of operations. Each step is clearly defined, and the pipe operator (%>%
) passes the result of each operation to the next.
2.7.2 Base R Approach
# Filter books published after 1900
filtered_books <- books[books$year > 1900, ]
# Select only title, author, and year columns
selected_books <- filtered_books[, c("title", "author", "year")]
# Sort by year
sorted_books <- selected_books[order(selected_books$year), ]
# Get the first 3 entries
result <- head(sorted_books, 3)
# View the result
result
# A tibble: 3 Γ 3
title author year
<chr> <chr> <dbl>
1 The Great Gatsby Fitzgerald 1925
2 1984 Orwell 1949
3 The Catcher in the Rye Salinger 1951
In the Base R approach, we need to create intermediate variables at each step. The code reads from top to bottom, with each line representing a separate operation.
2.8 Learning Check π
2.9 Hands-On Coding π»
Try the following exercises:
- Use
filter()
to find all books written by Austen or Orwell. - Use
arrange()
to sort the books by number of pages, from longest to shortest. - Use
mutate()
to add a new column calledwords
, assuming an average of 250 words per page. - Use
group_by()
andsummarise()
to find the earliest publication year for each genre.
2.9.1 Exercise 1: Filter books by Austen or Orwell
2.9.2 Exercise 2: Sort books by pages, longest to shortest
2.9.3 Exercise 3: Add words column
2.9.4 Exercise 4: Find earliest publication year by genre
3 Data Visualization with ggplot2 π
ggplot2 is a powerful package for creating beautiful and informative visualizations, especially useful for exploring data.
3.1 Expand the Books Dataset
Letβs expand the books dataset to include some more variables for visualization purposes:
novels <- books %>%
mutate(
words = pages*250, # Estimating word count based on pages
characters = c(30, 25, 15, 20, 10), # Number of named characters (estimated)
rating = c(4.2, 4.5, 4.0, 4.3, 4.1), # Modern reader ratings (out of 5)
male_chars = c(20, 10, 10, 12, 7), # Number of male characters (estimated)
female_chars = c(10, 15, 5, 8, 3) # Number of female characters (estimated)
)
# View the dataset
novels
# A tibble: 5 Γ 10
title author year genre pages words characters rating male_chars
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1984 Orwell 1949 Dyst⦠328 82000 30 4.2 20
2 Pride and Prejud⦠Austen 1813 Roma⦠432 108000 25 4.5 10
3 The Great Gatsby Fitzg⦠1925 Mode⦠180 45000 15 4 10
4 To Kill a Mockin⦠Lee 1960 Comi⦠281 70250 20 4.3 12
5 The Catcher in t⦠Salin⦠1951 Comi⦠234 58500 10 4.1 7
# βΉ 1 more variable: female_chars <dbl>
This dataset gives us a rich set of variables to explore, including publication year, word count, genre, character gender representation, and modern reader ratings.
3.2 1. The Basic Structure of a ggplot
Every ggplot2 plot starts with the ggplot()
function and uses +
to add layers. The basic structure is:
ggplot(data = <DATA>) +
GEOM_FUNCTION(mapping = aes(<MAPPINGS>))
Letβs create a simple scatter plot of publication year vs. word count (thousands):
ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words / 1000))
-
ggplot(data = novels)
: Initializes the plot with our dataset -
geom_point()
: Adds a layer of points (for a scatter plot) -
aes(x = year, y = words)
: Maps variables to aesthetic properties (here, x and y positions)
3.3 2. Aesthetic Mappings
Aesthetics are visual properties of the objects in your plot. Common aesthetics include: - x and y positions - color - size - shape
Letβs map the rating to the color of the points:
ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words, color = rating))
Alternatively, we can also use the size of the points to indicate the rating:
ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words, size = rating))
3.4 3. Adding Labels with labs()
We can improve our plot by adding informative labels:
ggplot(data = novels) +
geom_point(mapping = aes(x = year, y = words/1000, size = rating)) +
labs(title = "Classic Novels: Publication Year vs. Word Count",
x = "Year of Publication",
y = "Number of Words (thousands)",
size = "Rating")
3.5 4. Geometric Objects (geoms)
Different geom functions create different types of plots. Letβs create a bar plot of character counts:
ggplot(data = novels) +
geom_col(mapping = aes(x = title, y = characters)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
-
geom_point()
: Scatter plots -
geom_line()
: Line graphs -
geom_col()
orgeom_bar()
: Bar charts -
geom_boxplot()
: Box plots
For more inspiration and examples of whatβs possible with ggplot2, check out the R Graph Gallery. This fantastic resource offers:
- A wide variety of chart types and styles
- Reproducible code for each graph
- Explanations and use cases for different visualizations
- Advanced techniques and customizations
Exploring the R Graph Gallery can help you discover new ways to visualize your data and improve your ggplot2 skills!
3.6 Learning Check π
4 Building a Simple Slot Machine π°
Now, letβs use our tidyverse skills to create and analyze a simple virtual slot machine!
A slot machine is a game you might see in a casino. Hereβs how it works:
Look: It usually has three or more spinning wheels with pictures on them.
Symbols: In our game, we use fruit (π, π, π, π) and other symbols (π, π).
How to play: You press a button to make the wheels spin.
Winning: You win if the pictures line up in a certain way when the wheels stop. For example, you win if all three symbols are the same (e.g., πππ).
Random: Each spin is random - you canβt predict what will come up. But in real casinos, not all outcomes have the same chance. Casinos set the odds to make sure they make money over time.
Slot machines are often used to teach math and probability. In our case, weβre using it to learn about data analysis. Itβs a fun way to practice with numbers and see patterns.
In real life, itβs wise to stay away from slot machines. The odds are set so that the casino always wins in the long run. Our virtual slot machine lets us explore data without any risk!
First, letβs create our slot machine. The code has been provided for you. Take a moment to read through the script below and see if you can understand what each part does. Donβt worry if you donβt understand everything - weβll break it down together afterwards.
library(tidyverse)
# Define slot machine symbols
symbols <- c("π", "π", "π", "π", "π", "π")
# Function to play the slot machine
play_slot_machine <- function(n_plays = 10) {
tibble(
play = 1:n_plays,
symbol1 = sample(symbols, n_plays, replace = TRUE),
symbol2 = sample(symbols, n_plays, replace = TRUE),
symbol3 = sample(symbols, n_plays, replace = TRUE)
) %>%
mutate(
win = symbol1 == symbol2 & symbol2 == symbol3,
result = if_else(win, "π°", "π’")
)
}
# Simulate 100 plays
results <- play_slot_machine(100)
# Display the first few results
head(results)
# A tibble: 6 Γ 6
play symbol1 symbol2 symbol3 win result
<int> <chr> <chr> <chr> <lgl> <chr>
1 1 π π π FALSE π’
2 2 π π π FALSE π’
3 3 π π π FALSE π’
4 4 π π π FALSE π’
5 5 π π π FALSE π’
6 6 π π π FALSE π’
After youβve had a chance to examine the code, click βShow Solutionβ below to see a detailed, line-by-line explanation of whatβs happening in this script.
Now that we understand how our slot machine works, letβs move on to analyzing its results!
4.1 Hands-On Coding π»
Letβs explore our slot machine results with some exercises. Remember to use tidyverse functions like filter()
, summarise()
, group_by()
, and ggplot()
.
4.1.1 Exercise 1: Summarize the Results
Calculate the total number of plays, number of wins, and the win percentage.
4.1.2 Exercise 2: Find Winning Combinations
Create a new data frame showing only the winning plays and their symbol combinations.
4.1.3 Exercise 3: Visualize Symbol Distribution
Create a bar plot showing the distribution of symbols in the first reel (symbol1 column).
Emojis often canβt be rendered directly in plots. While there are packages like emojifont
or ggtext
that can handle emoji rendering, for simplicity, weβll use a text representation of the symbols.
Congratulations! Youβve now practiced using various tidyverse functions to analyze and visualize data from our virtual slot machine. These skills are fundamental in data manipulation and analysis, which are crucial in many digital humanities projects.
In this chapter, weβve covered:
- The basics of tidyverse and its core packages
- Data manipulation with dplyr functions
- Data visualization with ggplot2
- Applied tidyverse concepts to analyze our books dataset
- Built a virtual slot machine using tidyverse functions
These skills form an essential foundation for working with data in R using the tidyverse. As we progress in our digital humanities journey, weβll build upon these concepts to perform more complex data manipulations and analyses.
Tidyverse Basics
Data Manipulation
Data Visualization
Practical Application