Lecture 7: Data Wrangling

Brian J. Smith

2026-02-05

Tidyverse Logo

Tidyverse

Tidyverse Package Hex Stickers

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse

The core tidyverse contains these packages:

  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.
  • stringr, for strings.
  • forcats, for factors.
  • lubridate, for date/times.

dplyr

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

dplyr

Some core dplyr functions:

  • mutate() creates new columns
  • select() subsets existing columns
  • filter() subsets rows by a logical condition
  • summarize() creates summary stats for one or more columns
  • group_by() creates grouping variables that are respected by the other functions

tidyr

tidyr

tidyr The goal of tidyr is to help you create tidy data. Tidy data is data where:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

tidyr

We will primarily use tidyr to reshape data:

  • pivot_longer() reshapes wide data to long data
  • pivot_wider() reshapes long data to wide data

Wide vs. Long Data

Wide

Dataset Mean SD Var
A 10 3 9
B 7 4 16

Long

Dataset Stat Value
A Mean 10
A SD 3
A Var 9
B Mean 7
B SD 4
B Var 16

Data Wrangling

Data Wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics… Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

Wikipedia

Data Wrangling

R for Data Science Cover

For a view through the tidyverse lens of data wrangling, see Chapters 9-16 of R for Data Science.

Data Wrangling

In our context, we want to use data wrangling to get our data in a format conducive to visualization.

For example, to create visualizations with ggplot2, we typically want our data in long rather than wide format.

Conversely, calculating derived attributes (columns) is typically easier in wide rather than long format.

Data Wrangling


We are going to do some data wrangling in the code walkthrough (next).


But for even more details, see:

vignette("pivot", package = "tidyr")

Questions?



BCB5200 Home