InfoGuides: Software: Learn R: Data Management

Data Management in R - The 3 Options

Although it is not required to use just one of these, it is best to choose one and use it consistently. However, it is useful to be able to read Base R code, as it may be used in tutorials, is necessary for certain packages, and can simplify code in some circumstances.

1. Base R

Tasks do not require additional packages, but few people use it exclusively. Some functions do not support tidyverse notation.
Some notations are useful to know, such as $ and [] referring to object parts (see foundations).

2. Data.Table

Data Scientists, programmers, and those who work with very large data files may prefer this.
It can be faster to type and run (comparison with dplyr)
More on data.table. Learn to work with large dataset in R by Analytics Vidhya

3. Tidyverse (dplyr / tidyr)

Most popular, our recommended choice. Written by Hadley Wickham (RStudio/Posit)
Uses words instead of notation and built to support piping.
- mydata |> filter(rows) |> select(columns)
- |> is now built into R, but the tidyverse originally used %>% (from magrittr)

If you really can't decide, check out this comparison for examples of each.

Scheme	Data Class	Template	Advantages
Base R	data.frame	mydata[rows,columns]	universal, historical
Data Table	data.table	mydata[rows,columns,by]	fastest for Big Data
Tidyverse	tibble	mydata \|> filter(rows) \|> select(columns)	easiest to use

The Tidyverse

The Tidyverse: dplyr, tidyr and many more

Documentation - Click on individual packages for more
- Tidyverse Packages for Importing
Cheatsheets for Posit and tidyverse packages

What is tidy data? See the classic article: Tidy Data (pdf) by Hadley Wickham, uses an older version of tidyr

dplyr is for manipulating one dataset
- filter() - Keep/Remove Rows/Observations
- select() - Keep/Remove Variables
- mutate() - Create or change variables
  - See the packages forcats, stringr, lubridate, hms and more for working with specific data types
- group_by() |> summarize() - Summarize/Collapse across rows
- also arrange(), relocate(), rename(), count(), and many more
tidyr is for combining datasets or restructuring a dataset
readr, haven, and many others are for importing data from other formats
purrr provides Tidyverse alternatives to the base R apply() series of functions

Tutorials

LearnR Interactive Tutorials:
- Wrangling penguins from Allison Horst - Covers dplyr only
- Interactive tutorials/exercises: from the Data Science in a Box course
Read:
- Working with Data from Data Carpentry, Ecology
- Tidying/reshaping tables using tidyr & joining data tables from Exploratory Data Analysis in R
Watch:
- Learning the R Tidyverse (~3.3 hrs) from LinkedIn Learning, Charlie Joey Hadley (Mason Log In)
- R data manipulation with RStudio and dplyr (~1hr) from UQ Library