Skip to Main Content
George Mason University | University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Software: Learn R

Resources to learn and use the Open Source Statistical software R (R-Project)

Data Management in R - The 3 Options

Although it is not required to use just one of these, it is best to choose one and use it consistently. However, it is useful to be able to read Base R code, as it may be used in tutorials, is necessary for certain packages, and can simplify code in some circumstances.

1. Base R

  • Tasks do not require additional packages, but few people use it exclusively. Some functions do not support tidyverse notation.
  • Some notations are useful to know, such as $ and [] referring to object parts (see foundations).

2. Data.Table

3. Tidyverse (dplyr / tidyr)

  • Most popular, our recommended choice. Written by Hadley Wickham (RStudio/Posit)
  • Uses words instead of notation and built to support piping.
    • mydata |> filter(rows) |> select(columns)
    • |> is now built into R, but the tidyverse originally used %>% (from magrittr)
       

If you really can't decide, check out this comparison for examples of each.

Scheme Data Class Template Advantages
Base R data.frame mydata[rows,columns] universal, historical
Data Table data.table mydata[rows,columns,by] fastest for Big Data
Tidyverse tibble mydata |> filter(rows) |> select(columns) easiest to use

The Tidyverse

The Tidyverse: dplyr, tidyr and many more

 

What is tidy data? See the classic article: Tidy Data (pdf) by Hadley Wickham, uses an older version of tidyr

  • dplyr is for manipulating one dataset
    • filter() - Keep/Remove Rows/Observations
    • select() - Keep/Remove Variables
    • mutate() - Create or change variables
      • See the packages forcats, stringr, lubridate, hms and more for working with specific data types
    • group_by() |> summarize() - Summarize/Collapse across rows
    • also arrange(), relocate(), rename(), count(), and many more
  • tidyr is for combining datasets or restructuring a dataset
  • readr, haven, and many others are for importing data from other formats
  • purrr provides Tidyverse alternatives to the base R apply() series of functions

Tutorials