Skip to Main Content
| University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Learn R

Resources to learn and use the Open Source Statistical software R (R-Project)

There are 3 options for Data Management in R

  • Base R - Built in to R, and was the only option for a while. Many older tutorials use it. 
    • Some parts are useful for everyone to know, such as:
      • If you see brackets, it refers to part of an object: mydata[rows,columns]
      • For data tables, you can use the $ to access a variable: mydata$variable
  • data.table – If you need/want the following, this may be a good option for you.
    • Uses a shorthand notation for less typing: mydata[rows, columns, by
      • Example (changes values to "VA"):  mydata[state == "Virginia", state := "VA"]
    • Faster for work with Big Data (> 1 million records). (comparison with dplyr)
  • dplyr / tidyr (tidyverse) - Most popular, our recommended choice. Written by Hadley Wickham (RStudio/Posit)
    • Uses words instead of notation
      • mydata |> filter(rows) |> select(columns)
    • Always has the dataset as the first argument, enabling the use of the Pipe
      • You might also see %>% instead of |>

If you really can't decide, check out this comparison.

Scheme Data Class Template Advantages
Base R data.frame mydata[rows,columns] universal, historical
Data Table data.table mydata[rows,columns,by] fastest for Big Data
Tidyverse tibble mydata |> filter(rows) |> select(columns) easiest to use

 

The Tidyverse

The Tidyverse: dplyr, tidyr and many more

 

 

What is tidy data? See the classic article: Tidy Data (pdf) by Hadley Wickham, uses an older version of tidyr

  • dplyr is used to manipulate one dataset
  • tidyr is used to combine datasets or restructure a dataset