Skip to Main Content
George Mason University | University Libraries
See Updates and FAQs for the latest library services updates. Subject Librarians are available for online appointments, and Virtual Reference has extended hours.

Working with Data

What you need to know for Data Management and Data Wrangling

What is Big Data?

"Big data can be described in terms of data management challenges that – due to increasing volume, velocity and variety of data – cannot be solved with traditional databases."

Amazon Web Services

With the widespread use of technologies such as social media and e-commerce comes a large amount of data. In fact, nowadays so much data is generated at such a rapid pace that it's impossible to store and analyze it all through traditional means (i.e. relational databases, basic statistics). Instead, all the "Big Data" that's generated essentially gets dumped into giant repositories to be dealt with at a later point in time. A substantial degree of skill is required to be able to sift through this data (usually distributed across multiple computers working in parallel) and detect patterns within it (i.e. data mining.)

Below are some generalities that may be useful for evaluating your data file. These terms do not have widely shared definitions, so they are just guidelines. 

For reference, most personal computers these days (2018) have a 500GB-1TB hard drive and 4-8GB RAM (sometimes16GB). 

  • Tiny Data: Small enough for humans to look through the data
    • Up to a few thousand cases/rows
  • Small Data: Can store and work with on most personal computers because it all fits in RAM*
    • Can be up to several million cases/rows
    • Takes < 2-4 GB of disk space as CSV
  • Medium Data: Data is larger than RAM, but smaller than the Hard Drive
    • Takes < 1 TB of disk space as CSV
  • Large Data: Data cannot even fit on a normal hard drive
    • Takes > 1 TB of disk space as CSV
  • Big Data: Large data that keeps coming and coming and coming
    • Dozens of TB, likely becoming petabytes of data

* All statistical software puts a dataset into RAM except for SAS, which is why SAS is popular in Government and Finance.