InfoGuides: Working with Data: Web Scraping

Overview

If you come across a website displaying data you would like to use:

Does the website allow you to download the data in a format like XML or CSV?
- If not, can you find the data you need elsewhere (Google, Governments, International Organizations, Library, etc.)?
Does the website offer an API? This may not be immediately obvious and might require a bit of research.
Is the website otherwise trying to provide data openly? If so, email them to inquire about additional options.
If neither of those options are available, you might consider web scraping, but check that it is not prohibited.

Both APIs and web scraping have 2 parts:

Make the request: specify a URL (yes, a normal URL)
- For web-scraping, it is the same url you use in a web browser (because it “returns” an HTML file) – Easier
- For APIs, the URL would point to their API processor and have keys and values specifying what you want – Harder
Process the response: save the file you get and extract the data
- For web-scraping, you receive an HTML file with the web content which needs to be parsed to extract the data – Harder
- For APIs, you receive a file in a different format (often XML or JSON), which gives clean easy-to-access data – Easier

Identify the Tools you need:

If you just need links, images, or non-HTML content, consider whether a browser plugin like DownThemAll! would be sufficient.
If you just need to extract content within specific webpages, you would need a parser--unless you are working with a small number of pages and can use manual tools like Scraper (see #1).
If you need to follow links to get additional pages, you would need to use a spider--unless the links are knowable ahead of time, as in many paginated tables (see #2).
If the website uses JavaScript/AJAX to display content you need, you would need a web driver--unless you can identify the underlying API call using browser developer tools.(see #3).

Go through these tutorials before we meet up.

What is an API? (freeCodeCamp)
From the ACRL TechConnect Blog
- APIs for Librarians
- Library of Congress API documentation with examples
Data Science Foundations: Fundamentals (LinkedInLearning, Free at Mason)
- The “Sources of Data” chapter is all useful, especially
  - APIs (~3min)
  - Scraping data (~5min)