If you come across a website displaying data you would like to use:
- Does the website allow you to download the data in a format like XML or CSV?
- If not, can you find the data you need elsewhere (Google, Governments, International Organizations, Library, etc.)?
- Does the website offer an API? This may not be immediately obvious and might require a bit of research.
- Is the website otherwise trying to provide data openly? If so, email them to inquire about additional options.
- If neither of those options are available, you might consider web scraping, but check that it is not prohibited.
Both APIs and web scraping have 2 parts:
- Make the request: specify a URL (yes, a normal URL)
- For web-scraping, it is the same url you use in a web browser (because it “returns” an HTML file) – Easier
- For APIs, the URL would point to their API processor and have keys and values specifying what you want – Harder
- Process the response: save the file you get and extract the data
- For web-scraping, you receive an HTML file with the web content which needs to be parsed to extract the data – Harder
- For APIs, you receive a file in a different format (often XML or JSON), which gives clean easy-to-access data – Easier
Identify the Tools you need:
- If you just need links, images, or non-HTML content, consider whether a browser plugin like DownThemAll! would be sufficient.
- If you just need to extract content within specific webpages, you would need a parser--unless you are working with a small number of pages and can use manual tools like Scraper (see #1).
- If you need to follow links to get additional pages, you would need to use a spider--unless the links are knowable ahead of time, as in many paginated tables (see #2).
-
If the website uses JavaScript/AJAX to display content you need, you would need a web driver--unless you can identify the underlying API call using browser developer tools.(see #3).