View on GitHub

Web Scraping - Digital Scholarship Workshop Series

Jupyter Notebooks for exploring the practice and possibilities of web scraping.

Download this project as a .zip file Download this project as a tar.gz file

Web Scraping

Web scraping is a technique used to collect information from the internet and save it so it can be analyzed as needed. Web scraping is usually done after a research question has been defined and will be part of the data gathering phase of the research lifecycle.

This course will walk you through using several different strategies and several different python libraries to scrape data from the web and analyze it.

We will learn:

What web scraping is and why we would use web scraping to gather data
About the HTTP request and reponse process that is the foundation of the web
How to make requests and save the response from the web server(s)
How to analyze the structure of web pages so we can scrape specific data
How python scripts can be constructed to target specific data structures on web pages
How to direct the web scraper to navigate through a series of pages and scrape each one

The content on the web is incredibly diverse, so web scraping scripts are usually custom built to target specific web sites and data. These example projects will get you familiar with the various process and possibilities of web scraping so you can begin building a custom web scraper to target the web sites and information you are interested in.

Digital Workshop Series

You can find the complete list of Digital Scholarship Workshops here: https://librarybeales.github.io/dsworkshops/

Constellate or Binder?

Constellate is available to Case Western students, staff, and faculty. To use Constellate you will have to create a JSTOR login that is separate from your University login. Instructions for that are here: Making a Constellate Account.

If you are not part of Case you can launch these tutorials using the Launch Binder button. A binder launch button

Any changes you make or work you complete will be deleted when closing the tab or window. You can, however, download a copy of file you’ve been working in before closing the browser.

Why scrape? And when not to…

The content on the web is an unbelievably rich resource. However, the information available can be difficult to collect and organize manually.

For example, if you were interested in analyzing how people communicate on social media about the NASDAQ, you could manually search for posts tagged with #NASDAQ, copy and paste them or screen shot them, and then begin manually organizing the user names, likes, tags, text content, etc. You would be limited in the sample size you could analyze by the amount of labor required.

If, however, you built a web scraper to crawl a site and capture every post with a #NASDAQ tag, you’d be able to capture a much larger sample size and save the data in an accessible format for future researchers. Analysis of that data, also using python, would allow you to quickly see who posts the most, which posts have the most likes, what stock tickers are mentioned most frequestly in a given time span, etc.

Additionally, the content on the web is constantly changing. If you are basing your research on information from the web, it would be a good idea to store that infomation somewhere yourself, so that those who are evaluating your research can access the identical information.

In short, web scraping can make the endless data available online accessible, useful and permanent.

Do I really need to web scrape?

Data is available from many sources across a wide variety of disciplines. If you can find a relevant dataset, it will almost always be easier to use than something scraped from the web. The data from web scraping usually need significant parsing and cleaning in order to be useful.

So before you resort to web scraping, see if you can locate the data eleswhere. Contacting a research Librarian is an excellent first step. Case Western also has a research data index here: Data Index

Project #1: Making an HTTP Request and Receiving a Response

This first project will use the requests package to introduce the basic web scraping workflow. All the lessons on this page use Books to Scrape as the example website. As you can probably tell from the name, this is a website set up specifically for practicing web scraping.

In this project you will:

Send a request to a web server.
Check for a response.
View the content of that response.
Write that content to a file.

Project #2: Exploring Website Structure and Getting Specific Data from Web Scraping

Normally, in a web scraping project, you are looking for specific information. You don’t need to scrape the entire Books to Scrape page if you are only interested in a list of book titles. We are going to practice looking at the structure of a web page so we can design a web scraper that only retrieves certain infromation. Once we understand the way the information we want is tagged and/or organized, we can create rules for the web scraper to follow. The process of breaking down the parts of the web page to retrieve specific informaiton is frequently referred to as HTML parsing.

In this project you will:

Use the Inspect tool in your web browser to explore the structure of the Books to Scrape website.
Understand how book titles on the site are tagged/classified.
Understand and use a python script to crawl the web page and extract only the data that meets the classification criteria we identified for titles in step 2.
Look at the list of titles we scraped. Identify problems with the data and explore an alternative strategy of using Beautiful Soup to get the correct titles.
Write the list of correct titles to a file.

Build a scraper that collects multiple data points about each book based upon specific criteria.

In this project you will:

Determine what data you are able to collect about each book listed in the store.
Use the Inspect tool in your web browser to identify the web page structure for those pieces of data.
Understand and use a python script to crawl the web page and extract only the data that meets the classification criteria we identified in steps 1 and 2.
Write the data to a csv file.

Project #4: Scraping data from mulitple pages.

In this project you will:

Use the Inspect tool to explore how Books to Scrape handles navigation between pages.
Understand and use a python script to direct the web scraper to a navigate to the next page if there is one, and scrape the specified data from there, and repeat this process until there are no more pages.
Examine the data to see if our scraper worked the way we think it should.
Write the data to a csv file.

Bonus Project: Data Visualization with Plotly.

In this project you will:

Examine the csv file we saved from the previous project and determine what values we would like to visualize.
Look at the Plotly package for Python and determine which kinds of graphs we will could create with the data.
Understand and use a python script to generate those graphs using the data we scraped from the Books to Scrape website.
Save those graphs to a file.

Web Scraping

Digital Workshop Series

Constellate or Binder?

Why scrape? And when not to…

Do I really need to web scrape?

Project #1: Making an HTTP Request and Receiving a Response

Project #2: Exploring Website Structure and Getting Specific Data from Web Scraping

Project #3: Scraping sets of related information into CSV files.

Project #4: Scraping data from mulitple pages.

Bonus Project: Data Visualization with Plotly.