Understanding web scraping

Understanding web scraping

...building a Goodreads data extractor

Web scraping, in the literal sense, means scraping data off the web. "Scraping" refers to the act of collecting data from websites by programmatically accessing and parsing the HTML or XML structure of web pages. However, keep in mind that some websites may block scrapers and too many attempts or requests might get your IP address blocked.

Prerequisites - basic HTML and CSS

PART I: Setting up web scraping libraries -

  1. Installing requests (cmd)-
    pip install requests

  2. Installing lxml (cmd) -
    pip install lxml

  3. Installing BeautifulSoup (bs version 4) -
    pip install bs4

    You might need to restart your computer for these to start working.
    To use these in your programs, import them.

PART II: Grabbing titles, classes and images -

Here, we'll be using example.com for every demonstration. Make sure to include "http://" or "https://" depending on which one your site has.
Now, to begin, start by importing requests. Requests has a function called "get" which lets you "get" a response from a page.

import requests

response = requests.get("http://www.example.com")

print(type(response)) #prints the type of response
print(response.text) #prints the response text

*If you face any issues running the above, check your firewall to make sure it isn't blocking python.

The code above prints two things. The type of the response, as well as the text contained within. The text, however, is returned simply as a string. This is where BeautifulSoup comes into play; parsing. That is, bs4 will help us obtain specific data from the site, using ids and classes etc.

To create our "soup" object, we use the BeautifulSoup class within the bs4 module; to demonstrate :

import requests
import bs4 

response = requests.get("http://www.example.com")

soup = bs4.BeautifulSoup(response.text, "lxml") #second parameter is the engine

In our "soup" object, we've used the bs4 module's BeautifulSoup class to parse through the response text using the engine specified by the second parameter, that is, lxml. The lxml library helps sort through the data and grab classes, tags and ids etc.
Now, to grab specific elements or tags from the page, we can use something called "select()" and pass in the tag or element we want to select.

import requests
import bs4

response = requests.get("http://www.example.com")
soup = bs4.BeautifulSoup(response.text, "lxml") 
title = soup.select("title")

Here, we're grabbing the content between the HTML "<title>" and "</title>" tags. Grabbing paragraphs, headings, divs etc works the same way, for example if you wanted to select a link, you could just put in "a" inside the quotes instead.
However, this by default returns a list of the content, as well as the tag. To get rid of both, you can simply return the first index of the string and return just the content using "getText()" : title = soup.select("title")[0].getText
If this doesn't work, you can use another approach:

element = soup.find('p') text = element.getText()

Grabbing a class is slightly different. Instead of selecting an HTML tag, we can instead access specific parts of the src using css attributes like "#id" and ".class"

Images, can be accessed the same way. ex -

image = soup.select('img')[0] source = img['src']

PART III: Building the Goodreads web scraper -

import requests
from bs4 import BeautifulSoup

# URL of the Goodreads list you want to scrape
url = "https://www.goodreads.com/list/show/1.Best_Books_of_the_21st_Century"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all the book containers on the page
book_containers = soup.find_all("tr", itemtype="http://schema.org/Book")

# Loop through each book container and extract title and author information
for container in book_containers:
    title = container.find("a", class_="bookTitle").get_text(strip=True)
    author = container.find("a", class_="authorName").get_text(strip=True)
    print(f"Title: {title}\nAuthor: {author}\n")

In this example, we first import the necessary libraries: requests for making HTTP requests and BeautifulSoup for parsing HTML. We then send an HTTP GET request to the Goodreads URL and parse the HTML content using Beautiful Soup.

We use Beautiful Soup's find_all method to locate all the book containers on the page, which have the itemtype attribute set to "http://schema.org/Book". Within each container, we extract the title and author information using the appropriate HTML tags and class names.

Finally, we loop through each book container and print out the extracted title and author information.

(P.S what do you call a spider that codes?
.... a web scraper)