Portfolio of Dorothea Reher

Book to Scrape

Project information

GitHub URL: Book to Scrape
create date: Dec. 3, 2022
evaluation date: Jan. 27, 2023
Skills:

Python

Git

GitHub

Beautiful Soup

Web Scraping

CSV (Comma-separated values)

Introduction

In this project, I developed a web scraping tool to extract data from the website "Books to Scrape" (http://books.toscrape.com/). The project focuses on scraping detailed information from each book listed under every category on the site, downloading the associated book images, and storing the data in CSV files for future analysis.

Project Workflow:

- - Scrape Every Category: The script begins by extracting every category from the sidebar.
  - Scrape Book Links: It then navigates to each category and retrieves the link for every book listed.
  - Scrape Book Details: For each book link, the script scrapes the necessary information.
  - Download Images: The script downloads each book's image and saves it in a folder corresponding to its category.
  - Save to CSV: Finally, all book data from a single category is written to a CSV file named after the category, ensuring a clear and structured dataset.

This project illustrates how web scraping can be used to collect and organise large amounts of data, allowing for more in-depth analysis or reporting of the information collected.

Competences

- use Beautiful Soup to scrape webpage: http://books.toscrape.com/index.html
- write data into a CSV file
- Use wget to download an image from the website and save the image inside a folder

Learning Experience

1. Web Scraping with Beautiful Soup

The main task was to navigate through the site's structure and extract all book information for each category. I utilized Beautiful Soup to parse the HTML content of the site and extract relevant data points. Specifically, I targeted the left-hand sidebar, which lists every book category, and from each category page, I collected the links to every individual book.

2. Scraping Book Information

For each book in a category, I extracted the following key information:

- - Product Page URL: The direct link to the book’s page.
  - Image URL: The URL of the book’s cover image.
  - Title: The title of the book.
  - Universal Product Code (UPC): A unique identifier for the product.
  - Price including tax and Price excluding tax: The book's price with and without taxes.
  - Number available: The available stock for each book.
  - Category: The category under which the book is listed.
  - Review Rating: The rating given to the book.
  - Product Description: A short description of the book.

3. Downloading Book Images

To enrich the dataset, I used wget to download the book cover images. Each image is stored in a folder named after the category from which the book was scraped. This ensures that all images are well-organized and easy to access.

4. Saving Data to CSV Files

After extracting the necessary data from each book, I compiled the information into a structured format. Using Python's csv module, I saved all data for each category into separate CSV files. The CSV files are named after the respective categories and contain the detailed book data.