Python
Git
GitHub
Beautiful Soup
Web Scraping
CSV (Comma-separated values)
In this project, I developed a web scraping tool to extract data from the website "Books to Scrape" (http://books.toscrape.com/). The project focuses on scraping detailed information from each book listed under every category on the site, downloading the associated book images, and storing the data in CSV files for future analysis.
Project Workflow:
This project illustrates how web scraping can be used to collect and organise large amounts of data, allowing for more in-depth analysis or reporting of the information collected.
1. Web Scraping with Beautiful Soup
The main task was to navigate through the site's structure and extract all book information for each category. I utilized Beautiful Soup to parse the HTML content of the site and extract relevant data points. Specifically, I targeted the left-hand sidebar, which lists every book category, and from each category page, I collected the links to every individual book.
2. Scraping Book Information
For each book in a category, I extracted the following key information:
3. Downloading Book Images
To enrich the dataset, I used wget to download the book cover images. Each image is stored in a folder named after the category from which the book was scraped. This ensures that all images are well-organized and easy to access.
4. Saving Data to CSV Files
After extracting the necessary data from each book, I compiled the information into a structured format. Using Python's csv module, I saved all data for each category into separate CSV files. The CSV files are named after the respective categories and contain the detailed book data.