Web Scraping — Scrap Anything you Want

8 min readFeb 24, 2023

image source — https://media.geeksforgeeks.org/wp-content/cdn-uploads/20200622222131/What-is-Web-Scraping-and-How-to-Use-It.png

Web scraping is the process of extracting data from websites using automated software tools. It has become an important tool for businesses and researchers to collect data from the internet. In this article, we will discuss the basics of web scraping, the tools and libraries used for web scraping, and how to scrape websites using Python.

Introduction to Web Scraping

Web scraping involves extracting data from websites using automated tools. It has become a popular technique for businesses and researchers to collect data from the internet. The data extracted from websites can be used for a variety of purposes such as data analysis, market research, and competitive analysis. Web scraping involves analyzing the structure of the website and writing a script to extract the relevant data.

Tools for Web Scraping

There are several tools and libraries available for web scraping. Some of the popular ones are:

1. BeautifulSoup:

It is a Python library used for web scraping purposes to pull the data out of HTML and XML files.

Let’s say we want to scrape the titles of the top articles on the homepage of the New York Times website. We can start by sending a GET request to the website using the requests library in Python:

import requests
from bs4 import BeautifulSoup

url = 'https://www.nytimes.com/'
response = requests.get(url)

Next, we can create a Beautiful Soup object from the HTML content of the page using the html.parser:

soup = BeautifulSoup(response.content, 'html.parser')

Now that we have the Beautiful Soup object, we can use it to find the HTML elements that contain the article titles. In this case, we can see that the article titles are contained within <h2> elements with the class css-1cmu9py e1voiwgp0. We can use the .find_all() method to get a list of all elements that match these criteria:

titles = soup.find_all('h2', class_='css-1cmu9py e1voiwgp0')

Finally, we can loop through the list of titles and extract the text content of each element using the .text attribute:

for title in titles:
    print(title.text)

This will output the titles of the top articles on the New York Times homepage:

Why Texas’s Blackouts Are a Preview of America’s Climate-Changed Future
Biden Will Open Enrollment for Affordable Care Act Insurance
Cuomo Aides Rewrote Nursing Home Report to Hide Higher Death Toll
U.S. and Allies Plan Fight From Afar Against Al Qaeda Once Troops Exit Afghanistan
‘The System Is Broken’: Video Shows Cops Yanking Crying Baby From Mother’s Arms

2. Scrapy:

It is an open-source and collaborative web crawling framework for Python. First, you need to install Scrapy using the pip command:

pip install scrapy

After installing Scrapy, you can create a new project using the following command:

scrapy startproject project_name

Replace project_name with your preferred name for the project.

Once the project is created, navigate to the project directory using the command:

cd project_name

Now, create a new spider using the following command:

scrapy genspider spider_name domain_name

Replace spider_name with your preferred name for the spider and domain_name with the website domain, you want to scrape.

After creating the spider, you can edit it to add the rules for scraping data from the website. Here’s an example of how to extract data from the quotes.toscrape.com website:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

This code defines a new spider named “quotes” and sets the start URL to the first page of quotes on the website. It then defines a parse method that extracts the text, author, and tags for each quote on the page and yields a dictionary of the extracted data.

The method also checks if there is a next page and follows the link to the next page if there is one.

To run the spider, use the following command:

scrapy crawl spider_name -o output_file_name.json

Replace spider_name with the name of your spider and output_file_name.json with the name you want to give to the output file. The output file will contain the scraped data in JSON format.

That’s it! You can now use Scrapy to scrape data from any website that allows web scraping.

3. Selenium

It is a popular tool used for web automation and testing. It can also be used for web scraping purposes.

Let’s say we want to scrape the titles of the articles on the home page of the New York Times. We can use Selenium to automate the process of opening the webpage, scrolling down to load all the articles, and extracting the titles.

First, we need to install Selenium and download a web driver for the browser we want to use. In this example, we will use Chrome.

pip install selenium

Next, we need to download the Chrome driver from the following link: https://chromedriver.chromium.org/downloads

We can save the driver in the same folder as our Python script.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# set the path to the chrome driver
driver_path = "./chromedriver"

# create a new Chrome browser instance
driver = webdriver.Chrome(service=Service(executable_path=driver_path))

# navigate to the NY Times homepage
driver.get("https://www.nytimes.com/")

# scroll down to load all the articles
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# find all the article titles
titles = driver.find_elements_by_class_name("css-16nhkrn")

# print the titles
for title in titles:
    print(title.text)
    
# close the browser
driver.quit()

In the code above, we first import the necessary modules from Selenium. We then set the path to the Chrome driver we downloaded and create a new Chrome browser instance.

We navigate to the NY Times homepage and use execute_script to scroll down to load all the articles. We then find all the article titles using find_elements_by_class_name and print them.

Finally, we close the browser using quit().

4. Requests:

It is a Python library used for making HTTP requests, which you may already see above BeautifulSoup example, Requests is a library which work very well with BeauifulSoup Library together on web scraping, let me share 1 more example with you.

Suppose we want to extract the title and description of the top 10 articles on the homepage of a news website. We can use the requests library to fetch the HTML content of the homepage, and then parse the HTML using a library like BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# Send a GET request to the homepage of the news website
response = requests.get("https://www.example.com/news")

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find the top 10 articles on the homepage
articles = soup.find_all("article", limit=10)

# Extract the title and description of each article
for article in articles:
    # Find the title of the article
    title = article.find("h2").text.strip()
    
    # Find the description of the article
    description = article.find("p").text.strip()
    
    # Print the title and description
    print("Title:", title)
    print("Description:", description)
    print("-----------------------")

In this example, we first import the requests and BeautifulSoup libraries. We then send a GET request to the homepage of the news website using the requests.get() method. The response object contains the HTML content of the homepage.

We then use BeautifulSoup to parse the HTML content and find the top 10 articles on the homepage using the soup.find_all() method. For each article, we use article.find() to find the title and description of the article and print them to the console.

Note that web scraping using requests library may not always work if the website requires JavaScript rendering or if it has anti-scraping measures in place. In those cases, a headless browser like Selenium may be required.

Steps for Web Scraping

The following are the steps involved in web scraping:

Identify the website: The first step in web scraping is to identify the website from which data needs to be extracted.
Identify the data: Once the website is identified, the next step is to identify the data that needs to be extracted. This could be in the form of text, images, or tables.
Analyze the website: The next step is to analyze the structure of the website and identify the HTML tags and attributes that contain the data.
Write a script: Once the data is identified, the next step is to write a script to extract the data. This can be done using any of the web scraping tools and libraries mentioned above.
Extract the data: Once the script is written, the data can be extracted from the website.

Overall Recap of Web Scraping using Python

Let’s take an example of scraping the list of Top 100 Movies of All Time from IMDb website using Python and BeautifulSoup library.

import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.select('td.titleColumn')
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name="ir"]')]
for i in range(len(movies)):
    print(f"{i+1}. {movies[i].text.strip()} ({crew[i]}) - Rating: {ratings[i]}")

In the above example, we first import the required libraries, requests and BeautifulSoup. Then, we define the URL of the IMDb website and use the requests library to get the HTML content of the webpage. We then use BeautifulSoup library to parse the HTML content.

After that, we use the select method of BeautifulSoup library to select the HTML tags that contain the data we need. We select the title of the movie, crew, and ratings using the CSS selectors. Finally, we loop through the selected data and print the movie name, crew, and rating.

Conclusion

It’s important to note that web scraping can be subject to legal and ethical concerns, so it’s important to only scrape data from websites that allow it and to ensure that you are following any applicable laws and regulations.

In conclusion, web scraping is a powerful tool that allows you to extract data from websites quickly and efficiently. It has a wide range of applications, from data analysis to market research to content aggregation. With the help of Python libraries such as BeautifulSoup and requests, web scraping has become even more accessible to developers and data analysts. However, it is important to note that web scraping should be used responsibly and ethically, respecting the terms of service and copyright laws of the websites being scraped. Additionally, web scraping can be technically challenging, especially when dealing with websites that use dynamic content and anti-scraping measures. Nonetheless, with some patience and perseverance, you can learn how to scrape almost anything you want from the web, and unlock a wealth of valuable data for your projects and analyses.

More Stories you maybe interested?

Transferring Legacy Systems to the Cloud

As businesses and organizations grow and evolve, their technology needs and infrastructure often need to adapt to keep…

charlesarea.medium.com

7 Tricks for Docker and Kubernetes Optimization

Docker and Kubernetes are two of the most popular open-source tools for managing and deploying containers. Containers…

charlesarea.medium.com

Boosting Your Python Code with Simple Changes

Python is a powerful language that can perform complex tasks with ease. However, sometimes making simple changes to…

charlesarea.medium.com

About the Author

Hi Medium Community, my name is Charles Lo and I’m currently a project manager and data manager at Luxoft. Luxoft is a place where we combine a unique blend of engineering excellence and deep industry expertise to serve clients globally, specializing in cross-industry including but not limited to automotive, financial services, travel and hospitality, healthcare, life sciences, media, and telecommunications. In addition, Luxoft is also a family member of DXC.

I’m passionate about technology and hold several certifications including Offensive Security Certified Professional, AWS Certified Solution Architect, Red Hat Certified Engineer, and PMP Project Management. I have years of experience working in the banking, automotive, and open-source industries and have gained a wealth of knowledge throughout my career.

As I continue on my Medium journey, I hope to share my experiences and help others grow in their respective fields. Whether it’s providing tips for project management, insights into data analytics, or sharing my passion for open-source technology, I look forward to contributing to the Medium community and helping others succeed.

Author Linkedin — https://www.linkedin.com/in/charlesarea/

Web Scraping — Scrap Anything you Want

Introduction to Web Scraping

Tools for Web Scraping

1. BeautifulSoup:

2. Scrapy:

3. Selenium

4. Requests:

Steps for Web Scraping

Overall Recap of Web Scraping using Python

Conclusion

Transferring Legacy Systems to the Cloud

As businesses and organizations grow and evolve, their technology needs and infrastructure often need to adapt to keep…

7 Tricks for Docker and Kubernetes Optimization

Docker and Kubernetes are two of the most popular open-source tools for managing and deploying containers. Containers…

Boosting Your Python Code with Simple Changes

Python is a powerful language that can perform complex tasks with ease. However, sometimes making simple changes to…

About the Author

Written by Charles Lo