Web Scraping with Python: A Complete Guide Using BeautifulSoup and Requests

by Didin J. on Jul 22, 2025 Web Scraping with Python: A Complete Guide Using BeautifulSoup and Requests

Learn how to scrape websites using Python with BeautifulSoup and Requests. A complete guide to parsing, navigating, and extracting web data responsibly.

Web scraping is a powerful technique used to extract data from websites. Whether you're gathering product information from e-commerce platforms, pulling headlines from news sites, or collecting research data, web scraping enables you to automate the process efficiently.

In this tutorial, we’ll walk you through a complete, step-by-step guide to web scraping using two of the most popular Python libraries: BeautifulSoup and Requests. You’ll learn how to fetch and parse HTML content, navigate through HTML elements, extract data, and handle common issues like pagination and request headers. By the end, you'll be equipped to build your web scraper for a wide range of real-world scenarios.

This tutorial is ideal for Python developers—both beginners and intermediates—who want to expand their skill set into automation, data extraction, and scripting.

What You’ll Learn:

  • The fundamentals of HTTP requests and HTML parsing

  • How to install and use requests and BeautifulSoup

  • Navigating the DOM tree and extracting data from HTML

  • Handling pagination, headers, and common scraping challenges

  • Best practices and ethical considerations for web scraping

Let’s get started and turn the web into your personal data source!


Prerequisites and Tools

Before we dive into the code, let’s make sure your development environment is set up and you're familiar with a few key concepts.

Prerequisites

To follow along with this tutorial, you should have:

  • Basic Python knowledge – Familiarity with Python syntax, variables, functions, and loops.

  • Python 3 installed – You can download the latest version from python.org.

  • A code editor – VS Code, PyCharm, or even a simple text editor like Sublime Text will work just fine.

  • Access to the terminal or command prompt – For installing libraries and running scripts.

No prior experience with HTML or web scraping is required; however, a basic understanding of HTML tags and structure will be helpful.

Tools and Libraries

We'll be using the following Python libraries:

  • Requests – A simple HTTP library for sending GET and POST requests.

  • BeautifulSoup (bs4) – A powerful HTML/XML parser that makes navigating and searching the DOM tree straightforward.

Install both libraries using pip:

mkdir web-scraping
cd web-scraping
python3 -m venv path/to/venv
source path/to/venv/bin/activate
pip install requests beautifulsoup4

We’ll also use:

  • lxml (optional but recommended) – A fast parser for better performance with BeautifulSoup.

To install it:

pip install lxml

With your environment set up and tools ready, let's move on to fetching your first web page!


Fetching and Parsing a Web Page

Now that your environment is set up, let’s start scraping by fetching and parsing a simple web page using the requests and BeautifulSoup libraries.

Step 1: Import the Required Libraries

Create a new Python file called scraper.py and add the following imports:

import requests
from bs4 import BeautifulSoup

Step 2: Send a GET Request to a Web Page

Use the requests.get() method to fetch the HTML content of a target URL. For this example, let’s use a demo site: http://quotes.toscrape.com, which is specifically designed for practicing web scraping.

url = 'http://quotes.toscrape.com'
response = requests.get(url)

# Print the response status and raw HTML
print(response.status_code)
print(response.text)

You should see a 200 status code, followed by the page’s HTML content printed to the console.

Step 3: Parse the HTML with BeautifulSoup

Now, let’s parse the HTML using BeautifulSoup so we can extract specific elements:

soup = BeautifulSoup(response.text, 'lxml')  # or 'html.parser' if lxml is not installed

# Print the page title
print(soup.title.text)

You should see:

Quotes to Scrape

Step 4: Extract Quotes from the Page

Let’s find all the quotes on the page. Each quote is wrapped inside a <div class="quote"> tag.

quotes = soup.find_all('div', class_='quote')

for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    print(f'"{text}" — {author}')

This will output:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
...

Summary

  • We used requests to fetch the page content.

  • Parsed the HTML with BeautifulSoup.

  • Used .find_all() and .find() to locate specific HTML elements and extract their contents.

Once you’ve parsed an HTML document with BeautifulSoup, navigating through the HTML tree becomes intuitive. In this section, we'll dive deeper into selecting, searching, and navigating HTML elements using BeautifulSoup's powerful features.

Using find() and find_all()

You’ve already seen how find() and find_all() work. Here’s a quick refresher:

# Find the first quote div
quote = soup.find('div', class_='quote')

# Find all quote divs
quotes = soup.find_all('div', class_='quote')

You can also search by multiple attributes:

soup.find_all('span', {'class': 'text', 'itemprop': 'text'})

Using CSS Selectors with select() and select_one()

You can also use CSS selectors to pinpoint elements:

# Select all quote texts using CSS selector
quote_texts = soup.select('div.quote span.text')

for q in quote_texts:
    print(q.text)
# Select the first author's name
author = soup.select_one('div.quote small.author')
print(author.text)

Accessing Attributes

Want to get the value of an href or other attribute? You can use .get():

link = soup.select_one('a')
print(link.get('href'))

Or for all author profile links:

for quote in quotes:
    link = quote.find('a')
    print(link['href'])  # Same as .get('href')

Navigating the DOM Tree

You can move around the DOM tree using BeautifulSoup’s navigation properties:

  • .parent – Gets the parent tag.

  • .children – Generator for a tag’s children.

  • .next_sibling / .previous_sibling – Navigate between siblings.

Example:

first_quote = quotes[0]
print(first_quote.find('span', class_='text').parent.name)  # Output: span's parent tag name

These navigation features are useful when elements don't have consistent classes or IDs, and you need to rely on structure instead.

Summary

  • Use .find(), .find_all(), and .select() to search for elements.

  • Use .text to get the content, and .get('attribute') to extract attributes.

  • Navigate through the HTML tree using .parent, .children, and siblings.


Handling Pagination

Most websites display data across multiple pages. To collect everything, your scraper needs to follow “Next” page links or generate paginated URLs. In this section, you'll learn how to scrape data from multiple pages automatically.

We’ll continue using http://quotes.toscrape.com, which uses simple URL-based pagination (/page/1, /page/2, etc.).

Step 1: Inspect the Pagination

If you look at the bottom of the page, you’ll see a Next → button. It looks like this in HTML:

<li class="next">
  <a href="/page/2/">Next →</a>
</li>

We can use this pattern to loop through each page until the Next link no longer exists.

Step 2: Loop Through All Pages

Here’s how to write a loop to handle pagination:

import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com'
next_page = '/'

while next_page:
    # Fetch and parse the page
    response = requests.get(base_url + next_page)
    soup = BeautifulSoup(response.text, 'lxml')

    # Extract quotes
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        print(f'"{text}" — {author}')

    # Check for next page
    next_button = soup.find('li', class_='next')
    if next_button:
        next_page = next_button.find('a')['href']
    else:
        next_page = None

This will automatically go through all 10 pages of the quote site and print every quote.

Step 3: Throttle Requests (Optional but Recommended)

To be a respectful scraper and avoid hammering the server, you should add a small delay between requests:

import time
time.sleep(1)  # Sleep for 1 second between pages

Summary

  • Pagination is handled by looping until the "Next" button disappears.

  • The scraper follows relative links like /page/2/.

  • You can add time.sleep() to scrape politely and avoid being blocked.


Using Headers and Avoiding Blocks

When scraping real-world websites, you’ll often run into issues where your requests are blocked, redirected, or return unexpected content. This usually happens because websites try to detect bots or scrapers.

To reduce the chances of being blocked, you should:

  1. Spoof a user-agent

  2. Respect delays between requests

  3. Avoid hitting servers too frequently

  4. Rotate user-agents and IPs (for advanced use)

Step 1: Add Headers (User-Agent)

Web servers often block requests with no or suspicious headers. Adding a browser-like User-Agent makes your scraper look more like a real browser.

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

response = requests.get('http://quotes.toscrape.com', headers=headers)

Add this to all your requests.get() calls in your scraper.

Step 2: Handle Failed Requests Gracefully

Use status codes and try/except blocks to detect problems like rate limits or broken pages:

try:
    response = requests.get(url, headers=headers, timeout=5)
    response.raise_for_status()  # Raise error if not 2xx
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
    continue  # Skip to next page or retry later

Step 3: Add a Delay Between Requests

This helps prevent overloading servers and getting IP-banned.

import time
time.sleep(1)  # 1 second delay between requests

If scraping many pages, consider using random.uniform() for variable delays:

import random
time.sleep(random.uniform(1, 3))  # Delay between 1 and 3 seconds

Optional: Use Proxies and Rotating User-Agents

For heavy scraping or aggressive targets, you can rotate:

  • User-Agents: To avoid pattern detection

  • Proxies or VPNs: To rotate IP addresses

This requires external libraries or paid services like:

  • fake-useragent for random headers

  • scraperapi, Bright Data, or free proxy lists

Summary

  • Always use realistic headers (especially User-Agent)

  • Add error handling to detect and skip failed pages

  • Use time delays to avoid triggering rate limits

  • Rotate user-agents and IPs if you're scraping large volumes or strict sites


Best Practices and Ethical Considerations

Web scraping is a powerful tool, but with great power comes great responsibility. While it's technically easy to scrape most public websites, it’s important to follow ethical guidelines and legal boundaries to avoid violating terms of service or causing harm to the sites you’re targeting.

Here are the key practices you should always follow:

1. Check the Site’s robots.txt

Most websites have a robots.txt file that defines which parts of the site can or cannot be accessed by automated tools.

You can view it by visiting:
https://example.com/robots.txt

Look for lines like:

User-agent: *
Disallow: /private/

While robots.txt is not legally binding, it’s a clear signal of the site owner’s intent. You should respect it unless you have explicit permission.

2. Read the Terms of Service

Some websites explicitly forbid scraping in their terms. Ignoring this could lead to legal consequences, especially for commercial use. Always check the site’s Terms of Use before deploying your scraper publicly or at scale.

3. Don’t Overload the Server

Aggressive scraping can put unnecessary strain on a website’s infrastructure. To avoid this:

  • Add a delay (1–3 seconds) between requests.

  • Don’t run scrapers in rapid parallel threads unless allowed.

  • Avoid scraping sensitive endpoints or login-protected pages.

4. Identify Yourself When Needed

If you're scraping data for legitimate, transparent use (like academic research or public tools), consider including a contact email in your headers or using identifiable user-agent strings.

Example:

headers = {
    'User-Agent': 'MyScraperBot/1.0 (+https://example.com/contact)'
}

5. Cache and Reuse Data Locally

If you're running the same scraper repeatedly, avoid hitting the same pages over and over. Cache the data locally and only re-fetch if the content changes or needs updating.

6. Don't Use Scraped Data for Harmful Purposes

Avoid scraping or redistributing content for:

  • Spamming or phishing

  • Fake news, plagiarism, or misinformation

  • Violating copyrights or data privacy laws (like scraping personal info)

Summary

  • Respect robots.txt and Terms of Service.

  • Don’t overload or harm the servers you’re scraping.

  • Add delays, error handling, and ethical identifiers.

  • Use the data responsibly and legally.


Conclusion and Full Code Example

Congratulations! You've just built a fully functional and ethical web scraper using Python, BeautifulSoup, and Requests. You’ve learned how to:

  • Fetch HTML pages with requests

  • Parse and extract content using BeautifulSoup

  • Navigate the DOM with search methods and CSS selectors

  • Handle pagination for scraping multi-page datasets

  • Use headers and delays to avoid getting blocked

  • Follow best practices and respect website guidelines

💻 Full Working Code Example

Here’s a complete scraper that collects all quotes and authors from http://quotes.toscrape.com across multiple pages:

import requests
from bs4 import BeautifulSoup
import time
import random

base_url = 'http://quotes.toscrape.com'
next_page = '/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}

while next_page:
    try:
        # Fetch page with headers
        response = requests.get(base_url + next_page, headers=headers, timeout=5)
        response.raise_for_status()

        # Parse the page content
        soup = BeautifulSoup(response.text, 'lxml')
        quotes = soup.find_all('div', class_='quote')

        for quote in quotes:
            text = quote.find('span', class_='text').text
            author = quote.find('small', class_='author').text
            print(f'"{text}" — {author}')

        # Find the next page link
        next_button = soup.find('li', class_='next')
        next_page = next_button.find('a')['href'] if next_button else None

        # Be respectful
        time.sleep(random.uniform(1, 3))

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page: {e}")
        break

What’s Next?

Now that you’ve got the fundamentals down, here are a few ideas to explore:

  • Export data to CSV or JSON

  • Scrape more complex sites with dynamic content (use Selenium or Playwright)

  • Create a scraper for your projects, like:

    • Price monitoring tools

    • Job board aggregators

    • Academic data collectors

Final Tips

  • Always test your scraper slowly and respectfully.

  • Keep your code modular and maintainable.

  • Stay updated with site structure changes.

  • Respect website owners and privacy policies.

You can get the full source code on our GitHub.

That's just the basics. If you need more deep learning about Python and the frameworks, you can take the following cheap course:

Happy Scraping!