Web scraping is a powerful technique used to extract data from websites. Whether you're gathering product information from e-commerce platforms, pulling headlines from news sites, or collecting research data, web scraping enables you to automate the process efficiently.
In this tutorial, we’ll walk you through a complete, step-by-step guide to web scraping using two of the most popular Python libraries: BeautifulSoup and Requests. You’ll learn how to fetch and parse HTML content, navigate through HTML elements, extract data, and handle common issues like pagination and request headers. By the end, you'll be equipped to build your web scraper for a wide range of real-world scenarios.
This tutorial is ideal for Python developers—both beginners and intermediates—who want to expand their skill set into automation, data extraction, and scripting.
What You’ll Learn:
-
The fundamentals of HTTP requests and HTML parsing
-
How to install and use
requests
andBeautifulSoup
-
Navigating the DOM tree and extracting data from HTML
-
Handling pagination, headers, and common scraping challenges
-
Best practices and ethical considerations for web scraping
Let’s get started and turn the web into your personal data source!
Prerequisites and Tools
Before we dive into the code, let’s make sure your development environment is set up and you're familiar with a few key concepts.
Prerequisites
To follow along with this tutorial, you should have:
-
Basic Python knowledge – Familiarity with Python syntax, variables, functions, and loops.
-
Python 3 installed – You can download the latest version from python.org.
-
A code editor – VS Code, PyCharm, or even a simple text editor like Sublime Text will work just fine.
-
Access to the terminal or command prompt – For installing libraries and running scripts.
No prior experience with HTML or web scraping is required; however, a basic understanding of HTML tags and structure will be helpful.
Tools and Libraries
We'll be using the following Python libraries:
-
Requests – A simple HTTP library for sending GET and POST requests.
-
BeautifulSoup (bs4) – A powerful HTML/XML parser that makes navigating and searching the DOM tree straightforward.
Install both libraries using pip:
mkdir web-scraping
cd web-scraping
python3 -m venv path/to/venv
source path/to/venv/bin/activate
pip install requests beautifulsoup4
We’ll also use:
-
lxml (optional but recommended) – A fast parser for better performance with BeautifulSoup.
To install it:
pip install lxml
With your environment set up and tools ready, let's move on to fetching your first web page!
Fetching and Parsing a Web Page
Now that your environment is set up, let’s start scraping by fetching and parsing a simple web page using the requests
and BeautifulSoup
libraries.
Step 1: Import the Required Libraries
Create a new Python file called scraper.py
and add the following imports:
import requests
from bs4 import BeautifulSoup
Step 2: Send a GET Request to a Web Page
Use the requests.get()
method to fetch the HTML content of a target URL. For this example, let’s use a demo site: http://quotes.toscrape.com, which is specifically designed for practicing web scraping.
url = 'http://quotes.toscrape.com'
response = requests.get(url)
# Print the response status and raw HTML
print(response.status_code)
print(response.text)
You should see a 200
status code, followed by the page’s HTML content printed to the console.
Step 3: Parse the HTML with BeautifulSoup
Now, let’s parse the HTML using BeautifulSoup
so we can extract specific elements:
soup = BeautifulSoup(response.text, 'lxml') # or 'html.parser' if lxml is not installed
# Print the page title
print(soup.title.text)
You should see:
Quotes to Scrape
Step 4: Extract Quotes from the Page
Let’s find all the quotes on the page. Each quote is wrapped inside a <div class="quote">
tag.
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
print(f'"{text}" — {author}')
This will output:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
...
Summary
-
We used
requests
to fetch the page content. -
Parsed the HTML with
BeautifulSoup
. -
Used
.find_all()
and.find()
to locate specific HTML elements and extract their contents.
Navigating and Searching the DOM
Once you’ve parsed an HTML document with BeautifulSoup, navigating through the HTML tree becomes intuitive. In this section, we'll dive deeper into selecting, searching, and navigating HTML elements using BeautifulSoup's powerful features.
Using find()
and find_all()
You’ve already seen how find()
and find_all()
work. Here’s a quick refresher:
# Find the first quote div
quote = soup.find('div', class_='quote')
# Find all quote divs
quotes = soup.find_all('div', class_='quote')
You can also search by multiple attributes:
soup.find_all('span', {'class': 'text', 'itemprop': 'text'})
Using CSS Selectors with select()
and select_one()
You can also use CSS selectors to pinpoint elements:
# Select all quote texts using CSS selector
quote_texts = soup.select('div.quote span.text')
for q in quote_texts:
print(q.text)
# Select the first author's name
author = soup.select_one('div.quote small.author')
print(author.text)
Accessing Attributes
Want to get the value of an href
or other attribute? You can use .get()
:
link = soup.select_one('a')
print(link.get('href'))
Or for all author profile links:
for quote in quotes:
link = quote.find('a')
print(link['href']) # Same as .get('href')
Navigating the DOM Tree
You can move around the DOM tree using BeautifulSoup’s navigation properties:
-
.parent
– Gets the parent tag. -
.children
– Generator for a tag’s children. -
.next_sibling
/.previous_sibling
– Navigate between siblings.
Example:
first_quote = quotes[0]
print(first_quote.find('span', class_='text').parent.name) # Output: span's parent tag name
These navigation features are useful when elements don't have consistent classes or IDs, and you need to rely on structure instead.
Summary
-
Use
.find()
,.find_all()
, and.select()
to search for elements. -
Use
.text
to get the content, and.get('attribute')
to extract attributes. -
Navigate through the HTML tree using
.parent
,.children
, and siblings.
Handling Pagination
Most websites display data across multiple pages. To collect everything, your scraper needs to follow “Next” page links or generate paginated URLs. In this section, you'll learn how to scrape data from multiple pages automatically.
We’ll continue using http://quotes.toscrape.com, which uses simple URL-based pagination (/page/1
, /page/2
, etc.).
Step 1: Inspect the Pagination
If you look at the bottom of the page, you’ll see a Next → button. It looks like this in HTML:
<li class="next">
<a href="/page/2/">Next →</a>
</li>
We can use this pattern to loop through each page until the Next link no longer exists.
Step 2: Loop Through All Pages
Here’s how to write a loop to handle pagination:
import requests
from bs4 import BeautifulSoup
base_url = 'http://quotes.toscrape.com'
next_page = '/'
while next_page:
# Fetch and parse the page
response = requests.get(base_url + next_page)
soup = BeautifulSoup(response.text, 'lxml')
# Extract quotes
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
print(f'"{text}" — {author}')
# Check for next page
next_button = soup.find('li', class_='next')
if next_button:
next_page = next_button.find('a')['href']
else:
next_page = None
This will automatically go through all 10 pages of the quote site and print every quote.
Step 3: Throttle Requests (Optional but Recommended)
To be a respectful scraper and avoid hammering the server, you should add a small delay between requests:
import time
time.sleep(1) # Sleep for 1 second between pages
Summary
-
Pagination is handled by looping until the "Next" button disappears.
-
The scraper follows relative links like
/page/2/
. -
You can add
time.sleep()
to scrape politely and avoid being blocked.
Using Headers and Avoiding Blocks
When scraping real-world websites, you’ll often run into issues where your requests are blocked, redirected, or return unexpected content. This usually happens because websites try to detect bots or scrapers.
To reduce the chances of being blocked, you should:
-
Spoof a user-agent
-
Respect delays between requests
-
Avoid hitting servers too frequently
-
Rotate user-agents and IPs (for advanced use)
Step 1: Add Headers (User-Agent)
Web servers often block requests with no or suspicious headers. Adding a browser-like User-Agent
makes your scraper look more like a real browser.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}
response = requests.get('http://quotes.toscrape.com', headers=headers)
Add this to all your requests.get()
calls in your scraper.
Step 2: Handle Failed Requests Gracefully
Use status codes and try/except blocks to detect problems like rate limits or broken pages:
try:
response = requests.get(url, headers=headers, timeout=5)
response.raise_for_status() # Raise error if not 2xx
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
continue # Skip to next page or retry later
Step 3: Add a Delay Between Requests
This helps prevent overloading servers and getting IP-banned.
import time
time.sleep(1) # 1 second delay between requests
If scraping many pages, consider using random.uniform()
for variable delays:
import random
time.sleep(random.uniform(1, 3)) # Delay between 1 and 3 seconds
Optional: Use Proxies and Rotating User-Agents
For heavy scraping or aggressive targets, you can rotate:
-
User-Agents: To avoid pattern detection
-
Proxies or VPNs: To rotate IP addresses
This requires external libraries or paid services like:
-
fake-useragent
for random headers -
scraperapi
,Bright Data
, or free proxy lists
Summary
-
Always use realistic headers (especially
User-Agent
) -
Add error handling to detect and skip failed pages
-
Use time delays to avoid triggering rate limits
-
Rotate user-agents and IPs if you're scraping large volumes or strict sites
Best Practices and Ethical Considerations
Web scraping is a powerful tool, but with great power comes great responsibility. While it's technically easy to scrape most public websites, it’s important to follow ethical guidelines and legal boundaries to avoid violating terms of service or causing harm to the sites you’re targeting.
Here are the key practices you should always follow:
1. Check the Site’s robots.txt
Most websites have a robots.txt
file that defines which parts of the site can or cannot be accessed by automated tools.
You can view it by visiting:
https://example.com/robots.txt
Look for lines like:
User-agent: *
Disallow: /private/
While robots.txt
is not legally binding, it’s a clear signal of the site owner’s intent. You should respect it unless you have explicit permission.
2. Read the Terms of Service
Some websites explicitly forbid scraping in their terms. Ignoring this could lead to legal consequences, especially for commercial use. Always check the site’s Terms of Use before deploying your scraper publicly or at scale.
3. Don’t Overload the Server
Aggressive scraping can put unnecessary strain on a website’s infrastructure. To avoid this:
-
Add a delay (1–3 seconds) between requests.
-
Don’t run scrapers in rapid parallel threads unless allowed.
-
Avoid scraping sensitive endpoints or login-protected pages.
4. Identify Yourself When Needed
If you're scraping data for legitimate, transparent use (like academic research or public tools), consider including a contact email in your headers or using identifiable user-agent strings.
Example:
headers = {
'User-Agent': 'MyScraperBot/1.0 (+https://example.com/contact)'
}
5. Cache and Reuse Data Locally
If you're running the same scraper repeatedly, avoid hitting the same pages over and over. Cache the data locally and only re-fetch if the content changes or needs updating.
6. Don't Use Scraped Data for Harmful Purposes
Avoid scraping or redistributing content for:
-
Spamming or phishing
-
Fake news, plagiarism, or misinformation
-
Violating copyrights or data privacy laws (like scraping personal info)
Summary
-
Respect
robots.txt
and Terms of Service. -
Don’t overload or harm the servers you’re scraping.
-
Add delays, error handling, and ethical identifiers.
-
Use the data responsibly and legally.
Conclusion and Full Code Example
Congratulations! You've just built a fully functional and ethical web scraper using Python, BeautifulSoup, and Requests. You’ve learned how to:
-
Fetch HTML pages with
requests
-
Parse and extract content using
BeautifulSoup
-
Navigate the DOM with search methods and CSS selectors
-
Handle pagination for scraping multi-page datasets
-
Use headers and delays to avoid getting blocked
-
Follow best practices and respect website guidelines
💻 Full Working Code Example
Here’s a complete scraper that collects all quotes and authors from http://quotes.toscrape.com across multiple pages:
import requests
from bs4 import BeautifulSoup
import time
import random
base_url = 'http://quotes.toscrape.com'
next_page = '/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}
while next_page:
try:
# Fetch page with headers
response = requests.get(base_url + next_page, headers=headers, timeout=5)
response.raise_for_status()
# Parse the page content
soup = BeautifulSoup(response.text, 'lxml')
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
print(f'"{text}" — {author}')
# Find the next page link
next_button = soup.find('li', class_='next')
next_page = next_button.find('a')['href'] if next_button else None
# Be respectful
time.sleep(random.uniform(1, 3))
except requests.exceptions.RequestException as e:
print(f"Error fetching page: {e}")
break
What’s Next?
Now that you’ve got the fundamentals down, here are a few ideas to explore:
-
Export data to CSV or JSON
-
Scrape more complex sites with dynamic content (use Selenium or Playwright)
-
Create a scraper for your projects, like:
-
Price monitoring tools
-
Job board aggregators
-
Academic data collectors
-
Final Tips
-
Always test your scraper slowly and respectfully.
-
Keep your code modular and maintainable.
-
Stay updated with site structure changes.
-
Respect website owners and privacy policies.
You can get the full source code on our GitHub.
That's just the basics. If you need more deep learning about Python and the frameworks, you can take the following cheap course:
-
Edureka's Django course helps you gain expertise in Django REST framework, Django Models, Django AJAX, Django jQuery etc. You'll master Django web framework while working on real-time use cases and receive Django certification at the end of the course.
-
Unlock your coding potential with Python Certification Training. Avail Flat 25% OFF, coupon code: TECHIE25
-
Database Programming with Python
-
Python Programming: Build a Recommendation Engine in Django
-
Python Course:Learn Python By building Games in Python.
-
Learn API development with Fast API + MySQL in Python
-
Learn Flask, A web Development Framework of Python
Happy Scraping!