Delicious Scrapers with BeautifulSoup Library

Web-Scraping with BeautifulSoup

This article is a beginners walk-through guide on how to scrape web pages with the help of BeautifulSoup module in python. I hope that by…

Web-Scraping with BeautifulSoup

This article is a beginners walk-through guide on how to scrape web pages with the help of BeautifulSoup module in python. I hope that by the end of this tutorial, you will be able to scrape data from a static web page using the requests and BeautifulSoup libraries, and export that data into a structured text file using the pandas library.

Below is the roadmap of the flow of this article.
- What is web scraping?
- Examining the New York Times article
- Reading the web page into Python
- Parsing the HTML using Beautiful Soup
- Building the dataset
- Summary: 16 lines of Python code

What is Web Scraping?

It’s the process of extracting information from a web page by taking advantage of patterns in the web page’s underlying code. It
is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved as desired.

Examining the New York Times article.
By going to the following link- “https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" we can examine the article and it can be broken down into four fields.

The date of the lie.
The lie itself (as a quotation).
The writer’s brief explanation of why it was a lie.
The URL of an article that substantiates the claim that it was a lie.

Why does the formatting matter?

Because it’s very likely that the code underlying the web page “tags” those fields differently, and we can take advantage of that pattern when scraping the page. Let’s take a look at the source code, known as HTML.
To view the HTML code that generates a web page, you right click on it and select “View Page Source” in Chrome or Firefox, “View Source” in Internet Explorer, or “Show Page Source” in Safari. (If that option doesn’t appear in Safari, just open safari preferences, select the Advanced tab, and check “Show Develop menu in menu bar”.)

To scrape a web page you should at least know the basics of HTML which i am avoiding but you can go through my ipython notebook in which there is a full flavor of web scraping.

import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

The code above fetches our web page from the URL, and stores the result in a “response” object called r. That response object has a text attribute, which contains the same HTML code we saw when viewing the source from our web browser.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

The code above parses the HTML (stored in `r.text`) into a special object called `soup` that the BeautifulSoup library understands. In other words, Beautiful Soup is reading the HTML and making sense of its structure.(Note that `html.parser` is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup.

Now, since we know the structure of the article we will be extracting information.First we will extract the date, lie, explanation and the URL, then we will append all the information and loop over this action on whole article. Then we will store this information in the csv file using the pandas library and then we could apply the desired operations.

#Here are the 16 lines of code that we used to scrape the web page, #extract the relevant data, convert it into a tabular dataset, and #export it to a CSV file:

import requests

import pandas as pd

r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')

You can download my ipython notebook by visiting my github link :

https://github.com/akshayakn13/WebScraping

Search This Blog

Technology, AI/ML and Data Analytics