Beautiful Soup, and Python3 tutorial

Beautiful Soup is a Python package that allows you to parse HTML and XML files. It builds a parse tree for parsed pages, which can be used to extract data from HTML and is useful for web scraping.

We’ll go over how to do web scraping with Python from the ground up in this tutorial.
Then, using weather data as an example, we’ll work on a real-world web scraping project.

What is Web Scraping?

Beautiful Soup

Web scraping is a term that refers to the process of extracting and processing large quantities of data from the internet using a software or algorithm. If you find data on the web that you can’t download directly, web scraping with Python is an ability you can use to convert the data into a usable format that you can import.

Setup environment

virtualenv is used to manage Python packages for different projects. You can prevent breaking machine tools or other projects by using virtualenv instead of downloading Python packages globally. Pip can be used to install virtualenv.

# beautiful soup python setup example

# install and create a virtual environment
pip install virtualenv


# make a project directory
mkdir soup
cd soup

# create a virtual environment
virtualenv venv


# activate the virtual environment

# macos
source venv/bin/activate

# windows
venv\Scripts\activate

# to deactivate the virtual environment (if needed)
deactivate




Once you create the project folder and activate the virtual environment in it then it will look like this.

# virtual env

#in our case the project_folder_name is soup
(venv) username@dell:~/Desktop/project_folder_name

##

Now you are in the virtual environment. You can add or remove packages here. They are independent from your global settings and configurations.

Add a python file, for example we will create soup.py file.

# create file

sudo nano soup.py
#OR
touch soup.py

##

Within this file, we will import two libraries named Requests and Beautiful Soup. you can install Requests and Beautiful Soup via PIP in terminal.

# example install modules

#install requests
pip install requests

#install Beautiful Soup
pip install beautifulsoup4

#install html5lib
pip install html5lib


##

The Requests library makes it easy to use HTTP in Python programming language in a human-readable way, and the Beautiful Soup module helps you scrape the web quickly.

With the import statement, we’ll import both Requests and Beautiful Soup. Beautiful Soup will be imported from bs4, the package that contains Beautiful Soup 4.

import requests
from bs4 import BeautifulSoup

url = 'https://www.nokia.com/phones/en_pk'

# get rquest
r = requests.get(url)

# get html content
htmlContent = r.content

soup = BeautifulSoup(htmlContent, 'html.parser')

##

Now we have all the HTML content of a web page. We can access the data inside it. for example we will display phone from nokia site.

beautiful soup
## access the headings, which contains phone names
 
phones = soup.findAll('h3',{'class':'css-17c0ng7-Heading'})
for phone in phones:
     print(phone)

##

The above loop will output some things like this. We are selecting H3 headings with class= 'css-17c0ng7-Heading'. note that we are getting the HTML tags as well.

<!--  Output -->

<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia X20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia X10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia G20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia G10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia C20</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia C10</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 5310</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 8000 4G</h3>
<h3 class="css-17c0ng7-Heading e1tf4vg61">Nokia 6300 4G</h3>

<!-- /output -->

We can extract only text with get_text() method.

##

phones = soup.findAll('h3',{'class':'css-17c0ng7-Heading'})
for phone in phones:
    # print(phone)
    print(phone.get_text())

##

The output will be simple list of phone names.

##

#List of h3 headings
Nokia X20
Nokia X10
Nokia G20
Nokia G10
Nokia C20
Nokia C10
Nokia 5310
Nokia 8000 4G
Nokia 6300 4G

##

Beautiful Soup, The list of methods

In this section we will learn how to access the various sections of document object model. The BeautifulSoup comes with many methods to access data.

Here is a list of some BeautifulSoup methods.

BeautifulSoup prettify() method

The prettify method is used to beautify the content. It formats the html code.

BeautifulSoup: Accessing HTML Tags

We can easily find and access the content of various HTML tags such as head, title, div, p, and h1 using the BeautifulSoup module. Let’s look at a quick example where we’ll print the webpage’s title tag.

## getting 'title' tag

title_tag = soup.title
print(title_tag)

#Output: <title>The latest Nokia Android smartphones and mobile phones</title>

## getting 'title' tag Text only
title_text = soup.title.text
print(title_text)

#Output: The latest Nokia Android smartphones and mobile phones

Similarly we can access the head, div, p and headings in the same manner.

# getting 'title' tag text
print(soup.head.title.text)

#Output: The latest Nokia Android smartphones and mobile phones

# getting inline CSS from head tag
print(soup.head.style)

# getting first div tag
print(soup.find('div'))

# getting all div tags
print(soup.find_all('div'))
#OR
print(soup.findAll('div'))

# getting first tag
print(soup.find('p'))

# getting all p tags
print(soup.find_all('p'))
#OR
print(soup.findAll('p'))

Accessing HTML Tag Attributes

Using the following syntax, we can get the attributes of any HTML tag:
TagName["AttributeName"]
In our HTML code, let’s extract the href="" attribute from the anchor tag.

##

# get the anchor tag
link = soup.a

#Output: <a href="/phones/en_pk"><svg aria-label="Nokia" class="icon" focusable="false" viewbox="0 0 105 18"><use xlink:href="#nokia"></use></svg><span>Phones</span></a>

# print the 'href' attribute of the anchor tag
print(link["href"])

#Output: /phones/en_pk

##

The contents method

The contents method is used to display all of the tags in the parent tag. Using the contents method, we can get a list of all the children HTML tags of the body tag.

#Using the contents method

the_head = soup.head
all_tags = the_head.contents

for i in all_tags:
    print(i)

##Output will print all the available tags in <head></head> tag

The children method

The contents method is similar to the children method, except that the contents method returns a list of all the children, while the children method returns an iterator.

#Using the children method

the_head = soup.head
all_children = the_head.children

for i in all_children:
    print(i, '\n')

##Output will print all the available tags in <head></head> tag

The descendants method

The descendants method is useful for retrieving all of a parent tag’s child tags. It looks similar to the children and contents method. It works in a different manner. If we use it to extract the body tag, it will print the first div tag, then the div tag’s child, and then their child until it reaches the end, after which stage it will move on to the next div tag, and so on.

#Using the descendants method

the_head = soup.head
all_descendants = the_head.descendants

for i in all_descendants:
    print(i, '\n')

##

The parent method

The parent method is used to get the parent tag. By default, it will return all tha tags inside parent tag. for example we will only print the name of the parent tag.

##

the_head = soup.head
head_parent = the_head.parent

# use 'name' method to get the name of the tag
print(head_parent.name)

#Output: html
##

The parents method

The parent method is used to get all the parent tags. It gives you a generator as a result. Consider the following example:

##

the_head = soup.head
head_parents = the_head.parents

# use 'name' method to get the name of the tag
# If the child has more than one parent, all of their names will be printed.
for i in head_parents:
    print(i.name, '\n')

#Output: html
#       [document]
##

The sibiling methods

There are four sibiling methods.

  • next_sibling method
  • previous_sibling method
  • next_siblings method
  • previous_siblings method

As their name suggests the work on HTML tag sibilings.

The next_sibling method is used to get the next tag from the same parent.

The previous_sibling method is used to get the previous sibiling tag. They both return simply a HTML tag.

##

metaTag = soap.find('meta')
print(meta.next_sibiling)
#Output: None
#returns None if the next sibiling is not available

#previous_sibling example
title = soap.title
print(title.previous_sibling)
#Output: <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>

##

The next_siblings and previous_siblings methods are similar to the above mentioned methods except thy return generator with all available sibilings.

You can use a loop to display their content.

##

metaTag = soap.find('meta')
print(meta.next_sibilings)
#Output: None

#previous_sibling example
title = soap.title
print(title.previous_siblings)
#Output: <generator object PageElement.previous_siblings at 0x7f8860e21190>

##

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *