Intro
Web scraping is a powerful tool for automated data collection, allowing us to extract information from websites programmatically. In Python, one of the most popular libraries for web scraping is BeautifulSoup, this package is very simple to use and gives flexibility in the handling of HTML and XML elements . This practical introduction will guide you through the essentials of web scraping using Python's BeautifulSoup package. In this article we will walk through a simple yet effective web scraper that fetches temperature data from a table of global cities on timeanddate.com, storing the results in a CSV file.
Ensure you have Python installed on your computer. If not, you can download it from python.org. Additionally, to work with BeautifulSoup, you'll need to install it alongside the requests
library, which can be done using pip, Python’s package installer. Simply run pip install beautifulsoup4 requests
in your command line interface.
It is recommended that you use Juypter notebook for this tutorial so you can run the cells one at a time in order to properly inspect the output. Ok, now we get stuck in and pull in the data.
Getting the Soup
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get('https://www.timeanddate.com/weather/')
soup = BeautifulSoup(response.text, 'html.parser')
Once we have all the elements pulled in to Python, we now have some thinking to do. The BeautifulSoup library isn't a magic wand, there is still a lot of work to do in order to get meaningful data out of the program. So upon inspecting the page, we see that the data of interest (city name , temp) is organized in a table. Which implies that the relevant city and temp will be found within a <td></td> element representing table data.
Below we use the find_all() method to narrow down our search.
tds = soup.find_all('td')
print(f"there are {(len(tds))} table data elements")
'''
there are 564 table data elements
'''
Extracting the Information
Looks like we are making some progress, now let's print out the first 10 elements to see if we can find some sort of pattern.
for td in tds[:10]:
print(td)
'''
<td><a href="/weather/ghana/accra">Accra</a><span class="wds" id="p0s"></span></td>
<td class="r" id="p0">Sat 05:47</td>
<td class="r"><img alt="Clear. Warm." height="40" src="//c.tadst.com/gfx/w/svg/wt-13.svg" title="Clear. Warm." width="40"/></td>
<td class="rbi">27 °C</td>
<td><a href="/weather/canada/edmonton">Edmonton</a><span class="wds" id="p47s"></span></td>
<td class="r" id="p47">Fri 22:47</td>
<td class="r"><img alt="Passing clouds. Cold." height="40" src="//c.tadst.com/gfx/w/svg/wt-14.svg" title="Passing clouds. Cold." width="40"/></td>
<td class="rbi">-4 °C</td>
<td><a href="/weather/india/new-delhi">New Delhi</a><span class="wds" id="p94s"></span></td>
<td class="r" id="p94">Sat 11:17</td>
'''
Now we are getting even closer, it appears we have found the necessary pattern. It is as follows:
- The presence of a link element <a></a> indicates we have found a new place, between the <a> tags we will find the place from above <a>Accra</a>, <a>Edmonton</a> and <a>New Delhi</a>
- The third element after the presence of a <a> tag we will have the temperature in degrees celsius contained within a <td> class named "rbi"
So now most of the hard work has actually been completed, now all we need to do is write a simple script to extract the data. First we create a small helper function in order to parse the temperature. Note that the \xa0 represents a non-breaking space in html used to ensure the the characters on either side stay on same line.
def extract_temp_as_float(temp):
return float(temp.split("\xa0")[0])
temps = []
current_city = None
for td in tds:
# check if there is a <a> present in the td
if td.find('a'):
# if we have a link , then we extract the city name
current_city = td.get_text().strip() # Get the city name
# If we have a city and this is a temperature cell
elif 'rbi' in td.get('class', []) and current_city:
temp = td.get_text().strip() # Get the temperature
# add the city and the temp that has been passed through helper function
# to our list of temperatures
temps.append({'city': current_city, 'temp': extract_temp_as_float(temp)})
current_city = None # Reset the current city
print(temps[:3])
And the first three results are shown below. We will assume all the others are correct, and verify some other data points once we have the file in csv format.
[{'city': 'Accra', 'temp': 27.0},
{'city': 'Edmonton', 'temp': -4.0},
{'city': 'New Delhi', 'temp': 13.0}]
Saving Data to File
It appears our script has worked. Now all we need to do is save it as a csv so we don't lose the data.
df = pd.DataFrame(temps)
print(df.head())
print(df.tail())
"""
city temp
0 Accra 27.0
1 Edmonton -4.0
2 New Delhi 13.0
3 Addis Ababa 15.0
4 Frankfurt 9.0
city temp
134 Zürich 8.0
135 Dubai 23.0
136 Nairobi 19.0
137 Dublin 5.0
138 Nassau 20.0
"""
And it save it , the file will be saved to your current working directory, and now you can open in in Excel or notepad.
df.to_csv('global_temps.csv', index=False)
And that's it, you have created a simple webscraper using BeautifulSoup in Python!.