Extract RSS feed from NOAA for CO2 update

Aug. 29, 2021

Experiment with requests and feedparser package to extract update on CO2 concentration from NOAA website

Exercise with web scraping, manipulate string and convert to Python object such as datetime, float and pack the data into dataframe. Finally, the data is plotted using matplotlib.pyplot


Index

  • the goal is to extract RSS feed from NOAA website and create a graph like this one below
In [80]:
# plt_2_img(fig=fig)
Out[80]:

Requests

requests is a robust and flexible to get a web-page. Install it by: pip install requests

In [2]:
rss_url = 'https://gml.noaa.gov/webdata/ccgg/trends/rss.xml'

Visit NOAA webpage

  • the file is in xml format which attribute:value structures.
  • the outmost tag is rss > channel > then list of item
  • we will get the page in by requests package and extract the data inside items
In [3]:
import requests
In [4]:
# request the webpage
resp = requests.get(rss_url)
# check status code
resp.status_code
Out[4]:
200
In [5]:
# this is a long string
print(len(resp.text))
resp.text[:200]
5150
Out[5]:
'\n\n \n  NOAA/ESRL Trends in CO2\n  http://www.esrl.noaa.gov/gmd/ccgg/trend'

to parse out the string, we will use xml library

In [6]:
from pprint import pprint as pp
In [7]:
pp(resp.text[:600])
('\n'
 '\n'
 ' \n'
 '  NOAA/ESRL Trends in CO2\n'
 '  http://www.esrl.noaa.gov/gmd/ccgg/trends\n'
 '  \n'
 '  Recent Atmospheric CO2 Values from the NOAA Earth System '
 'Research Laboratory\n'
 '  Sat, 28 Aug 2021 05:01:03 MDT\n'
 '  en\n'
 '\n'
 '  \n'
 '    Weekly CO2 Update for August 15, 2021\n'
 '    h')
In [8]:
import xml.etree.ElementTree as ET
root  = ET.fromstring(resp.text)
In [9]:
for child in root:
    print(child.tag, child.attrib)
channel {}
In [10]:
type(root)
Out[10]:
xml.etree.ElementTree.Element
In [11]:
list(root)
Out[11]:
[]
In [12]:
print(list(root[0]))
[, , , , , , , , , , , , , , ]
In [13]:
# the item is here
items = root.findall('./channel/item')
len(items)
Out[13]:
9
In [14]:
# the inner most element 
print(list(items[0])) # print statement to make the output clear on HTML
[, , , , ]
In [15]:
print(list(items[0])[0].tag, list(items[0])[0].text)
title Weekly CO2 Update for August 15, 2021
In [16]:
print(list(items[0])[3].tag, list(items[0])[3].text)
description 
      Up-to-date weekly average CO2 at Mauna Loa 
Week starting on August 15, 2021: 414.91 ppm
Weekly value from 1 year ago: 412.61 ppm
Weekly value from 10 years ago: 390.33 ppm
In [17]:
print(list(items[0])[4].tag, list(items[0])[4].text)
pubDate Sat, 28 Aug 2021 05:00:56 MDT
  • we need to extract timestamp in the title and data inside description

Feedparser

  • install feedparser by pip otherwise you will see an import error
!pip install feedparser
import feedparser
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-52681e28dc26> in <module>
----> 1 import feedparser

ModuleNotFoundError: No module named 'feedparser'
In [18]:
import feedparser
In [19]:
feed = feedparser.parse(rss_url)
In [20]:
# the feed is a dictionary-type rather a long string with request
feed.keys()
Out[20]:
dict_keys(['bozo', 'entries', 'feed', 'headers', 'etag', 'updated', 'updated_parsed', 'href', 'status', 'encoding', 'version', 'namespaces'])
In [21]:
feed['bozo']
Out[21]:
False
In [22]:
feed['updated']
Out[22]:
'Sat, 28 Aug 2021 11:01:03 GMT'
In [23]:
feed['status']
Out[23]:
200
In [24]:
# top element
feed['headers']
Out[24]:
{'date': 'Sun, 29 Aug 2021 02:17:07 GMT',
 'server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.4.23',
 'last-modified': 'Sat, 28 Aug 2021 11:01:03 GMT',
 'etag': '"141e-5ca9c8411a342"',
 'accept-ranges': 'bytes',
 'content-length': '5150',
 'connection': 'close',
 'content-type': 'text/xml'}
In [25]:
# equivalement to items 
len(feed['entries'])
Out[25]:
9
In [26]:
# value, ts
summary = ''
for k, v in feed['entries'][0].items():
#     print(k, '-->', v)
    if k=='title':
        print(v)
    if k=='summary':
        print(v)
        summary = v
Weekly CO2 Update for August 15, 2021
Up-to-date weekly average CO2 at Mauna Loa 
Week starting on August 15, 2021: 414.91 ppm
Weekly value from 1 year ago: 412.61 ppm
Weekly value from 10 years ago: 390.33 ppm

TOP

Extract data

  • using requests or feedparser we will end up with a list of dictionary
  • we will extract date from title and concentration from summary/description
  • let continue with the data from feedparser

Extract timestamp

In [27]:
# let see the line contain date again
title = feed['entries'][0].title
title
Out[27]:
'Weekly CO2 Update for August 15, 2021'
In [28]:
# we can split the string until the mark of datetime object like this
date = title.split('for')[-1].strip()
date
Out[28]:
'August 15, 2021'
In [29]:
# then using datetime to convert a string object to datetime object
import datetime
In [30]:
datetime.datetime.strptime(date, '%B %d, %Y').date()
Out[30]:
datetime.date(2021, 8, 15)
In [31]:
# or we can use dateutil to parse the date 
from dateutil import parser
In [32]:
parser.parse(date).date()
Out[32]:
datetime.date(2021, 8, 15)

Extract concentration

In [33]:
entries = feed['entries']
In [34]:
entries[0]['summary']
Out[34]:
'Up-to-date weekly average CO2 at Mauna Loa 
\n Week starting on August 15, 2021: 414.91 ppm
\n Weekly value from 1 year ago: 412.61 ppm
\n Weekly value from 10 years ago: 390.33 ppm'
In [35]:
entries[1]['summary']
Out[35]:
'Up-to-date weekly average CO2 at Mauna Loa 
\n Week starting on August 8, 2021: 414.55 ppm
\n Weekly value from 1 year ago: 413.03 ppm
\n Weekly value from 10 years ago: 390.70 ppm'
In [36]:
entries[-1]['summary']
Out[36]:
'Estimated Global Daily average CO2 trend 
\n August 27, 2021: 415.31 ppm
\n Daily average CO2 at Mauna Loa
\n August 27, 2021: 414.29 ppm
'
  • identical information from element #0 upto the second to last element
  • the last element is a summary, we need to treat this element differently
In [37]:
# first let deal with 0-7 element (total 8)
summary = entries[0].summary.splitlines()
summary
Out[37]:
['Up-to-date weekly average CO2 at Mauna Loa 
', ' Week starting on August 15, 2021: 414.91 ppm
', ' Weekly value from 1 year ago: 412.61 ppm
', ' Weekly value from 10 years ago: 390.33 ppm']
In [38]:
# now I realized that we don't even to extract date from title. 
# It is availble in the second line in summary
In [39]:
# let split down to the string for date
summary[1].split(':')
Out[39]:
['      Week starting on August 15, 2021', '   414.91 ppm 
']
In [40]:
date = summary[1].split(':')[0]
date
Out[40]:
'      Week starting on August 15, 2021'
In [41]:
type(date)
Out[41]:
str
In [42]:
try:
    parser.parse(date)
except Exception as e:
    print('Exception raise: ', e)
Exception raise:  ('Unknown string format:', '      Week starting on August 15, 2021')
In [43]:
# parser cannot parse this form of string, we can improve by providing only a string of date
In [44]:
parser.parse(date.split('on')[-1]).date()
Out[44]:
datetime.date(2021, 8, 15)
In [45]:
# or even better, through fuzzy=True argument
parser.parse(date, fuzzy=True).date()
Out[45]:
datetime.date(2021, 8, 15)
In [46]:
conc = summary[1].split(':')[-1]
conc
Out[46]:
'   414.91 ppm 
'
In [47]:
conc.split('ppm')[0].strip()
Out[47]:
'414.91'
In [48]:
summary
Out[48]:
['Up-to-date weekly average CO2 at Mauna Loa 
', ' Week starting on August 15, 2021: 414.91 ppm
', ' Weekly value from 1 year ago: 412.61 ppm
', ' Weekly value from 10 years ago: 390.33 ppm']
In [49]:
summary = entries[1].summary
In [50]:
lines = summary.splitlines()[1:]
lines
Out[50]:
['      Week starting on August 8, 2021:   414.55 ppm 
', ' Weekly value from 1 year ago: 413.03 ppm
', ' Weekly value from 10 years ago: 390.70 ppm']
In [51]:
lines = [line.strip().split(':') for line in lines]
In [52]:
lines
Out[52]:
[['Week starting on August 8, 2021', '   414.55 ppm 
'], ['Weekly value from 1 year ago', ' 413.03 ppm
'], ['Weekly value from 10 years ago', ' 390.70 ppm']]
In [53]:
lines[0][0]
Out[53]:
'Week starting on August 8, 2021'
In [54]:
date = parser.parse(lines[0][0], fuzzy=True).date()
date
Out[54]:
datetime.date(2021, 8, 8)
In [55]:
concs = [line[-1].split('ppm') for line in lines]
concs
Out[55]:
[['   414.55 ', ' 
'], [' 413.03 ', '
'], [' 390.70 ', '']]
In [56]:
concs = [conc[0].strip() for conc in concs]
In [57]:
# let design a function to extract data

def extract_co2_conc(entry):
    '''desc is formatted in paragraph.'''
    
    # break up paragraph by \n mark and drop 1st line
    lines = entry.summary.splitlines()[1:]

    # break up each line by semicolon
    lines = [line.strip().split(':') for line in lines]
    date = parser.parse(lines[0][0], fuzzy=True).date()

    # get concentration
    concs = [line[-1].split('ppm') for line in lines]
    
#     concs = [line[-1].strip() for line in lines]
    concs = [conc[0].strip() for conc in concs]
    
    return {date: concs} 
In [58]:
extract_co2_conc(entries[0])
Out[58]:
{datetime.date(2021, 8, 15): ['414.91', '412.61', '390.33']}
In [59]:
co2_data = dict()
for line in entries[:-1]:
    item = extract_co2_conc(line)
    co2_data.update(item)
co2_data
Out[59]:
{datetime.date(2021, 8, 15): ['414.91', '412.61', '390.33'],
 datetime.date(2021, 8, 8): ['414.55', '413.03', '390.70'],
 datetime.date(2021, 8, 1): ['415.00', '413.52', '391.21'],
 datetime.date(2021, 7, 25): ['415.62', '413.55', '392.21'],
 datetime.date(2021, 7, 18): ['416.74', '414.15', '391.57'],
 datetime.date(2021, 7, 11): ['417.51', '414.98', '393.40'],
 datetime.date(2021, 7, 4): ['417.47', '415.43', '393.73'],
 datetime.date(2021, 6, 27): ['418.08', '415.76', '393.59']}

TOP

Data Tunning

In [60]:
# create a dateframe
import pandas as pd
df = pd.DataFrame(data=co2_data, dtype=float)
df
Out[60]:
2021-08-15 2021-08-08 2021-08-01 2021-07-25 2021-07-18 2021-07-11 2021-07-04 2021-06-27
0 414.91 414.55 415.00 415.62 416.74 417.51 417.47 418.08
1 412.61 413.03 413.52 413.55 414.15 414.98 415.43 415.76
2 390.33 390.70 391.21 392.21 391.57 393.40 393.73 393.59
In [61]:
# transpose table
df = df.transpose()
df
Out[61]:
0 1 2
2021-08-15 414.91 412.61 390.33
2021-08-08 414.55 413.03 390.70
2021-08-01 415.00 413.52 391.21
2021-07-25 415.62 413.55 392.21
2021-07-18 416.74 414.15 391.57
2021-07-11 417.51 414.98 393.40
2021-07-04 417.47 415.43 393.73
2021-06-27 418.08 415.76 393.59
In [62]:
this_year = df.index[0].year
this_year
Out[62]:
2021
In [63]:
# 1 year ago, and 10 years ago
cols = [this_year, this_year - 1, this_year - 10]
df.columns = cols
df.head()
Out[63]:
2021 2020 2011
2021-08-15 414.91 412.61 390.33
2021-08-08 414.55 413.03 390.70
2021-08-01 415.00 413.52 391.21
2021-07-25 415.62 413.55 392.21
2021-07-18 416.74 414.15 391.57
In [64]:
import matplotlib.pyplot as plt
In [65]:
bg_color='#F5F4EF'
In [66]:
list(feed)
Out[66]:
['bozo',
 'entries',
 'feed',
 'headers',
 'etag',
 'updated',
 'updated_parsed',
 'href',
 'status',
 'encoding',
 'version',
 'namespaces']
In [67]:
feed.headers
Out[67]:
{'date': 'Sun, 29 Aug 2021 02:17:07 GMT',
 'server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.4.23',
 'last-modified': 'Sat, 28 Aug 2021 11:01:03 GMT',
 'etag': '"141e-5ca9c8411a342"',
 'accept-ranges': 'bytes',
 'content-length': '5150',
 'connection': 'close',
 'content-type': 'text/xml'}
In [68]:
feed.updated
Out[68]:
'Sat, 28 Aug 2021 11:01:03 GMT'
In [69]:
title = feed.feed['subtitle']
In [70]:
update = parser.parse(feed.feed['updated']).date().strftime('%B %d, %Y')
/usr/lib/python3/dist-packages/dateutil/parser/_parser.py:1199: UnknownTimezoneWarning: tzname MDT identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  warnings.warn("tzname {tzname} identified but not understood.  "
In [71]:
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['font.size'] = 12
In [72]:
import matplotlib as mpl
In [73]:
dict(zip(df.iloc[1].index, df.iloc[1].values))
Out[73]:
{2021: 414.55, 2020: 413.03, 2011: 390.7}
In [74]:
x = df.iloc[0].name
In [75]:
Y = list(df.iloc[0].values)
Y
Out[75]:
[414.91, 412.61, 390.33]
In [76]:
bbox = dict(boxstyle="round,pad=0.3", fc=bg_color, ec='k', alpha=0.3)
In [77]:
fig, ax = plt.subplots(figsize=(10,6), facecolor=bg_color)
colors = ['firebrick', 'maroon', 'black']
for i, col in enumerate(cols):
    ax.plot(df[col], marker='o', linewidth=0.5, 
            markersize=10, markerfacecolor=bg_color,
            markeredgewidth=2,
            color=colors[i], label=col)
    ax.annotate(Y[i],
            xy=(x,Y[i]+1), xycoords='data',
               va='center',
            ha='right',
               color=colors[i],
               bbox=bbox)
# ax.set_ylim(350, 450)
ax.set_facecolor(bg_color)
fig.suptitle(title);
ax.set_title(f'updated: {update}', fontsize=12)
ax.tick_params(axis='both', direction='in', length=8)
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%B %d'))
ax.set_ylabel('parts per million, ppm')

ax.legend()
fig.tight_layout()
In [78]:
import io
import base64
from IPython.core.display import HTML

img = io.BytesIO()
fig.savefig(img, format='png', bbox_inches="tight")
# similar to function above
def plt_2_img(fig=None):
    '''convert image to bytes and display on jupyter'''
    img = io.BytesIO()
    fig.savefig(img, format='png', bbox_inches="tight")
    encoded_string = base64.b64encode(img.getvalue()).decode("utf-8").replace("\n", "")
    img = f'data:image/png;base64,{encoded_string}'
    return HTML(f"{img}>")  # check the Jupyter Notebook, HTML converter remove img tag
In [79]:
plt_2_img(fig=fig)
Out[79]:
In [ ]:
 
get Jupyter Notebook: