Scraping With BeautifulSoup

DAY 26 ๐Ÿ‘พ

Iโ€™m back ๐Ÿ’™

Today I scraped the web in Python using the BeautifulSoup.

Firstly I thought of scraping it using JS as I am a JS guy.

But, using headless browsers like PhantomJS or NightmareJS(which Iโ€™ve used earlier), BeautifulSoup felt easy.

It has a very good documentation & you can implement anything using all the methods provided in the docs.

The API is fantastic & easy to use.

Also, the LOC are reduced. I usually donโ€™t care about the LOCs but sometimes its easy for the eyes to see the chunk in one frame.

In this project, I scrape popular AMAs which Iโ€™m going to use in my next Project AMA Reader, so I scraped the webpage given here.

The following code does the job -

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from bs4 import BeautifulSoup
import urllib2
import pprint
import json
import re
pp = pprint.PrettyPrinter(indent=4)
print("Fetching the data....")
response = urllib2.urlopen('https://github.com/sindresorhus/amas/blob/master/readme.md')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
article = soup.find('article')
ul = article.find('ul')
li = ul.find_all('li')
arr = []
for item in li:
link = item.find('a')['href']
fullname = item.find('a').string
username = link.split('/')[3]
s = str(item)
start = '</a>'
end = '</li>'
description = re.search('%s(.*)%s' % (start, end), s).group(1)
obj = {
"username": username,
"link": link,
"fullname": fullname,
"description": description,
"avatar": "https://github.com/" + username + ".png?size=200"
}
arr.append(obj)
print("Data Stored in `amas.json` file")
with open('amas.json', 'w') as outfile:
json.dump(arr, outfile, indent=4, sort_keys=True, separators=(',', ':'))

Code on Github

Till the next time ๐Ÿ‘ป