DAY 26 ๐พ
Iโm back ๐
Today I scraped the web in Python using the BeautifulSoup.
Firstly I thought of scraping it using JS as I am a JS guy.
But, using headless browsers like PhantomJS or NightmareJS(which Iโve used earlier), BeautifulSoup felt easy.
It has a very good documentation & you can implement anything using all the methods provided in the docs.
The API is fantastic & easy to use.
Also, the LOC are reduced. I usually donโt care about the LOCs but sometimes its easy for the eyes to see the chunk in one frame.
In this project, I scrape popular AMAs which Iโm going to use in my next Project AMA Reader
, so I scraped the webpage given here.
The following code does the job -
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
| from bs4 import BeautifulSoup import urllib2 import pprint import json import re pp = pprint.PrettyPrinter(indent=4) print("Fetching the data....") response = urllib2.urlopen('https://github.com/sindresorhus/amas/blob/master/readme.md') html_doc = response.read() soup = BeautifulSoup(html_doc, 'html.parser') article = soup.find('article') ul = article.find('ul') li = ul.find_all('li') arr = [] for item in li: link = item.find('a')['href'] fullname = item.find('a').string username = link.split('/')[3] s = str(item) start = '</a>' end = '</li>' description = re.search('%s(.*)%s' % (start, end), s).group(1) obj = { "username": username, "link": link, "fullname": fullname, "description": description, "avatar": "https://github.com/" + username + ".png?size=200" } arr.append(obj) print("Data Stored in `amas.json` file") with open('amas.json', 'w') as outfile: json.dump(arr, outfile, indent=4, sort_keys=True, separators=(',', ':'))
|
Till the next time ๐ป