Tutorial #1: Web Scraping
This is like the adjustable wrench of your data science tool kit - web scraping. I mean, you need to know statistics, programming, and analytics as a whole; however, when you start to actually implement your ideas, webscraping is something you're going to need at some point (even if just for fun).
What is web scraping? It's the process of taking a webpage and pulling the text and links into a form that Python (or any language) will gladly give back to you as (what I'm going to call) a useable string. This means that you can download files, you can pull text from webpages, you can pull out a sentence out of a NY Times article... you can see why this is powerful, especially when implemented quickly. I used webscraping in my research to download hundreds of GBs of files from NASA and now I use webscraping as a data scientist to get information I need about movies, ads, etc. quickly.
So these Data Science tutorials are going to be in Python. Webscraping is probably not the best introduction to Python, so if you're learning bear with me. I'll try to make this accessible for everyone. This tutorial is going to show you how to scrape this blog for certain words.
The first thing you need to install is either Beautiful Soup or Selenium, depending on what you want from your webscraper. Beautiful Soup is what is going to be covered in this demo, but Selenium is incredibly powerful at trolling through web pages and clicking buttons. So if you need an interactive webscraper, Selenium is the way to go; the con of Selenium, however, is that it's fickle and prone to problems (i.e. getting stalled on a webpage or beaten by pop-ups). So, we'll use the more rustic albeit reliable Beautiful Soup.
To install, go here: https://www.crummy.com/software/BeautifulSoup/ (opens link in new window). Install beautiful soup 4 either with pip install or downloading the tar ball. Now open Python (I'm using 2.7 here), and type "from bs4 import BeautifulSoup":
Now the script tutorial is available at my github and instructions are in the comments: https://github.com/loisks317/Webscraping/tree/master
This is what it should look like when run within iPython at the command line |
As an exercise for yourself (it's always good to learn by trying) see if you can figure out how many times I say 'cat' throughout the blog. Maybe a prize to whoever does this first :)
Happy coding!
0 Comments