Tutorial #1: Web Scraping

This is like the adjustable wrench of your data science tool kit - web scraping. I mean, you need to know statistics, programming, and analytics as a whole; however, when you start to actually implement your ideas, webscraping is something you're going to need at some point (even if just for fun).

What is web scraping? It's the process of taking a webpage and pulling the text and links into a form that Python (or any language) will gladly give back to you as (what I'm going to call) a useable string. This means that you can download files, you can pull text from webpages, you can pull out a sentence out of a NY Times article... you can see why this is powerful, especially when implemented quickly. I used webscraping in my research to download hundreds of GBs of files from NASA and now I use webscraping as a data scientist to get information I need about movies, ads, etc. quickly.

So these Data Science tutorials are going to be in Python. Webscraping is probably not  the best introduction to Python, so if you're learning bear with me. I'll try to make this accessible for everyone. This tutorial is going to show you how to scrape this blog for certain words.

The first thing you need to install is either Beautiful Soup or Selenium, depending on what you want from your webscraper. Beautiful Soup is what is going to be covered in this demo, but Selenium is incredibly powerful at trolling through web pages and clicking buttons. So if you need an interactive webscraper, Selenium is the way to go; the con of Selenium, however, is that it's fickle and prone to problems (i.e. getting stalled on a webpage or beaten by pop-ups). So, we'll use the more rustic albeit reliable Beautiful Soup.

To install, go here: https://www.crummy.com/software/BeautifulSoup/ (opens link in new window). Install beautiful soup 4 either with pip install or downloading the tar ball.  Now open Python (I'm using 2.7 here), and type "from bs4 import BeautifulSoup":

Now the script tutorial is available at my github and instructions are in the comments: https://github.com/loisks317/Webscraping/tree/master

This is what it should look like when run within iPython at the command line 
I highly suggest playing with the html text and getting a better feel for what Beautiful Soup can do for you. This mini text string searcher is the beginning of Natural Language Processing (NLP) which is another incredibly valuable tool for data scientists.

As an exercise for yourself (it's always good to learn by trying) see if you can figure out how many times I say 'cat' throughout the blog. Maybe a prize to whoever does this first :)

Happy coding!


No comments: