From Corrections to Coding: Data Scraping with Nokogiri

Imagine that while working for testsite, you are asked to create an automated test that will audit one of our page's performance metrics. After a bunch of research, you find that your best bet for gathering website performance metrics is from a handy website called: webpagetest.org. Check it out, you can enter any URL and get all kinds of data about said site's performance without a whole lot of time spent.

In order to complete this project, we need two things:

Something that takes the performance data printed out on the site and turns into something you can manipulate. This process is called "data scraping" and it is what this blog post is all about!
Something, which I will cover later, that navigates to the site for us, inputs the data for us, and runs the test for us.

Luckily Ruby has an awesome gem called Nokogiri, which turns a website into a Ruby object. Let me show you a manual way of testing a webpage. I'm a Star Trek fan, so I'm going to test "http://www.trektoday.com." I enter my URL and then I'll get a nice chart with all kinds of useful data like so:

Let's say I wanted to get that first value "14.630s," in the "Load Time" column and "First View" row. How in the world would you do that? You have to delve into the front-end coding languages of the website: HTML and CSS. HTML shows us the plain text of the site and CSS shows us the style. HTML and CSS often times have specific identifiers that let us know what each text is. Since we're scraping a very specific data we have to use a very specific identifier for Nokogiri: 'id.' In HTML and CSS 'id' is specific to one thing and one thing only. It can't be used anywhere else. So how do we get "14.630s" id?

First highlight "14.630s" and right click. In Google Chrome, you'll see an option "inspect." This opens up the HTML and CSS we'll use to scrape our data. One nice thing is that the exact element for "14.630s" should already be highlighted for us! Here's what it will look like:

We now have an important CSS 'id' which equals "fvLoadTime" and this gives us the "14.630s" we're looking for. Now we can write our code. We need to require 'open-uri' as well as 'nokogiri' since we will be working with URL's.

Now we can define the page we'll be using to scrape data from, which will be the results page (not simply webpagetest.org). We'll call on Nokogiri and its HTML library and assign it to a variable to make it easier to use for future use. Nokogiri has the method ".at_css" which accesses our specific HTML id. The above code yields these results:

The first result simply spits out the HTML code we pulled our data from. The second result will (though it did not in the case) pull out any input text between the <td></td> tags (meaning we will \n's as well as the text we want). By adding .strip to data.at_css('fvLoadTime').text, we actually are able to get just the data we want. Our function has returned to us a string which will be extremely useful to us. Instead of copying and pasting all of our data to some document, we now have Ruby strings results we can work with. Now to automate the actions in the browsers! Stay tuned!

From Corrections to Coding

Data Scraping with Nokogiri

No comments:

Post a Comment