So now that we have functions for scraping data from sites, let's imagine we had to do it for a whole bunch of site. In the case of webpagetest.org, simply putting this in any array won't work. Whenever you go to webpagetest.org, you're entering a URL that, when submitted, changes the path of the webpage. So you start on webpagetest.org, but entering in your site changes the URL from "http://www.webpagetest.org" to "http://www.webpagetest.org/result." This means in order to do a simple automation of our test, we need to have something do the web surfing for us. Before that, however, I have to make a confession: if you use this test to do an large amount of testing, as I did, you're not being a good web citizen. Ensure that you only do this for a small batch of URL's (like 100 at the MOST). Ask for an API. Be wise, unlike myself.
Anyways, there is a tool to do our aforementioned task. Its called Selenium-Webdriver and it made me giggle the first time I used it! There are some awesome and surprisingly intuitive methods with this gem. First to start off this program, you'll have to require the gem. Now let's say we wanted to create a real simple demo program that opened up the browser, navigated to google, and search google for a something (I'm picking Star Trek Beyond Trailers). That's relatively simple and we can draw on some skills we learned from Nokogiri.
The only way a computer is going to know what a google search bar is, is through code. For us, that means HTML and CSS id's. To find these, we right click and click inspect elements. Luckily most element inspectors make finding elements in code easy. For the google search bar the HTML ID is "gs_lc0." Now for the code:
Firstly, we call on Selenium-Webdriver and this takes an argument for which browser you'll be using. So far, I've only tried this on Firefox and Chrome. Using Firefox is extremely easy, so I'd recommend that. Next you're going to want to go to google. The ".navigate.to" method does just that, it navigates to a page. Now, we want to do something to the page, namely enter a value into the google search bar and submit it to search. As mentioned earlier, the google search bar id is "gs_lc0." The find_element method finds the element with your two given arguments. Using ID is the most specific as each ID is unique to the item, "gs_lc0" will only be used for the search bar.
So now your computer has identified the search bar, what do we do? The same things we do as human beings, we just don't really think about it! We click on the search bar, type some stuff, and click or press enter. We just have to tell Selenium_Webdriver to do that now. For this, we use ".click" which prompts the search bar with a click. Then we type something out with the "send_keys" method which takes a string value argument. Finally we wrap it up with ."submit," which is our enter key, making Google search for us Star Trek Beyond Trailers. Which surely to be a great movie!
Data Scraping with Nokogiri
Imagine that while working for testsite, you are asked to create an automated test that will audit one of our page's performance metrics. After a bunch of research, you find that your best bet for gathering website performance metrics is from a handy website called: webpagetest.org. Check it out, you can enter any URL and get all kinds of data about said site's performance without a whole lot of time spent.
In order to complete this project, we need two things:
In order to complete this project, we need two things:
- Something that takes the performance data printed out on the site and turns into something you can manipulate. This process is called "data scraping" and it is what this blog post is all about!
- Something, which I will cover later, that navigates to the site for us, inputs the data for us, and runs the test for us.
Luckily Ruby has an awesome gem called Nokogiri, which turns a website into a Ruby object. Let me show you a manual way of testing a webpage. I'm a Star Trek fan, so I'm going to test "http://www.trektoday.com." I enter my URL and then I'll get a nice chart with all kinds of useful data like so:
Let's say I wanted to get that first value "14.630s," in the "Load Time" column and "First View" row. How in the world would you do that? You have to delve into the front-end coding languages of the website: HTML and CSS. HTML shows us the plain text of the site and CSS shows us the style. HTML and CSS often times have specific identifiers that let us know what each text is. Since we're scraping a very specific data we have to use a very specific identifier for Nokogiri: 'id.' In HTML and CSS 'id' is specific to one thing and one thing only. It can't be used anywhere else. So how do we get "14.630s" id?
First highlight "14.630s" and right click. In Google Chrome, you'll see an option "inspect." This opens up the HTML and CSS we'll use to scrape our data. One nice thing is that the exact element for "14.630s" should already be highlighted for us! Here's what it will look like:
We now have an important CSS 'id' which equals "fvLoadTime" and this gives us the "14.630s" we're looking for. Now we can write our code. We need to require 'open-uri' as well as 'nokogiri' since we will be working with URL's.
Now we can define the page we'll be using to scrape data from, which will be the results page (not simply webpagetest.org). We'll call on Nokogiri and its HTML library and assign it to a variable to make it easier to use for future use. Nokogiri has the method ".at_css" which accesses our specific HTML id. The above code yields these results:
The first result simply spits out the HTML code we pulled our data from. The second result will (though it did not in the case) pull out any input text between the <td></td> tags (meaning we will \n's as well as the text we want). By adding .strip to data.at_css('fvLoadTime').text, we actually are able to get just the data we want. Our function has returned to us a string which will be extremely useful to us. Instead of copying and pasting all of our data to some document, we now have Ruby strings results we can work with. Now to automate the actions in the browsers! Stay tuned!
What's up with those dollar signs?!
So you may have noticed that I put dollar signs next to my variables in my code. Like here:
The reason behind these dollars signs is simple and might save you some forehead wrinkles! Lets look at my gmail function. My goal was to make my program send an email to whatever input a user would enter. However I couldn't simply use a variable called "email" because of one computer science principle: global and local variables. Within functions, such as our email function, variables are local to the function or to say only exist within the function. This also means, I can't call upon another variable to use in the function. Now this is easily fixable! By adding a "$" to our variable name, the variable become global. This means that these variables permeate the script and can be used anywhere! So by adding $ to "email," I can ask a question that receives input outside the function and have the input used within the function! However, be aware variables "email" and "$email" are not equal! If these two are defined within a program, they will have two separate values.
The reason behind these dollars signs is simple and might save you some forehead wrinkles! Lets look at my gmail function. My goal was to make my program send an email to whatever input a user would enter. However I couldn't simply use a variable called "email" because of one computer science principle: global and local variables. Within functions, such as our email function, variables are local to the function or to say only exist within the function. This also means, I can't call upon another variable to use in the function. Now this is easily fixable! By adding a "$" to our variable name, the variable become global. This means that these variables permeate the script and can be used anywhere! So by adding $ to "email," I can ask a question that receives input outside the function and have the input used within the function! However, be aware variables "email" and "$email" are not equal! If these two are defined within a program, they will have two separate values.
Making a web crawler with Anemone
In case you're wondering how to pronounce "Anemone," I've got this useful YouTube video.
So, let's put ourselves in a real-life possible business situation: we work for a company that manages an online retail website. This website works for buyers and sellers and sells just about everything. Now our boss wants us to make sure that all our products and departments will show up without error, specifically no 400 or 500 level errors. Well, we already have a 404 finder, but we don't have the each page name. You could check this simply, but also inefficiently by having a team go through every single page, copy and paste into the .txt file our 404 finder uses, and then run the program. However, with a little understanding of how our webpage (and all webpages work) we can do this very efficiently!
Our site is called "testsite." Our homepage where you can see all the departments and customer options is "http://www.testsite.com." Let's say you wanted to look at something in the outdoors section, you click on a button or link and you'll be taken to "http://www.testsite.com/outdoors" and if you look at the hiking boots in the section you'll see "http:www.testsite.com/outdoors/hikingboots#123." Essentially, each link or button is another folder or branch on the tree in the website. Being that we sell a lot of stuff at testsite.com, there's going to be a ton of links! However, there's an age old tool that we can use to collect all the links to be used for testing: a web crawler.
Web crawlers, simply put, will index entire webpages based on certain user inputs. One important input is depth, which like it sounds, tests a page's "branches." Now, it is important to say here and now, that I am not yet at the level to create a webcrawler from scratch, but Anemone is a wonderful prepackaged little gem that makes this extremely easy.
Before you start crawling, install the gem with terminal command "gem install anemone" and open a new script in your text editor. To start, you have to require the anemone gem with "require 'anemone.'" Now, if you check anemone's documentation it has this web crawler script prewritten for us. However what I want to concentrate on is :obey_robots_txt and threads part of Anemone.
So, let's put ourselves in a real-life possible business situation: we work for a company that manages an online retail website. This website works for buyers and sellers and sells just about everything. Now our boss wants us to make sure that all our products and departments will show up without error, specifically no 400 or 500 level errors. Well, we already have a 404 finder, but we don't have the each page name. You could check this simply, but also inefficiently by having a team go through every single page, copy and paste into the .txt file our 404 finder uses, and then run the program. However, with a little understanding of how our webpage (and all webpages work) we can do this very efficiently!
Our site is called "testsite." Our homepage where you can see all the departments and customer options is "http://www.testsite.com." Let's say you wanted to look at something in the outdoors section, you click on a button or link and you'll be taken to "http://www.testsite.com/outdoors" and if you look at the hiking boots in the section you'll see "http:www.testsite.com/outdoors/hikingboots#123." Essentially, each link or button is another folder or branch on the tree in the website. Being that we sell a lot of stuff at testsite.com, there's going to be a ton of links! However, there's an age old tool that we can use to collect all the links to be used for testing: a web crawler.
Web crawlers, simply put, will index entire webpages based on certain user inputs. One important input is depth, which like it sounds, tests a page's "branches." Now, it is important to say here and now, that I am not yet at the level to create a webcrawler from scratch, but Anemone is a wonderful prepackaged little gem that makes this extremely easy.
Before you start crawling, install the gem with terminal command "gem install anemone" and open a new script in your text editor. To start, you have to require the anemone gem with "require 'anemone.'" Now, if you check anemone's documentation it has this web crawler script prewritten for us. However what I want to concentrate on is :obey_robots_txt and threads part of Anemone.
- If you go to any website and type in "/robots.txt" at the end of it, you'll receive two possible responses: a 200 (successful) response which indicates which, if any, webcrawlers are allowed to crawl the site, or a 404 (broken link) response. The 404 response means that the developers haven't said that crawling is disallowed. You only want to crawl websites that yield a 404 when you add robots.txt. This is formally called Robots Exclusion Protocol and you want to be polite in your early coding days. Luckily, Anemone has an intrinsic setting ":obey_robots_txt" that when set to "false" doesn't obey the REP, but is easily changed when the setting is "true." Simple enough!
2. Multithreading is a programming tool that allows for processes to be completed simultaneously. Here is a really simple demonstration of the actual code you have to write in Ruby to do this. Essentially what multithreading does is give a single set of code to multiple processors at once. I will demonstrate the effectiveness of this timewise in a later post. Again, we are super lucky that Anemone is intrinsically multithreaded and you can use the ":threads => *num" to perform multithreading. However, you have to wise about your multithreading use: you can't have too many without actually slowing down your program. Anemone's documentation recommended 4 threads, so that's what I chose.
Getting a script to send an email (gmail)
The gem I am using for this project is the gmail gem which can be installed by "gem install gmail." I found a little piece of code in the documentation that allowed me to do exactly what I wanted aside from one little thing, but let me explain what code I have thus far.
So this should be pretty intuitive. The function takes certain inputs. Sender email with the password so the program can log into gmail. Then it will take the recipient email, a subject, and a text body. Of course, I had a problem I want my program to be fluid, I want to be able to ask the user "Who do you want to send it to?" and then use user input as the email. You can't just make a variable like "email" and equal it to gets.chomp and use it within a function. Variables within a function are local, so you have to make the variables we use global. You can do this very easily by adding a "$" to whatever you want to do. So instead of recipientEmail as our variable, we can use $recipientEmail. This means it is usable within the entire program.
Subscribe to:
Posts (Atom)