From Corrections to Coding: Making a web crawler with Anemone

In case you're wondering how to pronounce "Anemone," I've got this useful YouTube video.

So, let's put ourselves in a real-life possible business situation: we work for a company that manages an online retail website. This website works for buyers and sellers and sells just about everything. Now our boss wants us to make sure that all our products and departments will show up without error, specifically no 400 or 500 level errors. Well, we already have a 404 finder, but we don't have the each page name. You could check this simply, but also inefficiently by having a team go through every single page, copy and paste into the .txt file our 404 finder uses, and then run the program. However, with a little understanding of how our webpage (and all webpages work) we can do this very efficiently!

Our site is called "testsite." Our homepage where you can see all the departments and customer options is "http://www.testsite.com." Let's say you wanted to look at something in the outdoors section, you click on a button or link and you'll be taken to "http://www.testsite.com/outdoors" and if you look at the hiking boots in the section you'll see "http:www.testsite.com/outdoors/hikingboots#123." Essentially, each link or button is another folder or branch on the tree in the website. Being that we sell a lot of stuff at testsite.com, there's going to be a ton of links! However, there's an age old tool that we can use to collect all the links to be used for testing: a web crawler.

Web crawlers, simply put, will index entire webpages based on certain user inputs. One important input is depth, which like it sounds, tests a page's "branches." Now, it is important to say here and now, that I am not yet at the level to create a webcrawler from scratch, but Anemone is a wonderful prepackaged little gem that makes this extremely easy.

Before you start crawling, install the gem with terminal command "gem install anemone" and open a new script in your text editor. To start, you have to require the anemone gem with "require 'anemone.'" Now, if you check anemone's documentation it has this web crawler script prewritten for us. However what I want to concentrate on is :obey_robots_txt and threads part of Anemone.

If you go to any website and type in "/robots.txt" at the end of it, you'll receive two possible responses: a 200 (successful) response which indicates which, if any, webcrawlers are allowed to crawl the site, or a 404 (broken link) response. The 404 response means that the developers haven't said that crawling is disallowed. You only want to crawl websites that yield a 404 when you add robots.txt. This is formally called Robots Exclusion Protocol and you want to be polite in your early coding days. Luckily, Anemone has an intrinsic setting ":obey_robots_txt" that when set to "false" doesn't obey the REP, but is easily changed when the setting is "true." Simple enough!
2. Multithreading is a programming tool that allows for processes to be completed simultaneously. Here is a really simple demonstration of the actual code you have to write in Ruby to do this. Essentially what multithreading does is give a single set of code to multiple processors at once. I will demonstrate the effectiveness of this timewise in a later post. Again, we are super lucky that Anemone is intrinsically multithreaded and you can use the ":threads => *num" to perform multithreading. However, you have to wise about your multithreading use: you can't have too many without actually slowing down your program. Anemone's documentation recommended 4 threads, so that's what I chose.

From Corrections to Coding

Making a web crawler with Anemone

No comments:

Post a Comment