Usually crawlers browse site pages, collect HTML from it, parse and extract some data.
The first step, we need to somehow emulate Browser to be able to render such sites. There are couple of projects allowing to emulate Browser. Modern ones like zombie.js, phantom.js and an old Selenium.
So, next I tried Selenium with Chrome, and it works. It also not reliable and sometimes hangs and crashes, but at least not very frequently.
Making it work
Now we need to create the Crawler itself. I used Random Breadth-first Algorithm to travel site pages and coupled it with Browser Emulator to render and extracts the data.
If you need more details about Graphs and Algorithms - take a look at 'The Algorithm Design Manual' by S. Skiena, very interesting book, with lots of practical examples.
As a development tools I choose Node.js & MongoDB, both of them are well suited for such task.
Making it simple to use
There are no simple way to create algorithm that will automatically extract data about the product from different eCommerce sites. So, for every site there are some rules needs to be written by hands. And as soon as there will be lots of such sites, this task should be made as simple as possible.
The first problem is that Selenium API is hard to use, so I wrote simplified wrapper around it.
Another complexity - Node.js uses asynchronous code and it's harder than synchronous. So it would be nice if there was a way to make it synchronous. Thankfully there's such a way - I used synchronize.js that makes asynchronous code look like if it's synchronous.
I also created UI to simplify Crawler administration.
Making it fast and reliable
So, now we have a working Crawler, but, there are two problems. It's slow because it runs on one machine only. And it's unreliable because Browser Emulator frequently hangs and crashes.
It is required for it to process millions of pages per day and to be able to do it we need to use lots of machines.
And it is not simple, because when we expand to multiple machines there are suddenly lots of problems arise.
- How to easily startup and prepare machine - create server and install all needed software and run all needed processes?
- How to detect when some machine crashes and how to fix it while keeping the Crawler working on other machines?
- How to detect Browser Emulators crashes and repair it?
- How to balance machines vs. sites?
I will discuss one possible way to solve those issues in the next article.