Crawling JavaScript Sites

Recently finished project - Web Crawler for JavaScript sites. It crawl eCommerce sites and collects information about the products. You can see parsed products for one of such eCommerce sites below.

Usually crawlers browse site pages, collect HTML from it, parse and extract some data.

But, in our case we can't use such approach - because sites we are interested in aren't usual sites. There are no any useful data in HTML sent by server, because they rendered using JavaScript in the Browser. And to parse such sites you need full Browser with working JavaScript Engine.

Browser Emulator

The first step, we need to somehow emulate Browser to be able to render such sites. There are couple of projects allowing to emulate Browser. Modern ones like zombie.js, phantom.js and an old Selenium.

I tried modern ones first, and failed. None of it works reliably. Sites that crawler parse use all sort of variety JavaScript libraries, tricks, they have bugs, and all other fancy stuff. And frequently one site or another somehow crashed emulator or not worked with it at all.

So, next I tried Selenium with Chrome, and it works. It also not reliable and sometimes hangs and crashes, but at least not very frequently.

Making it work

Now we need to create the Crawler itself. I used Random Breadth-first Algorithm to travel site pages and coupled it with Browser Emulator to render and extracts the data.

If you need more details about Graphs and Algorithms - take a look at 'The Algorithm Design Manual' by S. Skiena, very interesting book, with lots of practical examples.

As a development tools I choose Node.js & MongoDB, both of them are well suited for such task.

Making it simple to use

There are no simple way to create algorithm that will automatically extract data about the product from different eCommerce sites. So, for every site there are some rules needs to be written by hands. And as soon as there will be lots of such sites, this task should be made as simple as possible.

The first problem is that Selenium API is hard to use, so I wrote simplified wrapper around it.

Another complexity - Node.js uses asynchronous code and it's harder than synchronous. So it would be nice if there was a way to make it synchronous. Thankfully there's such a way - I used synchronize.js that makes asynchronous code look like if it's synchronous.

I also created UI to simplify Crawler administration.

Making it fast and reliable

So, now we have a working Crawler, but, there are two problems. It's slow because it runs on one machine only. And it's unreliable because Browser Emulator frequently hangs and crashes.

It is required for it to process millions of pages per day and to be able to do it we need to use lots of machines.

And it is not simple, because when we expand to multiple machines there are suddenly lots of problems arise.

  • How to easily startup and prepare machine - create server and install all needed software and run all needed processes?
  • How to detect when some machine crashes and how to fix it while keeping the Crawler working on other machines?
  • How to detect Browser Emulators crashes and repair it?
  • How to balance machines vs. sites?

I will discuss one possible way to solve those issues in the next article.