Making Crawler Fast and Reliable

So, as I told in the previous article - the basic version of the Crawler worked well and proved to be usable. The problem - it was slow and unstable.

To make it fast we need to run it on multiple machines (about 5 - 20). And to make it stable we need to figure out how to make reliable system from unreliable components.

The unreliable component - Crawler uses Browser Emulator (Selenium) with enabled JavaScript to properly render content of sites. And it consumes lots of resources and frequently crash and hangs (it is not very stable by itself and what's worse - there maybe invalid HTML or JS on different sites that can crush or hang it).

Going distributed

Multiple machines instead of just one make things a bit complex because couple issues arise:

  • Provisioning and deployment couple of tens of machines. I don't want to do it by hands.
  • Handle crashes of machines and heal after it. Crawler should be robust and continue to work if one of its nodes crashes, and pick it up again when this node get fixed or new node get added.
  • Detect if one of its nodes hanged and need to be rebooted.


Crawling JavaScript Sites

Recently finished project - Web Crawler for JavaScript sites. It crawl eCommerce sites and collects information about the products. You can see parsed products for one of such eCommerce sites below.

Usually crawlers browse site pages, collect HTML from it, parse and extract some data.

But, in our case we can't use such approach - because sites we are interested in aren't usual sites. There are no any useful data in HTML sent by server, because they rendered using JavaScript in the Browser. And to parse such sites you need full Browser with working JavaScript Engine.

Browser Emulator


MongoDB Performance Tuning

Best feature of MongoDB is not it's performance but simple and flexible data model. So, let's say you build prototype - you concentrate on the big picture - the product itself and ignore little things like performance and db indexes.

Later you deploy your product into the wild users came and it starting to get slow. You need to add indexes, to do so you need to know data usage patterns. Doing it manually by searching codebase is boring and not very productive. Thankfully MongoDB has Profiler - all you need is to enable it and it will give you all details about slow queries and what indexes you need to add.

I like this approach very much, because it fits iterative & lean development very well - you always concentrate on the most important things at the moment. At the first step most important thing is to experiment with the product and features without being distracted by performance issues. And flexible data model of MongoDB comes very handy to that. Later you deploy product into production and can use its Profiler to zoom to more fine grained performance details.



A little about CouchDB

I've read CouchDB Guide for the second time recently. Very interesting book, it's interesting to understand how CouchDB works internally, one of those rare books that creates a mind shift and expand it to the new territory. It's definitely worth the time spent even if I never will be using CouchDB.

So, what's good about CouchDB:

  • Reliable, can accept thousands of connections and behave elastically and predictably under high load (MongoDB is also pretty fast).
  • Has ability to efficiently (without blocking and copying) make snapshot of database state using MVCC (MongoDB can't to that).
  • High availability - thanks to MVCC it's never blocked (MongoDB block on write, although usually it's not a problem, unless you have write-heavy application).
  • Incremental Map/Reduce (MongoDB can do something similar).
  • Incremental, non-blocking, consistent replication (MongoDB also can do it).
  • Has notifications / change-log (MongoDB doesn't have it).
  • Has versioning, although pretty basic (MongoDB doesn't have it).

Now, about disadvantages:


Rad SBS - simple store, site and organizer

My old abandoned project, allows to create simple site, store and be used as an organizer.

My biggest mistake was that I tried to finish two projects at once - Rad SBS and the Web Framework that powers it, it's called Rad (inspired by Ruby on Rails, but more modular and object oriented).

When I started it about year and half ago I'll planed to build simple solution that should be used as a Site and Collaboration for Small Businesses (mix of Alfresco, Jive ClearSpace, Backpack and WordPress).

I already had some experience with delivering such kind of projects (about 5 years of working in business consulting) and thought I can do something similar alone. But sadly it turns out that I'm not a genius, and can't do it alone in a reasonable amount of time.

So, when I recognized it, I was forced to throw out everything (about 60% of features) except the most fundamental features. A lots of them where already half-done, so it wasn't an easy decision.

Below are some of conclusions that may be useful if you would like to finish the project fast:

  • Lots of your code will be thrown out