Making Crawler Fast and Reliable

So, as I told in the previous article - the basic version of the Crawler worked well and proved to be usable. The problem - it was slow and unstable.

To make it fast we need to run it on multiple machines (about 5 - 20). And to make it stable we need to figure out how to make reliable system from unreliable components.

The unreliable component - Crawler uses Browser Emulator (Selenium) with enabled JavaScript to properly render content of sites. And it consumes lots of resources and frequently crash and hangs (it is not very stable by itself and what's worse - there maybe invalid HTML or JS on different sites that can crush or hang it).

Going distributed

Multiple machines instead of just one make things a bit complex because couple issues arise:

  • Provisioning and deployment couple of tens of machines. I don't want to do it by hands.
  • Handle crashes of machines and heal after it. Crawler should be robust and continue to work if one of its nodes crashes, and pick it up again when this node get fixed or new node get added.
  • Detect if one of its nodes hanged and need to be rebooted.

Read more...

Crawling JavaScript Sites

Recently finished project - Web Crawler for JavaScript sites. It crawl eCommerce sites and collects information about the products. You can see parsed products for one of such eCommerce sites below.

Usually crawlers browse site pages, collect HTML from it, parse and extract some data.

But, in our case we can't use such approach - because sites we are interested in aren't usual sites. There are no any useful data in HTML sent by server, because they rendered using JavaScript in the Browser. And to parse such sites you need full Browser with working JavaScript Engine.

Browser Emulator

Read more...

SQL for Data Analyst

These days lots of data available in digital form, ability to analyze and get meaning from that data become more important. Usually such job is called Data Analysis or Data Mining and the person who does that is called Data Analyst.

Actually, I wrote that article for my brother (who's Analyst and ask me about SQL) and decided to publish it because it may be also useful for others.

Main skills of Analyst are Mathematics, Statistics and Domain expertise. But in order to apply those skills he should be able to get the data itself. Widely supported way to get access to data is called SQL, and Analyst can benefit greatly if it knows basics of it.

SQL is a declarative language for querying and transforming data stored in relational database.

  • Declarative - it means that you declare what you want without explicitly telling how to do that. And database figures it out by herself how to fulfill your request in the best way. It's a good thing, because describe what you want usually much simpler than to explain how to do that.
  • Relational database - special type of database (it's the most widely used type of database) that stores data as rows (also called records) in tables.

Read more...

Hadoop The Definitive Guide

 

If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.

This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.

Read more...

Interesting resources about AI

Find if domain name is generated by human or robot http://nbviewer.ipython.org/github/ClickSecurity/data_hacking/blob/master/dga_detection/DGA_Domain_Detection.ipynb

Some learning resources, simple approach, lots of interesting examples https://github.com/hangtwenty/dive-into-machine-learning

Podcast about AI http://www.thetalkingmachines.com/blog/

Infographics http://www.randalolson.com/blog/page/4/

Collection of easy and practical approaches http://www.igvita.com/2011/04/20/intuition-data-driven-machine-learning and there (in the body of that article) is also link to another interesting presentation from Google about it's machine translator, don't miss it.

Foundations of Intelligent Agents how to generate algorithms and judge if its right using Kolmogorov Complexity.

Pether Norvig about Google Algorithms:

http://www.youtube.com/watch?v=HT540VrCDwg http://www.youtube.com/watch?v=nU8DcBF-qo4

Computing Like the Brain mathematical model of brain, sparse distributed representation, semantic (locality sensitive) hashing, sequential memory, prediction and anomaly detection and some practical applications

Read more...