Dataflow of the search machine

article how to start a search, or a few thoughts about crawler

In a previous article, I got a little earful about the experiments with the load intensity and work Crawler'and, in General terms, describe the DataFlow of a project to build the index to make it clear what I'm writing. Each step I will try to describe in detail in the relevant article.

So downloaded page first enters to the allocation of links. New links from the current site fall into a local queue to download in the current session, and all other sites are added to the total of all Crawler'. In this queue to only contain the main pages of websites.

After collecting a sufficient number of pages from the same site analyzer is run, highlighted patterns that are present on most pages, and they are cut.
The output is the text of the pages without all the excess and grouped by site.

All the finished pages of one website placed in a directory that gets to the indexer. The indexer creates a small version of a direct index, consisting of a few (10-1000) sites prepared in the previous step is parsing HTML, word maps, packs, the metric considers word-for-word. A direct index is a table where a document ID can I get a list of words and information about their occurrences in the document. Also applied to it a database of words (built from this small number of documents) and links (still not added to the index and therefore has no ID). Links go to the module recalculate PR (the information links involved words and their weights), the rest is to update the reverse index. The volume of the piece that the indexer is prepared, usually 100-300Mb per hour. And this is without the lyrics pages.

Counting PR and different metrics has been a separate module. The funny thing is that for the calculation of PR is more convenient to store references as strings, and not as ID since not all references will fall in the final index score structure storage references extra information is pretty silly. Due to this feature easier to write this part in Perl or similar language, which is quickly works with strings, or you will have to organize a system of storing large number of rows on the structural language.

The output of the indexer get packaged piece of direct index. But we need to build a reverse index to quickly search. The reverse index is a table, in which each word is mapped to sorted by relevance or PR (like Google) a list of documents in which that word is. It is clear that just take the direct index and overtake him otherwise will take years to when the database will grow, and as such, incremental reverse index and a couple dozen pieces of direct index prepared at the previous step are merged, and then sorted. Technically this happens at the same time – sorted insert in a sorted list. There are many moments how to arrange the database so it is held, that it was not necessary each time the files on hundreds of gigs to copy back and forth how to put into memory the required piece, and so on.
Pregenerated database of words, metrics, and other things, also incrementally combining the previous version with the new information. ID words in a small index live on the go are replaced with ones that match the full version of the database, and the rest of the information is similar.
When building a reverse index is also complemented by a separate database to store the Url of the page is a separate large structure, too many entries should be stored and have a quick access ID, access for sequential reads within a single website, quickly check the existence of links in the description of database schema I solve this issue.

At the stage of rebuilding a reverse index of the desired data about PR metrics, pages and everything else that helps to sort the results. In fact, it is necessary to have only the first couple of hundreds of thousands of sorted results are guaranteed – if the case comes to rest – then it will work ranking functions.
Built reverse index can already be used to search – select lists of words from the query, looking for the intersection – consider factors, distance between the found words well, and sorted based on ranking functions.

A little more of morphology based on the morphological dictionary of Zaliznyak to select the most base, cut off the end, substitute 1-th dictionary form. This whole process was assembled into a tree for faster parsing by letter, in the final leaves contain endings. Run along the way, in parallel descending the tree based on the meet-up letter, until we reach the lowest possible sheet – there, based on the endings, we substitute normalized.
If the normal form is not found, then applied, stemming – based on the text of books downloaded from lib.ru I built a table of frequency of occurrence of endings, looking for the most common of the suitable (also a tree) and replaced by a normal form. Stemming works well if a word was in the language of even 5-10 years ago – easy to disassemble "the crawler", "crawler"

Full contents and list of my articles on search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

When the basin is small, or it's time to choose VPS server

Performance comparison of hierarchical models, Django and PostgreSQL

Comparing the performance of MongoDB vs. PostgreSQL. Part I: No index