Search in large documents in ElasticSearch

we Continue the series of articles about how we learned the ES in the process of creating Ambar. The first article in the series was about Hiliting large text fields in ElasticSearch.

In this article we will talk about how to get ES to work quickly with documents larger than 100 MB. Search for such documents at the approach "in a forehead" take tens of seconds. We managed to reduce this time to 6 MS.

Interested please under cat.

the Problem of search in large documents

As you know, all the action of the search in ES is built around the field _source — the source of the document come in ES, and then indexed by Lucene.

consider the example of a document that we store in ES:

the

{
sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c",
meta: [
{
id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0",
short_name: "quarter.pdf",
full_name: "//winserver/store/reports/quarter.pdf",
source_id: "crReports",
extension: ".pdf",
created_datetime: "2017-01-14 14:49:36.788",
updated_datetime: "2017-01-14 14:49:37.140",
extra: [],
indexed_datetime: "2017-01-16 18:32:03.712"
}
],
content: {
size: 112387192,
indexed_datetime: "2017-01-16 18:32:33.321",
author: "John Smith",
processed_datetime: "2017-01-16 18:32:33.321",
length: "",
language: "",
state: "processed",
title: "Quarter Report (Q4Y2016)",
type: "application/pdf",
text: ".... a lot of text here ...."
}
}

_source to Lucene's atomic unit, which by default contains all the fields of the document. The index in Lucene is a sequence of tokens from all fields of all documents.

so, the index contains N documents. The document contains about two dozen fields, all the fields are rather short, mostly of types keyword and date, with the exception of long text fields content.text.

Now let's try in the first approximation, to understand what will happen when you try to search any of the fields in the above documents. For example, we want to find documents with creation date greater than 14 January 2017. To do this, perform the following query:

the

curl -X POST -H "Content-Type: application/json" -d '{ range: { 'meta.created_datetime': { gt: '2017-01-14 00:00:00.000' } } }' "http://ambar:9200/ambar_file_data/_search"

the Result of this query you will see very soon, for several reasons:

first, the search will be open to all fields of all documents, although it would seem why do we need them if we do filtering only by date created. This is because the atomic unit for Lucene's _source, and the default index consists of a sequence of words from all to document fields.

second, ES in the process of formation of search results will unload into memory from the index all documents as a whole with a huge and we do not need the content.text.

third, ES collecting these huge documents in RAM will try to send them to us a single answer.

OK, the third reason is easily solved by including a source filtering in the query. What about the others?

Accelerated search

Obviously, the search, download to memory, and serialization of the results involved a larger field content.text is a bad idea. To avoid this it is necessary to make separate Lucene to store and process large fields apart from other fields of the documents. Describe the necessary steps.

first, the mapping for a large field, you must specify the parameter store: true. So you say that Lucene to store this field is required separate from the _source, i.e. from the rest of the document. It is important to understand that the logic level of the _source this field is not excluded! Just Lucene when referring to the document will be to collect it in two steps: take the _source and add to it a stored field content.text.

secondly, it is necessary to specify Lucene "heavy" field is no longer necessary to include in the _source. Thus when searching, we will no longer to upload a large 100MB documents in memory. For this mapping it is necessary to add the following lines:

excludes: [ "content.text" ] }

so, what we get in the end: when you add a document to the index, _source is indexed without the "heavy" field content.text. It is indexed separately. In the search for any "easy" the field content.text no does not participate, respectively, in Lucene this query works with cut documents that are larger than 100Mb, and a couple of hundred bytes and the search is very fast. Search for the "heavy" field is possible and effective, now he is on the field array of the same type. Searching for "heavy" and "light" fields of a single document is also possible and effective. It is done in three stages:

the

easy search for clipped documents (_source)
search in the array the "heavy fields" (content.text)
quick merge results without returning all the fields content.text

To assess the speed will look for the phrase "John DOE" in field content.text with a filter for field content.size in the index of documents larger than 100 MB. An example query is shown below:

the

curl -X POST -H "Content-Type: application/json" -d '{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": [
{ "range": { "content.size": { "gte": 100000000 } } },
{ "match_phrase": { "content.text": "Ivan Ivanov"} }
]
}
}
}' "http://ambar:9200/ambar_file_data/_search"

Our test index contains approximately 3.5 million documents. It all works on a single machine of small capacity (16GB RAM, the usual storage on a RAID 10 of SATA disks). The results are as follows:

the

Basic mapping "in a forehead" — 6.8 seconds

Our version — 6 MS

total, a performance gain of about 1 in 100 times. Agree, for such a result was worth spending a few evenings on study of the Lucene and ElasticSearch, and a few days to write this article. But in our approach, and one pitfall.

Side effects

if you keep any field separately and exclude it from the _source you will find one rather nasty gotcha which absolutely no information in the public domain or in the manuals ES.

Problem: you can't partially update a field of the document from _source with update scipt separately without losing the stored field! If you, for example, the script adds an array of meta new object, then ES will be forced to reindex the whole document (which is natural), but separately stored field content.text will be lost. You will receive an updated document, but in stored_fields it will not have anything except _source. So if you need to update some of the fields _source — you have to work with them to rewrite and store the field.

Result

For us, this is the second use of ES in a large project, and again we managed to solve all our problems while retaining the speed and efficiency of the search. ES is really very good, just need to be patient and to be able to configure it properly.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express