The Basics Of Elasticsearch

Elasticsearch — a search engine with a json rest api that uses Lucene and written in Java. Description all the benefits of this engine is available on official website. Hereafter we call Elasticsearch as ES.


Such engines are used in complex database search of documents. For example, search with account of morphology of the language or search by geo coordinates.


In this article, I will discuss the basics of ES on the example of the indexing of blog posts. Will show you how to filter, sort and search documents.


not to depend on the operating system, all requests to the ES I'm going to do with CURL. Also there is a plugin for google chrome called sense.


text placed links to documentation and other sources. At the end of the links for quick access to documentation. For definitions of unfamiliar terms can be found in Glossary.


the

Installing ES


To do this, we first need Java. Developers rekomenduyut to install a version of Java newer than Java 8 update 20 and Java 7 update 55.


Distribution of ES available at developer website. After unpacking the archive, you need to run bin/elasticsearch. Also available packages for apt and yum. There are the official image for docker. for More information about installing.


After installing and running check serviceability:


the
# for convenience, make a note of the address in a variable
#export ES_URL=$(docker-dev machine ip):9200
export ES_URL=localhost:9200

curl-X GET $ES_URL

We come about this answer:


the
{
"name" : "Heimdall",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.2.1",
"build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
"build_timestamp" : "2016-03-09T09:38:54Z",
"build_snapshot" : false,
"lucene_version" : "5.4.1"
},
"tagline" : "You Know, for Search"
}

the

Indexing


Add post to ES:


the
# Add a document with id 1 post in the blog index.
# ?pretty indicates that the output should be human-readable.

curl-XPUT "$ES_URL/blog/post/1?pretty" -d'
{
"title": "Funny kittens",
"content": "<p>a Funny story about kittens<p>",
"tags": [
"kittens",
"funny story"
],
"published_at": "2014-09-12T20:44:42+00:00"
}'

the server's response:


the
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : false
}

ES automatically created a index blog and tip post. You can do a conditional analogy: the index is the database, and the table in the DB. Each type has its own schema for the mappingalso as a relational table. Mapping is automatically generated when indexing the document:


the
# Get a mapping of all types of index blog
curl-XGET "$ES_URL/blog/_mapping?pretty"

In the server response, I added in the comments field values indexed document:


the
{
"blog" : {
"mappings" : {
"post" : {
"properties" : {
/* "content": "<p>a Funny story about kittens<p>", */ 
"content" : {
"type" : "string"
},
/* "published_at": "2014-09-12T20:44:42+00:00" */
"published_at" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
/* "tags": ["kittens", "funny story"] */
"tags" : {
"type" : "string"
},
/* "title": "Funny kittens" */
"title" : {
"type" : "string"
}
}
}
}
}
}

it is Worth noting that the ES does not distinguish between single value and array values. For example, the title field contains just the title and the tags field is an array of strings, although they are presented in the same mapping.
Later we will talk about mapping more like.


the

Inquiries


the

retrieving a document by its id:


the
# get document with id 1 post from the blog index

the 
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Funny kittens",
"content" : "<p>a Funny story about kittens<p>",
"tags" : [ "kittens", "funny story" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}
}

the answer appeared new keys: _version and _source. In General, all keys starting with _ related to the service.


the Key is _version shows the version of the document. It is necessary for the operation of the mechanism of optimistic locking. For example, we want to change a document that has version 1. We sent the revised document and indicated that it edit document with version 1. If someone else also edited the document with version 1 and sent the changes before us, then ES will not accept our changes as it keeps the document with the version 2.


the Key is _source contains the document we indexed. ES does not use this value for search operations, because the search uses the indexes. To save space, ES stores the compressed original document. If we need only the id, not the whole original document, you can disable storing source code.


If we don't need more information, you can get only the contents of the _source:


the
curl-XGET "$ES_URL/blog/post/1/_source?pretty"

the
{
"title" : "Funny kittens",
"content" : "<p>a Funny story about kittens<p>",
"tags" : [ "kittens", "funny story" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}

you can Also select only certain fields:


the
# get only the title field
curl-XGET "$ES_URL/blog/post/1?_source=title&pretty"

the
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Funny kittens"
}
}

Let's index some more posts and perform more complex queries.


the
curl-XPUT "$ES_URL/blog/post/2" -d'
{
"title": "Funny puppies",
"content": "<p>a Funny story about the puppies<p>",
"tags": [
"puppies",
"funny story"
],
"published_at": "2014-08-12T20:44:42+00:00"
}'

the
curl-XPUT "$ES_URL/blog/post/3" -d'
{
"title": "How I got my kitty"
"content": "<p>a Heartbreaking story about a poor kitten in the street<p>",
"tags": [
"kittens"
],
"published_at": "2014-07-21T20:44:42+00:00"
}'

the

Sorting


the
# find the last post by date of publication and extract the title field and published_at
curl-XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"size": 1,
"_source": ["title", "published_at"],
"sort": [{"published_at": "desc"}]
}'

the
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : null,
"_source" : {
"title" : "Funny kittens",
"published_at" : "2014-09-12T20:44:42+00:00"
},
"sort" : [ 1410554682000 ]
} ]
}
}

We chose the latest post. size limits the number of documents in the results. the total shows the total number of documents matching the query. the sort in the results contains an array of integers which is sorted. I.e. the date is converted to an integer. Read more about sorting, you can read the documentatie.


the

Filters and queries


ES with version 2 does not distinguish between filthy and queries, instead we introduce the concept of contexts.
The request context is different from the filter context that generates the query _score and not cached. What is _score, I'll show you later.


the

Filter by date


Use the range the filter context.


the
# get the posts published September 1st or later
curl-XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"filter": {
"range": {
"published_at": { "gte": "2014-09-01" }
}
}
}'

the

Filter by tags


Use term query id to search for documents containing the word specified:


the
# find all the documents in the tags field which is an element of 'kittens'
curl-XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": [
"title",
"tags"
],
"filter": {
"term": {
"tags": "kittens"
}
}
}'

"took" : 9, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "blog", "_type" : "post", "_id" : "1", "_score" : 1.0, "_source" : { "title" : "Funny kittens", "tags" : [ "kittens", "funny story" ] } }, { "_index" : "blog", "_type" : "post", "_id" : "3", "_score" : 1.0, "_source" : { "title" : "How I got my kitty" "tags" : [ "kittens" ] } } ] } }

the

full text search


our Three documents contain the content the following:


the
    the
  • <p>a Funny story about kittens<p>
  • the
  • <p>a Funny story about the puppies<p>
  • the
  • <p>a Heartbreaking story about a poor kitten in the street<p>

Use match query id to search for documents containing the word specified:


the
# source: false means do not need to extract the _source of documents found
curl-XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": false,
"query": {
"match": {
"content": "history"
}
}
}'

the
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "2",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "3",
"_score" : 0.095891505
} ]
}
}

However, if you search for "history" in the content field, we don't find anything, because the index contains only the original words, not their foundations. In order to make a good search, you should configure the parser.


The _score shows relevantist. If the request is running in the filter context, the value of _score is always 1, meaning full compliance with the filter.


the

Parsers


Parsers need to convert the source text into a set of tokens.
The analyzers consist of a single Tokenizer and several optional TokenFilters. Tokenizer may be preceded by several CharFilters. A Tokenizer divides a source string into tokens, e.g. spaces and punctuation marks. TokenFilter can change tokens, delete, or add new ones, for example, to leave only the base words, to remove the pretexts, to add synonyms. CharFilter — changes the original string as a whole, for example, cuts out the html tags.


In ES there are several standard analyzers. For example, the tool russian.


Use api and look at how the analyzers of russian standard, and convert a string of "Funny stories about kittens"


the
# use the standard analyzer 
# need to recode non-ASCII characters
curl-XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"

the
{
"tokens" : [ {
"token" : "happy",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "stories",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "about",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "kittens",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}

the
# use the English analyzer
curl-XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"

the
{
"tokens" : [ {
"token" : "oars",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "istor",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}

the Standard analyzer broke a string on whitespace and moved everything to lower case, the analyzer russian — removed non-relevant words, translated to lower case and left the words.


the
{
"filter": {
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"russian_stemmer": {
"type": "stemmer",
"language": "English"
}
},
"analyzer": {
"English": {
"tokenizer": "standard",
/* TokenFilters */
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
/* CharFilters do not  exist  */
}
}
}

Describe your analyzer on the basis of russian, which will cut out the html tags. Let's call it default, because the parser with the same name will be used by default.


the
{
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "English"
}
},
"analyzer": {
"default": {
/* added removal of html tags */
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}

First, from the original lines will be deleted all html tags, then it will break in the standard tokenizer tokens, the tokens received will go to lower case, removed insignificant words and from the remaining tokens will remain the Foundation of the word.


the

index Creation


Above we have described the default analyzer. It will be applied to all string fields. Our post contains an array of tags, respectively, the tags will also be processed by the analyzer. Since we look for posts by exact matching tag, then you need to disable analysis for a field tags.


Create the index blog2 analyzer and mapping, which disabled the analysis of field tags is:


the
curl-XPOST "$ES_URL/blog2" -d'
{
"settings": {
"analysis": {
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "English"
}
},
"analyzer": {
"default": {
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}
},
"mappings": {
"post": {
"properties": {
"content": {
"type": "string"
},
"published_at": {
"type": "date"
},
"tags": {
"type": "string",
"index": "not_analyzed"
},
"title": {
"type": "string"
}
}
}
}
}'

Add the same 3 posts in the index (blog2). I will omit this process because it is similar to adding documents to index blog.


the

full-text search-expression support


get Acquainted with another type of queries:


the
# find documents in which the word 'history'
# query -> simple_query_string -> query contains search query
# title field has a priority of 3
# the tags field has a priority of 2
# content field has a priority of 1
# priority is used when ranking results
curl-XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "stories",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'

because we use the analyzer with the Russian stemming, this query will return all documents, although they found only the word 'history'.


the Query may contain special characters, for example:


the
"\"fried eggs\" +(eggplant | potato) -frittata"

query Syntax:


the
+ signifies AND operation
| signifies OR operation
- negates a single token
"wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount

the
# we'll find the word 'puppies'
curl-XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "puppies",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'

# get 2 post about seals

the

Links


the
the

PS


If you are interested in such articles tutorials, any ideas for new articles or have any suggestions about cooperation, we'll be glad to message in PM or email m.kuzmin+habr@darkleaf.ru.

Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Performance comparison of hierarchical models, Django and PostgreSQL

google life search

Transport Tycoon Deluxe / Emscripten part 2