Elasticsearch Data Search Principles

Elasticsearch is an open source, Lucene-based distributed search and analytics engine designed for use in cloud computing environments that enables real-time, scalable search, analysis, and exploration of full-text and structured data. It is highly scalable and can search and analyze large amounts of data in a short period of time.

More than just a full-text search engine, Elasticsearch provides distributed multi-user capabilities, real-time analytics, and the ability to process complex search statements, making it useful in a wide range of scenarios, such as enterprise search, logging, and event data analysis.

This article will give you a detailed overview of what an inverted index is, as well as principles related to Elasticsearch queries, relevance scoring, and search optimization.

1、Inverted index

1.1, why the need for inverted index

Inverted indexes, also indexes. Indexes, in the first place, are all about retrieving the data you want quickly.

Each database has its own problems to solve (or areas of expertise), corresponding to their own data structure, and different usage scenarios and data structure, you need to use different indexes in order to maximize the purpose of speeding up the query.

B+ tree for Mysql and inverted indexes for Elasticsearch and Lucene.

Elasticsearch is a search engine built on Lucene, a full-text search engine library, which hides the complexity of Lucene and provides a simple and consistent RESTful API instead, but it can’t hide the fact that it’s also a Lucene-based search engine. Elasticsearch’s inverted index is actually Lucene’s inverted index.

1.2, why is it called inverted index

“The concept of Inverted Index is derived from Forward Index.

In “forward indexing”, we start from the document, record the words that appear in each document, so that we can know what words are contained in each document. In the “inverted index”, we start from the word, record each word appears in which documents, so that we can know each word is contained in which documents.

Forward indexing: document -> to -> words
Reverse index: word -> to -> documents

Therefore, the “backward index” can be regarded as the inverse operation of the “forward index”, so it is called “backward”. In full-text search, the “inverted index” is a very important data structure, because it allows us to quickly find all documents containing a particular term.

1.3 Structure of the inverted index

A backward index is used as a data structure to store a mapping relationship, i.e., a mapping from a word item to a document in which the word item occurs. It is a core component of full-text search engines such as Elasticsearch, Lucene, etc.

In the inverted index, each unique term has an associated inverted list that contains the IDs of all documents that contain the term, so that when we search for a term, the search engine only needs to look in the inverted index to quickly find all documents that contain the term.

For example, suppose we have the following three documents:

1. Document 1: I love coding
2. Document 2: I love reading
3. Document 3: I love both

After creating an inverted index on these documents, we get the following mapping relationship:

- I: Document 1, Document 2, Document 3
- love: document 1, document 2, document 3
- coding: document 1
- reading: Document 2
- both: Document 3

So, when we search for “love”, the search engine will find “love” in the inverted index, and then return all the documents that contain “love”, i.e., document 1, document 2 and document 3.

2. Data query process

2.1 Principles of data query processing

In Elasticsearch, query processing consists of the following main steps:

Parsing Query Statements: First, Elasticsearch parses the user’s query request and converts it into an internal query representation. This process includes parsing the syntax of the query statement, parsing the query parameters, and verifying the legitimacy of the query statement.
Generating a Query Plan: After parsing a query statement, Elasticsearch generates a query plan. The query plan describes how to execute the query on the inverted index, including which terms need to be queried, how to combine the query results of the terms, and so on.
Executing Queries: With a query plan in place, Elasticsearch can execute queries on the inverted index. This process includes looking up the inverted table of lexical items, calculating the relevance of the documents to the query, generating candidate result sets, and so on.
Generating Query Results: Finally, Elasticsearch generates the final query results based on the candidate result set and query parameters. This process includes sorting candidate results, generating summaries, paging, and so on.

2.2 Parsing Query Statements

In Elasticsearch, parsing the query statement is the first step in query processing. This process mainly consists of the following steps:

Parsing JSON: Elasticsearch’s query statements are usually provided in JSON format. First, Elasticsearch parses the JSON and converts it into internal data structures.

{
  "query": {
    "match": {
      "field_name": "query_value"
    }
  }
}

Parsing Query Types: Query types are often specified in query statements (including Match queries for basic full-text search, Term queries for exact matches, Range queries for range searches, Bool queries for logically combining multiple query criteria, Phrase queries for phrase searches, Wildcard queries for wildcard searches, Prefix queries for prefix searches, and Fuzzy queries for fuzzy searches). Elasticsearch parses the query type and selects the appropriate query processor based on the query type.
Parsing Query Parameters: The query statement will also contain some query parameters such as field names, query values, fuzzy match thresholds, etc. Elasticsearch parses these query parameters and passes them to the query processor.
Validating the query statement: Finally, Elasticsearch validates the legitimacy of the query statement. For example, it checks that the field name exists, that the type of the query value matches the field type, and so on. If the query statement is not legitimate, Elasticsearch returns an error.

2.3 Generate a query plan

In Elasticsearch, the process of generating a query plan consists of determining the type of query (such as thematch、term、range etc.), identifies the fields and values to be queried, and then generates a query plan based on this information that describes how the query is to be executed on the inverted index, including which lexical items need to be queried and how to combine the results of the lexical items.

2.4. Execution of queries

In Elasticsearch, executing a query is a key step in the query processing process. This process mainly consists of the following steps:

Finding terms: Based on the query plan, Elasticsearch looks up the inverted table for each term in the inverted index.
Calculating relevance: Elasticsearch calculates the relevance of each document and query. This is usually done using an algorithm called TF-IDF.
Generate Candidate Result Set: Elasticsearch generates a candidate result set based on the results of the relevance calculation. This result set contains all the documents that may satisfy the query conditions.

2.5 Generate query results

In Elasticsearch, generating query results is the final step in the query processing process. This process mainly consists of the following steps:

Sorting: Elasticsearch sorts the candidate result set based on the relevance of each document and query.
Generate a summary: To make it easier for users to view query results, Elasticsearch generates a summary for each document. The summary usually includes a portion of the document and the location of the query terms.
Paging: If paging parameters are specified in the query request, Elasticsearch will extract a page of results from the sorted result set based on those parameters.
Return Results: Finally, Elasticsearch returns the results of the query to the user. The query results are usually provided in JSON format and include information such as total number of hits, query time, ID of each document, summary, and so on.

This is the basic process of generating query results for Elasticsearch. Note that this process may be affected by the complexity of the query statement, the size of the data volume, the state of the cluster, and other factors.

3. Relevance score

3.1 The role of relevance scores

In Elasticsearch, a relevance score (also known as a rating or score) is used to measure how well a document matches a query condition. It is a value calculated by Elasticsearch’s query module based on the TF-IDF algorithm or other relevance algorithms.

The relevance score is useful in the following ways:

Sorting: When returning query results, Elasticsearch sorts the results based on relevance scores. Documents with higher scores are considered a better match for the query criteria and are therefore ranked higher. ****
Filtering: In some cases, you may only care about those documents that highly match the query criteria. In this case, you can set a rating threshold and only return documents with ratings above this threshold.
Tuning: By understanding and adjusting how relevance scores are calculated, you can optimize the query to better fit your needs. For example, you can influence the importance of a field in the scoring calculation by setting its weight.

It is important to note that the relevance score is not an absolute value and its magnitude does not directly reflect the quality or importance of the document. It simply indicates how well a document matches a particular query condition. The same document may have different scores for different query conditions.

3.2, TF-IDF Principle

The TF-IDF (Word Frequency-Inverse Document Frequency) algorithm is used to evaluate the importance of a word for a particular document in a document set or corpus. It works as follows:

Term Frequency (TF): measures how often a word appears in a document. It is usually calculated by dividing the number of times a word appears in the document by the total number of words in the document. the higher the TF value, the higher the importance of the word in the document.
Inverse Document Frequency (IDF): measures whether a word is common or not. It is calculated by dividing the total number of documents in the corpus by the logarithm of the number of documents containing the word. the higher the IDF value, the more informative the word is, and the more important it is for distinguishing documents.
TF-IDF value calculation: multiply the TF value and IDF value to get the final TF-IDF value, the higher the TF-IDF value, the higher the importance of the word for a certain document.

In Elasticsearch, for each query term, its TF value in the document and its IDF value in the whole corpus are calculated, and then these two values are multiplied together to get the final TF-IDF value. The query results are sorted according to the size of the TF-IDF value, and the larger the TF-IDF value, the higher the relevance of the document and the query.

The goal of the TF-IDF algorithm is to improve the accuracy and relevance of information retrieval by determining the importance of words by taking into account word frequency and word prevalence.

3.3 Other scoring rules

In addition to relevance scoring based on TF-IDF, Elasticsearch provides other scoring rules to fulfill different search needs. Here are some common scoring rules:

Constant Score: This scoring rule assigns the same score to all documents. It is usually used in filtering operations where we only care about whether a document satisfies a condition or not, not about the relevance of the document.
Boolean/Disjunction Max Score: This scoring rule calculates the score for each query condition and then takes the highest score as the final score. It is usually used in multiconditional queries, where we are usually concerned with how well a document satisfies any one condition.
Function Score: This scoring rule allows you to customize the scoring function to implement complex scoring logic. You can calculate a score based on a document’s field values, query parameters, scripts, and other factors.

These are just a few of Elasticsearch’s scoring rules, but Elasticsearch actually provides more scoring rules, such asscript_score、field_value_factor、decay functions etc., which can fulfill a variety of complex search needs.

4、Search function

Elasticsearch provides some advanced search features such as full-text search, fuzzy search, range search, aggregated search, and so on.

4.1 Full-text search

The most basic and core feature of Elasticsearch is full-text search. Full-text search refers to searching a large amount of text data to find documents that contain a specified term, and Elasticsearch uses a data structure such as an inverted index to achieve efficient full-text search.

The working principle of full-text search is mainly based on the inverted index. A backward index is a data structure that maps all the terms to the list of documents in which they appear. When performing full-text search, Elasticsearch will find the corresponding document list according to the query terms, and then calculate the relevance score of each document according to certain scoring rules (such as TF-IDF), and return the results sorted by score.

Elasticsearch’s full-text search supports a variety of query types, such asmatch Query,multi_match Query,query_string queries, etc. These query types can satisfy a variety of complex search needs, such as word search, phrase search, Boolean search and so on.

4.2 Multi-value search

In Elasticsearch, if you need to search on multiple values, you can use theterms Query.terms A query allows you to specify a field and multiple values, and Elasticsearch returns documents with all field values within those values.

terms The query works by converting each value to aterm query, and then theseterm search byOR are combined in the same way. This means that as long as a document’s field values match any of the values, the query condition is considered satisfied.

For example, if you execute aterms query that looks for items with the color “red” or “blue”, Elasticsearch will first look in the inverted index for the two terms “red” and “blue”, Elasticsearch will first look up the inverted list of “red” and “blue” in the inverted index, and then merge the two lists to get the final result.

It is important to note thatterms The query only applies to exact value matches, not to full-text searches. If you need to perform a full-text search on multiple lexical items, you can use themulti_match Inquiry orquery_string Query.

4.3. Fuzzy search

Elasticsearch’s fuzzy search is a feature that can handle misspellings and proximity searches.

The implementation of fuzzy search is mainly based on the editorial distance (Levenshtein distance) algorithm, which calculates the degree of difference between two lexical items. Edit distance is a measure of the degree of difference by calculating the minimum number of single-character editing operations (e.g., insertion, deletion, replacement) required to transform from one lexical item to another.

In Elasticsearch, you can use thefuzzy query to perform a fuzzy search.fuzzy The query allows you to specify afuzziness parameter, which determines the maximum editing distance allowed. For example, thefuzziness parameter is set to 1, then it can match all the words whose editing distance from the query word is within 1.

Fuzzy search is well suited to handle user input errors and can improve the fault tolerance of search, thus enhancing the user experience.

4.4 Scope search

Elasticsearch’s range search allows you to find documents with field values within a specified range.

Scope searches in Elasticsearch are primarily performed through therange query to achieve this. In therange In a query, you can specify an upper bound and a lower bound for a field, and Elasticsearch will return all documents whose field values fall within that range.

For example, you can find all items with a price between 10 and 20, or all articles with a publish date within the past week.

range The query supports many types of fields such as numeric fields, date fields, IP address fields, and so on. For date fields, you can also specify the range using date math expressions such asnow-1d Indicates the past day from the present.

In addition.range The query also supports the setting of open and closed intervals, which you can set with thegte(greater than or equal to),gt(greater than),lte(less than or equal to),lt(less than) and other parameters to control the opening and closing of the zone.

Range search is a very commonly used search method in Elasticsearch, which can fulfill a variety of range-based filtering and querying needs.

4.5. Aggregated search

Elasticsearch’s Aggregate Search is a powerful data analysis tool that allows you to perform a variety of statistical analyses on your search results.

Aggregated search is implemented in Elasticsearch primarily through the Aggregations feature. Aggregations provide a set of operators for analyzing data, such asmin、max、avg、sum、count etc. You can use these operators to statistically analyze search results.

For example, you can use theavg aggregation to calculate the average price of all items, or use thehistogram Aggregate to count the number of items in each price range.

In addition, the aggregation feature supports nested aggregations, where you can base one aggregation on another. This allows you to realize complex data analysis needs, such as grouping statistics, multi-level grouping statistics, and so on.

Aggregate search is a very powerful feature in Elasticsearch that can meet a variety of complex data analysis needs.

5、Search Optimization

5.1 Index Optimization

In Elasticsearch, optimizing the index structure is an important tool for improving search performance. Here are some common index optimization strategies:

Reasonable number of slices: Each index can be divided into multiple slices, each of which is a separate part of the indexed data. The number of slices affects the parallel processing capability of Elasticsearch, but too many slices will increase the management burden of the cluster and may reduce performance. Therefore, you need to set the number of slices reasonably according to the data volume, hardware resources, and other factors.
Use appropriate field types: Elasticsearch supports multiple field types, and different field types have different indexing and search performance. For example, for fields that require full-text search, you should use thetext type, because thetext type will perform word splitting on the field value, which is suitable for full-text search; for fields that need to be matched exactly, you should use thekeyword type, because thekeyword type does not split the field value and is suitable for exact matching.
Disable indexing of fields that don’t need to be searched: If a field doesn’t need to be searched, there’s no point in indexing it. You can disable the indexing of this field in the mapping of theindex The parameters are set tofalse, so that Elasticsearch does not index this field, saving storage space and improving indexing and search performance.
Optimize the document structure: Try to avoid using nested types (nested type), because nested types increase the complexity of the index and storage overhead. If you need to search on array fields, consider using theflattened Type.

These are just some of the ways to optimize Elasticsearch’s index structure, and there are actually many other optimization techniques and strategies, such as using thedoc_values Optimize sorting and aggregation, userouting Optimize slice-and-dice access, etc.

5.2. Query optimization

In Elasticsearch, optimizing query statements is an important tool for improving search performance. Here are some common query optimization strategies:

Avoid high-overhead queries: certain types of queries such aswildcard、regexp、fuzzy etc., have a high overhead due to the need to match on a large number of lexical items. These queries should be avoided in performance-sensitive scenarios.
Preferred use of filters: In Elasticsearch, thefilter respond in singingquery Both can be used to filter documents, butfilter The result can be cached for the next execution of the samefilter When it is possible to use caching directly, performance can be improved. Therefore, filtering conditions that do not require the calculation of a relevance score should be prioritized using thefilter。
Avoid deep paging: Deep paging refers to fetching the next few pages of results, such as the 1000th page. Deep paging requires Elasticsearch to sort all previous results, which has a high overhead. If you need to process a large number of results, you should consider using thescroll API orsearch_after Parameters.
Reduce the number of fields returned: By default, Elasticsearch returns all fields of a document. If you only need some of the document’s fields, you can use the_source parameter to specify the returned fields, which reduces the amount of data transferred over the network and improves performance.

These are just some of the ways to optimize Elasticsearch query statements. There are actually many other optimization techniques and strategies, such as using thebool consultingmust、should、filter、must_not to optimize the Boolean logic usingconstant_score queries to optimize static scores, etc.

5.3 Optimizing Sorting and Aggregation with doc_values

In Elasticsearch, thedoc_values is a columnar store on disk that can be used to perform operations such as sorting and aggregation quickly and efficiently.

When you sort or aggregate a field, Elasticsearch needs access to all the values for that field. If these values are stored in documents, then Elasticsearch needs to load each document from disk, which can be very slow. Insteaddoc_values Instead, the values of the fields are stored in a separate area of the disk, and Elasticsearch can access them directly without having to load the document, thus greatly improving performance.

By default, Elasticsearch adds a new file to all of thekeyword Type and value type fields are enableddoc_values. If you have atext type field that also needs to be sorted or aggregated, then you can add akeyword type subfield and enable thedoc_values。

It is important to note that whiledoc_values can improve the performance of sorting and aggregation, but it also takes up extra disk space. Therefore, for fields that don’t need to be sorted or aggregated, you can map thedoc_values set tofalseto save disk space.

5.4. Optimize slicing using routing

In Elasticsearch, therouting Parameters can be used to control to which slice the document is stored and to which slice the search request is routed. Search performance can be significantly improved with a proper routing strategy.

By default, Elasticsearch determines which shard to store a document in based on the document’s ID, and search requests are routed to all shards. This strategy ensures an even distribution of data, but may not be efficient in some cases.

For example, if your index contains data for multiple users, and each search request involves data for only one user, the default routing policy will result in a lot of invalid searches because most of the slices don’t contain data for that user.

At this point, you can use therouting parameter to optimize slice access. You can use the user ID as therouting parameter, so that all documents for the same user will be stored in the same slice, and search requests will only be routed to that slice. This can greatly reduce invalid searches and improve search performance.

It is important to note that whilerouting parameter can improve search performance, but if used improperly, it can also lead to uneven data distribution and affect the stability of the cluster. Therefore, it is important to use therouting The distribution of the data needs to be fully considered when parameterizing.

5.5 Other optimizations

In addition to the two mentioned above, consider:

Using Caching: Elasticsearch provides query result caching and field data caching that can improve the performance of repeated queries. Note that caching is not always beneficial and may degrade performance if the query pattern has a high degree of randomness.
Hardware Optimization: Improving hardware performance can also improve search performance, e.g., increasing memory can improve caching, using SSDs can improve IO performance, and so on.