Elasticsearch is the living heart of what is today the most popular log analytics platform — The ELK Stack (Elasticsearch, Logstash and Kibana)
Since its release in 2010, Elasticsearch has already been adopted by well-known organizations such as LinkedIn, Netflix, and Stack Overflow. It has quickly become the most popular search engine, and it’s commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases.
To give you a better understanding of what has become today the most popular search engine, this article will elaborate on the basic concepts of Elasticsearch as well as its value proposition.
What is Elasticsearch?
At its core, Elasticsearch is a complete open-source, distributed, RESTful, document-oriented search engine built on Apache Lucene, which means that you can save documents into Elasticsearch or delete documents from Elasticsearch. And along with with this basic insert and delete functionality you can of course also retrieve stored documents and even perform various analytics. In the context of data analysis, Elasticsearch is used together with the other components in the ELK Stack, Logstash and Kibana, and plays the role of data indexing and storage.
At the time of writing this article, Elasticsearch was ranked first in the search engine category and eighth for databases, and there are good reasons why.
Why should you consider using Elasticsearch?
This is an interesting question with many valid answers. Here is a summary of a few benefits of Elasticsearch.
Elasticsearch’s distributed design enables it to return search results over a large amount of data very quickly when compared to other search engines. Elasticsearch achieves this speed in a few different ways. One way is by generating inverted indices for every field in the data that you index. After you’ve indexed your data, Elasticsearch can use every inverted index simultaneously when a query is executed which helps it to return results more quickly. Moreover, Elasticsearch enables you to split your data into units called shards (which live in nodes inside a cluster) and then automatically handles the process of routing queries across your segmented data. This process of distributed search contributes to the outsized performance of Elasticsearch in the world of full-text search.
Elastiscseach provides the ability to extend resources and balance the loading between the nodes in a cluster. It also replicates the data automatically to prevent data loss in case of server node failure. It is an incredibly flexible technology that works for use cases of all sizes. You can quite literally run Elasticsearch on your laptop or scale it out to hundreds of servers with petabytes of data.
Elasticsearch lends itself to strong reliability and generally provides clear visibility into the health of your infrastructure. Specifically, the three main ways the technology helps with reliability are replication, cluster backups, and monitoring.
4. Data Types
Elasticsearch provides support for all commonly-used data types such as:
- Text: string (can be of both structured and unstructured data)
- Numbers: long, integer, short, byte, double, float
- Dates: date
In addition, Elasticsearch provides support for complex types such as arrays, objects, nested types iPV4, alias, Geo and many others.
A complete list of data types can be found on the official website.
The Basic Concepts of Elasticsearch
Let’s take a closer look at the basic concepts of Elasticsearch: Cluster, Node, Shards, Replicas, Index, Documents, Type, Mapping by comparing the terms with the terms used in the world of relational databases.
- Documents & Types
Documents are JSON objects stored within an Elasticsearch Index and are considered the base unit of storage. In the world of relational databases, documents can be compared to rows in a table.
A type in Elasticsearch is the representation of a class of similar documents.
A type consists of a mapping (see below) and a name — such as employee or projects.
Like a schema in the world of relational databases, mapping defines the different types that reside within an index. It defines the fields for documents of a specific type — the data type (such as string and integer) and how the fields should be indexed and stored in Elasticsearch.
The collection of similar documents in Elasticsearch is called index and can be compared to a database in the world of relational databases. For instance, we can have an index for employee data and another one for the company projects.
Shards are a way of logically dividing your data in order to be easily searchable or easily queryable. Shards are the building block of Elasticsearch and are what facilitate its scalability.
As the name implies, replicas are basically copies of your index’s shards. And they serve two main purposes.
- Replicas provide high availability in case nodes or shards fail.
- Replicas increase the performance of search queries
A node is a single server that holds some data and participates in the cluster’s indexing and querying. A node can be configured to join a specific cluster by the particular cluster name.
All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node. Besides that, each node serves one or more purposes:
A node that has a node.master set to true (default), which makes it eligible to be elected as the master node, which controls the cluster.
A node that has a node.data set to true (default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations.
A node that has a node. ingest set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as node.ingest: false.
A tribe node, configured via the tribe.* settings, is a special type of coordinating only node that can connect to multiple clusters and perform search and other operations across all connected clusters.
By default, a node is a master-eligible node and a data node, plus it can pre-process documents through ingest pipelines. This is very convenient for small clusters but, as the cluster grows, it becomes important to consider separating dedicated master-eligible nodes from dedicated data nodes.
As the name implies, an Elasticsearch cluster is a group of one or more Elasticsearch nodes instances that are connected together. The power of an Elasticsearch cluster lies in the distribution of tasks, searching and indexing, across all the nodes in the cluster.
A Final Note
I believe, this article did cover some of the most important concepts you should understand when getting started with ELK, However, there are other terms and components you need to get yourself familiar with when starting with the ELK stack. That being said, my advice is to check the official Elastic Stack and Product Documentation for additional information.
- Elastic.co. Elasticsearch(6.4) Documentation.
- Daniel Berman. (Jun 27th, 2016). 10 Elasticsearch Concepts You Need to Learn
- Sam Reid. (Feb 16, 2018) .Why is Elasticsearch so successful?