An ElasticSearch cluster

Technical details on an implementation of an ElasticSearch cluster and its ingestion and visualization tools (MetricBeat, Logstash and Kibana) to build the following:

  • Capture and centralization of system logging
  • Centralization of Apache Web Server access logs
  • Tweets capture for analysis

Setting up the cluster

The minimal recommended number of nodes for a cluster, is 3. This is to avoid a split-brain scenario:
If the cluster is just about 2 nodes, if a network problem is present and doesn't allow both nodes to see each other, each of them will think that the other one has disappeared. One that is not the master will then promote itself to the master role, while the other node still act as a master. When the network problem is solved and both nodes see each other again, then you will be in trouble. Some inconsistency may be created. Because the system is not multi-master capable, nodes won't be able to synch to each other changes they made as being a master.
If the cluster is 3 or more nodes, you will have to tell it the minimal number of nodes to be seen to be promoted master. If you have 3 nodes, this number will be equal to 2. So if a network problem happens:
  • one non-master node that doesn't see any other nodes won't promote itself as master.
  • one master that doesn't see any other node, it will no more act as a master for the data it is holding.
  • two nodes that still see each other but no more the master will elect one of them to be the master. When the original master is joinable again, it will have stopped acting as a master and will join the cluster again as a non-master node
In a big production cluster, you can define which role can be or cannot be a node. First of all, let's have an overview of each role:
  1. Master node: one and only one in the cluster, chosen by election among the node that can become a master. The setting node.master=true allows a node to become a master.
  2. Data node: node that can hold data (Lucene indexes) and perform data related operations on them. Assigned via the setting
  3. Client node: node that doesn't hold any data but perform document enrichment before they are effectively written into the indexes. Assign this role with node.ingest=true.
  4. Tribe node: a very special node, connecting to different clusters, able to do search across multiple cluster. When acting as tribe node, the node will just do coordination and not store data. Use the parameters tribe.* to achieve this (more information about this in the official ElasticSearch documentation)
By default, all 3 first roles are activated for each node. In heavy used cluster, it is recommended to split the activities:
  • Dedicate 3 or more node to be master-eligible and no data node (thus setting By dedicating master-eligible nodes, you will exclude them from the list of node to connect to upload data into the indices. Elastic recommend that you don't perform search, ingestion or any other client operations via the master node of the cluster. As you may not know which node is the master, by creating a group of nodes master-eligible only, you will exclude them from your client connection URL.
  • Dedicate as much data node you need according to the estimated amount of data, thus turning off master-eligibility.
  • If you need to create heavy pipelines in the cluster, to transform the documents when they enter the cluster but before they are really added into the indexes, you will dedicate some node to this role by turning off the master-eligibility and data role (; node.master=false)

You will need to have a valid Java Virtual Machine installed on your server before running ElasticSearch. All the configurations and operations presented below have been done on virtual machines with 1 vCPU and 4 GB RAM, running Ubuntu 16.04LTS.
The JVM used to perform the test if the Oracle Java SE 8u101, downloaded from Oracle/Java web site.
ElasticSearch has been installed via the Elastic APT repository:

Install the Elastic packages signing public key:

wget -qO - | sudo apt-key add -

Add the HTTPS transport for APT if not yet done:

sudo apt-get install apt-transport-https

Add the Elastic APT repository for ElasticSearch:

echo "deb stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list

Update the local package cache and perform the installation of the package:

sudo apt-get update && sudo apt-get install elasticsearch

You configure your cluster via the file /etc/elasticsearch/elasticsearch.yml on each node.
An example of configuration for a simple 3 nodes cluster will be:
# Name of the cluster, common for all nodes part of the same cluster pandora
# The name to identify this instance on this node in the cluster (can be different from the hostname) ES1
# We can use any directory we want to put the data (indices) and the cluster logs /es/data
path.logs: /es/logs
# Lock the memory on startup - to reserve memory when starting up the binary
#bootstrap.memory_lock: true
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
# Elasticsearch performs poorly when the system is swapping the memory.
# Bind to a specific IP address on the system only
#http.port: 9200
# To discover other nodes that are part of this server ["es1", "es2"]
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
discovery.zen.minimum_master_nodes: 2
# Block initial recovery after a full cluster restart until N nodes are started:
gateway.recover_after_nodes: 1
# Disable starting multiple nodes on a single system:
node.max_local_storage_nodes: 1
# Require explicit names when deleting indices:
#action.destructive_requires_name: true
When the ElasticSearch process is started on all nodes, then you have a cluster running, waiting for input on port 9200.
To check that not only the ElasticSearch daemon is running but also performing well, we can use the following 2 calls, they illustrate well how any client software will interact with the search engine.
All you need is to have curl installed on your system. Any other command line to send HTTP calls (GET, POST, PUT, HEAD, DELETE, ...) can do the trick.
The first examples is used to display the general health status of the cluster's nodes. The second one will list each indexes and their status.
es1(ubuntu):~$ curl http://localhost:9200/_cat/nodes?v
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name           26          95   7    0.01    0.00     0.00 mdi       -      ES2           49          92   4    0.02    0.01     0.00 mdi       *      ES1
es1(ubuntu):~$ curl http://localhost:9200/_cat/indices?v
health status index              uuid                   pri rep docs.count docs.deleted store.size
green  open   icinga-2017.04     2Hla9BSiQUWciDvSt6xHWg   5   1     294937            0    275.9mb        137.9mb
green  open   apache-2017.04     B6i1gmQIRj-dW41BFAL4AQ   5   1     412241            0    813.1mb        406.2mb
green  open   syslog-2017.04     9gPSk2z9RxaT3WdKFRHI1g   5   1     168997            0    152.5mb         67.8mb
green  open   metricbeat-2017.04 KBeHsGj4TVqhKysguPlKew   2   1    1480507            0   1020.1mb          510mb

The parameter ?v added to the URL is used to have a nice human readable output.