Elasticsearch¶
Elasticsearch is a distributed analytics and search engine and the core component of the ELK stack. Elastic search ingests structured data (typically JSON or key value pairs) and stores the data in distributed index shards.
In the CAST design the more Elasticsearch nodes the better. Generally speaking nodes with attached storage or large numbers of drives are prefered.
Configuration¶
Note
This guide has been tested using Elasticsearch 6.8.1, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the elasticsearch service. It is generally assumed that elasticsearch is to be installed on multiple Big Data Nodes to take advantage of the distributed nature of the service. Additionally, in the CAST configuration data drives are assumed to be JBOD.
CAST provides a set of sample configuration files in the repository at csm_big_data/elasticsearch/
If the ibm-csm-bds-*.noarch.rpm
rpm as been installed the sample configurations may be found
in /opt/ibm/csm/bigdata/elasticsearch/.
- Install the elasticsearch rpm and java 1.8.1+ (command run from directory with elasticsearch rpm):
yum install -y elasticsearch-*.rpm java-1.8.*-openjdk
Copy the Elasticsearch configuration files to the /etc/elasticsearch directory.
It is recommended that the system administrator review these configurations at this phase.
jvm.options: jvm options for the Elasticsearch service. elasticsearch.yml: Configuration of the service specific attributes, please see elasticsearch.yml for details. Make an ext4 filesystem on each hard drive designated to be in the Elasticsearch JBOD.
The mounted names for these file systems should match the names specified in path.data. Additionally, these mounted file systems should be owned by the
elasticsearch
user and in theelasticsearch
group.Start Elasticsearch:
systemctl enable elasticsearch
systemctl start elasticsearch
- Run the index template creator script:
/opt/ibm/csm/bigdata/elasticsearch/createIndices.sh
Note
This is technically optional, however, data will have limited use. This script configures Elasticsearch to properly parse timestamps.
Elasticsearch should now be operational. If Logstash was properly configured there should already be data being written to your index.
Tuning Elasticsearch¶
The process of tuning and configuring Elasticsearch is incredibly dependent on the volume and type of data ingested the Big Data Store. Due to the nuance of this process it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The following document outlines the defaults and recommendations of CAST in the configuration of the Big Data Store.
elasticsearch.yml¶
Note
The following section outline’s CAST’s recommendations for the Elasticsearch configuration it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The Elasticsearch configuration sample shipped by CAST marks fields that need to be set by a system administrator. A brief rundown of the fields to modify is as follows:
cluster.name: | The name of the cluster. Nodes may only join clusters with the name in this field. Generally it’s a good idea to give this a descriptive name. |
---|---|
node.name: | The name of the node in the elasticsearch cluster.
CAST defaults to ${HOSTNAME} . |
path.log: | The logging directory, needs elasticsearch read write access. |
path.data: | A comma separated listing of data directories, needs elasticsearch read write access. CAST recommends a JBOD model where each disk has a file system. |
network.host: | The address to bind the Elasticsearch model to.
CAST defaults to _site_ . |
http.port: | The port to bind Elasticsearch to.
CAST defaults to 9200 . |
discovery.zen.ping.unicast.hosts: | |
A list of nodes likely to be active, comma delimited array.
CAST defaults to cast.elasticsearch.nodes . |
|
discovery.zen.minimum_master_nodes: | |
Number of nodes with the node.master setting set to true that must be connected to
before starting.
Elastic search recommends (master_eligible_nodes/2)+1 . |
|
gateway.recover_after_nodes: | |
Number of nodes to wait for before begining recovery after cluster-wide restart. | |
xpack.ml.enabled: | |
Enables/disables the Machine Learning utility in xpack, this should be disabled on ppc64le installations. | |
xpack.security.enabled: | |
Enables/disables security in elasticsearch. | |
xpack.license.self_generated.type: | |
Sets the license of xpack for the cluster, if the user has no license it should be set to basic . |
jvm.options¶
The configuration file for the Logstash JVM. The supplied settings are CAST’s recommendation, however, the efficacy of these settings entirely depends on your elasticsearch node.
Generally speaking the only field to be changed is the heap size:
-Xms[HEAP MIN]
-Xmx[HEAP MAX]
Indices¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-*.json |
CAST has specified a suite of data mappings for use in separate indices. Each of these indices is documented below, with a JSON mapping file provided in the repository and rpm.
CAST uses cast-<class>-<description>-<date>
naming schema for indices to leverage templates when creating
the indices in Elasticsearch. The class is one of the three primary classifications determined
by CAST: log, counters, environmental. The description is typically a one to two word description
of the type of data: syslog, node, mellanox-event, etc.
A collection of templates is provided in ibm-csm-bds-*.noarch.rpm
which sets up aliases and data type mappings.
These temlates do not set sharding or replication factors, as these settings should be tuned to
the user’s data retention and index sizing needs.
The specified templates match indices generated in the data aggregators documentation. As different data sources produce different volumes of data in different environments, this document will make no recommendation on sharding or replication.
Note
These templates may be found on the git repo at csm_big_data/elasticsearch/mappings/templates
.
Note
Cast has elected to use lowercase and - characters to separate words. This is not mandatory for your index naming and creation.
scripts¶
Elasticsearch Index Scripts: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/ |
CAST provides a set of scripts which allow the user to easily manipulate the elasticsearch indices from the command line.
createIndices.sh¶
A script for initializing the templates defined by CAST. When executed it with attempt to
target the elasticsearch server running on ${HOSTNAME}:9200
. If the user supplies
either a hostname or ip address this will be targeted in lieu of ${HOSTNAME}
. This script
need only be run once on a node in the elasticsearch cluster.
removeIndices.sh¶
A script for removing all elasticsearch templates created by createIndices.sh.
When executed it with attempt to target the elasticsearch server running on ${HOSTNAME}:9200
.
If the user supplies either a hostname or ip address this will be targeted in lieu of ${HOSTNAME}
.
This script need only be run once on a node in the elasticsearch cluster.
reindexIndices.py¶
Attention
This script is currently not supported, a future release of CSM BDS will have a script matching this description.
A tool for performing in place reindexing of an elasticsearch index.
Warning
This script should only be used to reindex a handful of indices at a time as it is slow and can result in partial reindexing.
usage: reindexIndices.py [-h] [-t hostname:port]
[-i [index-pattern [index-pattern ...]]]
A tool for reindexing a list of elasticsearch indices, all indices will be
reindexed in place.
optional arguments:
-h, --help show this help message and exit
-t hostname:port, --target hostname:port
An Elasticsearch server to reindex indices on. This
defaults to the contents of environment variable
"CAST_ELASTIC".
-i [index-pattern [index-pattern ...]], --indices [index-pattern [index-pattern ...]]
A list of indices to reindex, this should use the
index pattern format.
cast-log¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-log*.json |
The cast-log- indices represent a set of logging indices produced by CAST supported data sources.
cast-log-syslog¶
alias: | cast-log-syslog |
---|
The syslog index is designed to capture generic syslog messages. The contents of the syslog index is considered by CAST to be the most useful data points for syslog analysis. CAST supplies both an rsyslog template and Logstash pattern, for details on these configurations please consult the data aggregators documentation.
The mapping for the index contains the following fields:
Field | Type | Description |
---|---|---|
@timestamp | date | The timestamp of the message, generated by the syslog utility. |
host | text | The host of the relay host. |
hostname | text | The hostname of the syslog origination. |
program_name | text | The name of the program which generated the log. |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log. |
message | text | The body of the message. |
tags | text | Tags containing additional metadata about the message. |
Note
Currently mmfs and CAST logs will be stored in the syslog index (due to similarity of the data mapping).
cast-log-mellanox-event¶
alias: | cast-log-mellanox-event |
---|
The mellanox event log is a superset of the cast-log-syslog index, an artifact of the event log being transmitted through syslog. In the CAST Big Data Pipeline this log will be ingested and parsed by the Logstash service then transmitted to the Elasticsearch index.
Field | Type | Description |
---|---|---|
@timestamp | date | When the message was written to the event log. |
hostname | text | The hostname of the ufm aggregating the events. |
program_name | text | The name of the generating program, should be event_log |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log, pulled from message. |
message | text | The body of the message (unstructured). |
log_counter | long | A counter tracking the log number. |
event_id | long | The unique identifier for the event in the mellanox event log. |
event_type | text | The type of event (e.g. HARDWARE) in the event log. |
category | text | The categorization of the error in the event log typing |
tags | text | Tags containing additional metadata about the message. |
cast-log-console¶
alias: | cast-log-console |
---|
CAST recommends the usage of the goconserver bundled in the xCAT dependicies, documented in xCat-GoConserver. Configuration of the goconserver should be performed on the xCAT service nodes in the cluster. CAST has created a limited configuration guide <ConsoleDataAggregator>, please consult for a basic rundown on the utility.
The mapping for the console index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | When console event occured. |
type | text | The type of the event (typically console). |
message | text | The console event data, typically a console line. |
hostname | text | The hostname generating the console. |
tags | text | Tags containing additional metadata about the console log. |
cast-csm¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-csm*.json |
The cast-csm- indices represent a set of metric indices produced by CSM. Indices matching this pattern will be created unilaterally by the CSM Daemon. Typically records in this type of index are generated by the Aggregator Daemon.
cast-csm-dimm-env¶
alias: | cast-csm-dimm-env |
---|
The mapping for the cast-csm-dimm-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the dimm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-dimm-env). |
source | text | The source of the counters. |
data.dimm_id | long | The id of dimm being aggregated. |
data.dimm_temp | long | The temperature of the dimm. |
data.dimm_temp_max | long | The max temperature of the dimm over the collection period. |
data.dimm_temp_min | long | The min temperature of the dimm over the collection period. |
cast-csm-gpu-env¶
alias: | cast-csm-gpu-env |
---|
The mapping for the cast-csm-gpu-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-gpu-env). |
source | text | The source of the counters. |
data.gpu_id | long | The id of the GPU record being aggregated. |
data.gpu_mem_temp | long | The memory temperature of the GPU. |
data.gpu_mem_temp_max | long | The max memory temperature of the GPU over the collection period. |
data.gpu_mem_temp_min | long | The min memory temperature of the GPU over the collection period. |
data.gpu_temp | long | The temperature of the GPU. |
data.gpu_temp_max | long | The max temperature of the GPU over the collection period. |
data.gpu_temp_min | long | The min temperature of the GPU over the collection period. |
cast-csm-node-env¶
alias: | cast-csm-node-env |
---|
The mapping for the cast-csm-node-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the node environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-node-env). |
source | text | The source of the counters. |
data.system_energy | long | The energy of the system at ingestion time. |
cast-csm-gpu-counters¶
alias: | cast-csm-gpu-counters |
---|
A listing of DCGM counters.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
nvlink_recovery_error_count_l1 | long | Total number of NVLink recovery errors. |
sync_boost_violation | long | Throttling duration due to sync-boost constraints (in us) |
gpu_temp | long | GPU temperature (in C). |
nvlink_bandwidth_l2 | long | Total number of NVLink bandwidth counters. |
dec_utilization | long | Decoder utilization. |
nvlink_recovery_error_count_l2 | long | Total number of NVLink recovery errors. |
nvlink_bandwidth_l1 | long | Total number of NVLink bandwidth counters. |
mem_copy_utilization | long | Memory utilization. |
gpu_util_samples | double | GPU utilization sample count. |
nvlink_replay_error_count_l1 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l1 | long | Total number of NVLink data CRC errors. |
nvlink_replay_error_count_l0 | long | Total number of NVLink retries. |
nvlink_bandwidth_l0 | long | Total number of NVLink bandwidth counters. |
nvlink_data_crc_error_count_l3 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l3 | long | Total number of NVLink flow-control CRC errors. |
nvlink_bandwidth_l3 | long | Total number of NVLink bandwidth counters. |
nvlink_replay_error_count_l2 | long | Total number of NVLink retries. |
nvlink_replay_error_count_l3 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l0 | long | Total number of NVLink data CRC errors. |
nvlink_recovery_error_count_l0 | long | Total number of NVLink recovery errors. |
enc_utilization | long | Encoder utilization. |
power_usage | double | Power draw (in W). |
nvlink_recovery_error_count_l3 | long | Total number of NVLink recovery errors. |
nvlink_data_crc_error_count_l2 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l2 | long | Total number of NVLink flow-control CRC errors. |
serial_number | text | The serial number of the GPU. |
power_violation | long | Throttling duration due to power constraints (in us). |
xid_errors | long | Value of the last XID error encountered. |
gpu_utilization | long | GPU utilization. |
nvlink_flit_crc_error_count_l0 | long | Total number of NVLink flow-control CRC errors. |
nvlink_flit_crc_error_count_l1 | long | Total number of NVLink flow-control CRC errors. |
mem_util_samples | double | The sample rate of the memory utilization. |
thermal_violation | long | Throttling duration due to thermal constraints (in us). |
cast-counters¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-ccounters*.json |
A class of index representing counter aggregation from non CSM data flows. Generally indices following this naming pattern contain data from standalone data aggregation utilities.
cast-counters-gpfs¶
alias: | cast-counters-gpfs |
---|
A collection of counter data from gpfs. The script outlined in the data aggregators documentation leverages zimon to perform the collection. The following is the index generated by the default script bundled in the CAST rpm.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
cpu_system | long | The system space usage of the CPU. |
cpu_user | long | The user space usage of the CPU. |
mem_active | long | Active memory usage. |
gpfs_ns_bytes_read | long | Networked bytes read. |
gpfs_ns_bytes_written | long | Networked bytes written. |
gpfs_ns_tot_queue_wait_rd | long | Total time spent waiting in the network queue for read operations. |
gpfs_ns_tot_queue_wait_wr | long | Total time spent waiting in the network queue for write operations. |
cast-counters-ufm¶
alias: | cast-counters-ufm |
---|
Due to the wide variety of counters that may be gathered checking the data aggregation script is strongly recommended.
The mapping for the cast-counters-ufm index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the ufm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (cast-counters-ufm). |
source | text | The source of the counters. |
cast-db¶
CSM history tables are archived in Elasticsearch as separate indices. CAST provides a document on configuring CSM database data archival <DataArchiving>.
The mapping shared between the indices is as follows:
Field | Type | Description |
---|---|---|
@timestamp | date | When archival event occured. |
tags | text | Tags about the archived data. |
type | text | The originating table, drives index assignment. |
data | doc | The mapping of table columns, contents differ for each table. |
Attention
These indicies will match CSM database history tables, contents not replicated for brevity.
cast-ibm-crasssd-bmc-alerts¶
While not managed by CAST crassd will ship bmc alerts to the big data store.