Cluster Administration Storage Tools¶
CAST stands for Cluster Administration Storage Tools.
CAST is comprised of several open source components:
CSM - Cluster System Management
A C API for managing a large cluster. Offers a suite of tools for maintaining the cluster:
- Discovery and management of system resources
- Database integration (PostgreSQL)
- Job launch support (workload management APIs)
- Node diagnostics (diag APIs and scripts)
- RAS events and actions
- Infrastructure Health checks
- Python Bindings for C APIs
Burst Buffer
A cost-effective mechanism that can improve I/O performance for a large class of high-performance computing applications without requirement of intermediary hardware. Burst Buffer provides:
- A fast storage tier between compute nodes and the traditional parallel file system
- Overlapping job stage-in and stage-out of data for checkpoint and restart
- Scratch volumes
- Extended memory I/O workloads
Function Shipping
A file I/O forwarding layer for Linux that aims to provide low-jitter access to remote parallel file system while retaining common POSIX semantics.
Table of Contents¶

CSM APIs¶
CSM uses APIs to communicate between its sub systems and to external programs. This section is a general purpose guide for interacting with CSM APIs.
This section is divided into the following subsections:
Installation¶
The three installation rpms essential to CSM APIs are csm-core-*.rpm
, csm-api-*.rpm
and csm-csmdb-*.rpm
.
These three rpms must be installed to use CSM APIs.
csm-core-*.rpm
must be installed on the following components:
- management node
- login node
- launch node
- compute node
csm-api-*.rpm
must be installed on the following components:
- management node
- login node
- launch node
- compute node
csm-csmdb-*.rpm
must be installed on the following components:
- management node
Configuration¶
Overview¶
CSM APIs have the ability to be configured in various ways. There are default configurations provided for CSM APIs to function, but these settings can be changed and set to a user’s preference.
The configurable features of CSM APIs are:
Table of Contents¶
CSM Pam Daemon Module¶
The libcsmpam.so
module is installed by the csm-core-*.rpm
rpm to /usr/lib64/security/libcsmpam.so
.
To enable the this module for sshd perform the following steps:
1: | Uncomment the following line in #account required libcsmpam.so
#session required libcsmpam.so
Note The If the configuration changes this make sure the |
---|---|
2: | Run systemctl restart sshd.service to restart the sshd daemon with the new config. After the daemon has been restarted the modified pam sshd configuration should now be used. |
Contents¶
Module Behavior¶
This module is designed for account authentication and cgroup session assignment in the pam sshd utility. The following checks are performed to verify that the user is allowed to access the system:
- The user is root.
- Allow entry.
- Place the user in the default cgroup (session only).
- Exit module with success.
- The user is defined in /etc/pam.d/csm/activelist.
- Allow entry.
- Place the session in the cgroup that the user is associated with in the activelist (session only).
- note: The activelist is modified by csm, admins should not modify.
- Exit module with success.
- The user is defined in /etc/pam.d/csm/whitelist.
- Allow entry.
- Place the user in the default cgroup (session only).
- note: The whitelist is modified by the admin.
- Exit module with success.
- The user was not found.
- Exit the module, rejecting the user.
Module Configuration¶
Configuration may occur in either a pam configuration file (e.g. /etc/pam.d/sshd
) or the
csm pam whitelist.
File Location: | /usr/lib64/security/libcsmpam.so |
---|---|
Configurable: | Through pam configuration file. |
The libcsmpam.so
is a session pam module. For details on configuring this module and other
pam modules please consult the linux man page (man pam.conf
).
When csm-core-*.rpm
is uninstalled, this library is always removed.
Warning
The libcsmpam.so
module is recommended be the last session line in the default pam
configuration file. The module requires the session to be established to move the session
to the correct cgroup. If the module is invoked too early in the configuration, users will
not be placed in the correct cgroup. Depending on your configuration this advice may or
man not be useful.
File location: | /etc/pam.d/csm/whitelist |
---|---|
Configurable: | Yes |
The whitelist is a newline delimited list of user names. If a user is specified they will always be allowed to login to the node.
If the user has an active allocation on the node an attempt will be made to place them in the correct allocation cgroup. Otherwise, the use will be placed in the default cgroup.
When csm-core-*.rpm
is uninstalled, if this file has been modified it will NOT be deleted.
The following configuration will add three users who will always be allowed to start a session. If the user has an active allocation they will be placed into the appropriate cgroup as described above.
jdunham
pmix
csm_admin
File location: | /etc/pam.d/csm/whitelist |
---|---|
Configurable: | No |
The activelist file should not be modified by the admin or user. CSM will modify this file when an allocation is created or deleted.
The file contains a newline delimited list of entries with the following format:
[user_name];[allocation_id]
. This format is parsed by libcsmpam.so
to determine
whether or not a user can begin the session (username) and which cgroup it belongs
to (allocation_id).
When csm-core-*.rpm
is uninstalled, this file is always removed.
Module Compilation¶
Attention
Ignore this section if the csm pam module is being installed by rpm.
In order to compile this module the pam-devel
package is required to compile.
Troubleshooting¶
If users are having problems with core isolation, unable to log onto the node, or not being placed into the correct cgroup, first perform the following steps.
1: | Manually create an allocation on a node that has the PAM module configured. This should be executed from the launch node as a non root user. $ csm_allocation_create -j 1 -n <node_name> --cgroup_type 2
---
allocation_id: <allocation_id>
num_nodes: 1
- compute_nodes: <node_name>
user_name: root
user_id: 0
state: running
type: user managed
job_submit_time: 2018-01-04 09:01:17
...
POSSIBLE FAILURES
$ csm_node_attributes_update -s "IN_SERVICE" -n <node_name>
|
---|
After the allocation has been created with core isolation ssh to the node
<node_name>
as the user who created the allocation:$ ssh <node_name>
POSSIBLE FAILURES
The /etc/pam.d/csm/activelist was not populated with <user_name>.
- Verify the allocation is currently active:
csm_allocation_query_active_all | grep "allocation_id.* <allocation_id>$"
If the allocation is not currently active attempt to recreate the allocation.
Login to <node_name> as root and check to see if the user is on the activelist:
$ ssh <node_name> -l root "grep <user_name> /etc/pam.d/csm/activelist"
If the user is not present and the allocation create is functioning this may be a CSM bug, open a defect to the CSM team.
Check the cgroup of the user’s ssh session.
$ cat /proc/self/cgroup 11:blkio:/ 10:memory:/allocation_<allocation_id> 9:hugetlb:/ 8:devices:/allocation_<allocation_id> 7:freezer:/ 6:cpuset:/allocation_<allocation_id> 5:net_prio,net_cls:/ 4:perf_event:/ 3:cpuacct,cpu:/allocation_<allocation_id> 2:pids:/ 1:name=systemd:/user.slice/user-9999137.slice/session-3957.scopeAbove is an example of a properly configured cgroup. The user should be in an allocation cgroup for the memory, devices, cpuacct and cpuset groups.
POSSIBLE FAILURES
The user is only in the cpuset:/csm_system cgroup This generally indicates that the libcsmpam.so module was not added in the correct location or is disabled.
Refer to the quick start at the top of this document for more details.
The user is in the cpuset:/ cgroup. Indicates that core isolation was not performed, verify core isolation is enabled in the allocation create step.
Any further issues are beyond the scope of this troubleshooting document, contacting the CSM team or opening a new issue is the recommended course of action.
Configuring allocation prolog and epilog scripts¶
A privileged_prolog and privileged_epilog script (with those exact names) must be placed in
/opt/ibm/csm/prologs
on a compute node in order to use the csm_allocation_create,
csm_allocation_delete, and csm_allocation_update APIs. These scripts must be executable and
take three command line parameters: –type, –user_flags, and –sys_flags.
To add output from this script to the Big Data Store (BDS) it is recommended that the system administrator producing these scripts make use of their language of choice’s logging function.
A sample privileged_prolog and privileged_epilog written in python is shipped in
csm-core-*.rpm
at /opt/ibm/csm/share/prologs
. These sample scripts demonstrate the use of
the python logging module to produce logs consumable for the BDS.
Mandatory prolog/epilog Features¶
Feature | Description |
---|---|
–type | The script must accept a command line parameter –type and have
support for both allocation and step as a string value.
|
–sys_flags | The script must have a command line parameter –sys_flags. This
parameter should take a space delimited list of alphanumeric
flags in the form of a string. CSM does not allow special
characters, as these represent a potential exposure, allowing
unwanted activity to occur.
|
–user_flags | The script must have a command line parameter –user_flags. This
parameter should take a space delimited list of alphanumeric
flags in the form of a string. CSM does not allow special
characters, as these represent a potential exposure, allowing
unwanted activity to occur.
|
Returns 0 on success | Any other error code will be captured by create/delete and the
api call will fail.
|
Optional prolog/epilog Features¶
Feature | Description |
---|---|
logging | If the sysadmin wants to track these scripts in BDS, a form of
logging must be implemented by the admin writing the script. The
sample scripts outline a technique using python and the logging
module.
|
Prolog/epilog Environment Variables¶
CSM_ALLOCATION_ID: | |||||
---|---|---|---|---|---|
The Allocation ID of the invoking CSM handler. |
|||||
CSM_SECONDARY_JOB_ID: | |||||
The Primary Job (Batch) ID of the invoking CSM handler. |
|||||
CSM_SECONDARY_JOB_ID: | |||||
The Secondary Job (Batch) ID of the invoking CSM handler. |
|||||
CSM_USER_NAME: |
The user associated with the job. |
Note
A step prolog or step epilog differs in two ways: the –type flag is set to step and certain environment variables will not be present.
Configuring CSM API Logging Levels¶
CSM has lots of things that print out to the logs. Some things are printed at different log levels. You can configure CSM APIs to switch between these log levels. Logging is handled through the CSM infrastructure and divided into two parts, “Front end” and “back end”.
“Front end” is supposed to represent the part of the API a user would interface with and before an API connects and goes into the CSM infrastructure. “Back end” refers to the part of an API that the user would not interact with and after an API connects and goes into the CSM infrastructure.
Front end logging¶
Front end logging is done through the csm logging utility. You will need to include the header file to call the function.
#include "csmutil/include/csmutil_logging.h"
Set the log level with this function:
csmutil_logging_level_set(my_level);
Where my_level is either: - off - trace - debug - info - warning - error - critical - always - disable
After this function is called, the logging level will change. For example, below we set the logging level to error. So none of these logging calls will print. When we call the API at the end, then only prints that are at level error and above will print.
csmutil_logging_level_set(“error”);
// This will print out the contents of the struct that we will pass to the api
csmutil_logging(debug, "%s-%d:", __FILE__, __LINE__);
csmutil_logging(debug, " Preparing to call the CSM API...");
csmutil_logging(debug, " value of input: %p", input);
csmutil_logging(debug, " address of input: %p", &input);
csmutil_logging(debug, " input contains the following:");
csmutil_logging(debug, " comment: %s", input->comment);
csmutil_logging(debug, " limit: %i", input->limit);
csmutil_logging(debug, " node_names_count: %i", input->node_names_count);
csmutil_logging(debug, " node_names: %p", input->node_names);
for(i = 0; i < input->node_names_count; i++){
csmutil_logging(debug, " node_names[%i]: %s", i, input->node_names[i]);
}
csmutil_logging(debug, " offset: %i", input->offset);
csmutil_logging(debug, " type: %s", csm_get_string_from_enum(csmi_node_type_t, input->type) );
/* Call the C API. */
return_value = csm_node_attributes_query(&csm_obj, input, &output);
If we called the same function, but instead passed in debug, then all those logging calls would print, and when we call the API at the end, all prints inside the API that were set to level debug and above would print. CSM API wrappers such as the CMD Line interfaces include access to this function via the –v, –verbose field on the cmd line parameters.
Back end logging¶
APIs incorporate the CSM daemon logging system, under the sub channel of csmapi. If you want to change the level of default API logging, then you must configure the field in the appropriate csm daemon config file. csmapi is the field you would need to change. It is found in all the CSM daemon config files, under the csm level, then under sub level log.
An excerpt of the csm_master.cfg is reproduced below as an example.
"csm" :
{
"log" :
{
"format" : "%TimeStamp% %SubComponent%::%Severity% | %Message%",
"consoleLog" : false,
"fileLog" : "/var/log/ibm/csm/csm_master.log",
"__rotationSize_comment_1" : "Maximum size (in bytes) of the log file, 10000000000 bytes is ~10GB",
"rotationSize" : 10000000000,
"default_sev" : "warning",
"csmdb" : "info",
"csmnet" : "info",
"csmd" : "info",
"csmras" : "info",
"csmapi" : "info",
"csmenv" : "info"
}
}
An example of editing this field from info to debug is shown below.
"csm" :
{
"log" :
{
"format" : "%TimeStamp% %SubComponent%::%Severity% | %Message%",
"consoleLog" : false,
"fileLog" : "/var/log/ibm/csm/csm_master.log",
"__rotationSize_comment_1" : "Maximum size (in bytes) of the log file, 10000000000 bytes is ~10GB",
"rotationSize" : 10000000000,
"default_sev" : "warning",
"csmdb" : "info",
"csmnet" : "info",
"csmd" : "info",
"csmras" : "info",
"csmapi" : "debug",
"csmenv" : "info"
}
}
If you have trouble finding the config files, then daemon config files are located: - source repo: “csmconf/” - ship to: “/opt/ibm/csm/share/” - run from: “etc/ibm/csm/”
Note: You may need to restart the daemon for the logging level to change.
If you want to make a run time change to logging, but don’t want to change the configuration file. You can use this tool found it here: opt/ibm/csm/sbin/csm_ctrl_cmd
You must run this command on the node with the CSM Daemon that you would like to change the logging level of.
List of CSM APIs¶
Full List¶
- csm_allocation_create
- csm_allocation_delete
- csm_allocation_query
- csm_allocation_query_active_all
- csm_allocation_query_details
- csm_allocation_resources_query
- csm_allocation_step_begin
- csm_allocation_step_cgroup_create
- csm_allocation_step_cgroup_delete
- csm_allocation_step_end
- csm_allocation_step_query
- csm_allocation_step_query_active_all
- csm_allocation_step_query_details
- csm_allocation_update_state
- csm_allocation_update_history
- csm_api_object_clear
- csm_api_object_destroy
- csm_api_object_errcode_get
- csm_api_object_errmsg_get
- csm_api_object_traceid_get
- csm_bb_cmd
- csm_bb_lv_create
- csm_bb_lv_delete
- csm_bb_lv_query
- csm_bb_lv_update
- csm_bb_vg_create
- csm_bb_vg_delete
- csm_bb_vg_query
- csm_cgroup_login
- csm_cluster_query_state
- csm_diag_result_create
- csm_diag_run_begin
- csm_diag_run_end
- csm_diag_run_query
- csm_diag_run_query_details
- csm_enum_from_string
- csm_infrastructure_health_check
- csm_ib_cable_inventory_collection
- csm_ib_cable_query
- csm_ib_cable_query_history
- csm_ib_cable_update
- csm_init_lib
- csm_init_lib_vers
- csm_node_attributes_query
- csm_node_attributes_query_details
- csm_node_attributes_query_history
- csm_node_attributes_update
- csm_node_delete
- csm_node_find_job
- csm_node_query_state_history
- csm_node_resources_query
- csm_node_resources_query_all
- csm_ras_event_create
- csm_ras_event_query
- csm_ras_event_query_allocation
- csm_ras_msg_type_create
- csm_ras_msg_type_delete
- csm_ras_msg_type_query
- csm_ras_msg_type_update
- csm_term_lib
- csm_smt
- csm_switch_attributes_query
- csm_switch_attributes_query_details
- csm_switch_attributes_query_history
- csm_switch_attributes_update
- csm_switch_inventory_collection
- csm_switch_children_inventory_collection
New in CSM 1.3.0¶
- csm_cluster_query_state
- csm_node_find_job
New in CSM 1.1.0¶
- csm_jsrun_cmd
Implementing New CSM APIs¶
CSM is an open source project that can be contributed to by the community. This section is a guide on how to contribute a new CSM API to this project.
Contributors should visit the GitHub and follow the instructions in the How to Contribute section of the repository readme.
Front-end¶
This is the API an end user would interact with. The front end interacts with the CSM Infrastructure through network connections to varying results.
Follow these steps to create/edit an api. The diagram below shows where to find the appropriate files in the GitHub repository.

The following numbers reference the chart above.
1: | When creating an API it should be determined whether it accepts input and produces output.
The CSM design follows the pattern of <API_Name>_input_t for input structs and
<API_Name>_output_t for output structs. These structs should be defined through use of an
x-macro in the appropriate folder for the API type under the A struct README is provided in this directory with an in-depth description of the struct definition process. |
---|
/*================================================================================*/
/**
* CSMI_COMMENT
* @brief An input wrapper for @ref csm_example_api.
*/
#ifndef CSMI_STRUCT_NAME
// ! The name of the struct to be generated !
#define CSMI_STRUCT_NAME csm_example_api_input_t
#undef CSMI_BASIC
#undef CSMI_STRING
#undef CSMI_STRING_FIXED
#undef CSMI_ARRAY
#undef CSMI_ARRAY_FIXED
#undef CSMI_ARRAY_STR
#undef CSMI_ARRAY_STR_FIXED
#undef CSMI_STRUCT
#undef CSMI_ARRAY_STRUCT
#undef CSMI_ARRAY_STRUCT_FIXED
#undef CSMI_NONE
// ! Set to 1 (true) when a field matching the type is present !
#define CSMI_BASIC 1
#define CSMI_STRING 1
#define CSMI_STRING_FIXED 0
#define CSMI_ARRAY 0
#define CSMI_ARRAY_FIXED 0
#define CSMI_ARRAY_STR 1
#define CSMI_ARRAY_STR_FIXED 0
#define CSMI_STRUCT 0
#define CSMI_ARRAY_STRUCT 0
#define CSMI_ARRAY_STRUCT_FIXED 0
#define CSMI_NONE 0
#endif
// CSMI_STRUCT_MEMBER(type, name, serial_type, length_member, init_value, extra ) /**< comment */
CSMI_VERSION_START(CSM_VERSION_1_0_0)
CSMI_STRUCT_MEMBER(int32_t , my_first_int , BASIC , , -1 , ) /**< Example int32_t value. API will ignore values less than 1.*/
CSMI_STRUCT_MEMBER(uint32_t, my_string_array_count, BASIC , , 0 , ) /**< Number of elements in the 'my_string_array' array. Must be greater than zero. Size of @ref my_string_array.*/
CSMI_STRUCT_MEMBER(char** , my_string_array , ARRAY_STR, my_string_array_count, NULL, ) /**< comment for my_string_array*/
CSMI_VERSION_END(fc57b7dafbe3060895b8d4b2113cbbf0)
CSMI_VERSION_START(CSM_DEVELOPMENT)
CSMI_STRUCT_MEMBER(int32_t, another_int, BASIC, , -1, ) /**< Another int.*/
CSMI_VERSION_END(0)
#undef CSMI_VERSION_START
#undef CSMI_VERSION_END
#undef CSMI_STRUCT_MEMBER
.. attention:: Follow the existing `struct README`_ in the code source for supplemental details.
2: | The X-Macro definition files will be collated by their ordering in the local Specific details for this file are in the struct README. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3: | The |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4: | After defining the X-Macro files the developer should run the The files modified by this script include:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5: | Add the API function declaration to the appropriate API file, consult the table below
for the correct file to add your API to (in the
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6: | Add a command to the |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7: | The implementation of the C API should be placed in the appropriate src directory:
|
Generally speaking the frontend C API implementation should follow a mostly standard pattern as outlined below:
#include "csmutil/include/csmutil_logging.h"
#include "csmutil/include/timing.h"
#include "csmi/src/common/include/csmi_api_internal.h"
#include "csmi/src/common/include/csmi_common_utils.h"
#include "csmi/include/“<API_HEADER>
// The expected command, defined in “csmi/src/common/include/csmi_cmds_def.h”
const static csmi_cmd_t expected_cmd = <CSM_CMD>;
// This function must be definedand supplied to the create_csm_api_object
// function if the API specifies an output.
void csmi_<api>_destroy(csm_api_object *handle);
// The actual implementation of the API.
int csm_<api>( csm_api_object **handle, <input_type> *input, <output_type> ** output)
{
START_TIMING()
char *buffer = NULL; // A buffer to store the serialized input struct.
uint32_t buffer_length = 0; // The length of the buffer.
char *return_buffer = NULL; // A return buffer for output from the backend.
uint32_t return_buffer_len = 0; // The length of the return buffer.
Int. error_code = CSMI_SUCCESS; // The error code, should be of type
// csmi_cmd_err_t.
// EARLY RETURN
// Create a csm_api_object and sets its csmi cmd and the destroy function.
create_csm_api_object(handle, expected_cmd, csmi_<api>_destroy);
// Test the input to the API, expand this to test input contents.
if (!input)
{
csmutil_logging(error, "The supplied input was null.");
// The error codes are listed in “csmi/include/csmi_type_common.h”.
csm_api_object_errcode_set(*handle, CSMERR_INVALID_PARAM);
csm_api_object_errmsg_set(*handle,
strdup(csm_get_string_from_enum(csmi_cmd_err_t, CSMERR_INVALID_PARAM)));
}
// EARLY RETURN
// Serialize the input struct and then test the serialization.
csm_serialize_struct(<input_type>, input, &buffer, &buffer_length);
test_serialization(handle, buffer);
// Execute the send receive command (this is blocking).
error_code = csmi_sendrecv_cmd(*handle, expected_cmd,
buffer, buffer_length, &return_buffer, &return_buffer_len);
// Based on the error code unpack the results or set the error code.
if ( error_code == CSMI_SUCCESS )
{
if ( return_buffer && csm_deserialize_struct(<output_type>, output,
(const char *)return_buffer, return_buffer_len) == 0 )
{
// ATTENTION: This is key, the CSM API makes a promise that the
// output of the API will be stored in the csm_api_object!
csm_api_object_set_retdata(*handle, 1, *output);
}
else
{
csmutil_logging(error, "Deserialization failed");
csm_api_object_errcode_set(*handle, CSMERR_MSG_UNPACK_ERROR);
csm_api_object_errmsg_set(*handle,
strdup(csm_get_string_from_enum(csmi_cmd_err_t,
CSMERR_MSG_UNPACK_ERROR)));
error_code = CSMERR_MSG_UNPACK_ERROR;
}
}
else
{
csmutil_logging(error, "csmi_sendrecv_cmd failed: %d - %s",
error_code, csm_api_object_errmsg_get(*handle));
}
// Free the buffers.
if(return_buffer)free(return_buffer);
free(buffer);
END_TIMING( csmapi, trace, csm_api_object_traceid_get(*handle), expected_cmd, api )
return error_code;
}
// This function should destroy any data stored in the csm_api_object by the API call.
void csmi_<api>_destroy(csm_api_object *handle)
{
csmi_api_internal *csmi_hdl;
<output_type> *output;
// free the CSMI dependent data
csmi_hdl = (csmi_api_internal *) handle->hdl;
if (csmi_hdl->cmd != expected_cmd)
{
csmutil_logging(error, "%s-%d: Unmatched CSMI cmd\n", __FILE__, __LINE__);
return;
}
// free the returned data specific to this csmi cmd
output = (<output_type> *) csmi_hdl->ret_cdata;
csm_free_struct_ptr( <output_type>, output);
csmutil_logging(info, "csmi_<api>_destroy called");
}
8: | Optionally, the developer may implement command line interface to the C API. For implementing an API please refer to existing API implementations. |
---|
Back-end¶
The he part of the API that the user will not interact with directly. The back end will be invoked by the CSM Infrastructure after receiving user requests.
This diagram below shows where to find the appropriate files in the GitHub repository.

When implementing a backend API the developer must determine several key details:
- Does the API handler access the database? How many times?
- What daemon will the API handler operate on?
- Does the API need a privilege mode?
- Will the API perform a multicast?
These questions will drive the development process, which in the case of most database APIs is boiler plate as shown in the following sections.
Determining the Base Handler Class¶
In the CSM Infrastructure the back-end API is implemented as an API Handler. This handler may be considered a static object which maintains no volatile state. The state of API execution is managed by a context object initialized when a request is first received by a back-end handler.
CSM has defined several implementations of handler class to best facilitate the rapid creation
of back-end handlers. Unless otherwise specified these handlers are located in
csmd/src/daemon/src/csmi_request_handler
and handler implementations should be placed in
the same directory.
CSMIStatefulDB (csmi_stateful_db.h)¶
If an API needs to access the database, it is generally recommended to use this handler as a base class. This class provides four virtual functions:
CreatePayload: | Parses the incoming API request, then generates the SQL query. |
---|---|
CreateByteArray: | |
Parses the response from the database, then generates the serialized response. | |
RetrieveDataForPrivateCheck: | |
Generates a query to the database to check the user’s privilege level (optional). | |
CompareDataForPrivateCheck: | |
Checks the results of the query in RetrieveDataForPrivateCheck
returning true or false based on the results (optional). |
In the simplest Database APIs, the developer needs to only implement two functions:
CreatePayload
and CreateByteArray
. In the case of privileged APIs, the
RetrieveDataForPrivateCheck
and CompareDataForPrivateCheck
must be implemented.
This handler actually represents a state machine consisting of three states which generalize the most commonly used database access path. If your application requires multiple database accesses or multicasts this state machine may be extended by overriding the constructor.
![digraph G {
DB_INIT -> DB_RECV_PRI [color="#993300" labelfontcolor="#993300" label="Privileged"];
DB_INIT -> DB_RECV_DB;
DB_RECV_PRI -> DB_RECV_DB;
DB_RECV_DB -> DB_DONE;
}](_images/graphviz-ab9216d75771fe2850cc7c259b789c19e8727b08.png)
To facilitate multiple database accesses in a single API call CSM has implemented
StatefulDBRecvSend
. StatefulDBRecvSend
takes a static function as a template parameter
which defines the processing logic for the SQL executed by CreatePayload
. The constructor for
StatefulDBRecvSend
then takes an assortment of state transitions for the state machine
which will depend on the state machine used for the API.
An example of this API implementation style can be found in CSMIAllocationQuery.cc
.
The pertinent section showing expansion of the state machine with the constructor is
reproduced and annotated below:
#define EXTRA_STATES 1 // There’s one additional state being used over the normal StatefulDB.
// Note: CSM_CMD_allocation_query matches the version on the front-end.
CSMIAllocationQuery::CSMIAllocationQuery(csm::daemon::HandlerOptions& options) :
CSMIStatefulDB(CSM_CMD_allocation_query, options,
STATEFUL_DB_DONE + EXTRA_STATES) // Send the total number of states to super.
{
const uint32_t final_state = STATEFUL_DB_DONE + EXTRA_STATES;
uint32_t current_state = STATEFUL_DB_RECV_DB;
uint32_t next_state = current_state + 1;
SetState( current_state++,
new StatefulDBRecvSend<CreateResponsePayload>(
next_state++, // Successful state.
final_state, // Failure state.
final_state ) ); // Final state.
}
#undef EXTRA_STATES
bool CSMIAllocationQuery::CreateResponsePayload(
const std::vector<csm::db::DBTuple *>&tuples,
csm::db::DBReqContent **dbPayload,
csm::daemon::EventContextHandlerState_sptr ctx )
{
// ….
}
Multicast operations will follow a largely similar behavior, however they exceed the scope of this
document, for more details refer to csmd/src/daemon/src/csmi_request_handler/csmi_mcast
.
CSMIStateful (csmi_stateful.h)¶
This handler should be used as a base class in handlers where no database operations are required (see CSMIAllocationStepCGROUPDelete.h). Generally, most API implementations will not use this as a base class. If an API is being implemented as CSMIStateful it is recommended to refer the source of CSMIAllocationStepCGROUPDelete.h and CSMIAllocationStepCGROUPCreate.h.
Adding Handler to Compliation¶
To add the handler to the compilation path for the daemon add it to the
csmd/src/daemon/src/CMakeLists.txt
file’s CSM_DAEMON_SRC file GLOB.
Registering with a Daemon¶
After implementing the back-end API the user must then register the API with the daemon routing.
Most APIs will only need to be registered on the Master Daemon, however, if the API performs
multicasts it will need to be registered on the Agent and Aggregator Daemons as well. The routing
tables are defined in csmd/src/daemon/src
:
Daemon | Routing File |
---|---|
Agent | csm_event_routing_agent.cc |
Aggregator | csm_event_routing_agg.cc |
Master | csm_event_routing_master.cc |
Utility | csm_event_routing_utility.cc |
Generally speaking registering a handler to a router is as simple as adding the following line to the RegisterHandlers function: Register < Handler_Class > (CSM_CMD_<api>) ;
Return Codes¶
As with all data types that will exist in both the C front-end and C++ back-end return codes are defined with an X-Macro solution. The return code X-Macro file can be located at: csmi/include/csm_types/enum_defs/common/csmi_errors.def
To protect backwards compatibility this file is guarded by with versioning blocks, for details on how to add error codes please consult the README: csmi/include/csm_types/enum_defs/README.md
The generated error codes may be included from the csmi/include/csmi_type_common.h
header.
Generally, the CSMI_SUCCESS
error code should be used in cases of successful execution. Errors
should be more granular to make error determination easier for users of the API, consult the list
of errors before adding a new one to prevent duplicate error codes.
CSM API Wrappers¶
There exist two documented methodologies for wrapping a CSM API to reduce the barrier of usage for system administrators: python bindings and command line interfaces. Generally speaking python bindings are preferred, as they provide more flexibility to system administrators and end users.
Command line interfaces are generally written in C and are used to expose basic functionality to an API.
Command Line Interfaces¶
Command line interfaces in CSM are generally written using native C and expose basic functionality
to the API, generally simplifying inputs or control over the output. When properly compiled a
native C command line interface will be placed in /csm/bin/
relative to the root of the
compiled output. Please consult csmi/src/wm/cmd/CMakeLists.txt for examples of compilation settings.
Naming¶
The name of the CSM command line interface should be matched one to one to the name of the API,
especially in cases where the command line interface simply exposes the function of the API with
no special modifications. For example, the csm_allocation_create
API is literally
csm_allocation_create
on the command line.
Parameters¶
CSM command line interfaces must provide long options for all command line parameters.
Short options are optional but preferred for more frequently used fields. A sample pairing of
short and long options would be in the case of the help flag: -h, --help
. i
The -h, --help
and -v, --verbose
flag pairings are reserved, always correspond to help
and verbose. These flags should be supported in all CSM command line interfaces.
All options should use the getopts
utility, no options should be position dependent.
Good:
csm_command --node_name node1 --state "some string"
csm_command --state "some string" –node_name node1
Bad:
csm_command node1 --state "some string"
Output¶
CSM command line requires that the YAML format is a supported output option. This is to facilitate
command line parsers. In cases where YAML output is not ideal for command line readability the
format may be changed as in the case of csm_node_query_state_history
.
In the following sample output the output is still considered valid YAML (note the open and close tokens). Data that is not YAML formatted will be commented out with the # character.
[root@c650f03p41 bin]# ./csm_node_query_state_history -n c650f03p41
---
node_name: c650f03p41
# history_time | state | alteration | RAS_rec_id, RAS_msg_id
# ----------------------------+----------------+----------------------+------------------------
# 2018-03-26 14:28:25.032879 | DISCOVERED | CSM INVENTORY |
# 2018-03-28 19:34:14.037409 | SOFT_FAILURE | RAS EVENT | 7, csm.status.down
...
By default, YAML is not presented on the command line. It is supported through a flag.
GENERAL OPTIONS:
[-h, --help] | Help.
[-v, --verbose verbose_level] | Set verbose level. Valid verbose levels: {off, trace, debug, info, warning, error, critical, always, disable}
[-Y, --YAML] | Set output to YAML. By default for this API, we have a custom output for ease of reading the long transaction history.
By setting the –Y
flag, the command line will then display in YAML.
[root@c650f03p41 bin]# ./csm_node_query_state_history -n c650f03p41 -Y
---
Total_Records: 2
Record_1:
history_time: 2018-03-26 14:28:25.032879
node_name: c650f03p41
state: DISCOVERED
alteration: CSM INVENTORY
RAS_rec_id:
RAS_msg_id:
Record_2:
history_time: 2018-03-28 19:34:14.037409
node_name: c650f03p41
state: SOFT_FAILURE
alteration: RAS EVENT
RAS_rec_id: 7
RAS_msg_id: csm.status.down
...
Python Interfaces¶
CSM uses Boost.Python to generate the Python interfaces. Struct bindings occur automatically when
running the csmi/include/struct_generator/regenerate_headers.sh
script. Each API type has its
own file to which the struct bindings will be placed by the automated script and function bindings
will be placed by the developer.
The following documentation assumes the python bindings are being added to one of the following files:
API Type | Python Binding File | Python Library |
---|---|---|
Burst Buffer | csmi/src/bb/src/csmi_bb_python.cc | lib_csm_bb_py |
Common | csmi/src/common/src/csmi_python.cc | lib_csm_py |
Diagnostics | csmi/src/diag/src/csmi_diag_python.cc | lib_csm_diag_py |
Inventory | csmi/src/inv/src/csmi_inv_python.cc | lib_csm_inv_py |
Launch | csmi/src/launch/src/csmi_launch_python.cc | lib_csm_launch_py |
RAS | csmi/src/ras/src/csmi_ras_python.cc | lib_csm_ras_py |
Workload Management | csmi/src/wm/src/csmi_wm_python.cc | lib_csm_wm_py |
Function Binding¶
Function binding with the Boost.Python library is boilerplate:
tuple wrap_<api>(<input-struct> input)
{
// Always sets the metadata.
// Ensures that the python binding always matches what it was designed for.
input._metadata=CSM_VERSION_ID;
// Output objects.
csm_api_object * updated_handle;
<output-struct> * output= nullptr;
// Run the API
int return_code = <api>( (csm_api_object**)&updated_handle, &input, &output);
// A singleton is used to track CSM object handles.
int64_t oid = CSMIObj::GetInstance().StoreCSMObj(updated_handle);
// Returned tuples should always follow the pattern:
// <return code, handler id, output values (optional)>
return make_tuple(return_code, oid, *output);
}
BOOST_PYTHON_MODULE(lib_csm_<api-type>_py)
{
def("<api-no-csm>", wrap_<api>, CSM_GEN_DOCSTRING("docstring", ",<output_type>"));
}
Python Binding Limitations¶
As CSM was designed predominantly around its use of pointers, and is a C native API, certain operations using the python bindings are not currently Pythonic.
1: | The output of the apis must be destroyed using csm.api_object_destroy(handler_id) . |
---|---|
2: | Array access/creation must be performed through get and set functions. Once an array is set it is currently immutable from python. |
These limitations are subject to change.
CSM API Python Bindings Guide¶
Tables of Contents:¶
About¶
The CSM API Python Bindings library works similar to other C Python binding libraries. CSM APIs can be accessed in Python because they are bound to C via Boost. More technical details can be found here: https://wiki.python.org/moin/boost.python/GettingStarted but understanding all of this is not required to use CSM APIs in Python. This guide provides a central location for users looking to utilize CSM APIs via Python. If you believe this guide to be incomplete, then please make a pull request with your additional content.
User Notes¶
Accessing CSM APIs in Python is very similar to accessing them in C. If you are familiar with the process, then you are already in a good position. If not, then the CSM API team suggest reading up on some CSM API documentation and guides.
Importing¶
Before writing your script and accessing CSM APIs, you must first import the CSM library into your script.
import sys
#add the python library to the path
sys.path.append('/opt/ibm/csm/lib')
import lib_csm_py as csm
import lib_csm_inv_py as inv
First you should say where the library is located. Which is what we did above in the first section.
import sys
#add the python library to the path
sys.path.append('/opt/ibm/csm/lib')
Second, you should import the main CSM library lib_csm_py
. We did this and then nicknamed it csm
for ease of use later in our script.
import lib_csm_py as csm
Then, you should import any appropriate sub libraries for the CSM APIs that you will be using. If you want workload manager APIs such as csm_allocation_query
then import the workload manager library lib_csm_wm_py
. If you want inventory APIs, such as csm_node_attributes_update
, then import the inventory library lib_csm_inv_py
. Look at CSM API documentation for a full list of all CSM API libraries.
For my example, I have imported the inventory library and nicknamed it inv
for ease of use later.
import lib_csm_inv_py as inv
Connection to CSM¶
At this point, a Python script can connect to CSM the same way a user would connect to CSM in the C language. You must connect to CSM by running the CSM init function before calling any CSM APIs. This init function is located in the main CSM library we imported earlier.
In Python, we do this below:
csm.init_lib()
Just like in C, this function takes care of connecting to CSM.
Accessing the CSM API¶
Below I have some code from an example script of setting a node to IN_SERVICE
via csm_node_attributes_update
input = inv.node_attributes_update_input_t()
nodes=["allie","node_01","bobby"]
input.set_node_names(nodes)
input.state = csm.csmi_node_state_t.CSM_NODE_IN_SERVICE
rc,handler,output = inv.node_attributes_update(input)
print rc
if rc == csm.csmi_cmd_err_t.CSMERR_UPDATE_MISMATCH:
print output.failure_count
for i in range(0, output.failure_count):
print output.get_failure_node_names(i)
Let’s break down some important lines here for first time users of the CSM Python library.
input = inv.node_attributes_update_input_t()
Here we are doing a few things. Just like in C, before we call the API we need to set up the input for the API. We do this on this line. Because this is an inventory API, we can find its input struct in the inventory library we imported earlier via inv
, and we create this as input
.
We now fill input
.
When using the CSM Python library arrays must be set
and get
.
nodes=["allie","node_01","bobby"]
input.set_node_names(nodes)
input.state = csm.csmi_node_state_t.CSM_NODE_IN_SERVICE
First we create an array in Python. nodes=["allie","node_01","bobby"]
Then we use the CSM Python library function set_ARRAYNAME(array)
to set the node_names
field of input
. We do not need to set node_names_count
like we do in C. the set_
function will take care of that for you. Finally, we call input.state = csm.csmi_node_state_t.CSM_NODE_IN_SERVICE
to set the state field of input to IN_SERVICE
. This will tell CSM to set these 3 nodes to IN_SERVICE
.
In the next line of code we call the csm API passing in the input we just populated.
rc,handler,output = inv.node_attributes_update(input)
Our CSM library returns 3 values.
- A return code - Here defined as
rc
. This is the same as the return code found in the C version of the API. - A handler - An identifier used in the
csm.api_object_destroy
function. - The API output - Here defined as output. This is the same as the output prarmeter found in the C version of the API. We will use this to access any output from the API. Similar to how you woul duse it in the C version.
If you noticed before I set nodes=["allie","node_01","bobby"]
. allie
and bobby
are not real nodes. So, the API will have some output data for us to check.
print rc
if rc == csm.csmi_cmd_err_t.CSMERR_UPDATE_MISMATCH:
print output.failure_count
for i in range(0, output.failure_count):
print output.get_failure_node_names(i)
The end of our sample script here first prints the return code, then if it matches the CSMERR_UPDATE_MISMATCH
prints additional information. Checking error codes and return codes from an API can be useful. The values are the same as the C APIs. Look at CSM API documentation for a full list of all CSM API return codes. Just like in the C version of APIs, error codes are found in the common API folder, which was included earlier as csm
.
Next we print out all the names of the nodes that could not be updated in the CSM database. To do this, we must access an array.
Arrays in the CSM Python library must be accessed using this get_
function. Following the pattern of get_ARRAYNAME
. The array names and fields of a CSM struct are the same as the C versions. Please look at CSM API documentation for a list of your struct and struct field names.
So in our example here, our struct has an array named failure_node_names
. To access it, we must call get_failure_node_names(i)
. i
here represents the element we want to access. Just like in the C version, output.failure_count
tells us how many elements are in our array.
This example keeps it simple and doesn’t do anything too crazy. We just loop through the array and print all the node names that did not update.
Cleaning Up and Closing Connection to CSM¶
Just like in C, when you are done communicating with CSM you must clean up and close connection. You call the same functions you would in C. api_object_destroy
and term_lib
. This will clean up memory and terminate connection to CSM.
csm.api_object_destroy(handler)
csm.term_lib()
Conclusion¶
This concludes the walkthrough of using the CSM Python library. If you have further question, then you can contact: https://github.com/NickyDaB . If you want more samples to analyze, then explore: CAST/csmi/python_samples
.
FAQ - Frequently Asked Questions¶
How do I access and set arrays in the CSM Python library.¶
When using the CSM Python library arrays must be set
and get
.
Get¶
Example:
if(result.dimms_count > 0):
print(" dimms:")
for j in range (0, result.dimms_count):
dimm = result.get_dimms(j)
print(" - serial_number: " + str(dimm.serial_number))
Here let’s assume that the dimms_count is > 0. Let’s say 3. The code will loop through each dimm, printing its serial number. The important line here is: dimm = result.get_dimms(j)
here we are accessing an array.
Arrays in the CSM Python library must be accessed using this get_
function. Following the pattern of get_ARRAYNAME
. The array names and fields of a CSM struct are the same as the C versions. Please look at CSM API documentation for a list of your struct and struct field names.
So in our example here, our struct has an array named dimms
. To access it, we must call get_dimms(j)
. j
here represents the element we want to access. dimm
represents how we will store this element.
Once stored, dimm
can be accessed like any other struct. print(" - serial_number: " + str(dimm.serial_number))
Set¶
Example:
input = inv.node_attributes_update_input_t()
nodes=["node_01","node_02","node_03"]
input.set_node_names(nodes)
input.state = csm.csmi_node_state_t.CSM_NODE_IN_SERVICE
Here we want to use csm_node_attributes_update
to set a few nodes to IN_SERVICE
. The API’s input takes in a list of nodes. So in Python we will need to set this array of node names. The important line here is: input.set_node_names(nodes)
here we are setting the array of the struct to an array we previously created.
Before we can call set_node_names(nodes)
we need to populate nodes
.
nodes=["node_01","node_02","node_03"]
Once nodes
has been defined, we can call: set_node_names(nodes)
.
Arrays in the CSM Python library must be set using this set_
function. Following the pattern of set_ARRAYNAME
. The set_
function requires a single parameter of a populated array. (Here that is nodes
.) The array names and fields of a CSM struct are the same as the C versions. Please look at CSM API documentation for a list of your struct and struct field names.
So in our example here, our struct has an array named node_names
. To set it, we must call input.set_node_names(nodes)
. nodes
here represents the Python array we already created in the previous line. input
represents the parent struct that contains this array.
Soft Failure Recovery¶
CSM defines a set of mechanisms for recovering from Soft Failure events.
A Soft Failure is an event which is considered to be largely intermittent. Generally, a soft failure may be caused by a networking issue or a failure in the Prolog/Epilog. CSM has a set of conditions for which it will trigger a Soft Failure to prevent scheduling until the intermitten failure is complete. It is also expected that system administrators will define Soft Failure events in their Prolog/Epilog.
When a node is placed into Soft Failure it must be returned to In Service before the scheduler will be allowed to select the node for further allocations. If the node exceeds a user specified retry count (either via recurring task or commandline) the node will be moved from Soft Failure to Hard Failure.
Success for moving from Soft Failure to In Service is determined by three metrics:
- CSM is able to clear all CGroups (soft failure means the node should host no allocations).
- The admin defined Recovery Script executed and returned zero.
- The recovery process didn’t timeout.
The following diagram is a high level abstraction of the state machine interacted with by the soft failure recovery mechanism:
![digraph G {
"Soft Failure" -> "Soft Failure" [label=" Retry"];
"Soft Failure" -> "In Service" [labelfontcolor="#009900" label="Recovery\nSuccess" color="#009900"];
"Soft Failure" -> "Hard Failure" [label=" Recovery\nFailure" color="#993300"];
"In Service" -> "Soft Failure" [label="Intermittent\nError" color="#993300"];
}](_images/graphviz-867ed196dbfe29f3710c0f2f8bacd886820c65d8.png)
Recurring Task Configuration¶
To configure the Soft Failure recovery mechanism, please refer to the :ref: csm_soft_failure_recovery-config documentation.
Additionally, depending on the complexity of the Recovery Script, the admin should modify the :ref: API Configuration timeout time of csm_soft_failure_recovery to account for at least twice the projected runtime of the recovery script.
Command Line Interface¶
CSM provides a command line script to trigger a Soft Failure recovery. Invocation is as follows:
/opt/ibm/csm/bin/csm_soft_failure_recovery -r <retry_threshold>
The -r or –retry option sets a retry threshold if this threshold is exceeded or met by any nodes that failed to be placed into In Service the node will be moved to Hard Failure.
Attention
Nodes that are in Soft Failure and owned by an allocation will NOT be processed by this utility!
Recovery Script¶
Attention
A recovery script must be located at /opt/ibm/csm/recovery/soft_failure_recovery to use the Soft Failure recovery mechanism!
A sample of the recovery script is placed in /opt/ibm/csm/share/recovery when installing the ibm-csm-core rpm. The sample script is extremely basic and is expected to be modified by the end user.
A recovery script must fit the following criteria:
- Be located at /opt/ibm/csm/recovery/soft_failure_recovery.
- Return 0 if the recovery was a success.
- Return > 0 in the event the recovery failed.
The recovery script takes no input parameters at this time.
Change Log¶
- 1.4.0
- Enum Types
- Struct Types
- Workload Management
- csmi_allocation_gpu_metrics_t
- csmi_allocation_mcast_context_t
- csmi_allocation_mcast_payload_request_t
- csmi_allocation_mcast_payload_response_t
- csmi_jsrun_cmd_payload_t
- csmi_soft_failure_recovery_payload_t
- csm_soft_failure_recovery_node_t
- csm_soft_failure_recovery_input_t
- csm_soft_failure_recovery_output_t
- Inventory
- Common
- Workload Management
1.4.0¶
The following document has been automatically generated to act as a change log for CSM version 1.4.0.
Enum Types¶
Common¶
- Added 10
- CSMERR_ALLOC_INVALID_NODES=46
- CSMERR_ALLOC_OCCUPIED_NODES=47
- CSMERR_ALLOC_UNAVAIL_NODES=48
- CSMERR_ALLOC_BAD_FLAGS=49
- CSMERR_ALLOC_MISSING=50
- CSMERR_EPILOG_EPILOG_COLLISION=51
- CSMERR_EPILOG_PROLOG_COLLISION=52
- CSMERR_PROLOG_EPILOG_COLLISION=53
- CSMERR_PROLOG_PROLOG_COLLISION=54
- CSMERR_SOFT_FAIL_RECOVERY_AGENT=55
- New Data Type
- CSM_NODE_NO_DEF=0
- CSM_NODE_DISCOVERED=1
- CSM_NODE_IN_SERVICE=2
- CSM_NODE_OUT_OF_SERVICE=3
- CSM_NODE_SYS_ADMIN_RESERVED=4
- CSM_NODE_SOFT_FAILURE=5
- CSM_NODE_MAINTENANCE=6
- CSM_NODE_DATABASE_NULL=7
- CSM_NODE_HARD_FAILURE=8
Struct Types¶
Workload Management¶
- New Data Type
- int64_t num_gpus
- int32_t* gpu_id
- int64_t* gpu_usage
- int64_t* max_gpu_memory
- int64_t num_cpus
- int64_t* cpu_usage
- New Data Type
- int64_t allocation_id
- int64_t primary_job_id
- int32_t num_processors
- int32_t num_gpus
- int32_t projected_memory
- int32_t secondary_job_id
- int32_t isolated_cores
- uint32_t num_nodes
- csmi_state_t state
- csmi_allocation_type_t type
- int64_t* ib_rx
- int64_t* ib_tx
- int64_t* gpfs_read
- int64_t* gpfs_write
- int64_t* energy
- int64_t* gpu_usage
- int64_t* cpu_usage
- int64_t* memory_max
- int64_t* power_cap_hit
- int32_t* power_cap
- int32_t* ps_ratio
- csm_bool shared
- char save_allocation
- char** compute_nodes
- char* user_flags
- char* system_flags
- char* user_name
- int64_t* gpu_energy
- char* timestamp
- csmi_state_t start_state
- int64_t runtime
- csmi_allocation_gpu_metrics_t** gpu_metrics
- Added 1
- int64_t runtime
New Data Type
- int64_t energy
- int64_t pc_hit
- int64_t gpu_usage
- int64_t ib_rx
- int64_t ib_tx
- int64_t gpfs_read
- int64_t gpfs_write
- int64_t cpu_usage
- int64_t memory_max
- int32_t power_cap
- int32_t ps_ratio
- char create
- char* hostname
- int64_t gpu_energy
- csmi_cmd_err_t error_code
- char* error_message
- csmi_allocation_gpu_metrics_t* gpu_metrics
Added 4
- uint32_t num_nodes
- char** compute_nodes
- char* launch_node
- csmi_allocation_type_t type
New Data Type
- char* hostname
- csmi_cmd_err_t error_code
- char* error_message
New Data Type
- uint32_t error_count
- csm_soft_failure_recovery_node_t** node_errors
Inventory¶
New Data Type
- int32_t insert_count
- int32_t update_count
- int32_t delete_count
New Data Type
- int32_t limit
- int32_t offset
- uint32_t switch_names_count
- char* state
- char** switch_names
- char* serial_number
- char order_by
- uint32_t roles_count
- char** roles
New Data Type
- char TBD
- int32_t insert_count
- int32_t update_count
- int32_t delete_count
- int32_t delete_module_count
New Data Type
- int32_t insert_count
- int32_t update_count
- int32_t delete_count
Common¶
New Data Type
- int errcode
- char* errmsg
- uint32_t error_count
- csm_node_error_t** node_errors
CSM Infrastructure¶
The managing process of CSM. The infrastructure facilitates the interaction of the local CSM APIs and the CSM Database and cluster Compute nodes.
A Broad general visualization of the infrastructure has been reproduced below:
![digraph G {
User -> Utility;
Utility -> Master;
Master -> Utility;
Master -> Master;
Master -> "CSM Database";
"CSM Database" -> Master
Master -> Aggregator;
Aggregator -> Master;
Aggregator -> Compute;
Compute -> Aggregator;
User [shape=Mdiamond];
"CSM Database" [shape=cylinder];
}](_images/graphviz-804052992b9928cfa7e0dc89e3fd08fe310c5fb6.png)
CSMD Executable¶
The csmd
executable is bundled in the csm-core-*.rpm
at /opt/ibm/csm/sbin/csmd
.
This executable has been daemonized to run the CSM Infrastructure.
CSMD Command line options¶
Supported Command Line Options:
-h [ --help ] Show this help
-f [ --file ] arg Specify configuration file
(default: /etc/ibm/csm/csm_master.cfg)
-r [ --role ] arg Set the role of the daemon (M|m)[aster] |
(A|a)[ggregator] | (C|c)[ompute] |
(U|u)[tility]
Note
- The role is determined by the first letter of the role argument.
- The file path should be an absolute path to avoid confusion.
CSMD Services¶
CSM defines four service types that are accessible through systemctl
.
Type | Config | Service |
---|---|---|
Utility | /etc/ibm/csm/csm_utility.cfg | csmd-utility.service |
Master | /etc/ibm/csm/csm_master.cfg | csmd-master.service |
Aggregator | /etc/ibm/csm/csm_aggregator.cfg | csmd-aggregator.service |
Compute | /etc/ibm/csm/csm_compute.cfg | csmd-compute.service |
The following is a sample how to manipulate these services:
systemctl [status|start|stop|restart] csmd-utility
CSMD Configuration¶
To configure the csmd
daemon please refer to CSMD Configuration.
ACL Configuration¶
To use the CSM API with proper security an ACL file must be configured. Using a combination of user privilege level and API access level, CSM determines what the actions to perform when an API is called by a user.
For example, if the user doesn’t have the proper privilege on a private API, the returned information will be limited or denied all together.
A user can be either privileged or non-privileged. To become a privileged user, either the user name must be listed as a privileged user in the ACL file or the user needs to be a member of a group that’s listed as a privileged group.
A template or default ACL file is included in the installation and can be found under
/opt/ibm/share/etc/csm_api.acl
.
{
"privileged_user_id": "root",
"privileged_group_id": "root",
"private":
["csm_allocation_query_details",
"csm_allocation_delete",
"csm_allocation_update_state",
"csm_bb_cmd",
"csm_jsrun_cmd",
"csm_allocation_step_query_details"],
"public":
["csm_allocation_step_cgroup_create",
"csm_allocation_step_cgroup_delete",
"csm_allocation_query",
"csm_allocation_query_active_all",
"csm_allocation_resources_query",
"csm_allocation_step_begin",
"csm_allocation_step_end",
"csm_allocation_step_query",
"csm_allocation_step_query_active_all",
"csm_diag_run_query",
"csm_node_attributes_query",
"csm_node_attributes_query_history",
"csm_node_resources_query",
"csm_node_resources_query_all"]
}
The CSM API ACL configuration is done through the file pointed at by the
setting in the csm config file (csm.api_permission_file
). It is required
to be in json format. The main entries are:
privileged_user_id: | |
---|---|
Lists the users that will be allowed to perform administrator tasks in terms of calling privileged CSM APIs. The user root will always be able to call APIs regardless of the configured privilege level. If more than one user needs to be listed, use the |
|
privileged_user_group: | |
Lists the groups which will be allowed to perform administrator tasks in terms of calling privileged CSM APIs. Users in group root will always be able to call APIs independent of the configured privilege level. If more than one user needs to be listed, use the |
|
private: | Specifies a list of CSM APIs that are private. A private API can only be called by privileged users or owners of the corresponding resources. For example, csm_allocation_query_details can only be called by the owner of the requested allocation. |
public: | Specifies a list of CSM APIs that can be called by any user who has access to the node and the client_listen socket of the CSM daemon. |
privileged: | Explicitly configure a list of CSM APIs as privileged APIs. The section is not present in the template ACL file because any API will be privileged unless listed as private or public. |
Warning
The ACL files should be synchronized between all nodes of the CSM infrastructure. Each daemon will attempt to enforce as many of the permissions as possible before routing the request to other daemons for furtherprocessing.
For example, if a user calls an API on a utility node where the API is configured public, there will be no further permission check if that request is forwarded to the master even if the ACL config on the master configures the API as private or privileged.
The permissions of a request are determined at the point of entry to the infrastructure. Enforcement is based on the effective user id and group id on the machine that runs the requesting client process.
API Configuration¶
The CSM API configuration file (json) allows the admin to set a number of API-specific parameters.
{
"#comment_1" : "This will be ignored",
"csm_allocation_create" : 120,
"csm_allocation_delete" : 120,
"csm_allocation_update_state" : 120,
"csm_allocation_step_end" : 120,
"csm_allocation_step_begin" : 120,
"csm_allocation_query" : 120,
"csm_bb_cmd" : 120,
"csm_jsrun_cmd" : 60,
"csm_soft_failure_recovery" : 240
}
At the moment this only includes the timeout for CSM APIs (in seconds). The API config file path
and name is defined in the CSM config file setting csm.api_configuration_file
.
Warning
The API configuration files should be synchronized between all nodes of the CSM infrastructure to avoid unexpected API timeout behavior.
The current version of CSM calculates daemon-role-specific, fixed API timeouts based on the configuration file. Meaning the actual timeouts will be different (lower) than the configured time to account for delays in the communication, processing, or number of internal round-trips for certain APIs.
For example, an API called from the utility node is configured with a 120s timeout. Once the request is forwarded to the master, the master will enforce a timeout of 119s accounting for network and processing delays.
If the request requires the master to reach out to compute nodes the aggregators will enforce a timeout of 58s because the aggregator accounts for some APIs requiring 2 round trips and 1 additional network hop.
Generally, the expected enforced timeout is: <value> / 2 - 2s.
CSMD Configuration¶
Each type of daemon is set up via a dedicated configuration file (default location: /etc/ibm/csm/csm_*.cfg). The format of the config file is json, json parse errors indicate formatting problems in the config file.
Warning
The CSM daemon needs to be restarted for any changes to the configuration to take effect.
The `csm` Block
{
"csm" :
{
"role": "<daemon_role>",
"thread_pool_size" : 1,
"api_permission_file": "/etc/ibm/csm/csm_api.acl",
"api_configuration_file": "/etc/ibm/csm/csm_api.cfg",
"log" : { },
"db" : { },
"inventory" : { },
"net" : { },
"ras" : { },
"ufm" : { },
"bds" : { },
"recurring_tasks": { },
"data_collection" : { }
}
}
Beginning with the top-level configuration section csm.
role: | Sets the role of the daemon ( |
---|---|
thread_pool_size: | |
Controls the number of worker threads that are used to process any CSM API calls and event handling. A setting of 1 should generally suffice. However, if there are some CSM API calls that spawn external processes which in turn might call other CSM APIs (e.g. csm_allocation_create() spawning the prolog). The worker thread waits for the completion of the spawned process and with only one available worker, there will be no resources left to process the additional API call. This is why a setting of at least 4 is recommended for compute nodes. |
|
api_permission_file: | |
Points to the file that controls the permissions of API calls to specify admin users/groups and classify CSM APIs as public, private, or privileged. See ACL Configuration for details. |
|
api_configuration_file: | |
Points to the file that contains detailled configuration settings for CSM API calls. If an API requires a non-default timeout, it should be configured in that file. See API Configuration for details. |
|
log: | A subsection that defines the logging level of various components of the daemon. See The log Block. |
db: | A master-specific block to set up the data base connection. |
inventory: | Configures the inventory collection component. See The inventory block. |
net: | Configures the network collection component to define the interconnectivity of the CSM infrastructure. |
ras: | Documentation ongoing. |
ufm: | Configures access to UFM. See The UFM block. |
bds: | Addresses, ports, and other settings for BDS access. See The BDS block. |
recurring_tasks: | |
Sets up intervals and types of predefined recurring tasks to be triggered by the daemon. |
|
data_collection: | |
Enables and configures predefined buckets for environmental data collection. |
The log
block¶
The log block determines what amount of logging goes to which files and console. This block also specifies log rotation options.
{
"format" : "%TimeStamp% %SubComponent%::%Severity% | %Message%",
"consoleLog" : false,
"sysLog" : true,
"fileLog" : "/var/log/ibm/csm/csm_master.log",
"#rotationSize_comment_1" : "Maximum size (in bytes) of the log file, ~1GB",
"rotationSize" : 1000000000,
"default_sev" : "warning",
"csmdb" : "info",
"csmnet" : "info",
"csmd" : "info",
"csmras" : "info",
"csmapi" : "info",
"csmenv" : "info",
"transaction" : true,
"transaction_file" : "/var/log/ibm/csm/csm_transaction.log",
"transaction_rotation_size" : 1000000000
"allocation_metrics" : true,
"allocation_metrics_file" : "/var/log/ibm/csm/csm_allocation_metrics.log",
"allocation_metrics_rotation_size" : 1000000000
}
format: | Defines a template for the format of the CSM log lines. In the given example, a log Message is prefixed with the TimeStamp followed the name of the SubComponent and the Severity. The SubComponent helps to identify the source of the message (e.g. the csmnet = Network component; csmapi = CSM API call processing). |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
consoleLog: | Determines whether the logs should go to the console or not. Can be |
||||||||||||
fileLog: | Determine whether the logs should go to syslog or not. Can be |
||||||||||||
rotationSize: | Limits the size (bytes) of the log file before starting a new log file. If set to -1 the file is allowed to grow without limit. |
||||||||||||
default_sev: | Set the logging level/verbosity for any component that’s not mentioned explicitly. Options include:
|
||||||||||||
csmdb: | Log level of the database component. Includes messages about database access and request handling. |
||||||||||||
csmnet: | Log level of the network component. Includes messages about the network interaction between daemons and daemons and client processes. |
||||||||||||
csmd: | Log level of the core daemon. Includes messages from the core of the infrastructure handling and management. |
||||||||||||
csmras: | Log level of the RAS component. Includes messages about RAS events and their processing within the daemon. |
||||||||||||
csmapi: | Log level of CSM API handling. Includes messages about API call processing. |
||||||||||||
csmenv: | Log level of environmental data handling. Includes messages related primarily to data collection and shipping from compute to aggregators. |
||||||||||||
transaction: | Enables the mechanism transaction log mechanism. |
||||||||||||
transaction_file: | |||||||||||||
Specifies the location the transaction log will be saved to. |
|||||||||||||
transaction_rotation_size: | |||||||||||||
The size of the file (in bytes) to rotate the log at. |
|||||||||||||
allocation: | Enables the mechanism allocation metrics log mechanism. |
||||||||||||
allocation_file: | |||||||||||||
Specifies the location the allocation metrics log will be saved to. |
|||||||||||||
allocation_rotation_size: | |||||||||||||
The size of the file (in bytes) to rotate the log at. |
The Database[db
] Block¶
The database block configures the location and access parameters of the CSM database. The settings are specific and relevant to the master daemon only.
{
"connection_pool_size" : 10,
"host" : "127.0.0.1",
"name" : "csmdb",
"user" : "csmdb",
"password" : "",
"schema_name" : ""
}
connection_pool_size: | |
---|---|
Configures the number of connections to the database. This number also specifies the number of database worker threads for concurrent access and parallel processing of requests. CSM recommends empirical adjustments to this size depending on system demand and spec. Demand will grow with size of the system and frequency of CSM API calls. |
|
host: | The hostname or IP address of the database server. |
name: | The name of the database on the |
user: | The username that CSM should use to access the database. |
password: | The password to access the database. Attention Be sure to set permissions of the file when the |
schema_name: | in case there is a named schema in use, this configures the name The named schema in the database (optional in the default configuration). |
The inventory
Block¶
The inventory block configures the location of files that are used for collection of the network inventory.
{
"csm_inv_log_dir" : "/var/log/ibm/csm/inv",
"ufm":
{
"ib_cable_errors" : "bad_ib_cable_records.txt",
"switch_errors" : "bad_switch_records.txt"
}
}
csm_inv_log_dir: | |||||||
---|---|---|---|---|---|---|---|
The absolute path for inventory collection logs. |
|||||||
ufm: |
|
The Network[net
] Block¶
The network block defines the hostnames, ports, and other important parameters of the CSM daemon infrastructure. Several subsections are specific to the role of the daemon.
{
"heartbeat_interval" : 15,
"local_client_listen" :
{
"socket" : "/run/csmd.sock",
"permissions" : 777,
"group" : ""
},
"ssl":
{
"ca_file" : "",
"cred_pem" : ""
}
}
General settings available for all daemon roles:
heartbeat_interval: | |||||||
---|---|---|---|---|---|---|---|
Determines the interval (in seconds) that this daemon will use for any connections to other CSM daemon(s) of the infrastructure. The actual interval of a connection will be the minimum interval of the 2 peers of that connection. For example, if one daemon initiates a connection with an interval of 60s while the peer daemon is configured to use 15s, both daemons will use a 15s interval for this connection. Note It takes about 3 intervals for a daemon to consider a connection as dead. Because each connection’s heartbeat is the minimum one can run different intervals between different daemons if necessary or desired. |
|||||||
local_client_listen: | |||||||
This subsection configures a unix domain socket where the daemon will receive requests from local clients. This subsection is available for all daemon roles. Note If you run multiple daemons on the same node, this section needs a dedicated setting for each daemon.
|
|||||||
ssl: | This subsection allows the user to enable SSL encryption and authentication between daemons. If any of the two settings below are non-empty, the CSM daemon will enable SSL for daemon-to-daemon connections by using the specified files. Note Since there’s only one certificate entry in the configuration, the same certificate has to serve as client and server certificate at the same time. This puts some limitations on the configuration of the certificate infrastructure.
|
Note
Note that the heartbeat is not determining the overall health of a peer daemon. The daemon might be able to respond to heartbeats.. while still impeded to respond to API calls. A successful exchange of heartbeats tells the daemon that there’s a functional network connection and the network mgr thread is able to process inbound and outbound messages. To check if a daemon is able to process API calls, you might use the infrastructure health check tool.
Note
The following is an explaination of the heartbeat mechanism to show why it takes about 3 intervals to detect a dead connection.
The heartbeat between daemons works as follows:
- After creating the connection, the daemons negotiate the smallest interval and start the timer.
- Whenever a message arrives at one daemon, the timer is reset.
- If the timer triggers, the daemon sends a heartbeat message to the peer and sets the connection status as UNSURE (as in unsure whether the peer is still alive) and resets the timer.
- If the peer receives the heartbeat, it will reset its timer. After the timer triggers, it will send a heartbeat back.
- If the peer responds, the timer is reset and the connection status is HAPPY.
- If the peer doesn’t respond and the timer triggers again, the daemon will send a second heartbeat, reset the timer, and change the status to MISSING_RECV.
- If the timer triggers without a response, the connection will be considered DEAD and torn down.
Network Destination Blocks¶
The following blocks unilaterally use the following two fields:
host: | Determines the hostname or IP address of the listening socket. Note To bind a particular interface, it is recommended to use an explicit IP address. Template entries like __MASTER__ and __AGGREGATOR__ are placeholders for the IP or host of a CSM daemon with that role. A host entry which is set to |
---|---|
port: | Specifies the port of a socket, it is used as both a listening and destination port. |
{
"aggregator_listen":
{
"host": "__MASTER__",
"port": 9815
},
"utility_listen":
{
"host": "__MASTER__",
"port": 9816
},
"compute_listen":
{
"host": "__AGGREGATOR__",
"port": 9800
},
"master":
{
"host": "__MASTER__",
"port": 9815
},
"aggregatorA" :
{
"host": "__AGGREGATOR_A__",
"port": 9800
},
"aggregatorB" :
{
"host": "__AGGREGATOR_B__",
"port": 9800
}
}
Possible connection configuration sections:
aggregator_listen: | |
---|---|
[master ] Specifies the interface and port where the master expects aggregators to connect. |
|
utility_listen: | [master ] Specifies the interface and port where the master expects utility daemons to connect. |
compute_listen: | [aggregator ] Specifies the interface and port where an aggregator expects compute nodes to connect. |
master: | [utility , aggregator ]
Configures the coordinates of the master daemon. |
aggregatorA: | [compute ]
Configures the coordinates of the primary aggregator.
The primary aggregator must be configured to allow the compute node to work (required to start). |
aggregatorB: | [compute ]
Configures the coordinates of the secondary aggregator.
Setting the host of this section to NONE will disable the compute daemons’ attempt to
create and maintain a redundant path through a secondary aggregator. |
The ufm
Block¶
The ufm block configures the location and access to ufm.
{
"rest_address" : "__UFM_REST_ADDRESS__",
"rest_port" : 80,
"ufm_ssl_file_path" : "/etc/ibm/csm",
"ufm_ssl_file_name" : "csm_ufm_ssl_key.txt"
}
rest_address: | The hostname of the UFM server. |
---|---|
rest_port: | The port UFM is serving the RESTful interface on (generally |
ufm_ssl_file_path: | |
The path to the SSL file for UFM access. |
|
ufm_ssl_file_name: | |
An SSL file for UFM Access. May be generated using the following command: openssl base64 -e <<< ${username}:${password} > /etc/ibm/csm/csm_ufm_ssl_key.txt;
|
The bds
Block¶
The BDS block configures the access to the Big Data Store.
{
"host" : "__LOGSTASH__",
"port" : 10522,
"reconnect_interval_max" : 5,
"data_cache_expiration" : 600
}
host: | Points to the host or IP address of the Logstash service. If following the configuration section in Logstash this should be |
---|---|
port: | The port that CSM should send entries to on the If following the configuration section in Logstash this should be |
reconnect_interval_max: | |
Reconnect interval in seconds to the Logstash server. Limits the frequency of reconnect attempts to the Logstash server in the event the service is
down. If the aggregator daemon is unable to connect, it will delay the next
attempt for 1s. If the next attempt fails, it will wait 2s before retrying. This retry attempt
will continue until |
|
data_cache_expiration: | |
The number of seconds the daemon will keep any environmental data that failed to get sent to Logstash. To limit the loss of environmental data, it is recommended to set the expiration to be longer than the maximum reconnect interval. |
Note
This block is only leveraged on the Aggregator.
The recurring_tasks
Block¶
{
"enabled" : false,
"soft_fail_recovery" :
{
"enabled" : false,
"interval" : "00:01:00",
"retry" : 3
}
}
The recurring tasks configuration block, schedules recurring tasks that are supported by CSM.
enabled: | Indicates whether or not recurring tasks will be processed by the daemons. |
---|
soft_fail_recovery
¶
The soft failure recovery task executes the soft_failure_recovery API over the specified interval for the number of retries specified. For s
{
"enabled" : false,
"interval" : "00:01:00",
"retry" : 3
}
enabled: | Indicates whether or not this task will be processed by the daemons. |
---|---|
interval: | The interval time between recurring tasks, format: HH:mm:ss. |
retry: | The number of times to retry the task on a specific node before placing the node into soft failure, if the daemon is restarted the retry count for the node will be restarted. |
Attention
This is only defined on the Master Daemon.
The data_collection
Block¶
The data collection block configures environmental data collection on compute nodes. It has no effect on other daemon roles.
{
"buckets":
[
{
"execution_interval":"00:10:00",
"item_list": ["gpu", "environmental"]
}
]
}
buckets: | A json array of buckets for collection of environmental data. Each array element or bucket is configured as follows:
|
---|
CSM Database¶
CSM database (CSM DB) holds information about systems hardware configuration, hardware inventory, RAS, diagnostics, job steps, job allocations, and CSM configuration. This information is essential for the CORAL system to run properly and for resources accounting.
CSM DB uses PostgreSQL.
CSM Database Appendix¶
Naming conventions¶
CSM Database Overview
Table | Table names start with “csm” prefix,
example csm_node. History table
names add “_history” suffix, example:
|
csm_node_history
|
Primary Key | Primary key names are automatically
generate within PostgreSQL starting
with table name and followed by pkey.
|
${table name}_pkey
csm_node_pkey
|
Unique Key | Unique key name start with “uk” followed
with table name and a letter indicating
the sequence (a, b, c, etc.).
|
uk_${table name}_b
uk_csm_allocation_b
|
Foreign key | Foreign key names are automatically
generate within PostgreSQL starting
with the table name and followed
with a list of field(s) and followed
by fkey.
|
${table}_${name_column names}_fkey
csm_allocation_node_allocation_id_fkey
|
Index | Index name starts with a prefix “ix”
followed by a table name and a letter
indicating the sequence.(a, b, c, etc.).
|
ix_${table name}_a
ix_csm_node_history_a
|
Functions | Function names will start with a prefix
with the prefix “fn” and followed by a
name, usually related to the table and
purpose or arguments if any.
|
fn_function_name_purpose
fn_csm_allocation_history_dump
|
Triggers | Trigger names will start with a prefix
with the prefix “tr” and followed by a
name, usually related to the table and
purpose.
|
tr_trigger_name_purpose
|
History Tables¶
CSM DB keeps track of data as it change over time. History tables will be used to store these records and a history time stamp is generated to indicate the transaction has completed. The information will remain in this table until further action is taken.
Usage and Size¶
The usage and size of each table will vary depending on system size and system activity. This document tries to estimate the usage and size of the tables. Usage is defined as how often a table is accessed and is recorded asLow
,Medium
, orHigh
. Size indicates how many rows are within the database tables and is recorded as total number of rows.
Table Categories¶
The CSM database tables are grouped and color coordinated to demonstrate which category they belong to within the schema. These categories include,
Tables¶
Node attributes tables¶
csm_node¶
Description
This table contains the attributes of all the nodes in the CORAL system including: management node, service node, login node, work load manager, launch node, and compute node.
Table | Overview | Action On: |
---|---|---|
Usage | High (CSM APIs access this table regularly)
|
|
Size | 1-5000 rows (total nodes in a CORAL System)
|
|
Key(s) | PK: node_name
|
|
Index | csm_node_pkey on (node_name)
ix_csm_node_a on (node_name, ready)
|
|
Functions | fn_csm_node_ready
fn_csm_node_update
fn_csm_node_delete
|
|
Triggers | tr_csm_node_ready on (csm_node)
tr_csm_node_update
|
update/delete
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_allocation_node | csm_allocation_node_node_name_fkey | node_name | (FK) |
csm_dimm | csm_dimm_node_name_fkey | node_name | (FK) |
csm_gpu | csm_gpu_node_name_fkey | node_name | (FK) |
csm_hca | csm_hca_node_name_fkey | node_name | (FK) |
csm_processor | csm_processor_node_name_fkey | node_name | (FK) |
csm_ssd | csm_ssd_node_name_fkey | node_name | (FK) |
csm_node_history¶
- Description
- This table contains the historical information related to node attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low (When hardware changes and to query
historical information)
|
|
Size | 5000+ rows (Based on hardware changes)
|
|
Index | ix_csm_node_history_a on (history_time)
ix_csm_node_history_b on (node_name)
ix_csm_node_history_c on (ctid)
ix_csm_node_history_d on (archive_history_time)
|
csm_node_ready_history¶
- Description
- This table contains historical information related to the node ready status. This table will be updated each time the node ready status changes.
Table | Overview | Action On: |
---|---|---|
Usage | Med-High
|
|
Size | (Based on how often a node ready status changes)
|
|
Index | ix_csm_node_ready_history_a on (history_time)
ix_csm_node_ready_history_b on (node_name, ready)
ix_csm_node_ready_history_c on (ctid)
ix_csm_node_ready_history_d on (archive_history_time)
|
csm_processor_socket¶
- Description
- This table contains information on the processors of a node.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Witherspoon will consist of
256 processors per node. (based on 5000 nodes)
|
|
Key(s) | PK: serial_number, node_name
FK: csm_node (node_name)
|
|
Index | csm_processor_pkey on (serial_number, node_name)
|
|
Functions | fn_csm_processor_history_dump
|
|
Triggers | tr_csm_processor_history_dump
|
update/delete
|
csm_processor_socket_history¶
- Description
- This table contains historical information associated with individual processors.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on how often a processor
is changed or its failure rate)
|
|
Index | ix_csm_processor_history_a on (history_time)
ix_csm_processor_history_b on (serial_number, node_name)
ix_csm_processor_history_c on (ctid)
ix_csm_processor_history_d on (archive_history_time)
|
csm_gpu¶
- Description
- This table contains information on the GPUs on the node.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 30,000+ rows
(Max per load =
6 (If there are 5000 nodes than
30,000 on Witherspoons)
|
|
Key(s) | PK: node_name, gpu_id
FK: csm_node (node_name)
|
|
Index | csm_gpu_pkey on (node_name, gpu_id)
|
|
Functions | fn_csm_gpu_history_dump
|
|
Triggers | tr_csm_gpu_history_dump
|
update/delete
|
csm_gpu_history¶
- Description
- This table contains historical information associated with individual GPUs. The GPU will be recorded and also be timestamped.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (based on how often changed)
|
|
Index | ix_csm_gpu_history_a on (history_time)
ix_csm_gpu_history_b on (serial_number)
ix_csm_gpu_history_c on (node_name, gpu_id)
ix_csm_gpu_history_d on (ctid)
ix_csm_gpu_history_e on (archive_history_time)
|
csm_ssd¶
- Description
- This table contains information on the SSDs on the system. This table contains the current status of the SSD along with its capacity and wear.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-5000 rows (one per node)
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_ssd_pkey on (serial_number)
ix_csm_ssd_a on (serial_number, node_name)
|
|
Functions | fn_csm_ssd_history_dump
|
|
Triggers | tr_csm_ssd_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_vg_ssd | csm_vg_ssd_serial_number_fkey | serial_number, node_name | (FK) |
csm_ssd_history¶
- Description
- This table contains historical information associated with individual SSDs.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 5000+ rows
|
|
Index | ix_csm_ssd_history_a on (history_time)
ix_csm_ssd_history_b on (serial_number, node_name)
ix_csm_ssd_history_c on (ctid)
ix_csm_ssd_history_d on (archive_history_time)
|
csm_ssd_wear_history¶
- Description
- This table contains historical information on the ssds wear known to the system.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 5000+ rows
|
|
Index | ix_csm_ssd_wear_history_a on (history_time)
ix_csm_ssd_wear_history_b on (serial_number, node_name)
ix_csm_ssd_wear_history_c on (ctid)
ix_csm_ssd_wear_history_d on (archive_history_time)
|
csm_hca¶
- Description
- This table contains information about the HCA (Host Channel Adapters). Each HC adapter has a unique identifier (serial number). The table has a status indicator, board ID (for the IB adapter), and Infiniband (globally unique identifier (GUID)).
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-10K – 1 or 2 per node
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_hca_pkey on (serial_number)
|
|
Functions | fn_csm_hca_history_dump
|
|
Triggers | tr_csm_hca_history_dump
|
update/delete
|
csm_hca_history¶
- Description
- This table contains historical information associated with the HCA (Host Channel Adapters).
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on how many are changed out)
|
|
Index | ix_csm_hca_history_a on (history_time)
ix_csm_hca_history_b on (node_name, serial_number)
ix_csm_hca_history_c on (ctid)
ix_csm_hca_history_d on (archive_history_time)
|
csm_dimm¶
- Description
- This table contains information related to the DIMM “”Dual In-Line Memory Module” attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-80K+ (16 DIMMs per node)
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_dimm_pkey on (serial_number)
|
|
Functions | fn_csm_dimm_history_dum
|
|
Triggers | tr_csm_dimm_history_dump
|
update/delete
|
csm_dimm_history¶
- Description
- This table contains historical information related to the DIMM “Dual In-Line Memory Module” attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on how many are changed out)
|
|
Index | ix_csm_dimm_history_a on (history_time)
ix_csm_dimm_history_b on (node_name, serial_number)
ix_csm_dimm_history_c on (ctid)
ix_csm_dimm_history_d on (archive_history_time)
|
Allocation tables¶
csm_allocation¶
- Description
- This table contains the information about the system’s current allocations. Specific attributes include: primary job ID, secondary job ID, user and system flags, number of nodes, state, username, start time stamp, power cap, power shifting ratio, authorization token, account, comments, eligible, job name, reservation, Wall clock time reservation, job_submit_time, queue, time_limit, WC Key, type.
Table | Overview | Action On: |
---|---|---|
Usage | High (Every time allocated and allocation query)
|
|
Size | 1-5000 rows (1 allocation per node (5000 max per 1 node))
|
|
Key(s) | PK: allocation_id
|
|
Index | csm_allocation_pkey on (allocation_id)
|
|
Functions | fn_csm_allocation_history_dump
fn_csm_allocation_state_history_state_change
fn_csm_allocation_update
|
insert/update/delete (API call)
|
Triggers | tr_csm_allocation_state_change
tr_csm_allocation_update
|
delete
update
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_allocation_node | csm_allocation_node_allocation_id_fkey | allocation_id | (FK) |
csm_step | csm_step_allocation_id_fkey | allocation_id | (FK) |
csm_allocation_history¶
- Description
- This table contains the information about the no longer current allocations on the system. Essentially this is the historical information about allocations. This table will increase in size only based on how many allocations are deployed on the life cycle of the machine/system. This table will also be able to determine the total energy consumed per allocation (filled in during “free of allocation”).
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | (Depending on customers work load (100,000+ rows))
|
|
Index | ix_csm_allocation_history_a on (history_time)
ix_csm_allocation_history_b on (allocation_id)
ix_csm_allocation_history_c on (ctid)
ix_csm_allocation_history_d on (archive_history_time)
|
Step tables¶
csm_step¶
- Description
- This table contains information on active steps within the CSM database. Featured attributes include: step id, allocation id, begin time, state, executable, working directory, arguments, environment variables, sequence ID, number of nodes, number of processes (that can run on each compute node), number of GPU’s, number of memory, number of tasks, user flags, system flags, and launch node name.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (depending on the steps)
|
|
Key(s) | PK: step_id, allocation_id
FK: csm_allocation (allocation_id)
|
|
Index | csm_step_pkey on (step_id, allocation_id)
uk_csm_step_a on (step_id, allocation_id)
|
|
Functions | fn_csm_step_history_dump
|
insert/update/delete (API call)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_step_node | csm_step_node_step_id_fkey | step_id | (FK) |
csm_step_history¶
- Description
- This table contains the information for steps that have terminated. There is some additional information from the initial step that has been added to the history table. These attributes include: end time, compute nodes, level gpu usage, exit status, error text, network band width, cpu stats, total U time, total S time, total number of threads, gpu stats, memory stats, max memory, max swap, ios stats.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | Millions of rows (depending on the customer’s work load)
|
|
Index | ix_csm_step_history_a on (history_time)
ix_csm_step_history_b on (begin_time, end_time)
ix_csm_step_history_c on (allocation_id, end_time)
ix_csm_step_history_d on (end_time)
ix_csm_step_history_e on (step_id)
ix_csm_step_history_f on (ctid)
ix_csm_step_history_g on (archive_history_time)
|
Allocation node, allocation state history, step node tables¶
csm_allocation_node¶
- Description
- This table maps current allocations to the compute nodes that make up the allocation. This information is later used when populating the csm_allocation_history table.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows
|
|
Key(s) | FK: csm_node (node_name)
FK: csm_allocation (allocation_id)
|
|
Index | ix_csm_allocation_node_a on (allocation_id)
uk_csm_allocation_node_b on (allocation_id, node_name)
|
insert (API call)
|
Functions | fn_csm_allocation_node_sharing_status
fn_csm_allocation_node_change
|
|
Triggers | tr_csm_allocation_node_change
|
update
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_lv | csm_lv_allocation_id_fkey | allocation_id, node_name | (FK) |
csm_step_node | csm_step_node_allocation_id_fkey | allocation_id, node_name | (FK) |
csm_allocation_node_history¶
- Description
- This table maps history allocations to the compute nodes that make up the allocation.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows
|
|
Index | ix_csm_allocation_node_history_a on (history_time)
ix_csm_allocation_node_history_b on (allocation_id)
ix_csm_allocation_node_history_c on (ctid)
ix_csm_allocation_node_history_d on (archive_history_time)
|
csm_allocation_state_history¶
- Description
- This table contains the state of the active allocations history. A timestamp of when the information enters the table along with a state indicator.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows (one per allocation)
|
|
Index | ix_csm_allocation_state_history_a on (history_time)
ix_csm_allocation_state_history_b on (allocation_id)
ix_csm_allocation_state_history_c on (ctid)
ix_csm_allocation_state_history_d on (archive_history_time)
|
csm_step_node¶
- Description
- This table maps active allocations to jobs steps and nodes.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (based on steps)
|
|
Key(s) | FK: csm_step (step_id, allocation_id)
FK: csm_allocation (allocation_id, node_name)
|
|
Index | uk_csm_step_node_a on (step_id, allocation_id, node_name)
ix_csm_step_node_b on (allocation_id)
ix_csm_step_node_c on (allocation_id, step_id)
|
|
Functions | fn_csm_step_node_history_dump
|
|
Triggers | tr_csm_step_node_history_dump
|
delete
|
csm_step_node_history¶
- Description
- This table maps historical allocations to jobs steps and nodes.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (based on steps)
|
|
Index | ix_csm_step_node_history_a on (history_time)
ix_csm_step_node_history_b on (allocation_id)
ix_csm_step_node_history_c on (allocation_id, step_id)
ix_csm_step_node_history_d on (ctid)
ix_csm_step_node_history_e on (archive_history_time)
|
RAS tables¶
csm_ras_type¶
- Description
- This table contains the description and details for each of the possible RAS event types. Specific attribute in this table include: msg_id, severity, message, description, control_action, threshold_count, threshold_period, enabled, set_not_ready, set_ready, viable_to_users.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows (depending on the different RAS types)
|
|
Key(s) | PK: msg_id
|
|
Index | csm_ras_type_pkey on (msg_id)
|
|
Functions | fn_csm_ras_type_update
|
|
Triggers | tr_csm_ras_type_updat
|
insert/update/delete
|
csm_ras_type_audit¶
- Description
- This table contains historical descriptions and details for each of the possible RAS event types. Specific attribute in this table include: msg_id_seq, operation, change_time, msg_id, severity, message, description, control_action, threshold_count, threshold_period, enabled, set_not_ready, set_ready, visible_to_users.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows (depending on the different RAS types)
|
|
Key(s) | PK: msg_id_seq
|
|
Index | csm_ras_type_audit_pkey on (msg_id_seq)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_ras_event_action | csm_ras_event_action_msg_id_seq_fkey | msg_id_seq | (FK) |
csm_ras_event_action¶
- Description
- This table contains all RAS events. Key attributes that are a part of this table include: rec id, msg id, msg_id_seq, timestamp, count, message, and raw data. This table will populate an enormous amount of records due to continuous event cycle. A solution needs to be in place to accommodate the mass amount of data produced.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | Million ++ rows
|
|
Key(s) | PK: rec_id
FK: csm_ras_type (msg_id_seq)
|
|
Index | csm_ras_event_action_pkey on (rec_id)
ix_csm_ras_event_action_a on (msg_id)
ix_csm_ras_event_action_b on (time_stamp)
ix_csm_ras_event_action_c on (location_name)
ix_csm_ras_event_action_d on (time_stamp, msg_id)
ix_csm_ras_event_action_e on (time_stamp, location_name)
ix_csm_ras_event_action_f on (master_time_stamp)
ix_csm_ras_event_action_g on (ctid)
ix_csm_ras_event_action_h on (archive_history_time)
|
CSM diagnostic tables¶
csm_diag_run¶
- Description
- This table contains information about each of the diagnostic runs. Specific attributes including: run id, allocation_id, begin time, status, inserted RAS, log directory, and command line.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Key(s) | PK: run_id
|
|
Index | csm_diag_run_pkey on (run_id)
|
|
Functions | fn_csm_diag_run_history_dump
|
insert/update/delete (API call)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_diag_result | csm_diag_result_run_id_fkey | run_id | (FK) |
csm_diag_run_history¶
- Description
- This table contains historical information about each of the diagnostic runs. Specific attributes including: run id, allocation_id, begin time, end_time, status, inserted RAS, log directory, and command line.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Index | ix_csm_diag_run_history_a on (history_time)
ix_csm_diag_run_history_b on (run_id)
ix_csm_diag_run_history_c on (allocation_id)
ix_csm_diag_run_history_d on (ctid)
ix_csm_diag_run_history_e on (archive_history_time)
|
csm_diag_result¶
- Description
- This table contains the results of a specific instance of a diagnostic.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Key(s) | FK: csm_diag_run (run_id)
|
|
Index | ix_csm_diag_result_a on (run_id, test_case, node_name)
|
|
Functions | fn_csm_diag_result_history_dump
|
|
Triggers | tr_csm_diag_result_history_dump
|
delete
|
csm_diag_result_history¶
- Description
- This table contains historical results of a specific instance of a diagnostic.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Index | ix_csm_diag_result_history_a on (history_time)
ix_csm_diag_result_history_b on (run_id)
ix_csm_diag_result_history_c on (ctid)
ix_csm_diag_result_history_d on (archive_history_time)
|
SSD partition and SSD logical volume tables¶
csm_lv¶
- Description
- This table contains information about the logical volumes that are created within the compute nodes.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on SSD usage)
|
|
Key(s) | PK: logical_volume_name, node_name
FK: csm_allocation (allocation_id)
FK: csm_vg (node_name, vg_name)
|
|
Index | csm_lv_pkey on (logical_volume_name, node_name)
ix_csm_lv_a on (logical_volume_name)
|
|
Functions | fn_csm_lv_history_dump
fn_csm_lv_modified_history_dump
fn_csm_lv_update_history_dump
|
insert/update/delete (API call)
|
Triggers | tr_csm_lv_modified_history_dump
tr_csm_lv_update_history_dump
|
update
update
|
csm_lv_history¶
- Description
- This table contains historical information associated with previously active logical volumes.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_lv_history_a on (history_time)
ix_csm_lv_history_b on (logical_volume_name)
ix_csm_lv_history_c on (ctid)
ix_csm_lv_history_d on (archive_history_time)
|
csm_lv_update_history¶
- Description
- This table contains historical information associated with lv updates.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_lv_update_history_a on (history_time)
ix_csm_lv_update_history_b on (logical_volume_name)
ix_csm_lv_update_history_c on (ctid)
ix_csm_lv_update_history_d on (archive_history_time)
|
csm_vg_ssd¶
- Description
- This table contains information that references both the SSD logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on SSD usage)
|
|
Key(s) | FK: csm_ssd (serial_number, node_name)
|
|
Index | csm_vg_ssd_pkey on (vg_name, node_name, serial_number)
ix_csm_vg_ssd_a on (vg_name, node_name, serial_number)
uk_csm_vg_ssd_a on (vg_name, node_name)
|
|
Functions | fn_csm_vg_ssd_history_dump
|
|
Triggers | tr_csm_vg_ssd_history_dump
|
update/delete
|
csm_vg_ssd_history¶
- Description
- This table contains historical information associated with SSD and logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_vg_ssd_history_a on (history_time)
ix_csm_vg_ssd_history_b on (vg_name, node_name)
ix_csm_vg_ssd_history_c on (ctid)
ix_csm_vg_ssd_history_d on (archive_history_time)
|
csm_vg¶
- Description
- This table contains information that references both the SSD logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Key(s) | PK: vg_name, node_name
FK: csm_node (node_name)
|
|
Index | csm_vg_pkey on (vg_name, node_name)
|
|
Functions | fn_csm_vg_history_dump
|
|
Triggers | tr_csm_vg_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_lv | csm_lv_node_name_fkey | node_name, vg_name | (FK) |
csm_vg_history¶
- Description
- This table contains historical information associated with SSD and logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_vg_history_a on (history_time)
ix_csm_vg_history_b on (vg_name, node_name)
ix_csm_vg_history_c on (ctid)
ix_csm_vg_history_d on (archive_history_time)
|
Switch & ib cable tables¶
csm_switch¶
- Description
- This table contain information about the switch and it attributes including; switch_name, discovery_time, collection_time, comment, description, fw_version, gu_id, has_ufm_agent, ip, model, num_modles, num_ports, physical_frame_location, physical_u_location, ps_id, role, server_operation_mode, sm_version, system_guid, system_name, total_alarms, type, and vendor.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 500 rows (Switches on a CORAL system)
|
|
Key(s) | PK: switch_name
|
|
Index | csm_switch_pkey on (switch_name)
|
|
Functions | fn_csm_switch_history_dump
|
|
Triggers | tr_csm_switch_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_switch_inventory | csm_switch_inventory_host_system_guid_fkey | host_system_guid | (FK) |
csm_switch_history¶
- Description
- This table contains historical information associated with individual switches.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on failure rate/ or how often changed out)
|
|
Index | ix_csm_switch_history_a on (history_time)
ix_csm_switch_history_b on (serial_number, history_time)
ix_csm_switch_history_c on (ctid)
ix_csm_switch_history_d on (archive_history_time)
|
csm_ib_cable¶
- Description
- This table contains information about the InfiniBand cables including; serial_number, discovery_time, collection_time, comment, guid_s1, guid_s2, identifier, length, name, part_number, port_s1, port_s2, revision, severity, type, and width.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Key(s) | PK: serial_number
|
|
Index | csm_ib_cable_pkey on (serial_number)
|
|
Functions | fn_csm_ib_cable_history_dump
|
|
Triggers | tr_csm_ib_cable_history_dump
|
update/delete
|
csm_ib_cable_history¶
- Description
- This table contains historical information about the InfiniBand cables.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Index | ix_csm_ib_cable_history_a on (history_time)
ix_csm_ib_cable_history_b on (serial_number)
ix_csm_ib_cable_history_c on (ctid)
ix_csm_ib_cable_history_d on (archive_history_time)
|
csm_switch_inventory¶
- Description
- This table contains information about the switch inventory including; name, host_system_guid, discovery_time, collection_time, comment, description, device_name, device_type, max_ib_ports, module_index, number_of_chips, path, serial_number, severity, and status.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Key(s) | PK: name
FK: csm_switch (switch_name)
|
|
Index | csm_switch_inventory_pkey on (name)
|
|
Functions | fn_csm_switch_inventory_history_dump
|
|
Triggers | tr_csm_switch_inventory_history_dump
|
update/delete
|
csm_switch_inventory_history¶
- Description
- This table contains historical information about the switch inventory.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topolog and or configuration)
|
|
Index | ix_csm_switch_inventory_history_a on (history_time)
ix_csm_switch_inventory_history_b on (name)
ix_csm_switch_inventory_history_c on (ctid)
ix_csm_switch_inventory_history_d on (archive_history_time)
|
CSM configuration tables¶
csm_config¶
- Description
- This table contains information about the CSM configuration.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1 row (Based on configuration changes)
|
|
Key(s) | PK: config_id
|
|
Index | csm_config_pkey on (csm_config_id)
|
|
Functions | fn_csm_config_history_dump
|
|
Triggers | tr_csm_config_history_dump
|
update/delete
|
csm_config_history¶
- Description
- This table contains historical information about the CSM configuration.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-100 rows
|
|
Index | ix_csm_config_history_a on (history_time)
ix_csm_config_history_b on (csm_config_id)
ix_csm_config_history_c on (ctid)
ix_csm_config_history_d on (archive_history_time)
|
csm_config_bucket¶
- Description
- This table is the list of items that will placed in the bucket. Some of the attributes include: bucket id, item lists, execution interval, and time stamp.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-400 rows (Based on configuration changes)
|
|
Index | ix_csm_config_bucket_a on
(bucket_id, item_list, time_stamp)
|
CSM DB schema version tables¶
csm_db_schema_version¶
- Description
- This is the current database schema version when loaded.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-100 rows (Based on CSM DB changes)
|
|
Key(s) | PK: version
|
|
Index | csm_db_schema_version_pkey on (version)
ix_csm_db_schema_version_a on (version, create_time)
|
|
Functions | fn_csm_db_schema_version_history_dump
|
|
Triggers | tr_csm_db_schema_version_history_dump
|
update/delete
|
csm_db_schema_version_history¶
- Description
- This is the historical database schema version (if changes have been made)
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-100 rows (Based on CSM DB changes/updates)
|
|
Index | ix_csm_db_schema_version_history_a on (history_time)
ix_csm_db_schema_version_history_b on (version)
ix_csm_db_schema_version_history_c on (ctid)
ix_csm_db_schema_version_history_d on (archive_history_time)
|
PK, FK, UK keys and Index Charts¶
Primary Keys (default Indexes)¶
Name | Table | Index on | Description |
---|---|---|---|
csm_allocation_pkey | csm_allocation | pkey index on | allocation_id |
csm_config_pkey | csm_config | pkey index on | csm_config_id |
csm_db_schema_version_pkey | csm_db_schema_version | pkey index on | version |
csm_diag_run_pkey | csm_diag_run | pkey index on | run_id |
csm_dimm_pkey | csm_dimm | pkey index on | serial_number |
csm_gpu_pkey | csm_gpu | pkey index on | node_name, gpu_id |
csm_hca_pkey | csm_hca | pkey index on | serial_number |
csm_ib_cable_pkey | csm_ib_cable | pkey index on | serial_number |
csm_lv_pkey | csm_lv | pkey index on | logical_volume_name, node_name |
csm_node_pkey | csm_node | pkey index on | node_name |
csm_processor_socket_pkey | csm_processor_socket | pkey index on | serial_number |
csm_ras_event_action_pkey | csm_ras_event_action | pkey index on | rec_id |
csm_ras_type_audit_pkey | csm_ras_type_audit | pkey index on | msg_id_seq |
csm_ras_type_pkey | csm_ras_type | pkey index on | msg_id |
csm_ssd_pkey | csm_ssd | pkey index on | serial_number, node_name |
csm_step_pkey | csm_step | pkey index on | step_id, allocation_id |
csm_switch_inventory_pkey | csm_switch_inventory | pkey index on | name |
csm_switch_pkey | csm_switch | pkey index on | switch_name |
csm_vg_pkey | csm_vg | pkey index on | vg_name, node_name |
Foreign Keys¶
Name | From Table | From Cols | To Table | To Cols |
---|---|---|---|---|
csm_allocation_node_allocation_id_fkey | csm_allocation_node | allocation_id | csm_allocation | allocation_id |
csm_allocation_node_node_name_fkey | csm_allocation_node | node_name | csm_node | node_name |
csm_diag_result_run_id_fkey | csm_diag_result | run_id | csm_diag_run | run_id |
csm_dimm_node_name_fkey | csm_dimm | node_name | csm_node | node_name |
csm_gpu_node_name_fkey | csm_gpu | node_name | csm_node | node_name |
csm_hca_node_name_fkey | csm_hca | node_name | csm_node | node_name |
csm_lv_allocation_id_fkey | csm_lv | allocation_id, node_name | csm_allocation_node | allocation_id, node_name |
csm_lv_node_name_fkey | csm_lv | node_name, vg_name | csm_vg | node_name, vg_name |
csm_processor_node_name_fkey | csm_processor | node_name | csm_node | node_name |
csm_ras_event_action_msg_id_seq_fkey | csm_ras_event_action | msg_id_seq | csm_ras_type_audit | msg_id_seq |
csm_ssd_node_name_fkey | csm_ssd | node_name | csm_node | node_name |
csm_step_allocation_id_fkey | csm_step | allocation_id | csm_allocation | allocation_id |
csm_step_node_allocation_id_fkey | csm_step_node | allocation_id, node_name | csm_allocation_node | allocation_id, node_name |
csm_step_node_step_id_fkey | csm_step_node | step_id, allocation_id | csm_step | step_id, allocation_id |
csm_switch_inventory_host_system_guid_fkey | csm_switch_inventory | host_system_guid | csm_switch | switch_name |
csm_switch_ports_parent_fkey | csm_switch_ports | parent | csm_switch | switch_name |
csm_vg_ssd_serial_number_fkey | csm_vg_ssd | serial_number, node_name | csm_ssd | serial_number, node_name |
csm_vg_vg_name_fkey | csm_vg | vg_name, node_name | csm_vg_ssd | vg_name, node_name |
Indexes¶
Name | Table | Index on | Description |
---|---|---|---|
ix_csm_allocation_history_a | csm_allocation_history | index on | history_time |
ix_csm_allocation_history_b | csm_allocation_history | index on | allocation_id |
ix_csm_allocation_history_c | csm_allocation_history | index on | ctid |
ix_csm_allocation_history_d | csm_allocation_history | index on | archive_history_time |
ix_csm_allocation_node_a | csm_allocation_node | index on | allocation_id |
ix_csm_allocation_node_history_a | csm_allocation_node_history | index on | history_time |
ix_csm_allocation_node_history_b | csm_allocation_node_history | index on | allocation_id |
ix_csm_allocation_node_history_c | csm_allocation_node_history | index on | ctid |
ix_csm_allocation_node_history_d | csm_allocation_node_history | index on | archive_history_time |
ix_csm_allocation_state_history_a | csm_allocation_state_history | index on | history_time |
ix_csm_allocation_state_history_b | csm_allocation_state_history | index on | allocation_id |
ix_csm_allocation_state_history_c | csm_allocation_state_history | index on | ctid |
ix_csm_allocation_state_history_d | csm_allocation_state_history | index on | archive_history_time |
ix_csm_config_bucket_a | csm_config_bucket | index on | bucket_id, item_list, time_stamp |
ix_csm_config_history_a | csm_config_history | index on | history_time |
ix_csm_config_history_b | csm_config_history | index on | csm_config_id |
ix_csm_config_history_c | csm_config_history | index on | ctid |
ix_csm_config_history_d | csm_config_history | index on | archive_history_time |
ix_csm_db_schema_version_a | csm_db_schema_version | index on | version, create_time |
ix_csm_db_schema_version_history_a | csm_db_schema_version_history | index on | history_time |
ix_csm_db_schema_version_history_b | csm_db_schema_version_history | index on | version |
ix_csm_db_schema_version_history_c | csm_db_schema_version_history | index on | ctid |
ix_csm_db_schema_version_history_d | csm_db_schema_version_history | index on | archive_history_time |
ix_csm_diag_result_a | csm_diag_result | index on | run_id, test_name, node_name |
ix_csm_diag_result_history_a | csm_diag_result_history | index on | history_time |
ix_csm_diag_result_history_b | csm_diag_result_history | index on | run_id |
ix_csm_diag_result_history_c | csm_diag_result_history | index on | ctid |
ix_csm_diag_result_history_d | csm_diag_result_history | index on | archive_history_time |
ix_csm_diag_run_history_a | csm_diag_run_history | index on | history_time |
ix_csm_diag_run_history_b | csm_diag_run_history | index on | run_id |
ix_csm_diag_run_history_c | csm_diag_run_history | index on | allocation_id |
ix_csm_diag_run_history_d | csm_diag_run_history | index on | ctid |
ix_csm_diag_run_history_e | csm_diag_run_history | index on | archive_history_time |
ix_csm_dimm_history_a | csm_dimm_history | index on | history_time |
ix_csm_dimm_history_b | csm_dimm_history | index on | node_name, serial_number |
ix_csm_dimm_history_c | csm_dimm_history | index on | ctid |
ix_csm_dimm_history_d | csm_dimm_history | index on | archive_history_time |
ix_csm_gpu_history_a | csm_gpu_history | index on | history_time |
ix_csm_gpu_history_b | csm_gpu_history | index on | serial_number |
ix_csm_gpu_history_c | csm_gpu_history | index on | node_name, gpu_id |
ix_csm_gpu_history_d | csm_gpu_history | index on | ctid |
ix_csm_gpu_history_e | csm_gpu_history | index on | archive_history_time |
ix_csm_hca_history_a | csm_hca_history | index on | history_time |
ix_csm_hca_history_b | csm_hca_history | index on | node_name, serial_number |
ix_csm_hca_history_c | csm_hca_history | index on | ctid |
ix_csm_hca_history_d | csm_hca_history | index on | archive_history_time |
ix_csm_ib_cable_history_a | csm_ib_cable_history | index on | history_time |
ix_csm_ib_cable_history_b | csm_ib_cable_history | index on | serial_number |
ix_csm_ib_cable_history_c | csm_ib_cable_history | index on | ctid |
ix_csm_ib_cable_history_d | csm_ib_cable_history | index on | archive_history_time |
ix_csm_lv_a | csm_lv | index on | logical_volume_name |
ix_csm_lv_history_a | csm_lv_history | index on | history_time |
ix_csm_lv_history_b | csm_lv_history | index on | logical_volume_name |
ix_csm_lv_history_c | csm_lv_history | index on | ctid |
ix_csm_lv_history_d | csm_lv_history | index on | archive_history_time |
ix_csm_lv_update_history_a | csm_lv_update_history | index on | history_time |
ix_csm_lv_update_history_b | csm_lv_update_history | index on | logical_volume_name |
ix_csm_lv_update_history_c | csm_lv_update_history | index on | ctid |
ix_csm_lv_update_history_d | csm_lv_update_history | index on | archive_history_time |
ix_csm_node_a | csm_node | index on | node_name, ready |
ix_csm_node_history_a | csm_node_history | index on | history_time |
ix_csm_node_history_b | csm_node_history | index on | node_name |
ix_csm_node_history_c | csm_node_history | index on | ctid |
ix_csm_node_history_d | csm_node_history | index on | archive_history_time |
ix_csm_node_state_history_a | csm_node_state_history | index on | history_time |
ix_csm_node_state_history_b | csm_node_state_history | index on | node_name, state |
ix_csm_node_state_history_c | csm_node_state_history | index on | ctid |
ix_csm_node_state_history_d | csm_node_state_history | index on | archive_history_time |
ix_csm_processor_socket_history_a | csm_processor_socket_history | index on | history_time |
ix_csm_processor_socket_history_b | csm_processor_socket_history | index on | serial_number, node_name |
ix_csm_processor_socket_history_c | csm_processor_socket_history | index on | ctid |
ix_csm_processor_socket_history_d | csm_processor_socket_history | index on | archive_history_time |
ix_csm_ras_event_action_a | csm_ras_event_action | index on | msg_id |
ix_csm_ras_event_action_b | csm_ras_event_action | index on | time_stamp |
ix_csm_ras_event_action_c | csm_ras_event_action | index on | location_name |
ix_csm_ras_event_action_d | csm_ras_event_action | index on | time_stamp, msg_id |
ix_csm_ras_event_action_e | csm_ras_event_action | index on | time_stamp, location_name |
ix_csm_ras_event_action_f | csm_ras_event_action | index on | master_time_stamp |
ix_csm_ras_event_action_g | csm_ras_event_action | index on | ctid |
ix_csm_ras_event_action_h | csm_ras_event_action | index on | archive_history_time |
ix_csm_ssd_history_a | csm_ssd_history | index on | history_time |
ix_csm_ssd_history_b | csm_ssd_history | index on | serial_number, node_name |
ix_csm_ssd_history_c | csm_ssd_history | index on | ctid |
ix_csm_ssd_history_d | csm_ssd_history | index on | archive_history_time |
ix_csm_ssd_wear_history_a | csm_ssd_wear_history | index on | history_time |
ix_csm_ssd_wear_history_b | csm_ssd_wear_history | index on | serial_number, node_name |
ix_csm_ssd_wear_history_c | csm_ssd_wear_history | index on | ctid |
ix_csm_ssd_wear_history_d | csm_ssd_wear_history | index on | archive_history_time |
ix_csm_step_history_a | csm_step_history | index on | history_time |
ix_csm_step_history_b | csm_step_history | index on | begin_time, end_time |
ix_csm_step_history_c | csm_step_history | index on | allocation_id, end_time |
ix_csm_step_history_d | csm_step_history | index on | end_time |
ix_csm_step_history_e | csm_step_history | index on | step_id |
ix_csm_step_history_f | csm_step_history | index on | ctid |
ix_csm_step_history_g | csm_step_history | index on | archive_history_time |
ix_csm_step_node_b | csm_step_node | index on | allocation_id |
ix_csm_step_node_c | csm_step_node | index on | allocation_id, step_id |
ix_csm_step_node_history_a | csm_step_node_history | index on | history_time |
ix_csm_step_node_history_b | csm_step_node_history | index on | allocation_id |
ix_csm_step_node_history_c | csm_step_node_history | index on | allocation_id, step_id |
ix_csm_step_node_history_d | csm_step_node_history | index on | ctid |
ix_csm_step_node_history_e | csm_step_node_history | index on | archive_history_time |
ix_csm_switch_history_a | csm_switch_history | index on | history_time |
ix_csm_switch_history_b | csm_switch_history | index on | switch_name, history_time |
ix_csm_switch_history_c | csm_switch_history | index on | ctid |
ix_csm_switch_history_d | csm_switch_history | index on | archive_history_time |
ix_csm_switch_inventory_history_a | csm_switch_inventory_history | index on | history_time |
ix_csm_switch_inventory_history_b | csm_switch_inventory_history | index on | name |
ix_csm_switch_inventory_history_c | csm_switch_inventory_history | index on | ctid |
ix_csm_switch_inventory_history_d | csm_switch_inventory_history | index on | archive_history_time |
ix_csm_vg_history_a | csm_vg_history | index on | history_time |
ix_csm_vg_history_b | csm_vg_history | index on | vg_name, node_name |
ix_csm_vg_history_c | csm_vg_history | index on | ctid |
ix_csm_vg_history_d | csm_vg_history | index on | archive_history_time |
ix_csm_vg_ssd_history_a | csm_vg_ssd_history | index on | history_time |
ix_csm_vg_ssd_history_b | csm_vg_ssd_history | index on | vg_name, node_name |
ix_csm_vg_ssd_history_c | csm_vg_ssd_history | index on | ctid |
ix_csm_vg_ssd_history_d | csm_vg_ssd_history | index on | archive_history_time |
Unique Indexes¶
Name | Table | Index on | Description |
---|---|---|---|
uk_csm_allocation_node_b | csm_allocation_node | uniqueness on | allocation_id, node_name |
uk_csm_ssd_a | csm_ssd | uniqueness on | serial_number, node_name |
uk_csm_step_a | csm_step | uniqueness on | step_id, allocation_id |
uk_csm_step_node_a | csm_step_node | uniqueness on | step_id, allocation_id, node_name |
uk_csm_vg_ssd_a | csm_vg_ssd | uniqueness on | vg_name, node_name, serial_number |
Functions and Triggers¶
Function Name | Trigger Name | Table On | Tr Type | Result Data Type | Action On | Argument Data Type | Description |
fn_csm_allocation_create_data_aggregator | (Stored Procedure) | csm_allocation_node | void | i_allocation_id bigint, i_state text, i_node_names text[], i_ib_rx_list bigint[], i_ib_tx_list bigint[], i_gpfs_read_list bigint[], i_gpfs_write_list bigint[], i_energy bigint[], i_power_cap integer[], i_ps_ratio integer[], i_power_cap_hit bigint[], i_gpu_energy bigint[], OUT o_timestamp timestamp without time zone | csm_allocation_node function to populate the data aggregator fields in csm_allocation_node. | ||
fn_csm_allocation_dead_records_on_lv | (Stored Procedure) | csm_allocation_node, csm_lv | void | i_allocation_id bigint | Delete any lvs on an allocation that is being deleted. | ||
fn_csm_allocation_delete_start | (Stored Procedure) | csm_allocation, csm_allocation_node | void | i_allocation_id bigint, i_primary_job_id bigint, i_secondary_job_id integer, i_timeout_time bigint, OUT o_allocation_id bigint, OUT o_primary_job_id bigint, OUT o_secondary_job_id integer, OUT o_user_flags text, OUT o_system_flags text, OUT o_num_nodes integer, OUT o_state text, OUT o_type text, OUT o_isolated_cores integer, OUT o_user_name text, OUT o_nodelist text, OUT o_runtime bigint | Retrieves allocation details for delete a d sets the state to deleteing. | ||
fn_csm_allocation_finish_data_stats | (Stored Procedure) | csm_allocation_node | void | allocationid bigint, i_state text, node_names text[], ib_rx_list bigint[], ib_tx_list bigint[], gpfs_read_list bigint[], gpfs_write_list bigint[], energy_list bigint[], pc_hit_list bigint[], gpu_usage_list bigint[], cpu_usage_list bigint[], mem_max_list bigint[], gpu_energy_list bigint[], OUT o_end_time timestamp without time zone, OUT o_final_state text | csm_allocation function to finalize the data aggregator fields. | ||
fn_csm_allocation_history_dump | (Stored Procedure) | csm_allocation | void | allocationid bigint, endtime timestamp without time zone, exitstatus integer, i_state text, finalize boolean, node_names text[], ib_rx_list bigint[], ib_tx_list bigint[], gpfs_read_list bigint[], gpfs_write_list bigint[], energy_list bigint[], pc_hit_list bigint[], gpu_usage_list bigint[], cpu_usage_list bigint[], mem_max_list bigint[], gpu_energy_list bigint[], OUT o_end_time timestamp without time zone | csm_allocation function to amend summarized column(s) on DELETE. (csm_allocation_history_dump) | ||
fn_csm_allocation_node_change | tr_csm_allocation_node_change | csm_allocation_node | BEFORE | trigger | DELETE | csm_allocation_node trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_allocation_node_sharing_status | (Stored Procedure) | csm_allocation_node | void | i_allocation_id bigint, i_type text, i_state text, i_shared boolean, i_nodenames text[] | csm_allocation_sharing_status function to handle exclusive usage of shared nodes on INSERT. | ||
fn_csm_allocation_revert | (Stored Procedure) | csm_allocation, csm_allocation_state_history | void | allocationid bigint | Removes all traces of an allocation that never multicasted. | ||
fn_csm_allocation_state_history_state_change | tr_csm_allocation_state_change | csm_allocation | BEFORE | trigger | UPDATE | csm_allocation trigger to amend summarized column(s) on UPDATE. | |
fn_csm_allocation_update | tr_csm_allocation_update | csm_allocation | BEFORE | trigger | UPDATE | csm_allocation_update trigger to amend summarized column(s) on UPDATE. | |
fn_csm_allocation_update_state | (Stored Procedure) | csm_allocation,csm_allocation_node | record | i_allocationid bigint, i_state text, OUT o_primary_job_id bigint, OUT o_secondary_job_id integer, OUT o_user_flags text, OUT o_system_flags text, OUT o_num_nodes integer, OUT o_nodes text, OUT o_isolated_cores integer, OUT o_user_name text, OUT o_shared boolean, OUT o_num_gpus integer, OUT o_num_processors integer, OUT o_projected_memory integer, OUT o_state text, OUT o_runtime bigint | csm_allocation_update_state function that ensures the allocation can be legally updated to the supplied state | ||
fn_csm_config_history_dump | tr_csm_config_history_dump | csm_config | BEFORE | trigger | UPDATE, DELETE | csm_config trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_db_schema_version_history_dump | tr_csm_db_schema_version_history_dump | csm_db_schema_version | BEFORE | trigger | UPDATE, DELETE | csm_db_schema_version trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_diag_result_history_dump | tr_csm_diag_result_history_dump | csm_diag_result | BEFORE | trigger | DELETE | csm_diag_result trigger to amend summarized column(s) on DELETE. | |
fn_csm_diag_run_history_dump | (Stored Procedure) | csm_diag_run | void | _run_id bigint, _end_time timestamp with time zone, _status text, _inserted_ras boolean | csm_diag_run function to amend summarized column(s) on UPDATE and DELETE. (csm_diag_run_history_dump) | ||
fn_csm_dimm_history_dump | tr_csm_dimm_history_dump | csm_dimm | BEFORE | trigger | UPDATE, DELETE | csm_dimm trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_gpu_history_dump | tr_csm_gpu_history_dump | csm_gpu | BEFORE | trigger | UPDATE, DELETE | csm_gpu trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_hca_history_dump | tr_csm_hca_history_dump | csm_hca | BEFORE | trigger | UPDATE, DELETE | csm_hca trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ib_cable_history_dump | tr_csm_ib_cable_history_dump | csm_ib_cable | BEFORE | trigger | UPDATE, DELETE | csm_ib_cable trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ib_cable_inventory_collection | (Stored Procedure) | csm_ib_cable | record | i_record_count integer, i_serial_number text[], i_comment text[], i_guid_s1 text[], i_guid_s2 text[], i_identifier text[], i_length text[], i_name text[], i_part_number text[], i_port_s1 text[], i_port_s2 text[], i_revision text[], i_severity text[], i_type text[], i_width text[], OUT o_insert_count integer, OUT o_update_count integer, OUT o_delete_count integer | function to INSERT and UPDATE ib cable inventory. | ||
fn_csm_lv_history_dump | (Stored Procedure) | csm_lv | void | i_logical_volume_name text, i_node_name text, i_allocationid bigint, i_updated_time timestamp without time zone, i_end_time timestamp without time zone, i_num_bytes_read bigint, i_num_bytes_written bigint | csm_lv function to amend summarized column(s) on DELETE. (csm_lv_history_dump) | ||
fn_csm_lv_modified_history_dump | tr_csm_lv_modified_history_dump | csm_lv | BEFORE | trigger | UPDATE | csm_lv_modified_history_dump trigger to amend summarized column(s) on UPDATE. | |
fn_csm_lv_update_history_dump | tr_csm_lv_update_history_dump | csm_lv | BEFORE | trigger | UPDATE | csm_lv_update_history_dump trigger to amend summarized column(s) on UPDATE. | |
fn_csm_lv_upsert | (Stored Procedure) | csm_lv | void | l_logical_volume_name text, l_node_name text, l_allocation_id bigint, l_vg_name text, l_state character, l_current_size bigint, l_max_size bigint, l_begin_time timestamp without time zone, l_updated_time timestamp without time zone, l_file_system_mount text, l_file_system_type text | csm_lv_upsert function to amend summarized column(s) on INSERT. (csm_lv table) | ||
fn_csm_node_attributes_query_details | (Stored Procedure) | csm_node,csm_dimm,csm_gpu,csm_hca,csm_processor,csm_ssd | node_details | i_node_name text | csm_node_attributes_query_details function to HELP CSM API. | ||
fn_csm_node_delete | (Stored Procedure) | csm_node,csm_dimm,csm_gpu,csm_hca,csm_processor,csm_ssd | record | i_node_names text[], OUT o_not_deleted_node_names_count integer, OUT o_not_deleted_node_names text | Function to delete a node, and remove records in the csm_node, csm_ssd, csm_processor, csm_gpu, csm_hca, csm_dimm tables. | ||
fn_csm_node_ready | tr_csm_node_ready | csm_node | BEFORE | trigger | UPDATE | csm_node_ready trigger to amend summarized column(s) on UPDATE. | |
fn_csm_node_update | tr_csm_node_update | csm_node | BEFORE | trigger | UPDATE, DELETE | csm_node_update trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_processor_history_dump | tr_csm_processor_history_dump | csm_processor | BEFORE | trigger | UPDATE, DELETE | csm_processor trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ras_type_update | tr_csm_ras_type_update | csm_ras_type | AFTER | trigger | INSERT, UPDATE,DELETE | csm_ras_type trigger to add rows to csm_ras_type_audit on INSERT and UPDATE and DELETE. (csm_ras_type_update) | |
fn_csm_ssd_dead_records | (Stored Procedure) | csm_vg_ssd, csm_vg, csm_lv | void | i_sn text | Delete any vg and lv on an ssd that is being deleted. | ||
fn_csm_ssd_history_dump | tr_csm_ssd_history_dump | csm_ssd | BEFORE | trigger | UPDATE, DELETE | csm_ssd trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ssd_wear | tr_csm_ssd_wear | csm_ssd | BEFORE | trigger | UPDATE | csm_ssd_wear trigger to amend summarized column(s) on UPDATE. | |
fn_csm_step_begin | (Stored Procedure) | csm_step | void | i_step_id bigint, i_allocation_id bigint, i_status text, i_executable text, i_working_directory text, i_argument text, i_environment_variable text, i_num_nodes integer, i_num_processors integer, i_num_gpus integer, i_projected_memory integer, i_num_tasks integer, i_user_flags text, i_node_names text[], OUT o_begin_time timestamp without time zone | csm_step_begin function to begin a step, adds the step to csm_step and csm_step_node | ||
fn_csm_step_end | (Stored Procedure) | csm_step_node,csm_step | record | i_stepid bigint, i_allocationid bigint, i_exitstatus integer, i_errormessage text, i_cpustats text, i_totalutime double precision, i_totalstime double precision, i_ompthreadlimit text, i_gpustats text, i_memorystats text, i_maxmemory bigint, i_iostats text, OUT o_user_flags text, OUT o_num_nodes integer, OUT o_nodes text, OUT o_end_time timestamp without time zone | csm_step_end function to delete the step from the nodes table (fn_csm_step_end) | ||
fn_csm_step_history_dump | (Stored Procedure) | csm_step | void | i_stepid bigint, i_allocationid bigint, i_endtime timestamp with time zone, i_exitstatus integer, i_errormessage text, i_cpustats text, i_totalutime double precision, i_totalstime double precision, i_ompthreadlimit text, i_gpustats text, i_memorystats text, i_maxmemory bigint, i_iostats text | csm_step function to amend summarized column(s) on DELETE. (csm_step_history_dump) | ||
fn_csm_step_node_history_dump | tr_csm_step_node_history_dump | csm_step_node | BEFORE | trigger | DELETE | csm_step_node trigger to amend summarized column(s) on DELETE. (csm_step_node_history_dump) | |
fn_csm_switch_attributes_query_details | (Stored Procedure) | csm_switch,csm_switch_inventory,csm_switch_ports | switch_details | i_switch_name text | csm_switch_attributes_query_details function to HELP CSM API. | ||
fn_csm_switch_children_inventory_collection | (Stored Procedure) | csm_switch_inventory | void | i_record_count integer, i_name text[], i_host_system_guid text[], i_comment text[], i_description text[], i_device_name text[], i_device_type text[], i_max_ib_ports integer[], i_module_index integer[], i_number_of_chips integer[], i_path text[], i_serial_number text[], i_severity text[], i_status text[], OUT o_insert_count integer, OUT o_update_count integer, OUT o_delete_count integer | function to INSERT and UPDATE switch children inventory. | ||
fn_csm_switch_history_dump | tr_csm_switch_history_dump | csm_switch | BEFORE | trigger | UPDATE, DELETE | csm_switch trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_switch_inventory_collection | (Stored Procedure) | csm_switch | void | i_record_count integer, i_switch_name text[], i_serial_number text[], i_comment text[], i_description text[], i_fw_version text[], i_gu_id text[], i_has_ufm_agent boolean[], i_hw_version text[], i_ip text[], i_model text[], i_num_modules integer[], i_physical_frame_location text[], i_physical_u_location text[], i_ps_id text[], i_role text[], i_server_operation_mode text[], i_sm_mode text[], i_state text[], i_sw_version text[], i_system_guid text[], i_system_name text[], i_total_alarms integer[], i_type text[], i_vendor text[], OUT o_insert_count integer, OUT o_update_count integer, OUT o_delete_count integer, OUT o_delete_module_count integer | function to INSERT and UPDATE switch inventory. | ||
fn_csm_switch_inventory_history_dump | tr_csm_switch_inventory_history_dump | csm_switch_inventory | BEFORE | trigger | UPDATE, DELETE | csm_switch_inventory trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_vg_create | (Stored Procedure) | csm_vg_ssd,csm_vg,csm_ssd | void | i_available_size bigint, i_node_name text, i_ssd_count integer, i_ssd_serial_numbers text[], i_ssd_allocations bigint[], i_total_size bigint, i_vg_name text, i_is_scheduler boolean | Function to create a vg, adds the vg to csm_vg_ssd and csm_vg | ||
fn_csm_vg_delete | (Stored Procedure) | csm_vg, csm_vg_ssd | void | i_node_name text, i_vg_name text | Function to delete a vg, and remove records in the csm_vg and csm_vg_ssd tables. | ||
fn_csm_vg_history_dump | tr_csm_vg_history_dump | csm_vg | BEFORE | trigger | UPDATE, DELETE | csm_vg trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_vg_ssd_history_dump | tr_csm_vg_ssd_history_dump | csm_vg_ssd | BEFORE | trigger | UPDATE, DELETE | csm_vg_ssd trigger to amend summarized column(s) on UPDATE and DELETE. |
Using csm_db_history_archive.py¶
This section describes the archiving process associated with the CSM DB history tables. If run alone it will archive all history tables in the CSM Database, including the csm_ras_event_action table.
Note
This script is designed to run as a root user. If you try to run as a postgres user the script will prompt a message and exit.
-bash-4.2$ ./csm_db_history_archive.py -h
---------------------------------------------------------------------------------------------------------
[INFO] Only root can run this script
---------------------------------------------------------------------------------------------------------
Usage Overview¶
/opt/ibm/csm/db/csm_db_history_archive.py –h
/opt/ibm/csm/db/csm_db_history_archive.py --help
The help command (-h, –help) will specify each of the options available to use.
Options | Description | Result |
---|---|---|
running the script with no options | ./csm_db_history_archive.py | Will execute with default configured settings |
running the script with –t, –target | ./csm_db_history_archive.py –t, –target | Specifies the target directory where json files will be written to. |
running the script with -n, –count | ./csm_db_history_archive.py –n, –count | specifies the number of records to be archived. |
running the script with –d, –database | ./csm_db_history_archive.py -d, –database | specifies the database name |
running the script with –u, –user | ./csm_db_history_archive.py –u, –user | specifies the database user name. |
running the script with –-threads | ./csm_db_history_archive.py –-threads | specifies threads. |
running the script with –h, –help | ./csm_db_history_archive.py –h, –help | see details below |
Example (usage)¶
-bash-4.2$ ./csm_db_history_archive.py -h
---------------------------------------------------------------------------------------------------------
usage: csm_db_history_archive.py [-h] [-t dir] [-n count] [-d db] [-u user]
[--threads threads]
A tool for archiving the CSM Database history tables.
optional arguments:
-h, --help show this help message and exit
-t dir, --target dir Target directory to write archive to. Default:
/var/log/ibm/csm/archive
-n count, --count count
Number of records to archive in the run. Default: 1000
-d db, --database db Database to archive tables from. Default: csmdb
-u user, --user user The database user. Default: postgres
--threads threads The number of threads for the thread pool. Default: 10
------------------------------------------------------------------------------
Note
This is a general overview of the CSM DB archive history process using the csm_db_history_archive.py
script.
Script overview¶
The script may largely be broken into
- Create a temporary table to archive history data based on a condition.
- Connect to the Database with the postgres user.
- Drops and creates the temp table used in the archival process.
- The first query selects all the fields in the table.
- The second and third query is a nested query that defines a particular row count that a user can pass in or can be set as a default value. The data is filter by using the history_time)..
- The where clause defines whether the archive_history_time field is NULL.
- The user will have the option to pass in a row count value (ex. 10,000 records).
- The data will be ordered by
history_time ASC
.
- Copies all satisfied history data to a json file.
- Copies all the results from the temp table and appends to a json file
- Then updates the archive_history_timestamp field, which can be later deleted during the purging process).
- Updates the csm_[table_name]_history table
- Sets the archive_history_time = current timestamp
- From clause on the temp table
- WHERE (compares history_time, from history table to temp table) AND history.archive_history_time IS NULL.
Attention
If this script below is run manually it will display the results to the screen. This script handles all history table archiving in the database.
Script out results¶
[root@c650mnp02 db]# /opt/ibm/csm/db/csm_db_history_archive.py -d csmdb -t /tmp/test_archive_dir/ -n 100
---------------------------------------------------------------------------------------------------------
Welcome to the CSM DB archiving script
---------------------------------------------------------------------------------------------------------
Start Script Time: | 2018-11-23 11:25:02.027564
---------------------------------------------------------------------------------------------------------
[INFO] Processing Table csm_config_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_allocation_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_allocation_node_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_db_schema_version_history | User Ct: 100 | Act DB Ct: 100
[INFO] Processing Table csm_allocation_state_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_diag_result_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_diag_run_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_hca_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_dimm_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_ib_cable_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_gpu_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_lv_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_lv_update_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_processor_socket_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_node_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_ssd_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_node_state_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_ssd_wear_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_step_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_switch_inventory_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_step_node_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_vg_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_switch_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_vg_ssd_history | User Ct: 100 | Act DB Ct: 0
[INFO] Processing Table csm_ras_event_action | User Ct: 100 | Act DB Ct: 0
---------------------------------------------------------------------------------------------------------
DB Name: | csmdb
DB User Name: | postgres
Thread Count: | 10
Archiving Log Directory: | /var/log/ibm/csm/db/csm_db_archive_script.log
Archiving Data Directory: | /tmp/test_archive_dir/
End Script Time: | 2018-11-23 11:25:02.130501
Total Process Time: | 0:00:00.102937
---------------------------------------------------------------------------------------------------------
Finish CSM DB archive script process
---------------------------------------------------------------------------------------------------------
Attention
While using the csm_stats_script (in another session) the user can monitor the results
/opt/ibm/csm/db/csm_db_stats.sh –t <db_name>
/opt/ibm/csm/db/csm_db_stats.sh –-tableinfo <db_name>
If a user specifies a non related DB in the system or if there are issues connecting to the DB server a message will display.
[root@c650mnp02 db]# /opt/ibm/csm/db/csm_db_history_archive.py -d csmd -t /tmp/test_archive_dir/ -n 100
---------------------------------------------------------------------------------------------------------
Welcome to the CSM DB archiving script
---------------------------------------------------------------------------------------------------------
Start Script Time: | 2018-11-23 11:44:17.535131
---------------------------------------------------------------------------------------------------------
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
---------------------------------------------------------------------------------------------------------
DB Name: | csmd
DB User Name: | postgres
Thread Count: | 10
Archiving Log Directory: | /var/log/ibm/csm/db/csm_db_archive_script.log
Archiving Data Directory: | /tmp/test_archive_dir/
End Script Time: | 2018-11-23 11:44:17.574674
Total Process Time: | 0:00:00.039543
---------------------------------------------------------------------------------------------------------
Finish CSM DB archive script process
---------------------------------------------------------------------------------------------------------
Note
Directory: Currently the scripts are setup to archive the results in a specified directory.
The history table data will be archived in a .json file format and in the specified or default directory:
csm_allocation_history.archive.2018-11-23.json
The history table log file will be in a .log file format and in the default directory:
/var/log/ibm/csm/db/csm_db_archive_script.log
Using csm_db_backup_script_v1.sh¶
To manually perform a cold backup a CSM database on the system the following script may be run.
/opt/ibm/csm/db/csm_db_backup_script_v1.sh
Note
This script should be run as the root or postgres user.
Attention
There are a few step that should be taken before backing up a CSM or related DB on the system.
Backup script actions¶
The following steps are behaviors recommended for use of the back up script:
- Stop all CSM daemons.
- Run the backup script.
Invocation: | /opt/ibm/csm/db/csm_db_backup_script_v1.sh [DBNAME] [/DIR/] |
---|---|
Default Directory: | |
/var/lib/pgsql/backups/ |
The script will check the DB connections and if there are no active connections then the backup process will begin. If there are any active connections to the DB, an Error message will be displayed and the program will exit.
To terminate active connections: csm_db_connections_script.sh
- Once the DB been successfully backed-up then the admin can restart the daemons.
Running the csm_db_backup_script_v1.sh¶
Example (-h, –help)¶
./csm_db_backup_script_v1.sh –h, --help
===============================================================================================================
[Info ] csm_db_backup_script_v1.sh : csmdb /tmp/csmdb_backup/
[Info ] csm_db_backup_script_v1.sh : csmdb
[Usage] csm_db_backup_script_v1.sh : [OPTION]... [/DIR/]
---------------------------------------------------------------------------------------------------------------
[Options]
----------------|----------------------------------------------------------------------------------------------
Argument | Description
----------------|----------------------------------------------------------------------------------------------
-h, --help | help menu
----------------|----------------------------------------------------------------------------------------------
[Examples]
---------------------------------------------------------------------------------------------------------------
csm_db_backup_script_v1.sh [DBNAME] | (default) will backup database to/var/lib/pgpsql/backups/ (directory)
csm_db_backup_script_v1.sh [DBNAME] [/DIRECTORY/ | will backup database to specified directory
| if the directory doesnt exist then it will be mode and
| written.
==============================================================================================================
Attention
Common errors
If the user tries to run the script as local user without PostgreSQL installed and does not provide a database name:
- An info message will prompt ([Info ] Database name is required)
- The usage message will also prompt the usage help menu
Example (no options, usage)¶
bash-4.1$ ./csm_db_backup_script_v1.sh
[Info ] Database name is required
===============================================================================================================
[Info ] csm_db_backup_script_v1.sh : csmdb /tmp/csmdb_backup/
[Info ] csm_db_backup_script_v1.sh : csmdb
[Usage] csm_db_backup_script_v1.sh : [OPTION]... [/DIR/]
---------------------------------------------------------------------------------------------------------------
[Options]
----------------|----------------------------------------------------------------------------------------------
Argument | Description
----------------|----------------------------------------------------------------------------------------------
-h, --help | help menu
----------------|----------------------------------------------------------------------------------------------
[Examples]
---------------------------------------------------------------------------------------------------------------
csm_db_backup_script_v1.sh [DBNAME] | (default) will backup database to/var/lib/pgpsql/backups/ (directory)
csm_db_backup_script_v1.sh [DBNAME] [/DIRECTORY/ | will backup database to specified directory
| if the directory doesnt exist then it will be mode and
| written.
Note
If the user tries to run the script as local user (non-root and postgresql not installed):
Example (postgreSQL not installed)¶
bash-4.1$ ./csm_db_backup_script_v1.sh csmdb /tmp/
-----------------------------------------------------------------------------------------
[Error ] PostgreSQL may not be installed. Please check configuration settings
-----------------------------------------------------------------------------------------
Note
If the user tries to run the script as local user (non-root and postgresql not installed)and doesnt specify a directory (default directory: /var/lib/pgsql/backups
Example (no directory specified)¶
bash-4.1$ ./csm_db_backup_script_v1.sh csmdb
-----------------------------------------------------------------------------------------
[Error ] make directory failed for: /var/lib/pgsql/backups/
[Info ] User: csmcarl does not have permission to write to this directory
[Info ] Please specify a valid directory
[Info ] Or log in as the appropriate user
-----------------------------------------------------------------------------------------
Using csm_db_connections_script.sh¶
This script is designed to list and/or kill all active connections to a PostgreSQL database. Logging for this script is placed in /var/log/ibm/csm/csm_db_connections_script.log
Usage Overview¶
/opt/ibm/csm/db/./csm_db_connections_script.sh –h
/opt/ibm/csm/db/./csm_db_connections_script.sh --help
The help command (-h, –help) will specify each of the options available to use.
Options | Description | Result |
---|---|---|
running the script with no options | ./csm_db_connections_script.sh | Try ‘csm_db_connections_script.sh –help’ for more information. |
running the script with –l, –list | ./csm_db_connections_script.sh –l, –list | list database sessions. |
running the script with -k, –kill | ./csm_db_connections_script.sh –k, –kill | kill/terminate database sessions. |
running the script with –f, –force | ./csm_db_connections_script.sh –f, –force | force kill (do not ask for confirmation, use in conjunction with -k option). |
running the script with –u, –user | ./csm_db_connections_script.sh –u, –user | specify database user name. |
running the script with –p, –pid | ./csm_db_connections_script.sh –p, –pid | specify database user process id (pid). |
running the script with –h, –help | ./csm_db_connections_script.sh –h, –help | see details below |
Example (usage)¶
-bash-4.2$ ./csm_db_connections_script.sh --help
[Info ] PostgreSQL is installed
=================================================================================================================
[Info ] csm_db_connections_script.sh : List/Kill database user sessions
[Usage] csm_db_connections_script.sh : [OPTION]... [USER]
-----------------------------------------------------------------------------------------------------------------
[Options]
----------------|------------------------------------------------------------------------------------------------
Argument | Description
----------------|------------------------------------------------------------------------------------------------
-l, --list | list database sessions
-k, --kill | kill/terminate database sessions
-f, --force | force kill (do not ask for confirmation,
| use in conjunction with -k option)
-u, --user | specify database user name
-p, --pid | specify database user process id (pid)
-h, --help | help menu
----------------|------------------------------------------------------------------------------------------------
[Examples]
-----------------------------------------------------------------------------------------------------------------
csm_db_connections_script.sh -l, --list | list all session(s)
csm_db_connections_script.sh -l, --list -u, --user [USERNAME] | list user session(s)
csm_db_connections_script.sh -k, --kill | kill all session(s)
csm_db_connections_script.sh -k, --kill -f, --force | force kill all session(s)
csm_db_connections_script.sh -k, --kill -u, --user [USERNAME] | kill user session(s)
csm_db_connections_script.sh -k, --kill -p, --pid [PIDNUMBER]| kill user session with a specific pid
=================================================================================================================
Listing all DB connections¶
To display all current DB connections:
/opt/ibm/csm/db/csm_db_connections_script.sh -l
/opt/ibm/csm/db/csm_db_connections_script.sh --list
Example (-l, –list)¶
-bash-4.2$ ./csm_db_connections_script.sh –l
-----------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
===========================================================================================================
[Info ] Database Session | (all_users): 13
-----------------------------------------------------------------------------------------------------------
pid | database | user | connection_duration
-------+----------+----------+---------------------
61427 | xcatdb | xcatadm | 02:07:26.587854
61428 | xcatdb | xcatadm | 02:07:26.586227
73977 | postgres | postgres | 00:00:00.000885
72657 | csmdb | csmdb | 00:06:17.650398
72658 | csmdb | csmdb | 00:06:17.649185
72659 | csmdb | csmdb | 00:06:17.648012
72660 | csmdb | csmdb | 00:06:17.646846
72661 | csmdb | csmdb | 00:06:17.645662
72662 | csmdb | csmdb | 00:06:17.644473
72663 | csmdb | csmdb | 00:06:17.643285
72664 | csmdb | csmdb | 00:06:17.642105
72665 | csmdb | csmdb | 00:06:17.640927
72666 | csmdb | csmdb | 00:06:17.639771
(13 rows)
===========================================================================================================
4.) To display specified user(s) currently connected to the DB:
/opt/ibm/csm/db/csm_db_connections_script.sh -l –u <username>
/opt/ibm/csm/db/csm_db_connections_script.sh --list --user <username>
Note
The script will display the total users connected along with total users.
Example (-l, –list –u, –user)¶
-bash-4.2$ ./csm_db_connections_script.sh -l -u postgres
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: postgres is connected
[Info ] PostgreSQL is installed
==============================================================================================================
[Info ] Database Session | (all_users): 13
[Info ] Session List | (postgres): 1
------------------------------------------------------------------------------------------------------
pid | database | user | connection_duration
-------+----------+----------+---------------------
74094 | postgres | postgres | 00:00:00.000876
(1 row)
==============================================================================================================
Example (not specifying user or invalid user in the system)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u
[Error] Please specify user name
------------------------------------------------------------------------------------------------------
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdbsadsd
[Error] DB user: csmdbsadsd is not connected or is invalid
------------------------------------------------------------------------------------------------------
Kill all DB connections¶
The user has the ability to kill all DB connections by using the –k, --kill
option:
/opt/ibm/csm/db/csm_db_connections_script.sh -k
/opt/ibm/csm/db/csm_db_connections_script.sh --kill
Note
If this option is chosen by itself, the script will prompt each session with a yes/no request. The user has the ability to manually kill or not kill each session. All responses are logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill)¶
-bash-4.2$ ./csm_db_connections_script.sh –k
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
======================================================================================================
-bash-4.2$ ./csm_db_connections_script.sh –k
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:61428) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:74295) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72657) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72658) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72659) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72660) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72661) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72662) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72663) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72664) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72665) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72666) [y/n] ?:
[Info ] User response: n
============================================================================================================
Force kill all DB connections¶
The user has the ability to force kill all DB connections by using the –k, --kill –f, --force
option.
/opt/ibm/csm/db/csm_db_connections_script.sh -k –f
/opt/ibm/csm/db/csm_db_connections_script.sh --kill --force
Warning
If this option is chosen by itself, the script will kill each open session(s).
All responses are logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill –f, –force)¶
-bash-4.2$ ./csm_db_connections_script.sh –k -f
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Killing session (PID:61427)
[Info ] Killing session (PID:61428)
[Info ] Killing session (PID:74295)
[Info ] Killing session (PID:72657)
[Info ] Killing session (PID:72658)
[Info ] Killing session (PID:72659)
[Info ] Killing session (PID:72660)
[Info ] Killing session (PID:72661)
[Info ] Killing session (PID:72662)
[Info ] Killing session (PID:72663)
[Info ] Killing session (PID:72664)
[Info ] Killing session (PID:72665)
./csm_db_connections_script.sh: line 360: kill: (72665) – No such process
=============================================================================================================
Example (Log file output)¶
2017-11-01 15:54:27 (postgres) [Start] Welcome to CSM datatbase automation stats script.
2017-11-01 15:54:27 (postgres) [Info ] DB Names: template1 | template0 | postgres |
2017-11-01 15:54:27 (postgres) [Info ] DB Names: xcatdb | csmdb
2017-11-01 15:54:27 (postgres) [Info ] PostgreSQL is installed
2017-11-01 15:54:27 (postgres) [Info ] Script execution: csm_db_connections_script.sh -k, --kill
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:61427) kill –TERM 61427
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:61428) kill –TERM 61428
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:74295) kill –TERM 74295
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:72657) kill –TERM 72657
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:72658) kill –TERM 72658
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72659) kill –TERM 72659
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72660) kill –TERM 72660
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72661) kill –TERM 72661
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72662) kill –TERM 72662
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72663) kill –TERM 72663
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72664) kill –TERM 72664
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72665) kill –TERM 72665
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72666) kill –TERM 72666
2017-11-01 15:54:31 (postgres) [End ] Postgres DB kill query executed
-----------------------------------------------------------------------------------------------------------
Kill user connection(s)¶
The user has the ability to kill specific user DB connections by using the –k, --kill
along with –u, --user
option.
/opt/ibm/csm/db/csm_kill_db_connections_test_1.sh -k –u <username>
/opt/ibm/csm/db/csm_kill_db_connections_test_1.sh --kill --user <username>
Note
If this option is chosen then the script will prompt each session with a yes/no request. The user has the ability to manually kill or not kill each session.
All responses are logged to the:
/var/log/ibm/csm/csm_db_kill_script.log
Example (-k, –kill –u, –user <username>)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
------------------------------------------------------------------------------------------------------
Example (Single session user kill)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
------------------------------------------------------------------------------------------------------
Example (Multiple session user kill)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
[Info ] Kill database session (PID: 61428) [y/n] ?:y
[Info ] Killing session (PID:61428)
------------------------------------------------------------------------------------------------------
Kill PID connection(s)¶
The user has the ability to kill specific user DB connections by using the –k, --kill
along with –p, --pid
option.
/opt/ibm/csm/db/csm_db_connections_script.sh -k –p <pidnumber>
/opt/ibm/csm/db/csm_db_connections_script.sh --kill --pid <pidnumber>
Note
If this option is chosen then the script will prompt the session with a yes/no request.
The response is logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill –u, –pid <pidnumber>)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -p 61427
---------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB PID: 61427 is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
---------------------------------------------------------------------------------------------------------
-bash-4.2$ ./csm_db_connections_script.sh -k -p 61427
---------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB PID: 61427 is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
---------------------------------------------------------------------------------------------------------
Using csm_db_history_delete.py¶
This section describes the deletion process associated with the CSM Database history table records. If run alone it will delete all history tables including the csm_event_action table, which contain a non-null archive history timestamp.
Note
This script is designed to run as a root user. If you try to run as a postgres user the script will prompt a message and exit.
-bash-4.2$ ./csm_db_history_delete.py -h
---------------------------------------------------------------------------------------------------------
[INFO] Only root can run this script
---------------------------------------------------------------------------------------------------------
Usage Overview¶
- The
csm_db_history_delete.py
script will accept certain flags: - Interval time (in minutes) - required (tunable time interval for managing table record deletions)
- Database name - required
- DB user_name - optional
- Thread Count - optional
/opt/ibm/csm/db/csm_db_history_delete.py –h
/opt/ibm/csm/db/csm_db_history_delete.py --help
Options | Description | Result |
---|---|---|
running the script with no options | ./csm_db_history_delete.py | Will prompt a message explaining that the -n/–count and or -d/–database is required |
running the script with -n, –count |
./csm_db_history_delete.py –n, –count | specifies the time (in mins.) of oldest records which to delete. (required) |
running the script with –d, –database | ./csm_db_history_delete.py -d, –database | specifies the database name (required) |
running the script with –u, –user | ./csm_db_history_delete.py –u, –user | specifies the database user name. (optional) |
running the script with –-threads | ./csm_db_history_delete.py –-threads | specifies threads. (optional) |
running the script with –h, –help | ./csm_db_history_deletee.py –h, –help | see details below |
Example (usage)¶
-bash-4.2$ /opt/ibm/csm/db/csm_db_history_delete.py –h
---------------------------------------------------------------------------------------------------------
usage: csm_db_history_delete.py [-h] -n count -d db [-u user]
[--threads threads]
A tool for deleting the CSM Database history table records.
optional arguments:
-h, --help show this help message and exit
-n count, --count count
The time (in mins.) of oldest records which to delete.
required argument
-d db, --database db Database name to delete history records from. required
argument
-u user, --user user The database user. Default: postgres
--threads threads The number of threads for the thread pool. Default: 10
------------------------------------------------------------------------------
Note
This is a general overview of the CSM DB deletion process using the csm_db_history_delete.py
script.
Script out results¶
[root@c650mnp02 db]# /opt/ibm/csm/db/csm_db_history_delete.py -d csmdb -n 2880
---------------------------------------------------------------------------------------------------------
Welcome to the CSM DB deletion of history table records script
---------------------------------------------------------------------------------------------------------
Start Script Time: | 2018-12-10 11:56:13.395135
---------------------------------------------------------------------------------------------------------
[INFO] Processing Table csm_allocation_state_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_config_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_allocation_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_allocation_node_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_db_schema_version_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_diag_result_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_diag_run_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_dimm_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_gpu_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_hca_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_ib_cable_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_lv_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_lv_update_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_node_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_node_state_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_processor_socket_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_ssd_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_ssd_wear_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_step_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_step_node_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_switch_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_switch_inventory_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_vg_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_vg_ssd_history | User Ct (time(mins)): 2880 | Act DB Ct: 0
[INFO] Processing Table csm_ras_event_action | User Ct (time(mins)): 2880 | Act DB Ct: 0
---------------------------------------------------------------------------------------------------------
DB Name: | csmdb
DB User Name: | postgres
Thread Count: | 10
Deletion Log Directory: | /var/log/ibm/csm/db/csm_db_history_delete.log
End Script Time: | 2018-12-10 11:56:13.441324
Total Process Time: | 0:00:00.046189
---------------------------------------------------------------------------------------------------------
Finish CSM DB deletion script process
---------------------------------------------------------------------------------------------------------
If a user specifies a non related DB in the system, unrelated user name, or if there are issues connecting to the DB server a message will display.
[root@c650mnp02 db]# /opt/ibm/csm/db/csm_db_history_delete.py -d csmdb123 -n 1 -u abcd
---------------------------------------------------------------------------------------------------------
Welcome to the CSM DB deletion of history table records script
---------------------------------------------------------------------------------------------------------
Start Script Time: | 2018-12-10 11:56:19.555008
---------------------------------------------------------------------------------------------------------
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
[CRITICAL] Unable to connect to local database.
---------------------------------------------------------------------------------------------------------
DB Name: | csmdb123
DB User Name: | abcd
Thread Count: | 10
Deletion Log Directory: | /var/log/ibm/csm/db/csm_db_history_delete.log
End Script Time: | 2018-12-10 11:56:19.601613
Total Process Time: | 0:00:00.046605
---------------------------------------------------------------------------------------------------------
Finish CSM DB deletion script process
---------------------------------------------------------------------------------------------------------
The csm_db_history_delete.py
script (when called manually) will delete history records which have been
archived with a archive_history_timestamp. Records in the history table that do not have an archived_history_timestamp
will remain in the system until it has been archived.
Note
Directory: The scripts logging information will be in a specified directory.
The history table delete log file will be in a .log file format and in the default directory:
/var/log/ibm/csm/db/csm_db_history_delete.log
Using csm_db_schema_version_upgrade_16_2.sh¶
Important
Prior steps before migrating to the newest DB schema version.
- Stop all CSM daemons
- Run a cold backup of the csmdb or specified DB (csm_db_backup_script_v1.sh)
- Install the newest RPMs
- Run the csm_db_schema_version_upgrade_16_2.sh
- Start CSM daemons
Attention
To migrate the CSM database from 15.0, 15.1, 16.0, or 16.1
to the newest schema version
/opt/ibm/csm/db/csm_db_schema_version_upgrade_16_2.sh <my_db_name>
Note
The csm_db_schema_version_upgrade_16_2.sh
script creates a log file: /var/log/ibm/csm/csm_db_schema_upgrade_script.log
16.2
).Note
For a quick overview of the script functionality:
/opt/ibm/csm/db/ csm_db_schema_version.sh –h
/opt/ibm/csm/db/ csm_db_schema_version.sh --help
If the script is ran without any options, then the usage function is displayed.
Usage Overview¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh -h
-------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade schema script.
[Error ] Please specify DB name
=================================================================================================
[Info ] csm_db_schema_version_upgrade.sh : Load CSM DB upgrade schema file
[Usage ] csm_db_schema_version_upgrade.sh : csm_db_schema_version_upgrade.sh [DBNAME]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-----------------|-----------|-------------------------------------------------------------------
script_name | [db_name] | Imports sql upgrades to csm db table(s) (appends)
| | fields, indexes, functions, triggers, etc
-----------------|-----------|-------------------------------------------------------------------
=================================================================================================
Upgrading CSM DB (manual process)¶
Note
To upgrade the CSM or specified DB:
/opt/ibm/csm/db/csm_db_schema_version_upgrade_16_2.sh <my_db_name> (where my_db_name is the name of your DB).
Note
The script will check to see if the given DB name exists. If the database name does not exist, then it will exit with an error message.
Example (non DB existence):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database does not exist. Exiting.
-------------------------------------------------------------------------------------
Note
- The script will check for the existence of these files:
csm_db_schema_version_data.csv
csm_create_tables.sql
csm_create_triggers.sql
When an upgrade process happens, the new RPM will consist of a new schema version csv, DB create tables file, and or create triggers/functions file to be loaded into a (completley new) DB.
Example (non csv_file_name existence):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Error ] File csm_db_schema_version_data.csv can not be located or doesnt exist
-------------------------------------------------------------------------------------
Note
The second check makes sure the file exists and compares the actual SQL upgrade version to the hardcoded version number. If the criteria is met successfully, then the script will proceed. If the process fails, then an error message will prompt.
Example (non compatible migration):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Error ] Cannot perform action because not compatible.
[Info ] Required DB schema version 15.0, 15.1, 16.0, 16.1 or appropriate files in directory
[Info ] csmdb current_schema_version is running: 16.1
[Info ] csm_create_tables.sql file currently in the directory is: 15.1 (required version) 16.2
[Info ] csm_create_triggers.sql file currently in the directory is: 16.2 (required version) 16.2
[Info ] csm_db_schema_version_data.csv file currently in the directory is: 16.2 (required version) 16.2
[Info ] Please make sure you have the latest RPMs installed and latest DB files.
-------------------------------------------------------------------------------------
Note
If the user selects the "n/no"
option when prompted to migrate to the newest DB schema upgrade, then the program will exit with the message below.
Example (user prompt execution with “n/no” option):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 16.1
[Info ] csmdb schema_version_upgrade: 16.2
[Warning ] This will migrate csmdb database to schema version 16.1. Do you want to continue [y/n]?:
[Info ] User response: n
[Error ] Migration session for DB: csmdb User response: ****(NO)**** not updated
---------------------------------------------------------------------------------------------------------------
Note
If the user selects the "y/yes"
option when prompted to migrate to the newest DB schema upgrade, then the program will begin execution. An additional section has been added to the migration script to update existing ras message types or to insert new cases. The user will have to specify y/yes
for these changes or n/no
to skip the process. If there are no changes to the RAS message types or no new cases then the information will be displayed accordingly.
Example (user prompt execution with “y/yes” options for both):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
------------------------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 16.1
[Info ] csmdb schema_version_upgrade: 16.2
[Warning ] This will migrate csmdb database to schema version 16.2. Do you want to continue [y/n]?:
[Info ] User response: y
[Info ] csmdb migration process begin.
[Info ] There are no connections to csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database ras type automation script.
[Info ] csm_ras_type_data.csv file exists
[Info ] PostgreSQL is installed
[Warning ] This will load and or update csm_ras_type table data into csmdb database. Do you want to continue [y/n]?
[Info ] User response: y
[Info ] csm_ras_type record count before script execution: 520
[Info ] Record import count from csm_ras_type_data.csv: 737
[Info ] Record update count from csm_ras_type_data.csv: 5
[Info ] Total csm_ras_type insert count from file: 217
[Info ] csm_ras_type live row count after script execution: 737
[Info ] csm_ras_type_audit live row count: 742
[Info ] Database: csmdb csv upload process complete for csm_ras_type table.
------------------------------------------------------------------------------------
[Complete] csmdb database schema update 16.2.
------------------------------------------------------------------------------------------------------------------------
[Timing ] 0:00:00:1.8838
------------------------------------------------------------------------------------------------------------------------
Example (user prompt execution with “y/yes” for the migration and “n/no” for the RAS section):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
------------------------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 16.1
[Info ] csmdb schema_version_upgrade: 16.2
[Warning ] This will migrate csmdb database to schema version 16.2. Do you want to continue [y/n]?:
[Info ] User response: y
[Info ] csmdb migration process begin.
[Info ] There are no connections to csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database ras type automation script.
[Info ] csm_ras_type_data.csv file exists
[Info ] PostgreSQL is installed
[Warning ] This will load and or update csm_ras_type table data into csmdb database. Do you want to continue [y/n]?
[Info ] User response: n
[Info ] Skipping the csm_ras_type table data import/update process
------------------------------------------------------------------------------------
[Complete] csmdb database schema update 16.2.
------------------------------------------------------------------------------------------------------------------------
[Timing ] 0:00:00:1.0024
------------------------------------------------------------------------------------------------------------------------
Attention
It is not recommended to select n/no
for the RAS section during the migration script process. If this process does occur, then the RAS script can be ran alone by the system admin.
To run the RAS script by itself please refer to link: csm_ras_type_script_sh
Note
If the migration script has already ran already or a new database has been created with the latest schema version of 16.2
then this message will be prompted to the user.
Running the script with existing newer version¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
-------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade script.
[Info ] PostgreSQL is installed
[Info ] csmdb is currently running db schema version: 16.2
-------------------------------------------------------------------------------------------------
Warning
If there are existing DB connections, then the migration script will prompt a message and the admin will have to kill connections before proceeding.
Hint
The csm_db_connections_script.sh script can be used with the –l option to quickly list the current connections. (Please see user guide or –h
for usage function). This script has the ability to terminate user sessions based on pids, users, or a –f
force option will kill all connections if necessary. Once the connections are terminated then the csm_db_schema_version_upgrade_16_2.sh
script can be executed. The log message will display current connection of user, database name, connection count, and duration.
Example (user prompt execution with “y/yes” option and existing DB connection(s)):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_2.sh csmdb
---------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 16.1
[Info ] csmdb schema_version_upgrade: 16.2
[Warning ] This will migrate csmdb database to schema version 16.1. Do you want to continue [y/n]?:
[Info ] User response: y
[Info ] csmdb migration process begin.
[Error ] csmdb has existing connection(s) to the database.
[Error ] User: csmdb has 1 connection(s)
[Info ] See log file for connection details
---------------------------------------------------------------------------------------------------
Using csm_db_script.sh¶
Note
For a quick overview of the script functionality:
/opt/ibm/csm/db/csm_db_script.sh –h
/opt/ibm/csm/db/csm_db_script.sh --help
This help command <-h, --help> specifies each of the options available to use.
Usage Overview¶
A new DB set up <default db> | Command | Result |
---|---|---|
running the script with no options | ./csm_db_script.sh | This will create a default db with tables and populated data <specified by user or db admin> |
running the script with –x, –nodata | ./csm_db_script.sh –x ./csm_db_script.sh –nodata | This will create a default db with tables and no populated data |
A new DB set up <new user db> | Command | Result |
---|---|---|
running the script with –n, –newdb | ./csm_db_script.sh -n <my_db_name> ./csm_db_script.sh –newdb <my_db_name> | This will create a new db with tables and populated data. |
running the script with –n, –newdb, -x, –nodata | ./csm_db_script.sh -n <my_db_name> –x ./csm_db_script.sh –newdb <my_db_name> –nodata | This will create a new db with tables and no populated data. |
If a DB already exists | Command | Result |
---|---|---|
Drop DB totally | ./csm_db_script.sh -d <my_db_name> ./csm_db_script.sh –delete <my_db_name> | This will totally remove the DB from the system |
Drop only the existing CSM DB tables | ./csm_db_script.sh -e <my_db_name> ./csm_db_script.sh –eliminatetables <my_db_name> | This will only drop the specified CSM DB tables. <useful if integrated within another DB <e.x. “XCATDB”> |
Force overwrite of existing DB. | ./csm_db_script.sh -f <my_db_name> ./csm_db_script.sh –force <my_db_name> | This will totally drop the existing tables in the DB and recreate them with populated table data. |
Force overwrite of existing DB. | ./csm_db_script.sh -f <my_db_name> –x ./csm_db_script.sh –force <my_db_name> –nodata | This will totally drop the existing tables in the DB and recreate them without table data. |
Remove just the data from all the tables in the DB | ./csm_db_script.sh -r <my_db_name> ./csm_db_script.sh –removetabledata <my_db_name> | This will totally remove all data from all the tables within the DB. |
Example (usage)¶
bash 4.2$ ./csm_db_script.sh -h
===============================================================================================================
[Info ] csm_db_script.sh : CSM database creation script with additional features
[Usage] csm_db_script.sh : [OPTION]... [DBNAME]... [OPTION]
---------------------------------------------------------------------------------------------------------------
[Options]
-----------------------|-----------|---------------------------------------------------------------------------
Argument | DB Name | Description
-----------------------|-----------|---------------------------------------------------------------------------
-x, --nodata | [DEFAULT] | creates database with tables and does not pre populate table data
| [db_name] | this can also be used with the -f --force, -n --newdb option when
| | recreating a DB. This should follow the specified DB name
-d, --delete | [db_name] | totally removes the database from the system
-e, --eliminatetables| [db_name] | drops CSM tables from the database
-f, --force | [db_name] | drops the existing tables in the DB, recreates and populates with table data
-n, --newdb | [db_name] | creates a new database with tables and populated data
-r, --removetabledata| [db_name] | removes data from all database tables
-h, --help | | help
-----------------------|-----------|-----------------------------------------------------------------------------
[Examples]
-----------------------------------------------------------------------------------------------------------------
[DEFAULT] csm_db_script.sh | |
[DEFAULT] csm_db_script.sh -x, --nodata | |
csm_db_script.sh -d, --delete | [DBNAME] |
csm_db_script.sh -e, --eliminatetables | [DBNAME] |
csm_db_script.sh -f, --force | [DBNAME] |
csm_db_script.sh -f, --force | [DBNAME] | -x, --nodata
csm_db_script.sh -n, --newdb | [DBNAME] |
csm_db_script.sh -n, --newdb | [DBNAME] | -x, --nodata
csm_db_script.sh -r, --removetabledata | [DBNAME] |
csm_db_script.sh -h, --help | |
===============================================================================================================
Note
Setting up or creating a new DB <manually>
To create your own DB¶
/opt/ibm/csm/db/db_script.sh –n <my_db_name>
/opt/ibm/csm/db/db_script.sh --newdb <my_db_name>
By default if no DB name is specified, then the script will
create a DB called csmdb.
Example (successful DB creation):¶
$ /opt/ibm/csm/db/csm_db_script.sh
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database created.
[Complete] csmdb database tables created.
[Complete] csmdb database functions and triggers created.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Complete] csmdb table data loaded successfully into csm_ras_type
[Info ] csmdb DB schema version <16.2>
------------------------------------------------------------------------------------------------------
Note
The script checks to see if the given name exists. If the database does not exist, then it will be created. If the database already exists, then the script prompts an error message indicating a database with this name already exists and exits the program.
Example (DB already exists)¶
$ /opt/ibm/csm/db/csm_db_script.sh
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database already exists. Exiting.
------------------------------------------------------------------------------------------------------
- The script automatically populates data in specified tables using csv files.
For example, ras message type data, into the ras message type table.
If a user does not want to populate these tables, then they should indicate a
-x, --nodata in the command line during the initial setup process.
/opt/ibm/csm/db/csm_db_script.sh -x
/opt/ibm/csm/db/csm_db_script.sh --nodata
Example (Default DB creation without loaded data option)¶
$ /opt/ibm/csm/db/csm_db_script.sh –x
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database created.
[Complete] csmdb database tables created.
[Complete] csmdb database functions and triggers created.
[Info ] csmdb skipping data load process. <----------[when running the -x, --nodata option]
[Complete] csmdb initialized csm_db_schema_version data
[Info ] csmdb DB schema version <16.2>
------------------------------------------------------------------------------------------------------
Existing DB Options¶
Note
There are some other features in this script that will assist users in a “clean-up” process. If the database already exists, then these actions will work.
- Delete the database
/opt/ibm/csm/db/csm_db_script.sh –d <my_db_name>
/opt/ibm/csm/db/csm_db_script.sh --delete <my_db_name>
Example (Delete existing DB)¶
$ /opt/ibm/csm/db/csm_db_script.sh –d csmdb
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Info ] This will drop csmdb database including all tables and data. Do you want to continue [y/n]?y
[Complete] csmdb database deleted
------------------------------------------------------------------------------------------------------
- Remove just data from all the tables
/opt/ibm/csm/db/csm_db_script.sh –r <my_db_name>
/opt/ibm/csm/db/csm_db_script.sh --removetabledata <my_db_name>
Example (Remove data from DB tables)¶
$ /opt/ibm/csm/db/csm_db_script.sh –r csmdb
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Complete] csmdb database data deleted from all tables excluding csm_schema_version and
csm_db_schema_version_history tables
------------------------------------------------------------------------------------------------------
- Force a total overwrite of the database <drops tables and recreates them>.
/opt/ibm/csm/db/csm_db_script.sh –f <my_db_name>
/opt/ibm/csm/db/csm_db_script.sh --force <my_db_name> (which auto populates table data).
Example (Force DB receation)¶
$ /opt/ibm/csm/db/csm_db_script.sh –f csmdb
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database tables and triggers dropped
[Complete] csmdb database functions dropped
[Complete] csmdb database tables recreated.
[Complete] csmdb database functions and triggers recreated.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Complete] csmdb table data loaded successfully into csm_ras_type
[Info ] csmdb DB schema version <16.2>
------------------------------------------------------------------------------------------------------
4. Force a total overwrite of the database <drops tables and recreates them without prepopulated data>.
/opt/ibm/csm/db/csm_db_script.sh –f <my_db_name> -x
/opt/ibm/csm/db/csm_db_script.sh --force <my_db_name --nodata (which does not populate table data).
Example (Force DB recreation without preloaded table data)¶
$ /opt/ibm/csm/db/csm_db_script.sh –f csmdb –x
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database tables and triggers dropped
[Complete] csmdb database functions dropped
[Complete] csmdb database tables recreated.
[Complete] csmdb database functions and triggers recreated.
[Complete] csmdb skipping data load process.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Info ] csmdb DB schema version <16.2>
------------------------------------------------------------------------------------------------------
CSMDB user info.¶
5. The "csmdb"
user will remain in the system unless an admin manually deletes this option.
If the user has to be deleted for any reason the Admin can run this command inside the psql postgres DB connection. DROP USER csmdb
. If any current database are running with this user, then the user will
get a response similar to the example below
ERROR: database "csmdb" is being accessed by other users
DETAIL: There is 1 other session using the database.
Warning
It is not recommended to delete the csmdb user.
su – postgres
psql -t -q -U postgres -d postgres -c "DROP USER csmdb;"
psql -t -q -U postgres -d postgres -c "CREATE USER csmdb;"
Note
The command below can be executed if specific privileges are needed.
psql -t -q -U postgres -d postgres -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO csmdb"
Note
If admin wants to change the ownership of the DB to postgres then use the command below.
ALTER DATABASE csmdb OWNER TO postgres
ALTER DATABASE csmdb OWNER TO csmdb
Please see the log file for details:
/var/log/ibm/csm/csm_db_script.log
Using csm_db_stats.sh script¶
This script will gather statistical information related to the CSM DB which includes, table data activity, index related information, and table lock monitoring, CSM DB schema version, DB connections stats query, DB user stats query, and PostgreSQL version installed .
Note
For a quick overview of the script functionality,
/opt/ibm/csm/db/csm_db_stats.sh –h
/opt/ibm/csm/db/csm_db_stats.sh --help
This help command <-h, --help> will specify each of the options available to use.
The csm_db_stats.sh
script creates a log file for each query executed. (Please see the log file for details): /var/log/ibm/csm/csm_db_stats.log
Usage Overview¶
Options | Command | Result |
---|---|---|
Table data activity | ./csm_db_stats.sh –t <my_db_name> ./csm_db_stats.sh –tableinfo <my_db_name> | see details below |
Index related information | ./csm_db_stats.sh –i <my_db_name> ./csm_db_stats.sh –indexinfo <my_db_name> | see details below |
Index analysis information | ./csm_db_stats.sh –x <my_db_name> ./csm_db_stats.sh –indexanalysis <my_db_name> | see details below |
Table Locking Monitoring | ./csm_db_stats.sh –l <my_db_name> ./csm_db_stats.sh –lockinfo <my_db_name> | see details below |
Schema Version Query | ./csm_db_stats.sh –s <my_db_name> ./csm_db_stats.sh –schemaversion <my_db_name> | see details below |
DB connections stats Query | ./csm_db_stats.sh –c <my_db_name> ./csm_db_stats.sh –connectionsdb <my_db_name> | see details below |
DB user stats query | ./csm_db_stats.sh –u <my_db_name> ./csm_db_stats.sh –usernamedb <my_db_name> | see details below |
PostgreSQL Version Installed | ./csm_db_stats.sh -v csmdb ./csm_db_stats.sh –postgresqlversion csmdb | see details below |
DB Archiving Stats | ./csm_db_stats.sh -a csmdb ./csm_db_stats.sh –archivecount csmdb | see details below |
–help | ./csm_db_stats.sh | see details below |
Example (usage)¶
-bash-4.2$ ./csm_db_stats.sh --help
=================================================================================================
[Info ] csm_db_stats.sh : List/Kill database user sessions
[Usage] csm_db_stats.sh : [OPTION]... [DBNAME]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-------------------------|-----------|-----------------------------------------------------------
-t, --tableinfo | [db_name] | Populates Database Table Stats:
| | Live Row Count, Inserts, Updates, Deletes, and Table Size
-i, --indexinfo | [db_name] | Populates Database Index Stats:
| | tablename, indexname, num_rows, tbl_size, ix_size, uk,
| | num_scans, tpls_read, tpls_fetched
-x, --indexanalysis | [db_name] | Displays the index usage analysis
-l, --lockinfo | [db_name] | Displays any locks that might be happening within the DB
-s, --schemaversion | [db_name] | Displays the current CSM DB version
-c, --connectionsdb | [db_name] | Displays the current DB connections
-u, --usernamedb | [db_name] | Displays the current DB user names and privileges
-v, --postgresqlversion | [db_name] | Displays the current version of PostgreSQL installed
| | along with environment details
-a, --archivecount | [db_name] | Displays the archived and non archive record counts
-h, --help | | help
-------------------------|-----------|-----------------------------------------------------------
[Examples]
-------------------------------------------------------------------------------------------------
csm_db_stats.sh -t, --tableinfo [dbname] | Database table stats
csm_db_stats.sh -i, --indexinfo [dbname] | Database index stats
csm_db_stats.sh -x, --indexanalysisinfo [dbname] | Database index usage analysis stats
csm_db_stats.sh -l, --lockinfo [dbname] | Database lock stats
csm_db_stats.sh -s, --schemaversion [dbname] | Database schema version (CSM_DB only)
csm_db_stats.sh -c, --connectionsdb [dbname] | Database connections stats
csm_db_stats.sh -u, --usernamedb [dbname] | Database user stats
csm_db_stats.sh -v, --postgresqlversion [dbname] | Database (PostgreSQL) version
csm_db_stats.sh -a, --archivecount [dbname] | Database archive stats
csm_db_stats.sh -h, --help [dbname] | Help menu
=================================================================================================
1. Table data activity¶
/opt/ibm/csm/db/csm_db_stats.sh –t <my_db_name>
/opt/ibm/csm/db/csm_db_stats.sh --tableinfo <my_db_name>
Example (Query details)¶
Column_Name | Description |
tablename |
table name |
live_row_count |
current row count in the CSM_DB |
insert_count |
number of rows inserted into each of the tables |
update_count |
number of rows updated in each of the tables |
delete_count |
number of rows deleted in each of the tables |
table_size |
table size |
Note
This query will display information related to the CSM DB tables (or other specified DB). The query will display results based on if the insert, update, and delete count is > 0
. If there is no data in a particular table it will be omitted from the results.
Example (DB Table info.)¶
-bash-4.2$ ./csm_db_stats.sh -t csmdb
--------------------------------------------------------------------------------------------------
relname | live_row_count | insert_count | update_count | delete_count | table_size
-----------------------+----------------+--------------+--------------+--------------+------------
csm_db_schema_version | 1 | 1 | 0 | 0 | 8192 bytes
csm_gpu | 4 | 4 | 0 | 0 | 8192 bytes
csm_hca | 2 | 2 | 0 | 0 | 8192 bytes
csm_node | 2 | 2 | 0 | 0 | 8192 bytes
csm_ras_type | 4 | 4 | 0 | 0 | 8192 bytes
csm_ras_type_audit | 4 | 4 | 0 | 0 | 8192 bytes
(6 rows)
--------------------------------------------------------------------------------------------------
3. Index Analysis Usage Information¶
/opt/ibm/csm/db/csm_db_stats.sh –x <my_db_name>
/opt/ibm/csm/db/csm_db_stats.sh --indexanalysis <my_db_name>
Example (Query details)¶
Column_Name | Description |
relname |
table name |
too_much_seq |
case when seq_scan - idx_scan > 0 |
case |
If Missing Index or is Ok |
rel_size |
OID of a table, index returns the on-disk size in bytes. |
seq_scan |
Number of sequential scans initiated on this table. |
idx_scan |
Number of index scans initiated on this index |
Note
This query checks if there are more sequence scans being performed instead of index scans. Results will be displayed if the relname
, too_much_seq
, case
, rel_size
, seq_scan
, and idx_scan
. This query helps analyze database.
Example (Indexes Usage)¶
-bash-4.2$ ./csm_db_stats.sh -x csmdb
--------------------------------------------------------------------------------------------------
relname | too_much_seq | case | rel_size | seq_scan | idx_scan
------------------------------+--------------+----------------+-------------+----------+----------
csm_step_node | 16280094 | Missing Index? | 245760 | 17438931 | 1158837
csm_allocation_history | 3061025 | Missing Index? | 57475072 | 3061787 | 762
csm_allocation_state_history | 3276 | Missing Index? | 35962880 | 54096 | 50820
csm_vg_history | 1751 | Missing Index? | 933888 | 1755 | 4
csm_vg_ssd_history | 1751 | Missing Index? | 819200 | 1755 | 4
csm_ssd_history | 1749 | Missing Index? | 1613824 | 1755 | 6
csm_dimm_history | 1652 | Missing Index? | 13983744 | 1758 | 106
csm_gpu_history | 1645 | Missing Index? | 24076288 | 1756 | 111
csm_hca_history | 1643 | Missing Index? | 8167424 | 1754 | 111
csm_ras_event_action | 1549 | Missing Index? | 263143424 | 1854 | 305
csm_node_state_history | 401 | Missing Index? | 78413824 | 821 | 420
csm_node_history | -31382 | OK | 336330752 | 879 | 32261
csm_ras_type_audit | -97091 | OK | 98304 | 793419 | 890510
csm_step_history | -227520 | OK | 342327296 | 880 | 228400
csm_vg_ssd | -356574 | OK | 704512 | 125588 | 482162
csm_vg | -403370 | OK | 729088 | 86577 | 489947
csm_hca | -547463 | OK | 1122304 | 1 | 547464
csm_ras_type | -942966 | OK | 81920 | 23 | 942989
csm_ssd | -1242433 | OK | 1040384 | 85068 | 1327501
csm_step_node_history | -1280913 | OK | 2865987584 | 49335 | 1330248
csm_allocation_node_history | -1664023 | OK | 21430599680 | 887 | 1664910
csm_gpu | -2152044 | OK | 5996544 | 1 | 2152045
csm_dimm | -2239777 | OK | 7200768 | 118280 | 2358057
csm_allocation_node | -52187077 | OK | 319488 | 1727675 | 53914752
csm_node | -78859700 | OK | 2768896 | 127214 | 78986914
(25 rows)
--------------------------------------------------------------------------------------------------
4. Table Lock Monitoring¶
/opt/ibm/csm/db/csm_db_stats.sh –l <my_db_name>
/opt/ibm/csm/db/csm_db_stats.sh --lockinfo <my_db_name>
Example (Query details)¶
Column_Name | Description |
blocked_pid |
Process ID of the server process holding or awaiting this lock, or null if the lock is held by a prepared transaction. |
blocked_user |
The user that is being blocked. |
current_or_recent_statement_in_blocking_process |
The query statement that is displayed as a result. |
state_of_blocking_process |
Current overall state of this backend. |
blocking_duration |
Evaluates when the process begin and subtracts from the current time when the query began. |
blocking_pid |
Process ID of this backend. |
blocking_user |
The user that is blocking other transactions. |
blocked_statement |
The query statement that is displayed as a result. |
blocked_duration |
Evaluates when the process begin and subtracts from the current time when the query began. |
Example (Lock Monitoring)¶
-bash-4.2$ ./csm_db_stats.sh -l csmdb
-[ RECORD 1 ]-----------------------------------+--------------------------------------------------------------
blocked_pid | 38351
blocked_user | postgres
current_or_recent_statement_in_blocking_process | update csm_processor set status=’N’ where serial_number=3;
state_of+blocking_process | active
blocking_duration | 01:01:11.653697
blocking_pid | 34389
blocking_user | postgres
blocked_statement | update csm_processor set status=’N’ where serial_number=3;
blocked_duration | 00:01:09.048478
---------------------------------------------------------------------------------------------------------------
Note
This query displays relevant information related to lock monitoring. It will display the current blocked and blocking rows affected along with each duration. A systems administrator can run the query and evaluate what is causing the results of a “hung” procedure and determine the possible issue.
5. DB schema Version Query¶
/opt/ibm/csm/db/csm_db_stats.sh –s <my_db_name>
/opt/ibm/csm/db/csm_db_stats.sh --schemaversion <my_db_name>
Example (Query details)¶
version |
This provides the current CSM DB version that is current being used. |
create_time |
This column indicated when the database was created. |
comment |
This column indicates the “current version” as comment. |
Example (DB Schema Version)¶
-bash-4.2$ ./csm_db_stats.sh -s csmdb
-------------------------------------------------------------------------------------
version | create_time | comment
---------+----------------------------+-----------------
16.2 | 2018-04-04 09:41:57.784378 | current_version
(1 row)
-------------------------------------------------------------------------------------
Note
This query provides the current database version the system is running along with its creation time.
6. DB Connections with details¶
/opt/ibm/csm/db/./csm_db_stats.sh –c <my_db_name>
/opt/ibm/csm/db/./csm_db_stats.sh --connectionsdb <my_db_name>
Example (Query details)¶
pid |
Process ID of this backend. |
dbname |
Name of the database this backend is connected to. |
username |
Name of the user logged into this backend. |
backend_start |
Time when this process was started, i.e., when the client connected to the server. |
query_start |
Time when the currently active query was started, or if state is not active, when the last query was started. |
state_change |
Time when the state was last changed. |
wait |
True if this backend is currently waiting on a lock. |
query |
Text of this backends most recent query. If state is active this field shows the currently executing query. In all other states, it shows the last query that was executed. |
Example (database connections)¶
-bash-4.2$ ./csm_db_stats.sh -c csmdb
-----------------------------------------------------------------------------------------------------------------------------------------------------------
pid | dbname | usename | backend_start | query_start | state_change | wait | query
-------+--------+----------+----------------------------+----------------------------+----------------------------+------+---------------------------------
61427 | xcatdb | xcatadm | 2017-11-01 13:42:53.931094 | 2017-11-02 10:15:04.617097 | 2017-11-02 10:15:04.617112 | f | DEALLOCATE
| | | | | | | dbdpg_p17050_384531
61428 | xcatdb | xcatadm | 2017-11-01 13:42:53.932721 | 2017-11-02 10:15:04.616291 | 2017-11-02 10:15:04.616313 | f | SELECT 'DBD::Pg ping test'
55753 | csmdb | postgres | 2017-11-02 10:15:06.619898 | 2017-11-02 10:15:06.620889 | 2017-11-02 10:15:06.620891 | f |
| | | | | | | SELECT pid,datname AS dbname,
| | | | | | | usename,backend_start, q.
| | | | | | |.uery_start, state_change,
| | | | | | | waiting AS wait,query FROM pg.
| | | | | | |._stat_activity;
(3 rows)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Note
This query will display information about the database connections that are in use on the system. The pid (Process ID), database name, user name, backend start time, query start time, state change, waiting status, and query will display statistics about the current database activity.
7. PostgreSQL users with details¶
/opt/ibm/csm/db/./csm_db_stats.sh –u <my_db_name>
/opt/ibm/csm/db/./csm_db_stats.sh --usernamedb <my_db_name>
Example (Query details)¶
Column_Name | Description |
rolname |
Role name (t/f). |
rolsuper |
Role has superuser privileges (t/f). |
rolinherit |
Role automatically inherits privileges of roles it is a member of (t/f). |
rolcreaterole |
Role can create more roles (t/f). |
rolcreatedb |
Role can create databases (t/f). |
rolcatupdate |
Role can update system catalogs directly. (Even a superuser cannot do this unless this column is true) (t/f). |
rolcanlogin |
Role can log in. That is, this role can be given as the initial session authorization identifier (t/f). |
rolreplication |
Role is a replication role. That is, this role can initiate streaming replication and set/unset the system backup mode using pg_start_backup and pg_stop_backup (t/f). |
rolconnlimit |
For roles that can log in, this sets maximum number of concurrent connections this role can make. -1 means no limit. |
rolpassword |
Not the password (always reads as ****). |
rolvaliduntil |
Password expiry time (only used for password authentication); null if no expiration. |
rolconfig |
Role-specific defaults for run-time configuration variables. |
oid |
ID of role. |
Example (DB users with details)¶
-bash-4.2$ ./csm_db_stats.sh -u postgres
-----------------------------------------------------------------------------------------------------------------------------------
rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcatupdate | rolcanlogin | rolreplication | rolconnlimit | rolpassword | rolvaliduntil | rolconfig | oid
----------+----------+------------+---------------+-------------+--------------+-------------+----------------+--------------+-------------+---------------+-----------+--------
postgres | t | t | t | t | t | t | t | -1 | ******** | | | 10
xcatadm | f | t | f | f | f | t | f | -1 | ******** | | | 16385
root | f | t | f | f | f | t | f | -1 | ******** | | | 16386
csmdb | f | t | f | f | f | t | f | -1 | ******** | | | 704142
(4 rows)
-----------------------------------------------------------------------------------------------------------------------------------
Note
This query will display specific information related to the users that are currently in the postgres database. These fields will appear in the query: rolname, rolsuper, rolinherit, rolcreaterole, rolcreatedb, rolcatupdate, rolcanlogin, rolreplication, rolconnlimit, rolpassword, rolvaliduntil, rolconfig, and oid. See below for details.
8. PostgreSQL Version Installed¶
/opt/ibm/csm/db/./csm_db_stats.sh –v <my_db_name>
/opt/ibm/csm/db/./csm_db_stats.sh --postgresqlversion <my_db_name>
Column_Name | Description |
version |
This provides the current PostgreSQL installed on the system along with other environment details. |
Example (DB Schema Version)¶
-bash-4.2$ ./csm_db_stats.sh -v csmdb
-------------------------------------------------------------------------------------------------
version
-------------------------------------------------------------------------------------------------
PostgreSQL 9.2.18 on powerpc64le-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-9), 64-bit
(1 row)
-------------------------------------------------------------------------------------------------
Note
This query provides the current version of PostgreSQL installed on the system along with environment details.
9. DB Archiving Stats¶
/opt/ibm/csm/db/./csm_db_stats.sh –a <my_db_name>
/opt/ibm/csm/db/./csm_db_stats.sh --indexanalysis <my_db_name>
Example (Query details)¶
Column_Name | Description |
table_name |
Table name. |
total_rows |
Total Rows in DB. |
not_archived |
Total rows not archived in the DB. |
archived |
Total rows archived in the DB. |
last_archive_time |
Last archived process time. |
Warning
This query could take several minutes to execute depending on the total size of each table.
Example (DB archive count with details)¶
-bash-4.2$ ./csm_db_stats.sh -a csmdb
---------------------------------------------------------------------------------------------------
table_name | total_rows | not_archived | archived | last_archive_time
-------------------------------+------------+--------------+----------+----------------------------
csm_allocation_history | 94022 | 0 | 94022 | 2018-10-09 16:00:01.912545
csm_allocation_node_history | 73044162 | 0 | 73044162 | 2018-10-09 16:00:02.06098
csm_allocation_state_history | 281711 | 0 | 281711 | 2018-10-09 16:01:03.685959
csm_config_history | 0 | 0 | 0 |
csm_db_schema_version_history | 2 | 0 | 2 | 2018-10-03 10:38:45.294172
csm_diag_result_history | 12 | 0 | 12 | 2018-10-03 10:38:45.379335
csm_diag_run_history | 8 | 0 | 8 | 2018-10-03 10:38:45.464976
csm_dimm_history | 76074 | 0 | 76074 | 2018-10-03 10:38:45.550827
csm_gpu_history | 58773 | 0 | 58773 | 2018-10-03 10:38:47.486974
csm_hca_history | 23415 | 0 | 23415 | 2018-10-03 10:38:50.574223
csm_ib_cable_history | 0 | 0 | 0 |
csm_lv_history | 0 | 0 | 0 |
csm_lv_update_history | 0 | 0 | 0 |
csm_node_history | 536195 | 0 | 536195 | 2018-10-09 14:10:40.423458
csm_node_state_history | 966991 | 0 | 966991 | 2018-10-09 15:30:40.886846
csm_processor_socket_history | 0 | 0 | 0 |
csm_ras_event_action | 1115253 | 0 | 1115253 | 2018-10-09 15:30:50.514246
csm_ssd_history | 4723 | 0 | 4723 | 2018-10-03 10:39:47.963564
csm_ssd_wear_history | 0 | 0 | 0 |
csm_step_history | 456080 | 0 | 456080 | 2018-10-09 16:01:05.797751
csm_step_node_history | 25536362 | 0 | 25536362 | 2018-10-09 16:01:06.216121
csm_switch_history | 0 | 0 | 0 |
csm_switch_inventory_history | 0 | 0 | 0 |
csm_vg_history | 4608 | 0 | 4608 | 2018-10-03 10:44:25.837201
csm_vg_ssd_history | 4608 | 0 | 4608 | 2018-10-03 10:44:26.047599
(25 rows)
---------------------------------------------------------------------------------------------------
Note
This query provides statistical information related to the DB archiving count and processing time.
Using csm_ras_type_script.sh¶
This script is for importing or removing records in the csm_ras_type table.
The csm_db_ras_type_script.sh creates a log file:
/var/log/ibm/csm/csm_db_ras_type_script.log
Note
csm_ras_type
table is pre populated which, contains the description and details for each of the possible RAS event types. This may change over time and new message types can be imported into the table. The script is ran and a temp table is created and appends the csv file data with the current records in thecsm_ras_type
table. If any duplicate (key) values exist in the process, they will get dismissed and the rest of the records get imported. A total record count is displayed and logged, along with the after livecsm_ras_type
count and also for thecsm_ras_type_audit
table.- A complete cleanse of the
csm_ras_type
table may also need to take place. If this step is necessary then the auto script can be ran with the–r
option. A"y/n"
prompt will display to the admins to ensure this execution is really what they want. If then
option is selected then the process is aborted and results are logged accordingly.
Usage Overview¶
/opt/ibm/csm/db/ csm_db_ras_type_script.sh –h
/opt/ibm/csm/db/ csm_db_ras_type_script.sh --help
Note
This help command (-h, --help
) will specify each of the options available to use.
Example (Usage)¶
-bash-4.2$ ./csm_db_ras_type_script.sh -h
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
=================================================================================================
[Info ] csm_db_ras_type_script.sh : Load/Remove data from csm_ras_type table
[Usage] csm_db_ras_type_script.sh : [OPTION]... [DBNAME]... [CSV_FILE]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-------------------------|-----------|-----------------------------------------------------------
-l, --loaddata | [db_name] | Imports CSV data to csm_ras_type table (appends)
| | Live Row Count, Inserts, Updates, Deletes, and Table Size
-r, --removedata | [db_name] | Removes all records from the csm_ras_type table
-h, --help | | help
-------------------------|-----------|-----------------------------------------------------------
[Examples]
-------------------------------------------------------------------------------------------------
csm_db_ras_type_script.sh -l, --loaddata [dbname] | [csv_file_name]
csm_db_ras_type_script.sh -r, --removedata [dbname] |
csm_db_ras_type_script.sh -h, --help [dbname] |
=================================================================================================
Importing records into csm_ras_type table (manually)¶
- To import data to the
csm_ras_type
table:
/opt/ibm/csm/db/csm_db_ras_type_script.sh –l my_db_name (where my_db_name is the name of your DB) and the csv_file_name.
/opt/ibm/csm/db/csm_db_ras_type_script.sh --loaddata my_db_name (where my_db_name is the name of your DB) and the csv_file_name.
Note
The script will check to see if the given name is available and if the database does not exist then it will exit with an error message.
Example (non DB existence):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] csm_ras_type_data.csv file exists
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database does not exist. Exiting.
-------------------------------------------------------------------------------------
Note
Make sure PostgreSQL is installed on the system.
Example (non csv_file_name existence):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data_file.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Error ] File csm_ras_type_data_file.csv can not be located or doesnt exist
[Info ] Please choose another file or check path
-------------------------------------------------------------------------------------
Note
Make sure the latest csv file exists in the appropriate working directory
Example (successful execution):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database ras type automation script.
[Info ] csm_ras_type_data.csv file exists
[Info ] PostgreSQL is installed
[Warning ] This will load and or update csm_ras_type table data into csmdb database. Do you want to continue [y/n]?
[Info ] User response: y
[Info ] csm_ras_type record count before script execution: 520
[Info ] Record import count from csm_ras_type_data.csv: 737
[Info ] Record update count from csm_ras_type_data.csv: 5
[Info ] Total csm_ras_type insert count from file: 217
[Info ] csm_ras_type live row count after script execution: 737
[Info ] csm_ras_type_audit live row count: 742
[Info ] Database: csmdb csv upload process complete for csm_ras_type table.
------------------------------------------------------------------------------------
Removing records from csm_ras_type table (manually)¶
- The script will remove records from the
csm_ras_type
table. The option (-r, --removedata
) can be executed. A prompt message will appear and the admin has the ability to choose"y/n"
. Each of the logging message will be logged accordingly.
/opt/ibm/csm/db/csm_db_ras_type_script.sh –r my_db_name (where my_db_name is the name of your DB).
Example (successful execution):
-bash-4.2$ ./csm_db_ras_type_script.sh -r csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database ras type automation script.
[Info ] PostgreSQL is installed
[Warning ] This will drop csm_ras_type table data from csmdb database. Do you want to continue [y/n]?
[Info ] User response: y
[Info ] Record delete count from the csm_ras_type table: 737
[Info ] csm_ras_type live row count: 0
[Info ] csm_ras_type_audit live row count: 1479
[Info ] Data from the csm_ras_type table has been successfully removed
------------------------------------------------------------------------------------
- The script will remove records from the
csm_ras_type
table and repopulate when a given csv file is present after the db_name. The option (-r, --removedata
) can be executed. A prompt message will appear and the admin has the ability to choose"y/n"
. Each of the logging message will be logged accordingly.
/opt/ibm/csm/db/csm_db_ras_type_script.sh –r my_db_name <ras_csv_file> (where my_db_name is the name of your DB and the csv_file_name).
Example (successful execution):
-bash-4.2$ ./csm_db_ras_type_script.sh -r csmdb csm_ras_type_data.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database ras type automation script.
[Info ] PostgreSQL is installed
[Info ] csm_ras_type_data.csv file exists
[Warning ] This will drop csm_ras_type table data from csmdb database. Do you want to continue [y/n]?
[Info ] User response: y
[Info ] Record delete count from the csm_ras_type table: 520
[Info ] csm_ras_type live row count: 0
[Info ] csm_ras_type_audit live row count: 1040
[Info ] Data from the csm_ras_type table has been successfully removed
------------------------------------------------------------------------------------
[Info ] csm_ras_type record count before script execution: 0
[Info ] Record import count from csm_ras_type_data.csv: 737
[Info ] Total csm_ras_type insert count from file: 737
[Info ] csm_ras_type live row count after script execution: 1777
[Info ] csm_ras_type_audit live row count:
[Info ] Database: csmdb csv upload process complete for csm_ras_type table.
------------------------------------------------------------------------------------
Example (unsuccessful execution):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -r csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] PostgreSQL is installed
[Warning ] This will drop csm_ras_type table data from csmdb database. Do you want to continue [y/n]?
[Info ] User response: n
[Info ] Data removal from the csm_ras_type table has been aborted
-------------------------------------------------------------------------------------
Big Data Store¶
CAST supports the integration of the ELK stack as a Big Data solution. Support for this solution is bundled in the csm-big-data rpm in the form of suggested configurations and support scripts.
Configuration order is not strictly enforced for the ELK stack, however, this resource generally assumes the components of the stack are installed in the following order:
- Elasticsearch
- Kibana
- Logstash
This installation order minimizes the likelihood of improperly ingested data being stored in elasticsearch.
Warning
If the index mappings are not created properly timestamp data may be improperly stored. If this occurs the user will need to reindex the data to fix the problem. Please read the elasticsearch section carefully before ingesting data.
Attention
It is recommended to review Common Big Data Store Problems before installing the stack.
Elasticsearch¶
Elasticsearch is a distributed analytics and search engine and the core component of the ELK stack. Elastic search ingests structured data (typically JSON or key value pairs) and stores the data in distributed index shards.
In the CAST design the more Elasticsearch nodes the better. Generally speaking nodes with attached storage or large numbers of drives are prefered.
Configuration¶
Note
This guide has been tested using Elasticsearch 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the elasticsearch service. It is generally assumed that elasticsearch is to be installed on multiple Big Data Nodes to take advantage of the distributed nature of the service. Additionally, in the CAST configuration data drives are assumed to be JBOD.
CAST provides a set of sample configuration files in the repository at csm_big_data/elasticsearch/
If the ibm-csm-bds-1.4.0-*.noarch.rpm
rpm as been installed the sample configurations may be found
in /opt/ibm/csm/bigdata/elasticsearch/.
- Install the elasticsearch rpm and java 1.8.1+ (command run from directory with elasticsearch rpm):
yum install -y elasticsearch-*.rpm java-1.8.*-openjdk
Copy the Elasticsearch configuration files to the /etc/elasticsearch directory.
It is recommended that the system administrator review these configurations at this phase.
jvm.options: jvm options for the Elasticsearch service. elasticsearch.yml: Configuration of the service specific attributes, please see elasticsearch.yml for details. Make an ext4 filesystem on each hard drive designated to be in the Elasticsearch JBOD.
The mounted names for these file systems should match the names specified in path.data. Additionally, these mounted file systems should be owned by the
elasticsearch
user and in theelasticsearch
group.Start Elasticsearch:
systemctl enable elasticsearch
systemctl start elasticsearch
- Run the index template creator script:
/opt/ibm/csm/bigdata/elasticsearch/createIndices.sh
Note
This is technically optional, however, data will have limited use. This script configures Elasticsearch to properly parse timestamps.
Elasticsearch should now be operational. If Logstash was properly configured there should already be data being written to your index.
Tuning Elasticsearch¶
The process of tuning and configuring Elasticsearch is incredibly dependent on the volume and type of data ingested the Big Data Store. Due to the nuance of this process it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The following document outlines the defaults and recommendations of CAST in the configuration of the Big Data Store.
elasticsearch.yml¶
Note
The following section outline’s CAST’s recommendations for the Elasticsearch configuration it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The Elasticsearch configuration sample shipped by CAST marks fields that need to be set by a system administrator. A brief rundown of the fields to modify is as follows:
cluster.name: | The name of the cluster. Nodes may only join clusters with the name in this field. Generally it’s a good idea to give this a descriptive name. |
---|---|
node.name: | The name of the node in the elasticsearch cluster.
CAST defaults to ${HOSTNAME} . |
path.log: | The logging directory, needs elasticsearch read write access. |
path.data: | A comma separated listing of data directories, needs elasticsearch read write access. CAST recommends a JBOD model where each disk has a file system. |
network.host: | The address to bind the Elasticsearch model to.
CAST defaults to _site_ . |
http.port: | The port to bind Elasticsearch to.
CAST defaults to 9200 . |
discovery.zen.ping.unicast.hosts: | |
A list of nodes likely to be active, comma delimited array.
CAST defaults to cast.elasticsearch.nodes . |
|
discovery.zen.minimum_master_nodes: | |
Number of nodes with the node.master setting set to true that must be connected to
before starting.
Elastic search recommends (master_eligible_nodes/2)+1 . |
|
gateway.recover_after_nodes: | |
Number of nodes to wait for before begining recovery after cluster-wide restart. | |
xpack.ml.enabled: | |
Enables/disables the Machine Learning utility in xpack, this should be disabled on ppc64le installations. | |
xpack.security.enabled: | |
Enables/disables security in elasticsearch. | |
xpack.license.self_generated.type: | |
Sets the license of xpack for the cluster, if the user has no license it should be set to basic . |
jvm.options¶
The configuration file for the Logstash JVM. The supplied settings are CAST’s recommendation, however, the efficacy of these settings entirely depends on your elasticsearch node.
Generally speaking the only field to be changed is the heap size:
-Xms[HEAP MIN]
-Xmx[HEAP MAX]
Indices¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-*.json |
CAST has specified a suite of data mappings for use in separate indices. Each of these indices is documented below, with a JSON mapping file provided in the repository and rpm.
CAST uses cast-<class>-<description>-<date>
naming schema for indices to leverage templates when creating
the indices in Elasticsearch. The class is one of the three primary classifications determined
by CAST: log, counters, environmental. The description is typically a one to two word description
of the type of data: syslog, node, mellanox-event, etc.
A collection of templates is provided in ibm-csm-bds-1.4.0-*.noarch.rpm
which sets up aliases and data type mappings.
These temlates do not set sharding or replication factors, as these settings should be tuned to
the user’s data retention and index sizing needs.
The specified templates match indices generated in the data aggregators documentation. As different data sources produce different volumes of data in different environments, this document will make no recommendation on sharding or replication.
Note
These templates may be found on the git repo at csm_big_data/elasticsearch/mappings/templates
.
Note
Cast has elected to use lowercase and - characters to separate words. This is not mandatory for your index naming and creation.
scripts¶
Elasticsearch Index Scripts: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/ |
CAST provides a set of scripts which allow the user to easily manipulate the elasticsearch indices from the command line.
createIndices.sh¶
A script for initializing the templates defined by CAST. When executed it with attempt to
target the elasticsearch server running on ${HOSTNAME}:9200
. If the user supplies
either a hostname or ip address this will be targeted in lieu of ${HOSTNAME}
. This script
need only be run once on a node in the elasticsearch cluster.
removeIndices.sh¶
A script for removing all elasticsearch templates created by createIndices.sh.
When executed it with attempt to target the elasticsearch server running on ${HOSTNAME}:9200
.
If the user supplies either a hostname or ip address this will be targeted in lieu of ${HOSTNAME}
.
This script need only be run once on a node in the elasticsearch cluster.
reindexIndices.py¶
A tool for performing in place reindexing of an elasticsearch index.
Warning
This script should only be used to reindex a handful of indices at a time as it is slow and can result in partial reindexing.
usage: reindexIndices.py [-h] [-t hostname:port]
[-i [index-pattern [index-pattern ...]]]
A tool for reindexing a list of elasticsearch indices, all indices will be
reindexed in place.
optional arguments:
-h, --help show this help message and exit
-t hostname:port, --target hostname:port
An Elasticsearch server to reindex indices on. This
defaults to the contents of environment variable
"CAST_ELASTIC".
-i [index-pattern [index-pattern ...]], --indices [index-pattern [index-pattern ...]]
A list of indices to reindex, this should use the
index pattern format.
cast-log¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-log*.json |
The cast-log- indices represent a set of logging indices produced by CAST supported data sources.
cast-log-syslog¶
alias: | cast-log-syslog |
---|
The syslog index is designed to capture generic syslog messages. The contents of the syslog index is considered by CAST to be the most useful data points for syslog analysis. CAST supplies both an rsyslog template and Logstash pattern, for details on these configurations please consult the data aggregators documentation.
The mapping for the index contains the following fields:
Field | Type | Description |
---|---|---|
@timestamp | date | The timestamp of the message, generated by the syslog utility. |
host | text | The host of the relay host. |
hostname | text | The hostname of the syslog origination. |
program_name | text | The name of the program which generated the log. |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log. |
message | text | The body of the message. |
tags | text | Tags containing additional metadata about the message. |
Note
Currently mmfs and CAST logs will be stored in the syslog index (due to similarity of the data mapping).
cast-log-mellanox-event¶
alias: | cast-log-mellanox-event |
---|
The mellanox event log is a superset of the cast-log-syslog index, an artifact of the event log being transmitted through syslog. In the CAST Big Data Pipeline this log will be ingested and parsed by the Logstash service then transmitted to the Elasticsearch index.
Field | Type | Description |
---|---|---|
@timestamp | date | When the message was written to the event log. |
hostname | text | The hostname of the ufm aggregating the events. |
program_name | text | The name of the generating program, should be event_log |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log, pulled from message. |
message | text | The body of the message (unstructured). |
log_counter | long | A counter tracking the log number. |
event_id | long | The unique identifier for the event in the mellanox event log. |
event_type | text | The type of event (e.g. HARDWARE) in the event log. |
category | text | The categorization of the error in the event log typing |
tags | text | Tags containing additional metadata about the message. |
cast-log-console¶
alias: | cast-log-console |
---|
CAST recommends the usage of the goconserver bundled in the xCAT dependicies, documented in xCat-GoConserver. Configuration of the goconserver should be performed on the xCAT service nodes in the cluster. CAST has created a limited configuration guide <ConsoleDataAggregator>, please consult for a basic rundown on the utility.
The mapping for the console index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | When console event occured. |
type | text | The type of the event (typically console). |
message | text | The console event data, typically a console line. |
hostname | text | The hostname generating the console. |
tags | text | Tags containing additional metadata about the console log. |
cast-csm¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-csm*.json |
The cast-csm- indices represent a set of metric indices produced by CSM. Indices matching this pattern will be created unilaterally by the CSM Daemon. Typically records in this type of index are generated by the Aggregator Daemon.
cast-csm-dimm-env¶
alias: | cast-csm-dimm-env |
---|
The mapping for the cast-csm-dimm-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the dimm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-dimm-env). |
source | text | The source of the counters. |
data.dimm_id | long | The id of dimm being aggregated. |
data.dimm_temp | long | The temperature of the dimm. |
data.dimm_temp_max | long | The max temperature of the dimm over the collection period. |
data.dimm_temp_min | long | The min temperature of the dimm over the collection period. |
cast-csm-gpu-env¶
alias: | cast-csm-gpu-env |
---|
The mapping for the cast-csm-gpu-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-gpu-env). |
source | text | The source of the counters. |
data.gpu_id | long | The id of the GPU record being aggregated. |
data.gpu_mem_temp | long | The memory temperature of the GPU. |
data.gpu_mem_temp_max | long | The max memory temperature of the GPU over the collection period. |
data.gpu_mem_temp_min | long | The min memory temperature of the GPU over the collection period. |
data.gpu_temp | long | The temperature of the GPU. |
data.gpu_temp_max | long | The max temperature of the GPU over the collection period. |
data.gpu_temp_min | long | The min temperature of the GPU over the collection period. |
cast-csm-node-env¶
alias: | cast-csm-node-env |
---|
The mapping for the cast-csm-node-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the node environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-node-env). |
source | text | The source of the counters. |
data.system_energy | long | The energy of the system at ingestion time. |
cast-csm-gpu-counters¶
alias: | cast-csm-gpu-counters |
---|
A listing of DCGM counters.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
nvlink_recovery_error_count_l1 | long | Total number of NVLink recovery errors. |
sync_boost_violation | long | Throttling duration due to sync-boost constraints (in us) |
gpu_temp | long | GPU temperature (in C). |
nvlink_bandwidth_l2 | long | Total number of NVLink bandwidth counters. |
dec_utilization | long | Decoder utilization. |
nvlink_recovery_error_count_l2 | long | Total number of NVLink recovery errors. |
nvlink_bandwidth_l1 | long | Total number of NVLink bandwidth counters. |
mem_copy_utilization | long | Memory utilization. |
gpu_util_samples | double | GPU utilization sample count. |
nvlink_replay_error_count_l1 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l1 | long | Total number of NVLink data CRC errors. |
nvlink_replay_error_count_l0 | long | Total number of NVLink retries. |
nvlink_bandwidth_l0 | long | Total number of NVLink bandwidth counters. |
nvlink_data_crc_error_count_l3 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l3 | long | Total number of NVLink flow-control CRC errors. |
nvlink_bandwidth_l3 | long | Total number of NVLink bandwidth counters. |
nvlink_replay_error_count_l2 | long | Total number of NVLink retries. |
nvlink_replay_error_count_l3 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l0 | long | Total number of NVLink data CRC errors. |
nvlink_recovery_error_count_l0 | long | Total number of NVLink recovery errors. |
enc_utilization | long | Encoder utilization. |
power_usage | double | Power draw (in W). |
nvlink_recovery_error_count_l3 | long | Total number of NVLink recovery errors. |
nvlink_data_crc_error_count_l2 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l2 | long | Total number of NVLink flow-control CRC errors. |
serial_number | text | The serial number of the GPU. |
power_violation | long | Throttling duration due to power constraints (in us). |
xid_errors | long | Value of the last XID error encountered. |
gpu_utilization | long | GPU utilization. |
nvlink_flit_crc_error_count_l0 | long | Total number of NVLink flow-control CRC errors. |
nvlink_flit_crc_error_count_l1 | long | Total number of NVLink flow-control CRC errors. |
mem_util_samples | double | The sample rate of the memory utilization. |
thermal_violation | long | Throttling duration due to thermal constraints (in us). |
cast-counters¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-ccounters*.json |
A class of index representing counter aggregation from non CSM data flows. Generally indices following this naming pattern contain data from standalone data aggregation utilities.
cast-counters-gpfs¶
alias: | cast-counters-gpfs |
---|
A collection of counter data from gpfs. The script outlined in the data aggregators documentation leverages zimon to perform the collection. The following is the index generated by the default script bundled in the CAST rpm.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
cpu_system | long | The system space usage of the CPU. |
cpu_user | long | The user space usage of the CPU. |
mem_active | long | Active memory usage. |
gpfs_ns_bytes_read | long | Networked bytes read. |
gpfs_ns_bytes_written | long | Networked bytes written. |
gpfs_ns_tot_queue_wait_rd | long | Total time spent waiting in the network queue for read operations. |
gpfs_ns_tot_queue_wait_wr | long | Total time spent waiting in the network queue for write operations. |
cast-counters-ufm¶
alias: | cast-counters-ufm |
---|
Due to the wide variety of counters that may be gathered checking the data aggregation script is strongly recommended.
The mapping for the cast-counters-ufm index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the ufm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (cast-counters-ufm). |
source | text | The source of the counters. |
cast-db¶
CSM history tables are archived in Elasticsearch as separate indices. CAST provides a document on configuring CSM database data archival <DataArchiving>.
The mapping shared between the indices is as follows:
Field | Type | Description |
---|---|---|
@timestamp | date | When archival event occured. |
tags | text | Tags about the archived data. |
type | text | The originating table, drives index assignment. |
data | doc | The mapping of table columns, contents differ for each table. |
Attention
These indicies will match CSM database history tables, contents not replicated for brevity.
cast-ibm-crasssd-bmc-alerts¶
While not managed by CAST crassd will ship bmc alerts to the big data store.
Kibana¶
Kibana is an open-sourced data visualization tool used in the ELK stack.
CAST provides a utility plugin for multistep searches of CSM jobs in Kibana dashboards.
Configuration¶
Note
This guide has been tested using Kibana 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the Kibana service.
At the current time CAST does not provide a configuration file in its RPM.
- Install the Kibana rpm:
yum install -y kibana-*.rpm
- Configure the Kibana YAML file (/etc/kibana/kibana.yml)
CAST recommends the following four values be set before starting Kibana:
Setting | Description | Sample Value |
---|---|---|
server.host | The address the kibana server will bind on, needed for external access. | “10.7.4.30” |
elasticsearch.url | The URL of an elasticsearch service, this should include the port number (9200 by default). | “http://10.7.4.13:9200” |
xpack.security.enabled | The xpack security setting, set to false if not being used. | false |
xpack.ml.enabled | Sets the status of xpack Machine Learning. Please note this must be set to false on ppc64le installations. | false |
- Install the CAST Search rpm:
rpm -ivh ibm-csm-bds-kibana-*.noarch.rpm
- Start Kibana:
systemctl enable kibana.service
systemctl start kibana.service
Kibana should now be running and fully featured. Searchs may now be performed on the Discover tab.
CAST Search¶
CAST Search is a React plugin designed for interfacing with elastic search an building filters for Kibana Dashboards. To maxmize the value of the plugin the cast-allocation index pattern should be specified.
Logstash¶
Logstash is an open-source data processing pipeline used in the ELK stack. The core function of this service is to process unstructured data, typically syslogs, and then pass the newly structured text to the elasticsearch service.
Typically, in the CAST design, the Logstash service is run on the service nodes in the xCAT infrastructure. This design is to reduce the number of servers communicating with each instance of Logstash, distributing the workload. xCAT service nodes have failover capabilities removing the need for HAProxies to reduce the risk of data loss. Finally, in using the service node the total cost of the Big Data Cluster is reduced as the need for a dedicated node for data processing is removed.
CAST provides an event correlator for Logstash to assist in the generation of RAS events for specific messages.
Configuration¶
Note
This guide has been tested using Logstash 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the logstash service.
CAST provides a set of sample configuration files in the repository at csm_big_data/logstash/.
If the ibm-csm-bds-1.4.0-*.noarch.rpm
rpm has been installed the sample configurations may be found
in /opt/ibm/csm/bigdata/logstash/.
- Install the logstash rpm and java 1.8.1+ (command run from directory with logstash rpm):
yum install -y logstash-*.rpm java-1.8.*-openjdk
Copy the Logstash pipeline configuration files to the appropriate directories.
This step is ultimately optional, however it is recommended that these files be reviewed and modified by the system administrator at this phase:
Target file Repo Dir RPM Dir logstash.yml(see note) config/ config/ jvm.options config/ config/ conf.d/logstash.conf config/ config/ patterns/ibm_grok.conf patterns/ patterns/ patterns/mellanox_grok.conf patterns/ patterns/ patterns/events.yml patterns/ patterns/
Note
Target files are relative to /etc/logstash. Repo Directories are relative to csm_big_data/logstash. RPM Directories are relative to /opt/ibm/csm/bigdata/logstash/.
Note
The conf.d/logstash.conf file requires the ELASTIC-INSTANCE field be replaced with your cluster’s Elasticsearch nodes.
Note
logstash.yml is not shipped with this version of the RPM please use the following config for logstash.
# logstash.yml
---
path.data: /var/lib/logstash
path.config: /etc/logstash/conf.d/*conf
path.logs: /var/log/logstash
pipeline.workers: 2
pipeline.batch.size: 2000 # This is the MAXIMUM, to prevent exceedingly long waits a delay is supplied.
pipeline.batch.delay: 50 # Maximum time to wait to execute an underfilled queue in milliseconds.
queue.type: persisted
...
- Install the CSM Event Correlator
rpm -ivh ibm-csm-bds-logstash*.noarch.rpm
Note
This change is effective in the 1.3.0 release of the CAST rpms.
Please refer to CSM Event Correlator for more details.
Note
The bin directory is relative to your logstash install location.
- Start Logstash:
systemctl enable logstash
systemctl start logstash
Logstash should now be operational. At this point data aggregators should be configured to point to your Logstash node as appropriate.
Tuning Logstash¶
Tuning logstash is highly dependant on your use case and environment. What follows is a set of recommendations based on the research and experimentation of the CAST Big Data team.
Here are some useful resources for learning more about profiling and tuning logstash:
logstash.yml¶
This configuration file specifies details about the Logstash service:
- Path locations (as a rule of thumb these files should be owned by the logstash user).
- Pipeline details (e.g. workers, threads, etc.)
- Logging levels.
For more details please refer to the Logstash settings file documentation.
jvm.options¶
The configuration file for the Logstash JVM. The supplied settings are CAST’s recommendation, however, the efficacy of these settings entirely depends on your Logstash node.
logstash.conf¶
The logstash.conf is the core configuration file for determining the behavior of the Logstash pipeline in the default CAST configuration. This configuration file is split into three components: input, filter and output.
input¶
The input section defines how the pipeline may ingest data. In the CAST sample only the tcp input plugin is used. CAST currently uses different ports to assign tagging to facilitate simpler filter configuration. For a more in depth description of this section please refer to the configuration file structure in the official Logstash documentation.
The default ports and data tagging are as follows:
Default Port Values | |
---|---|
Tag | Port Number |
syslog | 10515 |
json_data | 10522 |
transactions | 10523 |
filter¶
The filter section defines the data enrichment step of the pipeline. In the CAST sample the following operations are performed:
- Unstructured events are parsed with the grok utility.
- Timestamps are reformatted (as needed).
- Events with JSON formatting are parsed.
- CSM Event Correlator is invoked on properly ingested logs.
Generally speaking care must be taken in this section to leverage branch prediction. Additionally, it is easy to malform the grok plugin to result in slow downs in the pipeline performance. Please consult configuration file structure in the official Logstash documentation for more details.
output¶
The output section defines the target for the data processed through the pipeline. In the CAST sample the elasticsearch plugin is used, for more details please refer to the linked documentation.
The user must replace _ELASTIC_IP_PORT_LIST_ with a comma delimited list of hostname:port string pairs refering to the nodes in the elasticsearch cluster. Generally if using the default configuration the port should be 9200. An example of this configuration is as follows:
hosts => [ "10.7.4.14:9200", "10.7.4.15:9200", "10.7.4.19:9200" ]
grok¶
Logstash provides a grok utility to perform regular expression pattern recognition and extraction. When writing grok patterns several rules of thumb are recommended by the CAST team:
- Profile your patterns, Do you grok Grok? discusses a mechanism for profiling.
- Grok failure can be expensive, use anchors (^ and $) to make string matches precise to reduce failure costs.
- _groktimeout tagging can set an upper bound time limit for grok operations.
- Avoid DATA and GREEDYDATA if possible.
CSM Event Correlator¶
CSM Event Correlator (CEC) is the CAST solution for event correlation in the logstash pipeline. CEC is written in ruby to leverage the existing Logstash plugin system. At its core CEC is a pattern matching engine using grok to handle pattern matching.
A sample configuration of CEC is provided as the events.yml file described in the Configuration section of the document.
There’s an extensive asciidoc for usage of the CSM Event Correlator plugin. The following documentation is an abridged version.
Common Big Data Store Problems¶
The following document outlines some common sources of error for the Big Data Store and how to best resolve the described issues.
Timestamps¶
Timestamps are generally the number one source of problems in the ELK Stack. This is due to a wide variety of precisions and timestamp formats that may come from different data sources.
Elasticsearch will try its best to parse dates, as outlined in the ELK Date documentation. If a date doesn’t match the default formats (a usual culprit is epoch time or microseconds) the administrator will need to take action.
CAST has two prescribed resolution patterns for this problem:
The administrator may apply one or more resolution patterns to resolve the issue.
Attention
Timestamps created by CSM will generally attempt to ship timestamps in the correct format, however, Elasticsearch will only automatically parse up to millisecond. The default ISO 8601 format of Postgresql has precision up to microseconds, requiring postgres generated timestamps to use a parsing strategy.
Note
If any indices have been populated with data not interpreted as dates, those indices will need to be reindexed.
Fixing Timestamps in Elasticsearch¶
This is the preferred methodology for resolving issues in the timestamp. CAST supplies
a utility in ibm-csm-bds-1.4.0-*.noarch.rpm
for generating mappings that fix the timestamps in
data sources outlined in Data Aggregation.
The index mapping script is present at /opt/ibm/csm/bigdata/elasticsearch/createIndices.sh. When executed the script will make a request to the Elasticsearch server (determined by the input to the script) which creates all of the mappings defined in the /opt/ibm/csm/bigdata/elasticsearch/templates directory. If the user wishes to clear existing templates/mappings the /opt/ibm/csm/bigdata/elasticsearch/removeIndices.sh is provided to delete indices made through the creation script.
If adding a new index, the following steps should be taken to repair timestamps or any other invalid data types on a per index or index pattern basis:
Create a json file to store the mapping. CAST recommends naming the file <template-name>.json
Populate the file with configuration settings.
{ "index_patterns": ["<NEW INDEX PATTERN>"], "order" : 0, "settings" : { "number_of_shards" : <SHARDING COUNT>, "number_of_replicas" : <REPLICA COUNT> }, "mappings" : { "_doc": { "properties" : { "<SOME TIMESTAMP>" : { "type" : "date" }, }, "dynamic_date_formats" : [ "strict_date_optional_time|yyyy/MM/dd HH:mm:ss Z|| yyyy/MM/dd Z||yyyy-MM-dd HH:mm:ss.SSSSSS"] } } }
Attention
The dynamic_date_formats section is most relevant to the context of this entry.
Note
To resolve timestamps with microseconds (e.g. postgres timestamps) yyyy-MM-dd HH:mm:ss.SSSSSS serves as a sample.
Ship the json file to elasticsearch. There are two mechanisms to achieve this:
- Place the file in the /opt/ibm/csm/bigdata/elasticsearch/templates/ directory and run
the /opt/ibm/csm/bigdata/elasticsearch/createIndices.sh script.
Curl the file to Elasticsearch.
curl -s -o /dev/null -X PUT "${HOST}:9200/_template/${template_name}?pretty"\ -H 'Content-Type: application/json' -d ${json-template-file}
Attention
If the template is changed the old template must be removed first!
To remove a template the admin may either run the /opt/ibm/csm/bigdata/elasticsearch/removeIndices.sh script, which removes templates by the file names in /opt/ibm/csm/bigdata/elasticsearch/templates/.
The other option is to remove a template specifically with a curl command:
curl -X DELETE "${HOST}:9200/_template/${template_name}?pretty"
The above documentation is a brief primer on how to modify templates, a powerful elasticsearch utility. If the user needs more information please consult the official elastic template documentation.
Fixing Timestamps in Logstash¶
If the elasticsearch methodology doesn’t apply to the use case, logstash timestamp manipulation might be the correct solution.
Note
The following section performs modifications to the logstash.conf file that should be placed in /etc/logstash/conf.d/logstash.conf if following the Logstash configuration documentation.
The CAST solution uses the date filter plugin to achieve these results. In the shipped configuration the following sample is provided:
if "ras" in [tags] and "csm" in [tags] {
date {
match => ["time_stamp", "ISO8601","YYYY-MM-dd HH:mm:ss.SSS" ]
target => "time_stamp"
}
}
The above sample parses the time_stamp field for the ISO 8601 standard and converts it to something that is definitely parseable by elasticsearch. For additional notes about this utility please refer to the official date filter plugin documentation.
Data Aggregation¶
Data Aggregation in CAST utilizes the logstash pipeline to process events and pass it along to Elasticsearch.
Note
In the following documentation, examples requiring replacement will be annotated with the bash style ${variable_name} and followed by an explanation of the variable.
Logs¶
The default configuration of the CAST Big Data Store has support for a number of logging types, most of which are processed through the syslog utility and then enriched by Logstash and the CAST Event Correlator.
Syslog¶
Logstash Port: | 10515 |
---|
Syslog is generally aggregated through the use of the rsyslog daemon.
Most devices are capable of producing syslogs, and it is suggested that syslogs should be sent to Logstash via a redirection hierarchy outlined in the diagram below:
![digraph G {
Logstash [shape=square];
"Service Node" -> Logstash
"IB/Ethernet" -> Logstash
PDUs -> Logstash
"Compute Node" -> "Service Node"
"Utility Node" -> "Service Node"
"UFM Server" -> "Service Node"
}](_images/graphviz-27a3f46bb9135a36201e8c187bb6d08d8cfee851.png)
Syslog Redirection¶
Warning
This step should not be performed on compute nodes in xCAT clusters!
To redirect a syslog so it is accepted by Logstash the following must be added to the /etc/rsyslog.conf file:
$template logFormat, "%TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% \
%PROCID% %syslogseverity-text% %msg%\n"
*.*;cron.none @@${logstash_node}:${syslog_port};logFormat
The rsyslog utility must then be restarted for the changes to take effect:
/bin/systemctl restart rsyslog.service
Field Description
logstash_node: | Replace with the hostname or IP address of the Logstash Server, on service nodes this is typically localhost. |
---|---|
syslog_port: | Replace with the port set in the Logstash Configuration File [ default: 10515 ]. |
Format
The format of the syslog is parsed in the CAST model by Logstash. CAST provides a grok for this syslog format in the pattern list provided by the CAST repository and rpm. The grok pattern is reproduced below with the types matching directly to the types in the syslog elastic documentation.
RSYSLOGDSV ^(?m)%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:hostname} \
%{DATA:program_name} %{INT:process_id} %{DATA:severity} %{GREEDYDATA:message}$
Note
This pattern has a 1:1 relationship with the template given above and a 1:many relationship with the index data mapping. Logstash appends some additional fields for metadata analysis.
GPFS¶
To redirect the GPFS logging data to the syslog please do the following on the Management node for GPFS:
/usr/lpp/mmfs/bin/mmchconfig systemLogLevel=notice
After completing this process the gpfs log should now be forwarded to the syslog for the configured node.
Note
Refer to Syslog Redirection for gpfs log forwarding, the default syslog port is recommended (10515).
Note
The systemLogLevel
attribute will forward logs of the specified level and higher to the
syslog. It supports the following options: alert, critical, error, warning,
notice, configuration, informational, detail, and debug.
Note
This data type will inhabit the same index as the syslog documents due to data similarity.
UFM¶
Note
This document assumes that the UFM daemon is up and running on the UFM Server.
The Unified Fabric Manager (UFM) has several distinct data logs to aggregate for the big data store.
System Event Log¶
Logstash Port: | 10515 |
---|
The System Event Log will report various fabric events that occur in the UFM’s network:
- A link coming up.
- A link going down.
- UFM module problems.
A sample output showing a downed link can be seen below:
Oct 17 15:56:33 c931hsm04 eventlog[30300]: WARNING - 2016-10-17 15:56:33.245 [5744] [112]
WARNING [Hardware] IBPort [default(34) / Switch: c931ibsw-leaf01 / NA / 16]
[dev_id: 248a0703006d40f0]: Link-Downed counter delta threshold exceeded.
Threshold is 0, calculated delta is 1. Peer info: Computer: c931f03p08 HCA-1 / 1.
Note
The above example is in the Syslog format.
To send this log to the Logstash data aggregation the /opt/ufm/files/conf/gv.cfg file must be modified and /etc/rsyslog.conf should be modified as described in Syslog Redirection.
CAST recommends setting the following attributes in /opt/ufm/files/conf/gv.cfg:
[Logging]
level = INFO
syslog = true
event_syslog = true
[CSV]
write_interval = 30
ext_ports_only = yes
max_files = 10
[MonitoringHistory]
history_configured = true
Note
write_interval
and max_files
were set as a default, change these fields as needed.
After configuring /opt/ufm/files/conf/gv.cfg restart the ufm daemon.
/etc/init.d/ufmd restart
Format
CAST recommends using the same syslog format as shown in Syslog Redirection, however, the message
in the case of the mellanox event log has a consistent structure which may be parsed by Logstash.
The pattern and substitutions are used below. Please note that the timestamp
, severity
and
message
fields are all overwritten from the default syslog pattern.
Please consult the event log table in the elasticsearch documentation <melElastic> for details on the message fields.
MELLANOXMSG %{MELLANOXTIME:timestamp} \[%{NUMBER:log_counter}\] \[%{NUMBER:event_id}\] \
%{WORD:severity} \[%{WORD:event_type}\] %{WORD:category} %{GREEDYDATA:message}
Console¶
Note
This document is designed to configure the xCAT service nodes to ship goconserver output to logstash (written using xCAT 2.13.11).
Logstash Port: | 10522 |
---|---|
Relevant Directories: | |
/etc/goconserver
|
CSM recommends using the goconserver bundled in the xCAT dependencies and documented in xCat-GoConserver. A limited configuration guide is provided below, but for gaps or more details please refer to the the xCAT read the docs.
- Install the goconserver and start it:
yum install goconserver
systemctl stop conserver.service
makegocons
- Configure the /etc/goconserver to send messages to the Logstash server associated with the
- service node (generally localhost):
# For options above this line refer to the xCAT read-the-docs
logger:
tcp:
- name: Logstash
host: <Logstash-Server>
port: 10522 # This is the port in the sample configuration.
timeout: 3 # Default timeout time.
- Restart the goconserver:
service goconserver restart
Format
The goconserver will now start sending data to the Logstash server in the form of JSON messages:
{
"type" : "console"
"message" : "c650f04p23 login: jdunham"
"node" : "c650f04p23"
"date" : "2018-05-08T09:49:36.530886-04"
}
The CAST logstash filter then mutates this data to properly store it in the elasticsearch backing store:
Field | New Field |
---|---|
node | hostname |
date | @timestamp |
Cumulus Switch¶
Attention
The CAST documentation was written using Cumulus Linux 3.5.2, please ensure the switch is at this level or higher.
Cumulus switch logging is performed through the usage of the rsyslog service. CAST recommends placing Cumulus logging in the syslog-log indices at this time.
Configuration of the logging on the switch can be achieved through the net command:
net add syslog host ipv4 ${logstash_node} port tcp ${syslog_port}
net commit
This command will populate the /etc/rsyslog.d/11-remotesyslog.conf file with a rule to export the syslog to the supplied hostname and port. If using the default CAST syslog configuration this file will need to be modified to have the CAST syslog template:
vi /etc/rsyslog.d/11-remotesyslog.conf
$template logFormat, "%TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% \
%syslogseverity-text% %msg%\n"
*.*;cron.none @@${logstash_node}:${syslog_port};logFormat
sudo service rsyslog restart
Note
For more configuration details please refer to the official Cumulus Linux User Guide.
Counters¶
The default configuration of the CAST Big Data Store has support for a number of counter types, most of which are processed through Logstash and the CAST Event Correlator.
GPFS¶
In order to collect counters from the GPFS file system CAST leverages the zimon utility. A python
script interacting with this utility is provided in the ibm-csm-bds-1.4.0-*.noarch.rpm
.
The following document assumes that the cluster’s service nodes be running the pmcollector service and any nodes requiring metrics be running pmsensors.
Collector¶
rpms: |
|
---|---|
config: | /opt/IBM/zimon/ZIMonCollector.cfg |
In the CAST architecture a pmcollector should be run on each of the service node in federated mode. To configure federated mode on the collector add all of the nodes configured as collectors to the /opt/IBM/zimon/ZIMonCollector.cfg this configuration should be then propagated to all of the collector nodes in the cluster.
peers = {
host = "collector1"
port = "9085"
},
{
host = "collector2"
port = "9085"
},
{
host = "collector3"
port = "9085"
}
After configuring the collector start and enable the pmcollectors.
systemctl start pmcollector
systemctl enable pmcollector
Sensors¶
RPMs: | gpfs.gss.pmsensors.ppc64le (Version 5.0 or greater) |
---|---|
Config: | /opt/IBM/zimon/ZIMonSensors.cfg |
It is recommended to use the GPFS managed configuration file through use of the mmperfmon command. Before setting the node to do performance monitoring it’s recommended that at least the following command be run:
/usr/lpp/mmfs/bin/mmperfmon config generate --collectors ${collectors}
/usr/lpp/mmfs/bin/mmperfmon config update GPFSNode.period=0
It’s recommended to specify at least two collectors defined in the zimon.collector section of this document. The pmsensor service will attempt to distribute the load and account for failover in the event of a downed collector.
After generating the sensor configuration the nodes must then be set to perfmon:
$ /usr/lpp/mmfs/bin/mmchnode --perfmon -N ${nodes}
Assuming /opt/IBM/zimon/ZIMonSensors.cfg has been properly distributed the sensors may then be started on the nodes.
$ systemctl start pmsensors
$ systemctl enable pmsensors
Attention
To detect failures of the power hardware the following must be prepared on the management node of the GPFS cluster.
$ vi /var/mmfs/mmsysmon/mmsysmonitor.conf
[general]
powerhw_enabled=True
$ mmsysmoncontrol restart
Python Script¶
CAST RPM: | ibm-csm-bds-1.4.0-*.noarch.rpm |
---|---|
Script Location: | |
/opt/ibm/csm/bigdata/data-aggregators/zimonCollector.py | |
Dependencies: | gpfs.base.ppc64le (Version 5.0 or greater) |
CAST provides a script for easily querying zimon, then sending the results to Big Data Store. The zimonCollector.py python script leverages the python interface to zimon bundled in the gpfs.base rpm. The help output for this script is duplicated below:
A tool for extracting zimon sensor data from a gpfs collector node and shipping it in a json
format to logstash. Intended to be run from a cron job.
Options:
Flag | Description < default >
==================================|============================================================
-h, --help | Displays this message.
--collector <host> | The hostname of the gpfs collector. <127.0.0.1>
--collector-port <port> | The collector port for gpfs collector. <9084>
--logstash <host> | The logstash instance to send the JSON to. <127.0.0.1>
--logstash-port <port> | The logstash port to send the JSON to. <10522>
--bucket-size <int> | The size of the bucket accumulation in seconds. <60>
--num-buckets <int> | The number of buckets to retrieve in the query. <10>
--metrics <Metric1[,Metric2,...]> | A comma separated list of zimon sensors to get metrics from.
| <cpu_system,cpu_user,mem_active,gpfs_ns_bytes_read,
| gpfs_ns_bytes_written,gpfs_ns_tot_queue_wait_rd,
| gpfs_ns_tot_queue_wait_wr>
CAST expects this script to be run from a service node configured for both logstash and zimon collection. In this release this script need only be executed on one service node in the cluster to gather sensor data.
The recommended cron configuration for this script is as follows:
*/10 * * * * /opt/ibm/csm/bigdata/data-aggregators/zimonCollector.py
The output of this script is a newline delimited list of JSON designed for easy ingestion by the logstash pipeline. A sample from the default script configuration is as follows:
{
"type": "zimon",
"source": "c650f99p06",
"data": {
"gpfs_ns_bytes_written": 0,
"mem_active": 1769963,
"cpu_system": 0.015,
"cpu_user": 0.004833,
"gpfs_ns_tot_queue_wait_rd": 0,
"gpfs_ns_bytes_read": 0,
"gpfs_ns_tot_queue_wait_wr": 0
},
"timestamp": 1529960640
}
In the default configuration of this script records will be shipped as JSONDataSources.
UFM¶
CAST RPM: | ibm-csm-bds-1.4.0-*.noarch.rpm |
---|---|
Script Location: | |
/opt/ibm/csm/bigdata/data-aggregators/ufmCollector.py |
CAST provides a python script to gather UFM counter data. The script is intended to be run from either a service node running logstash or the UFM node as a cron job. A description of the script from the help functionality is reproduced below:
Purpose: Simple script that is packaged with BDS. Can be run individually and
independantly when ever called upon.
Usage:
- Run the program.
- pass in parameters.
- REQUIRED [--ufm] : This tells program where UFM is (an IP address)
- REQUIRED [--logstash] : This tells program where logstash is (an IP address)
- OPTIONAL [--logstash-port] : This specifies the port for logstash
- OPTIONAL [--ufm_restAPI_args-attributes] : attributes for ufm restAPI
- CSV
Example:
- Value1
- Value1,Value2
- OPTIONAL [--ufm_restAPI_args-functions] : functions for ufm restAPI
- CSV
- OPTIONAL [--ufm_restAPI_args-scope_object] : scope_object for ufm restAPI
- single string
- OPTIONAL [--ufm_restAPI_args-interval] : interval for ufm restAPI
- int
- OPTIONAL [--ufm_restAPI_args-monitor_object] : monitor_object for ufm restAPI
- single string
- OPTIONAL [--ufm_restAPI_args-objects] : objects for ufm restAPI
- CSV
FOR ALL ufm_restAPI related arguments:
- see ufm restAPI for documentation
- json format
- program provides default value if no user provides
The recommended cron configuration for this script is as follows:
*/10 * * * * /opt/ibm/csm/bigdata/data-aggregators/ufmCollector.py
The output of this script is a newline delimited list of JSON designed for easy ingestion by the logstash pipeline. A sample from the default script configuration is as follows:
{
"type": "counters-ufm",
"source": "port2",
"statistics": {
...
},
"timestamp": 1529960640
}
In the default configuration of this script records will be shipped as JSONDataSources.
JSON Data Sources¶
Logstash Port: | 10522 |
---|---|
Required Field: | type |
Recommended Fields: | |
timestamp |
Attention
This section is currently a work in progress.
CAST recommends JSON data sources be shipped to Logstash to leverage the batching and data enrichment tool. The default logstash configuration shipped with CAST will designate port 10522. JSON shipped to this port should have the type field specified. This type field will be used in defining the name of the index.
Data Aggregators shipping to this port will generate indices with the following name format: cast-%{type}-%{+YYYY.MM.dd}
crass bmc alerts¶
While not bundled with CAST the crass daemon is used to monitor BMC events and counters. The following document is written assuming you have access to an ibm-crassd-*.ppc64le rpm.
- Install the rpm:
yum install -y ibm-crassd-*.ppc64le.rpm
- Edit the configuration file located at /opt/ibm/ras/etc/ibm-crassd.config:
This file neds the [logstash] configuration section configured and logstash=True in the [notify] section.
- Start crassd:
systemctl start ibm-crassd
Attention
The above section is a limited rundown of crassd configuration, for greater detail consult the official documentation for crassd.
CAST Data Sources¶
csmd
syslog¶
Logstash Port: | 10515 |
---|
CAST has enabled the boost syslog utility through use of the csmd configuration file.
"csm" : {
...
"log" : {
...
"sysLog" : true,
"server" : "127.0.0.1",
"port" : "514"
}
...
}
By default enabling syslog will write to the localhost syslog port using UDP. The target may be changed by the server and port options.
The syslog will follow the RFC 3164 syslog protocol. After being filtered through the Syslog Redirection template the log will look something like this:
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - debug csmapi TIMING: 1525910812,17,2,1526570252507364568,1526570252508039085,674517
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - info csmapi [1525910812]; csm_allocation_query_active_all end
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - info csmapi CSM_CMD_allocation_query_active_all[1525910812]; Client Recv; PID: 14921; UID:0; GID:0
These logs will then stored in the cast-log-syslog index using the default CAST configuration.
CSM Buckets¶
Logstash Port: | 10522 |
---|
CSM provides a mechanism for running buckets to aggregate environmental and counter data from a variety of sources in the cluster. This data will be aggregated and shipped by the CSM aggregator to a logstash server (typically the local logstash server).
Format
Each run of a bucket will be encapsulated in a JSON document with the following pattern:
{
"type": "type-of-record",
"source": "source-of-record",
"timestamp": "timestamp-of-record",
"data": {
...
}
}
type: | The type of the bucket, used to determine the appropriate index. |
---|---|
source: | The source of the bucket run (typically a hostname, but can depend on the bucket). |
timestamp: | The timestamp of the collection |
data: | The actual data from the bucket run, varies on bucket specification. |
Note
Each JSON document is newline delimited.
CSM Configuration¶
Compute
Refer to :ref`CSMD_datacollection_Block` for proper compute configuration.
This configuration will run data collection at specified intervals in one or more buckets. This must be configured on each compute node (compute nodes may have different buckets).
Aggregator
Refer to :ref`CSMD_BDS_Block` for proper aggregator configuration.
This will ship the environmental data to the specified ip and port. Officially CAST suggests the use of logstash for this feature and suggests targeting the local logstash instance running on the service node.
Attention
For users not employing logstash in their solution the output of this feature is a newline delimited list of JSON documents formatted as seen above.
Logstash Configuration¶
CAST uses a generic port (10522) for processing data matching the JSONDataSources pattern. The default logstash configuration file specifies the following in the input section of the configuration file:
tcp {
port => 10522
codec => "json"
}
Default Buckets¶
CSM supplies several default buckets for environmental collection:
Bucket Type | Source | Description |
---|---|---|
csm-env-gpu | Hostname | Environmental counters about the node’s GPUs. |
csm-env-mem | Hostname | Environmental counters about the node’s Memory. |
Database Archiving¶
Logstash Port: | 10523 |
---|---|
Script Location: | |
/opt/ibm/csm/db/csm_db_history_archive.sh | |
Script RPM: | csm-csmdb-*.rpm |
CAST supplies a command line utility for archiving the contents of the CSM database history tables. When run the utility (csm_db_history_archive.sh) will append to a daily JSON dump file (<table>.archive.<YYYY>-<MM>-<DD>.json) the contents of all history tables and the RAS event action table. The content appended is the next n records without a archive time as provided to the command line utility.Any records archived in this manner are then marked with an archive time for their eventual removal from the database. The utility should be executed on the node running the CSM Postgres database.
Each row archived in this way will be converted to a JSON document with the following pattern:
{
"type": "db-<table-name>",
"data": { "<table-row-contents>" }
}
type: | The table in the database, converted to index in default configuration. |
---|---|
data: | Encapsulates the row data. |
CAST recommends the use of a cron job to run this archival. The following sample runs every five minutes, gathers up to 100 unarchived records from the csmdb tables, then appends the JSON formatted records to the daily dump file in the /var/log/ibm/csm/archive directory.
$ crontab -e
*/5 * * * * /opt/ibm/csm/db/csm_db_history_archive.sh -d csmdb -n 100 -t /var/log/ibm/csm/archive
CAST recommends ingesting this data through the filebeats utility. A sample log configuration is given below:
filebeat.prospectors:
- type: log
enabled: true
paths:
- "/var/log/ibm/csm/archive/*.json"
# CAST recommends tagging all filebeats input sources.
tags: ["archive"]
Note
For the sake of brevity further filebeats configuration documentation will be omitted. Please refer to the filebeats documentation for more details.
To configure logstash to ingest the archives the beats input plugin must be used, CAST recommends port 10523 for ingesting beats records as shown below:
input
{
beats {
port => 10523
codec=>"json"
}
}
filter
{
mutate {
remove_field => [ "beat", "host", "source", "offset", "prospector"]
}
}
output
{
elasticsearch {
hosts => [<elastic-server>:<port>]
index => "cast-%{type}-%{+YYYY.MM.dd}"
http_compression =>true
document_type => "_doc"
}
}
In this sample configuration the archived history will be stored in the cast-db-<table_name> indices.
CSM Filebeat Logs¶
Logstash Port: | 10523 |
---|
Note
CSM only ships these logs to a local file, a utility such as Filebeats or a local Logstash service would be needed to ship the log to a Big Data Store.
Transaction Log¶
CAST offers a transaction log for select CSM API events. Today the following events are tracked:
- Allocation create/delete/update
- Allocation step begin/end
This transaction log represents a set of events that may be assembled to create the current state of an event in a Big Data Store.
In the CSM design these transactions are intended to be stored in a single elasticsearch index each transaction should be identified by a uid in the index.
Each transaction record will follow the following pattern:
Format
{
"type": "<transaction-type>",
"@timestamp" : "<timestamp>",
"data": { <table-row-contents>},
"traceid":<traceid-api>,
"uid": <unique-id>
}
type: | The type of the transaction, converted to index in default configuration. |
---|---|
data: | Encapsulates the transactional data. |
traceid: | The API’s trace id as used in the CSM API trace functionality. |
uid: | A unique identifier for the record in the elasticsearch index. |
@timestamp: | The timestamp in ISO 8601. |
Allocation Metrics¶
The CSM Daemon has the ability to report special Allocation metrics on Allocation Delete operations. This data includes per gpu usage and per cpu usage metrics.
Format
{
"type": "<metric-type>",
"data": { <metric data> },
"@timestamp" : "<timestamp>"
}
type: | The type of the allocation metric, converted to index in default configuration. |
---|---|
data: | Encapsulates the allocation metric data. |
@timestamp: | The timestamp in ISO 8601. |
GPU Data Sample
{
"type":"allocation-gpu",
"source":"c650f99p18",
"@timestamp" : "4/17/2018T09:42:42Z",
"data":
{
"allocation_id":1,
"gpu_id":0,
"gpu_usage":33520365,
"max_gpu_memory":29993467904
}
}
allocation_id: | The allocation where collection occured. |
---|---|
gpu_id: | The gpu id on the system. |
gpu_usage: | The usage of the GPU(microseconds) over the allocation. |
max_gpu_memory: | Maximum GPU memory usage over the allocation. |
CPU Data Sample
{
"type":"allocation-cpu",
"source":"c650f99p18",
"@timestamp" : "4/17/2018T09:42:42Z",
"data":
{
"allocation_id":1,
"cpu_0":777777000000,
"cpu_1":777777000001
// ...
}
}
allocation_id: | The allocation where collection occured. |
---|---|
cpu_x: | The individual CPU usage (nanoseconds) over the allocation. |
CSM Configuration¶
To enable the transaction and allocation metricslogging mechanism the following configuration settings must be specified in the CSM master configuration file:
"log" :
{
"transaction" : true,
"transaction_file" : "/var/log/ibm/csm/csm_transaction.log",
"transaction_rotation_size" : 1000000000,
"allocation_metrics" : true,
"allocation_metrics_file" : "/var/log/ibm/csm/csm_allocation_metrics.log",
"allocation_metrics_rotation_size" : 1000000000
}
transaction: | Enables the mechanism transaction log mechanism. |
---|---|
transaction_file: | |
Specifies the location the transaction log will be saved to. | |
transaction_rotation_size: | |
The size of the file (in bytes) to rotate the log at. | |
allocation: | Enables the mechanism allocation metrics log mechanism. |
allocation_file: | |
Specifies the location the allocation metrics log will be saved to. | |
allocation_rotation_size: | |
The size of the file (in bytes) to rotate the log at. |
Note
Please review The log block for additional context.
Filebeats Configuration¶
CAST recommends ingesting this data through the filebeats utility. A sample log configuration is given below:
filebeat.prospectors:
- type: log
enabled: true
paths:
- /var/log/ibm/csm/csm_transaction.log
tags: ["transaction"]
- type: log
enabled: true
paths:
- /var/log/ibm/csm/csm_allocation_metrics.log
tags: ["allocation","metrics"]
Note
For the sake of brevity further filebeats configuration documentation will be omitted. Please refer to the filebeats documentation for more details.
Warning
Filebeats has some difficulty with rollover events.
Logstash Configuration¶
To configure logstash to ingest the archives the beats input plugin must be used, CAST recommends port 10523 for ingesting beats records. Please note that this configuration only creates one index for each transaction log type, this is to prevent transactions that span days from duplicating logs.
input
{
beats {
port => 10523
codec=>"json"
}
}
filter
{
mutate {
remove_field => [ "beat", "host", "source", "offset", "prospector"]
}
}
output
{
elasticsearch {
hosts => [<elastic-server>:<port>]
action => "update"
index => "cast-%{type}"
http_compression =>true
doc_as_upsert => true
document_id => "%{uid}"
document_type => "_doc"
}
}
The resulting indices for this configuration will be one per transaction type with each document corresponding to the current state of a set of transactions.
Supported Transactions¶
The following transactions currently tracked by CSM are as follows:
type | uid | data |
---|---|---|
allocation | <allocation_id> | Superset of csmi_allocation_t. Adds running-start-timestamp and running-end-timestamp. Failed allocation creates have special state: reverted. |
allocation-step | <allocation_id>-<step_id> | Direct copy of csmi_allocation_step_t. |
Beats¶
Official Documentation: | |
---|---|
Beats Reference |
Beats are a collection of open source data shippers. CAST employs a subset of these beats to facilitate data aggregation.
Filebeats¶
Official Documentation: | |
---|---|
Filebeats Reference |
Filebeats is used to ship the CSM transactional log to the big data store. It was selected for its high reliabilty in data transmission and existing integration in the elastic stack.
Installation¶
The following installation guide will deal with configuring filebeats for the CSM transaction log for a more generalized installation guide please consult the official Filebeats Reference.
- Install the filebeats rpm on the node:
rpm -ivh filebeat-*.rpm
- Configure the /etc/filebeat/filebeat.yml file:
CAST ships a sample configuration file in the ibm-csm-bds-*.noarch rpm at /opt/ibm/csm/bigdata/beats/config/filebeat.yml. This file is preconfigured to point at the CSM database archive files and the csm transaction logs. Users will need to replace two keywords before using this configuration:
_KIBANA_HOST_PORT_: | |
---|---|
A string containing the “hostname:port” pairing of the Kibana server. |
|
_LOGSTASH_IP_PORT_LIST_: | |
A list of “hostname:port” pairs pointing to Logstash servers to ingest the data (current CAST recommendation is a single instance of Logstash). |
- Start the filebeats service.
systemctl start filebeat.service
Filebeats should now be sending injested data to the Logstash instances specified in the configuation file.
Python Guide¶
Elasticsearch API¶
CAST leverages the Elasticsearch API python library to interact with Elasticsearch. If the API is being run on a node with internet access the following process may be used to install this library:
pip install elasticsearch
If the node doesn’t have access to the internet please refer to the official python documentation for the installation of wheels: Installing Packages.
Big Data Use Cases¶
CAST offers a collection of use case scripts designed to interact with the Big Data Store through the elasticsearch interface.
findJobTimeRange.py¶
This use case may be considered a building block for the remaining ones. This use case demonstrates the use of the cast-allocation transactional index to get the time range of a job.
The usage of this use case is described by the –help option.
findJobKeys.py¶
This use case represents two comingled use cases. First when supplied a job identifier (allocation id or job id) and a keyword (regular expression case insensitive) the script will generate a listing of keywords and their occurrence rates on records associated with the supplied job. Association is filtered on by the time range of the jobs and hostnames that participated on the job.
A secondary usecase is presented in the verbose flag, allowing the user to see a list of all entries matching the keyword.
usage: findJobKeys.py [-h] [-a int] [-j int] [-s int] [-t hostname:port]
[-k [key [key ...]]] [-v] [--size size]
[-H [host [host ...]]]
A tool for finding keywords in the "message" field during the run time of a job.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-k [key [key ...]], --keywords [key [key ...]]
A list of keywords to search for in the Big Data
Store. Case insensitive regular expressions (default :
.*). If your keyword is a phrase (e.g. "xid 13")
regular expressions are not supported at this time.
-v, --verbose Displays any logs that matched the keyword search.
--size size The number of results to be returned. (default=30)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to (filters on the "hostname" field, job independent).
findJobsRunning.py¶
A use case for finding all jobs running at the supplied timestamp. This usecase will display a list of jobs for which the start time is less than the supplied time and have either no end time or an end time greater than the supplied time.
usage: findJobsRunning.py [-h] [-t hostname:port] [-T YYYY-MM-DDTHH:MM:SS]
[-s size] [-H [host [host ...]]]
A tool for finding jobs running at the specified time.
optional arguments:
-h, --help show this help message and exit
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-T YYYY-MM-DDTHH:MM:SS, --time YYYY-MM-DDTHH:MM:SS
A timestamp representing a point in time to search for
all running CSM Jobs. HH, MM, SS are optional, if not
set they will be initialized to 0. (default=now)
-s size, --size size The number of results to be returned. (default=1000)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
findJobMetrics.py¶
Leverages the built in Elasticsearch statistics functionality. Takes a list of fields and a job identifier then computes the min, max, average, and standard deviation of those fields. The calculations are computed against all records for the field during the running time of the job on the nodes that participated.
This use case also has the ability to generate correlations between the fields specified.
usage: findJobMetrics.py [-h] [-a int] [-j int] [-s int] [-t hostname:port]
[-H [host [host ...]]] [-f [field [field ...]]]
[-i index] [--correlation]
A tool for finding metrics about the nodes participating in the supplied job
id.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
-f [field [field ...]], --fields [field [field ...]]
A list of fields to retrieve metrics for (REQUIRED).
-i index, --index index
The index to query for metrics records.
--correlation Displays the correlation between the supplied fields
over the job run.
findUserJobs.py¶
Retrieves a list of all jobs that the the supplied user owned. This list can be filtered to a time range or on the state of the allocation. If the –commonnodes argument is supplied a list nodes will be displayed where the node participated in more nodes than the supplied threshold. The colliding nodes will be sorted by number of jobs they participated in.
usage: findUserJobs.py [-h] [-u username] [-U userid] [--size size]
[--state state] [--starttime YYYY-MM-DDTHH:MM:SS]
[--endtime YYYY-MM-DDTHH:MM:SS]
[--commonnodes threshold] [-v] [-t hostname:port]
A tool for finding a list of the supplied user's jobs.
optional arguments:
-h, --help show this help message and exit
-u username, --user username
The user name to perform the query on, either this or
-U must be set.
-U userid, --userid userid
The user id to perform the query on, either this or -u
must be set.
--size size The number of results to be returned. (default=1000)
--state state Searches for jobs matching the supplied state.
--starttime YYYY-MM-DDTHH:MM:SS
A timestamp representing the beginning of the absolute
range to look for failed jobs, if not set no lower
bound will be imposed on the search.
--endtime YYYY-MM-DDTHH:MM:SS
A timestamp representing the ending of the absolute
range to look for failed jobs, if not set no upper
bound will be imposed on the search.
--commonnodes threshold
Displays a list of nodes that the user jobs had in
common if set. Only nodes with collisions exceeding
the threshold are shown. (Default: -1)
-v, --verbose Displays all retrieved fields from the `cast-
allocation` index.
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
findWeightedErrors.py¶
An extension of the findJobKeys.py use case. This use case will query elasticsearch for a job then run a predefined collection of mappings to assist in debugging a problem with the job.
usage: findWeightedErrors.py [-h] [-a int] [-j int] [-s int]
[-t hostname:port] [-k [key [key ...]]] [-v]
[--size size] [-H [host [host ...]]]
[--errormap file]
A tool which takes a weighted listing of keyword searches and presents
aggregations of this data to the user.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-v, --verbose Displays the top --size logs matching the --errormap mappings.
--size size The number of results to be returned. (default=10)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
--errormap file A map of errors to scan the user jobs for, including
weights.
JSON Mapping Format¶
This use case utilizes a JSON mapping to define a collection of keywords and values to query the elasticsearch cluster for. These values can leverage the native elasticsearch boost feature to apply weights to the mappings allowing a user to quickly determine high priority items using scoring.
The format is defined as follows:
[
{
"category" : "A category, used for tagging the search in output. (Required)",
"index" : "Matches an index on the elasticsearch cluster, uses elasticsearch syntax. (Required)",
"source" : "The hostname source in the index.",
"mapping" : [
{
"field" : "The field in the index to check against(Required)",
"value" : "A value to query for; can be a phrase, regex or number. (Required)",
"boost" : "The elasticsearch boost factor, may be thought of as a weight. (Required)",
"threshold" : "A range comparison operator: 'gte', 'gt', 'lte', 'lt'. (Optional)"
}
]
}
]
When applied to a real configuration a mapping file will look something like this:
[
{
"index" : "*syslog*",
"source" : "hostname",
"category": "Syslog Errors" ,
"mapping" : [
{
"field" : "message",
"value" : "error",
"boost" : 50
},
{
"field" : "message",
"value" : "kdump",
"boost" : 60
},
{
"field" : "message",
"value" : "kernel",
"boost" : 10
}
]
},
{
"index" : "cast-zimon*",
"source" : "source",
"category" : "Zimon Counters",
"mapping" : [
{
"field" : "data.mem_active",
"value" : 12000000,
"boost" : 100,
"threshold" : "gte"
},
{
"field" : "data.cpu_system",
"value" : 10,
"boost" : 200,
"threshold" : "gte"
}
]
}
]
Note
The above configuration was designed for demonstrative purposes, it is recommended that users create their own mappings based on this example.
UFM Collector¶
A tool interacting with the UFM collector is provided in ibm-csm-bds-1.4.0-*.noarch.rpm
.
This script performs 3 key operations:
- Connects to the UFM monitoring snapshot RESTful interface.
- This connection specifies a collection attributes and functions to execute against the
- interface.
- Processes and enriches the output of the REST connection.
- Adds a type, timestamp and source field to the root of the JSON document.
- Opens a socket to a target logstash instance and writes the payload.
CSM Event Correlator Filter Plugin¶
Attention
The following document is a work in progress! The CSM Event Correlator is currently under development and the interface is subject to change.
Parses arbitrary text and structures the results of the parse into actionable events.
The CSM Event Correlator is a utility by which a system administrator may specify a collection of patterns (grok style), grouping by context (e.g. syslog, event log, etc.), which trigger actions (ruby scripts).
Installation¶
The CSM Event Correlator comes bundled in the ibm-csm-bds-logstash-1.4.0-*.noarch.rpm
rpm.
When installing the rpm, any old versions of the plugin will be removed and the bundled version
will be installed.
CSM Event Correlator Pipeline Configuration Options¶
This plugin supports the following configuration options:
Setting | Input type | Required |
---|---|---|
events_dir | string | No |
patterns_dir | array | No |
named_captures_only | boolean | No |
Please refer to common-options for options supported in all Logstash filter plugins.
This plugin is intended to be used in the filter block of the logstash configuration file. A sample configuration is reproduced below:
filter {
csm_event_correlator {
events_dir => "/etc/logstash/patterns/events.yml"
patterns_dir => "/etc/logstash/patterns/*.conf"
}
}
events_dir¶
Value type: | string |
---|---|
Default value: | /etc/logstash/conf.d/events.yml |
The configuration file for the event correlator, see CSM Event Correlator Event Configuration File for details on the contents of this file.
This file is loaded on pipeline creation.
Attention
This field will use an array in future iterations to specify multiple configuration files. This change should not impact existing configurations.
patterns_dir¶
Value type: | array |
---|---|
Default value: | [] |
A directory, file or filepath with a glob. The listing of files will be parsed for grok patterns which may be used in writing patterns for event correlation. If no glob is specified in the path * is used.
Configuration with a file glob:
patterns_dir => "/etc/logstash/patterns/*.conf" # Retrieves all .conf files in the directory.
Configuration with multiple files:
patterns_dir => ["/etc/logstash/patterns/mellanox_grok.conf", "/etc/logstash/patterns/ibm_grok.conf"]
CSM Event Correlator will load the default Logstash patterns regardless of the contents of this field.
Pattern files are plain text with the following format:
NAME PATTERN
For example:
GUID [0-9a-f]{16}
The patterns are loaded on pipeline creation.
named_captures_only¶
Value type: | boolean |
---|---|
Default value: | true |
If true only store captures that have been named for grok. Anonymous captures are considered named.
CSM Event Correlator Event Configuration File¶
CSM Event Correlator uses a YAML file for configuration. The YAML configuration is
heirarchical with 3 major groupings:
This is a sample configuration of this file:
---
# Metadata
ras_create_url: "/csmi/V1.0/ras/event/create"
csm_target: "localhost"
csm_port: 4213
data_sources:
# Data Sources
syslog:
ras_location: "syslogHostname"
ras_timestamp: "timestamp"
event_data: "message"
category_key: "programName"
categories:
# Categories
NVRM:
- tag: "XID_GENERIC"
pattern: "Xid(%{DATA:pciLocation}): %{NUMBER:xid:int},"
ras_msg_id: "gpu.xid.%{xid}"
action: 'unless %{xid}.between?(1, 81); ras_msg_id="gpu.xid.unknown" end; .send_ras;'
mlx5_core:
- tag: "IB_CABLE_PLUG"
pattern: "mlx5_core %{MLX5_PCI}.*module %{NUMBER:module}, Cable (?<cableEvent>(un)?plugged)"
ras_msg_id: "ib.connection.%{cableEvent}"
action: ".send_ras;"
mmsysmon:
- tag: "MMSYSMON_CLEAN_MOUNT"
pattern: "filesystem %{NOTSPACE:filesystem} was (?<mountEvent>(un)?mounted)"
ras_msg_id: "spectrumscale.fs.%{mountEvent}"
action: ".send_ras;"
- tag: "MMSYSMON_UNMOUNT_FORCED"
pattern: "filesystem %{NOTSPACE:filesystem} was.*forced.*unmount"
ras_msg_id: "spectrumscale.fs.unmount_forced"
action: ".send_ras;"
...
Metadata¶
The metadata section may be thought of as global configuration options that will apply to all events in the event correlator.
Field | Input type | Required |
---|---|---|
ras_create_url | string | Yes <Initial Release> |
csm_target | string | Yes <Initial Release> |
csm_port | integer | Yes <Initial Release> |
data_sources | map | Yes |
ras_create_url¶
Value type: | string |
---|---|
Sample value: | /csmi/V1.0/ras/event/create |
Specifies the REST create resource on the node runnning the CSM REST Daemon. This path will be used by the .send_ras; utility.
Attention
In a future release /csmi/V1.0/ras/event/create will be the default value.
csm_target¶
Value type: | string |
---|---|
Sample value: | 127.0.0.1 |
A server running the CSM REST daemon. This server will be used to generate ras events with the .send_ras; utility.
Attention
In a future release 127.0.0.1 will be the default value.
csm_port¶
Value type: | integer |
---|---|
Sample value: | 4213 |
The port on the server running the CSM REST daemon. This port will be used to connect by the .send_ras; utility.
Attention
In a future release 4213 will be the default value.
data_sources¶
Value type: | map |
---|
A mapping of data sources to event correlation rules. The key of the data_sources field matches type field of the logstash event processed by the filter plugin. The type field may be set in the input section of the logstash configuration file.
Below is an example of setting the type of all incoming communication on the 10515 tcp port to have the syslog type:
input {
tcp {
port => 10515
type => "syslog"
}
}
The YAML configuration file for the syslog data source would then look something like this:
syslog:
# Event Data Sources configuration settings.
# More data sources.
The YAML configuration uses this structure to reduce the pattern space for event matching. If the user doesn’t configure a type in this data_sources map CSM will discard events of that type for consideration in event correlation.
Data Sources¶
Event data sources are entries in the data_sources map. Each data source has a set of configuration options which allow the event correlator to parse the structured data of the logstash event being checked for event corelation/action generation.
This section has the following configuration fields:
Field | Input type | Required |
---|---|---|
ras_location | string | Yes <Initial release> |
ras_timestamp | string | Yes <Initial release> |
event_data | string | Yes |
category_key | string | Yes |
categories | map | Yes |
ras_location¶
Value type: | string |
---|---|
Sample value: | syslogHostname |
Specifies a field in the logstash event received by the filter. The contents of this field are then used to generate the ras event spawned with the .send_ras; utility.
The referenced data is used in the location_name of the of the REST payload sent by .send_ras;.
For example, assume an event is being processed by the filter. This event has the field syslogHostname populated at some point in the pipeline’s execution to have the value of cn1. It is determined that this event was worth responding to and a RAS event is created. Since ras_location was set to syslogHostname the value of cn1 is POSTed to the CSM REST daemon when creating the RAS event.
ras_timestamp¶
Value type: | string |
---|---|
Sample value: | timestamp |
Specifies a field in the logstash event received by the filter. The contents of this field are then used to generate the ras event spawned with the .send_ras; utility.
The referenced data is used in the time_stamp of the of the REST payload sent by .send_ras;.
For example, assume an event is being processed by the filter. This event has the field timestamp populated at some point in the pipeline’s execution to have the value of Wed Feb 28 13:51:19 EST 2018. It is determined that this event was worth responding to and a RAS event is created. Since ras_timestamp was set to timestamp the value of Wed Feb 28 13:51:19 EST 2018 is POSTed to the CSM REST daemon when creating the RAS event.
event_data¶
Value type: | string |
---|---|
Sample value: | message |
Specifies a field in the logstash event received by the filter. The contents of this field are matched against the specified patterns.
Attention
This is the data checked for event correlation once the event list has been selected, make sure the correct event field is specified.
category_key¶
Value type: | string |
---|---|
Sample value: | programName |
Specifies a field in the logstash event received by the filter. The contents of this field are used to select the category in the categories map.
categories¶
Value type: | map |
---|
A mapping of data sources categories to event correlation rules. The key of the categories field matches field specified by category_key. In the included example this is the program name of a syslog event.
This mapping exists to reduce the number of pattern matches performed per event. Events that don’t have a match in the categories map are ignored when performing further pattern matches.
Each entry in this map is an array of event correlation rules with the schema described in Event Categories. Please consult the sample for formatting examples for this section of the configuration.
Event Categories¶
Event categories are entries in the categories map. Each category has a list of tagged configuration options which specify an event correlation rule.
This section has the following configuration fields:
Field | Input type | Required |
---|---|---|
tag | string | No |
pattern | string | Yes <Initial Release> |
action | string | Yes <Initial Release> |
extract | boolean | No |
ras_msg_id | string | No <Needed for RAS> |
tag¶
Value type: | string |
---|---|
Sample value: | XID_GENERIC |
A tag to identify the event correlation rule in the plugin. If not specified an internal identifier will be specified by the plugin. Tags starting with . will be rejected at the load phase as this is a reserved pattern for internal tag generation.
Note
In the current release this mechanism is not fully implemented.
pattern¶
Value type: | string |
---|---|
Sample value: | mlx5_core %{MLX5_PCI}.*module %{NUMBER:module}, Cable (?<cableEvent>(un)?plugged) |
A grok based pattern, follows the rules specified in Grok Primer. This pattern will save any pattern match extractions to the event travelling through the pipeline. Additionally, any extractions will be accessible to the action to drive behavior.
action¶
Value type: | string |
---|---|
Sample value: | unless %{xid}.between?(1, 81); ras_msg_id=”gpu.xid.unknown” end; .send_ras; |
A ruby script describing an action to take in response to an event. The action is taken when an event is matched. The plugin will compile these scripts at load time, cancelling the startup if invalid scripts are specified.
This script follows the rules specified in CSM Event Correlator Action Programming.
extract¶
Value type: | boolean |
---|---|
Default value: | false |
By default the Event Correlator doesn’t save the extract pattern matches in pattern to the final event shipped to elastic search or your big data platform of choice. To save the pattern extraction this field must be set to true.
Note
This field does not impact the writing of action scripts.
ras_msg_id¶
Value type: | string |
---|---|
Sample value: | gpu.xid.%{xid} |
A string representing the ras message id in event creation. This string may specify fields in the event object through use of the %{FIELD_NAME} pattern. The plugin will attempt to populate the string using this formatting before passing to the action processor.
For example, if the event has a field xid with value 42 the pattern gpu.xid.%{xid} will resolve to gpu.xid.42.
Grok Primer¶
CSM Event Correlator uses grok to drive pattern matching.
Grok is a regular expression pattern checking utility. A typical grok pattern has the following syntax: %{PATTERN_NAME:EXTRACTED_NAME}
PATTERN_NAME is the name of a grok pattern specified in a pattern file or in the default Logstash pattern space. Samples include NUMBER, IP and WORD.
EXTRACTED_NAME is the identifier to be assigned to the text in the event context. The EXTRACTED_NAME will be accessible in the action through use of the %{EXTRACTED_NAME} pattern as described later. EXTRACTED_NAME identifiers are added to the big data record in elasticsearch. The EXTRACTED_NAME section is optional, patterns without the EXTRACTED_NAME are matched, but not extracted.
For specifying custom patterns refer to custom patterns.
A grok pattern may also use raw regular expressions to perform non-extracting pattern matches. Anonymous extraction patterns may be specified with the following syntax: (?<EXTRACTED_NAME>REGEX)
EXTRACTED_NAME in the anonymous extraction pattern is identical to the named pattern. REGEX is a standard regular expression.
CSM Event Correlator Action Programming¶
Programming actions is a central part of the CSM Event Correlator. This plugin supports action scripting using ruby. The action script supplied to the pipeline is converted to an anonymous function which is invoked when the event is processed.
Default Variables¶
The action script has a number of variables which are acessible to action writers:
Variable | Type | Description |
---|---|---|
event | LogStash::Event | The event the action is generated for, getters provided. |
ras_msg_id | string | The ras message id, formatted. |
ras_location | string | The location the RAS event originated from, parsed from event. |
ras_timestamp | string | The timestamp to assign to the RAS event. |
raw_data | string | The raw data which generated the action. |
The user may directly influence any of these fields in their action script, however it is recommended that the user take caution when manipulating the event as the contents of this field are ultimately written to any Logstash targets. The event members may be accessed using the %{field} syntax.
The ras_msg_id, ras_location, ras_timestamp, and raw_data fields are used with the .send_ras; action keyword.
Accessing Event Fields¶
Event fields are commonly used to drive event actions. These fields may be specified by the event corelation rule or other Logstash plugins. Due to the importance of this pattern the CSM Event Correlator provides a special syntaxtic sugar for field access %{FIELD_NAME}.
This syntax is interpreted as event.get(FIELD_NAME) where the field name is a field in the event. If the field was not present the field will be interpreted as nil.
Action Keywords¶
Several action keywords are provided to abstract or reduce the code written in the actions. Action keywords always start with a . and end with a ;.
.send_ras;¶
Creates a ras event with msg_id == ras_msg_id, location_name == ras_location, time_stamp == ras_timestamp, and raw_data == raw_data.
Currently only issues RESTful create requests. Planned improvements add local calls.
Attention
A clarification for this section will be provided in the near future. (5/18/2018 jdunham@us.ibm.com)
Sample Action¶
- Using the above tools an action may be written that:
Processes a field in the event, checking to see it’s in a valid range.
unless %{xid}.between?(1, 81);
Sets the message id to a default value if the field is not within range.
ras_msg_id="gpu.xid.unknown" end;
Generate a ras message with the new id.
.send_ras;
All together it becomes:
unless %{xid}.between?(1, 81); ras_msg_id="gpu.xid.unknown" end; .send_ras;
This action script is then compiled and stored by the plugin at load time then executed when actions are triggered by events.
Debugging Issues¶
Perform the following checks in order, when a matching condition is found, exit the debug process and handle that condition. Numbered sequences assume that the user performs each step in order.
RAS Event Not Firing¶
If RAS events haven’t been firing for conditions matching .send_ras perform the following diagnostic steps:
Check the `/var/log/logstash/logstash-plain.log`
Search for the phrase “Unable send RAS event” :
This indicates that the corelator was unable to connect to the CSM REST Daemon. Verify that Daemon is running on the specified hostname and port.
Search for the phrase “Posting ras message” :
This indicates that the corelator connected to the CSM REST Daemon, but the RAS events were malconfigured. Verify that the message id sent has an analog in the list of RAS events registered in CSM.
The RAS mesage id may be checked using the following utility:
csm_ras_msg_type_query -m "MESSAGE_ID"
Neither of these strings were found:
Cast Search¶
The cast search mechanism is a GUI utility for searching for allocations in the Big Data Store.
Installation¶
Installation of the CAST Search plugin is performed through the ibm-csm-bds-kibana-1.4.0-*.noarch.rpm
rpm:
rpm -ivh ibm-csm-bds-kibana-*.noarch.rpm
Attention
Kibana must be installed first please refer to the :ref: cast-kibana documentation.
Configuration¶
After installing the plugin the following steps must be taken to begin using the plugin.
- Select the Management tab from the sidebar:

- Select the Kibana Index Patterns:

3. If the cast-allocation index pattern is not present create a new index pattern. If the index pattern is present skip to Step Seven:

- Input cast-allocation in the index pattern name:

- Select @timestamp for the time filter (this will sort records by update date by default:

- Verify that the cast-allocation index pattern is now present:

- Select the Visualize sidebar tab, then select the CAST Search option:

8. Select the add option, by default this will select the Allocation ID option. If the user wishes to search on Job IDs, select Job ID in the dropdown.

- A listing of fields should now be visible. Select the Apply changes button before saving the visualization:

- Save the Visualization so the plugin may be used from a dashboard:

- Select the Dashboard sidebar tab, then create a new dashboard:

- Select the Add option, then select the visualizer created in Step Ten and Add new Visualization:

- The plugin should now be usable:

Tools¶
CSM logging tools¶
To use the CSM logging tools run: opt/csm/tools/API_Statistics.py
python script.
This python script parses log files to calculate the start and end time of API calls on the different types of nodes that generate these logs. From the start and end time, the script calculates:
frequency
at which the API was calledmean
run timemedian
run timeminimum
run timemaximum
run timestandard deviation
run time
The script also captures job ID collisions when start and end API’s do not match.
Note
Run the script with -h for help.
[root@c650f03p41 tools]# python API_Statistics.py -h
usage: API_Statistics.py [-h] [-p path] [-s start] [-e end] [-o order]
[-r reverse]
A tool for parsing daemon logs for API statistics.
optional arguments:
-h, --help show this help message and exit
-p path The directory path to where the logs are located. Defaults to:
'/var/log/ibm/csm'
-s start start of search range. Defaults to: '1000-01-01 00:00:00.0000'
-e end end of search range. Defaults to: '9999-01-01 00:00:00.0000'
-o order order the results by a field. Defaults to alphabetical by API
name. Valid values: 0 = alphabetical, 1 = Frequency, 2 = Mean, 3
= Max, 4 = Min, 5 = Std
-r reverse reverse the order of the data. Defaults to 0. Set to 1 to turn
on.
Obtaining Log Statistics¶
Setup¶
This script handles Master, Computer, Utility, and, Aggregator logs. These must be placed under the opt/csm/tools/Logs
directory unders their respective types.
Note
As of CSM 1.4, the script can be pointed to a directory where the log files are located, and by default the program will use /var/log/ibm/csm
.
Running the script¶
There are three ways of running the logs with time formats:
Format: <Start Date> <Start Time>
Format: YYYY-MM-DD HH:MM::SS
- Run through the logs in its entirety:
python API_Statistics.py
- Run through the logs with a specific start time:
python API_Statistics.py <Start Date> <Start Time>
- Run through the logs with a specific start and end time:
python API_Statistics.py <Start Date> <Start Time> <End Date> <End Time>
Note
As of CSM 1.4 the time ranges of the script has been updated to use flags.
Output¶
Reports will be caluclated and saved to individual files under opt/csm/tools/Reports
under their respective log types. (The script will output to the screen as well). The report includes errors and calculated statistics.