This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Monitor Aura system

1: Aura Analytics 1.1.

1.1: Architecture
1.2: OB analytics
1.3: Data model
1.4: Annex: Dataset fields

2: Aura Analytics 2.0.0

2.1: Architecture
2.2: Operation
2.3: Guidelines for OBs
2.4: Analytics Dashboard
2.5: Annex: Dataset fields

3: Aura Billing Module

3.1: Aura Billing Module operation

4: Manage Aura logs
5: Manage metrics

5.1: Aura Bot metrics
5.2: Aura Groot metrics
5.3: Atria Model Gateway metrics
5.4: Atria RAG server metrics
5.5: Aura Authentication API metrics
5.6: Aura Configuration API metrics
5.7: Aura Gateway API metrics
5.8: Aura Bridge metrics
5.9: Aura KPIs uploader metrics
5.10: Aura NLP metrics
5.11: T&C API metrics
5.12: NLP provisioning metrics
5.13: Aura Complex Logic metrics
5.14: Aura Context metrics
5.15: Aura File Manager metrics
5.16: Aura Redis MongoDB sync metrics

6: Aura dashboards

6.1: Aura system dashboards

6.1.1: Alertmanager dashboard
6.1.2: Elasticsearch dashboard
6.1.3: Fluent bit dashboard
6.1.4: Kubernetes cluster monitoring dashboard
6.1.5: Kubernetes cron and batch job monitoring dashboard
6.1.6: Kubernetes nodes dashboard
6.1.7: Kubernetes services dashboard
6.1.8: Kubernetes storage monitoring dashboard
6.1.9: NLP provisioning dashboard
6.1.10: Prometheus stats dashboard
6.1.11: Redis dashboard

6.2: Aura components dashboards

6.2.1: Aura bot latencies dashboard
6.2.2: Aura bridge dashboard
6.2.3: Authentication API dashboard
6.2.4: Aura HTTP Inbound dashboard
6.2.5: Aura HTTP Outbound dashboard
6.2.6: Pod resources dashboard

7: Aura Alerts
8: Queries

8.1: Basic monitoring queries
8.2: Basic database queries

Monitor Aura system

The operation of your Aura system can be monitored in a continuous basis. Discover Aura monitoring tools, the different logs and metrics that are generated and how to implement an alert system.

Introduction

Aura monitoring system is crucial to control how Aura system works internally, in order to keep the service functional and, moreover, to understand the behavior of our clients, leading to evolve Aura accordingly.

Monitoring is based in the generation by different Aura components of logs and metrics, that are stored for their subsequent analysis and decision-making based on the obtained results. Both logs and metrics together create a complementary observability foundation to have an essential knowledge of the system performance in predictable and unpredictable ways.

Within this process, key external monitoring tools are used, such as ElasticSearch, Kibana and Grafana.

Stages in Aura monitoring process

Sections below show, at a glance, the steps and recommended tools both for logs and metrics management and include links to the corresponding documents for more details.

Aura logs management

Logs are files generated by different Aura components that record specific single events, warnings and errors as they occur.

Logs generation
Different Aura components generate logs every time a relevant event happens.
Logs storage
Aura logs are stored in Elasticsearch.
Logs visualization and analysis
Logs can be managed through different tools:
- Kibana: basic operational logging analyzer of Aura.
- Moreover, two additional tools, fluentd and Grafana can be used for specific aspect related to logs management.

📃 Read here detailed information regarding Aura logs management.

Aura metrics management

Metrics offer an aggregated view of Aura performance based on meaningful aggregated logs. They are typically generated at fixed-time intervals and represent a specific aspect of the monitored system.

Metrics generation
Different Aura components generate metrics periodically based on aggregated logs.
Metrics storage
Once generated, Aura metrics are pooled by Prometheus, which is in charge of gathering and exposing them.
Metrics analysis
Aura metrics are analyzed in order to have a meaningful interpretation of data and to obtain an overall evaluation of Aura’s performance.

For the management of metrics, we recommend using Aura dashboards, which are generated in Grafana. These dashboards can be retrieved by making queries to the system.

📃 Read here detailed information regarding Aura metrics management.

Aura alerts

Prometheus has a list of alert rules that are part of the platform configuration and can be editable.

📃 Discover the alerts currently set for Aura system in Aura alerts document.

1 - Aura Analytics 1.1.

Aura Analytics 1.1.

Description of Aura Analytics 1.1, the monitoring dataflow that allows active listening in Aura

Introduction

This document contains a description of a joint dataflow between LCDO OB teams and Aura Global Team for processing Aura log files created in production environment (i.e., coming from actual Aura users) in order to create PPDs (Privacy-Preserving Datasets). All this process is known as Active Listening.

The dataflow produces as a result, among other elements, an analytics component, named as Aura Analytics Dashboard, that can be used to gather statistics on the production system and to analyze user’s behavior. The latest version 1.1 of this dashboard is described in the current document.

The main objectives of the unified dataflow are:

Consolidate the processing of Aura logs into a framework.
Provide LCDOs and Aura Global Team with a unified common source for analytics, in a privacy-preserving way.
Enable extensibility of the dataflow.

In this framework, the current documents provide:  

The available data analytics versions
Prerequisites and recommended tools for its use
Overall description of Aura Analytics architecture
Description of the OB Analytics subsystem that enables the management by OBs
Current Aura Analytics data model

The target audience of this document includes the following roles both in LCDO Teams and Aura Global Team:

Data Scientists and Product teams, that wish to access Aura logs information and perform analytics on them.
Operation teams, for the architectural description and the requirements on OB environments.

Aura Analytics versions

Release 1.0.

The first release 1.0. sets up the basic paths, deploys the PPD infrastructure and produce:

Version 1.0. of the OB Analytics system, which includes the OB Dashboard.
The first version of pre-processed datasets (clean PPDs) for training and analytics at Aura Global.

As mentioned, this version enables OBs to go further by:

Enhancing the OB Dashboard with new visualizations, as they seem fit (given that panels and dashboards can be exported and imported, it is possible to share new ones across all OBs, as they are developed).
Processing the PPD files as desired (they are standard CSV files, which can be ingested in alternative platforms if desired). Restrictions on them are softer than on the original logs due to the anonymization process they have been subjected to, although they are still subjected to management precautions (a code of conduct is being prepared for that).

Release 1.1.

Version 1.1. introduces the following changes:

The table of data has been enlarged with these new fields: AURA_ID, STATUS_CD, sesId, sesSize, sesDuration.
An expanded list of test users is used, so that the userType column contains more identifications.
The code for data ingestion into a local Kibana, which previously consisted on a single Python script, has been turned into a full Python package to be installed, due to its increasing complexity.

Prerequisites and recommended tools

The prerequisites for the use of version 1.1. of Aura Analytics Dashboard are set below:

Aura Platform version:
Recommended operating system:Ubuntu 18.04 system
Recommended tool for data visualization: ELK stack

1.1 - Architecture

Aura Analytics 1.1. architecture

Technical architecture of Aura Analytics 1.1.

Architecture description

The following figure shows a full overview of Aura Analytics Dashboard architecture and operation, which is also described below:

Aura Analytics architecture

 Aura logs generated in local instance are converted to datasets and transferred to local Kernel via the standard procedure and with the established frequency (typically, daily).
 Once there, the “Active listening” process flow fires up daily. Through a specialized process that runs on an Aura local instance and with access to the stored datasets in the Kernel local storage space:
- PII (Personally Identifiable Information) is removed or encrypted.
- The result is transferred to a bucket/blob set up for this task and managed by Global Aura team.
- Here, the PPDs (Privacy-Preserving Datasets) are created. Currently, only MESSAGE, RECOGNIZER and API datasets are involved in this process.
In order to convert PII data to PPD, every field in these datasets can be:
- a. Not transferred.
- b. Pseudo-anonymized. In this situation, the field is transformed through a cryptographic hashing process using a secret set up by the OB.
- c. Anonymized fragments of the field (e.g., credit card number, email, etc.). The field is processed to detect specific patterns and replaces them with a specific tag (idemail, idpassport, etc.). The list of anonymization strings is agreed with each OB.
- d. Transferred as is.
 After that, the Raw PPD Datasets stored in bucket/blog managed by the Global Team are processed generating clean PPD Datasets in order to adapt them to the analytics tools.
 From that space, the clean PPD Datasets can be:

Accessed by the Aura Global Team that use them for several tasks, with the purpose of evaluating Aura quality and taking the best decisions regarding to product evolution:
- Perform analytics on Aura behavior and prototype Analytics Dashboard features
- Improve Aura Platform capabilities (e.g., adapting machine learning models)
Accessed by a Local Aura Team, ingesting the data to a dedicated server managed by the OB with analytics and data visualization capabilities. In order to do that, the Aura Global Team provides a component with the ELK (elasticsearch, logstash & kibana) preconfigured with a set of dashboards that can be deployed and adapted by the OB team.

All the code involved in this process can be found in Github. Particularly:

PPD RAW creation package
Conversion from PPD RAW to PPD Clean
Pseudo-anonymization function for identifiers
Utterance anonymization (agreed individually for ES and UK)

1.2 - OB analytics

OB analytics

Description of the OB OB Analytics subsystem that can be managed by OBs.

Introduction

The OB Analytics subsystem is an optional component in the dataflow, which enables the management of clean PPDs (Privacy-Preserving Datasets) by LCDOs for the analysis of Aura behavior.

In order to work with OB Analytics subsystem, the following items must be fulfilled:

 The legal agreement for log management and creation of PPDs must be signed between the OB and Aura Global Team.
 The mechanism for PPD creation and transfer must be installed. This requires the deployment of a piece of software (provided by Aura Global Team) inside the OB cloud, with access to the repository (AWS bucket or Azure Blob Storage) holding Aura logs.
 A virtual machine must be deployed on the OB cloud to hold the OB Dashboard. This virtual server must be provisioned by the OB on the same cloud environment (provider and region, e.g., AWS West Europe) than the Kernel cloud, but separated from it in terms of access rights (placing it in the same cloud enables saving transfer costs from the cloud provider for PPD access).

Architecture and installation

The basic infrastructure of the OB Analytics subsystem consists on a Virtual Machine that is fed with the extracted and cleaned PPDs. This virtual machine is set up with a proposed stack of tools based on the open-source ELK framework (See figure in Architecture document).

Elastic Search: indexing database.
Logstash: ingester for PPD data, configured to upload the anonymized clean PPD tables into Elastic Search.
Kibana: visualization tool offering dashboards and panels created over Elastic Search data.

The OB is required to set up the base VM, for which an Ubuntu 18.04 system is advised.

On top of this base system, Aura Global Team provides an installation kit that includes:

The pre-processing and ingesting configuration for feeding clean PPD data into logstash.
The indexing configuration for Elastic Search.
Certain prototype dashboards and panels for Kibana.
Basic security provisions (providing web-based secure access to the dashboard).

Once installed, the system automatically ingests any new clean PPD being produced, so that the index and dashboards remain up to date.

In principle, the PPD creation process specifies daily production, since Aura logs are sent to Kernel once a day. This means that information about Aura behavior and user actions on one given day will be available in the dashboards on the following day.

The provided system and installed dashboards are only visualization examples for clean PPDs. The system allows the creation of additional panels that may provide complementary insights on clean PPD elements and OBs are encouraged to explore data as they see fit.

Dashboards can be exported and reimported in a different system, so in addition to the LCDO team adding new analysis features, it is possible to provide later updates to the OB Analytics system. These updates can be provided by the Aura Global Team or shared between OBs.

Outside the dashboard stack, it is also possible to process clean PPD with alternative tools (PPDs are essentially CSV files with a defined structure, so they can be processed with a variety of tools).

Kibana dataflow

The Aura Analytics dashboard follows a standard ELK deployment:

An Elastic Search index has been created. It is called aura-message-COUNTRY, and its index schema contains a cleaned version of the AURA MESSAGE table (which registers input and output messages). For details on the fields that this index contains, go to the document Data model.
A Logstash configuration ingests into this index the cleaned sets of datapoints that are produced daily as a result of the transfer and processing of Aura logs. This is usually done in the early morning (which will then upload data for the previous day).
A Kibana index pattern has been created, matching the uploaded Elastic Search index. An Elastic Search index is how the data is stored inside the DB; a Kibana index pattern is how it is visualized in the interface. Typically, Kibana index patterns match Elastic Search indices, but it is, for example, possible to create a Kibana index pattern that matches more than one Elastic Search index and hence combines different data sources.
A small set of visualizations have been pre-installed in Kibana over that index pattern, as a means to get a default peek on the index data. See the section preinstalled visual elements to check them.

This configuration is deployed on the Kibana default space (the only one available on a freshly created Aura Analytics dashboard). If there is the need to create additional spaces, to better organize visualizations, then the Elastic Search index pattern needs to be installed into those additional spaces.

Preinstalled visual elements

Kibana offers many possibilities to visualize the ingested data and there are many resources and tutorials around explaining its mechanics. We therefore refer to the official Kibana documentation, or tutorials available on the web, for generic information.

In the particular case of the Aura Analytics deployment, there is an Elastic Search index that gets automatically ingested daily. It is called Aura-message-COUNTRY and contains a cleaned version of the AURA MESSAGE table (which registers input and output messages).

Over this index, three types of panels/visualizations have been preinstalled, to provide a starting point:

Discover panel
Visualizations
Dashboards

These preinstalled elements are described in the following subsections. To access them, select the appropriate icon in the left navigation panel.

Discover panel

The Discover panel in Kibana is an essential tool where one can perform queries to an Elastic Search index (and save those searches if desired), and explore users’ interactions with Aura in detail log by log, these being filtered by:

 Search terms or conditions
 A time interval
 Additional filters applied to the query results
 A set of index fields to show in the result table

These 4 steps are represented in the following figure:

Discover panel

As shown in the previous figure, the starting point is the Elastic Search index holding all the data. The three first steps in the chain reduces the amount of data handled, by pruning out elements that do not satisfy the defined condition. The fourth step is just a display adjustment: on the final dataset, define which of the available fields will be shown on the output table that appears in the panel. However, the retrieved data contains all fields (clicking on any of the rows will show them).

In the Aura Dashboard default set, there is one Discover panel preinstalled. It is called question-answer pairs and has the following characteristics:

A blank query (i.e., provide all the results)
A time interval for the last 7 days
A “only user” filter: it filters out all intents that correspond to non-user queries (suggestions, help commands from the client application, etc.)
A visualization that includes: the timestamp, the (cleaned) user message, the detected aura intent, associated entities (if applicable), the dialog that was invoked and Aura’s response

This figure shows a snapshot of this panel. To load it, select the Discover tool in the left navigation bar and then click on the “Open” menu option in the top menu bar. A list of saved panels will be shown, together with the already mentioned “question-answer pairs”.

question-answer pairs panel

Once the panel is loaded, each one of the aforementioned four elements can be freely modified. For example, the interface allows:

Adding new filters with the “+Add Filters” button
Deactivating the current filters by pressing over the predefined filter and clicking over the “Temporarily Disable” option
Modifying the query interval with the “calendar” button or “Dates Box”
Adding a specific query on a given index field(s) by using the “Search Box”, instead of the (default) blank query.

Discover panels can be saved as named objects, to be later loaded at will. So, if needed, any panel (a modified panel or a newly created one) can be saved with a new name to have it available for later loading.

Visualizations

A total of 7 visualizations come preinstalled with the base Aura Dashboard. The list can be obtained from the “visualizations” item in the left menu bar, as shown in the figure, and they are:

Three “Stats” type visualizations, which provide general statistics on platform usage.
Four “User” type visualizations, which provide insights on user behavior.

Visualizations

Note that this distinction between “User” and “Stats” is purely conceptual and based on the fields that have been used to generate the visualizations that, from the point of view of Kibana, are all regular visualizations. Those visualizations can be instantly loaded by clicking on their names. But they can also be integrated into dashboards, as described in the next section.

Dashboards

A dashboard in Kibana is essentially a spatial arrangement of visualizations. For example, to construct a dashboard, just place visualizations into a page, resizing them as required, so they can be observed in a single place.

It is interesting to know that in a dashboard all visualizations are linked. So that if, for example, time interval is changed, or a filter is added using the interface, these modifications affect all visualizations in the dashboard and all of them get updated.

Elements in the dashboard visualizations can also generate instant filters by clicking on graphs or table elements. Those filters are then added to the top of the page as a filter and, therefore, can then be modified or removed.

The Aura Analytics default installation preloads two dashboards. Those are available for selection when we click on the “dashboard” icon in the left navigation bar:

Default dashboards

There are different types of dashboards, described in the following sections.

System dashboard

This dashboard integrates the three predefined “Stats” visualizations (generic statistics):

A timeline of interactions (user messages sent and answered), segmented by channel
A heatmap of interactions by weekday and time of day (hour)
A bar graph classifying the interactions produced in the period by detected intent

The following figure shows a screenshot of this dashboard:

System dashboard

User dashboard

The user dashboard contains the four visualizations:

Most Frequent User Utterances: list of the most frequent user’s sentences (in the time interval and filter active at the moment). It uses the usrMsgSig field to group together very similar utterances.
Aura Answer Groups: list of the most frequent answers that Aura generates, grouped by the semantic categories in AuraMsgGroup field.
Words per query: distribution of sizes for the user messages, measured as number of words in the utterance and segmented by channel.
Tag cloud: set of most frequent user utterances, as a tag cloud in which the font size represents the utterance frequency. The MESSAGE_USR_NORM field is used for its representation, so it contains normalized utterances.

The next screenshots show the dashboard with all these visualizations (it is a large dashboard, so typically it needs scrolling to visualize all its components).

User dashboard

Note that those four visualizations are linked as they correspond to the same subset of the data (as given by filters and time interval) but they are NOT linked at the individual item level (i.e., a given most frequent user utterance in the left table does not correspond to any specific Aura answer in the right bar graph).

Instead, the dashboard can be manipulated by selecting one specific item in any of the visualization and this will create a filter for the others. For instance, as the following image shows, if we select <CHURN> in the Aura answer group visualization, we can see in the others the user utterances that led Aura to generate that answer (i.e., an answer about contract cancelation).

Aura answer groups in User dashboard

1.3 - Data model

Aura Analytics data model

Data model of Aura Analytics 1.1. that can be used as the base for building new elements

Introduction

New elements can be built (or the current elements modified) by making use of the available fields in Kibana through the ingested Elastic Search index.

In this document, we provide a reference of the schema that the index follows, so that it can be used to build such new visualizations, or to better understand the existing ones.

Elements in the Aura-message data model have 3 different types:

Numeric: single numbers, integer or real. Suitable for numerical statistics, such as averages, or for plotting variation across time in graphs.
Keyword: they are opaque strings, i.e., terms that cannot be searched within (it is not possible to look for words inside a keyword field). They can, however, be used to create some term-level queries, such as prefix queries (find all instances that begin with) and they usually work great for aggregations, since most of them are categorical variables (fields that only have a limited number of possible values) and can therefore be bucketed and counted.
Text: these fields are divided into separate terms (words), and some pre-processing is done to them before indexing to improve access though an Elastic Search analyzer. Text fields cannot be used in aggregated visualizations, since they cannot be grouped. They are most useful for queries, because they allow searching for fragments (only a few words) and fuzzy searches.

Fields list

The following table lists all the fields available in the Aura-message-COUNTRY Elastic Search index, together with their type and a brief description.

The most relevant ones include a more detailed description in the section fields explanations.

Note that some fields of text type have a mirror field of type keyword, with the same content. Having the same data indexed in two different ways at the same time (as text and as keyword) enables to perform different types of analysis by choosing the right field.

The “Raw” column indicates if this field is already present in the Aura raw PPD files:

Yes: field contained in raw PPDs.
No: generated field, produced when creating clean PPDs. They can be recognized as lowercase fields.
Partial: It exists in the raw PPDs, but in a somehow different shape.

Field	Type	Raw	Contents
CORR_ID	keyword	yes	Unique identifier for each interaction
VERSION_ID	keyword	yes	Aura Platform version
CHANNEL_CD	keyword	yes	Identifier for the channel this interaction corresponds to
STATUS_CD	keyword	yes	Internal code related to operation status
AURA_ID_GLOBAL	keyword	yes	(Mostly) unique identifier for the user
AURA_ID	keyword	yes	(Mostly) local identifier for the user
INTENT	keyword	yes	Detected user intent, including “system” intents
MESSAGE_USR	text	partial	Text request sent by the user
MESSAGE_USR_NORM	text	no	A normalized version of MESSAGE_USR
MESSAGE_USR_NORM.keyword	keyword	no	A keyword version of MESSAGE_USR_NORM, to enable aggregating on it
MESSAGE_AURA	text	partial	Text message sent by AURA to the user
MESSAGE_AURA.keyword		partial	Keyword version of MESSAGE_AURA, to enable aggregating on it
MODALITY_CD_USR	text	partial	Modality of the user message
MODALITY_CD_AURA	text	partial	Modality of Aura response
ENTITIES	text	yes	Comma-separated list of the entities recognized in the user message
DIALOG_ID	text	yes	Identifier for the dialog that produced Aura response
DIALOG_ID.keyword	keyword	yes	Keyword version of `DIALOG_ID`, to enable aggregating on it
DURATION_NU	number	yes	Elapsed time, in ms, between the reception of the user message and the moment the response is generated to be sent to the channel
userType	keyword	no	A single char identifier that characterizes the user as a test user
usrMsgWc	number	no	Message word count: number of words contained in the user message
usrMsgSig	keyword	no	Message signature: a string that helps clustering user messages
AuraMsgGroup	keyword	no	Cluster the Aura response belongs to
weekday	number	no	Day of the week the interaction happened (0=Monday to 6=Sunday)
hour	number	no	(Integer) hour the interaction happened
country	keyword	partial	Two-letter code for the country
sesId	keyword	no	Session information
sesSize	number	no	Session information
sesDuration	number	no	Session information

Fields explanations

This subsection contains more detailed descriptions of some of the key fields in the schema.

AURA_ID_GLOBAL

This element (mostly) uniquely identifies the user generating the interaction.

Note the concrete value of this field is not the same as the actual identifier used within Aura and uploaded to Kernel: for privacy reasons, the identifier was hashed when generating the PPD and has no resemblance to the original one. The correspondence is however maintained across time, so it is possible to analyse user behavior.

The “mostly” qualifier reflects one quirk of the original Aura identifier: it is generated with a dependence to the authentication method used by the channel, so if two channels follow different authentication methods (e.g., MobileConnect vs. User/Password) then the AURA_ID_GLOBAL identifier for the same user will be different. In summary:

The identifier stays the same for a given user across time.
Different users will not have the same identifier.
But the same user could produce two different identifiers if connected to two channels that use a different authentication method.

AURA_ID

This is a “local” identifier, i.e., one that is generated inside the channel according to specific channel characteristics and it is not tied as much as AURA_ID_GLOBAL to user authentication.

Its main disadvantage is its transient nature: the same user, on the same channel, could generate different AURA_ID strings when connecting different times on a different session. Therefore, for user accounting and tracing, AURA_ID_GLOBAL is usually preferred.

However, there are instances in which AURA_ID works better, namely for anonymous access (when the user is not authenticated). This depends on the channel:  

In the WhatsApp channel, the initial use of the channel will be anonymous from the Aura side (i.e., no authentication is done), hence AURA_ID_GLOBAL will also be empty (at least until the user authenticates, which depends on the use case). But in this channel, AURA_ID has a permanent value, linked to the WhatsApp user, so here it is a good substitute for a persistent id, even for unauthenticated users.

MESSAGE_USR

This field includes the message sent by the user.

It has been partially processed to enhance anonymization by removing some standard identifiers contained in it with <idxxx> strings (e.g., phone numbers appear as <idphone>).

Removal is done mostly through regular expressions, so there might be occasional glitches (such as identifying as phone a number that does not really correspond to a phone, just because it follows the phone number pattern).

MESSAGE_USR is a field of text type. As such, it is searchable: it is possible to search for specific words the user might have said.

Furthermore, it has been processed through an ElasticSearch analyzer adapted to the specific language used. This means that searches are able to match related words (e.g., plural versions of a singular query word, or verb conjugations). Phrase searches are also possible (by using double quotes around the phrase). If a phrase (several words) is used as a query without the quotes, ElasticSearch interprets it as a query for any of the words, so it will return all data elements that contain any of the words in the query.

In Kibana, more sophisticated text searches can be made by switching Lucene query syntax: proximity queries (words close to each other), fuzzy searches (query words allowing typos), wildcards, etc.

MESSAGE_USR_NORM

This is a normalized version of MESSAGE_USR, in which the user text has been streamlined by:

Converting all the sentence to lowercase
Removing all punctuation
Removing any extra spaces

Furthermore, this field is not processed through a language-dependent analyzer as MESSAGE_USR is, so queries on this field must match words exactly. It is still a text type field. However, the same query language can be used.

MESSAGE_AURA

This contains the text message generated by Aura and sent to the user as response to the user query. It is a text type field, so it is possible to search for specific words in it.

In the current version of Aura KPIs logs, this field only contains the text response. Some Aura use cases do not generate a purely textual message, but a more elaborated one (e.g., a card with text and graphics). These complex answers are inserted as attachments into Aura’s response to the channel and since attachments are not logged into the MESSAGE field, this field will appear empty in those cases. So, an empty MESSAGE_AURA field does not necessarily mean that Aura did not provide an answer. As an alternative for those situations, looking at the DIALOG_ID field (or INTENT) may give a hint of the type of answer that Aura delivered.

 MODALITY_CD_USR

This field contains the modality in which the user sent the message.

It is a slightly transformed field because there are some variations across Aura versions and, in order to unify it, the modalities are consolidated into only four different keywords: audio (spoken message), text (written free-text message) o form (commands sent via automatic processing or menus).

 DIALOG_ID

This field contains the identifier for the user case dialog module at the aura-bot Framework that was selected to construct the Aura response.

Dialog identifiers have two components (library  and dialog) separated by a colon e.g., services:service-usage

This field uses a custom analyser that splits the identifier at the colon, generating two terms. This makes possible to construct queries with one of the terms, e.g., “give me all the elements for the domain services”. But being a text field makes it impossible to do aggregations on it, so it cannot be used for statistics like bar charts (use DIALOG_ID.keyword for that).

DURATION_NU

This number reflects the time that took Aura to understand, process and respond to the user message. It is the difference (in milliseconds) between the timestamp of the moment the user message was received and the timestamp in which Aura’s response was finalized and sent to the channel.

Note that it is not a complete end-to-end delay time from the user’s point of view, since it does not include either the time it took the request to arrive to Aura through the channel or the time it took the response to travel back through the channel and get rendered at the client application (those times are outside Aura, and as such not registered by it).

Session Information

Session information includes the fields: sesId, sesSize, sesDuration.

These fields are generated by running a process over the time series formed by interactions from each user at each channel.

A session is automatically identified as a consecutive list of such user’s interactions, each separated from the next by a time interval shorter than 5 minutes. Once each session is identified, it is tabulated and labelled with three fields:

 sesId: string, forming a unique identifier for the session. It should be considered as an opaque identifier and the guarantee is that no other session in the data stream carries the same identifier.
As an aside, interactions that do not correspond to actual user interactions (because no user could be identified or because the datapoint corresponds to an interaction not triggered by the user) are all labelled with a <void> sesId.
 sesSize: number of interactions this session contains. This is labelled only for the first interaction in the session, all other interactions carry a 0 in this field. Non-sessions such as the ones with <void> sesId will be left empty. This facilitates computing averages or other statistics on valid sessions, by just first filtering out all zero and empty values.
 sesDuration: time duration for each session, counted from the instant the first user message was received, to the instant the last Aura message was sent. For single-interaction sessions its value will be the same as DURATION_NU, for multiple interactions it will contain the time interval between all of them.

As with sesSize, only the first interaction in a session is annotated with sesDuration; the remaining interactions will be assigned a 0 value (and interactions that do not correspond to a session will be left empty). Therefore, to compute statistics on sesDuration, remove the 0 and empty values first.

userType

This field may be used, in certain cases, to help identify rows that do not correspond to real users but to test users (internal users that belong to test/QA teams and whose behaviour is, therefore, not representative of actual Aura users).

The field contains a single character, which is s for standard (real) users, and can be Q or T for QA/Test users respectively (there are also lowercased versions q and t, referring to unconfirmed test users).

Note that test user identification is not available on every country, since it depends on having a register of the AURA_GLOBAL_ID identifiers that QA/Test users authenticate and this is not always available.

usrMsgSig

This field is not useful by itself. Instead, it is intended to be used to help grouping together very similar user utterances. It does so by generating a signature of the utterance that is (hopefully) insensitive to small variations in the sentence.

This is an experimental field; it might change if we reach a variant that is better suited for its purpose.

The way to generate this signature is by following these steps with the utterance:

Start with the normalized utterance (i.e., MESSAGE_USR_NORM).
Perform stemming (removal of word suffixes) on all the words. This makes bills and bill the same word.
Substitute words from a fixed list of very common, uninformative tokens (stopwords) by an asterisk. For example, this converts both “get my bill” and “get the bill” to the same phrase “get * bill”.
Group words in sets of 3 elements (trigrams) and sort them alphabetically. This removes the global structure of the sentence, while retaining local structure.

The resulting string is a non-understandable version of the original utterance (hence, it cannot be used by itself), but the fact that several very similar utterances produce the same signature helps cluster those utterances. An example is one of the preinstalled visualizations “Most Frequent User Utterances” which uses this field to group very similar utterances.

Another example is provided in the following figure, which shows message utterances generating the same signature:

Message utterances

As it can be seen, the signature is the same for “how can I upgrade” and “when can I upgrade”, “when does my contract end” and “when is my contract ending”, and “live chat” & “live chats”. So, they would be counted together when aggregating by signature.

The procedure has its limitations and, as explained, it is experimental, so we are trying to improve it, but it can already alleviate a bit the inherent variability in user expressions.

AuraMsgGroup

The AuraMsgGroup field is a keyword field that helps categorize Aura answer by abstracting away some of this variation. It classifies the response given by Aura into two types of elements:

Generic group: a name such as <NONE>, <GREETING> or <NOTFOUND>, which corresponds to a response category (see Table 3)
Truncated answer: for answers that do not have a defined generic group, as a fallback the literal answer text is inserted, after substituting all numbers in it with a placeholder and truncating it (i.e., retain only the first characters).

The following table contains the generic groups defined so far. They correspond to the most frequent Aura messages. It is country-dependent, since it also depends on the use cases deployed in each country. As said above, responses not falling into these groups will be assigned a truncated version of the response text.

Note that th emost frequent Aura messages list can be enlarged with time. Also, the correspondence between Aura messages and groups is not static, if the text database is updated with new variants, it will be necessary to also update the translation table in the PPD cleaner process that generates this field.

Group	Meaning
EMPTY	No textual answer from Aura (see note in Section MESSAGE_AURA for the usual meaning of no text answer)
NONE	Aura says it did not understand the user utterance
ERR	There was a processing error of some kind at Aura side, and the request could not be fulfilled
GREETING	Aura is greeting the user
GOODBYE	Aura is acknowledging a conversation end
YOU-ARE-WELCOME	Aura is accepting a compliment
CHURN	Aura recognizes the user intention to terminate a contract
NOTFOUND	Aura tried to search for some bit of data concerning the user query, and could not find it
CANNOT	Aura cannot fulfil the user request because of insufficient information (in the query, or on user data)
BILL-INFO	The user requested information about her bill, and Aura is returning it
DATA-INFO	The user requested information about her data usage, and Aura is returning it

: The list can be enlarged with time. Also, the correspondence between Aura messages and groups is not static, if the text database is updated with new variants, it will be necessary to also update the translation table in the PPD cleaner process that generates this field.

1.4 - Annex: Dataset fields

Annex: Dataset fields detail

Explanation of the process that each field of the data model is going through towards a clean PPD

Introduction

The objective of the following tables is to explain the process that each field is going through within this flow:

AURA DATASET PPD_RAW PPD_CLEAN

Each cell of the table explains the process that the data field is undergoing in this specific moment before it gets to the concrete stage (table column).
For example, the field GLOBAL_AURA_ID is undergoing a “hashing” before it gets stored in PPD_RAW. After this, the “hashed data” is progressed without any further processing to PPD_CLEAN.

Tables used in the Active Listening process are described in the following sections. They belong to the Aura Entities database.

Columns “FIELD” and “DESCRIPTION”: instances managed by the OB
Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

MESSAGE dataset

Message dataset (stored in local Kernel)

COLUMNS “field” and “description”: instances managed by the OB
COLUMNS “PPD raw” and “PPD clean”: instances managed by Aura Global Team

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	MSG_DT	Timestamp of the data
3	MSG_ID	Unique ID of the message		NOT transferred
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new `Aura_id` each time she logs in Aura.	Hashed
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened
8	SUBSCRIPTION_CD	Code of the subscription type of the user in the OB		NOT transferred
9	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
10	CATEGORY_CD	Code of the category where the action happened		NOT transferred
11	COUNTRY_CD	Code of the country		NOT transferred
12	CORR_ID	Correlator ID of the request that produces this data
13	IS_CACHED	Shows if the entity content was already cached or not		NOT transferred
14	STATUS_CD	Status code of the action, if meaningful
15	REASON	Result of the action in error case, code of the error		NOT transferred
16	VERSION_ID	Aura version that produces this data
17	LANG_CD	Language configured by the user for communication		NOT transferred
18	TZ_CD	Timezone where the communication happened		NOT transferred
19	DURATION_NU	Duration in milliseconds of the action
20	MESSAGE	Content of the message	Anonymized
21	DIALOG_ID	Id of the dialog where the message happens
22	CONVERSATION_ID	Id of the conversation where the message happens		NOT transferred
23	WIN_RECOGNIZER_CD	Code of the recognizer that wins for this message		NOT transferred
24	WIN_RECOGNIZER_SCORE_NU	Score of the recognizer that wins for this message		NOT transferred
25	INTENT	Selected intent
26	ENTITIES	List of entities determined by the recognizer
27	MODALITY_CD	How does the user communicate with Aura
28	AURA_ID_GLOBAL	Identifies the same `user_id` logged with the same authentication method	Hashed
29	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred

RECOGNIZER dataset

Recognizer dataset stored in local Kernel

Columns “FIELD” and “DESCRIPTION”: instances managed by the OB
Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	RECOGNIZER_DT	Timestamp of the data
3	RECOGNIZER_ID	Unique ID of the recognizer
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new `Aura_id` each time she logs in Aura.	Hashed
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened
8	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
9	CATEGORY_CD	Code of the category where the action happened		NOT transferred
10	COUNTRY_CD	Code of the country		NOT transferred
11	CORR_ID	Correlator ID of the request that produces this data
12	IS_CACHED	Shows if the entity content was already cached or not		NOT transferred
13	STATUS_CD	Status code of the action, if meaningful
14	REASON	Result of the action in error case, code of the error
15	VERSION_ID	Aura version that produces this data
16	LANG_CD	Language configured by the user for communication		NOT transferred
17	TZ_CD	Timezone where the communication happened		NOT transferred
18	DURATION_NU	Duration in milliseconds of the action
19	SCORE_NU	Score returned by the recognizer
20	INPUT	User input sent to the recognizer. Null if incoming message is an AuraCommand	Anonymized
21	OUTPUT	Complete output generated by the recognizer
22	INTENT	Intent returned by the recognizer
23	ENTITIES	Entities returned by the recognizer due to the intent
24	COMMON_THRESHOLD_NU	Common threshold used to determine the best answer of all recognizers		NOT transferred
25	THRESHOLD	Specific threshold of the specific recognizer being executed		NOT transferred
26	EXPECTED_INTENT	Intent expected to be returned by the recognizer		NOT transferred
27	EXPECTED_ENTITIES	Entities expected to be returned by the recognizer due to the intent		NOT transferred
28	AURA_ID_GLOBAL	Identifies the same `user_id` logged with the same authentication method	Hashed
29	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred

This Markdown table can be directly used in your GitHub Markdown files.  

API dataset

API request dataset (stored in local Kernel)

Columns “FIELD” and “DESCRIPTION”: instances managed by the OB
Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	REQUEST_DT	Timestamp of the data
3	REQUEST_ID	Unique ID of the request
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura	Hashed	NOT transferred
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened		NOT transferred
8	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
9	CATEGORY_CD	Code of the category where the action happened		NOT transferred
10	COUNTRY_CD	Code of the country		NOT transferred
11	CORR_ID	Correlator ID of the request that produces this data
12	IS_CACHED	Shows if the entity content was already cached or not	NOT transferred	NOT transferred
13	STATUS_CD	Status code of the API request
14	REASON	Result of the action in error case, code of the error
15	VERSION_ID	Aura version that produces this data		NOT transferred
16	LANG_CD	Language configured by the user for communication		NOT transferred
17	TZ_CD	Timezone where the communication happened
18	DURATION_NU	Duration in milliseconds of the action
19	HOST	Host of the API
20	PATH	Specific path of the API being called		NOT transferred
21	HTTP_STATUS	HTTP status of the server response		NOT transferred
22	RESPONSE	Response body	Anonymized
23	AURA_ID_GLOBAL	Identifies the same user_id logged with the same authentication method	Hashed	NOT transferred
24	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred
25	REQUEST	Request body

2 - Aura Analytics 2.0.0

Aura Analytics 2.0.0

Description of Aura Analytics 2.0.0, the monitoring tool designed and managed by Aura Global Team that allows active listening in Aura

What is Aura Analytics 2.0.0?

Active listening is defined as a key process that involves a continuous monitoring of Aura performance based on real logs from the users to analyze them and gather insights on the efficiency and effectiveness of Aura as a system and also to track the interaction of our users with Aura.

In this framework, Aura Analytics 2.0.0 is a tool used by Aura Global Team that uses active listening with the ultimate goal of improving Aura quality, as it generates accurate information to carry out both corrective and predictive actions and to decide how Aura should evolve in the future.

How does Aura Analytics 2.0.0 work?

The process is built upon Aura users logs generated in production environment
From these logs, Aura Analytics 2.0.0 create PPDs (Privacy-Preserving Datasets)
Datasets are processed, enabling the visualization through dashboards and the extraction of statistical insights
The Aura Global Team consumes this data to support decision-making processes

Target users

The Aura Global Team is the target user of the Aura Analytics 2.0.0 tool, responsible for its design and management as well as for the interpretation of results for decision-making.
OBs should allow the generation of datasets from their Aura users logs in their local environment just by installing and executing a single process, as shown in the document Guidelines for OBs.

Index of documents

Aura Analytics 2.0.0 includes the following documents:

Version history
Architecture overview and main processes and components
Internal operation
Guidelines for OBs willing to consume data provided by Aura Analytics 2.0.0
Aura Analytics dashboard
Annex: Dataset fields

Aura Analytics versions

Release 1.0.0

The first release 1.0.0. sets up the basic paths, deploys the PPD infrastructure and produce:

Version 1.0.0. of the OB Analytics system, which includes the OB Dashboard.
The first version of pre-processed datasets (clean PPDs) for training and analytics at Aura Global.

As mentioned, this version enables going further by:

Enhancing the analytics dashboard with new visualizations.
Processing the PPD files as desired (they are standard CSV files, which can be ingested in alternative platforms if desired). Restrictions on them are softer than on the original logs due to the anonymization process they have been subjected to, although they are still subjected to management precautions (a code of conduct is being prepared for that).

Release 1.1.0

Version 1.1.0. introduces the following changes:

The table of data has been enlarged with these new fields: AURA_ID, STATUS_CD, sesId, sesSize, sesDuration.
An expanded list of test users is used, so that the userType column contains more identifications.
The code for data ingestion into a local Kibana, which previously consisted on a single Python script, has been turned into a full Python package to be installed, due to its increasing complexity.

Release 2.0.0

Version 2.0.0 introduces the following changes:

In 2.0.0 version, Aura Analytics has undergone a refactor to improve its structure and make it easier to understand, maintain and extend in the future.
Aura Analytics 2.0.0 simplifies the deployment and execution process.
But one of the most significant enhancements in Aura Analytics 2.0.0 is its capability to manage both processed and to-process files centrally in one place (database).

Prerequisites and recommended tools

The prerequisites for the use of Aura Analytics 2.0.0 are set below:

Recommended tool for data visualization: ELK stack

2.1 - Architecture

Aura Analytics 2.0.0. architecture

Technical architecture of Aura Analytics 2.0.0 and description of main processes and components

Architecture overview

Aura Analytics 2.0.0 contains two different environments:

OB local environment: Processes in this side are managed by the OB, that should install and execute certain processes related to the PPD-Creator for the creation of raw datasets.
Global environment: Processes here are managed by Aura Global Team for data recovery, processing and generation of dashboards and statistics. The output includes data and metrics to be consumed by Aura Global Team for decision-making.

Aura Analytics 2.0.0 architecture flowchart

The following diagram shows an overview of Aura Analytics 2.0.0 architecture, including the environments involved and the main components and processes, which are fully described in succeeding sections.

Figure 1. Aura Analytics 2.0.0 Architecture flowchart

Aura Analytics 2.0.0 processes

PPD-Creator process

The PPD-Creator is a Python module for the creation of PPD-Raw datasets.

It is the only component that belongs to the OB environment. The OB should install it and is responsible for its execution. The PPD-Raw datasets will be stored in the destination blob PPD-RAW.

This process reads the files included in OB MANAGED INSTANCES columns of the tables in Annex: Dataset fields. The result of this process is the PPD RAW columns of the tables in the above-mentioned annex.

The main tasks executed by the PPD-Creator are summarized below:

Reads the Aura log files in a Kernel Blob.
Anonymize the sensible fields (AuraID, AuraGlobalID, and personal information of user sentence such as DNI, phone numbers, etc).
Save the anonymized files to another directory of blob (PPD-Raw).

Figure 2. PPD-Creator process

The PPD-Creator anonymizes the following data, in the different OBs:

ES	UK
dni	creditcard
nie	insurance
phone	postcode
email	imei
	phone
	imsi
	email
	twitter
	passport

Manage PPD-Raw process

The Manage PPD-Raw process inserts the PPD-Raw path files (output from PPD-Creator) to PostgreSQL table for files management data centric:

It reads the output of PPD-Creator JSON file
Afterwards, it saves the paths to PostgreSQL server

Figure 3. Manage PPD-Raw process

PPD-Clean process

The PPD-Clean is a Python package used to clean PPD-Raw datasets.

Firstly, this process locates the directory where the PPD-Raw files are located, reads the corresponding files and processes them.

Once the process is finished, it writes to the files_processed table in the database and saves them in the PPD-Clean directory.

The main tasks executed by the PPD-Clean are summarized below:

Apply transformations to columns
Extract the explicit frustration
Calculates the Nones n-grams
Save the data in Directory or blob, PostgreSQL server and ElasticSearch for visualization

Figure 4. PPD-Clean process

User Dynamics process

User dynamics is a script used to measure the user’s behavior through metrics. It extracts statistics on the recurrence of users in Aura in a monthly basis.

The processes executed are summarized below:

User dynamics reads the file_processed table of the database and the all PPD-Clean files stored for 1 month.
It extracts metrics regarding new users, recurrent users, lost users and recovered users.
Afterwards, it saves these metrics in the User_dynamics schema, in a PostgreSQL database, within the tables connections, daycount, user and channel.
Data is also saved in ElasticSearch.

Figure 5. User Dynamics process

Components

Active Listening Database

The Active Listening Database is a PostgreSQL database that stores the processed and to-process files centrally in one place. It is used by the PPD-Clean and User Dynamics processes to store the processed data and metrics.

Schema files management

Currently, in the Active Listening project, we have input and output files for each of the processes and files that are processed. With the proposed database solution through the files management database, a more efficient management of raw files is achieved:

The PPD-Creator process transfers files from the OB to a shared blob.
The transferred files are written to a file in that blob called aura-sync-cache-dst.json.
The manage_ppd_raw process will read the aura-sync-cache-dst.json file from the PPD-Raw folder and ingest the records into the FILE_PPD_RAW table of the database.
It will also insert into the EMPTY_DATA_FILES table the days that are not found in the JSON file. This table is necessary for logging metrics in Prometheus. This process will run daily.

files management db Figure 6. Files management database

Schema user dynamics

The User Dynamics process generates the statistics of Aura users, number of daily active users and types of users, with 4 categories: new, recurring, lost and recovered.

The Channel table contains all the channels in Aura that have been processed by the User Dynamics process.
The User table contains the unique Aura users in each environment and country.
The Daycount table contains the number of total users for each day, indicating how many of them are new, recurring, recovered or lost users, the number of weekly unique recurring users and the number of monthly unique recurring users.
The Connection table has the status of the user for each day (whether it is new, recurring, lost or recovered).

user dynamics db
Figure 7. User dynamics database

Aura Analytics Dashboard

Aura Analytics 2.0.0 produces as a result, among other elements, an analytics component named Aura Analytics Dashboard that is the one used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior.

This Analytics Dashboard is based on the ELK stack that contains:

ElasticSearch: distributed search and analytics engine at the heart of the Elastic Stack. It allows the storage of data and its subsequent indexing, search and analysis.
Kibana: provides a visualization tool that includes dashboards and panels created over the ElasticSearch data. Users interactively explore, visualize and share insights into data and manage and monitor the stack.

Once installed:

An ElasticSearch index is created. It is called aura-ppd-ENTITY-COUNTRY-YEAR, and its index schema contains a cleaned version of the AURA MESSAGE, RECOGNIZER or API tables (which registers input and output messages).
A Kibana index pattern is created, matching the uploaded ElasticSearch index.
A pre-defined set of visualizations are installed in Kibana over that index pattern, as a means to get a default peek on the index data. See the section pre-installed analytics dashboard.
The system automatically ingests any new clean PPD being produced in the ElasticSearch database, so that the index and dashboards remain up to date.

As mentioned above, the Aura Analytics Dashboard is conceived to be used by Aura Global Team. However, OBs can install locally the ELK stack or any other visualization tool for data consumption. Access to the document Local data visualization for further details.

2.2 - Operation

Aura Analytics 2.0.0 operation

Discover Aura Analytics 2.0.0 operation at a glance

Introduction

Based on Aura Analytics 2.0.0 architecture, the current documents provides an overview of its global operation.

Take the Aura Analytics 2.0.0 architecture flowchart as a reference to follow each step of the dataflow described below:

Aura logs generated in local instance are converted to datasets and transferred to local Kernel via the standard procedure and with the established frequency (typically, daily). Once there, the Active listening process flow fires up daily.
PPD-Creator: This is the first process that runs, and it is the only one executed in the OBs’ environment. It retrieves Kernel data, anonymizes all sensitive data that could identify users, and then transfers this data to an environment shared with the Aura Global team.
Manage-PPD-Raw: This is the first process executed from the global environment. It solely stores the paths of the data transferred by the PPD-Creator into a PostgreSQL database to keep a record of which data has been transferred.
PPD-Clean: This process runs from the global environment. Once the data is anonymized, it is processed to extract additional features (such as user frustration or the extraction of n-grams from user phrases about iterations that do not have an intent).
Once the data is processed, a path is saved in the environment and also in ElasticSearch to create dashboards that tracks Aura usage by its customers.
User-Dynamics: This is the last process, also executed in global environment. It is responsible for extracting statistics about users’ recurrence and the number of users per day. Among that, it identifies new users, recurring users (those making iterations every day), recovered users (those who have stopped using Aura at some point and have returned to the system) and lost users (those who have stopped using Aura in 3 months).

Examples of different dashboards are included below:

Figure 1. Users dashboard

Figure 2. Daily users dashboard

Figure 3. Weekly users dashboard

Figure 4. Trends dashboards

2.3 - Guidelines for OBs

Guidelines for OBs

Guidelines for OBs to allow the generation of datasets from their Aura users logs in local environment

Introduction

As seen in the Aura Analytics 2.0.0 architecture flowchart, Aura Analytics 2.0.0 contains two different environments: the OB local environment, managed by the OB and the Global one, managed by Aura Global Team.

Within this framework, the current guidelines are tailored towards OBs, indicating how to install and execute the PPD-Creator, for the creation and processing of PPD RAW datasets.

Once it is carried out, the subsequent processes of Aura Analytics 2.0.0 are executed in global environment by Aura Global Team.

Installation of PPD-Creator

The OB must install and store the PPD-Creator in a specific destination blob PPD-RAW and is responsible for its execution.

Guidelines are included in installer Aurak8s documentation: Active listening deployment.

Execution of PPD-Creator

The execution of the PPD-Creator must be done by the OBs, previous to its installation.

Parameters to launch the PPD-Creator

These are the parameters that the PPD-Creator takes from Kernel:

Mandatory parameters:

Parameter	Description	Example
`--src-type`	Source type	‘aws’, ’local’, ‘azure’
`--src-name`	Source folder name (raw files)	`/directory/data/`
`--dst-type`	Destination type	‘aws’, ’local’, ‘azure’
`--dst-name`	Destination folder name (raw files processed)	`/directory/data/output`
`--country`	Customize for a lang_country pair	es
`--table`	Table to process {message, grootmessage, recognizer, api}	MESSAGE
`--environment`	Environment to process (ap-one, ap-two, prod)	ap-one
OPERATIONS
`--transfer`	File processing: transfer, anonymize and group files
`--copy`	Raw file copy – no processing
`--copy-fix`	File copy + apply small fixes
`--show`	Show available files; no action performed

Optional parameters:

Parameter	Description
`--src-user`	Access key for the source folder
`--src-pass`	Source access secret
`--src-encryption`	Encryption key for data in source
`--dst-user`	Access key for the destination folder
`--dst-pass`	Destination access secret
`--dst-encryption`	Encryption key for data in destination
`--cloud-type`	Cloud type (‘aws’,’azure’)
`--dry-run`	Process dry-run: no action performed
`--reraise`	Re-raise exceptions on errors
`--options-file`	Read additional options from a file in dest repo
`--verbose`	Verbose level
`--console`	Act as a console app (format logs as console messages, raise on errors)
`--dest-log-skip`	Do not write the logfile at the destination folder
`--dest-log-dir`	Logging subdirectory at destination
`--show-input`	Print out input arguments
`--anon-key/ encryption-key`	Encryption key for anonymization of sensitive columns
`--max-files`	Maximum number of files to process
`--folders/month`	Restrict transfer to certain folders (i.e. months)
`--tables`	Restrict transfer to certain tables
`--reset`	Reset the index and re-process
`--encrypt-index`	Save the index files encrypted or unencrypted (default depends on cloud type)
`--raw`	Copy as raw data instead of text file
`--fix-header`	Fix file header
`--fix-anon`	Fix anonymized fields in API columns

Launching PPD-Creator

Execute the following command:

docker run aura/ppd-creator --country <country-code> --anon-key <KEY> <source-params> <dest-params>

Example:

docker run aura/ppd-creator \
   --country ar \
   --anon-key as34-dre23-4127 \
   --src-name 4P-bucket-name-for-uk \
   --src-user EF45IHWD34DE4FGA \
   --src-pass k/Erf/6DSWWPjhdde1/abc123def-2331ldf \
   --dst-name aura-ppd-ar \
   --dst-user EF4341sdf3EFGUA1 \
   --dst-pass J/DQW/Sdde5k12ldsf/1abcde12dd1d-123c11 \
   --dst-encryption 1234ab56-12a3-45eb-8e06-8c522cdbb668-75f1b00f-6ca6-4a13-a741-64514cce728b \
   --table message \
   --environment prod \
   --transfer

Output from PPD-Creator

The output includes the following items:

BOT_XXXXXX.txt.bz2: raw files (processed).
log folder: if the logging options have been configured.

aura-sync-cache-dst.json: table/month: processed files (automatically generated in destination). For example:

{
  "AURA_DATA/ES/API/202212/": [
    "BOT_04095750-724e-11ed-9565-53054255c842_ES_API_20221202T150000Z.txt.bz2",
    "BOT_d2e93fc0-7656-11ed-a8eb-49a811568ab3_ES_API_20221207T170000Z.txt.bz2",
    "BOT_987780e0-7660-11ed-ba4a-2dac114c5321_ES_API_20221207T180000Z.txt.bz2"
  ],
  ...
}

aura-sync-cache-src.json: table/month: raw files_to_process (source). For example:

{
"AURA_DATA/ES/API/202212/": [
  "BOT_04095750-724e-11ed-9565-53054255c842_ES_API_20221202T150000Z.txt",
  "BOT_05a5b860-7663-11ed-bbf7-cb8fd9eb3c25_ES_API_20221207T190000Z.txt",
  "BOT_05ae43e0-7663-11ed-a0aa-8b7e0e134809_ES_API_20221207T190000Z.txt",
  "BOT_0d69fb10-7492-11ed-a1fc-95dce7e56901_ES_API_20221205T110000Z.txt"
],
  ... 
}

aura-sync-key-dst.json: key used to encrypt sensitive fields. For example:
```
{
  "sample": "abcd1234-ab12-12ab-ab12-1abc234e56fg"
}
```

Local data visualization (optional)

As explained before, Aura Global Team will be in charge of the analysis of the generated data through the global tool Aura Analytics Dashboard.

Nevertheless, just in case the OB wants to visualize certain data locally:

This will be done following a prior agreement with the OB on privacy-related matters.
Aura Global Team will provide access to the clean data stored in the corresponding PPD-clean blob container.
The OB can install locally the ELK stack or other alternative tool for data visualization.
No support will be offered by Aura Global team for this task.

2.4 - Analytics Dashboard

Aura Analytics 2.0.0. Dashboard

Description of Aura Analytics 2.0.0 dashboard used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior.

Aura Analytics 2.0.0 produces as a result, among other elements, an analytics component named Aura Analytics Dashboard that is the one used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior. This Analytics Dashboard is based on the ELK stack.

The dashboards provides a pre-defined set of visualizations, described throughout this document. Nevertheless, it is possible to build additional dashboards using the ELK stack.

Pre-installed analytics dashboard

Kibana offers many possibilities to visualize the ingested data, and there are many resources and tutorials around explaining its mechanics. We therefore refer to the official Kibana documentation, or the many tutorials available on the Web, for generic information.

In the particular case of Aura Analytics 2.0.0, there is an ElasticSearch index that gets automatically ingested daily. It is called Aura-message-COUNTRY, and contains a cleaned version of the AURA MESSAGE table (which registers input and output messages).

Over this index, three types of panels/visualizations have been pre-installed:

Discover panel
Visualizations
Dashboards

Discover

The Discover panel in Kibana is an essential tool for performing queries to an ElasticSearch index (save those searches, if desired), and explore users’ interactions with Aura in detail log by log, these being filtered by:

Search terms or conditions » A time interval » Additional filters applied to the query results » A set of index fields to show in the result table.

These 4 steps are represented in Figure 1:

Figure 1. Discover panel

As shown in this figure, the starting point is the ElasticSearch index holding all the data.

Each of the three first steps in the chain reduces the amount of data handled, by pruning out elements that do not satisfy the defined condition. The fourth step is just a display adjustment: on the final dataset, define which of the available fields will be shown on the output table that appears in the panel.

In the Aura Dashboard default set, there is one such Discover panel pre-installed. It is called question-answer pairs and has the following characteristics:

A blank query (i.e., provide all the results)
A time interval for the last 7 days
A “only user” filter: filters out all intents that correspond to non-user queries (suggestions, help commands from the client application, etc)
A visualization that includes: timestamp, (cleaned) user message, detected Aura intent, associated entities (if applicable), dialog that was invoked and Aura’s response.

Figure 2 shows a snapshot of this panel. To load it, select the Discover tool in the left navigation bar, and then click on the “Open” menu option in the top menu bar. A list of saved panels will be shown, with this one in it named “question-answer pairs”.

Figure 2. Question-answer pairs panel

Once the panel is loaded, each one of the aforementioned four elements can be freely modified, for example, the interface allows:

Adding new filters with the “+Add Filters” button
Deactivating the current filters by pressing over the predefined ones and clicking over the “Temporarily Disable” option
Modifying the query interval with the “calendar” button or “Dates Box”
Adding a specific query on a given index field(s) by using the “Search Box”, instead of the (default) blank query

Discover panels can be saved as named objects, to be later loaded at will. So, if needed, any panel (a modified panel or a newly created one) can be saved with a new name, to have it available for later loading.

Visualizations

A total of 7 visualizations come pre-installed with the base Aura Dashboard. The list can be obtained from the visualizations item in the left menu bar, shown in Figure 3:

Three “Stats” type visualizations, which provide general statistics on platform usage.
Four “User” type visualizations, which provide insights on user behavior.

Figure 3. Preinstalled visualizations dashboard

Note that this distinction between “User” and “Stats” is purely conceptual and based on the fields that have been used to generate the visualizations as from the point of view of Kibana, they are all regular visualizations.

Those visualizations can be instantly loaded by clicking on their names. But they can also be integrated into dashboards, which is described in the next section.

Dashboards

A dashboard in Kibana is essentially a spatial arrangement of visualizations. For example, to construct a dashboard, we just place visualizations into a page, resizing them as we wish, so we can observe all of them in a single place afterwards.

Within a dashboard all visualizations are linked. For example, if we change the time interval or add a filter using the interface, this modification affects all visualizations in the dashboard, and all of them get updated.

Elements in the dashboard visualizations can also generate “instant filters” by clicking on graphs or table elements. Those filters are added to the top of the page as a filter afterwards and can be modified or removed.

The Aura Analytics default installation preloads two dashboards. Those are available for selection when we click on the Dashboard icon in the left navigation bar:

selection-aura-analytics-default-dashboards Figure 4. Aura analytics default dashboards

Nones dashboard

This dashboard integrates the n-grams extracted from PPD-Clean process.

nones-dashboard

nones-dashboard Figure 5. Nones dashboards

System dashboard

This dashboard integrates the three predefined “Stats” visualizations (generic statistics):

A timeline of interactions (user messages sent and answered), segmented by channel
A heatmap of interactions by weekday and time of day (hour)
A bar graph classifying the interactions produced in the period by detected intent

system-dashboard Figure 6. System dashboard

User Dashboard

The User dashboard contains 4 user visualizations:

Most Frequent User Utterances: list of the most frequent user sentences (in the time interval and filter active at the moment). It uses the msgUsrSig field to group together very similar utterances.
AURA Answer Groups: list of the most frequent answers that Aura generates, grouped by the semantic categories in AuraMsgGroup field.
Words per query: distribution of sizes for the user messages, measured as number of words in the utterance, and segmented by channel.
Tag cloud: set of plain most frequent user utterances, as a tag cloud in which the font size represents the utterance frequency. The MESSAGE_USR_NORM field is used for the representation, so it contains normalized utterances.

The next screenshots show the dashboard with all these visualizations (it is a large dashboard, so typically it needs scrolling to visualize all its components).

user-dashboard Figure 7. User dashboard

Note that those four visualizations are linked in the sense of corresponding to the same subset of the data (as given by filters and time interval) but they are NOT linked at the individual item level (i.e., a given most frequent user utterance in the left table does not correspond to any specific Aura answer in the right bar graph).

Instead, the dashboard can be manipulated by selecting one specific item in any of the visualizations, and this will create a filter for the others.

For instance, as the following image shows, if we select “CHURN” in the Aura answer group visualization, we can observe in the others the user utterances that led Aura to generate that answer (i.e., an answer about contract cancelation).

Example-of-Aura-answer-groups-in-the-user-dashboard Figure 8. Example of Aura answer groups in the user dashboard

Building new visualizations and dashboards

If the OB has installed locally the ELK stack, new elements can be built (or the current ones modified) by making use of the available fields in Kibana through the ingested ElasticSearch index.

In this section, we provide a reference of the schema that the index follows, so it can be used to build such new visualizations or to better understand the existing ones.

Data model

Field types

Elements in the Aura-message data model have 3 different types:

Numeric: single numbers, integer or real. Suitable for numerical statistics, such as averages, or for plotting variation across time in graphs.
Keyword: they are opaque strings, i.e., terms that cannot be searched within (it is not possible to look for words inside a keyword field). They can however be used to create some term-level queries, such as e.g., prefix queries (find all instances that begin with) and they usually work great for aggregations, since most of them are categorical variables (fields that only have a limited number of possible values) and can therefore be bucketed and counted.
Text: these fields are divided into separate terms (words), and some pre-processing is done to them before indexing, to improve access, though an ElasticSearch analyzer. Text fields cannot be used in aggregated visualizations, since they cannot be grouped. They are most useful for queries, because they allow searching for fragments (only a few words) and fuzzy searches.

Field list

The following table lists all the available fields in the Aura-message-COUNTRY ElasticSearch index, with their type and a brief description. Some of them have more detailed explanations in Section Field explanations.

Note that some fields of text type have a mirror field of type keyword, with the same content. Having the same data indexed in two different ways at the same time (as text and as keyword) enables to perform different types of analysis by choosing the right field.

The Raw column indicates if this field is already present in the AURA raw PPD files:

Yes: it is a field contained in raw PPDs.
No: it is a generated field, produced when creating clean PPDs. They can be recognized as lowercase fields.
Partial: It exists in the raw PPDs, but in a somehow different shape.

Field	Type	Raw	Content
CORR_ID	keyword	yes	Unique identifier for each interaction
VERSION_ID	keyword	yes	Aura Platform version
CHANNEL_CD	keyword	yes	Identifier for the channel this interaction corresponds to
STATUS_CD	keyword	yes	Internal code related to operation status
AURA_ID_GLOBAL	keyword	yes	(Mostly) unique identifier for the user
AURA_ID	keyword	yes	(Mostly) local identifier for the user
INTENT	keyword	yes	Detected user intent, including “system” intents
MESSAGE_USR	text	partial	Text request sent by the user
MESSAGE_USR_NORM	text	no	A normalized version of MESSAGE_USR
MESSAGE_USR_NORM.keyword	keyword	no	A keyword version of `MESSAGE_USR_NORM`, to enable aggregating on it
MESSAGE_AURA	text	partial	Text message sent by AURA to the user
MESSAGE_AURA.keyword	partial	partial	Keyword version of `MESSAGE_AURA`, to enable aggregating on it
MODALITY_CD_USR	text	partial	Modality of the user message
MODALITY_CD_AURA	text	partial	Modality of Aura response
ENTITIES	text	yes	Comma-separated list of the entities recognized in the user message
DIALOG_ID	text	yes	Identifier for the dialog that produced Aura response
DIALOG_ID.keyword	keyword	yes	Keyword version of `DIALOG_ID`, to enable aggregating on it
DURATION_NU	number	yes	Elapsed time, in ms, between the reception of the user message and the moment the response is generated to be sent to the channel
userType	keyword	no	A single char identifier that characterizes the user as a test user
usrMsgWc	number	no	Message word count: number of words contained in the user message
usrMsgSig	keyword	no	Message signature: a string that helps clustering user messages
AuraMsgGroup	keyword	no	Cluster the Aura response belongs to
weekday	number	no	Day of the week the interaction happened (0=Monday to 6=Sunday)
hour	number	no	(integer) hour the interaction happened
country	keyword	partial	Two-letter code for the country
sesId	keyword	no	Session information
sesSize	number	no	Session information
sesDuration	number	no	Session information
EXPLICIT_FRUSTATION	number	no	The frustration probability of message user
AllNGrams	keyword	no	All n-grams of user message
AllNGramsFilter	keyword	no	All n-grams filtered by stopwords
NGrams1	keyword	no	The n-grams of 1 word
NGrams1Filter	keyword	no	The n-grams of 1 word filtered by stopwords
NGrams2	keyword	no	The n-grams of 2 words
NGrams3	keyword	no	The n-grams of 3 words

Field explanations

This subsection contains more detailed descriptions of some of the fields in the schema.

AURA_ID_GLOBAL

This element (mostly) uniquely identifies the user generating the interaction.

Note the concrete value of this field is not the same as the current identifier used in Aura and uploaded to Kernel: for privacy reasons, the identifier was hashed when generating the PPD and has no resemblance to the original one. The correspondence is however maintained across time, so it is possible to analyze user behavior.

In summary:

The identifier stays the same for a given user across time.
No two users will have the same identifier.
But the same user could produce two different identifiers if it connects to two channels that use a different authentication method.

AURA_ID

This is a “local” identifier, i.e., it is generated inside the channel according to the specific channel characteristics, and it is not tied as much as AURA_ID_GLOBAL to user authentication.

Its main disadvantage is its transient nature: the same user, on the same channel, could generate different AURA_ID strings when connecting different times, on different session. Therefore, for user accounting and tracing, AURA_ID_GLOBAL is usually preferred.

However, there are instances in which AURA_ID works better, namely for anonymous access (when the user is not authenticated). This depends on the channel:

In the WhatsApp channel, the initial use of the channel will be anonymous from Aura side (i.e., no authentication is done), hence AURA_ID_GLOBAL will also be empty (at least until the user authenticates, which depends on the use case). But in this channel, AURA_ID has a permanent value, linked to the WhatsApp user, so here it is a good substitute for a persistent id even for unauthenticated users.

MESSAGE_USR

This field includes the message sent by user1. It has been partially processed to enhance anonymization by removing some standard identifiers contained in it with <idxxx> strings (e.g., phone numbers appear as <idphone>).

Removal is done mostly through regular expressions, so there might be occasional glitches (such as identifying as a phone number that does not really correspond to a phone, just because it follows the phone number pattern).

MESSAGE_USR is a field of text type. As such, it is searchable: it is possible to search for specific words the user might have said. Furthermore, it has been processed through an ElasticSearch analyzer adapted to the specific language used. This means that searches will be able to match related words (e.g., plural versions of a singular query word, or verb conjugations). Phrase searches are also possible (by using double quotes around the phrase).

In Kibana, more sophisticated text searches can be made by switching Lucene query syntax: proximity queries (words close to each other), fuzzy searches (query words allowing typos), wildcards, etc

MESSAGE_USR_NORM

This is a normalized version of MESSAGE_USR, in which the user text has been streamlined by:

Converting all the sentence to lowercase
Removing all punctuation
Removing any extra spaces

Furthermore, this field is not processed through a language-dependent analyzer, as MESSAGE_USR is, so queries on this field must match words exactly. It is still a text field, however, so the same query language can be used.

MESSAGE_AURA

This contains the text message generated by Aura and sent to the user as response to the user query. It is a text field, so it is possible to search for specific words in it.

IMPORTANT In the current version of Aura KPI logs, this field contains only the text response.
Some Aura use cases do not generate a purely textual message, but a more elaborated one (e.g., a card with text and graphics). These complex answers are inserted as attachments into Aura’s response to the channel, and since attachments are not logged into the MESSAGE field, this field will appear empty in those cases.
So, an empty MESSAGE_AURA field does not necessarily mean that AURA did not provide an answer. As an alternative for those situations, looking at the DIALOG_ID field (or INTENT) may give a hint of the type of answer that Aura delivered.

MODALITY_CD_USR

This field contains the modality in which the user sent the message.

It is a slightly transformed field because there is some variation across Aura versions, and to unify the modalities are consolidated into only four different keywords: audio (spoken message), text (written free-text message) o form (commands sent via automatic processing or menus).

DIALOG_ID

This field contains the identifier for the user case dialog module at the Aura Bot Framework that was selected to construct the Aura response.

Dialog identifiers have two components (library and dialog) separated by a colon e.g., services:service-usage.

This field uses a custom analyzer that splits the identifier at the colon, generating two terms. This makes possible to construct queries with one of the terms, e.g., “give me all the elements for the domain services”). But being a text field makes it impossible to do aggregations on it, so it cannot be used for statistics like bar charts (use DIALOG_ID.keyword for that).

DURATION_NU

Session Information (sesId, sesSize, sesDuration)

These fields are generated by running a process over the time series formed by interactions from each user at each channel. A session is automatically identified as a consecutive list of such user’s interactions, each separated from the next by a time interval shorter than 5 minutes. Once each session is identified, it is tabulated and labelled with three fields:

sesId: a string, forming a unique identifier for the session. It should be considered an opaque identifier and the guarantee is that no other session in the data stream carries the same identifier.

As an aside, interactions that do not correspond to actual user interactions (because no user could be identified, or because the datapoint corresponds to an interaction not triggered by the user) are all labelled with a <void> sesId

sesSize: the number of interactions this session contains. This is labelled only for the first interaction in the session, all other interactions carry a 0 in this field. Non-sessions such as the ones with sesId will be left empty. This facilitates computing averages or other statistics on valid sessions, by just first filtering out all zero and empty values
sesDuration: the time duration for each session, counted from the instant the first user message was received, to the instant the last Aura message was sent. For single-interaction sessions its value will be the same as DURATION_NU, for multiple interactions it will contain the time interval between all of them.

As with sesSize, only the first interaction in a session is annotated with sesDuration; the remaining interactions will be assigned a 0 value (and interactions that do not correspond to a session will be left empty). Therefore, to compute statistics on sesDuration, remove the 0 and empty values first.

userType

This field may be used, in certain cases, to help identify rows that do not correspond to real users but to test users (internal users that belong to test/QA teams, and whose behavior is therefore not representative of actual Aura users). The field contains a single character, which is s for standard (real) users, and can be Q or T for QA/Test users respectively (there are also lowercased versions q and t, which means unconfirmed test users).

usrMsgSig

Start with the normalized utterance (i.e., MESSAGE_USR_NORM)
Perform stemming (removal of word suffixes) on all the words. This makes bills and bill the same word
Substitute words from a fixed list of very common, uninformative tokens (stopwords) by an asterisk. For example, this converts both “get my bill” and “get the bill” to the same phrase “get * bill”
Group words in sets of 3 elements (trigrams), and sort them alphabetically. This removes the global structure of the sentence, while retaining local structure.

The resulting string is a non-understandable version of the original utterance (hence it cannot be used by itself), but the fact that several very similar utterances produce the same signature helps to cluster those utterances. An example is one of the preinstalled visualizations, Most Frequent User Utterances, which uses this field to group very similar utterances.

Another example is provided in the following figure, which shows message utterances generating the same signature:

message-utterances-signature Figure 9. Message utterances generating the same signature

As it can be seen, the signature is the same for "how can I upgrade" and "when can I upgrade", "when does my contract end" and "when is my contract ending", and "live chat" & "live chats". So, they would be counted together when aggregating by signature.

The procedure has its limitations, and as explained is experimental, so we are trying to improve it, but it can already alleviate a bit the inherent variability in user expressions.

AuraMsgGroup

Messages produced by Aura are as generated by its text resource database. In some cases, the same category of message produces different output texts, maybe because the message includes some user-dependent parameter or because the text database contains several variants of the same text (and Aura picks one at random). The AuraMsgGroup field is a keyword field that helps categorizing Aura answer by abstracting away some of this variation. It classifies the response given by Aura into two types of elements:

Generic group: a name such as <NONE>, <GREETING> or <NOTFOUND>, which corresponds to a response category (see Table 3)
Truncated answer: for answers that do not have a defined generic group, as a fallback the literal answer text is inserted, after substituting all numbers in it with a placeholder and truncating it (i.e., retain only the first characters)

Table 4 contains the generic groups defined so far. They correspond to the most frequent Aura messages. It is country-dependent, since it also depends on the use cases deployed in each country. As said above, responses not falling into these groups will be assigned a truncated version of the response text.

Group	Meaning
`<EMPTY>`	No textual answer from Aura
`<NONE>`	Aura says it did not understand the user utterance
`<ERR>`	There was a processing error of some kind at Aura side, and the request could not be fulfilled
`<GREETING>`	Aura is greeting the user
`<GOODBYE>`	Aura is acknowledging a conversation end
`<YOU-ARE-WELCOME>`	Aura is accepting a compliment
`<CHURN>`	Aura recognizes the user intention to terminate a contract
`<NOTFOUND>`	Aura tried to search for some bit of data concerning the user query, and could not find it
`<CANNOT>`	Aura cannot fulfil the user request because of insufficient information (in the query, or on user data)
`<BILL-INFO>`	The user requested information about her bill, and Aura is returning it
`<DATA-INFO>`	The user requested information about her data usage, and Aura is returning it

EXPLICIT_FRUSTRATION

The sentiment model generates explicit frustration regarding the user’s message. In this field, the probability indicates that a user’s sentence is an explicit expression of frustration.

AllNGrams

For intents none and tv.none, an extraction of the most common n-grams generated by these none responses is applied. In these fields n-grams for 1 word, 2 words and 3 words are represented.

AllNGramsFilter

This field represents the AllNGrams field but filtered by stopwords.

NGrams1

This field represents the n-grams for 1 word.

NGrams1Filter

This field represents the n-grams for 1 word filtered by stopwords.

NGrams2

This field represents the n-grams for 2 words.

NGrams3

This field represents the n-grams for 3 words.

2.5 - Annex: Dataset fields

Annex: Dataset fields detail

The current annex describes the process that each field of Aura Analytics 2.0.0 data model is going through towards a clean PPD

Introduction

The objective of the following tables is to explain the process that each field is going through within this flow:

Aura datasets

>>>

PPD_RAW

>>>

PPD_CLEAN

Each cell of the table explains the process that the data field is undergoing in this specific moment before it gets to the concrete stage (table column).
For example, the field GLOBAL_AURA_ID is undergoing a “hashing” before it gets stored in PPD_RAW. After this, the “hashed data” is progressed without any further processing to PPD_CLEAN.

Tables used in the Active Listening process are described in the following sections. They belong to the Aura Entities database.

Columns FIELD and DESCRIPTION: instances managed by the OB
Columns PPD RAW and PPD CLEAN: instances managed by Aura Global Team

MESSAGE dataset

Message dataset (stored in local Kernel).

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	MSG_DT	Timestamp of the data
3	MSG_ID	Unique ID of the message		NOT transferred
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new `Aura_id` each time she logs in Aura.	Hashed
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened
8	SUBSCRIPTION_CD	Code of the subscription type of the user in the OB		NOT transferred
9	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
10	CATEGORY_CD	Code of the category where the action happened		NOT transferred
11	COUNTRY_CD	Code of the country		NOT transferred
12	CORR_ID	Correlator ID of the request that produces this data
13	IS_CACHED	Shows if the entity content was already cached or not		NOT transferred
14	STATUS_CD	Status code of the action, if meaningful
15	REASON	Result of the action in error case, code of the error		NOT transferred
16	VERSION_ID	Aura version that produces this data
17	LANG_CD	Language configured by the user for communication		NOT transferred
18	TZ_CD	Timezone where the communication happened		NOT transferred
19	DURATION_NU	Duration in milliseconds of the action
20	MESSAGE	Content of the message	Anonymized
21	DIALOG_ID	Id of the dialog where the message happens
22	CONVERSATION_ID	Id of the conversation where the message happens		NOT transferred
23	WIN_RECOGNIZER_CD	Code of the recognizer that wins for this message		NOT transferred
24	WIN_RECOGNIZER_SCORE_NU	Score of the recognizer that wins for this message		NOT transferred
25	INTENT	Selected intent
26	ENTITIES	List of entities determined by the recognizer
27	MODALITY_CD	How does the user communicate with Aura
28	AURA_ID_GLOBAL	Identifies the same `user_id` logged with the same authentication method	Hashed
29	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred

GROOTMESSAGE dataset

Groot Message dataset (stored in local Kernel).

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	MSG_DT	Timestamp of the data
3	MSG_ID	Unique ID of the message		NOT transferred
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new `Aura_id` each time she logs in Aura.	Hashed
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened
8	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
9	CATEGORY_CD	Code of the category where the action happened		NOT transferred
10	COUNTRY_CD	Code of the country		NOT transferred
11	CORR_ID	Correlator ID of the request that produces this data
12	IS_CACHED	Shows if the entity content was already cached or not		NOT transferred
13	STATUS_CD	Status code of the action, if meaningful
14	REASON	Result of the action in error case, code of the error		NOT transferred
15	VERSION_ID	Aura version that produces this data
16	LANG_CD	Language configured by the user for communication		NOT transferred
17	TZ_CD	Timezone where the communication happened		NOT transferred
18	DURATION_NU	Duration in milliseconds of the action
19	MESSAGE	Content of the message	Anonymized
20	CHANNEL_CONVERSATION_CD	Id of the channel conversation where the message happens		NOT transferred
21	SKILL_CONVERSATION_CD	Id of the skill conversation		NOT transferred
22	WIN_RECOGNIZER_CD	Code of the recognizer that wins for this message		NOT transferred
23	WIN_RECOGNIZER_SCORE_NU	Score of the recognizer that wins for this message		NOT transferred
24	AURA_ID_GLOBAL	Identifies the same `user_id` logged with the same authentication method	Hashed
25	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred
26	SKILL_CD	Unique id of skill

RECOGNIZER dataset

Recognizer dataset stored in local Kernel.

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	RECOGNIZER_DT	Timestamp of the data
3	RECOGNIZER_ID	Unique ID of the recognizer
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new `Aura_id` each time she logs in Aura.	Hashed
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened
8	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
9	CATEGORY_CD	Code of the category where the action happened		NOT transferred
10	COUNTRY_CD	Code of the country		NOT transferred
11	CORR_ID	Correlator ID of the request that produces this data
12	IS_CACHED	Shows if the entity content was already cached or not		NOT transferred
13	STATUS_CD	Status code of the action, if meaningful
14	REASON	Result of the action in error case, code of the error
15	VERSION_ID	Aura version that produces this data
16	LANG_CD	Language configured by the user for communication		NOT transferred
17	TZ_CD	Timezone where the communication happened		NOT transferred
18	DURATION_NU	Duration in milliseconds of the action
19	SCORE_NU	Score returned by the recognizer
20	INPUT	User input sent to the recognizer. Null if incoming message is an AuraCommand	Anonymized
21	OUTPUT	Complete output generated by the recognizer
22	INTENT	Intent returned by the recognizer
23	ENTITIES	Entities returned by the recognizer due to the intent
24	COMMON_THRESHOLD_NU	Common threshold used to determine the best answer of all recognizers		NOT transferred
25	THRESHOLD	Specific threshold of the specific recognizer being executed		NOT transferred
26	EXPECTED_INTENT	Intent expected to be returned by the recognizer		NOT transferred
27	EXPECTED_ENTITIES	Entities expected to be returned by the recognizer due to the intent		NOT transferred
28	AURA_ID_GLOBAL	Identifies the same `user_id` logged with the same authentication method	Hashed
29	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred

This Markdown table can be directly used in your GitHub Markdown files.

API dataset

API request dataset (stored in local Kernel).

#	FIELD	DESCRIPTION	PPD RAW	PPD CLEAN
1	USER_ID	Unique user ID in the OB systems	NOT transferred	NOT transferred
2	REQUEST_DT	Timestamp of the data
3	REQUEST_ID	Unique ID of the request
4	ACTION_CD	Code of the action that produces the data		NOT transferred
5	AURA_ID	User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura	Hashed	NOT transferred
6	PHONE_ID	Phone number of the user	NOT transferred	NOT transferred
7	CHANNEL_CD	Code of the channel where the action happened		NOT transferred
8	DOMAIN_CD	Code of the domain where the action happened		NOT transferred
9	CATEGORY_CD	Code of the category where the action happened		NOT transferred
10	COUNTRY_CD	Code of the country		NOT transferred
11	CORR_ID	Correlator ID of the request that produces this data
12	IS_CACHED	Shows if the entity content was already cached or not	NOT transferred	NOT transferred
13	STATUS_CD	Status code of the API request
14	REASON	Result of the action in error case, code of the error
15	VERSION_ID	Aura version that produces this data		NOT transferred
16	LANG_CD	Language configured by the user for communication		NOT transferred
17	TZ_CD	Timezone where the communication happened		NOT transferred
18	DURATION_NU	Duration in milliseconds of the action
19	HOST	Host of the API
20	PATH	Specific path of the API being called		NOT transferred
21	HTTP_STATUS	HTTP status of the server response		NOT transferred
22	RESPONSE	Response body	Anonymized
23	AURA_ID_GLOBAL	Identifies the same user_id logged with the same authentication method	Hashed	NOT transferred
24	ACCOUNT_NUMBER	Unique account number of the user	NOT transferred	NOT transferred
25	REQUEST	Request body

3 - Aura Billing Module

Aura Billing Module

Description of Aura Billing Module, the tool for the generation of Liceo invoices.

Introduction

The Aura Billing Module is a tool for the generation of Liceo invoices, that allow charging each customer for the services that she has used. This is a mandatory process for OBs.

It is based on the storage and processing of specific logs in the OB’s Aura systems to track the type and number of interactions of a user or service with Aura.

This information is used to assign costs based on different billing models and criteria chosen by the OB, which ultimately determines the total amount of the invoice.

The invoices will be generated in XLSX (Excel) format and stored in an Azure Storage Explorer blob container, along with the historical invoice records.

These invoices will be available for download by the Aura Global Team, to be sent to the OBs.

Interested in how the Aura Billing Module works and which are the tasks required to bring it into use? Access the document Aura Billing Module operation.

Generated Liceo invoices

The Liceo invoices generated by Aura Billing Module will contain the following information:

Invoicing model (based on the payment model of the OB)
Aura components used to provide the service
Service/app that used this component
Number of queries per component
Cost of each query in each specific component
Total amount generated by each component
Total number of requests made during the billing period
Total amount of the invoice

3.1 - Aura Billing Module operation

Aura Billing Module operation

This document contains:

An overview of Aura Billing Module functional operation
Tasks to be executed by OBs to bring Aura Billing Module into use

Aura Billing Module operational flowchart

Figure 1 schematically shows how Aura Billing Module operates, where three different instances come into play:

Aura: OB managed environment
Aura: Global Team managed environment
Kernel

Figure 1. Aura Billing Module operation

The operational processes executed by the Aura Billing Module are outlined below. In each step, the tasks that must be carried out by the OBs in order to bring it into use, are described.

1. Data generation

This task takes place in Aura’s OB managed environment.

Aura components automatically generate logs every time a user/service interacts with Aura in local environment.
These logs are pre-processed, cleaned and converted into datasets, in Avro format.
These are the required Avro-formatted datasets for the Aura Billing Module:
- Aura_Audit, that stores the minimum information needed for generating the Liceo invoices.
- Aura dimensional entities:
  - D_Aura_App schema definition: List of possible Apps defined in Aura.
  - D_Aura_Channel: List of possible channels defined in Aura.
  - D_Aura_Component: List of possible components defined in Aura.
  - D_Aura_Preset: List of possible presets defined in Aura.
  - D_Aura_Recognizer: List of possible recognizers defined by Aura.
  - D_Aura_Skill: List of possible skills defined in Aura.
- Aura Gateway Message: summary of Aura interactions handled by aura-gateway-api.

2. Data publication in Kernel

The latest versions of the previous Avro-formatted datasets must be published into Kernel productive environment by the Kernel team.

Tasks for OBs

2.1. Ask the Kernel Team to publish the datasets in Kernel productive environment with the latest version.

2.2. When correctly published, you can check them in the repository: 4p-datasets

3. Kernel apps configuration to write/read datasets

Two Kernel applications (clients) must be created/configured by the Kernel team to allow the use of Kernel resources:
- aura-bot-[environment]: already existing app in Kernel
- aura-billing-[environment]: new application
Specifically, the applications must be configured with concrete scopes that provide permissions to write/read the datasets.
The obligation to indicate the exact version in the configuration is removed. Therefore, in the following deployments, the version number indicated in the scope will be eliminated. For example, the configuration of the Brazil OB will have to be updated when a new scope change is made. For example: data:Aura_Audit:6:read —> data:Aura_Audit:read.

Tasks for OBs

3.1. Configure aura-bot Kernel application to write datasets

Ask the Kernel Team to create a list of scopes in the aura-bot application for your intended environment.

- admin:datasets:read
- data:read
- data:write
- data:Aura_Audit:read
- data:Aura_Audit:write
- data:Aura_Gateway_Message:read
- data:Aura_Gateway_Message:write
- data:D_Aura_App:read
- data:D_Aura_App:write
- data:D_Aura_Channel:read
- data:D_Aura_Channel:write
- data:D_Aura_Component:read
- data:D_Aura_Component:write
- data:D_Aura_Preset:read
- data:D_Aura_Preset:write
- data:D_Aura_Recognizer:read
- data:D_Aura_Recognizer:write
- data:D_Aura_Skill:read
- data:D_Aura_Skill:write
- data:D_Gbl_Brand:read
- data:D_Gbl_Contact_Channel:read
- data:D_Gbl_Country:read
The scopes are associated with a specific version of the dataset, that will increase and vary with time.

3.2. Create a new app for Aura Billing Module and configure it to write datasets

Ask the Kernel Team to create a new application aura-billing in Kernel for your intended environment

Ask the Kernel Team to create a purpose for this application. For instance, aura-kpi-data-read-purpose

Ask the Kernel Team to assign to this purpose the following scopes:
- admin:datasets:read
- data:read
- data:write
- data:Aura_Audit:read
- data:Aura_Gateway_Message:read
- data:D_Aura_App:read
- data:D_Aura_Channel:read
- data:D_Aura_Component:read
- data:D_Aura_Preset:read
- data:D_Aura_Recognizer:read
- data:D_Aura_Skill:read
- data:D_Gbl_Brand:read
- data:D_Gbl_Contact_Channel:read
- data:D_Gbl_Country:read

3.3. Access the Kernel applications

Once the Kernel team has created the app with the above-mentioned purposes and scopes, two parameters for securely accessing the app will be provided:
- client_id: unique identifier of the consuming app acting as Kernel API client.
- client_secret: password.
This allows Kernel to securely identify, authenticate and authorize any access requested from this app.

4. Data processing

Data processing is executed with Azure Databricks.

In this process, the information from the Kernel datasets is recovered and read by the Aura Billing Module, that uses algorithms to assign a unitary cost to each concept that composes the invoice to calculate the total amount of this invoice.

Tasks for OBs

4.1. Enable Aura components

Check that Aura KPIs Uploader, the component in charge of the management of KPIs entities and KPIs dimensions in Aura, is in use.

Check that Aura Databricks Jobs, component used to import Avro-formatted files into a Kernel dataset, is enabled in Aura installer.

Configure Azure Common:
- To avoid data files migrations between releases, KPI files are now stored in an Azure common storage that is not release dependent.
- The default value of days for KPIs uploading is changed to 30 days. Due to that, the variable fourth_platform.conversations.days_to_find must be removed from OB deployments, because it will be configured by the installer.

5. Data consumption

This step refers to the generation of the Liceo invoices and its storage in Azure Storage Explorer in xlsx format.

These invoices will be available for download by the Aura Global Team, to be sent to the OBs.

4 - Manage Aura logs

Manage Aura logs

Learn what are Aura logs and how they are managed in Kibana and other tools

Introduction

Logs are files that record specific single events, warnings and errors as they occur within a software environment. They can include contextual information, such as the time an event occurred and which user or endpoint was associated with it.

In Aura, logs are generated by specific components when an event happens and stored in order to monitor or debug the system.

Logs are stored in an ElasticSearch cluster.

Once stored, Aura integrates a logging system based on Kibana, which is the official tool to manage logs in Aura. Moreover, logs can be managed with Grafana and fluentd for specific features.

⚠️ You should not integrate third-party applications or scripts with ElasticSearch. These kinds of integrations are weak because the ElasticSearch API is not part of the public interface with the OB. This means that it could change without notice for several reasons such as updating the version of ElasticSearch or changing Aura internal architecture.

Manage logs in Kibana

The official Kibana User Guide is the reference guide to use Kibana.

Moreover, the current section includes certain useful points for managing Aura logs through this tool.

Policies in Kibana

Kibana includes index lifecycle policies.

By default, we add one policy for each index created (service and system index), to delete the logs older than seven days.

Snapshot in Kibana

Index snapshot is configured by default as long-term storage for the logs. These snapshots are taken daily and end in the cluster Azure Storage blob container (aura-backups/elk).

Manage logs in Grafana

Discover section

The “Discover” section in Grafana is very useful to look for logs and troubleshoot issues.

You can full-text search logs using Lucene query syntax.

Moreover, logs are tagged with many fields that can be useful to narrow down a search, such as:

kubernetes.labels.app: name of the Kubernetes application that generated the log.
kubernetes.pod_name: name of the Kubernetes pod that generated the log.
corr: correlator that tracks E2E requests.
lvl: log level (TRACE, DEBUG, INFO, WARN, ERROR or FATAL).

Queries that rely on a specific text are weak. Aura cannot guarantee that log messages do not change between versions. In fact, they do change. This is why metrics based on logs will not be reliable and it is not recommended to use Kibana to get metrics.

Manage logs in fluentd

Logs external forwarding feature

It is possible to send logs to an external system (a fluentd endpoint).

To enable this feature, add the following configuration to your config file:

external_forwarding:
  secret_shared_key: "mysecretkey"
  tls_config:
    tls_enabled: True
  remote_servers:
  - hostname: xxx
    port: yyy

Set hostname and port fields with the remote endpoint. If you configure more than one remote server, fluentd load balances the traffic to them in a round-robin order.
The hostname value can be an IP address, but it is not recommended if TLS is enabled. Turning off TLS is possible but discouraged for security reasons.
secret_shared_key is used to verify client’s identity and must be configured properly in all the remote servers.

You can find additional information regarding receivers’ configuration (including TLS configuration and password authentication procedure) here.

5 - Manage metrics

Manage Aura metrics

Learn what are Aura metrics, how they are generated and stored in Prometheus and the process for its analysis through Grafana

Introduction

Metrics provide a measurement of certain data that represent a specific aspect of the monitored system at a point in time and offer an aggregated view over the system. They are useful to visualize long-term trends and alerts on log data.

Each Aura component is in charge of publishing its own metrics, which are typically generated at fixed-time intervals from aggregated logs.

Once generated, Aura metrics are pooled by Prometheus, which is in charge of gathering and exposing them.

Grafana is the most suitable tool to represent metrics through different dashboards. Each component counts on a Grafana dashboard to show its current behavior and there is a single dashboard for an Aura overview.

If you think a new metric could useful, please contact the Aura Platform Team, so it can be officially included as part of the platform.

The aim of this section is to explain both how Aura metrics work and all the metrics stored by each component.

⚠️ Saved dashboards, visualizations and queries are not guaranteed to be kept between upgrades because all the stack, including ElasticSearch and Grafana can be upgraded to newer versions.

Prometheus

Aura metrics system is based on Prometheus, a Cloud Native Computing Foundation project that works as systems and services monitoring system. Prometheus collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

prom-client is being used to implement prometheus functionality in Node.js.

Prometheus service pools every component to get the metrics generated during the last time period. Every component counts on a private endpoint (not accessible from Internet) called /metrics where Prometheus requests the metrics.

Currently, the metric types used in this component are:

Summary: similar to histogram metrics, it includes samples observations (such as request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
Counter: cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
Gauge: similar to Counter, but it represents a single numerical value that can arbitrarily go up and down.

Prometheus-es-exporter

Working with Prometheus, we can create metrics using queries to ElasticSearch indexes (as well as create alarms, dashboard, etc) using prometheus-es-exporter.

This component is not deployed by default, but it can be enabled changing the variable prometheus_es_exporter_enabled to true in you config.yml file. (In Brazil, it is set to true by default). Access here the guidelines to enable prometheus-es-exporter component.

To config your own metrics from queries, write the new section, as in the following example, in your config.yml.

prometheus_es_exporter:
  query_blocks:
    ob:
      - name: "query_ob_br"
        QueryIntervalSecs: "60"
        QueryJson: '{"size":0,"query":{"bool":{"must":[],"filter":[{"bool":{"filter":[{"bool":{"should":[{"match_phrase":{"msg":"[AzureEventHub] emit"}}],"minimum_should_match":1}},{"bool":{"should":[{"match_phrase":{"kubernetes.labels.app":"aura-bot"}}],"minimum_should_match":1}}]}},{"range":{"@timestamp":{"gte":"now-1m","lte":"now"}}}]}}}'
        QueryIndices: "aurak8s-service-*"

Where:

name: Mandatory. Name of the query. It must start with query_*
QueryIntervalSecs: Optional. It indicates how often to run queries in seconds. By default, 60.
QueryJson: Mandatory. The search query to run.
QueryIndices: Optional. Indices to run the query on. Any way of specifying indices supported by your ElasticSearch version can be used. By default, _all. Although this field is optional, it is highly recommended to delimit the search query.

Aura components metrics

The main Aura components can generate their own metrics.

Select your intended component in the left menu and access to its details.

5.1 - Aura Bot metrics

Aura Bot metrics

List of metrics available in Aura Bot

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bot.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-bot until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the user.

The metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

This metric was stored since Iron Maiden (7.2.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-bot.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

This metric was stored since Camela (5.0.0) release.

outgoing_message_duration_seconds

This metric is intended to store the number of Direct Line requests arriving to aura-bot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-bot until the last message of its answer is sent to the client callback.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent by Direct Line in the body of the response in the happening of an error.
origin: specific host of the request.
channel: channel of the request.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-bot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

This metric was stored since Camela (5.0.0) release with the name of bot_version and updated to aura_component_version in Iron Maiden (7.2.0).

bot_request_version

This metric is intended to store the number of incoming requests to aura-bot depending on their channelData.version. It is stored as a Counter in Prometheus.

Labels:

version: channelData.version in the incoming request. If the incoming request has no version field, 1 will be set.

This metric was stored since Iron Maiden (7.2.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge.

It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

aura_token_generate

This metric is intended to store the information related to Kernel accessToken refreshments in aura-bridge. It is intended to make it possible to set an alarm in the happening of any error during refresh of the 2-legged accessToken needed to access Kernel WhatsApp APIs.

It is stored as a Summary.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status returned by Kernel in the response.
originStatus: status sent by Kernel in the body of the response in the happening of an error.
origin: channelId of the channel that needs the accessToken in Aura.
channel: channel of the request.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.2 - Aura Groot metrics

Aura Groot metrics

List of metrics available in Aura Groot

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-groot.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-groot until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the Direct Line or aura-bridge.

The metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

outgoing_request_duration_seconds

This metric is intended to store the processing time related to all the outgoing HTTP requests made by aura-groot.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

outgoing_message_duration_seconds

This metric is intended to store the processing time of Direct Line or aura-bridge requests arriving to aura-groot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-goot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-groot until the last message of its answer is sent to the client callback.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent by Direct Line in the body of the response in the happening of an error.
origin: specific host of the request (Direct Line or aura-bridge).
channel: channel of the request.

incoming_message_duration_seconds

This metric is intended to store the processing time of Direct Line, aura-bridge or skills requests arriving to aura-groot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-goot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or skill is sent back to the client callback. This metric measures the duration from when the request arrives at aura-groot until it is processed to send to the channel/bridge or skill.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent by Direct Line in the body of the response in the happening of an error.
origin: specific host of the request (Direct Line, aura-bridge or skill name). If origin is missing, the content of path label will be added.
channel: channel of the request.

aura_component_version

This metric is intended to store the number of aura-groot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

skill_access_error

This metric is intended to store the number of times a skill has been misconfigured in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

skill: skill name.
code: noRespond or noFound
channel: channel of the request.

skill_request_status

This metric is intended to store the number of times we have obtained a response status per skill in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

skill: skill name.
code: status code of the request.
channel: channel of the request.

skill_response_error

This metric is intended to store the number of times a skill has been blocked in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

skill: skill name
code: blocked
channel: channel of the request.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.3 - Atria Model Gateway metrics

Atria Model Gateway metrics

List of metrics available in atria-model-gateway

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by atria-model-gateway.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in atria-model-gateway until its HTTP response is returned:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
path: specific endpoint of the request
status_code: HTTP status code returned in the response
application: application name that is using the model

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by atria-model-gateway. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

generative_tokens

This metric is intended to store the information related to tokens used by OpenAI in atria-rag-server. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its tokens usages.

The metric allows measuring the behavior of the tokens using any given OpenAI model:

The number of tokens during a time
The average/min/max tokens of these requests

Labels:

application: application name that is using the model
deployment_model_name: name of the deployment model
model_type: identifier of the model

5.4 - Atria RAG server metrics

Atria RAG server metrics

List of metrics available in atria-rag-server

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by atria-rag-server.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in atria-rag-server until its HTTP response is returned:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
path: specific endpoint of the request
status_code: HTTP status code returned in the response
application: application name that is using the model

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by atria-rag-server. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

5.5 - Aura Authentication API metrics

Authentication API metrics

List of metrics available in Aura Authentication API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-authentication-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-authentication-api until its HTTP response is returned:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

This metric was stored since Greenday (6.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-authentication-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

This metric was stored since Camela (5.0.0) release.

aura_token_generate

This metric is intended to store the information related to Kernel accessToken generation, used during the integrated authorization process of the Aura users in aura-authentication-api.

It is intended to make it possible to set an alarm in the happening of any error during token validation. It is stored as a Summary in Prometheus.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status returned by Kernel in the response.
originStatus: status sent by Kernel in the body of the response in the happening of an error.
origin: channelId of the channel that needs the accessToken in Aura.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-authentication-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

This metric was stored since Barricada (5.3.0) release with the name of authentication_api_version and updated to aura_component_version in Iron Maiden (7.2.0).

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.6 - Aura Configuration API metrics

Aura Configuration metrics

List of metrics available in Aura Configuration API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-configuration-api.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-configuration-api until its HTTP response is returned:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

This metric was stored since Greenday (6.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-configuration-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

aura_component_version

This metric is intended to store the number of aura-configuration-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-configuration-api. It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.7 - Aura Gateway API metrics

Gateway API metrics

List of metrics available in Aura Gateway API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-gateway-api.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-gateway-api until its HTTP response is returned:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response
application: Application name of the request.
channel: Channel name of the request. Only for NLPaaS endpoint.
preset: Preset name of the request. Only for Generative endpoint.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-gateway-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

aura_component_version

This metric is intended to store the number of aura-gateway-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

This metric was stored since Beatles (8.9.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-gateway. It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.8 - Aura Bridge metrics

Aura Bridge metrics

List of metrics available in Aura bridge

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bridge. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-bridge until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the user.

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

This metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

outgoing_message_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bridge.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bridge is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback.

This metric measures the duration since the request lands in aura-bridge until the last message of its answer is sent to the client callback.

Labels:

host: host and domain where the request is being sent.
method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
path: specific endpoint of the request.
originStatus: third party status sent in the body of the response. Usually, this status is sent by whatsapp.
status: HTTP status code returned in the response.
origin: specific source of the request. The value could be: ‘4p’, ‘whatsapp’, ‘aura-bot’ or ‘genesys’.
channel: channel of the request.

This metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

incoming_message_duration_seconds

This metric is intended to store the number requests arriving to aura-bridge from a channel or Direct Line.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bridge is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or Direct Line is sent back to the client callback. This metric measures the duration from when the request arrives at aura-bridge until it is processed to send to the channel or Direct Line.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent by Direct Line or channel in the body of the response in the happening of an error.
origin: specific host of the request. If origin is missing, the content of path label will be added.
channel: channel of the request. In Auraline requests used to get conversationId with path: /aura-services/v1/auraline/conversations, channel will be missing.

aura_response_ack_duration_seconds

This metric is intended to store the information related to all the ACK requests sent by the clients to aura-bridge. The ACK requests are used by the clients (WhatsApp) to notify if in the end Aura’s answer was delivered to the user or not.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration. The duration measures since the ACK request lands in aura-bridge until its asynchronous answer is sent to the user.

Labels:

host: host and domain where the request is being sent.
method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
path: specific endpoint of the request.
originStatus: third party status sent in the body of the response. Usually, this status is sent by whatsapp.
status: HTTP status code returned in the response.
origin: specific source of the request. The value could be: ‘4p’, ‘whatsapp’, ‘aura-bot’ or ‘genesys’.
channel: channel of the request.

This metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

This metric was stored since Heroes (7.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-bridge. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, …)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

This metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

aura_token_generate

It is stored as a Summary in Prometheus.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status returned by Kernel in the response.
originStatus: status sent by Kernel in the body of the response in the happening of an error.
origin: channelId of the channel that needs the accessToken in Aura.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-bridge instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

This metric was stored since Greenday (6.0.0) release with the name of aura_bridge_version and updated to aura_component_version in Iron Maiden (7.2.0).

aura_bridge_wa_incoming_message

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.9 - Aura KPIs uploader metrics

Aura KPIs Uploader

List of metrics available in Aura KPIs uploader

aura_kpis_uploader_metrics_duration

This KPI measures the time required by aura-kpis-uploader to process each type of KPI. KPI management has several steps (load, process, upload), and this KPI represents the time it takes to perform all those steps for each of the KPIs defined in AURA_SOURCE_PATH_AVRO_ADAPTERS.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

format: File format in which the KPI will be stored.
- csv: File format will be CSV (deprecated).
- avro: File format will be AVRO.
kpiType: Type of KPI:
- entity: KPI is of type Entity.
- dimensional: KPI is of type Dimensional.
kpiName: Name of the KPI.
duration: Time in seconds with the time used to process the KPI.
numberFilesProcessed: Number of KPIs processed. If the format is AVRO, it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

aura_kpis_uploader_metrics

This metric is intended to store the information related to all processes executed by aura-kpis-uploader. It is stored as a Counter in Prometheus, so every sample, besides the defined labels.

This KPI measures the amount of KPI registers processed, if the format is AVRO it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

Labels:

format: File format in which the KPI will be stored.
- csv: File format will be CSV (deprecated).
- avro: File format will be AVRO.
kpiType: Type of KPI:
- entity: KPI is of type Entity.
- dimensional: KPI is of type Dimensional.
kpiName: Name of the KPI.
duration: Time in seconds with the time used to process the KPI.
numberFilesProcessed: Number of KPIs processed. If the format is AVRO, it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

aura_kpis_uploader_errors

This metric is intended to store the information related to all errors generated by execution of aura-kpis-uploader. It is stored as a Counter in Prometheus, so every sample, besides the defined labels.

This KPI measures the amount of KPI errors produced when generating KPIs.

Labels:

type: Name of the method or function where the error occurred.
format: File format in which the KPI will be stored.
- csv: File format will be CSV (deprecated).
- avro: File format will be AVRO.
kpiType: Type of KPI:
- entity: KPI is of type Entity.
- dimensional: KPI is of type Dimensional.
kpiName: Name of the KPI.
url: If the error contains a file with more information stored in Azure Storage, this field contains the URL to download the file.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-kpis-uploader. It is stored as a Counter in Prometheus.

Labels:

error: Exception message that forced the unhandled error.

aura_server_unhandled_error is stored from Loquillo (7.5.0) release onwards.

5.10 - Aura NLP metrics

Aura NLP metrics

List of metrics available in Aura NLP

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
path: HTTP path of the incoming request.
status_code: the responded HTTP status code (as a string).

Value:

Request duration in seconds.

outgoing_request_duration_seconds

This Prometheus metric is a modelled as a summary where the value is the spent time until the remote host responds to an HTTP request.

Note the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

method: HTTP method (GET, POST; etc.), a string in uppercase.
host: remote host that will receive the outgoing request.
path: HTTP path of the outgoing request.
status: the responded HTTP status code (as a string).

5.11 - T&C API metrics

Terms & Conditions API metrics

List of metrics available in Terms and Conditions API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests handled by tac-api. It is stored as a Histogram in Prometheus, so every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in tac-api until its HTTP response is returned.

This metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a period of time
The average/min/max duration of these requests
Quantiles of the duration and the number of requests in a period

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

This metric was stored since Barricada (5.0.0) release.

http_requests_total

This metric is intended to store information about all the request handled by tac-api. It is stored as a Counter in Prometheus.

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
status_code: HTTP status code returned in the response.

This metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests
Quantiles

This metric was stored since Barricada (4.0.0) release.

http_in_flight_requests_total

This metric is intended to store the information related to all the concurrent HTTP requests being handled by tac-api in a period.

It is stored as a Gauge in Prometheus because it is a value that can go up and down at every moment.

This metric allows to measure the behavior of the requests from any given endpoint:

The number of requests during a period of time
The average/min/max duration of these requests
Quantiles of the duration and the number of requests in a period.

This metric was stored since Barricada (4.0.0) release.

tac_internal_errors

This metric is intended to store the number of internal errors happening in tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

name: it will contain the exception message that forced the unhandled error.

This metric was stored since Barricada (4.0.0) release.

tac_service_acceptances_total

This metric is intended to store the number of acceptances of Terms and Conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

name: it will contain the name of the accepted service. Currently, it could contain one of: aura, whatsapp-anonymous, whatsapp-authenticated
version: T&C version accepted by the user

This metric was stored since Barricada (4.0.0) release.

tac_service_updates_total

This metric is intended to store the number of updates of terms and conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

name: name of the updated service. Currently (Iron Maiden) it could contain one of: aura, whatsapp-anonymous, whatsapp-authenticated
version: T&C version updated by the user

This metric was stored since Barricada (4.0.0) release.

tac_user_deletions_total

This metric is intended to store the number of deletions of terms and conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

This metric was stored since Barricada (4.0.0) release.

aura_component_version

This metric is intended to store the number tac-api instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

This metric was stored since Iron Maiden (7.2.0).

5.12 - NLP provisioning metrics

NLP Provisioning metrics

List of metrics available in Aura NLP provisioning

These metrics are stored since Heroes (7.0.0.) release.

Introduction

In the Aura NLP provisioning component, it is important to know in each moment the quantity of processes restarted in relation with the total processes that, at this moment, work to process the different container. In that way, it could be alerted to an abnormal performance and take measures in this regard.

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
path: HTTP path of the incoming request.
status_code: the responded HTTP status code (as a string).

Value:

Request duration in seconds.

nlp_provisioning_killed_processes

This metric is intended to store the number of processes killed in each iteration of the Aura NLP provisioning execution. It is stored as a Gauge in Prometheus.

Value:

Number worker processes killed in each iteration

nlp_provisioning_alive_processes

This metric is intended to store the number worker processes alive in each iteration of NLP Provisioning. It is stored as a Gauge.

Value:

Total alive processes.

nlp_provisioning_expected_alive_processes

This metric is intended to store the number of expected alive processes in the NLP Provisioning. It is stored as a Gauge.

Value:

Set gauge with total alive processes.
Decrease gauge with finished processes.

nlp_provisioning_container_killed_count

This metric is intended to store the counter of all the processes killed in Aura NLP provisioning. It is stored as a Counter in Prometheus.

Labels:

container: container URL.

Value:

Dead process ids (pids).

5.13 - Aura Complex Logic metrics

Aura Complex Logic metrics

List of metrics available in Aura Complex Logic Framework

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary, where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
path: HTTP path of the incoming request.
status_code: the responded HTTP status code (as a string).

Value:

Request duration in seconds

supervised_complex_logic_app_restarted_counter

This metric is intended to store a count of the restarted plugins.

It is stored as a Counter in Prometheus.

Labels:

All label values are strings.

app: clf
supervised_plugin: Supervised plugin class path.
plugin_status: Plugin response code status.
plugin_handler_name: Handler name.

complex_logic_app_http_requests

This metric is intended to store the HTTP requests of Aura Complex Logic plugins.

It is stored as a Counter in Prometheus.

Labels:

All label values are strings.

app: clf
plugin: plugin class path.
status_code: plugin response code status.
handler_name: handler name.

5.14 - Aura Context metrics

Aura Context metrics

List of metrics available in Aura Context

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
path: HTTP path of the incoming request.
status_code: the responded HTTP status code (as a string).

Value:

Request duration in seconds.

database_request_duration_seconds

This metric is intended to store the duration of database requests in seconds.

It is stored as a Summary in Prometheus.

Labels:

All label values are strings.

database: database name (Redis or Mongo).
operation: database operation (i.e., update, create, get_by_date, get_last_n, get_by_corr).

Value:

Request duration in seconds.

5.15 - Aura File Manager metrics

Aura File Manager metrics

List of metrics available in Aura File Manager

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-file-manager.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-file-manager until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer.

The metric allows measuring the behavior of the requests from any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status_code: HTTP status code returned in the response

outgoing_request_duration_seconds

This metric is intended to store the processing time related to all the outgoing HTTP requests made by aura-file-manager.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

The number of requests during a time
The average/min/max duration of these requests

Labels:

method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
host: host and domain where the request is being sent
path: specific endpoint of the request
status: HTTP status code returned in the response

outgoing_message_duration_seconds

This metric is intended to store the processing time of aura-bot requests arriving to aura-file-manager.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-file-manager is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-file-manager until the last message of its answer is sent to the client callback.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
origin: aura-bot

incoming_message_duration_seconds

This metric is intended to store the processing time of aura-bot requests arriving to aura-file-manager.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-file-manage is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or skill is sent back to the client callback. This metric measures the duration from when the request arrives at aura-file-manager until it is processed to send the response.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent in the body of the response in the happening of an error.
origin: aura-bot

aura_component_version

This metric is intended to store the number of aura-file-manager instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-file-manager.

It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

aura_token_generate

This metric is intended to store the processing time of aura-file-manger to get/refresh kernel token.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

path: specific endpoint of the request.
httpStatus: HTTP status code returned in the response.
originStatus: status sent by Direct Line in the body of the response in the happening of an error.
origin: kernel client identifier

file_validation_duration_seconds

This metric is intended to store the validation time of a file.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

path: specific endpoint of the request.
code: OK when file is valid.
origin: specific endpoint of the request.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

5.16 - Aura Redis MongoDB sync metrics

Aura Redis MongoDB Synchronizer metrics

List of metrics available in aura-redis-mongo-sync (ARMS)

aura_component_version

This metric is intended to store the number of aura-bot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

version: version field in the package.json file included in the running docker container.
component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-redis-mongo-sync.

It is stored as a Counter in Prometheus.

Labels:

error: exception message that forced the unhandled error.

redis_mongo_sync_duration_milliseconds

This metric measures the data upload time from the service to the Mongo database.

It is stored as a Histogram in Prometheus. So every sample, besides the defined labels, also includes its duration.

The aura-redis-mongo-sync service contains a data collector that helps the event service move stale data from Redis to MongoDB. This collector sends the data in packets to optimize performance. This metric measures the time MongoDB takes to process the packet.

Labels:

status: HTTP status returned in the response. Values: success.
- success: if the status is success, the time is stored.

redis_mongo_synced_items_total

This metric is intended to store the registers synchronized between Redis and MongoDB by events.

It is stored as a Counter in Prometheus.

Labels:

type: register type. Values: event, active_context
- event: Items synchronized by event.
- active_context: Items synchronized by active context process.

redis_mongo_synced_errors

This metric is intended to store the errors that have occurred in the synchronization.

It is stored as a Counter in Prometheus.

Labels:

error: Values : create, syncData, executeBulk.
- create: If the error occurred when creating the service.
- syncData: If the error occurred when synchronizing the data.
- executeBulk: If the error occurred when uploading the data to MongoDB in bulk mode.

redis_mongo_sync_configuration_settings

This metric contains the service configuration data.

It is stored as a Gauge in Prometheus.

Labels:

setting_name: Values: shard_count, pod_count, active_context_ttl_seconds, redis_cache_ttl_seconds.
- shard_count: Current shard used to distribute the data to synchronize between pods.
- pod_count: Current number of services of aura-redis-mongo-sync.
- active_context_ttl_seconds: Time interval to run the data collector.
- redis_cache_ttl_seconds: Time in seconds that will be set to the context elements in the Redis cache.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

moduleId: Id of the module.
status: OK or ERROR

6 - Aura dashboards

Aura dashboards

Discover the dashboards that can be generated through the different tools used for Aura monitoring in order to track and analyze data

Introduction

Dashboards are reporting tools that aggregate and display metrics and key indicators, so they can be examined at a glance by all possible audiences.

These dashboards allow data interpretation and provide an overall view for the evaluation of Aura’s performance, thus improving decision-making. Each component counts on a dashboard to show its current behavior and there is a single dashboard for an Aura overview.

There are two types of dashboards for Aura metrics (Prometheus) that are generated in Grafana:

6.1 - Aura system dashboards

Aura system dashboards

Grafana dashboards with metrics related to the performance of Aura system

Introduction

Currently, these are the available Aura system dashboards in Grafana based on metrics stored in Prometheus:

6.1.1 - Alertmanager dashboard

Alertmanager dashboard

Information provided by Alertmanager dashboards

Panels

Received alerts rate

It shows a time series with the received alerts rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows received alerts rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_alertmanagers_discovered[1m])) by(status)

An example of this panel is shown below:

The available metrics are defined in the following sections.

Successful notification rate

It shows a time series with the successful notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the successful notifications rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_sent_total[1m])) by(integration)

An example of this panel is shown below:

Failed notifications rate

It shows a time series with the failed notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the failed notifications rate.

The queries used to get panel information are:

sum(rate(prometheus_notifications_errors_total[1m])) by(integration)

An example of this panel is shown below:

CPU usage rate

It shows a time series with the CPU usage rate aggregated by one minute. It also shows the current minimum, maximum and average cpu consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the CPU usage rate.

The queries used to get panel information are:

sum(rate(container_cpu_usage_seconds_total{container="alertmanager"}[1m])) by (pod_name)

An example of this panel is shown below:

Memory usage

It shows a time series with the memory usage. It also shows the current minimum, maximum and average memory consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get panel information are:

sum (container_memory_working_set_bytes{container="alertmanager"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

It shows a time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O.

The x-axis shows the time series and the y-axis shows the network usage.

The queries used to get panel information are:

sum (rate (container_network_receive_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)

An example of this panel is shown below:

6.1.2 - Elasticsearch dashboard

Elasticsearch dashboard

Information provided by Elasticsearch dashboard

Introduction

Elastic dashboard monitors multiple data, service and system related metrics.

The different graphs are shown in the following sections:

Cluster graphs
Shard graphs
system graphs
Documents graphs
Total operations stats graphs
Elastic search times graphs
Caches graphs
Thread pool graphs
JVM garbage collection graphs

Cluster graphs

The current section includes cluster related graphs.

Health status

Code coloured indicator of cluster health.

Metrics:

((sum(elasticsearch_cluster_health_status{color="green"})*2)+sum(elasticsearch_cluster_health_status{color="yellow"}))/count(elasticsearch_index_stats_up)

Nodes

Number of nodes.

Metrics:

count(elasticsearch_index_stats_up)

Data nodes

Number of data nodes per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_data_nodes{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Pending tasks

Pending tasks per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_pending_tasks{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

Shards graphs

Shards related graphs.

Active primary shards

Number of active primary shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_primary_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Active shards

Number of active shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Initializing shards

Number of shards initializing per node.

Metrics:

sum(elasticsearch_cluster_health_initializing_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Relocating shards

Number of relocating shards per node.

Metrics:

sum(elasticsearch_cluster_health_relocating_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Unassigned shards

Number of unassigned shards per node.

Metrics:

sum(elasticsearch_cluster_health_delayed_unassigned_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

System graphs

System related graphs.

CPU usage

Percentage of used CPU on master and data nodes.

Metrics: It includes two metrics:

Master node CPU usage

elasticsearch_process_cpu_percent{cluster="elasticsearch",es_master_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Data nodes CPU usage:

elasticsearch_process_cpu_percent{cluster="elasticsearch",es_data_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

JVM memory usage

Memory used by JVM graph in bytes.

Metrics:

It includes three metrics:

Used memory

elasticsearch_jvm_memory_used_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Committed memory

elasticsearch_jvm_memory_committed_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Max memory

elasticsearch_jvm_memory_max_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Disk usage

Disk usage in bytes.

Metrics:

1-(elasticsearch_filesystem_data_available_bytes{cluster="elasticsearch"}/elasticsearch_filesystem_data_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"})

Network usage

Bytes rate sent and received, aggregated by one minute.

Metrics: It includes two metrics:

Sent bytes

irate(elasticsearch_transport_tx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Received bytes

irate(elasticsearch_transport_rx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Documents graphs

Documents state related graphs.

Documents count

Number of documents in cluster.

Metrics:

elasticsearch_indices_docs{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Documents indexed rate

Rate of indexed documents, aggregated by one minute.

Metrics:

irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents deleted rate

Rate of deleted documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_docs_deleted{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents merged rate

Rate of merged documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Total operations stats graphs

Data related to total operations.

Total operations rate

Total operations number rate, aggregated by one minute.

Metrics: It includes six metrics:

Indexing index

irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Search queries

irate(elasticsearch_indices_search_query_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Search fetch

irate(elasticsearch_indices_search_fetch_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Merges

irate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Refresh

irate(elasticsearch_indices_refresh_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Flush

irate(elasticsearch_indices_flush_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Total operations time

Time rate for the different operations in milliseconds, aggregated by one minute.

Metrics: It includes six metrics:

Indexing index

irate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Search queries

irate(elasticsearch_indices_search_query_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Search fetch

irate(elasticsearch_indices_search_fetch_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Merges

irate(elasticsearch_indices_merges_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Refresh

irate(elasticsearch_indices_refresh_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Flush

irate(elasticsearch_indices_flush_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Elasticsearch times graphs

Graphs related to elapsed times of different actions.

Query time

Time rate for search query operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_search_query_time_seconds{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Indexing time

Time rate for indexing index operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Merging time

Time rate for merge operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Caches graphs

Graphs related to caches metrics.

Field data memory size

Field data memory size in bytes.

Metrics:

elasticsearch_indices_fielddata_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Field data evictions

Rate of field data evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_fielddata_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Query cache size

Bytes of memory occupied by cached queries.

Metrics:

elasticsearch_indices_query_cache_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Query cache evictions

Rate of queries evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_query_cache_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Thread pool graphs

Graphs related to the thread pool.

Operations rejected

Rate of rejected operations, aggregated by one minute.

Metrics:

irate(elasticsearch_thread_pool_rejected_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Operations queued

Rate of queued operations, aggregated by one minute.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Threads active

Number of active threads.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Operations completed

Shows rate of completed operations, aggregated by one minute

Metrics:

irate(elasticsearch_thread_pool_completed_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

JVM Garbage collection graphs

Graphs related to JVM garbage collector activity.

GC count

Rate of GC count, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

GC time

Rate of GC execution time, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

6.1.3 - Fluent bit dashboard

Fluent bit dashboard

Information provided by Fluent bit dashboard

Introduction

Fluent bit dashboard monitors system metrics related to fluent bit.

The available metrics are defined in the following sections.

Input bytes

Input bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_input_bytes_total[1m])

Graph visual

Output bytes

Output bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_output_proc_bytes_total[1m])

Graph visual

Retries/fails

Rate of retries and fails, aggregated by one minute

Metrics:
It includes two metrics:

Retries rate

rate(fluentbit_output_retries_total[1m])

Fails rate

rate(fluentbit_output_retries_failed_total[1m])

Graph visual

Errors

Rate of output errors, aggregated by one minute.

Metrics:

rate(fluentbit_output_errors_total[1m])

Graph visual

6.1.4 - Kubernetes cluster monitoring dashboard

Kubernetes cluster monitoring dashboard

Information provided by Kubernetes cluster monitoring dashboard

Introduction

Kubernetes cluster monitoring dashboard monitors multiple systems and networks related data from Kubernetes clusters.

The available metrics are defined in the following sections.

Network I/O pressure graph

Rate of total received/sent data on all cluster containers, in bytes and aggregated by one minute.

Metrics:
It includes two metrics:

Received bytes

sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))

Sent bytes (negative value)

- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))

Graph visual

Total usage

Graphs with different system parameters usage.

Cluster memory usage

It is composed by three graphs:

Memory usage, showing percentage of used memory
Used, showing used memory
Total, showing total memory

Metrics:
It includes three metrics:

Memory usage percentage

sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100

Used memory

sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Total cluster memory

sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster CPU usage

It is composed by three graphs:

CPU usage, showing percentage of used CPU cores, aggregated by one minute
Used, showing used CPU cores, aggregated by one minute
Total, showing total CPU cores

Metrics:
It includes three metrics:

CPU usage percentage

sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) / 
sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100

Used CPUs

sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))

Total cluster CPUs

sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster filesystem usage

It is composed by three graphs:

Filesystem usage, showing percentage of used filesystem space
Used, showing used filesystem space
Total, showing total filesystem space

Metrics:
It includes three metrics:

Filesystem usage

sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100

Used

sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Total

sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Pods CPU usage

CPU usage rate, classified by pod and aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers CPU usage

CPU usage rate, classified by container and aggregated by one minute.

Metrics:
It includes two metrics:

Containers with “k8s_”

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Containers without “k8s_”

sum (rate (container_cpu_usage_seconds_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes CPU usage

Total CPU usage rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id!="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (id)

Graph visual

Pods memory usage

Memory usage, classified by pod.

Metrics:

sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Containers memory usage

Memory usage, classified by container.

Metrics:
It includes two metrics:

Containers with “k8s_”

sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container_name!="POD",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (container_name, pod_name)

Containers without “k8s_”

sum (container_memory_working_set_bytes{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname, name, image)

Graph visual

All processes memory usage

Total memory usage rate.

Metrics:

sum (container_memory_working_set_bytes{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Pods network I/O

Total network received/sent usage rate, classified by pod and aggregated by one minute.

Metrics:

Received bytes

sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Sent bytes

- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers network I/O

Total network received/sent usage rate, classified by container and aggregated by one minute.

Metrics:

Received bytes, containers with “k8s_”

sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)

Sent bytes, containers with “k8s_”

- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)

Received bytes, containers without “k8s_”

sum (rate (container_network_receive_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Sent bytes, containers without “k8s_”

- sum (rate (container_network_transmit_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes network I/O

Total network received/sent usage rate, aggregated by one minute.

Metrics:

Received bytes

sum (rate (container_network_receive_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Sent bytes

- sum (rate (container_network_transmit_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Pods disk I/O

Total disk reads/writes rate, classified by pod and aggregated by one minute.

Metrics:

Read bytes, pods without device

sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)

Written bytes, pods without device

sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)

Read bytes, pods with device

sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)

Written bytes, pods with device

sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)

Graph visual

Containers disk I/O

Total disk reads/writes rate, classified by container and aggregated by one minute.

Metrics:

Read bytes, containers without device

sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)

Written bytes, containers without device

sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)

Read bytes, containers with device

sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)

Written bytes, containers with device

sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)

Read bytes, containers without “k8s_”

sum(rate(container_fs_reads_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Written bytes, containerswithout “k8s_”

sum(rate(container_fs_writes_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

6.1.5 - Kubernetes cron and batch job monitoring dashboard

Kubernetes cron and batch job monitoring dashboard

Information provided by cron and batch job monitoring dashboard

Introduction

Kubernetes cron and batch job monitoring dashboard monitors success/fail rates for cron/batch jobs.

The available metrics are defined in the following sections.

Jobs succeeded

Successfully executed jobs.

Metrics:

kube_job_status_succeeded

Graph visual

Jobs failed

Failed job executions.

Metrics:

kube_job_status_failed

Graph visual

6.1.6 - Kubernetes nodes dashboard

Kubernetes nodes dashboard

Information provided by Kubernetes nodes dashboard

Introduction

Kubernetes nodes dashboard monitors nodes general system status.

The available metrics are defined in the following sections.

CPU usage

CPU usage percent rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname) / sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Memory usage

Memory usage percentage.

Metrics:

sum (container_memory_working_set_bytes{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) / sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Disk I/O

Disk read/written data in bytes.

Metrics:
It includes two metrics:

Read bytes

sum (container_fs_reads_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)

Written bytes

sum (container_fs_writes_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)

Graph visual

Network I/O

Network received/sent data in bytes, aggregated by one minute.

Metrics:
It includes two metrics:

Received bytes

sum (rate (container_network_receive_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)

Sent bytes

- sum (rate (container_network_transmit_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)

Graph visual

6.1.7 - Kubernetes services dashboard

Kubernetes services dashboard

Information provided by Kubernetes services dashboard

Introduction

Kubernetes services dashboard monitors system metrics related to services/pods.

The available metrics are defined in the following sections.

Service CPU usage

Services cpu usage rate, aggregated by one minute

Metrics:

sum(rate(container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container)

Graph visual

Pods CPU usage

Pods CPU usage rate, aggregated by one minute.

Metrics:
It includes two metrics:

CPU usage by pod and container

sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container, pod)

CPU usage by container and instance

sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service memory usage

Service memory usage in bytes.

Metrics:

sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)

Graph visual

Pods memory usage

Pods memory usage in bytes, and memory usage rate aggregated by one minute

Metrics:
It includes four metrics:

memory usage classified by pod and container

sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container, pod)

memory usage classified by container, image and instance

sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)

memory usage rate, classified by pod and container, and aggregated by one minute

sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (container, pod)

memory usage rate, classified by instance and container, and aggregated by one minute

sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service network I/O

Network received/sent data rate, aggregated by one minute

Metrics:
It includes two metrics:

Received bytes

sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)

Sent bytes

- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (pod)

Graph visual

Pods network I/O

Pods received/sent data rate in bytes, aggregated by one minute.

Metrics:
It includes four metrics:

Received bytes classified by pod

sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (name, pod)

Sent bytes classified by pod

- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (container, pod)

Received bytes classified by container and instance

sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)

Send bytes classified by container and instance

- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)

Graph visual

6.1.8 - Kubernetes storage monitoring dashboard

Kubernetes storage monitoring dashboard

Information provided by Kubernetes storage monitoring dashboard

Introduction

Kubernetes storage monitoring dashboard monitors storage related metrics.

The available metrics are defined in the following sections.

Used space

Kubelets volumes and container filesystems data usage in bytes.

Metrics:
It includes two metrics:

Kubelet volumes used bytes

kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}

Container filesystem usage in bytes

container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}

Graph visual

PVC used space %

PersistentVolumeClaim used space percent.

Metrics:

(kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_capacity_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

Local used space %

Containers assigned space usage percentage.

Metrics:

(container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"} / container_fs_limit_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

PVC used inodes %

Kubelet PersistentVolumeClaim volumes inodes usage percentage.

Metrics:

(kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_inodes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

6.1.9 - NLP provisioning dashboard

NLP provisioning dashboard

Information provided by NLP provisioning dashboard

Panels

Expected Killed Alive

Number of expected, killed and alive provisioning processes.

The queries used to get the panel information are:

nlp_provisioning_expected_alive_processes
nlp_provisioning_killed_processes
nlp_provisioning_alive_processes

An example of this panel is shown below:

Killed by container

Time series with the killed processes by container.

The x-axis shows the time series and the y-axis shows the number of killed processes by container.

The queries used to get the panel information are:

nlp_provisioning_container_killed_count_total

An example of this panel is shown below:

Killed processes

Time series with the total killed processes.

The x-axis shows the time series and the y-axis shows the number of killed processes.

The queries used to get the panel information are:

nlp_provisioning_killed_processes

An example of this panel is shown below:

Alive processes VS Expected alive processes

Time series with the ratio between alive processes and expected alive processes.

The x-axis shows the time series and the y-axis shows the ratio between alive and expected

The queries used to get the panel information are:

nlp_provisioning_alive_processes/ nlp_provisioning_expected_alive_processes

An example of this panel is shown below:

Alive processes VS expected processes

Time series with the ratio between alive processes rate aggregated by 15 minutes and expected alive processes rate aggregated by 15 minutes.

The x-axis shows the time series and the y-axis shows the ratio between alive/expected processes

The queries used to get the panel information are:

sum by (exported_job) (rate(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}[15m])) / 
sum by (exported_job) (rate(nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"}[15m]))

An example of this panel is shown below:

6.1.10 - Prometheus stats dashboard

Prometheus stats dashboard

Information provided by Prometheus stats dashboard

Introduction

This is a dashboard to obtain a lot of information on how Prometheus performs.

To get the information about each pod, the dashboard counts on a filter with the following fields:

jobs: list of active jobs.
instances: list of scrapeable instances.
interval: possible time intervals.

Once selected, the following graphs are printed.

Panels

Pods CPU usage

Time series with CPU usage rate, aggregated by one minute. It also shows the current minimum, maximum and average cpu usage.

The x-axis shows the time series and the y-axis shows the cpu usage rate.

The queries used to get the panel information are:

sum(rate(container_cpu_usage_seconds_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Pods memory usage

Time series with memory usage. It also shows the current minimum, maximum and average memory usage.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get the panel information are:

sum (container_memory_working_set_bytes{pod_name!="",pod_name=~"prometheus.*"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

Time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O bytes.

The x-axis shows the time series and the y-axis shows the network I/O.

The queries used to get the panel information are:

sum (rate (container_network_receive_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)
- sum (rate (container_network_transmit_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Uptime

Percentage of uptime for the last hour.

The queries used to get the panel information are:

avg(avg_over_time(up{instance=~"(10\\.240\\.0\\.10:9093|10\\.240\\.3\\.161:9093|10\\.240\\.0\\.34:9114|10\\.240\\.0\\.253:8080|10\\.240\\.3\\.205:9090|10\\.240\\.3\\.236:9090|10\\.240\\.4\\.14:9090|10\\.240\\.4\\.156:9121|10\\.240\\.4\\.186:9121|10\\.240\\.4\\.223:9121)",job=~"kubernetes-service-endpoints"}[1h]) * 100)

An example of this panel is shown below:

Currently down

Currently down instances.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"} < 1

An example of this panel is shown below:

Total series

Total series count.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Total series

Memory chunks being used.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Quick numbers

Quick numbers section shows a series of Prometheus indicators.

Missed iterations

Number of missed iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped iterations

Number of skipped iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Tardy scrapes

Number of scrapes that elapsed more than expected, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Reload failures

Number of reload failures, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_tsdb_reloads_failures_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped scrapes

Number of uncompleted scrapes due to multiple reasons, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) +
sum(sum_over_time(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

An example of this panel is shown below:

Failures and errors

Time series with the number of several different errors and failures, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows a series of different errors and failures:

Dialer connection errors.
Evaluator iterations missed.
Evaluator iterations skipped.
Evaluation failures.
Azure refresh failures.
Consul rpc failures.
Dns lookup failures.
Ec2 refresh failures.
Gce refresh failures.
Marathon refresh failures.
Openstack refresh failures.
Triton refresh failures.
Scrapes exceeded sample limit.
Scrapes sample duplicate timestamp.
Scrapes sample out of bounds.
Treecache zookeeper failures.
Tsdb compactions failed.
Tsdb head series not found.
Tsdb reloads failures.

The queries used to get the panel information are:

sum(increase(net_conntrack_dialer_conn_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_missed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_skipped_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_rule_evaluation_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_azure_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_consul_rpc_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_dns_lookup_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_ec2_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_gce_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_marathon_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_openstack_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_triton_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_exceeded_sample_limit_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_bounds_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_order_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_treecache_zookeeper_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_compactions_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_head_series_not_found{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_reloads_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0

An example of how this panel looks like:

Upness (stacked)

Time series with a time bound representation of services upness. Those values are shown stacked.

The x-axis shows the time series and the y-axis shows the upness state of the different services.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}

An example of this panel is shown below:

Storage memory chunks

Time series with the number of memory chunks used.

The x-axis shows the time series and the y-axis shows the number of memory chunks.

The queries used to get the panel information are:

prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series count

Time series with the number of tsdb series.

The x-axis shows the time series and the y-axis shows the number of series.

The queries used to get the panel information are:

prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series created/removed

Time series with the number of tsdb series created/removed.

The x-axis shows the time series and the y-axis shows the number of series created/removed, aggregated by 5 minutes.

The queries used to get the panel information are:

sum( increase(prometheus_tsdb_head_series_created_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )
sum( increase(prometheus_tsdb_head_series_removed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )

An example of this panel is shown below:

Appended samples per second

Time series with the number of metrics per second stored by Prometheus.

The x-axis shows the time series and the y-axis shows the number of metrics per second stored by Prometheus.

The queries used to get the panel information are:

rate(prometheus_tsdb_head_samples_appended_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])

An example of this panel is shown below:

Scrape Sync total

Time series with the total number of syncs that were executed on a scrape pool.

The x-axis shows the time series and the y-axis shows the total number of syncs that were executed on a scrape pool.

The queries used to get the panel information are:

sum(prometheus_target_scrape_pool_sync_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (scrape_job)

An example of this panel is shown below:

Target sync

Time series with the interval to sync the scrape pool.

The x-axis shows the time series and the y-axis shows the interval to sync the scrape pool.

The queries used to get the panel information are:

sum(rate(prometheus_target_sync_length_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[2m])) by (scrape_job) * 1000

An example of this panel is shown below:

Scrape duration

Time series with the scrape duration in seconds.

The x-axis shows the time series and the y-axis shows the scrape duration in seconds.

The queries used to get the panel information are:

scrape_duration_seconds{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Rejected scrapes

Time series with the rejected scrapes.

The x-axis shows the time series and the y-axis shows the rejected scrapes for several reasons:

Total number of scrapes that hit the sample limit and were rejected.
Total number of scrapes samples duplicated.
Total number of scrapes samples out of bounds.
Total number of scrapes samples out of order.

The queries used to get the panel information are:

sum(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Average rule evaluation duration

Time series with the average duration of rule group evaluations, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows the average duration of rule group evaluations.

The queries used to get the panel information are:

1000 * rate(prometheus_evaluator_duration_seconds_sum{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) / 
rate(prometheus_evaluator_duration_seconds_count{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])

An example of this panel is shown below:

HTTP request duration

Time series with the HTTP request duration, aggregated by one minute.

The x-axis shows the time series and the y-axis shows the http request duration.

The queries used to get the panel information are:

sum(rate(http_request_duration_microseconds_count{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])) by (handler) > 0

An example of this panel is shown below:

Prometheus engine query duration seconds

Time series with the engine query duration in seconds.

The x-axis shows the time series and the y-axis shows the engine query duration.

The queries used to get the panel information are:

sum(prometheus_engine_query_duration_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (slice)

An example of this panel is shown below:

Rule evaluator iterations

Time series with the number of scheduled rule group evaluations, whether executed, missed or skipped.

The x-axis shows the time series and the y-axis shows the number of scheduled rule group evaluations.

The queries used to get the panel information are:

sum(rate(prometheus_evaluator_iterations_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))

An example of this panel is shown below:

Notifications sent

Time series with the rate of sent notifications, aggregated by 5 minutes.

The x-axis shows the time series and the y-axis shows the rate of sent notifications.

The queries used to get the panel information are:

rate(prometheus_notifications_sent_total[5m])

An example of this panel is shown below:

Minutes since successful config reload

Time series with the number of minutes since the last successful config reload.

The x-axis shows the time series and the y-axis shows the number of minutes since the last successful reload.

The queries used to get the panel information are:

(time() - prometheus_config_last_reload_success_timestamp_seconds{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) / 60

An example of this panel is shown below:

Successful config reload

Time series with the last successful reload.

The x-axis shows the time series and the y-axis shows the last successful reload.

The queries used to get the panel information are:

prometheus_config_last_reload_successful{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

GC rate

Time series with the GC invocation durations rate, aggregated by two minutes.

The x-axis shows the time series and the y-axis shows the GC invocation durations rate.

The queries used to get the panel information are:

sum(rate(go_gc_duration_seconds_sum{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}[2m])) by (instance)

An example of this panel is shown below:

6.1.11 - Redis dashboard

Redis dashboard

Information provided by Redis dashboard

Introduction

Redis dashboard monitors multiple data and service-related metrics.

The available metrics are defined in the following sections.

Redis uptime

Uptime graph shows time since last restart/shutdown.

Metrics:

max(max_over_time(redis_uptime_in_seconds{kubernetes_name=~"redis-announce-0"}[$__interval]))

Graph visual

Redis clients

Clients graph shows number of connected clients.

Metrics:

redis_connected_clients{kubernetes_name=~"redis-announce-0"}

Graph visual

Redis memory usage

Memory usage graph shows percentage of used memory.

Metrics:

100 * (redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"}  / redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"} )

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis hits/missed per second

Hits/missed per second graph shows the rate of hits and misses per second, aggregated by five minutes.

Metrics: It includes two metrics:

Hits metrics

irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])

Misses metrics

irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total memory usage

Total memory usage graph shows total memory usage and total memory free + used.

Metrics: It includes two metrics:

Used memory

redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"}

Max memory

redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"}

Graph visual

Redis network I/O

Network I/O graph shows rate of total in/out bytes, aggregated by 5 minutes.

Metrics: It includes two metrics:

In bytes

rate(redis_net_input_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])

Out bytes

rate(redis_net_output_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total items per DB

Total items per DB graph shows total number of items separated by db number.

Metrics:

sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) by (db) > 0

Graph visual

Redis expiring vs not-expiring keys

Expiring vs not-expiring keys graph shows total number of expiring and not expiring keys.

Metrics: It includes two metrics:

Not-expiring keys.

sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) - sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"})

Expiring keys

sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"})

Graph visual

Redis expired/evicted

Expired/evicted graph shows total number of expired and evicted keys, aggregated by 5 minutes.

Metrics: It includes two metrics:

Expired keys.

sum(rate(redis_expired_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)

Evicted keys

sum(rate(redis_evicted_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)

Graph visual

Redis command calls per second

Command calls per second graph shows top commands number of executions rate, aggregated by 5 minutes.

Metrics:

topk(5, irate(redis_commands_total{kubernetes_name=~"redis-announce-0"} [1m]))

Graph visual

6.2 - Aura components dashboards

Aura components dashboards

Grafana dashboards with metrics related to the performance of specific Aura components

Introduction

Currently, these are the available dashboards for Aura components in Grafana based on metrics stored in Prometheus:

6.2.1 - Aura bot latencies dashboard

Aura bot latencies dashboard

Information provided by Aura bot latencies dashboard

Introduction

Aura bot latencies dashboard monitors outbound and inbound latencies on the request and responses handled directly by aura-bot.

The available metrics are defined in the following sections, corresponding to request errors and latency for requests, Microsoft APIs, Kernel APIs, Cognitive APIs, aura-services APIs and other APIs.

Request error

Request error graph shows the number of errors rate, aggregated by one minute.

Graph metrics

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",status=~"4..|500"}[1m]))

Graph visual

Request latency

Request latency graph shows latency rate for outgoing traffic, aggregated by one minute.

Graph metrics

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot"}[1m]))

Graph visual

Microsoft APIs latency

Microsoft APIs latency graph shows mean latency rate for the different Microsoft APIs used.

Graph metrics

Currently, there are three monitored Microsoft endpoints:

Direct Line endpoint

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"directline.botframework.com"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"directline.botframework.com"}[1m]))

Microsoft auth endpoint

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"login.microsoftonline.com"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"login.microsoftonline.com"}[1m]))

Blob storage endpoint

sum (label_replace(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"aura.*blob.core.windows.net",path=~"/aura-temporary-resources/.*"},"path_set","$1","path","/aura-temporary-resources/.*")) by (path_set,kubernetes_namespace) / 
sum (label_replace(outgoing_request_duration_seconds_count{app="aura-bot",host=~"aura.*blob.core.windows.net",path=~"/aura-temporary-resources/.*"},"path_set","$1","path","/aura-temporary-resources/.*")) by (path_set,kubernetes_namespace)

Graph visual

Kernel APIs latency

Kernel APIs latency graph shows mean latency rate for the different Kernel APIs used.

Graph metrics

Currently, there are four monitored Kernel endpoints (more can be added if necessary for a given environment):

Kernel auth endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"auth.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"auth.*"}[1m]))

Kernel subscribed products endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/subscribed_products/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/subscribed_products/.*"}[1m]))

Kernel user profile endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/userprofile/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/userprofile/.*"}[1m]))

Kernel invoicing enpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/invoicing/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/invoicing/.*"}[1m]))

Graph visual

Cognitive APIs latency

Cognitive APIs latency graph shows mean latency rate for the different cognitive APIs used.

Graph metrics

Currently, there are three monitored Cognitive endpoints:

Domain classifier endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/domain_classifier/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/domain_classifier/.*"}[1m]))

Mplus resolution endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/mplus_resolution/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/mplus_resolution/.*"}[1m]))

Suggestions endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/suggestions/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/suggestions/.*"}[1m]))

Graph visual

Aura-services APIs latency

Graph metrics

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot", path=~"/aura-services/.*"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot", path=~"/aura-services/.*"}[1m]))

Graph visual

Other APIs latency

Other APIs latency graph shows mean latency rate for traffic directed to other APIs different from those above, aggregated by one minute.

Graph metrics

Currently, the only API monitored is Genesys API:

sum (label_replace(outgoing_request_duration_seconds_sum{app="aura-bot",path=~"/genesys/.*"},"path_set","$1","path","/genesys/.*")) by (path_set,kubernetes_namespace) / sum (label_replace(outgoing_request_duration_seconds_count{app="aura-bot",path=~"/genesys/.*"},"path_set","$1","path","/genesys/.*")) by (path_set,kubernetes_namespace)

Graph visual

Service API

Service API graph shows mean latency rate for the main endpoint on aura-bridge, that receives requests from Direct Line and aura-bridge. Aggregated by one minute.

Graph metrics

sum by (path,kubernetes_namespace)(rate(http_request_duration_seconds_sum{path=~"/api/messages"}[1m]))/
sum by (path,kubernetes_namespace)(rate(http_request_duration_seconds_count{path=~"/api/messages"}[1m]))

Graph visual

6.2.2 - Aura bridge dashboard

Aura bridge dashboard

Information provided by Aura bridge dashboard

Aura bridge ack success

Ack success graph shows the number of successful acks rate, aggregated by three minutes.

The available metrics are defined in the following sections.

Graph metrics

sum by (origin,originStatus)(rate(aura_response_ack_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus="200"}[3m]))

Graph visual

Aura bridge ack error

Ack error graph shows acks rate with an error status, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(aura_response_ack_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge message success

Message success graph shows the number of successful messages rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus="200"}[3m]))

Graph visual

Aura bridge message error

Message error graph shows number of erroneous messages rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge bot message error

They correspond to errors that aura-bridge receives from aura-bot. Bot message error graph shows the number of erroneous messages (sent by aura-bot) rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot",originStatus!="200"}[3m]))

Graph visual

Aura bridge message - Kernel internal error

Kernel internal error graph shows number of erroneous messages (sent by Kernel) rate, regardless of the error type and aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge message - Kernel HTTP error

Kernel HTTP error graph shows number of erroneous messages (sent by Kernel) rate, filtered by HTTP client errors and aggregated by three minutes.

Graph metrics

sum by (origin,httpStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"4.."}[3m]))

Graph visual

6.2.3 - Authentication API dashboard

Aura authentication API dashboard

Information provided by Authentication API dashboard

Aura services latency

Aura services latency graph shows mean latency rate for the different incoming calls.

The available metrics are defined in the following sections.

Graph metrics

Currently, these are the existing monitored incoming calls:

WhatsApp users’ retrieval

sum by (kubernetes_namespace,path)(rate(http_request_duration_seconds_sum{path=~"/aura-services/v1/users/whatsapp.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(http_request_duration_seconds_count{path=~"/aura-services/v1/users/whatsapp.*"}[1m]))

Get or create user

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id"})

Get or create user

sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id/.*")) by (kubernetes_namespace,path_set)

Retrieves an Aura user by the given auraIdGlobal and the channelId

sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id-global/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id-global/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id-global/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id-global/.*")) by (kubernetes_namespace,path_set)

Gets given authorization and identification information to register the user

sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/.*"},"path_set","$1","path","/aura-services/v1/users/auraid/integrated/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/.*"},"path_set","$1","path","/aura-services/v1/users/auraid/integrated/.*")) by (kubernetes_namespace,path_set)

OpenID logout

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/logout"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/logout"})

New Direct Line token

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/token"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/token"})

New Direct Line token(wss)

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/token/wss"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/token/wss"})

JWT uri retrieval

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/openid/issuer/.well-known/openid-configuration"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/openid/issuer/.well-known/openid-configuration"})

JWT token retrieval

sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/openid/jwk"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/openid/jwk"})

Get or create user

sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/users/aura-id/.*"},"path_set","$1","path","/users/aura-id/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/users/aura-id/.*"},"path_set","$1","path","/users/aura-id/.*")) by (kubernetes_namespace,path_set)

User by phone number

sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/admin/users/phone-numbers/.*"},"path_set","$1","path","/aura-services/v1/admin/users/phone-numbers/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/admin/users/phone-numbers/.*"},"path_set","$1","path","/aura-services/v1/admin/users/phone-numbers/.*")) by (kubernetes_namespace,path_set)

Graph visual

Request out error

Request out error graph shows error rate for outgoing requests with HTTP codes 4xx and 5xx, aggregated by 1 minute.

Graph metrics

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",status=~"4..|5.."}[1m]))

Graph visual

Microsoft APIs latency

Microsoft APIs latency graph shows mean latency rate for the different Microsoft APIs used.

Graph metrics

Currently, there are three monitored Microsoft endpoints:

Directline endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"directline.botframework.com"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"directline.botframework.com"}[1m]))

Microsoft auth endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"login.microsoftonline.com"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"login.microsoftonline.com"}[1m]))

Blob storage endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"aura.*.blob.core.windows.net"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"aura.*.blob.core.windows.net"}[1m]))

Graph visual

Kernel APIs latency

Kernel APIs latency graph shows mean latency rate for the different Kernel APIs used.

Graph metrics

Currently, there are three monitored Kernel endpoints:

Kernel token retrieval endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"auth.*",path="/token"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"auth.*",path="/token"}[1m]))

Kernel token introspection endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"api.*",path=~"/token-introspection/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"api.*",path=~"/token-introspection/.*"}[1m]))

Kernel open-id configuration endpoint

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"auth.*",path="/.well-known/openid-configuration"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"auth.*",path="/.well-known/openid-configuration"}[1m]))

Graph visual

6.2.4 - Aura HTTP Inbound dashboard

Aura HTTP Inbound dashboard

Information provided Aura HTTP inbound dashboard

Introduction

HTTP inbound dashboard monitors inbound traffic to different services.

This inbound traffic can be visualized by channel, thus providing a detailed insight into the specific incoming traffic to this particular channel. It clearly improves the optimization of strategies for that channel or a performance comparison between different channels.

The available metrics are defined in the following sections.

HTTP request latency

HTTP request latency graph shows mean latency time aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)(rate(http_request_duration_seconds_sum{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m])) /
sum by (app, kubernetes_namespace)(rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP Request Rate

HTTP requests rate graph shows number of requests aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP request latency

HTTP request latency graph shows request latency aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_sum{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP error rate

HTTP error rate shows rate of petition errors aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)',status_code=~"4..|5.."}[1m]))

Graph visual

Errors

Errors graph shows errors duration aggregated by one minute.

Graph metrics

sum(rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)',status_code=~"4..|5.."}[1m])) by (app, kubernetes_namespace)

Graph visual

6.2.5 - Aura HTTP Outbound dashboard

Aura HTTP Outbound dashboard

Information provided Aura HTTP outbound dashboard

Introduction

HTTP outbound dashboard monitors outbound traffic to different services.

This outbound traffic can be visualized by channel, thus providing a detailed insight into the specific outgoing traffic from this particular channel. It clearly improves the optimization of strategies for that channel or a performance comparison between different channels.

The available metrics are defined in the following sections.

HTTP request latency

HTTP request latency graph shows mean latency time aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app=~'.*'}[1m])) / sum by (app,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app=~'.*'}[1m]))

Graph visual

HTTP request rate

HTTP requests rate graph shows requests rate per second, aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace) (rate(outgoing_request_duration_seconds_count{app=~'.*'}[1m]))

Graph visual

HTTP request latency

HTTP request latency graph shows request latency rate per second, aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace)  (rate(outgoing_request_duration_seconds_sum{app=~'.*'}[1m]))

Graph visual

HTTP error rate

HTTP error rate shows request errors rate per second, aggregated by one minute

Graph metrics

sum by (app,kubernetes_namespace)  (rate(outgoing_request_duration_seconds_count{app=~'.*',status=~"4..|5.."}[1m]))

Graph visual

Errors

Errors graph shows errors duration aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_count{app=~'.*',status=~"4..|5.."}[1m])) by (app,kubernetes_namespace)

Graph visual

Aura bot backend latency

aura-bot backend latency shows mean latency rate on aura-bot backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"aura-bot"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"aura-bot"}[1m])) by (path,kubernetes_namespace)

Graph visual

Authentication API backend latency

aura-authentication-api backend latency shows mean latency rate on aura-authentication-api backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"authentication-api"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"authentication-api"}[1m])) by (path,kubernetes_namespace)

Graph visual

Aura bridge backend latency

aura-bridge backend latency shows mean latency rate on aura-bridge backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"aura-bridge"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"aura-bridge"}[1m])) by (path,kubernetes_namespace)

Graph visual

6.2.6 - Pod resources dashboard

Pod resources dashboard

Information provided by Pod resources dashboard

Introduction

This is a unique dashboard to obtain the most basic information about how the environment pods behavior is.

To get the information about each pod, the dashboard counts on a filter with the following fields:

namespace: list of all the available namespaces of your deployment.
pod: list of pods running in the selected namespace.
container: list of containers running in the selected pod.
DS_PROMETHEUS: Prometheus data source to be used. By default, Prometheus.

Once selected, the following graphs are printed, with the data of the pod.

Panels

Pod memory

Pod memory panel shows a time series with the current memory consumption in the selected pod. It also shows the current, maximum, minimum and average memory consumption of the Pod.

The x-axis shows the time series and the y-axis shows the amount of memory consumed by the pod.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(container_memory_working_set_bytes{namespace="aura-<dev>",container!="POD",container!="",pod!="", pod="aura-bot-<id>"})

An example of this panel is shown below:

Container memory

Container memory panel shows a time series with the current memory consumption the selected container. It also shows the current, maximum, minimum and average memory consumption of the container.

The x-axis shows the time series and the y-axis shows the amount of memory consumed by the container.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(container_memory_working_set_bytes{namespace="aura-<dev>",container!="POD",container!="",pod!="", pod="aura-bot-<id>",container="aura-bot"}) by (container)

An example of this panel is shown below:

Pod network

Pod network panel shows a time series with the current I/O network consumption of the selected pod. It also shows the current, maximum, minimum and average network consumption of the pod.

The x-axis shows the time series and the y-axis shows the amount of bytes consumed by the pod.

The queries used to get the panel information are:

sum(rate(container_network_receive_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m]))
sum(rate(container_network_transmit_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m]))

An example of this panel is shown below:

Pod CPU

Pod CPU panel shows a time series with the current CPU consumption of the selected pod. It also shows the current, maximum, minimum and average CPU consumption of the pod.

The x-axis shows the time series and the y-axis shows the percentage of CPU used by the pod.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(rate(container_cpu_usage_seconds_total{namespace="aura-<env>",container!="POD",container!="",pod!="", pod="aura-bot-<id>"}[1m]))

An example of this panel is shown below:

Container CPU

Container CPU panel shows a time series with the current CPU usage of the selected container within the pod. It also shows the current, maximum, minimum and average CPU usage of the container.

The x-axis shows the time series and the y-axis shows the percentage of CPU used by the container.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(rate(container_cpu_usage_seconds_total{namespace="aura-<env>",container!="POD",container!="",pod!="", pod="aura-bot-<id>",container="aura-bot"}[1m]))

An example of this panel is shown below:

Container disk

Container Disk panel shows a time series with the current disk usage of the selected container within the pod. It also shows the current, maximum, minimum and average disk usage of the container.

The x-axis shows the time series and the y-axis shows the amount of disk used by the container.

The queries used to get the panel information are:

sum(rate(container_fs_reads_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}[1m])) by (container,device)
sum(rate(container_fs_writes_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}[1m])) by (container,device)

An example of this panel is shown below:

Pod network errors

Pods network errors panel shows a time series with the percentage of errors in network access of the pod. It also shows the current, maximum, minimum and average number of errors of the pod, related to errors while receiving and transmitting data to the network.

The x-axis shows the time series and the y-axis shows the percentage of errors of the pod network accesses.

The queries used to get the panel information are:

sum(rate(container_network_receive_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_errors_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_receive_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_packets_dropped_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_transmit_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_transmit_errors_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_transmit_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_packets_dropped_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100

Pod status

This section consists of 5 panels: ready, created, number of restarts, last terminated reason, waiting reason and the description of the image running in the container.

Ready

Ready panel shows a time series with heartbeat of the container. If there are no errors, it should be a flat line in 1.0.

The x-axis shows the time series and the y-axis shows the answer of the heartbeat of the container: 1 is a correct answer.

The queries used to get the panel information are:

kube_pod_container_status_ready{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}

An example of this panel is shown below:

Pod created

Pod created panel shows the timestamp when the selected pod was created.

The queries used to get the panel information are:

kube_pod_created{namespace="aura-<env>",pod="aura-bot-<id>"} * 1000

An example of this panel is shown below:

Last terminated reason

This panel shows the reason why the pod entered the terminated status.

Last waiting reason

This panel shows the reason why the pod entered the waiting status.

Info

Info panel shows the images running in the containers of the selected pod.

The queries used to get the panel information are:

kube_pod_container_info{namespace="aura-<env>",pod="aura-bot-<id>"}

An example of this panel is shown below:

7 - Aura Alerts

Management of alerts in Aura

Learn how to manage alerts through Prometheus system

Introduction to alerts in Aura

As previously stated, Prometheus has a list of alert rules that are part of the platform configuration. These alerting rules allow you to define alert conditions based on Prometheus expression language.

⚠️ It is possible to edit the Aura alert rules but, for now, changes are lost in a re-deployment.
If you think an alert is important and should be part of the platform, let us know, so we can officially include it.

Alerts are sent via email, using a global SMTP server managed by the Aura Team. Other notification channels (Slack) are also available but not used by default in production.

Alerts are disabled (silenced) during Aura deployments to avoid false positives due to services that need to be restarted, etc.

In order to manage alerts, Aura Platform includes the AlertManager system, which is the part of Prometheus Stack. The URL to access to alertmanager is:
alerts-{{ environment_name }}.auracognitive.com

When accessing the web, you can see all the alerts, as shown in the image below.

Alert manager home

In this panel, the most important thing that you can do is “silence” one alarm pushing in the “silence alarm” or pressing the “new silence button”

Alert manager new silence

In order to check if the cluster is ok (ready) or the status of the system, click in the “status” section.

Alert manager status

Alerts set in Aura

The current section includes the different alerts currently set in Aura, organized by their scope.

Scope: infrastructure

high_cpu_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of CPU. CPU usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum by(kubernetes_io_hostname) (machine_cpu_cores) * 100 > 90
- For: 10m
- summary: HIGH CPU USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
high_memory_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of Memory. Memory usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (container_memory_working_set_bytes{id="/"}) / sum by(kubernetes_io_hostname) (machine_memory_bytes) * 100 > 90
- For: 10m
- summary: HIGH MEMORY USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
high_fs_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of FileSystem space. FileSystem usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (container_fs_usage_bytes{device=~"^/dev/.*$",id="/"}) / sum by(kubernetes_io_hostname) (container_fs_limit_bytes{device=~"^/dev/.*$",id="/"}) * 100 > 70
- For: 10m
- summary: HIGH FILESYSTEM USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’

Scope: kubernetes

high_persistent_volume_usage
- Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume space. Persistent volume usage is « humanize $value »%.
- Expr: kubelet_volume_stats_used_bytes * 100 / kubelet_volume_stats_capacity_bytes > 70
- For: 10m
- summary: HIGH PERSISTENT VOLUME USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
high_persistent_volume_inode_usage
- Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume inodes. Persistent volume inode usage is « humanize $value »%.
- Expr: kubelet_volume_stats_inodes_used * 100 / kubelet_volume_stats_inodes > 70
- For: 10m
- summary: HIGH PERSISTENT VOLUME INODE USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
docker_deleted_container_rate_on_hosts
- Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
- Expr: sum by(kubernetes_io_hostname) (rate(kubelet_docker_operations{operation_type=~"remove_container|stop_container"}[5m])) > 0.1
- For: 1m
- summary: DOCKER DELETED/STOPPED CONTAINER RATE WARNING
runtime_deleted_container_rate_on_hosts
- Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
- Expr: sum by(kubernetes_io_hostname) (rate(kubelet_runtime_operations{operation_type=~"stop_podsandbox|remove_container|stop_container"}[5m])) > 0.1
- For: 1m
- summary: RUNTIME DELETED/STOPPED CONTAINER RATE WARNING
frequent_container_restarts
- Description: Container « $labels.container » on pod « $labels.pod » has been restarted « $value » times within the last hour.
- Expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
- For: 5m
- summary: KUBERNETES FREQUENT CONTAINER RESTARTS WARNING
node_not_ready
- Description: Node « $labels.node » has status « $labels.condition » as « $labels.status ».
- Expr: kube_node_status_condition{condition!="Ready",status!="false"} > 0 or on(node) kube_node_status_condition{condition="Ready",status="false"} > 0
- For: 5m
- summary: KUBERNETES NODE NOT READY WARNING
job_error
- Description: JOB ERROR
- Expr: kube_job_status_failed==1
- For: 5m
- summary: KUBERNETES JOB NOT READY WARNING

Scope: prometheus

prometheus_rule_evaluation_slow
- Description: Prometheus has a 90th percentile latency of « $value »s completing rule evaluation cycles.
- Expr: prometheus_evaluator_duration_seconds{quantile="0.9"} > 60
- For: 10m
- summary: PROMETHEUS RULE EVALUATION SLOW WARNING
prometheus_indexing_backlog
- Description: Prometheus is backlogging on the indexing queue. Queue is currently « $value | printf %.0f »% full.
- Expr: prometheus_local_storage_indexing_queue_length / prometheus_local_storage_indexing_queue_capacity * 100 > 10
- For: 10m
- summary: PROMETHEUS INDEXING BACKLOG WARNING
prometheus_not_ingesting_samples
- Description: Prometheus has not ingested any sample in the last 10 minutes.
- Expr: rate(prometheus_local_storage_ingested_samples_total[5m]) == 0
- For: 5m
- summary: PROMETHEUS NOT INGESTING SAMPLES WARNING
prometheus_persist_errors
- Description: Prometheus has encountered « $value » persistent errors per second in the last 10 minutes.
- Expr: rate(prometheus_local_storage_persist_errors_total[10m]) > 0
- For: 5m
- summary: PROMETHEUS PERSIST ERRORS WARNING
prometheus_notifications_backlog
- Description: Prometheus is backlogging on the notifications queue. The queue has not been empty for 10 minutes. Current queue length: « $value ».
- Expr: prometheus_notifications_queue_length > 0
- For: 10m
- summary: PROMETHEUS NOTIFICATIONS BACKLOG WARNING
prometheus_storage_inconsistent
- Description: Prometheus has detected a storage inconsistency. A server restart is needed to initiate recovery.
- Expr: prometheus_local_storage_inconsistencies_total > 0
- For: 5m
- summary: PROMETHEUS STORAGE INCONSISTENCY WARNING
prometheus_persistence_pressure_too_high_24h
- Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 24h.
- Expr: prometheus_local_storage_persistence_urgency_score > 0.8 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 24) > 1
- For: 30m
- summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
prometheus_persistence_pressure_too_high_2h
- Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 2h.
- Expr: prometheus_local_storage_persistence_urgency_score > 0.85 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 2) > 1
- For: 30m
- summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
prometheus_series_maintenance_stalled
- Description: Prometheus is maintaining memory time series so slowly that it will take « $value | printf %.0f »h to complete a full cycle. This will lead to persistence falling behind.
- Expr: prometheus_local_storage_memory_series / on(job, instance) rate(prometheus_local_storage_series_ops_total{type="maintenance_in_memory"}[5m]) / 3600 > 24 and prometheus_local_storage_rushed_mode == 1
- For: 1h
- summary: PROMETHEUS SERIES MAINTENANCE WARNING
prometheus_target_scrape_sync_too_low
- Description: Prometheus target scrape sync rate is too low.
- Expr: rate(prometheus_target_scrape_pool_sync_total{app="prometheus"}[10m]) == 0
- For: 5m
- summary: PROMETHEUS TARGET SCRAPE SYNC WARNING

Scope: logs

elasticsearch_too_few_nodes_running
- Description: There are only « $value » < 3 ElasticSearch nodes running.
- Expr: elasticsearch_cluster_health_number_of_node < 3
- For: 10m
- summary: TOO FEW ELASTICSEARCH NODES
elasticsearch_high_memory_usage
- Description: The memory (heap) usage is over 90% for 15m on node « $labels.node »
- Expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
- For: 15m
- summary: ELASTICSEARCH HIGH MEMORY USAGE
elasticsearch_not_indexing
- Description: ElasticSearch data node is not indexing new documents
- Expr: increase(elasticsearch_indices_docs{es_data_node="true"}[1m]) == 0
- For: 5m
- summary: ELASTICSEARCH NOT INDEXING

Scope: Aura

aura-bot_unauthorized_aura-bridge
- Description: aura-bridge has not authorized the connection with aura-bot for 3 minutes.
- Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"401"}[3m])) > 0
- For: 3m
- summary: AURA-BOT RETURN UNAUTHORIZED TO AURA-BRIDGE
aura-bot_bad-request_aura-bridge
- Description: aura-bridge has not been able to correctly handle the connection with aura-bot for 3 minutes.
- Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"400"}[3m])) > 0
- For: 3m
- summary: AURA-BOT RETURN BAD REQUEST TO AURA-BRIDGE
aura-bot_internal-error_aura-bridge
- Description: aura-bridge failed to connect to aura-bot for 3 minutes.
- Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host=~"aura-bot.*"}[3m])) > 0
- For: 3m
- summary: COMMUNICATION ERROR BETWEEN AURA-BOT AND AURA-BRIDGE
aura-bridge-error_callback
- Description: aura-bridge failed to handle the connection with callback for 3 minutes.
- Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host!~"aura-bot.*"}[3m])) > 0
- For: 3m
- summary: COMMUNICATION ERROR BETWEEN AURA-BRIDGE AND CALLBACK
aura-bridge_error_whatsapp
- Description: errors in aura-bridge with WhatsApp functionality for 5 minutes.
- Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200",httpStatus!~"403|408|400"}[5m])) > 0
- For: 5m
- summary: Error happened in WhatsApp functionality.
aura-bridge_error_4p
- Description: errors in aura-bridge with Kernel in WhatsApp functionality for 5 minutes.
- Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"403|408|400"}[5m])) > 0
- For: 5m
- summary: Error happened with Kernel in WhatsApp functionality.
nlp-provisioning_killed-processes
- Description: killed nlp-provisioning processes for 15 minutes.
- Expr: sum by (exported_job) (rate(nlp_provisioning_killed_processes{exported_job="nlp_provisioning_job"}[15m])) > 0
- For: 15m
- summary: Processes killed in nlp-provisioning
alive-processes_nlp-provisioning_expected-alive-processes
- Description: alive nlp-provisioning processes vs expected alive nlp-provisioning processes for 15 minutes.
- Expr: sum by (exported_job)(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}) / sum by (exported_job) (nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"})!=1
- For: 15m
- summary: Processes killed in nlp-provisioning

Scope: misc

probe_down
- Description: The endpoint « $labels.instance » is down or not reachable. The blackbox exporter could not validate « $labels.app »’s health.
- Expr: probe_success == 0
- For: 2m
- summary: PROBE FAILING

8 - Queries

Queries

Description of the different types of queries that can be done in order to retrieve metrics from the system

With the goal of retrieving from Aura specific information regarding the generated logs and metrics, we can make queries to the system.

These queries are classified into two categories:

Basic monitoring queries
Guidelines for retrieving information from basic queries.
Basic database queries
Guidelines for retrieving information from queries to the database.

8.1 - Basic monitoring queries

Basic monitoring queries

Learn how to get information for the evaluation of Aura system performance through basic queries

Introduction

The current document shows the guidelines for making queries to Grafana and Kibana in order to retrieve basic information from the system.

Number of TPS per component

Request rate in Grafana

Access Grafana of the environment.
Select “Aura HTTP inbounds” dashboard.
Select the time period for the query.
Select the service of your choice, as can be seen in the following picture:

Basic queries - get current TPS per component in grafana

The panel named “HTTP Request Rate” shows the total number of requests being processed by a service.

This panel is based on the Prometheus aura-bot stored metric called http_request_duration_seconds aggregated in buckets of one minute.

It shows in the y-axis the number of requests in the service and in the x-axis the time period.

Request rate in Kibana

Add a new dashboard with the following data:

Select the time interval for the filter
Query (overwrite your-env with the environment of your choice):
- aura-bot:
  - Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "aura-bot" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "aura-bot" and kubernetes.namespace_name: "your-env".
- aura-authentication-api:
  - Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "authentication-api" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "authentication-api" and kubernetes.namespace_name: "your-env".
- aura-bridge:
  - Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "aura-bridge" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "aura-bridge" and kubernetes.namespace_name: "your-env".
- tac-api:
  - Public endpoints but healthcheck: msg: "Response with status" and kubernetes.labels.app : "tac" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "msg: "Response with status" and kubernetes.labels.app : "tac" and kubernetes.namespace_name: "your-env".
- aura-nlp:
  - Public endpoints but healthcheck: msg: "HTTP request" and kubernetes.labels.app : "nlp" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "msg: "msg: "HTTP request" and kubernetes.labels.app : "nlp" and kubernetes.namespace_name: "your-env" .
- aura-context:
  - Public endpoints but healthcheck: msg: "HTTP request" and kubernetes.labels.app : "context" and kubernetes.namespace_name: "your-env" and not path: "/ping".
  - For all the endpoints of the service: msg: "msg: "msg: "HTTP request" and kubernetes.labels.app : "context" and kubernetes.namespace_name: "your-env" .
Index: aurak8s-service
Data:
- Metric: Unique Count
  - Field: corr.keyword
  - Custom label: Number of requests
Buckets: split rows
- Aggregation: Date histogram
- Field: @timestamp
- Minimum interval: minute

Update the dashboard and name it, in order to have it available.

A basic example of this dashboard with a panel per component is delivered with Aura, so it can be imported in the Kibana of the environment using Kibana import objects API.

Basic queries - get current TPS per component in kibana

Number of unique users in Aura

In this case, the only way of getting the number of unique users accessing to Aura is querying the operational logs, in Kibana.

Add a new dashboard in Kibana with the following data:

Select the time interval for the filter
Query (overwrite your-env with the environment of your choice): app.keyword : "aura-bot" and kubernetes.namespace_name.keyword : "your-env"
Index: aurak8s-service
Data:
- Metric: Unique Count
- Field: auraId.keyword
- Custom label: Number of unique aura users
Buckets: split rows
- Aggregation: Date histogram
- Field: @timestamp
- Minimum interval: 1h

Update the dashboard and name it, to have it available.

A basic example of this dashboard with a panel per component is delivered with Aura, so it can be imported in the Kibana of the environment using kibana import objects API.

Basic queries - get number of unique aura users in kibana

8.2 - Basic database queries

Basic database queries

Learn how to get information from the database to get some insights

Requirements

A valid kubeconfig for the environment
If the environment database is in Atlas: access to Atlas by IP

Get the variables to access the database:

# substitute {{aura-environment}} with the environment you're configuring
export AURA_ENVIRONMENT={{aura-environment}}

$ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_URI"

{{mongo_uri}}

$ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USERNAME"

{{mongo_user}}

$ kubectl -n $AURA_ENVIRONMENT get secret authentication-api -o json | jq -r ".data.AURA_MONGODB_PASSWORD|@base64d"

{{mongo_pass}}

$ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USER_DB"

{{mongo_users_db}}

$ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USER_COLLECTION"

{{mongo_users_col}}

Get the channel_name and channel_id for the all channels in the environment:

# substitue {{aura-environment}} with the environment you're configuring
export AURA_ENVIRONMENT={{aura-environment}}

$ kubectl -n $AURA_ENVIRONMENT get cm aura-bot -o json | jq -r ".data.AURA_CHANNELS_CONFIGURATION_API_ENDPOINT"

{{channels_configuration_endpoint}}

$ kubectl -n $AURA_ENVIRONMENT get secret aura-bot -o json | jq -r ".data.AURA_AUTHORIZATION_HEADER|@base64d"

{{authorization_header}}

$ curl {{channels_configuration_endpoint}}/aura-services/v2/configuration/channels -H "Authorization: {{authorization_header}}" -o channels_config.json

$ cat channels_config.json| jq -r '.[] | .name + ":" +.id'
{{ channels }}

# Example of channels
# novum-mytelco:45494a5b-835a-4fff-a813-b3d2be529dbe
# whatsapp:f7fd1021-41cd-588a-a461-387cc24be223
# whatsapp-1004:e75e7b9d-7949-451a-9493-3d759745492c
# movistar-plus:60f0ffda-e58a-4a96-aad9-d42be70b7b42
# set-top-box:814bc401-7743-47d3-957b-7f1b2dafe398
# set-top-box-haac:dc388448-b1d1-11e9-b77b-67224ed60908

Queries

Total number of users registered in Aura

⚠️ This information is only for authenticated users. Currently, anonymous users are not stored in the Aura users’ database.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.find({}).count()
10167

Total number of users registered in aura per channel

⚠️ This information is only for authenticated users. Currently, anonymous users are not stored in the Aura users’ database.

Use the output of {{ channels }} to identify the channel by its name rather than by its identifier.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.aggregate([
    {"$group" : {_id: "$channelId", count: {$sum:1}}}
])
{ "_id" : "981e5db7-8031-4370-a326-b6f4d163cd82", "count" : 1 }
{ "_id" : "814bc401-7743-47d3-957b-7f1b2dafe398", "count" : 21 }
{ "_id" : "189d4016-bcd0-491d-a75e-64e7a54aa75c", "count" : 1 }
{ "_id" : "b2501583-6d76-4e77-b364-aa169490efec", "count" : 1 }
{ "_id" : "b94aec9a-da4d-46de-afc3-06cfe0157888", "count" : 1 }
{ "_id" : "60f0ffda-e58a-4a96-aad9-d42be70b7b42", "count" : 22 }
{ "_id" : "e75e7b9d-7949-451a-9493-3d759745492c", "count" : 2518 }
{ "_id" : "9924335b-321a-4f48-b820-e35c7eb9e58b", "count" : 1 }
{ "_id" : "dc388448-b1d1-11e9-b77b-67224ed60908", "count" : 69 }
{ "_id" : "f7fd1021-41cd-588a-a461-387cc24be223", "count" : 2062 }
{ "_id" : "e59aa30f-bae5-4c9e-9d1a-0be8b904711d", "count" : 1 }
{ "_id" : "45494a5b-835a-4fff-a813-b3d2be529dbe", "count" : 346 }
{ "_id" : "5ad28380-85fa-4ba2-bcdb-0732127f4a85", "count" : 4792 }
{ "_id" : "ecd189c2-b1dd-4142-bbe1-eb9549b327e2", "count" : 1 }
{ "_id" : "25695326-c67c-40fe-b5df-a8fd5b4deb22", "count" : 14 }
{ "_id" : "4c14973e-3369-4c6a-b59d-e3e0ecaed78c", "count" : 291 }

Total number of users with expired authorization_id

An authorization_id is expired if it has not been used for 180 days in a row.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.find({lastAccess: {
        $lt: new Date(ISODate().getTime() - 180*24*60*60*1000)
    }
}).count()
3034