This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Monitor Aura system

Monitor Aura system

The operation of your Aura system can be monitored in a continuous basis. Discover Aura monitoring tools, the different logs and metrics that are generated and how to implement an alert system.

Introduction

Aura monitoring system is crucial to control how Aura system works internally, in order to keep the service functional and, moreover, to understand the behavior of our clients, leading to evolve Aura accordingly.

Monitoring is based in the generation by different Aura components of logs and metrics, that are stored for their subsequent analysis and decision-making based on the obtained results. Both logs and metrics together create a complementary observability foundation to have an essential knowledge of the system performance in predictable and unpredictable ways.

Within this process, key external monitoring tools are used, such as ElasticSearch, Kibana and Grafana.

Stages in Aura monitoring process

Sections below show, at a glance, the steps and recommended tools both for logs and metrics management and include links to the corresponding documents for more details.

Aura logs management

Logs are files generated by different Aura components that record specific single events, warnings and errors as they occur.

  1. Logs generation
    Different Aura components generate logs every time a relevant event happens.

  2. Logs storage
    Aura logs are stored in Elasticsearch.

  3. Logs visualization and analysis
    Logs can be managed through different tools:

    • Kibana: basic operational logging analyzer of Aura.
    • Moreover, two additional tools, fluentd and Grafana can be used for specific aspect related to logs management.

📃 Read here detailed information regarding Aura logs management.

Aura metrics management

Metrics offer an aggregated view of Aura performance based on meaningful aggregated logs. They are typically generated at fixed-time intervals and represent a specific aspect of the monitored system.

  1. Metrics generation
    Different Aura components generate metrics periodically based on aggregated logs.

  2. Metrics storage
    Once generated, Aura metrics are pooled by Prometheus, which is in charge of gathering and exposing them.

  3. Metrics analysis
    Aura metrics are analyzed in order to have a meaningful interpretation of data and to obtain an overall evaluation of Aura’s performance.

    For the management of metrics, we recommend using Aura dashboards, which are generated in Grafana. These dashboards can be retrieved by making queries to the system.

📃 Read here detailed information regarding Aura metrics management.

Aura alerts

Prometheus has a list of alert rules that are part of the platform configuration and can be editable.

📃 Discover the alerts currently set for Aura system in Aura alerts document.

1 - Aura Analytics 1.1.

Aura Analytics 1.1.

Description of Aura Analytics 1.1, the monitoring dataflow that allows active listening in Aura

Introduction

This document contains a description of a joint dataflow between LCDO OB teams and Aura Global Team for processing Aura log files created in production environment (i.e., coming from actual Aura users) in order to create PPDs (Privacy-Preserving Datasets). All this process is known as Active Listening.

The dataflow produces as a result, among other elements, an analytics component, named as Aura Analytics Dashboard, that can be used to gather statistics on the production system and to analyze user’s behavior. The latest version 1.1 of this dashboard is described in the current document.

The main objectives of the unified dataflow are:

  • Consolidate the processing of Aura logs into a framework.
  • Provide LCDOs and Aura Global Team with a unified common source for analytics, in a privacy-preserving way.
  • Enable extensibility of the dataflow.

In this framework, the current documents provide:  

The target audience of this document includes the following roles both in LCDO Teams and Aura Global Team:

  • Data Scientists and Product teams, that wish to access Aura logs information and perform analytics on them.
  • Operation teams, for the architectural description and the requirements on OB environments.

Aura Analytics versions

Release 1.0.

The first release 1.0. sets up the basic paths, deploys the PPD infrastructure and produce:

  • Version 1.0. of the OB Analytics system, which includes the OB Dashboard.
  • The first version of pre-processed datasets (clean PPDs) for training and analytics at Aura Global.

As mentioned, this version enables OBs to go further by:

  • Enhancing the OB Dashboard with new visualizations, as they seem fit (given that panels and dashboards can be exported and imported, it is possible to share new ones across all OBs, as they are developed).

  • Processing the PPD files as desired (they are standard CSV files, which can be ingested in alternative platforms if desired). Restrictions on them are softer than on the original logs due to the anonymization process they have been subjected to, although they are still subjected to management precautions (a code of conduct is being prepared for that).

Release 1.1.

Version 1.1. introduces the following changes:

  • The table of data has been enlarged with these new fields: AURA_ID, STATUS_CD, sesId, sesSize, sesDuration.
  • An expanded list of test users is used, so that the userType column contains more identifications.
  • The code for data ingestion into a local Kibana, which previously consisted on a single Python script, has been turned into a full Python package to be installed, due to its increasing complexity.

The prerequisites for the use of version 1.1. of Aura Analytics Dashboard are set below:

  • Aura Platform version:

  • Recommended operating system:Ubuntu 18.04 system

  • Recommended tool for data visualization: ELK stack

1.1 - Architecture

Aura Analytics 1.1. architecture

Technical architecture of Aura Analytics 1.1.

Architecture description

The following figure shows a full overview of Aura Analytics Dashboard architecture and operation, which is also described below:

Aura Analytics architecture

  1.  Aura logs generated in local instance are converted to datasets and transferred to local Kernel via the standard procedure and with the established frequency (typically, daily).

  2.  Once there, the “Active listening” process flow fires up daily. Through a specialized process that runs on an Aura local instance and with access to the stored datasets in the Kernel local storage space:

    • PII (Personally Identifiable Information) is removed or encrypted.
    • The result is transferred to a bucket/blob set up for this task and managed by Global Aura team.
    • Here, the PPDs (Privacy-Preserving Datasets) are created. Currently, only MESSAGE, RECOGNIZER and API datasets are involved in this process.

    In order to convert PII data to PPD, every field in these datasets can be:

    • a. Not transferred.
    • b. Pseudo-anonymized. In this situation, the field is transformed through a cryptographic hashing process using a secret set up by the OB.
    • c. Anonymized fragments of the field (e.g., credit card number, email, etc.). The field is processed to detect specific patterns and replaces them with a specific tag (idemail, idpassport, etc.). The list of anonymization strings is agreed with each OB.
    • d. Transferred as is.
  3.  After that, the Raw PPD Datasets stored in bucket/blog managed by the Global Team are processed generating clean PPD Datasets in order to adapt them to the analytics tools.

  4.  From that space, the clean PPD Datasets can be:

  • Accessed by the Aura Global Team that use them for several tasks, with the purpose of evaluating Aura quality and taking the best decisions regarding to product evolution:

    • Perform analytics on Aura behavior and prototype Analytics Dashboard features
    • Improve Aura Platform capabilities (e.g., adapting machine learning models)
  • Accessed by a Local Aura Team, ingesting the data to a dedicated server managed by the OB with analytics and data visualization capabilities. In order to do that, the Aura Global Team provides a component with the ELK (elasticsearch, logstash & kibana) preconfigured with a set of dashboards that can be deployed and adapted by the OB team.

All the code involved in this process can be found in Github. Particularly:

1.2 - OB analytics

OB analytics

Description of the OB OB Analytics subsystem that can be managed by OBs.

Introduction

The OB Analytics subsystem is an optional component in the dataflow, which enables the management of clean PPDs (Privacy-Preserving Datasets) by LCDOs for the analysis of Aura behavior.

In order to work with OB Analytics subsystem, the following items must be fulfilled:

  1.  The legal agreement for log management and creation of PPDs must be signed between the OB and Aura Global Team.

  2.  The mechanism for PPD creation and transfer must be installed. This requires the deployment of a piece of software (provided by Aura Global Team) inside the OB cloud, with access to the repository (AWS bucket or Azure Blob Storage) holding Aura logs.

  3.  A virtual machine must be deployed on the OB cloud to hold the OB Dashboard. This virtual server must be provisioned by the OB on the same cloud environment (provider and region, e.g., AWS West Europe) than the Kernel cloud, but separated from it in terms of access rights (placing it in the same cloud enables saving transfer costs from the cloud provider for PPD access).

Architecture and installation

The basic infrastructure of the OB Analytics subsystem consists on a Virtual Machine that is fed with the extracted and cleaned PPDs. This virtual machine is set up with a proposed stack of tools based on the open-source ELK framework (See figure in Architecture document).

  • Elastic Search: indexing database.

  • Logstash: ingester for PPD data, configured to upload the anonymized clean PPD tables into Elastic Search.

  • Kibana: visualization tool offering dashboards and panels created over Elastic Search data.

The OB is required to set up the base VM, for which an Ubuntu 18.04 system is advised.

On top of this base system, Aura Global Team provides an installation kit that includes:

  • The pre-processing and ingesting configuration for feeding clean PPD data into logstash.
  • The indexing configuration for Elastic Search.
  • Certain prototype dashboards and panels for Kibana.
  • Basic security provisions (providing web-based secure access to the dashboard).

Once installed, the system automatically ingests any new clean PPD being produced, so that the index and dashboards remain up to date.

In principle, the PPD creation process specifies daily production, since Aura logs are sent to Kernel once a day. This means that information about Aura behavior and user actions on one given day will be available in the dashboards on the following day.

The provided system and installed dashboards are only visualization examples for clean PPDs. The system allows the creation of additional panels that may provide complementary insights on clean PPD elements and OBs are encouraged to explore data as they see fit.

Dashboards can be exported and reimported in a different system, so in addition to the LCDO team adding new analysis features, it is possible to provide later updates to the OB Analytics system. These updates can be provided by the Aura Global Team or shared between OBs.

Outside the dashboard stack, it is also possible to process clean PPD with alternative tools (PPDs are essentially CSV files with a defined structure, so they can be processed with a variety of tools).

Kibana dataflow

The Aura Analytics dashboard follows a standard ELK deployment:

  1. An Elastic Search index has been created. It is called aura-message-COUNTRY, and its index schema contains a cleaned version of the AURA MESSAGE table (which registers input and output messages). For details on the fields that this index contains, go to the document Data model.

  2. A Logstash configuration ingests into this index the cleaned sets of datapoints that are produced daily as a result of the transfer and processing of Aura logs. This is usually done in the early morning (which will then upload data for the previous day).

  3. A Kibana index pattern has been created, matching the uploaded Elastic Search index. An Elastic Search index is how the data is stored inside the DB; a Kibana index pattern is how it is visualized in the interface. Typically, Kibana index patterns match Elastic Search indices, but it is, for example, possible to create a Kibana index pattern that matches more than one Elastic Search index and hence combines different data sources.

  4. A small set of visualizations have been pre-installed in Kibana over that index pattern, as a means to get a default peek on the index data. See the section preinstalled visual elements to check them.

This configuration is deployed on the Kibana default space (the only one available on a freshly created Aura Analytics dashboard). If there is the need to create additional spaces, to better organize visualizations, then the Elastic Search index pattern needs to be installed into those additional spaces.

Preinstalled visual elements

Kibana offers many possibilities to visualize the ingested data and there are many resources and tutorials around explaining its mechanics. We therefore refer to the official Kibana documentation, or tutorials available on the web, for generic information.

In the particular case of the Aura Analytics deployment, there is an Elastic Search index that gets automatically ingested daily. It is called Aura-message-COUNTRY and contains a cleaned version of the AURA MESSAGE table (which registers input and output messages).

Over this index, three types of panels/visualizations have been preinstalled, to provide a starting point:

  • Discover panel
  • Visualizations
  • Dashboards

These preinstalled elements are described in the following subsections. To access them, select the appropriate icon in the left navigation panel.

Elastic search index

Discover panel

The Discover panel in Kibana is an essential tool where one can perform queries to an Elastic Search index (and save those searches if desired), and explore users’ interactions with Aura in detail log by log, these being filtered by:

  1.  Search terms or conditions
  2.  A time interval
  3.  Additional filters applied to the query results
  4.  A set of index fields to show in the result table

These 4 steps are represented in the following figure:

Discover panel

As shown in the previous figure, the starting point is the Elastic Search index holding all the data. The three first steps in the chain reduces the amount of data handled, by pruning out elements that do not satisfy the defined condition. The fourth step is just a display adjustment: on the final dataset, define which of the available fields will be shown on the output table that appears in the panel. However, the retrieved data contains all fields (clicking on any of the rows will show them).

In the Aura Dashboard default set, there is one Discover panel preinstalled. It is called question-answer pairs and has the following characteristics:

  • A blank query (i.e., provide all the results)
  • A time interval for the last 7 days
  • A “only user” filter: it filters out all intents that correspond to non-user queries (suggestions, help commands from the client application, etc.)
  • A visualization that includes: the timestamp, the (cleaned) user message, the detected aura intent, associated entities (if applicable), the dialog that was invoked and Aura’s response

This figure shows a snapshot of this panel. To load it, select the Discover tool in the left navigation bar and then click on the “Open” menu option in the top menu bar. A list of saved panels will be shown, together with the already mentioned “question-answer pairs”.

question-answer pairs panel

Once the panel is loaded, each one of the aforementioned four elements can be freely modified. For example, the interface allows:

  • Adding new filters with the “+Add Filters” button
  • Deactivating the current filters by pressing over the predefined filter and clicking over the “Temporarily Disable” option
  • Modifying the query interval with the “calendar” button or “Dates Box”
  • Adding a specific query on a given index field(s) by using the “Search Box”, instead of the (default) blank query.

Discover panels can be saved as named objects, to be later loaded at will. So, if needed, any panel (a modified panel or a newly created one) can be saved with a new name to have it available for later loading.

Visualizations

A total of 7 visualizations come preinstalled with the base Aura Dashboard. The list can be obtained from the “visualizations” item in the left menu bar, as shown in the figure, and they are:

  • Three “Stats” type visualizations, which provide general statistics on platform usage.
  • Four “User” type visualizations, which provide insights on user behavior.

Visualizations

Note that this distinction between “User” and “Stats” is purely conceptual and based on the fields that have been used to generate the visualizations that, from the point of view of Kibana, are all regular visualizations. Those visualizations can be instantly loaded by clicking on their names. But they can also be integrated into dashboards, as described in the next section.

Dashboards

A dashboard in Kibana is essentially a spatial arrangement of visualizations. For example, to construct a dashboard, just place visualizations into a page, resizing them as required, so they can be observed in a single place.

It is interesting to know that in a dashboard all visualizations are linked. So that if, for example, time interval is changed, or a filter is added using the interface, these modifications affect all visualizations in the dashboard and all of them get updated.

Elements in the dashboard visualizations can also generate instant filters by clicking on graphs or table elements. Those filters are then added to the top of the page as a filter and, therefore, can then be modified or removed.

The Aura Analytics default installation preloads two dashboards. Those are available for selection when we click on the “dashboard” icon in the left navigation bar:

Default dashboards

There are different types of dashboards, described in the following sections.

System dashboard

This dashboard integrates the three predefined “Stats” visualizations (generic statistics):

  • A timeline of interactions (user messages sent and answered), segmented by channel
  • A heatmap of interactions by weekday and time of day (hour)
  • A bar graph classifying the interactions produced in the period by detected intent

The following figure shows a screenshot of this dashboard:

System dashboard

User dashboard

The user dashboard contains the four visualizations:

  • Most Frequent User Utterances: list of the most frequent user’s sentences (in the time interval and filter active at the moment). It uses the usrMsgSig field to group together very similar utterances.
  • Aura Answer Groups: list of the most frequent answers that Aura generates, grouped by the semantic categories in AuraMsgGroup field.
  • Words per query: distribution of sizes for the user messages, measured as number of words in the utterance and segmented by channel.
  • Tag cloud: set of most frequent user utterances, as a tag cloud in which the font size represents the utterance frequency. The MESSAGE_USR_NORM field is used for its representation, so it contains normalized utterances.

The next screenshots show the dashboard with all these visualizations (it is a large dashboard, so typically it needs scrolling to visualize all its components).

User dashboard

Note that those four visualizations are linked as they correspond to the same subset of the data (as given by filters and time interval) but they are NOT linked at the individual item level (i.e., a given most frequent user utterance in the left table does not correspond to any specific Aura answer in the right bar graph).

Instead, the dashboard can be manipulated by selecting one specific item in any of the visualization and this will create a filter for the others. For instance, as the following image shows, if we select <CHURN> in the Aura answer group visualization, we can see in the others the user utterances that led Aura to generate that answer (i.e., an answer about contract cancelation).

Aura answer groups in User dashboard

1.3 - Data model

Aura Analytics data model

Data model of Aura Analytics 1.1. that can be used as the base for building new elements

Introduction

New elements can be built (or the current elements modified) by making use of the available fields in Kibana through the ingested Elastic Search index.

In this document, we provide a reference of the schema that the index follows, so that it can be used to build such new visualizations, or to better understand the existing ones.

Elements in the Aura-message data model have 3 different types:

  • Numeric: single numbers, integer or real. Suitable for numerical statistics, such as averages, or for plotting variation across time in graphs.

  • Keyword: they are opaque strings, i.e., terms that cannot be searched within (it is not possible to look for words inside a keyword field). They can, however, be used to create some term-level queries, such as prefix queries (find all instances that begin with) and they usually work great for aggregations, since most of them are categorical variables (fields that only have a limited number of possible values) and can therefore be bucketed and counted.

  • Text: these fields are divided into separate terms (words), and some pre-processing is done to them before indexing to improve access though an Elastic Search analyzer. Text fields cannot be used in aggregated visualizations, since they cannot be grouped. They are most useful for queries, because they allow searching for fragments (only a few words) and fuzzy searches.

Fields list

The following table lists all the fields available in the Aura-message-COUNTRY Elastic Search index, together with their type and a brief description.

The most relevant ones include a more detailed description in the section fields explanations.

Note that some fields of text type have a mirror field of type keyword, with the same content. Having the same data indexed in two different ways at the same time (as text and as keyword) enables to perform different types of analysis by choosing the right field.

The “Raw” column indicates if this field is already present in the Aura raw PPD files:

  • Yes: field contained in raw PPDs.

  • No: generated field, produced when creating clean PPDs. They can be recognized as lowercase fields.

  • Partial: It exists in the raw PPDs, but in a somehow different shape.

Field Type Raw Contents
CORR_ID keyword yes Unique identifier for each interaction
VERSION_ID keyword yes Aura Platform version
CHANNEL_CD keyword yes Identifier for the channel this interaction corresponds to
STATUS_CD keyword yes Internal code related to operation status
AURA_ID_GLOBAL keyword yes (Mostly) unique identifier for the user
AURA_ID keyword yes (Mostly) local identifier for the user
INTENT keyword yes Detected user intent, including “system” intents
MESSAGE_USR text partial Text request sent by the user
MESSAGE_USR_NORM text no A normalized version of MESSAGE_USR
MESSAGE_USR_NORM.keyword keyword no A keyword version of MESSAGE_USR_NORM, to enable aggregating on it
MESSAGE_AURA text partial Text message sent by AURA to the user
MESSAGE_AURA.keyword partial Keyword version of MESSAGE_AURA, to enable aggregating on it
MODALITY_CD_USR text partial Modality of the user message
MODALITY_CD_AURA text partial Modality of Aura response
ENTITIES text yes Comma-separated list of the entities recognized in the user message
DIALOG_ID text yes Identifier for the dialog that produced Aura response
DIALOG_ID.keyword keyword yes Keyword version of DIALOG_ID, to enable aggregating on it
DURATION_NU number yes Elapsed time, in ms, between the reception of the user message and the moment the response is generated to be sent to the channel
userType keyword no A single char identifier that characterizes the user as a test user
usrMsgWc number no Message word count: number of words contained in the user message
usrMsgSig keyword no Message signature: a string that helps clustering user messages
AuraMsgGroup keyword no Cluster the Aura response belongs to
weekday number no Day of the week the interaction happened (0=Monday to 6=Sunday)
hour number no (Integer) hour the interaction happened
country keyword partial Two-letter code for the country
sesId keyword no Session information
sesSize number no Session information
sesDuration number no Session information

Fields explanations

This subsection contains more detailed descriptions of some of the key fields in the schema.

AURA_ID_GLOBAL

This element (mostly) uniquely identifies the user generating the interaction.

Note the concrete value of this field is not the same as the actual identifier used within Aura and uploaded to Kernel: for privacy reasons, the identifier was hashed when generating the PPD and has no resemblance to the original one. The correspondence is however maintained across time, so it is possible to analyse user behavior.

The “mostly” qualifier reflects one quirk of the original Aura identifier: it is generated with a dependence to the authentication method used by the channel, so if two channels follow different authentication methods (e.g., MobileConnect vs. User/Password) then the AURA_ID_GLOBAL identifier for the same user will be different. In summary:

  • The identifier stays the same for a given user across time.

  • Different users will not have the same identifier.

  • But the same user could produce two different identifiers if connected to two channels that use a different authentication method.

AURA_ID

This is a “local” identifier, i.e., one that is generated inside the channel according to specific channel characteristics and it is not tied as much as AURA_ID_GLOBAL to user authentication.

Its main disadvantage is its transient nature: the same user, on the same channel, could generate different AURA_ID strings when connecting different times on a different session. Therefore, for user accounting and tracing, AURA_ID_GLOBAL is usually preferred.

However, there are instances in which AURA_ID works better, namely for anonymous access (when the user is not authenticated). This depends on the channel:  

  • In the WhatsApp channel, the initial use of the channel will be anonymous from the Aura side (i.e., no authentication is done), hence AURA_ID_GLOBAL will also be empty (at least until the user authenticates, which depends on the use case). But in this channel, AURA_ID has a permanent value, linked to the WhatsApp user, so here it is a good substitute for a persistent id, even for unauthenticated users.
MESSAGE_USR

This field includes the message sent by the user.

It has been partially processed to enhance anonymization by removing some standard identifiers contained in it with <idxxx> strings (e.g., phone numbers appear as <idphone>).

Removal is done mostly through regular expressions, so there might be occasional glitches (such as identifying as phone a number that does not really correspond to a phone, just because it follows the phone number pattern).

MESSAGE_USR is a field of text type. As such, it is searchable: it is possible to search for specific words the user might have said.

Furthermore, it has been processed through an ElasticSearch analyzer adapted to the specific language used. This means that searches are able to match related words (e.g., plural versions of a singular query word, or verb conjugations). Phrase searches are also possible (by using double quotes around the phrase). If a phrase (several words) is used as a query without the quotes, ElasticSearch interprets it as a query for any of the words, so it will return all data elements that contain any of the words in the query.

In Kibana, more sophisticated text searches can be made by switching Lucene query syntax: proximity queries (words close to each other), fuzzy searches (query words allowing typos), wildcards, etc.

MESSAGE_USR_NORM

This is a normalized version of MESSAGE_USR, in which the user text has been streamlined by:

  • Converting all the sentence to lowercase
  • Removing all punctuation
  • Removing any extra spaces

Furthermore, this field is not processed through a language-dependent analyzer as MESSAGE_USR is, so queries on this field must match words exactly. It is still a text type field. However, the same query language can be used.

MESSAGE_AURA

This contains the text message generated by Aura and sent to the user as response to the user query. It is a text type field, so it is possible to search for specific words in it.


In the current version of Aura KPIs logs, this field only contains the text response. Some Aura use cases do not generate a purely textual message, but a more elaborated one (e.g., a card with text and graphics). These complex answers are inserted as attachments into Aura’s response to the channel and since attachments are not logged into the MESSAGE field, this field will appear empty in those cases. So, an empty MESSAGE_AURA field does not necessarily mean that Aura did not provide an answer. As an alternative for those situations, looking at the DIALOG_ID field (or INTENT) may give a hint of the type of answer that Aura delivered.

 MODALITY_CD_USR

This field contains the modality in which the user sent the message.

It is a slightly transformed field because there are some variations across Aura versions and, in order to unify it, the modalities are consolidated into only four different keywords: audio (spoken message), text (written free-text message) o form (commands sent via automatic processing or menus).

 DIALOG_ID

This field contains the identifier for the user case dialog module at the aura-bot Framework that was selected to construct the Aura response.

Dialog identifiers have two components (library  and dialog) separated by a colon e.g., services:service-usage

This field uses a custom analyser that splits the identifier at the colon, generating two terms. This makes possible to construct queries with one of the terms, e.g., “give me all the elements for the domain services”. But being a text field makes it impossible to do aggregations on it, so it cannot be used for statistics like bar charts (use DIALOG_ID.keyword for that).

DURATION_NU

This number reflects the time that took Aura to understand, process and respond to the user message. It is the difference (in milliseconds) between the timestamp of the moment the user message was received and the timestamp in which Aura’s response was finalized and sent to the channel.

Note that it is not a complete end-to-end delay time from the user’s point of view, since it does not include either the time it took the request to arrive to Aura through the channel or the time it took the response to travel back through the channel and get rendered at the client application (those times are outside Aura, and as such not registered by it).

Session Information

Session information includes the fields: sesId, sesSize, sesDuration.

These fields are generated by running a process over the time series formed by interactions from each user at each channel.

A session is automatically identified as a consecutive list of such user’s interactions, each separated from the next by a time interval shorter than 5 minutes. Once each session is identified, it is tabulated and labelled with three fields:

  1. sesId: string, forming a unique identifier for the session. It should be considered as an opaque identifier and the guarantee is that no other session in the data stream carries the same identifier.
    As an aside, interactions that do not correspond to actual user interactions (because no user could be identified or because the datapoint corresponds to an interaction not triggered by the user) are all labelled with a <void> sesId.

  2. sesSize: number of interactions this session contains. This is labelled only for the first interaction in the session, all other interactions carry a 0 in this field. Non-sessions such as the ones with <void> sesId will be left empty. This facilitates computing averages or other statistics on valid sessions, by just first filtering out all zero and empty values.

  3. sesDuration: time duration for each session, counted from the instant the first user message was received, to the instant the last Aura message was sent. For single-interaction sessions its value will be the same as DURATION_NU, for multiple interactions it will contain the time interval between all of them.

As with sesSize, only the first interaction in a session is annotated with sesDuration; the remaining interactions will be assigned a 0 value (and interactions that do not correspond to a session will be left empty). Therefore, to compute statistics on sesDuration, remove the 0 and empty values first.

userType

This field may be used, in certain cases, to help identify rows that do not correspond to real users but to test users (internal users that belong to test/QA teams and whose behaviour is, therefore, not representative of actual Aura users).

The field contains a single character, which is s for standard (real) users, and can be Q or T for QA/Test users respectively (there are also lowercased versions q and t, referring to unconfirmed test users).

Note that test user identification is not available on every country, since it depends on having a register of the AURA_GLOBAL_ID identifiers that QA/Test users authenticate and this is not always available.

usrMsgSig

This field is not useful by itself. Instead, it is intended to be used to help grouping together very similar user utterances. It does so by generating a signature of the utterance that is (hopefully) insensitive to small variations in the sentence.

This is an experimental field; it might change if we reach a variant that is better suited for its purpose.

The way to generate this signature is by following these steps with the utterance:

  • Start with the normalized utterance (i.e., MESSAGE_USR_NORM).

  • Perform stemming (removal of word suffixes) on all the words. This makes bills and bill the same word.

  • Substitute words from a fixed list of very common, uninformative tokens (stopwords) by an asterisk. For example, this converts both “get my bill” and “get the bill” to the same phrase “get * bill”.

  • Group words in sets of 3 elements (trigrams) and sort them alphabetically. This removes the global structure of the sentence, while retaining local structure.

The resulting string is a non-understandable version of the original utterance (hence, it cannot be used by itself), but the fact that several very similar utterances produce the same signature helps cluster those utterances. An example is one of the preinstalled visualizations “Most Frequent User Utterances” which uses this field to group very similar utterances.

Another example is provided in the following figure, which shows message utterances generating the same signature:

Message utterances

As it can be seen, the signature is the same for “how can I upgrade” and “when can I upgrade”, “when does my contract end” and “when is my contract ending”, and “live chat” & “live chats”. So, they would be counted together when aggregating by signature.

The procedure has its limitations and, as explained, it is experimental, so we are trying to improve it, but it can already alleviate a bit the inherent variability in user expressions.

AuraMsgGroup

Messages produced by Aura are as generated by its text resource database. In some cases, the same category of message produces different output texts, maybe because the message includes some user-dependent parameter or because the text database contains several variants of the same text (and Aura picks one at random).

The AuraMsgGroup field is a keyword field that helps categorize Aura answer by abstracting away some of this variation. It classifies the response given by Aura into two types of elements:

  • Generic group: a name such as <NONE>, <GREETING> or <NOTFOUND>, which corresponds to a response category (see Table 3)

  • Truncated answer: for answers that do not have a defined generic group, as a fallback the literal answer text is inserted, after substituting all numbers in it with a placeholder and truncating it (i.e., retain only the first characters).

The following table contains the generic groups defined so far. They correspond to the most frequent Aura messages. It is country-dependent, since it also depends on the use cases deployed in each country. As said above, responses not falling into these groups will be assigned a truncated version of the response text.

Note that th emost frequent Aura messages list can be enlarged with time. Also, the correspondence between Aura messages and groups is not static, if the text database is updated with new variants, it will be necessary to also update the translation table in the PPD cleaner process that generates this field.

Group Meaning
EMPTY No textual answer from Aura (see note in Section MESSAGE_AURA for the usual meaning of no text answer)
NONE Aura says it did not understand the user utterance
ERR There was a processing error of some kind at Aura side, and the request could not be fulfilled
GREETING Aura is greeting the user
GOODBYE Aura is acknowledging a conversation end
YOU-ARE-WELCOME Aura is accepting a compliment
CHURN Aura recognizes the user intention to terminate a contract
NOTFOUND Aura tried to search for some bit of data concerning the user query, and could not find it
CANNOT Aura cannot fulfil the user request because of insufficient information (in the query, or on user data)
BILL-INFO The user requested information about her bill, and Aura is returning it
DATA-INFO The user requested information about her data usage, and Aura is returning it

: The list can be enlarged with time. Also, the correspondence between Aura messages and groups is not static, if the text database is updated with new variants, it will be necessary to also update the translation table in the PPD cleaner process that generates this field.

1.4 - Annex: Dataset fields

Annex: Dataset fields detail

Explanation of the process that each field of the data model is going through towards a clean PPD

Introduction

The objective of the following tables is to explain the process that each field is going through within this flow:

AURA DATASET PPD_RAW PPD_CLEAN


  • Each cell of the table explains the process that the data field is undergoing in this specific moment before it gets to the concrete stage (table column).

  • For example, the field GLOBAL_AURA_ID is undergoing a “hashing” before it gets stored in PPD_RAW. After this, the “hashed data” is progressed without any further processing to PPD_CLEAN.

Tables used in the Active Listening process are described in the following sections. They belong to the Aura Entities database.

  • Columns “FIELD” and “DESCRIPTION”: instances managed by the OB

  • Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

MESSAGE dataset

Message dataset (stored in local Kernel)

  • COLUMNS “field” and “description”: instances managed by the OB

  • COLUMNS “PPD raw” and “PPD clean”: instances managed by Aura Global Team

# FIELD                 DESCRIPTION   PPD RAW       PPD CLEAN
1 USER_ID               Unique user ID in the OB systems NOT
transferred
 NOT
transferred
2 MSG_DT                 Timestamp of the data                                
3 MSG_ID                 Unique ID of the message                         NOT
transferred
4 ACTION_CD             Code of the action that produces the data                     NOT
transferred
5 AURA_ID               User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura. Hashed              
6 PHONE_ID               Phone number of the user    NOT
transferred
NOT
transferred             
7 CHANNEL_CD             Code of the channel where the action happened                                  
8 SUBSCRIPTION_CD       Code of the subscription type of the user in the OB           NOT
transferred
9 DOMAIN_CD             Code of the domain where the action happened                 NOT
transferred
10 CATEGORY_CD           Code of the category where the action happened               NOT
transferred
11 COUNTRY_CD             Code of the country NOT
transferred
12 CORR_ID               Correlator ID of the request that produces this data                                    
13 IS_CACHED             Shows if the entity content was already cached or not     NOT
transferred
14 STATUS_CD             Status code of the action, if meaningful                               
15 REASON                 Result of the action in error case, code of the error   NOT
transferred
16 VERSION_ID             Aura version that produces this data                                
17 LANG_CD               Language configured by the user for communication   NOT
transferred
18 TZ_CD                 Timezone where the communication happened NOT transferred
19 DURATION_NU           Duration in milliseconds of the action                                
20 MESSAGE               Content of the message   Anonymized                  
21 DIALOG_ID             Id of the dialog where the message happens                              
22 CONVERSATION_ID       Id of the conversation where the message happens               NOT
transferred
23 WIN_RECOGNIZER_CD     Code of the recognizer that wins for this message             NOT
transferred
24 WIN_RECOGNIZER_SCORE_NU Score of the recognizer that wins for this message             NOT
transferred
25 INTENT                 Selected intent                                  
26 ENTITIES               List of entities determined by the recognizer                                
27 MODALITY_CD           How does the user communicate with Aura                                   
28 AURA_ID_GLOBAL         Identifies the same user_id logged with the same authentication method Hashed    
29 ACCOUNT_NUMBER         Unique account number of the user               NOT
transferred             
NOT
transferred

RECOGNIZER dataset

Recognizer dataset stored in local Kernel

  • Columns “FIELD” and “DESCRIPTION”: instances managed by the OB

  • Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

# FIELD                   DESCRIPTION       PPD RAW       PPD CLEAN
1 USER_ID                 Unique user ID in the OB systems   NOT transferred NOT transferred             
2 RECOGNIZER_DT           Timestamp of the data                              
3 RECOGNIZER_ID           Unique ID of the recognizer                              
4 ACTION_CD               Code of the action that produces the data                 NOT transferred
5 AURA_ID                 User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura. Hashed              
6 PHONE_ID               Phone number of the user   NOT transferred NOT transferred 
7 CHANNEL_CD             Code of the channel where the action happened                              
8 DOMAIN_CD               Code of the domain where the action happened     NOT transferred      
9 CATEGORY_CD             Code of the category where the action happened     NOT transferred  
10 COUNTRY_CD             Code of the country                NOT transferred         
11 CORR_ID                 Correlator ID of the request that produces this data                            
12 IS_CACHED               Shows if the entity content was already cached or not NOT transferred     
13 STATUS_CD               Status code of the action, if meaningful                            
14 REASON                 Result of the action in error case, code of the error                              
15 VERSION_ID             Aura version that produces this data                              
16 LANG_CD                 Language configured by the user for communication NOT transferred             
17 TZ_CD                   Timezone where the communication happened    NOT transferred            
18 DURATION_NU             Duration in milliseconds of the action                              
19 SCORE_NU               Score returned by the recognizer                                  
20 INPUT                   User input sent to the recognizer. Null if incoming message is an AuraCommand Anonymized              
21 OUTPUT                 Complete output generated by the recognizer                            
22 INTENT                 Intent returned by the recognizer                               
23 ENTITIES               Entities returned by the recognizer due to the intent                            
24 COMMON_THRESHOLD_NU     Common threshold used to determine the best answer of all recognizers                 NOT transferred
25 THRESHOLD               Specific threshold of the specific recognizer being executed  NOT transferred            
26 EXPECTED_INTENT         Intent expected to be returned by the recognizer  NOT transferred            
27 EXPECTED_ENTITIES       Entities expected to be returned by the recognizer due to the intent             NOT transferred
28 AURA_ID_GLOBAL         Identifies the same user_id logged with the same authentication method Hashed              
29 ACCOUNT_NUMBER         Unique account number of the user    NOT transferred NOT transferred             

This Markdown table can be directly used in your GitHub Markdown files.  

API dataset

API request dataset (stored in local Kernel)

  • Columns “FIELD” and “DESCRIPTION”: instances managed by the OB

  • Columns “PPD RAW” and “PPD CLEAN”: instances managed by Aura Global Team

#   FIELD            DESCRIPTION     PPD RAW       PPD CLEAN
1   USER_ID               Unique user ID in the OB systems                                   NOT transferred NOT transferred             
2   REQUEST_DT           Timestamp of the data                                                                        
3   REQUEST_ID           Unique ID of the request                                                                      
4   ACTION_CD             Code of the action that produces the data                                       NOT transferred
5   AURA_ID               User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura Hashed NOT transferred
6   PHONE_ID             Phone number of the user                                           NOT transferred NOT transferred
7   CHANNEL_CD           Code of the channel where the action happened                                   NOT transferred
8   DOMAIN_CD             Code of the domain where the action happened                                   NOT transferred
9   CATEGORY_CD           Code of the category where the action happened                                 NOT transferred
10 COUNTRY_CD           Code of the country                                                             NOT transferred
11 CORR_ID               Correlator ID of the request that produces this data                                        
12 IS_CACHED             Shows if the entity content was already cached or not             NOT transferred NOT transferred             
13 STATUS_CD             Status code of the API request                                                                
14 REASON               Result of the action in error case, code of the error                                        
15 VERSION_ID           Aura version that produces this data                               NOT transferred
16 LANG_CD               Language configured by the user for communication                    NOT transferred          
17 TZ_CD                 Timezone where the communication happened                                      
18 DURATION_NU           Duration in milliseconds of the action                                                        
19 HOST                 Host of the API                                                                              
20 PATH                 Specific path of the API being called                               NOT transferred           
21 HTTP_STATUS           HTTP status of the server response                                  NOT transferred            
22 RESPONSE             Response body                                                     Anonymized                  
23 AURA_ID_GLOBAL       Identifies the same user_id logged with the same authentication method Hashed NOT transferred
24 ACCOUNT_NUMBER       Unique account number of the user         NOT transferred NOT transferred             
25 REQUEST               Request body                                                                                

2 - Aura Analytics 2.0.0

Aura Analytics 2.0.0

Description of Aura Analytics 2.0.0, the monitoring tool designed and managed by Aura Global Team that allows active listening in Aura

What is Aura Analytics 2.0.0?

Active listening is defined as a key process that involves a continuous monitoring of Aura performance based on real logs from the users to analyze them and gather insights on the efficiency and effectiveness of Aura as a system and also to track the interaction of our users with Aura.

In this framework, Aura Analytics 2.0.0 is a tool used by Aura Global Team that uses active listening with the ultimate goal of improving Aura quality, as it generates accurate information to carry out both corrective and predictive actions and to decide how Aura should evolve in the future.

How does Aura Analytics 2.0.0 work?

  • The process is built upon Aura users logs generated in production environment

  • From these logs, Aura Analytics 2.0.0 create PPDs (Privacy-Preserving Datasets)

  • Datasets are processed, enabling the visualization through dashboards and the extraction of statistical insights

  • The Aura Global Team consumes this data to support decision-making processes

Target users

  • The Aura Global Team is the target user of the Aura Analytics 2.0.0 tool, responsible for its design and management as well as for the interpretation of results for decision-making.

  • OBs should allow the generation of datasets from their Aura users logs in their local environment just by installing and executing a single process, as shown in the document Guidelines for OBs.

Index of documents

Aura Analytics 2.0.0 includes the following documents:

Aura Analytics versions

Release 1.0.0

The first release 1.0.0. sets up the basic paths, deploys the PPD infrastructure and produce:

  • Version 1.0.0. of the OB Analytics system, which includes the OB Dashboard.
  • The first version of pre-processed datasets (clean PPDs) for training and analytics at Aura Global.

As mentioned, this version enables going further by:

  • Enhancing the analytics dashboard with new visualizations.

  • Processing the PPD files as desired (they are standard CSV files, which can be ingested in alternative platforms if desired). Restrictions on them are softer than on the original logs due to the anonymization process they have been subjected to, although they are still subjected to management precautions (a code of conduct is being prepared for that).

Release 1.1.0

Version 1.1.0. introduces the following changes:

  • The table of data has been enlarged with these new fields: AURA_ID, STATUS_CD, sesId, sesSize, sesDuration.
  • An expanded list of test users is used, so that the userType column contains more identifications.
  • The code for data ingestion into a local Kibana, which previously consisted on a single Python script, has been turned into a full Python package to be installed, due to its increasing complexity.

Release 2.0.0

Version 2.0.0 introduces the following changes:

  • In 2.0.0 version, Aura Analytics has undergone a refactor to improve its structure and make it easier to understand, maintain and extend in the future.
  • Aura Analytics 2.0.0 simplifies the deployment and execution process.
  • But one of the most significant enhancements in Aura Analytics 2.0.0 is its capability to manage both processed and to-process files centrally in one place (database). 

The prerequisites for the use of Aura Analytics 2.0.0 are set below:

  • Recommended tool for data visualization: ELK stack

2.1 - Architecture

Aura Analytics 2.0.0. architecture

Technical architecture of Aura Analytics 2.0.0 and description of main processes and components

Architecture overview

Aura Analytics 2.0.0 contains two different environments:

  • OB local environment: Processes in this side are managed by the OB, that should install and execute certain processes related to the PPD-Creator for the creation of raw datasets.

  • Global environment: Processes here are managed by Aura Global Team for data recovery, processing and generation of dashboards and statistics. The output includes data and metrics to be consumed by Aura Global Team for decision-making.

Aura Analytics 2.0.0 architecture flowchart

The following diagram shows an overview of Aura Analytics 2.0.0 architecture, including the environments involved and the main components and processes, which are fully described in succeeding sections.

Architecture flowchart Figure 1. Aura Analytics 2.0.0 Architecture flowchart

Aura Analytics 2.0.0 processes

PPD-Creator process

The PPD-Creator is a Python module for the creation of PPD-Raw datasets.

It is the only component that belongs to the OB environment. The OB should install it and is responsible for its execution. The PPD-Raw datasets will be stored in the destination blob PPD-RAW.

This process reads the files included in OB MANAGED INSTANCES columns of the tables in Annex: Dataset fields. The result of this process is the PPD RAW columns of the tables in the above-mentioned annex.

The main tasks executed by the PPD-Creator are summarized below:

  • Reads the Aura log files in a Kernel Blob.
  • Anonymize the sensible fields (AuraID, AuraGlobalID, and personal information of user sentence such as DNI, phone numbers, etc).
  • Save the anonymized files to another directory of blob (PPD-Raw).

PPD-Creator process Figure 2. PPD-Creator process

The PPD-Creator anonymizes the following data, in the different OBs:

ES UK
dni creditcard
nie insurance
phone postcode
email imei
phone
imsi
email
twitter
passport

Manage PPD-Raw process

The Manage PPD-Raw process inserts the PPD-Raw path files (output from PPD-Creator) to PostgreSQL table for files management data centric:

  • It reads the output of PPD-Creator JSON file
  • Afterwards, it saves the paths to PostgreSQL server

Manage PPD-Raw process Figure 3. Manage PPD-Raw process

PPD-Clean process

The PPD-Clean is a Python package used to clean PPD-Raw datasets.

Firstly, this process locates the directory where the PPD-Raw files are located, reads the corresponding files and processes them.

Once the process is finished, it writes to the files_processed table in the database and saves them in the PPD-Clean directory.

The main tasks executed by the PPD-Clean are summarized below:

  • Apply transformations to columns
  • Extract the explicit frustration
  • Calculates the Nones n-grams
  • Save the data in Directory or blob, PostgreSQL server and ElasticSearch for visualization

PPD-Clean process Figure 4. PPD-Clean process

User Dynamics process

User dynamics is a script used to measure the user’s behavior through metrics. It extracts statistics on the recurrence of users in Aura in a monthly basis.

The processes executed are summarized below:

  • User dynamics reads the file_processed table of the database and the all PPD-Clean files stored for 1 month.
  • It extracts metrics regarding new users, recurrent users, lost users and recovered users.
  • Afterwards, it saves these metrics in the User_dynamics schema, in a PostgreSQL database, within the tables connections, daycount, user and channel.
  • Data is also saved in ElasticSearch.

User Dynamics process Figure 5. User Dynamics process

Components

Active Listening Database

The Active Listening Database is a PostgreSQL database that stores the processed and to-process files centrally in one place. It is used by the PPD-Clean and User Dynamics processes to store the processed data and metrics.

Schema files management

Currently, in the Active Listening project, we have input and output files for each of the processes and files that are processed. With the proposed database solution through the files management database, a more efficient management of raw files is achieved:

  • The PPD-Creator process transfers files from the OB to a shared blob.
  • The transferred files are written to a file in that blob called aura-sync-cache-dst.json.
  • The manage_ppd_raw process will read the aura-sync-cache-dst.json file from the PPD-Raw folder and ingest the records into the FILE_PPD_RAW table of the database.
  • It will also insert into the EMPTY_DATA_FILES table the days that are not found in the JSON file. This table is necessary for logging metrics in Prometheus. This process will run daily.

files management db Figure 6. Files management database

Schema user dynamics

The User Dynamics process generates the statistics of Aura users, number of daily active users and types of users, with 4 categories: new, recurring, lost and recovered.

  • The Channel table contains all the channels in Aura that have been processed by the User Dynamics process.
  • The User table contains the unique Aura users in each environment and country.
  • The Daycount table contains the number of total users for each day, indicating how many of them are new, recurring, recovered or lost users, the number of weekly unique recurring users and the number of monthly unique recurring users.
  • The Connection table has the status of the user for each day (whether it is new, recurring, lost or recovered).

user dynamics db
Figure 7. User dynamics database

Aura Analytics Dashboard

Aura Analytics 2.0.0 produces as a result, among other elements, an analytics component named Aura Analytics Dashboard that is the one used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior.

This Analytics Dashboard is based on the ELK stack that contains:

  • ElasticSearch: distributed search and analytics engine at the heart of the Elastic Stack. It allows the storage of data and its subsequent indexing, search and analysis.
  • Kibana: provides a visualization tool that includes dashboards and panels created over the ElasticSearch data. Users interactively explore, visualize and share insights into data and manage and monitor the stack.

Once installed:

  • An ElasticSearch index is created. It is called aura-ppd-ENTITY-COUNTRY-YEAR, and its index schema contains a cleaned version of the AURA MESSAGE, RECOGNIZER or API tables (which registers input and output messages).
  • A Kibana index pattern is created, matching the uploaded ElasticSearch index.
  • A pre-defined set of visualizations are installed in Kibana over that index pattern, as a means to get a default peek on the index data. See the section pre-installed analytics dashboard.
  • The system automatically ingests any new clean PPD being produced in the ElasticSearch database, so that the index and dashboards remain up to date.

In principle, the PPD creation process specifies daily production, since Aura logs are sent to Kernel once a day. This means that information about Aura behavior and user actions on one given day will be available in the dashboards of the following day.

As mentioned above, the Aura Analytics Dashboard is conceived to be used by Aura Global Team. However, OBs can install locally the ELK stack or any other visualization tool for data consumption. Access to the document Local data visualization for further details.

2.2 - Operation

Aura Analytics 2.0.0 operation

Discover Aura Analytics 2.0.0 operation at a glance

Introduction

Based on Aura Analytics 2.0.0 architecture, the current documents provides an overview of its global operation.

Take the Aura Analytics 2.0.0 architecture flowchart as a reference to follow each step of the dataflow described below:

  1. Aura logs generated in local instance are converted to datasets and transferred to local Kernel via the standard procedure and with the established frequency (typically, daily). Once there, the Active listening process flow fires up daily.

  2. PPD-Creator: This is the first process that runs, and it is the only one executed in the OBs’ environment. It retrieves Kernel data, anonymizes all sensitive data that could identify users, and then transfers this data to an environment shared with the Aura Global team.

  3. Manage-PPD-Raw: This is the first process executed from the global environment. It solely stores the paths of the data transferred by the PPD-Creator into a PostgreSQL database to keep a record of which data has been transferred.

  4. PPD-Clean: This process runs from the global environment. Once the data is anonymized, it is processed to extract additional features (such as user frustration or the extraction of n-grams from user phrases about iterations that do not have an intent).

  5. Once the data is processed, a path is saved in the environment and also in ElasticSearch to create dashboards that tracks Aura usage by its customers.

  6. User-Dynamics: This is the last process, also executed in global environment. It is responsible for extracting statistics about users’ recurrence and the number of users per day. Among that, it identifies new users, recurring users (those making iterations every day), recovered users (those who have stopped using Aura at some point and have returned to the system) and lost users (those who have stopped using Aura in 3 months).

Examples of different dashboards are included below:

elastic-search-users-dashboard Figure 1. Users dashboard

elastic-search-daily-iterations-dashboard Figure 2. Daily users dashboard

elastic-search-weekly-iterations-dashboard Figure 3. Weekly users dashboard

elastic-search-trend-dashboard Figure 4. Trends dashboards

2.3 - Guidelines for OBs

Guidelines for OBs

Guidelines for OBs to allow the generation of datasets from their Aura users logs in local environment

Introduction

As seen in the Aura Analytics 2.0.0 architecture flowchart, Aura Analytics 2.0.0 contains two different environments: the OB local environment, managed by the OB and the Global one, managed by Aura Global Team.

Within this framework, the current guidelines are tailored towards OBs, indicating how to install and execute the PPD-Creator, for the creation and processing of PPD RAW datasets.

Once it is carried out, the subsequent processes of Aura Analytics 2.0.0 are executed in global environment by Aura Global Team.

Installation of PPD-Creator

The OB must install and store the PPD-Creator in a specific destination blob PPD-RAW and is responsible for its execution.

Guidelines are included in installer Aurak8s documentation: Active listening deployment.

Execution of PPD-Creator

The execution of the PPD-Creator must be done by the OBs, previous to its installation.

Parameters to launch the PPD-Creator

These are the parameters that the PPD-Creator takes from Kernel:

Mandatory parameters:

Parameter Description Example
--src-type Source type ‘aws’, ’local’, ‘azure’
--src-name Source folder name (raw files) /directory/data/
--dst-type Destination type ‘aws’, ’local’, ‘azure’
--dst-name Destination folder name (raw files processed) /directory/data/output
--country Customize for a lang_country pair es
--table Table to process {message, grootmessage, recognizer, api} MESSAGE
--environment Environment to process (ap-one, ap-two, prod) ap-one
OPERATIONS
--transfer File processing: transfer, anonymize and group files
--copy Raw file copy – no processing
--copy-fix File copy + apply small fixes
--show Show available files; no action performed

Optional parameters:

Parameter Description
--src-user Access key for the source folder
--src-pass Source access secret
--src-encryption Encryption key for data in source
--dst-user Access key for the destination folder
--dst-pass Destination access secret
--dst-encryption Encryption key for data in destination
--cloud-type Cloud type (‘aws’,’azure’)
--dry-run Process dry-run: no action performed
--reraise Re-raise exceptions on errors
--options-file Read additional options from a file in dest repo
--verbose Verbose level
--console Act as a console app (format logs as console messages, raise on errors)
--dest-log-skip Do not write the logfile at the destination folder
--dest-log-dir Logging subdirectory at destination
--show-input Print out input arguments
--anon-key/ encryption-key Encryption key for anonymization of sensitive columns
--max-files Maximum number of files to process
--folders/month Restrict transfer to certain folders (i.e. months)
--tables Restrict transfer to certain tables
--reset Reset the index and re-process
--encrypt-index Save the index files encrypted or unencrypted (default depends on cloud type)
--raw Copy as raw data instead of text file
--fix-header Fix file header
--fix-anon Fix anonymized fields in API columns

Launching PPD-Creator

Execute the following command:

docker run aura/ppd-creator --country <country-code> --anon-key <KEY> <source-params> <dest-params>

Example:

docker run aura/ppd-creator \
   --country ar \
   --anon-key as34-dre23-4127 \
   --src-name 4P-bucket-name-for-uk \
   --src-user EF45IHWD34DE4FGA \
   --src-pass k/Erf/6DSWWPjhdde1/abc123def-2331ldf \
   --dst-name aura-ppd-ar \
   --dst-user EF4341sdf3EFGUA1 \
   --dst-pass J/DQW/Sdde5k12ldsf/1abcde12dd1d-123c11 \
   --dst-encryption 1234ab56-12a3-45eb-8e06-8c522cdbb668-75f1b00f-6ca6-4a13-a741-64514cce728b \
   --table message \
   --environment prod \
   --transfer

Output from PPD-Creator

The output includes the following items:

  • BOT_XXXXXX.txt.bz2: raw files (processed).

  • log folder: if the logging options have been configured.

  • aura-sync-cache-dst.json: table/month: processed files (automatically generated in destination). For example:

    {
      "AURA_DATA/ES/API/202212/": [
        "BOT_04095750-724e-11ed-9565-53054255c842_ES_API_20221202T150000Z.txt.bz2",
        "BOT_d2e93fc0-7656-11ed-a8eb-49a811568ab3_ES_API_20221207T170000Z.txt.bz2",
        "BOT_987780e0-7660-11ed-ba4a-2dac114c5321_ES_API_20221207T180000Z.txt.bz2"
      ],
      ...
    }
    
  • aura-sync-cache-src.json: table/month: raw files_to_process (source). For example:

    {
    "AURA_DATA/ES/API/202212/": [
      "BOT_04095750-724e-11ed-9565-53054255c842_ES_API_20221202T150000Z.txt",
      "BOT_05a5b860-7663-11ed-bbf7-cb8fd9eb3c25_ES_API_20221207T190000Z.txt",
      "BOT_05ae43e0-7663-11ed-a0aa-8b7e0e134809_ES_API_20221207T190000Z.txt",
      "BOT_0d69fb10-7492-11ed-a1fc-95dce7e56901_ES_API_20221205T110000Z.txt"
    ],
      ... 
    }
    
  • aura-sync-key-dst.json: key used to encrypt sensitive fields. For example:

    {
      "sample": "abcd1234-ab12-12ab-ab12-1abc234e56fg"
    }
    

Local data visualization (optional)

As explained before, Aura Global Team will be in charge of the analysis of the generated data through the global tool Aura Analytics Dashboard.

Nevertheless, just in case the OB wants to visualize certain data locally:

  • This will be done following a prior agreement with the OB on privacy-related matters.
  • Aura Global Team will provide access to the clean data stored in the corresponding PPD-clean blob container.
  • The OB can install locally the ELK stack or other alternative tool for data visualization.
  • No support will be offered by Aura Global team for this task.

2.4 - Analytics Dashboard

Aura Analytics 2.0.0. Dashboard

Description of Aura Analytics 2.0.0 dashboard used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior.

Aura Analytics 2.0.0 produces as a result, among other elements, an analytics component named Aura Analytics Dashboard that is the one used by Aura Global Team to gather statistics on the production system and to analyze user’s behavior. This Analytics Dashboard is based on the ELK stack.

The dashboards provides a pre-defined set of visualizations, described throughout this document. Nevertheless, it is possible to build additional dashboards using the ELK stack.

Pre-installed analytics dashboard

Kibana offers many possibilities to visualize the ingested data, and there are many resources and tutorials around explaining its mechanics. We therefore refer to the official Kibana documentation, or the many tutorials available on the Web, for generic information.

In the particular case of Aura Analytics 2.0.0, there is an ElasticSearch index that gets automatically ingested daily. It is called Aura-message-COUNTRY, and contains a cleaned version of the AURA MESSAGE table (which registers input and output messages).

Over this index, three types of panels/visualizations have been pre-installed:

  • Discover panel
  • Visualizations
  • Dashboards

Discover

The Discover panel in Kibana is an essential tool for performing queries to an ElasticSearch index (save those searches, if desired), and explore users’ interactions with Aura in detail log by log, these being filtered by:

Search terms or conditions » A time interval » Additional filters applied to the query results » A set of index fields to show in the result table.

These 4 steps are represented in Figure 1:

Discover panel Figure 1. Discover panel

As shown in this figure, the starting point is the ElasticSearch index holding all the data.

Each of the three first steps in the chain reduces the amount of data handled, by pruning out elements that do not satisfy the defined condition. The fourth step is just a display adjustment: on the final dataset, define which of the available fields will be shown on the output table that appears in the panel.

In the Aura Dashboard default set, there is one such Discover panel pre-installed. It is called question-answer pairs and has the following characteristics:

  • A blank query (i.e., provide all the results)
  • A time interval for the last 7 days
  • A “only user” filter: filters out all intents that correspond to non-user queries (suggestions, help commands from the client application, etc)
  • A visualization that includes: timestamp, (cleaned) user message, detected Aura intent, associated entities (if applicable), dialog that was invoked and Aura’s response.

Figure 2 shows a snapshot of this panel. To load it, select the Discover tool in the left navigation bar, and then click on the “Open” menu option in the top menu bar. A list of saved panels will be shown, with this one in it named “question-answer pairs”.

question-answer pairs panel Figure 2. Question-answer pairs panel

Once the panel is loaded, each one of the aforementioned four elements can be freely modified, for example, the interface allows:

  • Adding new filters with the “+Add Filters” button
  • Deactivating the current filters by pressing over the predefined ones and clicking over the “Temporarily Disable” option
  • Modifying the query interval with the “calendar” button or “Dates Box”
  • Adding a specific query on a given index field(s) by using the “Search Box”, instead of the (default) blank query

Discover panels can be saved as named objects, to be later loaded at will. So, if needed, any panel (a modified panel or a newly created one) can be saved with a new name, to have it available for later loading.

Visualizations

A total of 7 visualizations come pre-installed with the base Aura Dashboard. The list can be obtained from the visualizations item in the left menu bar, shown in Figure 3:

  • Three “Stats” type visualizations, which provide general statistics on platform usage.
  • Four “User” type visualizations, which provide insights on user behavior.

preinstalled-visualizations-dashboard.png Figure 3. Preinstalled visualizations dashboard

Note that this distinction between “User” and “Stats” is purely conceptual and based on the fields that have been used to generate the visualizations as from the point of view of Kibana, they are all regular visualizations.

Those visualizations can be instantly loaded by clicking on their names. But they can also be integrated into dashboards, which is described in the next section.

Dashboards

A dashboard in Kibana is essentially a spatial arrangement of visualizations. For example, to construct a dashboard, we just place visualizations into a page, resizing them as we wish, so we can observe all of them in a single place afterwards.

Within a dashboard all visualizations are linked. For example, if we change the time interval or add a filter using the interface, this modification affects all visualizations in the dashboard, and all of them get updated.

Elements in the dashboard visualizations can also generate “instant filters” by clicking on graphs or table elements. Those filters are added to the top of the page as a filter afterwards and can be modified or removed.

The Aura Analytics default installation preloads two dashboards. Those are available for selection when we click on the Dashboard icon in the left navigation bar:

selection-aura-analytics-default-dashboards Figure 4. Aura analytics default dashboards

Nones dashboard

This dashboard integrates the n-grams extracted from PPD-Clean process.

nones-dashboard

nones-dashboard

nones-dashboard

nones-dashboard Figure 5. Nones dashboards

System dashboard

This dashboard integrates the three predefined “Stats” visualizations (generic statistics):

  • A timeline of interactions (user messages sent and answered), segmented by channel
  • A heatmap of interactions by weekday and time of day (hour)
  • A bar graph classifying the interactions produced in the period by detected intent

system-dashboard Figure 6. System dashboard

User Dashboard

The User dashboard contains 4 user visualizations:

  • Most Frequent User Utterances: list of the most frequent user sentences (in the time interval and filter active at the moment). It uses the msgUsrSig field to group together very similar utterances.
  • AURA Answer Groups: list of the most frequent answers that Aura generates, grouped by the semantic categories in AuraMsgGroup field.
  • Words per query: distribution of sizes for the user messages, measured as number of words in the utterance, and segmented by channel.
  • Tag cloud: set of plain most frequent user utterances, as a tag cloud in which the font size represents the utterance frequency. The MESSAGE_USR_NORM field is used for the representation, so it contains normalized utterances.

The next screenshots show the dashboard with all these visualizations (it is a large dashboard, so typically it needs scrolling to visualize all its components).

user-dashboard Figure 7. User dashboard

Note that those four visualizations are linked in the sense of corresponding to the same subset of the data (as given by filters and time interval) but they are NOT linked at the individual item level (i.e., a given most frequent user utterance in the left table does not correspond to any specific Aura answer in the right bar graph).

Instead, the dashboard can be manipulated by selecting one specific item in any of the visualizations, and this will create a filter for the others.

For instance, as the following image shows, if we select “CHURN” in the Aura answer group visualization, we can observe in the others the user utterances that led Aura to generate that answer (i.e., an answer about contract cancelation).

Example-of-Aura-answer-groups-in-the-user-dashboard Figure 8. Example of Aura answer groups in the user dashboard

Building new visualizations and dashboards

If the OB has installed locally the ELK stack, new elements can be built (or the current ones modified) by making use of the available fields in Kibana through the ingested ElasticSearch index.

In this section, we provide a reference of the schema that the index follows, so it can be used to build such new visualizations or to better understand the existing ones.

Data model

Field types

Elements in the Aura-message data model have 3 different types:

  • Numeric: single numbers, integer or real. Suitable for numerical statistics, such as averages, or for plotting variation across time in graphs.
  • Keyword: they are opaque strings, i.e., terms that cannot be searched within (it is not possible to look for words inside a keyword field). They can however be used to create some term-level queries, such as e.g., prefix queries (find all instances that begin with) and they usually work great for aggregations, since most of them are categorical variables (fields that only have a limited number of possible values) and can therefore be bucketed and counted.
  • Text: these fields are divided into separate terms (words), and some pre-processing is done to them before indexing, to improve access, though an ElasticSearch analyzer. Text fields cannot be used in aggregated visualizations, since they cannot be grouped. They are most useful for queries, because they allow searching for fragments (only a few words) and fuzzy searches.

Field list

The following table lists all the available fields in the Aura-message-COUNTRY ElasticSearch index, with their type and a brief description. Some of them have more detailed explanations in Section Field explanations.

Note that some fields of text type have a mirror field of type keyword, with the same content. Having the same data indexed in two different ways at the same time (as text and as keyword) enables to perform different types of analysis by choosing the right field.

The Raw column indicates if this field is already present in the AURA raw PPD files:

  • Yes: it is a field contained in raw PPDs.
  • No: it is a generated field, produced when creating clean PPDs. They can be recognized as lowercase fields.
  • Partial: It exists in the raw PPDs, but in a somehow different shape.
Field Type Raw Content
CORR_ID keyword yes Unique identifier for each interaction
VERSION_ID keyword yes Aura Platform version
CHANNEL_CD keyword yes Identifier for the channel this interaction corresponds to
STATUS_CD keyword yes Internal code related to operation status
AURA_ID_GLOBAL keyword yes (Mostly) unique identifier for the user
AURA_ID keyword yes (Mostly) local identifier for the user
INTENT keyword yes Detected user intent, including “system” intents
MESSAGE_USR text partial Text request sent by the user
MESSAGE_USR_NORM text no A normalized version of MESSAGE_USR
MESSAGE_USR_NORM.keyword keyword no A keyword version of MESSAGE_USR_NORM, to enable aggregating on it
MESSAGE_AURA text partial Text message sent by AURA to the user
MESSAGE_AURA.keyword partial partial Keyword version of MESSAGE_AURA, to enable aggregating on it
MODALITY_CD_USR text partial Modality of the user message
MODALITY_CD_AURA text partial Modality of Aura response
ENTITIES text yes Comma-separated list of the entities recognized in the user message
DIALOG_ID text yes Identifier for the dialog that produced Aura response
DIALOG_ID.keyword keyword yes Keyword version of DIALOG_ID, to enable aggregating on it
DURATION_NU number yes Elapsed time, in ms, between the reception of the user message and the moment the response is generated to be sent to the channel
userType keyword no A single char identifier that characterizes the user as a test user
usrMsgWc number no Message word count: number of words contained in the user message
usrMsgSig keyword no Message signature: a string that helps clustering user messages
AuraMsgGroup keyword no Cluster the Aura response belongs to
weekday number no Day of the week the interaction happened (0=Monday to 6=Sunday)
hour number no (integer) hour the interaction happened
country keyword partial Two-letter code for the country
sesId keyword no Session information
sesSize number no Session information
sesDuration number no Session information
EXPLICIT_FRUSTATION number no The frustration probability of message user
AllNGrams keyword no All n-grams of user message
AllNGramsFilter keyword no All n-grams filtered by stopwords
NGrams1 keyword no The n-grams of 1 word
NGrams1Filter keyword no The n-grams of 1 word filtered by stopwords
NGrams2 keyword no The n-grams of 2 words
NGrams3 keyword no The n-grams of 3 words

Field explanations

This subsection contains more detailed descriptions of some of the fields in the schema.

AURA_ID_GLOBAL

This element (mostly) uniquely identifies the user generating the interaction.

Note the concrete value of this field is not the same as the current identifier used in Aura and uploaded to Kernel: for privacy reasons, the identifier was hashed when generating the PPD and has no resemblance to the original one. The correspondence is however maintained across time, so it is possible to analyze user behavior.

The “mostly” qualifier reflects one quirk of the original Aura identifier: it is generated with a dependence to the authentication method used by the channel, so if two channels follow different authentication methods (e.g., MobileConnect vs. User/Password) then the AURA_ID_GLOBAL identifier for the same user will be different.

In summary:

  • The identifier stays the same for a given user across time.
  • No two users will have the same identifier.
  • But the same user could produce two different identifiers if it connects to two channels that use a different authentication method.
AURA_ID

This is a “local” identifier, i.e., it is generated inside the channel according to the specific channel characteristics, and it is not tied as much as AURA_ID_GLOBAL to user authentication.

Its main disadvantage is its transient nature: the same user, on the same channel, could generate different AURA_ID strings when connecting different times, on different session. Therefore, for user accounting and tracing, AURA_ID_GLOBAL is usually preferred.

However, there are instances in which AURA_ID works better, namely for anonymous access (when the user is not authenticated). This depends on the channel:

  • In the WhatsApp channel, the initial use of the channel will be anonymous from Aura side (i.e., no authentication is done), hence AURA_ID_GLOBAL will also be empty (at least until the user authenticates, which depends on the use case). But in this channel, AURA_ID has a permanent value, linked to the WhatsApp user, so here it is a good substitute for a persistent id even for unauthenticated users.
MESSAGE_USR

This field includes the message sent by user1. It has been partially processed to enhance anonymization by removing some standard identifiers contained in it with <idxxx> strings (e.g., phone numbers appear as <idphone>).

Removal is done mostly through regular expressions, so there might be occasional glitches (such as identifying as a phone number that does not really correspond to a phone, just because it follows the phone number pattern).

MESSAGE_USR is a field of text type. As such, it is searchable: it is possible to search for specific words the user might have said. Furthermore, it has been processed through an ElasticSearch analyzer adapted to the specific language used. This means that searches will be able to match related words (e.g., plural versions of a singular query word, or verb conjugations). Phrase searches are also possible (by using double quotes around the phrase).

In Kibana, more sophisticated text searches can be made by switching Lucene query syntax: proximity queries (words close to each other), fuzzy searches (query words allowing typos), wildcards, etc

MESSAGE_USR_NORM

This is a normalized version of MESSAGE_USR, in which the user text has been streamlined by:

  • Converting all the sentence to lowercase
  • Removing all punctuation
  • Removing any extra spaces

Furthermore, this field is not processed through a language-dependent analyzer, as MESSAGE_USR is, so queries on this field must match words exactly. It is still a text field, however, so the same query language can be used.

MESSAGE_AURA

This contains the text message generated by Aura and sent to the user as response to the user query. It is a text field, so it is possible to search for specific words in it.

IMPORTANT In the current version of Aura KPI logs, this field contains only the text response.
Some Aura use cases do not generate a purely textual message, but a more elaborated one (e.g., a card with text and graphics). These complex answers are inserted as attachments into Aura’s response to the channel, and since attachments are not logged into the MESSAGE field, this field will appear empty in those cases.
So, an empty MESSAGE_AURA field does not necessarily mean that AURA did not provide an answer. As an alternative for those situations, looking at the DIALOG_ID field (or INTENT) may give a hint of the type of answer that Aura delivered.

MODALITY_CD_USR

This field contains the modality in which the user sent the message.

It is a slightly transformed field because there is some variation across Aura versions, and to unify the modalities are consolidated into only four different keywords: audio (spoken message), text (written free-text message) o form (commands sent via automatic processing or menus).

DIALOG_ID

This field contains the identifier for the user case dialog module at the Aura Bot Framework that was selected to construct the Aura response.

Dialog identifiers have two components (library and dialog) separated by a colon e.g., services:service-usage.

This field uses a custom analyzer that splits the identifier at the colon, generating two terms. This makes possible to construct queries with one of the terms, e.g., “give me all the elements for the domain services”). But being a text field makes it impossible to do aggregations on it, so it cannot be used for statistics like bar charts (use DIALOG_ID.keyword for that).

DURATION_NU

This number reflects the time that took Aura to understand, process and respond to the user message. It is the difference (in milliseconds) between the timestamp of the moment the user message was received and the timestamp in which Aura’s response was finalized and sent to the channel.

Note that it is not a complete end-to-end delay time from the user’s point of view, since it does not include either the time it took the request to arrive to Aura through the channel or the time it took the response to travel back through the channel and get rendered at the client application (those times are outside Aura, and as such not registered by it).

Session Information (sesId, sesSize, sesDuration)

These fields are generated by running a process over the time series formed by interactions from each user at each channel. A session is automatically identified as a consecutive list of such user’s interactions, each separated from the next by a time interval shorter than 5 minutes. Once each session is identified, it is tabulated and labelled with three fields:

  1. sesId: a string, forming a unique identifier for the session. It should be considered an opaque identifier and the guarantee is that no other session in the data stream carries the same identifier.

As an aside, interactions that do not correspond to actual user interactions (because no user could be identified, or because the datapoint corresponds to an interaction not triggered by the user) are all labelled with a <void> sesId

  1. sesSize: the number of interactions this session contains. This is labelled only for the first interaction in the session, all other interactions carry a 0 in this field. Non-sessions such as the ones with sesId will be left empty. This facilitates computing averages or other statistics on valid sessions, by just first filtering out all zero and empty values

  2. sesDuration: the time duration for each session, counted from the instant the first user message was received, to the instant the last Aura message was sent. For single-interaction sessions its value will be the same as DURATION_NU, for multiple interactions it will contain the time interval between all of them.

As with sesSize, only the first interaction in a session is annotated with sesDuration; the remaining interactions will be assigned a 0 value (and interactions that do not correspond to a session will be left empty). Therefore, to compute statistics on sesDuration, remove the 0 and empty values first.

userType

This field may be used, in certain cases, to help identify rows that do not correspond to real users but to test users (internal users that belong to test/QA teams, and whose behavior is therefore not representative of actual Aura users). The field contains a single character, which is s for standard (real) users, and can be Q or T for QA/Test users respectively (there are also lowercased versions q and t, which means unconfirmed test users).

Note that test user identification is not available on every country, since it depends on having a register of the AURA_GLOBAL_ID identifiers that QA/Test users authenticate to, and this is not always available.

usrMsgSig

This field is not useful by itself. Instead, it is intended to be used to help grouping together very similar user utterances. It does so by generating a signature of the utterance that is (hopefully) insensitive to small variations in the sentence. The way to generate this signature is by following these steps with the utterance:

  • Start with the normalized utterance (i.e., MESSAGE_USR_NORM)
  • Perform stemming (removal of word suffixes) on all the words. This makes bills and bill the same word
  • Substitute words from a fixed list of very common, uninformative tokens (stopwords) by an asterisk. For example, this converts both “get my bill” and “get the bill” to the same phrase “get * bill”
  • Group words in sets of 3 elements (trigrams), and sort them alphabetically. This removes the global structure of the sentence, while retaining local structure.

The resulting string is a non-understandable version of the original utterance (hence it cannot be used by itself), but the fact that several very similar utterances produce the same signature helps to cluster those utterances. An example is one of the preinstalled visualizations, Most Frequent User Utterances, which uses this field to group very similar utterances.

Another example is provided in the following figure, which shows message utterances generating the same signature:

message-utterances-signature Figure 9. Message utterances generating the same signature

As it can be seen, the signature is the same for "how can I upgrade" and "when can I upgrade", "when does my contract end" and "when is my contract ending", and "live chat" & "live chats". So, they would be counted together when aggregating by signature.

The procedure has its limitations, and as explained is experimental, so we are trying to improve it, but it can already alleviate a bit the inherent variability in user expressions.

AuraMsgGroup

Messages produced by Aura are as generated by its text resource database. In some cases, the same category of message produces different output texts, maybe because the message includes some user-dependent parameter or because the text database contains several variants of the same text (and Aura picks one at random). The AuraMsgGroup field is a keyword field that helps categorizing Aura answer by abstracting away some of this variation. It classifies the response given by Aura into two types of elements:

  • Generic group: a name such as <NONE>, <GREETING> or <NOTFOUND>, which corresponds to a response category (see Table 3)
  • Truncated answer: for answers that do not have a defined generic group, as a fallback the literal answer text is inserted, after substituting all numbers in it with a placeholder and truncating it (i.e., retain only the first characters)

Table 4 contains the generic groups defined so far. They correspond to the most frequent Aura messages. It is country-dependent, since it also depends on the use cases deployed in each country. As said above, responses not falling into these groups will be assigned a truncated version of the response text.

Group Meaning
<EMPTY> No textual answer from Aura
<NONE> Aura says it did not understand the user utterance
<ERR> There was a processing error of some kind at Aura side, and the request could not be fulfilled
<GREETING> Aura is greeting the user
<GOODBYE> Aura is acknowledging a conversation end
<YOU-ARE-WELCOME> Aura is accepting a compliment
<CHURN> Aura recognizes the user intention to terminate a contract
<NOTFOUND> Aura tried to search for some bit of data concerning the user query, and could not find it
<CANNOT> Aura cannot fulfil the user request because of insufficient information (in the query, or on user data)
<BILL-INFO> The user requested information about her bill, and Aura is returning it
<DATA-INFO> The user requested information about her data usage, and Aura is returning it
EXPLICIT_FRUSTRATION

The sentiment model generates explicit frustration regarding the user’s message. In this field, the probability indicates that a user’s sentence is an explicit expression of frustration.

AllNGrams

For intents none and tv.none, an extraction of the most common n-grams generated by these none responses is applied. In these fields n-grams for 1 word, 2 words and 3 words are represented.

AllNGramsFilter

This field represents the AllNGrams field but filtered by stopwords.

NGrams1

This field represents the n-grams for 1 word.

NGrams1Filter

This field represents the n-grams for 1 word filtered by stopwords.

NGrams2

This field represents the n-grams for 2 words.

NGrams3

This field represents the n-grams for 3 words.

2.5 - Annex: Dataset fields

Annex: Dataset fields detail

The current annex describes the process that each field of Aura Analytics 2.0.0 data model is going through towards a clean PPD

Introduction

The objective of the following tables is to explain the process that each field is going through within this flow:

Aura datasets >>> PPD_RAW >>> PPD_CLEAN
  • Each cell of the table explains the process that the data field is undergoing in this specific moment before it gets to the concrete stage (table column).

  • For example, the field GLOBAL_AURA_ID is undergoing a “hashing” before it gets stored in PPD_RAW. After this, the “hashed data” is progressed without any further processing to PPD_CLEAN.

Tables used in the Active Listening process are described in the following sections. They belong to the Aura Entities database.

  • Columns FIELD and DESCRIPTION: instances managed by the OB

  • Columns PPD RAW and PPD CLEAN: instances managed by Aura Global Team

MESSAGE dataset

Message dataset (stored in local Kernel).

# FIELD DESCRIPTION PPD RAW PPD CLEAN
1 USER_ID Unique user ID in the OB systems NOT
transferred
NOT
transferred
2 MSG_DT Timestamp of the data
3 MSG_ID Unique ID of the message NOT
transferred
4 ACTION_CD Code of the action that produces the data NOT
transferred
5 AURA_ID User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura. Hashed
6 PHONE_ID Phone number of the user NOT
transferred
NOT
transferred
7 CHANNEL_CD Code of the channel where the action happened
8 SUBSCRIPTION_CD Code of the subscription type of the user in the OB NOT
transferred
9 DOMAIN_CD Code of the domain where the action happened NOT
transferred
10 CATEGORY_CD Code of the category where the action happened NOT
transferred
11 COUNTRY_CD Code of the country NOT
transferred
12 CORR_ID Correlator ID of the request that produces this data
13 IS_CACHED Shows if the entity content was already cached or not NOT
transferred
14 STATUS_CD Status code of the action, if meaningful
15 REASON Result of the action in error case, code of the error NOT
transferred
16 VERSION_ID Aura version that produces this data
17 LANG_CD Language configured by the user for communication NOT
transferred
18 TZ_CD Timezone where the communication happened NOT
transferred
19 DURATION_NU Duration in milliseconds of the action
20 MESSAGE Content of the message Anonymized
21 DIALOG_ID Id of the dialog where the message happens
22 CONVERSATION_ID Id of the conversation where the message happens NOT
transferred
23 WIN_RECOGNIZER_CD Code of the recognizer that wins for this message NOT
transferred
24 WIN_RECOGNIZER_SCORE_NU Score of the recognizer that wins for this message NOT
transferred
25 INTENT Selected intent
26 ENTITIES List of entities determined by the recognizer
27 MODALITY_CD How does the user communicate with Aura
28 AURA_ID_GLOBAL Identifies the same user_id logged with the same authentication method Hashed
29 ACCOUNT_NUMBER Unique account number of the user NOT
transferred
NOT
transferred

GROOTMESSAGE dataset

Groot Message dataset (stored in local Kernel).

# FIELD DESCRIPTION PPD RAW PPD CLEAN
1 USER_ID Unique user ID in the OB systems NOT
transferred
NOT
transferred
2 MSG_DT Timestamp of the data
3 MSG_ID Unique ID of the message NOT
transferred
4 ACTION_CD Code of the action that produces the data NOT
transferred
5 AURA_ID User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura. Hashed
6 PHONE_ID Phone number of the user NOT
transferred
NOT
transferred
7 CHANNEL_CD Code of the channel where the action happened
8 DOMAIN_CD Code of the domain where the action happened NOT
transferred
9 CATEGORY_CD Code of the category where the action happened NOT
transferred
10 COUNTRY_CD Code of the country NOT
transferred
11 CORR_ID Correlator ID of the request that produces this data
12 IS_CACHED Shows if the entity content was already cached or not NOT
transferred
13 STATUS_CD Status code of the action, if meaningful
14 REASON Result of the action in error case, code of the error NOT
transferred
15 VERSION_ID Aura version that produces this data
16 LANG_CD Language configured by the user for communication NOT
transferred
17 TZ_CD Timezone where the communication happened NOT
transferred
18 DURATION_NU Duration in milliseconds of the action
19 MESSAGE Content of the message Anonymized
20 CHANNEL_CONVERSATION_CD Id of the channel conversation where the message happens NOT
transferred
21 SKILL_CONVERSATION_CD Id of the skill conversation NOT
transferred
22 WIN_RECOGNIZER_CD Code of the recognizer that wins for this message NOT
transferred
23 WIN_RECOGNIZER_SCORE_NU Score of the recognizer that wins for this message NOT
transferred
24 AURA_ID_GLOBAL Identifies the same user_id logged with the same authentication method Hashed
25 ACCOUNT_NUMBER Unique account number of the user NOT
transferred
NOT
transferred
26 SKILL_CD Unique id of skill

RECOGNIZER dataset

Recognizer dataset stored in local Kernel.

# FIELD DESCRIPTION PPD RAW PPD CLEAN
1 USER_ID Unique user ID in the OB systems NOT
transferred
NOT
transferred
2 RECOGNIZER_DT Timestamp of the data
3 RECOGNIZER_ID Unique ID of the recognizer
4 ACTION_CD Code of the action that produces the data NOT
transferred
5 AURA_ID User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura. Hashed
6 PHONE_ID Phone number of the user NOT
transferred
NOT
transferred
7 CHANNEL_CD Code of the channel where the action happened
8 DOMAIN_CD Code of the domain where the action happened NOT
transferred
9 CATEGORY_CD Code of the category where the action happened NOT
transferred
10 COUNTRY_CD Code of the country NOT
transferred
11 CORR_ID Correlator ID of the request that produces this data
12 IS_CACHED Shows if the entity content was already cached or not NOT
transferred
13 STATUS_CD Status code of the action, if meaningful
14 REASON Result of the action in error case, code of the error
15 VERSION_ID Aura version that produces this data
16 LANG_CD Language configured by the user for communication NOT
transferred
17 TZ_CD Timezone where the communication happened NOT
transferred
18 DURATION_NU Duration in milliseconds of the action
19 SCORE_NU Score returned by the recognizer
20 INPUT User input sent to the recognizer. Null if incoming message is an AuraCommand Anonymized
21 OUTPUT Complete output generated by the recognizer
22 INTENT Intent returned by the recognizer
23 ENTITIES Entities returned by the recognizer due to the intent
24 COMMON_THRESHOLD_NU Common threshold used to determine the best answer of all recognizers NOT
transferred
25 THRESHOLD Specific threshold of the specific recognizer being executed NOT
transferred
26 EXPECTED_INTENT Intent expected to be returned by the recognizer NOT
transferred
27 EXPECTED_ENTITIES Entities expected to be returned by the recognizer due to the intent NOT
transferred
28 AURA_ID_GLOBAL Identifies the same user_id logged with the same authentication method Hashed
29 ACCOUNT_NUMBER Unique account number of the user NOT
transferred
NOT
transferred

This Markdown table can be directly used in your GitHub Markdown files.

API dataset

API request dataset (stored in local Kernel).

# FIELD DESCRIPTION PPD RAW PPD CLEAN
1 USER_ID Unique user ID in the OB systems NOT
transferred
NOT
transferred
2 REQUEST_DT Timestamp of the data
3 REQUEST_ID Unique ID of the request
4 ACTION_CD Code of the action that produces the data NOT
transferred
5 AURA_ID User logging ID in Aura. The user will have a new Aura_id each time she logs in Aura Hashed NOT
transferred
6 PHONE_ID Phone number of the user NOT
transferred
NOT
transferred
7 CHANNEL_CD Code of the channel where the action happened NOT
transferred
8 DOMAIN_CD Code of the domain where the action happened NOT
transferred
9 CATEGORY_CD Code of the category where the action happened NOT
transferred
10 COUNTRY_CD Code of the country NOT
transferred
11 CORR_ID Correlator ID of the request that produces this data
12 IS_CACHED Shows if the entity content was already cached or not NOT
transferred
NOT
transferred
13 STATUS_CD Status code of the API request
14 REASON Result of the action in error case, code of the error
15 VERSION_ID Aura version that produces this data NOT
transferred
16 LANG_CD Language configured by the user for communication NOT
transferred
17 TZ_CD Timezone where the communication happened NOT
transferred
18 DURATION_NU Duration in milliseconds of the action
19 HOST Host of the API
20 PATH Specific path of the API being called NOT
transferred
21 HTTP_STATUS HTTP status of the server response NOT
transferred
22 RESPONSE Response body Anonymized
23 AURA_ID_GLOBAL Identifies the same user_id logged with the same authentication method Hashed NOT
transferred
24 ACCOUNT_NUMBER Unique account number of the user NOT
transferred
NOT
transferred
25 REQUEST Request body

3 - Aura Billing Module

Aura Billing Module

Description of Aura Billing Module, the tool for the generation of Liceo invoices.

Introduction

The Aura Billing Module is a tool for the generation of Liceo invoices, that allow charging each customer for the services that she has used. This is a mandatory process for OBs.

It is based on the storage and processing of specific logs in the OB’s Aura systems to track the type and number of interactions of a user or service with Aura.

This information is used to assign costs based on different billing models and criteria chosen by the OB, which ultimately determines the total amount of the invoice.

The invoices will be generated in XLSX (Excel) format and stored in an Azure Storage Explorer blob container, along with the historical invoice records.

These invoices will be available for download by the Aura Global Team, to be sent to the OBs.

Interested in how the Aura Billing Module works and which are the tasks required to bring it into use? Access the document Aura Billing Module operation.

Generated Liceo invoices

The Liceo invoices generated by Aura Billing Module will contain the following information:

  • Invoicing model (based on the payment model of the OB)
  • Aura components used to provide the service
  • Service/app that used this component
  • Number of queries per component
  • Cost of each query in each specific component
  • Total amount generated by each component
  • Total number of requests made during the billing period
  • Total amount of the invoice

3.1 - Aura Billing Module operation

Aura Billing Module operation

This document contains:

  • An overview of Aura Billing Module functional operation
  • Tasks to be executed by OBs to bring Aura Billing Module into use

Aura Billing Module operational flowchart

Figure 1 schematically shows how Aura Billing Module operates, where three different instances come into play:

  • Aura: OB managed environment
  • Aura: Global Team managed environment
  • Kernel


Figure 1. Aura Billing Module operation

The operational processes executed by the Aura Billing Module are outlined below. In each step, the tasks that must be carried out by the OBs in order to bring it into use, are described.

1. Data generation

This task takes place in Aura’s OB managed environment.

  • Aura components automatically generate logs every time a user/service interacts with Aura in local environment.

  • These logs are pre-processed, cleaned and converted into datasets, in Avro format.

  • These are the required Avro-formatted datasets for the Aura Billing Module:

    • Aura_Audit, that stores the minimum information needed for generating the Liceo invoices.​

    • Aura dimensional entities:

      • D_Aura_App schema definition: List of possible Apps defined in Aura.
      • D_Aura_Channel: List of possible channels defined in Aura.
      • D_Aura_Component: List of possible components defined in Aura.
      • D_Aura_Preset: List of possible presets defined in Aura.
      • D_Aura_Recognizer: List of possible recognizers defined by Aura.
      • D_Aura_Skill: List of possible skills defined in Aura.

    • Aura Gateway Message: summary of Aura interactions handled by aura-gateway-api.

2. Data publication in Kernel

  • The latest versions of the previous Avro-formatted datasets must be published into Kernel productive environment by the Kernel team.
Tasks for OBs

2.1. Ask the Kernel Team to publish the datasets in Kernel productive environment with the latest version.

2.2. When correctly published, you can check them in the repository: 4p-datasets

3. Kernel apps configuration to write/read datasets

  • Two Kernel applications (clients) must be created/configured by the Kernel team to allow the use of Kernel resources:

    • aura-bot-[environment]: already existing app in Kernel
    • aura-billing-[environment]: new application
  • Specifically, the applications must be configured with concrete scopes that provide permissions to write/read the datasets.

  • The obligation to indicate the exact version in the configuration is removed. Therefore, in the following deployments, the version number indicated in the scope will be eliminated. For example, the configuration of the Brazil OB will have to be updated when a new scope change is made. For example: data:Aura_Audit:6:read —> data:Aura_Audit:read.

Tasks for OBs

3.1. Configure aura-bot Kernel application to write datasets

Ask the Kernel Team to create a list of scopes in the aura-bot application for your intended environment.

- admin:datasets:read
- data:read
- data:write
- data:Aura_Audit:read
- data:Aura_Audit:write
- data:Aura_Gateway_Message:read
- data:Aura_Gateway_Message:write
- data:D_Aura_App:read
- data:D_Aura_App:write
- data:D_Aura_Channel:read
- data:D_Aura_Channel:write
- data:D_Aura_Component:read
- data:D_Aura_Component:write
- data:D_Aura_Preset:read
- data:D_Aura_Preset:write
- data:D_Aura_Recognizer:read
- data:D_Aura_Recognizer:write
- data:D_Aura_Skill:read
- data:D_Aura_Skill:write
- data:D_Gbl_Brand:read
- data:D_Gbl_Contact_Channel:read
- data:D_Gbl_Country:read
The scopes are associated with a specific version of the dataset, that will increase and vary with time.

3.2. Create a new app for Aura Billing Module and configure it to write datasets

Ask the Kernel Team to create a new application aura-billing in Kernel for your intended environment

Ask the Kernel Team to create a purpose for this application. For instance, aura-kpi-data-read-purpose

Ask the Kernel Team to assign to this purpose the following scopes:
- admin:datasets:read
- data:read
- data:write
- data:Aura_Audit:read
- data:Aura_Gateway_Message:read
- data:D_Aura_App:read
- data:D_Aura_Channel:read
- data:D_Aura_Component:read
- data:D_Aura_Preset:read
- data:D_Aura_Recognizer:read
- data:D_Aura_Skill:read
- data:D_Gbl_Brand:read
- data:D_Gbl_Contact_Channel:read
- data:D_Gbl_Country:read

3.3. Access the Kernel applications

Once the Kernel team has created the app with the above-mentioned purposes and scopes, two parameters for securely accessing the app will be provided:
- client_id: unique identifier of the consuming app acting as Kernel API client.
- client_secret: password.
This allows Kernel to securely identify, authenticate and authorize any access requested from this app.

4. Data processing

Data processing is executed with Azure Databricks.

In this process, the information from the Kernel datasets is recovered and read by the Aura Billing Module, that uses algorithms to assign a unitary cost to each concept that composes the invoice to calculate the total amount of this invoice.

Tasks for OBs

4.1. Enable Aura components

Check that Aura KPIs Uploader, the component in charge of the management of KPIs entities and KPIs dimensions in Aura, is in use.

Check that Aura Databricks Jobs, component used to import Avro-formatted files into a Kernel dataset, is enabled in Aura installer.

Configure Azure Common:
- To avoid data files migrations between releases, KPI files are now stored in an Azure common storage that is not release dependent.
- The default value of days for KPIs uploading is changed to 30 days. Due to that, the variable fourth_platform.conversations.days_to_find must be removed from OB deployments, because it will be configured by the installer.

5. Data consumption

This step refers to the generation of the Liceo invoices and its storage in Azure Storage Explorer in xlsx format.

These invoices will be available for download by the Aura Global Team, to be sent to the OBs.

4 - Manage Aura logs

Manage Aura logs

Learn what are Aura logs and how they are managed in Kibana and other tools

Introduction

Logs are files that record specific single events, warnings and errors as they occur within a software environment. They can include contextual information, such as the time an event occurred and which user or endpoint was associated with it.

In Aura, logs are generated by specific components when an event happens and stored in order to monitor or debug the system.

Logs are stored in an ElasticSearch cluster.

Once stored, Aura integrates a logging system based on Kibana, which is the official tool to manage logs in Aura. Moreover, logs can be managed with Grafana and fluentd for specific features.

⚠️ You should not integrate third-party applications or scripts with ElasticSearch. These kinds of integrations are weak because the ElasticSearch API is not part of the public interface with the OB. This means that it could change without notice for several reasons such as updating the version of ElasticSearch or changing Aura internal architecture.

Manage logs in Kibana

The official Kibana User Guide is the reference guide to use Kibana.

Moreover, the current section includes certain useful points for managing Aura logs through this tool.

Policies in Kibana

Kibana includes index lifecycle policies.

By default, we add one policy for each index created (service and system index), to delete the logs older than seven days.

Snapshot in Kibana

Index snapshot is configured by default as long-term storage for the logs. These snapshots are taken daily and end in the cluster Azure Storage blob container (aura-backups/elk).

Manage logs in Grafana

Discover section

The “Discover” section in Grafana is very useful to look for logs and troubleshoot issues.

You can full-text search logs using Lucene query syntax.

Moreover, logs are tagged with many fields that can be useful to narrow down a search, such as:

  • kubernetes.labels.app: name of the Kubernetes application that generated the log.
  • kubernetes.pod_name: name of the Kubernetes pod that generated the log.
  • corr: correlator that tracks E2E requests.
  • lvl: log level (TRACE, DEBUG, INFO, WARN, ERROR or FATAL).

Queries that rely on a specific text are weak. Aura cannot guarantee that log messages do not change between versions. In fact, they do change. This is why metrics based on logs will not be reliable and it is not recommended to use Kibana to get metrics.

Manage logs in fluentd

Logs external forwarding feature

It is possible to send logs to an external system (a fluentd endpoint).

To enable this feature, add the following configuration to your config file:

external_forwarding:
  secret_shared_key: "mysecretkey"
  tls_config:
    tls_enabled: True
  remote_servers:
  - hostname: xxx
    port: yyy
  • Set hostname and port fields with the remote endpoint. If you configure more than one remote server, fluentd load balances the traffic to them in a round-robin order.
  • The hostname value can be an IP address, but it is not recommended if TLS is enabled. Turning off TLS is possible but discouraged for security reasons.
  • secret_shared_key is used to verify client’s identity and must be configured properly in all the remote servers.

You can find additional information regarding receivers’ configuration (including TLS configuration and password authentication procedure) here.

5 - Manage metrics

Manage Aura metrics

Learn what are Aura metrics, how they are generated and stored in Prometheus and the process for its analysis through Grafana

Introduction

Metrics provide a measurement of certain data that represent a specific aspect of the monitored system at a point in time and offer an aggregated view over the system. They are useful to visualize long-term trends and alerts on log data.

Each Aura component is in charge of publishing its own metrics, which are typically generated at fixed-time intervals from aggregated logs.

Once generated, Aura metrics are pooled by Prometheus, which is in charge of gathering and exposing them.

Grafana is the most suitable tool to represent metrics through different dashboards. Each component counts on a Grafana dashboard to show its current behavior and there is a single dashboard for an Aura overview.

If you think a new metric could useful, please contact the Aura Platform Team, so it can be officially included as part of the platform.

The aim of this section is to explain both how Aura metrics work and all the metrics stored by each component.

⚠️ Saved dashboards, visualizations and queries are not guaranteed to be kept between upgrades because all the stack, including ElasticSearch and Grafana can be upgraded to newer versions.

Prometheus

Aura metrics system is based on Prometheus, a Cloud Native Computing Foundation project that works as systems and services monitoring system. Prometheus collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

prom-client is being used to implement prometheus functionality in Node.js.

Prometheus service pools every component to get the metrics generated during the last time period. Every component counts on a private endpoint (not accessible from Internet) called /metrics where Prometheus requests the metrics.

Currently, the metric types used in this component are:

  • Summary: similar to histogram metrics, it includes samples observations (such as request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

  • Counter: cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

  • Gauge: similar to Counter, but it represents a single numerical value that can arbitrarily go up and down.

Prometheus-es-exporter

Working with Prometheus, we can create metrics using queries to ElasticSearch indexes (as well as create alarms, dashboard, etc) using prometheus-es-exporter.

This component is not deployed by default, but it can be enabled changing the variable prometheus_es_exporter_enabled to true in you config.yml file. (In Brazil, it is set to true by default). Access here the guidelines to enable prometheus-es-exporter component.

To config your own metrics from queries, write the new section, as in the following example, in your config.yml.

prometheus_es_exporter:
  query_blocks:
    ob:
      - name: "query_ob_br"
        QueryIntervalSecs: "60"
        QueryJson: '{"size":0,"query":{"bool":{"must":[],"filter":[{"bool":{"filter":[{"bool":{"should":[{"match_phrase":{"msg":"[AzureEventHub] emit"}}],"minimum_should_match":1}},{"bool":{"should":[{"match_phrase":{"kubernetes.labels.app":"aura-bot"}}],"minimum_should_match":1}}]}},{"range":{"@timestamp":{"gte":"now-1m","lte":"now"}}}]}}}'
        QueryIndices: "aurak8s-service-*"

Where:

  • name: Mandatory. Name of the query. It must start with query_*
  • QueryIntervalSecs: Optional. It indicates how often to run queries in seconds. By default, 60.
  • QueryJson: Mandatory. The search query to run.
  • QueryIndices: Optional. Indices to run the query on. Any way of specifying indices supported by your ElasticSearch version can be used. By default, _all. Although this field is optional, it is highly recommended to delimit the search query.

Aura components metrics

The main Aura components can generate their own metrics.

Select your intended component in the left menu and access to its details.

5.1 - Aura Bot metrics

Aura Bot metrics

List of metrics available in Aura Bot

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bot.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-bot until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the user.

The metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

This metric was stored since Iron Maiden (7.2.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-bot.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

This metric was stored since Camela (5.0.0) release.

outgoing_message_duration_seconds

This metric is intended to store the number of Direct Line requests arriving to aura-bot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-bot until the last message of its answer is sent to the client callback.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent by Direct Line in the body of the response in the happening of an error.
  • origin: specific host of the request.
  • channel: channel of the request.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-bot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

This metric was stored since Camela (5.0.0) release with the name of bot_version and updated to aura_component_version in Iron Maiden (7.2.0).

bot_request_version

This metric is intended to store the number of incoming requests to aura-bot depending on their channelData.version. It is stored as a Counter in Prometheus.

Labels:

  • version: channelData.version in the incoming request. If the incoming request has no version field, 1 will be set.

This metric was stored since Iron Maiden (7.2.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge.

It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

aura_token_generate

This metric is intended to store the information related to Kernel accessToken refreshments in aura-bridge. It is intended to make it possible to set an alarm in the happening of any error during refresh of the 2-legged accessToken needed to access Kernel WhatsApp APIs.

It is stored as a Summary.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status returned by Kernel in the response.
  • originStatus: status sent by Kernel in the body of the response in the happening of an error.
  • origin: channelId of the channel that needs the accessToken in Aura.
  • channel: channel of the request.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.2 - Aura Groot metrics

Aura Groot metrics

List of metrics available in Aura Groot

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-groot.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-groot until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the Direct Line or aura-bridge.

The metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

outgoing_request_duration_seconds

This metric is intended to store the processing time related to all the outgoing HTTP requests made by aura-groot.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

outgoing_message_duration_seconds

This metric is intended to store the processing time of Direct Line or aura-bridge requests arriving to aura-groot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-goot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-groot until the last message of its answer is sent to the client callback.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent by Direct Line in the body of the response in the happening of an error.
  • origin: specific host of the request (Direct Line or aura-bridge).
  • channel: channel of the request.

incoming_message_duration_seconds

This metric is intended to store the processing time of Direct Line, aura-bridge or skills requests arriving to aura-groot.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-goot is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or skill is sent back to the client callback. This metric measures the duration from when the request arrives at aura-groot until it is processed to send to the channel/bridge or skill.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent by Direct Line in the body of the response in the happening of an error.
  • origin: specific host of the request (Direct Line, aura-bridge or skill name). If origin is missing, the content of path label will be added.
  • channel: channel of the request.

aura_component_version

This metric is intended to store the number of aura-groot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

skill_access_error

This metric is intended to store the number of times a skill has been misconfigured in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

  • skill: skill name.
  • code: noRespond or noFound
  • channel: channel of the request.

skill_request_status

This metric is intended to store the number of times we have obtained a response status per skill in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

  • skill: skill name.
  • code: status code of the request.
  • channel: channel of the request.

skill_response_error

This metric is intended to store the number of times a skill has been blocked in aura-groot.

It is stored as a Counter in Prometheus.

Labels:

  • skill: skill name
  • code: blocked
  • channel: channel of the request.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.3 - Atria Model Gateway metrics

Atria Model Gateway metrics

List of metrics available in atria-model-gateway

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by atria-model-gateway.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in atria-model-gateway until its HTTP response is returned:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response
  • application: application name that is using the model

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by atria-model-gateway. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

generative_tokens

This metric is intended to store the information related to tokens used by OpenAI in atria-rag-server. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its tokens usages.

The metric allows measuring the behavior of the tokens using any given OpenAI model:

  • The number of tokens during a time
  • The average/min/max tokens of these requests

Labels:

  • application: application name that is using the model
  • deployment_model_name: name of the deployment model
  • model_type: identifier of the model

5.4 - Atria RAG server metrics

Atria RAG server metrics

List of metrics available in atria-rag-server

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by atria-rag-server.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in atria-rag-server until its HTTP response is returned:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response
  • application: application name that is using the model

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by atria-rag-server. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

5.5 - Aura Authentication API metrics

Authentication API metrics

List of metrics available in Aura Authentication API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-authentication-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-authentication-api until its HTTP response is returned:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

This metric was stored since Greenday (6.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-authentication-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

This metric was stored since Camela (5.0.0) release.

aura_token_generate

This metric is intended to store the information related to Kernel accessToken generation, used during the integrated authorization process of the Aura users in aura-authentication-api.

It is intended to make it possible to set an alarm in the happening of any error during token validation. It is stored as a Summary in Prometheus.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status returned by Kernel in the response.
  • originStatus: status sent by Kernel in the body of the response in the happening of an error.
  • origin: channelId of the channel that needs the accessToken in Aura.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-authentication-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

This metric was stored since Barricada (5.3.0) release with the name of authentication_api_version and updated to aura_component_version in Iron Maiden (7.2.0).

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.6 - Aura Configuration API metrics

Aura Configuration metrics

List of metrics available in Aura Configuration API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-configuration-api.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-configuration-api until its HTTP response is returned:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

This metric was stored since Greenday (6.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-configuration-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

aura_component_version

This metric is intended to store the number of aura-configuration-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-configuration-api. It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.7 - Aura Gateway API metrics

Gateway API metrics

List of metrics available in Aura Gateway API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-gateway-api.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests from any given endpoint. Specifically, the duration since the request lands in aura-gateway-api until its HTTP response is returned:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response
  • application: Application name of the request.
  • channel: Channel name of the request. Only for NLPaaS endpoint.
  • preset: Preset name of the request. Only for Generative endpoint.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-gateway-api. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

The metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

aura_component_version

This metric is intended to store the number of aura-gateway-api instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

This metric was stored since Beatles (8.9.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-gateway. It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.8 - Aura Bridge metrics

Aura Bridge metrics

List of metrics available in Aura bridge

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bridge. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-bridge until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer for the user.

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

This metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

outgoing_message_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-bridge.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bridge is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback.

This metric measures the duration since the request lands in aura-bridge until the last message of its answer is sent to the client callback.

Labels:

  • host: host and domain where the request is being sent.
  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • path: specific endpoint of the request.
  • originStatus: third party status sent in the body of the response. Usually, this status is sent by whatsapp.
  • status: HTTP status code returned in the response.
  • origin: specific source of the request. The value could be: ‘4p’, ‘whatsapp’, ‘aura-bot’ or ‘genesys’.
  • channel: channel of the request.

This metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

incoming_message_duration_seconds

This metric is intended to store the number requests arriving to aura-bridge from a channel or Direct Line.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-bridge is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or Direct Line is sent back to the client callback. This metric measures the duration from when the request arrives at aura-bridge until it is processed to send to the channel or Direct Line.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent by Direct Line or channel in the body of the response in the happening of an error.
  • origin: specific host of the request. If origin is missing, the content of path label will be added.
  • channel: channel of the request. In Auraline requests used to get conversationId with path: /aura-services/v1/auraline/conversations, channel will be missing.

aura_response_ack_duration_seconds

This metric is intended to store the information related to all the ACK requests sent by the clients to aura-bridge. The ACK requests are used by the clients (WhatsApp) to notify if in the end Aura’s answer was delivered to the user or not.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration. The duration measures since the ACK request lands in aura-bridge until its asynchronous answer is sent to the user.

Labels:

  • host: host and domain where the request is being sent.
  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • path: specific endpoint of the request.
  • originStatus: third party status sent in the body of the response. Usually, this status is sent by whatsapp.
  • status: HTTP status code returned in the response.
  • origin: specific source of the request. The value could be: ‘4p’, ‘whatsapp’, ‘aura-bot’ or ‘genesys’.
  • channel: channel of the request.

This metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

This metric was stored since Heroes (7.0.0) release.

outgoing_request_duration_seconds

This metric is intended to store the information related to all the outgoing HTTP requests made by aura-bridge. It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, …)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

This metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

This metric was stored since Greenday (6.0.0) release.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

aura_token_generate

This metric is intended to store the information related to Kernel accessToken refreshments in aura-bridge. It is intended to make it possible to set an alarm in the happening of any error during refresh of the 2-legged accessToken needed to access Kernel WhatsApp APIs.

It is stored as a Summary in Prometheus.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status returned by Kernel in the response.
  • originStatus: status sent by Kernel in the body of the response in the happening of an error.
  • origin: channelId of the channel that needs the accessToken in Aura.

This metric was stored since Iron Maiden (7.2.0) release.

aura_component_version

This metric is intended to store the number of aura-bridge instances (pods) running each version of the code.

It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

This metric was stored since Greenday (6.0.0) release with the name of aura_bridge_version and updated to aura_component_version in Iron Maiden (7.2.0).

aura_bridge_wa_incoming_message

This metric is intended to store the number of unhandled errors happening in aura-bridge. It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

This metric was stored since Iron Maiden (7.2.0) release.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.9 - Aura KPIs uploader metrics

Aura KPIs Uploader

List of metrics available in Aura KPIs uploader

aura_kpis_uploader_metrics_duration

This KPI measures the time required by aura-kpis-uploader to process each type of KPI. KPI management has several steps (load, process, upload), and this KPI represents the time it takes to perform all those steps for each of the KPIs defined in AURA_SOURCE_PATH_AVRO_ADAPTERS.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

  • format: File format in which the KPI will be stored.
    • csv: File format will be CSV (deprecated).
    • avro: File format will be AVRO.
  • kpiType: Type of KPI:
    • entity: KPI is of type Entity.
    • dimensional: KPI is of type Dimensional.
  • kpiName: Name of the KPI.
  • duration: Time in seconds with the time used to process the KPI.
  • numberFilesProcessed: Number of KPIs processed. If the format is AVRO, it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

aura_kpis_uploader_metrics

This metric is intended to store the information related to all processes executed by aura-kpis-uploader. It is stored as a Counter in Prometheus, so every sample, besides the defined labels.

This KPI measures the amount of KPI registers processed, if the format is AVRO it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

Labels:

  • format: File format in which the KPI will be stored.
    • csv: File format will be CSV (deprecated).
    • avro: File format will be AVRO.
  • kpiType: Type of KPI:
    • entity: KPI is of type Entity.
    • dimensional: KPI is of type Dimensional.
  • kpiName: Name of the KPI.
  • duration: Time in seconds with the time used to process the KPI.
  • numberFilesProcessed: Number of KPIs processed. If the format is AVRO, it represents the number of records processed. If the format is CSV, it only represents the number of processed files.

aura_kpis_uploader_errors

This metric is intended to store the information related to all errors generated by execution of aura-kpis-uploader. It is stored as a Counter in Prometheus, so every sample, besides the defined labels.

This KPI measures the amount of KPI errors produced when generating KPIs.

Labels:

  • type: Name of the method or function where the error occurred.
  • format: File format in which the KPI will be stored.
    • csv: File format will be CSV (deprecated).
    • avro: File format will be AVRO.
  • kpiType: Type of KPI:
    • entity: KPI is of type Entity.
    • dimensional: KPI is of type Dimensional.
  • kpiName: Name of the KPI.
  • url: If the error contains a file with more information stored in Azure Storage, this field contains the URL to download the file.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-kpis-uploader. It is stored as a Counter in Prometheus.

Labels:

  • error: Exception message that forced the unhandled error.

aura_server_unhandled_error is stored from Loquillo (7.5.0) release onwards.

5.10 - Aura NLP metrics

Aura NLP metrics

List of metrics available in Aura NLP

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
  • path: HTTP path of the incoming request.
  • status_code: the responded HTTP status code (as a string).

Value:

  • Request duration in seconds.

outgoing_request_duration_seconds

This Prometheus metric is a modelled as a summary where the value is the spent time until the remote host responds to an HTTP request.

Note the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

  • method: HTTP method (GET, POST; etc.), a string in uppercase.
  • host: remote host that will receive the outgoing request.
  • path: HTTP path of the outgoing request.
  • status: the responded HTTP status code (as a string).

5.11 - T&C API metrics

Terms & Conditions API metrics

List of metrics available in Terms and Conditions API

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests handled by tac-api. It is stored as a Histogram in Prometheus, so every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in tac-api until its HTTP response is returned.

This metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a period of time
  • The average/min/max duration of these requests
  • Quantiles of the duration and the number of requests in a period

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

This metric was stored since Barricada (5.0.0) release.

http_requests_total

This metric is intended to store information about all the request handled by tac-api. It is stored as a Counter in Prometheus.

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • status_code: HTTP status code returned in the response.

This metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests
  • Quantiles

This metric was stored since Barricada (4.0.0) release.

http_in_flight_requests_total

This metric is intended to store the information related to all the concurrent HTTP requests being handled by tac-api in a period.

It is stored as a Gauge in Prometheus because it is a value that can go up and down at every moment.

This metric allows to measure the behavior of the requests from any given endpoint:

  • The number of requests during a period of time
  • The average/min/max duration of these requests
  • Quantiles of the duration and the number of requests in a period.

This metric was stored since Barricada (4.0.0) release.

tac_internal_errors

This metric is intended to store the number of internal errors happening in tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

  • name: it will contain the exception message that forced the unhandled error.

This metric was stored since Barricada (4.0.0) release.

tac_service_acceptances_total

This metric is intended to store the number of acceptances of Terms and Conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

  • name: it will contain the name of the accepted service. Currently, it could contain one of: aura, whatsapp-anonymous, whatsapp-authenticated
  • version: T&C version accepted by the user

This metric was stored since Barricada (4.0.0) release.

tac_service_updates_total

This metric is intended to store the number of updates of terms and conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

Labels:

  • name: name of the updated service. Currently (Iron Maiden) it could contain one of: aura, whatsapp-anonymous, whatsapp-authenticated
  • version: T&C version updated by the user

This metric was stored since Barricada (4.0.0) release.

tac_user_deletions_total

This metric is intended to store the number of deletions of terms and conditions per service handled by tac-api. It is stored as a Counter in Prometheus because its value can only go up.

This metric was stored since Barricada (4.0.0) release.

aura_component_version

This metric is intended to store the number tac-api instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

This metric was stored since Iron Maiden (7.2.0).

5.12 - NLP provisioning metrics

NLP Provisioning metrics

List of metrics available in Aura NLP provisioning

These metrics are stored since Heroes (7.0.0.) release.

Introduction

In the Aura NLP provisioning component, it is important to know in each moment the quantity of processes restarted in relation with the total processes that, at this moment, work to process the different container. In that way, it could be alerted to an abnormal performance and take measures in this regard.

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
  • path: HTTP path of the incoming request.
  • status_code: the responded HTTP status code (as a string).

Value:

  • Request duration in seconds.

nlp_provisioning_killed_processes

This metric is intended to store the number of processes killed in each iteration of the Aura NLP provisioning execution. It is stored as a Gauge in Prometheus.

Value:

  • Number worker processes killed in each iteration

nlp_provisioning_alive_processes

This metric is intended to store the number worker processes alive in each iteration of NLP Provisioning. It is stored as a Gauge.

Value:

  • Total alive processes.

nlp_provisioning_expected_alive_processes

This metric is intended to store the number of expected alive processes in the NLP Provisioning. It is stored as a Gauge.

Value:

  • Set gauge with total alive processes.
  • Decrease gauge with finished processes.

nlp_provisioning_container_killed_count

This metric is intended to store the counter of all the processes killed in Aura NLP provisioning. It is stored as a Counter in Prometheus.

Labels:

  • container: container URL.

Value:

  • Dead process ids (pids).

5.13 - Aura Complex Logic metrics

Aura Complex Logic metrics

List of metrics available in Aura Complex Logic Framework

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary, where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
  • path: HTTP path of the incoming request.
  • status_code: the responded HTTP status code (as a string).

Value:

  • Request duration in seconds

supervised_complex_logic_app_restarted_counter

This metric is intended to store a count of the restarted plugins.

It is stored as a Counter in Prometheus.

Labels:

All label values are strings.

  • app: clf
  • supervised_plugin: Supervised plugin class path.
  • plugin_status: Plugin response code status.
  • plugin_handler_name: Handler name.

complex_logic_app_http_requests

This metric is intended to store the HTTP requests of Aura Complex Logic plugins.

It is stored as a Counter in Prometheus.

Labels:

All label values are strings.

  • app: clf
  • plugin: plugin class path.
  • status_code: plugin response code status.
  • handler_name: handler name.

5.14 - Aura Context metrics

Aura Context metrics

List of metrics available in Aura Context

These metrics are stored since Heroes (7.0.0.) release

http_request_duration_seconds

This Prometheus metric is modelled as a summary where its value is the spent time until the remote host responds to an HTTP request.

Note that the value is a float number rounded to its third decimal. It is stored as a Summary in Prometheus.

This metric is intended to store the duration of outgoing requests in seconds.

Labels:

All label values are strings.

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.).
  • path: HTTP path of the incoming request.
  • status_code: the responded HTTP status code (as a string).

Value:

  • Request duration in seconds.

database_request_duration_seconds

This metric is intended to store the duration of database requests in seconds.

It is stored as a Summary in Prometheus.

Labels:

All label values are strings.

  • database: database name (Redis or Mongo).
  • operation: database operation (i.e., update, create, get_by_date, get_last_n, get_by_corr).

Value:

  • Request duration in seconds.

5.15 - Aura File Manager metrics

Aura File Manager metrics

List of metrics available in Aura File Manager

http_request_duration_seconds

This metric is intended to store the information related to all the incoming HTTP requests received by aura-file-manager.

It is stored as a Summary in Prometheus. So every sample, besides the defined labels, also includes its duration.

It measures the duration since the request lands in aura-file-manager until its HTTP response is returned, indicating to the client that Aura is processing the request to obtain a proper answer.

The metric allows measuring the behavior of the requests from any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status_code: HTTP status code returned in the response

outgoing_request_duration_seconds

This metric is intended to store the processing time related to all the outgoing HTTP requests made by aura-file-manager.

It is stored as a Summary in Prometheus so every sample, besides the defined labels, also includes its duration.

This metric allows measuring the behavior of the requests to any given endpoint:

  • The number of requests during a time
  • The average/min/max duration of these requests

Labels:

  • method: HTTP method used by the request being stored (GET, POST, PUT, DELETE, etc.)
  • host: host and domain where the request is being sent
  • path: specific endpoint of the request
  • status: HTTP status code returned in the response

outgoing_message_duration_seconds

This metric is intended to store the processing time of aura-bot requests arriving to aura-file-manager.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-file-manager is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the user is sent back to the client callback. This metric measures the duration since the request lands in aura-file-manager until the last message of its answer is sent to the client callback.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • origin: aura-bot

incoming_message_duration_seconds

This metric is intended to store the processing time of aura-bot requests arriving to aura-file-manager.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

As aura-file-manage is an asynchronous server, the processing of a request does not end when the HTTP response is returned, but when the proper answer for the channel or skill is sent back to the client callback. This metric measures the duration from when the request arrives at aura-file-manager until it is processed to send the response.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent in the body of the response in the happening of an error.
  • origin: aura-bot

aura_component_version

This metric is intended to store the number of aura-file-manager instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-file-manager.

It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

aura_token_generate

This metric is intended to store the processing time of aura-file-manger to get/refresh kernel token.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

  • path: specific endpoint of the request.
  • httpStatus: HTTP status code returned in the response.
  • originStatus: status sent by Direct Line in the body of the response in the happening of an error.
  • origin: kernel client identifier

file_validation_duration_seconds

This metric is intended to store the validation time of a file.

It is stored as a Summary in Prometheus, so every sample, besides the defined labels, also includes its duration.

Labels:

  • path: specific endpoint of the request.
  • code: OK when file is valid.
  • origin: specific endpoint of the request.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

5.16 - Aura Redis MongoDB sync metrics

Aura Redis MongoDB Synchronizer metrics

List of metrics available in aura-redis-mongo-sync (ARMS)

aura_component_version

This metric is intended to store the number of aura-bot instances (pods) running each version of the code. It is stored as a Gauge in Prometheus.

Labels:

  • version: version field in the package.json file included in the running docker container.
  • component: name of the component that is writing the metric.

aura_server_unhandled_error

This metric is intended to store the number of unhandled errors happening in aura-redis-mongo-sync.

It is stored as a Counter in Prometheus.

Labels:

  • error: exception message that forced the unhandled error.

redis_mongo_sync_duration_milliseconds

This metric measures the data upload time from the service to the Mongo database.

It is stored as a Histogram in Prometheus. So every sample, besides the defined labels, also includes its duration.

The aura-redis-mongo-sync service contains a data collector that helps the event service move stale data from Redis to MongoDB. This collector sends the data in packets to optimize performance. This metric measures the time MongoDB takes to process the packet.

Labels:

  • status: HTTP status returned in the response. Values: success.
    • success: if the status is success, the time is stored.

redis_mongo_synced_items_total

This metric is intended to store the registers synchronized between Redis and MongoDB by events.

It is stored as a Counter in Prometheus.

Labels:

  • type: register type. Values: event, active_context
    • event: Items synchronized by event.
    • active_context: Items synchronized by active context process.

redis_mongo_synced_errors

This metric is intended to store the errors that have occurred in the synchronization.

It is stored as a Counter in Prometheus.

Labels:

  • error: Values : create, syncData, executeBulk.
    • create: If the error occurred when creating the service.
    • syncData: If the error occurred when synchronizing the data.
    • executeBulk: If the error occurred when uploading the data to MongoDB in bulk mode.

redis_mongo_sync_configuration_settings

This metric contains the service configuration data.

It is stored as a Gauge in Prometheus.

Labels:

  • setting_name: Values: shard_count, pod_count, active_context_ttl_seconds, redis_cache_ttl_seconds.
    • shard_count: Current shard used to distribute the data to synchronize between pods.
    • pod_count: Current number of services of aura-redis-mongo-sync.
    • active_context_ttl_seconds: Time interval to run the data collector.
    • redis_cache_ttl_seconds: Time in seconds that will be set to the context elements in the Redis cache.

services_status

This metric is intended to store the number of success or errored checks of modules of the server. It is stored as a Counter in Prometheus.

Labels:

  • moduleId: Id of the module.
  • status: OK or ERROR

6 - Aura dashboards

Aura dashboards

Discover the dashboards that can be generated through the different tools used for Aura monitoring in order to track and analyze data

Introduction

Dashboards are reporting tools that aggregate and display metrics and key indicators, so they can be examined at a glance by all possible audiences.

These dashboards allow data interpretation and provide an overall view for the evaluation of Aura’s performance, thus improving decision-making. Each component counts on a dashboard to show its current behavior and there is a single dashboard for an Aura overview.

There are two types of dashboards for Aura metrics (Prometheus) that are generated in Grafana:

6.1 - Aura system dashboards

Aura system dashboards

Grafana dashboards with metrics related to the performance of Aura system

Introduction

Currently, these are the available Aura system dashboards in Grafana based on metrics stored in Prometheus:

6.1.1 - Alertmanager dashboard

Alertmanager dashboard

Information provided by Alertmanager dashboards

Panels

Received alerts rate

It shows a time series with the received alerts rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows received alerts rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_alertmanagers_discovered[1m])) by(status)

An example of this panel is shown below:

The available metrics are defined in the following sections.

Successful notification rate

It shows a time series with the successful notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the successful notifications rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_sent_total[1m])) by(integration)

An example of this panel is shown below:

Failed notifications rate

It shows a time series with the failed notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the failed notifications rate.

The queries used to get panel information are:

sum(rate(prometheus_notifications_errors_total[1m])) by(integration)

An example of this panel is shown below:

CPU usage rate

It shows a time series with the CPU usage rate aggregated by one minute. It also shows the current minimum, maximum and average cpu consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the CPU usage rate.

The queries used to get panel information are:

sum(rate(container_cpu_usage_seconds_total{container="alertmanager"}[1m])) by (pod_name)

An example of this panel is shown below:

Memory usage

It shows a time series with the memory usage. It also shows the current minimum, maximum and average memory consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get panel information are:

sum (container_memory_working_set_bytes{container="alertmanager"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

It shows a time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O.

The x-axis shows the time series and the y-axis shows the network usage.

The queries used to get panel information are:

sum (rate (container_network_receive_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)

An example of this panel is shown below:

6.1.2 - Elasticsearch dashboard

Elasticsearch dashboard

Information provided by Elasticsearch dashboard

Introduction

Elastic dashboard monitors multiple data, service and system related metrics.

The different graphs are shown in the following sections:

  • Cluster graphs
  • Shard graphs
  • system graphs
  • Documents graphs
  • Total operations stats graphs
  • Elastic search times graphs
  • Caches graphs
  • Thread pool graphs
  • JVM garbage collection graphs

Cluster graphs

The current section includes cluster related graphs.

Health status

Code coloured indicator of cluster health.

Metrics:

((sum(elasticsearch_cluster_health_status{color="green"})*2)+sum(elasticsearch_cluster_health_status{color="yellow"}))/count(elasticsearch_index_stats_up)

Nodes

Number of nodes.

Metrics:

count(elasticsearch_index_stats_up)

Data nodes

Number of data nodes per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_data_nodes{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Pending tasks

Pending tasks per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_pending_tasks{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

Shards graphs

Shards related graphs.

Active primary shards

Number of active primary shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_primary_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Active shards

Number of active shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Initializing shards

Number of shards initializing per node.

Metrics:

sum(elasticsearch_cluster_health_initializing_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Relocating shards

Number of relocating shards per node.

Metrics:

sum(elasticsearch_cluster_health_relocating_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Unassigned shards

Number of unassigned shards per node.

Metrics:

sum(elasticsearch_cluster_health_delayed_unassigned_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

System graphs

System related graphs.

CPU usage

Percentage of used CPU on master and data nodes.

Metrics: It includes two metrics:

  • Master node CPU usage
elasticsearch_process_cpu_percent{cluster="elasticsearch",es_master_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Data nodes CPU usage:
elasticsearch_process_cpu_percent{cluster="elasticsearch",es_data_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

JVM memory usage

Memory used by JVM graph in bytes.

Metrics:

It includes three metrics:

  • Used memory
elasticsearch_jvm_memory_used_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Committed memory
elasticsearch_jvm_memory_committed_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Max memory
elasticsearch_jvm_memory_max_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Disk usage

Disk usage in bytes.

Metrics:

1-(elasticsearch_filesystem_data_available_bytes{cluster="elasticsearch"}/elasticsearch_filesystem_data_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"})

Network usage

Bytes rate sent and received, aggregated by one minute.

Metrics: It includes two metrics:

  • Sent bytes
irate(elasticsearch_transport_tx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Received bytes
irate(elasticsearch_transport_rx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Documents graphs

Documents state related graphs.

Documents count

Number of documents in cluster.

Metrics:

elasticsearch_indices_docs{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Documents indexed rate

Rate of indexed documents, aggregated by one minute.

Metrics:

irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents deleted rate

Rate of deleted documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_docs_deleted{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents merged rate

Rate of merged documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Total operations stats graphs

Data related to total operations.

Total operations rate

Total operations number rate, aggregated by one minute.

Metrics: It includes six metrics:

  • Indexing index
irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search queries
irate(elasticsearch_indices_search_query_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search fetch
irate(elasticsearch_indices_search_fetch_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Merges
irate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Refresh
irate(elasticsearch_indices_refresh_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Flush
irate(elasticsearch_indices_flush_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Total operations time

Time rate for the different operations in milliseconds, aggregated by one minute.

Metrics: It includes six metrics:

  • Indexing index
irate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search queries
irate(elasticsearch_indices_search_query_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search fetch
irate(elasticsearch_indices_search_fetch_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Merges
irate(elasticsearch_indices_merges_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Refresh
irate(elasticsearch_indices_refresh_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Flush
irate(elasticsearch_indices_flush_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Elasticsearch times graphs

Graphs related to elapsed times of different actions.

Query time

Time rate for search query operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_search_query_time_seconds{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m]) 

Indexing time

Time rate for indexing index operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Merging time

Time rate for merge operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Caches graphs

Graphs related to caches metrics.

Field data memory size

Field data memory size in bytes.

Metrics:

elasticsearch_indices_fielddata_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Field data evictions

Rate of field data evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_fielddata_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Query cache size

Bytes of memory occupied by cached queries.

Metrics:

elasticsearch_indices_query_cache_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Query cache evictions

Rate of queries evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_query_cache_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Thread pool graphs

Graphs related to the thread pool.

Operations rejected

Rate of rejected operations, aggregated by one minute.

Metrics:

irate(elasticsearch_thread_pool_rejected_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Operations queued

Rate of queued operations, aggregated by one minute.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Threads active

Number of active threads.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Operations completed

Shows rate of completed operations, aggregated by one minute

Metrics:

irate(elasticsearch_thread_pool_completed_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

JVM Garbage collection graphs

Graphs related to JVM garbage collector activity.

GC count

Rate of GC count, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

GC time

Rate of GC execution time, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

6.1.3 - Fluent bit dashboard

Fluent bit dashboard

Information provided by Fluent bit dashboard

Introduction

Fluent bit dashboard monitors system metrics related to fluent bit.

The available metrics are defined in the following sections.

Input bytes

Input bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_input_bytes_total[1m])

Graph visual

Output bytes

Output bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_output_proc_bytes_total[1m])

Graph visual

Retries/fails

Rate of retries and fails, aggregated by one minute

Metrics:
It includes two metrics:

  • Retries rate
rate(fluentbit_output_retries_total[1m])
  • Fails rate
rate(fluentbit_output_retries_failed_total[1m])

Graph visual

Errors

Rate of output errors, aggregated by one minute.

Metrics:

rate(fluentbit_output_errors_total[1m])

Graph visual

6.1.4 - Kubernetes cluster monitoring dashboard

Kubernetes cluster monitoring dashboard

Information provided by Kubernetes cluster monitoring dashboard

Introduction

Kubernetes cluster monitoring dashboard monitors multiple systems and networks related data from Kubernetes clusters.

The available metrics are defined in the following sections.

Network I/O pressure graph

Rate of total received/sent data on all cluster containers, in bytes and aggregated by one minute.

Metrics:
It includes two metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))
  • Sent bytes (negative value)
- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))

Graph visual

Total usage

Graphs with different system parameters usage.

Cluster memory usage

It is composed by three graphs:

  • Memory usage, showing percentage of used memory
  • Used, showing used memory
  • Total, showing total memory

Metrics:
It includes three metrics:

  • Memory usage percentage
sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used memory
sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})
  • Total cluster memory
sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster CPU usage

It is composed by three graphs:

  • CPU usage, showing percentage of used CPU cores, aggregated by one minute
  • Used, showing used CPU cores, aggregated by one minute
  • Total, showing total CPU cores

Metrics:
It includes three metrics:

  • CPU usage percentage
sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) / 
sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used CPUs
sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))
  • Total cluster CPUs
sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster filesystem usage

It is composed by three graphs:

  • Filesystem usage, showing percentage of used filesystem space
  • Used, showing used filesystem space
  • Total, showing total filesystem space

Metrics:
It includes three metrics:

  • Filesystem usage
sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used
sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})
  • Total
sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Pods CPU usage

CPU usage rate, classified by pod and aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers CPU usage

CPU usage rate, classified by container and aggregated by one minute.

Metrics:
It includes two metrics:

  • Containers with “k8s_”
sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Containers without “k8s_”
sum (rate (container_cpu_usage_seconds_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes CPU usage

Total CPU usage rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id!="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (id)

Graph visual

Pods memory usage

Memory usage, classified by pod.

Metrics:

sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Containers memory usage

Memory usage, classified by container.

Metrics:
It includes two metrics:

  • Containers with “k8s_”
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container_name!="POD",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (container_name, pod_name)
  • Containers without “k8s_”
sum (container_memory_working_set_bytes{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname, name, image)

Graph visual

All processes memory usage

Total memory usage rate.

Metrics:

sum (container_memory_working_set_bytes{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Pods network I/O

Total network received/sent usage rate, classified by pod and aggregated by one minute.

Metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers network I/O

Total network received/sent usage rate, classified by container and aggregated by one minute.

Metrics:

  • Received bytes, containers with “k8s_”
sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)
  • Sent bytes, containers with “k8s_”
- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)
  • Received bytes, containers without “k8s_”
sum (rate (container_network_receive_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)
  • Sent bytes, containers without “k8s_”
- sum (rate (container_network_transmit_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes network I/O

Total network received/sent usage rate, aggregated by one minute.

Metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Pods disk I/O

Total disk reads/writes rate, classified by pod and aggregated by one minute.

Metrics:

  • Read bytes, pods without device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)
  • Written bytes, pods without device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)
  • Read bytes, pods with device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)
  • Written bytes, pods with device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)

Graph visual

Containers disk I/O

Total disk reads/writes rate, classified by container and aggregated by one minute.

Metrics:

  • Read bytes, containers without device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)
  • Written bytes, containers without device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)
  • Read bytes, containers with device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)
  • Written bytes, containers with device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)
  • Read bytes, containers without “k8s_”
sum(rate(container_fs_reads_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)
  • Written bytes, containerswithout “k8s_”
sum(rate(container_fs_writes_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

6.1.5 - Kubernetes cron and batch job monitoring dashboard

Kubernetes cron and batch job monitoring dashboard

Information provided by cron and batch job monitoring dashboard

Introduction

Kubernetes cron and batch job monitoring dashboard monitors success/fail rates for cron/batch jobs.

The available metrics are defined in the following sections.

Jobs succeeded

Successfully executed jobs.

Metrics:

kube_job_status_succeeded

Graph visual

Jobs failed

Failed job executions.

Metrics:

kube_job_status_failed

Graph visual

6.1.6 - Kubernetes nodes dashboard

Kubernetes nodes dashboard

Information provided by Kubernetes nodes dashboard

Introduction

Kubernetes nodes dashboard monitors nodes general system status.

The available metrics are defined in the following sections.

CPU usage

CPU usage percent rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname) / sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Memory usage

Memory usage percentage.

Metrics:

sum (container_memory_working_set_bytes{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) / sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Disk I/O

Disk read/written data in bytes.

Metrics:
It includes two metrics:

  • Read bytes
sum (container_fs_reads_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)
  • Written bytes
sum (container_fs_writes_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)

Graph visual

Network I/O

Network received/sent data in bytes, aggregated by one minute.

Metrics:
It includes two metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)

Graph visual

6.1.7 - Kubernetes services dashboard

Kubernetes services dashboard

Information provided by Kubernetes services dashboard

Introduction

Kubernetes services dashboard monitors system metrics related to services/pods.

The available metrics are defined in the following sections.

Service CPU usage

Services cpu usage rate, aggregated by one minute

Metrics:

sum(rate(container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container)

Graph visual

Pods CPU usage

Pods CPU usage rate, aggregated by one minute.

Metrics:
It includes two metrics:

  • CPU usage by pod and container
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container, pod)
  • CPU usage by container and instance
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service memory usage

Service memory usage in bytes.

Metrics:

sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)

Graph visual

Pods memory usage

Pods memory usage in bytes, and memory usage rate aggregated by one minute

Metrics:
It includes four metrics:

  • memory usage classified by pod and container
sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container, pod)
  • memory usage classified by container, image and instance
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)
  • memory usage rate, classified by pod and container, and aggregated by one minute
sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (container, pod)
  • memory usage rate, classified by instance and container, and aggregated by one minute
sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service network I/O

Network received/sent data rate, aggregated by one minute

Metrics:
It includes two metrics:

  • Received bytes
sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (pod)

Graph visual

Pods network I/O

Pods received/sent data rate in bytes, aggregated by one minute.

Metrics:
It includes four metrics:

  • Received bytes classified by pod
sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (name, pod)
  • Sent bytes classified by pod
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (container, pod)
  • Received bytes classified by container and instance
sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)
  • Send bytes classified by container and instance
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)

Graph visual

6.1.8 - Kubernetes storage monitoring dashboard

Kubernetes storage monitoring dashboard

Information provided by Kubernetes storage monitoring dashboard

Introduction

Kubernetes storage monitoring dashboard monitors storage related metrics.

The available metrics are defined in the following sections.

Used space

Kubelets volumes and container filesystems data usage in bytes.

Metrics:
It includes two metrics:

  • Kubelet volumes used bytes
kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}
  • Container filesystem usage in bytes
container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}

Graph visual

PVC used space %

PersistentVolumeClaim used space percent.

Metrics:

(kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_capacity_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

Local used space %

Containers assigned space usage percentage.

Metrics:

(container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"} / container_fs_limit_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

PVC used inodes %

Kubelet PersistentVolumeClaim volumes inodes usage percentage.

Metrics:

(kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_inodes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

6.1.9 - NLP provisioning dashboard

NLP provisioning dashboard

Information provided by NLP provisioning dashboard

Panels

Expected Killed Alive

Number of expected, killed and alive provisioning processes.

The queries used to get the panel information are:

nlp_provisioning_expected_alive_processes
nlp_provisioning_killed_processes
nlp_provisioning_alive_processes

An example of this panel is shown below:

Killed by container

Time series with the killed processes by container.

The x-axis shows the time series and the y-axis shows the number of killed processes by container.

The queries used to get the panel information are:

nlp_provisioning_container_killed_count_total

An example of this panel is shown below:

Killed processes

Time series with the total killed processes.

The x-axis shows the time series and the y-axis shows the number of killed processes.

The queries used to get the panel information are:

nlp_provisioning_killed_processes

An example of this panel is shown below:

Alive processes VS Expected alive processes

Time series with the ratio between alive processes and expected alive processes.

The x-axis shows the time series and the y-axis shows the ratio between alive and expected

The queries used to get the panel information are:

nlp_provisioning_alive_processes/ nlp_provisioning_expected_alive_processes

An example of this panel is shown below:

Alive processes VS expected processes

Time series with the ratio between alive processes rate aggregated by 15 minutes and expected alive processes rate aggregated by 15 minutes.

The x-axis shows the time series and the y-axis shows the ratio between alive/expected processes

The queries used to get the panel information are:

sum by (exported_job) (rate(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}[15m])) / 
sum by (exported_job) (rate(nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"}[15m]))

An example of this panel is shown below:

6.1.10 - Prometheus stats dashboard

Prometheus stats dashboard

Information provided by Prometheus stats dashboard

Introduction

This is a dashboard to obtain a lot of information on how Prometheus performs.

To get the information about each pod, the dashboard counts on a filter with the following fields:

  • jobs: list of active jobs.
  • instances: list of scrapeable instances.
  • interval: possible time intervals.

Once selected, the following graphs are printed.

Panels

Pods CPU usage

Time series with CPU usage rate, aggregated by one minute. It also shows the current minimum, maximum and average cpu usage.

The x-axis shows the time series and the y-axis shows the cpu usage rate.

The queries used to get the panel information are:

sum(rate(container_cpu_usage_seconds_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Pods memory usage

Time series with memory usage. It also shows the current minimum, maximum and average memory usage.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get the panel information are:

sum (container_memory_working_set_bytes{pod_name!="",pod_name=~"prometheus.*"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

Time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O bytes.

The x-axis shows the time series and the y-axis shows the network I/O.

The queries used to get the panel information are:

sum (rate (container_network_receive_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)
- sum (rate (container_network_transmit_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Uptime

Percentage of uptime for the last hour.

The queries used to get the panel information are:

avg(avg_over_time(up{instance=~"(10\\.240\\.0\\.10:9093|10\\.240\\.3\\.161:9093|10\\.240\\.0\\.34:9114|10\\.240\\.0\\.253:8080|10\\.240\\.3\\.205:9090|10\\.240\\.3\\.236:9090|10\\.240\\.4\\.14:9090|10\\.240\\.4\\.156:9121|10\\.240\\.4\\.186:9121|10\\.240\\.4\\.223:9121)",job=~"kubernetes-service-endpoints"}[1h]) * 100)

An example of this panel is shown below:

Currently down

Currently down instances.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"} < 1

An example of this panel is shown below:

Total series

Total series count.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Total series

Memory chunks being used.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Quick numbers

Quick numbers section shows a series of Prometheus indicators.

Missed iterations

Number of missed iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped iterations

Number of skipped iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Tardy scrapes

Number of scrapes that elapsed more than expected, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Reload failures

Number of reload failures, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_tsdb_reloads_failures_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped scrapes

Number of uncompleted scrapes due to multiple reasons, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) +
sum(sum_over_time(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) 

An example of this panel is shown below:

Failures and errors

Time series with the number of several different errors and failures, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows a series of different errors and failures:

  • Dialer connection errors.
  • Evaluator iterations missed.
  • Evaluator iterations skipped.
  • Evaluation failures.
  • Azure refresh failures.
  • Consul rpc failures.
  • Dns lookup failures.
  • Ec2 refresh failures.
  • Gce refresh failures.
  • Marathon refresh failures.
  • Openstack refresh failures.
  • Triton refresh failures.
  • Scrapes exceeded sample limit.
  • Scrapes sample duplicate timestamp.
  • Scrapes sample out of bounds.
  • Treecache zookeeper failures.
  • Tsdb compactions failed.
  • Tsdb head series not found.
  • Tsdb reloads failures.

The queries used to get the panel information are:

sum(increase(net_conntrack_dialer_conn_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_missed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_skipped_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_rule_evaluation_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_azure_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_consul_rpc_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_dns_lookup_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_ec2_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_gce_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_marathon_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_openstack_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_triton_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_exceeded_sample_limit_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_bounds_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_order_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_treecache_zookeeper_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_compactions_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_head_series_not_found{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_reloads_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0

An example of how this panel looks like:

Upness (stacked)

Time series with a time bound representation of services upness. Those values are shown stacked.

The x-axis shows the time series and the y-axis shows the upness state of the different services.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}

An example of this panel is shown below:

Storage memory chunks

Time series with the number of memory chunks used.

The x-axis shows the time series and the y-axis shows the number of memory chunks.

The queries used to get the panel information are:

prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series count

Time series with the number of tsdb series.

The x-axis shows the time series and the y-axis shows the number of series.

The queries used to get the panel information are:

prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series created/removed

Time series with the number of tsdb series created/removed.

The x-axis shows the time series and the y-axis shows the number of series created/removed, aggregated by 5 minutes.

The queries used to get the panel information are:

sum( increase(prometheus_tsdb_head_series_created_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )
sum( increase(prometheus_tsdb_head_series_removed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )

An example of this panel is shown below:

Appended samples per second

Time series with the number of metrics per second stored by Prometheus.

The x-axis shows the time series and the y-axis shows the number of metrics per second stored by Prometheus.

The queries used to get the panel information are:

rate(prometheus_tsdb_head_samples_appended_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])

An example of this panel is shown below:

Scrape Sync total

Time series with the total number of syncs that were executed on a scrape pool.

The x-axis shows the time series and the y-axis shows the total number of syncs that were executed on a scrape pool.

The queries used to get the panel information are:

sum(prometheus_target_scrape_pool_sync_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (scrape_job)

An example of this panel is shown below:

Target sync

Time series with the interval to sync the scrape pool.

The x-axis shows the time series and the y-axis shows the interval to sync the scrape pool.

The queries used to get the panel information are:

sum(rate(prometheus_target_sync_length_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[2m])) by (scrape_job) * 1000

An example of this panel is shown below:

Scrape duration

Time series with the scrape duration in seconds.

The x-axis shows the time series and the y-axis shows the scrape duration in seconds.

The queries used to get the panel information are:

scrape_duration_seconds{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Rejected scrapes

Time series with the rejected scrapes.

The x-axis shows the time series and the y-axis shows the rejected scrapes for several reasons:

  • Total number of scrapes that hit the sample limit and were rejected.
  • Total number of scrapes samples duplicated.
  • Total number of scrapes samples out of bounds.
  • Total number of scrapes samples out of order.

The queries used to get the panel information are:

sum(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) 

An example of this panel is shown below:

Average rule evaluation duration

Time series with the average duration of rule group evaluations, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows the average duration of rule group evaluations.

The queries used to get the panel information are:

1000 * rate(prometheus_evaluator_duration_seconds_sum{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) / 
rate(prometheus_evaluator_duration_seconds_count{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])

An example of this panel is shown below:

HTTP request duration

Time series with the HTTP request duration, aggregated by one minute.

The x-axis shows the time series and the y-axis shows the http request duration.

The queries used to get the panel information are:

sum(rate(http_request_duration_microseconds_count{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])) by (handler) > 0

An example of this panel is shown below:

Prometheus engine query duration seconds

Time series with the engine query duration in seconds.

The x-axis shows the time series and the y-axis shows the engine query duration.

The queries used to get the panel information are:

sum(prometheus_engine_query_duration_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (slice)

An example of this panel is shown below:

Rule evaluator iterations

Time series with the number of scheduled rule group evaluations, whether executed, missed or skipped.

The x-axis shows the time series and the y-axis shows the number of scheduled rule group evaluations.

The queries used to get the panel information are:

sum(rate(prometheus_evaluator_iterations_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))

An example of this panel is shown below:

Notifications sent

Time series with the rate of sent notifications, aggregated by 5 minutes.

The x-axis shows the time series and the y-axis shows the rate of sent notifications.

The queries used to get the panel information are:

rate(prometheus_notifications_sent_total[5m])

An example of this panel is shown below:

Minutes since successful config reload

Time series with the number of minutes since the last successful config reload.

The x-axis shows the time series and the y-axis shows the number of minutes since the last successful reload.

The queries used to get the panel information are:

(time() - prometheus_config_last_reload_success_timestamp_seconds{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) / 60

An example of this panel is shown below:

Successful config reload

Time series with the last successful reload.

The x-axis shows the time series and the y-axis shows the last successful reload.

The queries used to get the panel information are:

prometheus_config_last_reload_successful{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

GC rate

Time series with the GC invocation durations rate, aggregated by two minutes.

The x-axis shows the time series and the y-axis shows the GC invocation durations rate.

The queries used to get the panel information are:

sum(rate(go_gc_duration_seconds_sum{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}[2m])) by (instance)

An example of this panel is shown below:

6.1.11 - Redis dashboard

Redis dashboard

Information provided by Redis dashboard

Introduction

Redis dashboard monitors multiple data and service-related metrics.

The available metrics are defined in the following sections.

Redis uptime

Uptime graph shows time since last restart/shutdown.

Metrics:

max(max_over_time(redis_uptime_in_seconds{kubernetes_name=~"redis-announce-0"}[$__interval]))

Graph visual

Redis clients

Clients graph shows number of connected clients.

Metrics:

redis_connected_clients{kubernetes_name=~"redis-announce-0"}

Graph visual

Redis memory usage

Memory usage graph shows percentage of used memory.

Metrics:

100 * (redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"}  / redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"} )

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis hits/missed per second

Hits/missed per second graph shows the rate of hits and misses per second, aggregated by five minutes.

Metrics: It includes two metrics:

  • Hits metrics
irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])
  • Misses metrics
irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total memory usage

Total memory usage graph shows total memory usage and total memory free + used.

Metrics: It includes two metrics:

  • Used memory
redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"} 
  • Max memory
redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"} 

Graph visual

Redis network I/O

Network I/O graph shows rate of total in/out bytes, aggregated by 5 minutes.

Metrics: It includes two metrics:

  • In bytes
rate(redis_net_input_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])
  • Out bytes
rate(redis_net_output_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total items per DB

Total items per DB graph shows total number of items separated by db number.

Metrics:

sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) by (db) > 0

Graph visual

Redis expiring vs not-expiring keys

Expiring vs not-expiring keys graph shows total number of expiring and not expiring keys.

Metrics: It includes two metrics:

  • Not-expiring keys.
sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) - sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"}) 
  • Expiring keys
sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"}) 

Graph visual

Redis expired/evicted

Expired/evicted graph shows total number of expired and evicted keys, aggregated by 5 minutes.

Metrics: It includes two metrics:

  • Expired keys.
sum(rate(redis_expired_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)
  • Evicted keys
sum(rate(redis_evicted_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)

Graph visual

Redis command calls per second

Command calls per second graph shows top commands number of executions rate, aggregated by 5 minutes.

Metrics:

topk(5, irate(redis_commands_total{kubernetes_name=~"redis-announce-0"} [1m]))

Graph visual

6.2 - Aura components dashboards

Aura components dashboards

Grafana dashboards with metrics related to the performance of specific Aura components

Introduction

Currently, these are the available dashboards for Aura components in Grafana based on metrics stored in Prometheus:

6.2.1 - Aura bot latencies dashboard

Aura bot latencies dashboard

Information provided by Aura bot latencies dashboard

Introduction

Aura bot latencies dashboard monitors outbound and inbound latencies on the request and responses handled directly by aura-bot.

The available metrics are defined in the following sections, corresponding to request errors and latency for requests, Microsoft APIs, Kernel APIs, Cognitive APIs, aura-services APIs and other APIs.

Request error

Request error graph shows the number of errors rate, aggregated by one minute.

Graph metrics

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",status=~"4..|500"}[1m]))

Graph visual

Request latency

Request latency graph shows latency rate for outgoing traffic, aggregated by one minute.

Graph metrics

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot"}[1m]))

Graph visual

Microsoft APIs latency

Microsoft APIs latency graph shows mean latency rate for the different Microsoft APIs used.

Graph metrics

Currently, there are three monitored Microsoft endpoints:

  • Direct Line endpoint
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"directline.botframework.com"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"directline.botframework.com"}[1m]))
  • Microsoft auth endpoint
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"login.microsoftonline.com"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"login.microsoftonline.com"}[1m]))
  • Blob storage endpoint
sum (label_replace(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"aura.*blob.core.windows.net",path=~"/aura-temporary-resources/.*"},"path_set","$1","path","/aura-temporary-resources/.*")) by (path_set,kubernetes_namespace) / 
sum (label_replace(outgoing_request_duration_seconds_count{app="aura-bot",host=~"aura.*blob.core.windows.net",path=~"/aura-temporary-resources/.*"},"path_set","$1","path","/aura-temporary-resources/.*")) by (path_set,kubernetes_namespace)

Graph visual

Kernel APIs latency

Kernel APIs latency graph shows mean latency rate for the different Kernel APIs used.

Graph metrics

Currently, there are four monitored Kernel endpoints (more can be added if necessary for a given environment):

  • Kernel auth endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"auth.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"auth.*"}[1m]))
  • Kernel subscribed products endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/subscribed_products/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/subscribed_products/.*"}[1m]))
  • Kernel user profile endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/userprofile/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/userprofile/.*"}[1m]))
  • Kernel invoicing enpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api.*",path=~"/invoicing/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api.*",path=~"/invoicing/.*"}[1m]))

Graph visual

Cognitive APIs latency

Cognitive APIs latency graph shows mean latency rate for the different cognitive APIs used.

Graph metrics

Currently, there are three monitored Cognitive endpoints:

  • Domain classifier endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/domain_classifier/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/domain_classifier/.*"}[1m]))
  • Mplus resolution endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/mplus_resolution/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/mplus_resolution/.*"}[1m]))
  • Suggestions endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/suggestions/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="aura-bot",host=~"api-.*",path=~"/auracognitive/v3/suggestions/.*"}[1m]))

Graph visual

Aura-services APIs latency

Graph metrics

sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app="aura-bot", path=~"/aura-services/.*"}[1m]))/
sum by (path,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app="aura-bot", path=~"/aura-services/.*"}[1m]))

Graph visual

Other APIs latency

Other APIs latency graph shows mean latency rate for traffic directed to other APIs different from those above, aggregated by one minute.

Graph metrics

Currently, the only API monitored is Genesys API:

sum (label_replace(outgoing_request_duration_seconds_sum{app="aura-bot",path=~"/genesys/.*"},"path_set","$1","path","/genesys/.*")) by (path_set,kubernetes_namespace) / sum (label_replace(outgoing_request_duration_seconds_count{app="aura-bot",path=~"/genesys/.*"},"path_set","$1","path","/genesys/.*")) by (path_set,kubernetes_namespace)

Graph visual

Service API

Service API graph shows mean latency rate for the main endpoint on aura-bridge, that receives requests from Direct Line and aura-bridge. Aggregated by one minute.

Graph metrics

sum by (path,kubernetes_namespace)(rate(http_request_duration_seconds_sum{path=~"/api/messages"}[1m]))/
sum by (path,kubernetes_namespace)(rate(http_request_duration_seconds_count{path=~"/api/messages"}[1m]))

Graph visual

6.2.2 - Aura bridge dashboard

Aura bridge dashboard

Information provided by Aura bridge dashboard

Aura bridge ack success

Ack success graph shows the number of successful acks rate, aggregated by three minutes.

The available metrics are defined in the following sections.

Graph metrics

sum by (origin,originStatus)(rate(aura_response_ack_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus="200"}[3m]))

Graph visual

Aura bridge ack error

Ack error graph shows acks rate with an error status, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(aura_response_ack_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge message success

Message success graph shows the number of successful messages rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus="200"}[3m]))

Graph visual

Aura bridge message error

Message error graph shows number of erroneous messages rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge bot message error

They correspond to errors that aura-bridge receives from aura-bot. Bot message error graph shows the number of erroneous messages (sent by aura-bot) rate, aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot",originStatus!="200"}[3m]))

Graph visual

Aura bridge message - Kernel internal error

Kernel internal error graph shows number of erroneous messages (sent by Kernel) rate, regardless of the error type and aggregated by three minutes.

Graph metrics

sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",originStatus!="200"}[3m]))

Graph visual

Aura bridge message - Kernel HTTP error

Kernel HTTP error graph shows number of erroneous messages (sent by Kernel) rate, filtered by HTTP client errors and aggregated by three minutes.

Graph metrics

sum by (origin,httpStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"4.."}[3m]))

Graph visual

6.2.3 - Authentication API dashboard

Aura authentication API dashboard

Information provided by Authentication API dashboard

Aura services latency

Aura services latency graph shows mean latency rate for the different incoming calls.

The available metrics are defined in the following sections.

Graph metrics

Currently, these are the existing monitored incoming calls:

  • WhatsApp users’ retrieval
sum by (kubernetes_namespace,path)(rate(http_request_duration_seconds_sum{path=~"/aura-services/v1/users/whatsapp.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(http_request_duration_seconds_count{path=~"/aura-services/v1/users/whatsapp.*"}[1m]))
  • Get or create user
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id"})
  • Get or create user
sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id/.*")) by (kubernetes_namespace,path_set)
  • Retrieves an Aura user by the given auraIdGlobal and the channelId
sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/aura-id-global/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id-global/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/aura-id-global/.*"},"path_set","$1","path","/aura-services/v1/users/aura-id-global/.*")) by (kubernetes_namespace,path_set)
  • Gets given authorization and identification information to register the user
sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/.*"},"path_set","$1","path","/aura-services/v1/users/auraid/integrated/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/.*"},"path_set","$1","path","/aura-services/v1/users/auraid/integrated/.*")) by (kubernetes_namespace,path_set)
  • OpenID logout
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/logout"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/users/auraid/integrated/logout"})
  • New Direct Line token
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/token"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/token"})
  • New Direct Line token(wss)
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/token/wss"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/token/wss"})
  • JWT uri retrieval
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/openid/issuer/.well-known/openid-configuration"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/openid/issuer/.well-known/openid-configuration"})
  • JWT token retrieval
sum by (kubernetes_namespace,path)(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/openid/jwk"})/
sum by (kubernetes_namespace,path)(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/openid/jwk"})
  • Get or create user
sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/users/aura-id/.*"},"path_set","$1","path","/users/aura-id/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/users/aura-id/.*"},"path_set","$1","path","/users/aura-id/.*")) by (kubernetes_namespace,path_set)
  • User by phone number
sum (label_replace(http_request_duration_seconds_sum{app="authentication-api",path=~"/aura-services/v1/admin/users/phone-numbers/.*"},"path_set","$1","path","/aura-services/v1/admin/users/phone-numbers/.*")) by (kubernetes_namespace,path_set) / 
sum (label_replace(http_request_duration_seconds_count{app="authentication-api",path=~"/aura-services/v1/admin/users/phone-numbers/.*"},"path_set","$1","path","/aura-services/v1/admin/users/phone-numbers/.*")) by (kubernetes_namespace,path_set)

Graph visual

Request out error

Request out error graph shows error rate for outgoing requests with HTTP codes 4xx and 5xx, aggregated by 1 minute.

Graph metrics

sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",status=~"4..|5.."}[1m]))

Graph visual

Microsoft APIs latency

Microsoft APIs latency graph shows mean latency rate for the different Microsoft APIs used.

Graph metrics

Currently, there are three monitored Microsoft endpoints:

  • Directline endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"directline.botframework.com"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"directline.botframework.com"}[1m]))
  • Microsoft auth endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"login.microsoftonline.com"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"login.microsoftonline.com"}[1m]))
  • Blob storage endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"aura.*.blob.core.windows.net"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"aura.*.blob.core.windows.net"}[1m]))

Graph visual

Kernel APIs latency

Kernel APIs latency graph shows mean latency rate for the different Kernel APIs used.

Graph metrics

Currently, there are three monitored Kernel endpoints:

  • Kernel token retrieval endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"auth.*",path="/token"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"auth.*",path="/token"}[1m]))
  • Kernel token introspection endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"api.*",path=~"/token-introspection/.*"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"api.*",path=~"/token-introspection/.*"}[1m]))
  • Kernel open-id configuration endpoint
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_sum{app="authentication-api",host=~"auth.*",path="/.well-known/openid-configuration"}[1m]))/
sum by (kubernetes_namespace,path)(rate(outgoing_request_duration_seconds_count{app="authentication-api",host=~"auth.*",path="/.well-known/openid-configuration"}[1m]))

Graph visual

6.2.4 - Aura HTTP Inbound dashboard

Aura HTTP Inbound dashboard

Information provided Aura HTTP inbound dashboard

Introduction

HTTP inbound dashboard monitors inbound traffic to different services.

This inbound traffic can be visualized by channel, thus providing a detailed insight into the specific incoming traffic to this particular channel. It clearly improves the optimization of strategies for that channel or a performance comparison between different channels.

The available metrics are defined in the following sections.

HTTP request latency

HTTP request latency graph shows mean latency time aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)(rate(http_request_duration_seconds_sum{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m])) /
sum by (app, kubernetes_namespace)(rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP Request Rate

HTTP requests rate graph shows number of requests aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP request latency

HTTP request latency graph shows request latency aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_sum{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)'}[1m]))

Graph visual

HTTP error rate

HTTP error rate shows rate of petition errors aggregated by one minute.

Graph metrics

sum by (app, kubernetes_namespace)  (rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)',status_code=~"4..|5.."}[1m]))

Graph visual

Errors

Errors graph shows errors duration aggregated by one minute.

Graph metrics

sum(rate(http_request_duration_seconds_count{app=~'(aura-bot|aura-bridge|authentication-api|complex-logic|context|nlp|tac|thanos-querier)',status_code=~"4..|5.."}[1m])) by (app, kubernetes_namespace)

Graph visual

6.2.5 - Aura HTTP Outbound dashboard

Aura HTTP Outbound dashboard

Information provided Aura HTTP outbound dashboard

Introduction

HTTP outbound dashboard monitors outbound traffic to different services.

This outbound traffic can be visualized by channel, thus providing a detailed insight into the specific outgoing traffic from this particular channel. It clearly improves the optimization of strategies for that channel or a performance comparison between different channels.

The available metrics are defined in the following sections.

HTTP request latency

HTTP request latency graph shows mean latency time aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace)(rate(outgoing_request_duration_seconds_sum{app=~'.*'}[1m])) / sum by (app,kubernetes_namespace)(rate(outgoing_request_duration_seconds_count{app=~'.*'}[1m]))

Graph visual

HTTP request rate

HTTP requests rate graph shows requests rate per second, aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace) (rate(outgoing_request_duration_seconds_count{app=~'.*'}[1m]))

Graph visual

HTTP request latency

HTTP request latency graph shows request latency rate per second, aggregated by one minute.

Graph metrics

sum by (app,kubernetes_namespace)  (rate(outgoing_request_duration_seconds_sum{app=~'.*'}[1m]))

Graph visual

HTTP error rate

HTTP error rate shows request errors rate per second, aggregated by one minute

Graph metrics

sum by (app,kubernetes_namespace)  (rate(outgoing_request_duration_seconds_count{app=~'.*',status=~"4..|5.."}[1m]))

Graph visual

Errors

Errors graph shows errors duration aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_count{app=~'.*',status=~"4..|5.."}[1m])) by (app,kubernetes_namespace)

Graph visual

Aura bot backend latency

aura-bot backend latency shows mean latency rate on aura-bot backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"aura-bot"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"aura-bot"}[1m])) by (path,kubernetes_namespace)

Graph visual

Authentication API backend latency

aura-authentication-api backend latency shows mean latency rate on aura-authentication-api backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"authentication-api"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"authentication-api"}[1m])) by (path,kubernetes_namespace)

Graph visual

Aura bridge backend latency

aura-bridge backend latency shows mean latency rate on aura-bridge backend, aggregated by one minute.

Graph metrics

sum(rate(outgoing_request_duration_seconds_sum{app=~"aura-bridge"}[1m])) by (path,kubernetes_namespace)/sum(rate(outgoing_request_duration_seconds_count{app=~"aura-bridge"}[1m])) by (path,kubernetes_namespace)

Graph visual

6.2.6 - Pod resources dashboard

Pod resources dashboard

Information provided by Pod resources dashboard

Introduction

This is a unique dashboard to obtain the most basic information about how the environment pods behavior is.

To get the information about each pod, the dashboard counts on a filter with the following fields:

  • namespace: list of all the available namespaces of your deployment.
  • pod: list of pods running in the selected namespace.
  • container: list of containers running in the selected pod.
  • DS_PROMETHEUS: Prometheus data source to be used. By default, Prometheus.

Once selected, the following graphs are printed, with the data of the pod.

Panels

Pod memory

Pod memory panel shows a time series with the current memory consumption in the selected pod. It also shows the current, maximum, minimum and average memory consumption of the Pod.

The x-axis shows the time series and the y-axis shows the amount of memory consumed by the pod.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(container_memory_working_set_bytes{namespace="aura-<dev>",container!="POD",container!="",pod!="", pod="aura-bot-<id>"})

An example of this panel is shown below:

Container memory

Container memory panel shows a time series with the current memory consumption the selected container. It also shows the current, maximum, minimum and average memory consumption of the container.

The x-axis shows the time series and the y-axis shows the amount of memory consumed by the container.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(container_memory_working_set_bytes{namespace="aura-<dev>",container!="POD",container!="",pod!="", pod="aura-bot-<id>",container="aura-bot"}) by (container)

An example of this panel is shown below:

Pod network

Pod network panel shows a time series with the current I/O network consumption of the selected pod. It also shows the current, maximum, minimum and average network consumption of the pod.

The x-axis shows the time series and the y-axis shows the amount of bytes consumed by the pod.

The queries used to get the panel information are:

sum(rate(container_network_receive_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m]))
sum(rate(container_network_transmit_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m]))

An example of this panel is shown below:

Pod CPU

Pod CPU panel shows a time series with the current CPU consumption of the selected pod. It also shows the current, maximum, minimum and average CPU consumption of the pod.

The x-axis shows the time series and the y-axis shows the percentage of CPU used by the pod.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>"})
sum(rate(container_cpu_usage_seconds_total{namespace="aura-<env>",container!="POD",container!="",pod!="", pod="aura-bot-<id>"}[1m]))

An example of this panel is shown below:

Container CPU

Container CPU panel shows a time series with the current CPU usage of the selected container within the pod. It also shows the current, maximum, minimum and average CPU usage of the container.

The x-axis shows the time series and the y-axis shows the percentage of CPU used by the container.

The queries used to get the panel information are:

sum(kube_pod_container_resource_requests_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"})
sum(rate(container_cpu_usage_seconds_total{namespace="aura-<env>",container!="POD",container!="",pod!="", pod="aura-bot-<id>",container="aura-bot"}[1m]))

An example of this panel is shown below:

Container disk

Container Disk panel shows a time series with the current disk usage of the selected container within the pod. It also shows the current, maximum, minimum and average disk usage of the container.

The x-axis shows the time series and the y-axis shows the amount of disk used by the container.

The queries used to get the panel information are:

sum(rate(container_fs_reads_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}[1m])) by (container,device)
sum(rate(container_fs_writes_bytes_total{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}[1m])) by (container,device)

An example of this panel is shown below:

Pod network errors

Pods network errors panel shows a time series with the percentage of errors in network access of the pod. It also shows the current, maximum, minimum and average number of errors of the pod, related to errors while receiving and transmitting data to the network.

The x-axis shows the time series and the y-axis shows the percentage of errors of the pod network accesses.

The queries used to get the panel information are:

sum(rate(container_network_receive_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_errors_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_receive_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_packets_dropped_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_transmit_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_transmit_errors_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100
sum(rate(container_network_transmit_packets_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) / sum(rate(container_network_receive_packets_dropped_total{namespace="aura-<env>",pod="aura-bot-<id>"}[5m])) * 100

Pod status

This section consists of 5 panels: ready, created, number of restarts, last terminated reason, waiting reason and the description of the image running in the container.

Ready

Ready panel shows a time series with heartbeat of the container. If there are no errors, it should be a flat line in 1.0.

The x-axis shows the time series and the y-axis shows the answer of the heartbeat of the container: 1 is a correct answer.

The queries used to get the panel information are:

kube_pod_container_status_ready{namespace="aura-<env>",pod="aura-bot-<id>",container="aura-bot"}

An example of this panel is shown below:

Pod created

Pod created panel shows the timestamp when the selected pod was created.

The queries used to get the panel information are:

kube_pod_created{namespace="aura-<env>",pod="aura-bot-<id>"} * 1000

An example of this panel is shown below:

Last terminated reason

This panel shows the reason why the pod entered the terminated status.

Last waiting reason

This panel shows the reason why the pod entered the waiting status.

Info

Info panel shows the images running in the containers of the selected pod.

The queries used to get the panel information are:

kube_pod_container_info{namespace="aura-<env>",pod="aura-bot-<id>"}

An example of this panel is shown below:

7 - Aura Alerts

Management of alerts in Aura

Learn how to manage alerts through Prometheus system

Introduction to alerts in Aura

As previously stated, Prometheus has a list of alert rules that are part of the platform configuration. These alerting rules allow you to define alert conditions based on Prometheus expression language.

⚠️ It is possible to edit the Aura alert rules but, for now, changes are lost in a re-deployment.
If you think an alert is important and should be part of the platform, let us know, so we can officially include it.

Alerts are sent via email, using a global SMTP server managed by the Aura Team. Other notification channels (Slack) are also available but not used by default in production.

Alerts are disabled (silenced) during Aura deployments to avoid false positives due to services that need to be restarted, etc.

In order to manage alerts, Aura Platform includes the AlertManager system, which is the part of Prometheus Stack. The URL to access to alertmanager is:
alerts-{{ environment_name }}.auracognitive.com

When accessing the web, you can see all the alerts, as shown in the image below.

Alert manager home

In this panel, the most important thing that you can do is “silence” one alarm pushing in the “silence alarm” or pressing the “new silence button”

Alert manager new silence

In order to check if the cluster is ok (ready) or the status of the system, click in the “status” section.

Alert manager status

Alerts set in Aura

The current section includes the different alerts currently set in Aura, organized by their scope.

Scope: infrastructure

  • high_cpu_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of CPU. CPU usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum by(kubernetes_io_hostname) (machine_cpu_cores) * 100 > 90
    • For: 10m
    • summary: HIGH CPU USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
  • high_memory_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of Memory. Memory usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (container_memory_working_set_bytes{id="/"}) / sum by(kubernetes_io_hostname) (machine_memory_bytes) * 100 > 90
    • For: 10m
    • summary: HIGH MEMORY USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
  • high_fs_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of FileSystem space. FileSystem usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (container_fs_usage_bytes{device=~"^/dev/.*$",id="/"}) / sum by(kubernetes_io_hostname) (container_fs_limit_bytes{device=~"^/dev/.*$",id="/"}) * 100 > 70
    • For: 10m
    • summary: HIGH FILESYSTEM USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’

Scope: kubernetes

  • high_persistent_volume_usage

    • Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume space. Persistent volume usage is « humanize $value »%.
    • Expr: kubelet_volume_stats_used_bytes * 100 / kubelet_volume_stats_capacity_bytes > 70
    • For: 10m
    • summary: HIGH PERSISTENT VOLUME USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
  • high_persistent_volume_inode_usage

    • Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume inodes. Persistent volume inode usage is « humanize $value »%.
    • Expr: kubelet_volume_stats_inodes_used * 100 / kubelet_volume_stats_inodes > 70
    • For: 10m
    • summary: HIGH PERSISTENT VOLUME INODE USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
  • docker_deleted_container_rate_on_hosts

    • Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
    • Expr: sum by(kubernetes_io_hostname) (rate(kubelet_docker_operations{operation_type=~"remove_container|stop_container"}[5m])) > 0.1
    • For: 1m
    • summary: DOCKER DELETED/STOPPED CONTAINER RATE WARNING
  • runtime_deleted_container_rate_on_hosts

    • Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
    • Expr: sum by(kubernetes_io_hostname) (rate(kubelet_runtime_operations{operation_type=~"stop_podsandbox|remove_container|stop_container"}[5m])) > 0.1
    • For: 1m
    • summary: RUNTIME DELETED/STOPPED CONTAINER RATE WARNING
  • frequent_container_restarts

    • Description: Container « $labels.container » on pod « $labels.pod » has been restarted « $value » times within the last hour.
    • Expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
    • For: 5m
    • summary: KUBERNETES FREQUENT CONTAINER RESTARTS WARNING
  • node_not_ready

    • Description: Node « $labels.node » has status « $labels.condition » as « $labels.status ».
    • Expr: kube_node_status_condition{condition!="Ready",status!="false"} > 0 or on(node) kube_node_status_condition{condition="Ready",status="false"} > 0
    • For: 5m
    • summary: KUBERNETES NODE NOT READY WARNING
  • job_error

    • Description: JOB ERROR
    • Expr: kube_job_status_failed==1
    • For: 5m
    • summary: KUBERNETES JOB NOT READY WARNING

Scope: prometheus

  • prometheus_rule_evaluation_slow

    • Description: Prometheus has a 90th percentile latency of « $value »s completing rule evaluation cycles.
    • Expr: prometheus_evaluator_duration_seconds{quantile="0.9"} > 60
    • For: 10m
    • summary: PROMETHEUS RULE EVALUATION SLOW WARNING
  • prometheus_indexing_backlog

    • Description: Prometheus is backlogging on the indexing queue. Queue is currently « $value | printf %.0f »% full.
    • Expr: prometheus_local_storage_indexing_queue_length / prometheus_local_storage_indexing_queue_capacity * 100 > 10
    • For: 10m
    • summary: PROMETHEUS INDEXING BACKLOG WARNING
  • prometheus_not_ingesting_samples

    • Description: Prometheus has not ingested any sample in the last 10 minutes.
    • Expr: rate(prometheus_local_storage_ingested_samples_total[5m]) == 0
    • For: 5m
    • summary: PROMETHEUS NOT INGESTING SAMPLES WARNING
  • prometheus_persist_errors

    • Description: Prometheus has encountered « $value » persistent errors per second in the last 10 minutes.
    • Expr: rate(prometheus_local_storage_persist_errors_total[10m]) > 0
    • For: 5m
    • summary: PROMETHEUS PERSIST ERRORS WARNING
  • prometheus_notifications_backlog

    • Description: Prometheus is backlogging on the notifications queue. The queue has not been empty for 10 minutes. Current queue length: « $value ».
    • Expr: prometheus_notifications_queue_length > 0
    • For: 10m
    • summary: PROMETHEUS NOTIFICATIONS BACKLOG WARNING
  • prometheus_storage_inconsistent

    • Description: Prometheus has detected a storage inconsistency. A server restart is needed to initiate recovery.
    • Expr: prometheus_local_storage_inconsistencies_total > 0
    • For: 5m
    • summary: PROMETHEUS STORAGE INCONSISTENCY WARNING
  • prometheus_persistence_pressure_too_high_24h

    • Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 24h.
    • Expr: prometheus_local_storage_persistence_urgency_score > 0.8 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 24) > 1
    • For: 30m
    • summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
  • prometheus_persistence_pressure_too_high_2h

    • Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 2h.
    • Expr: prometheus_local_storage_persistence_urgency_score > 0.85 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 2) > 1
    • For: 30m
    • summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
  • prometheus_series_maintenance_stalled

    • Description: Prometheus is maintaining memory time series so slowly that it will take « $value | printf %.0f »h to complete a full cycle. This will lead to persistence falling behind.
    • Expr: prometheus_local_storage_memory_series / on(job, instance) rate(prometheus_local_storage_series_ops_total{type="maintenance_in_memory"}[5m]) / 3600 > 24 and prometheus_local_storage_rushed_mode == 1
    • For: 1h
    • summary: PROMETHEUS SERIES MAINTENANCE WARNING
  • prometheus_target_scrape_sync_too_low

    • Description: Prometheus target scrape sync rate is too low.
    • Expr: rate(prometheus_target_scrape_pool_sync_total{app="prometheus"}[10m]) == 0
    • For: 5m
    • summary: PROMETHEUS TARGET SCRAPE SYNC WARNING

Scope: logs

  • elasticsearch_too_few_nodes_running

    • Description: There are only « $value » < 3 ElasticSearch nodes running.
    • Expr: elasticsearch_cluster_health_number_of_node < 3
    • For: 10m
    • summary: TOO FEW ELASTICSEARCH NODES
  • elasticsearch_high_memory_usage

    • Description: The memory (heap) usage is over 90% for 15m on node « $labels.node »
    • Expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
    • For: 15m
    • summary: ELASTICSEARCH HIGH MEMORY USAGE
  • elasticsearch_not_indexing

    • Description: ElasticSearch data node is not indexing new documents
    • Expr: increase(elasticsearch_indices_docs{es_data_node="true"}[1m]) == 0
    • For: 5m
    • summary: ELASTICSEARCH NOT INDEXING

Scope: Aura

  • aura-bot_unauthorized_aura-bridge

    • Description: aura-bridge has not authorized the connection with aura-bot for 3 minutes.
    • Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"401"}[3m])) > 0
    • For: 3m
    • summary: AURA-BOT RETURN UNAUTHORIZED TO AURA-BRIDGE
  • aura-bot_bad-request_aura-bridge

    • Description: aura-bridge has not been able to correctly handle the connection with aura-bot for 3 minutes.
    • Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"400"}[3m])) > 0
    • For: 3m
    • summary: AURA-BOT RETURN BAD REQUEST TO AURA-BRIDGE
  • aura-bot_internal-error_aura-bridge

    • Description: aura-bridge failed to connect to aura-bot for 3 minutes.
    • Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host=~"aura-bot.*"}[3m])) > 0
    • For: 3m
    • summary: COMMUNICATION ERROR BETWEEN AURA-BOT AND AURA-BRIDGE
  • aura-bridge-error_callback

    • Description: aura-bridge failed to handle the connection with callback for 3 minutes.
    • Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host!~"aura-bot.*"}[3m])) > 0
    • For: 3m
    • summary: COMMUNICATION ERROR BETWEEN AURA-BRIDGE AND CALLBACK
  • aura-bridge_error_whatsapp

    • Description: errors in aura-bridge with WhatsApp functionality for 5 minutes.
    • Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200",httpStatus!~"403|408|400"}[5m])) > 0
    • For: 5m
    • summary: Error happened in WhatsApp functionality.
  • aura-bridge_error_4p

    • Description: errors in aura-bridge with Kernel in WhatsApp functionality for 5 minutes.
    • Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"403|408|400"}[5m])) > 0
    • For: 5m
    • summary: Error happened with Kernel in WhatsApp functionality.
  • nlp-provisioning_killed-processes

    • Description: killed nlp-provisioning processes for 15 minutes.
    • Expr: sum by (exported_job) (rate(nlp_provisioning_killed_processes{exported_job="nlp_provisioning_job"}[15m])) > 0
    • For: 15m
    • summary: Processes killed in nlp-provisioning
  • alive-processes_nlp-provisioning_expected-alive-processes

    • Description: alive nlp-provisioning processes vs expected alive nlp-provisioning processes for 15 minutes.
    • Expr: sum by (exported_job)(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}) / sum by (exported_job) (nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"})!=1
    • For: 15m
    • summary: Processes killed in nlp-provisioning

Scope: misc

  • probe_down

    • Description: The endpoint « $labels.instance » is down or not reachable. The blackbox exporter could not validate « $labels.app »’s health.
    • Expr: probe_success == 0
    • For: 2m
    • summary: PROBE FAILING

8 - Queries

Queries

Description of the different types of queries that can be done in order to retrieve metrics from the system

With the goal of retrieving from Aura specific information regarding the generated logs and metrics, we can make queries to the system.

These queries are classified into two categories:

8.1 - Basic monitoring queries

Basic monitoring queries

Learn how to get information for the evaluation of Aura system performance through basic queries

Introduction

The current document shows the guidelines for making queries to Grafana and Kibana in order to retrieve basic information from the system.

Number of TPS per component

Request rate in Grafana

  • Access Grafana of the environment.
  • Select “Aura HTTP inbounds” dashboard.
  • Select the time period for the query.
  • Select the service of your choice, as can be seen in the following picture:

Basic queries - get current TPS per component in grafana

The panel named “HTTP Request Rate” shows the total number of requests being processed by a service.

This panel is based on the Prometheus aura-bot stored metric called http_request_duration_seconds aggregated in buckets of one minute.

It shows in the y-axis the number of requests in the service and in the x-axis the time period.

Request rate in Kibana

Add a new dashboard with the following data:

  • Select the time interval for the filter

  • Query (overwrite your-env with the environment of your choice):

    • aura-bot:

      • Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "aura-bot" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "aura-bot" and kubernetes.namespace_name: "your-env".
    • aura-authentication-api:

      • Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "authentication-api" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "authentication-api" and kubernetes.namespace_name: "your-env".
    • aura-bridge:

      • Public endpoints but healthcheck: msg: "Response returned" and kubernetes.labels.app : "aura-bridge" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "Response returned" and kubernetes.labels.app : "aura-bridge" and kubernetes.namespace_name: "your-env".
    • tac-api:

      • Public endpoints but healthcheck: msg: "Response with status" and kubernetes.labels.app : "tac" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "msg: "Response with status" and kubernetes.labels.app : "tac" and kubernetes.namespace_name: "your-env".
    • aura-nlp:

      • Public endpoints but healthcheck: msg: "HTTP request" and kubernetes.labels.app : "nlp" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "msg: "msg: "HTTP request" and kubernetes.labels.app : "nlp" and kubernetes.namespace_name: "your-env" .
    • aura-context:

      • Public endpoints but healthcheck: msg: "HTTP request" and kubernetes.labels.app : "context" and kubernetes.namespace_name: "your-env" and not path: "/ping".
      • For all the endpoints of the service: msg: "msg: "msg: "HTTP request" and kubernetes.labels.app : "context" and kubernetes.namespace_name: "your-env" .
  • Index: aurak8s-service

  • Data:

    • Metric: Unique Count
      • Field: corr.keyword
      • Custom label: Number of requests
  • Buckets: split rows

    • Aggregation: Date histogram
    • Field: @timestamp
    • Minimum interval: minute

Update the dashboard and name it, in order to have it available.

A basic example of this dashboard with a panel per component is delivered with Aura, so it can be imported in the Kibana of the environment using Kibana import objects API.

Basic queries - get current TPS per component in kibana

Number of unique users in Aura

In this case, the only way of getting the number of unique users accessing to Aura is querying the operational logs, in Kibana.

Add a new dashboard in Kibana with the following data:

  • Select the time interval for the filter
  • Query (overwrite your-env with the environment of your choice): app.keyword : "aura-bot" and kubernetes.namespace_name.keyword : "your-env"
  • Index: aurak8s-service
  • Data:
    • Metric: Unique Count
    • Field: auraId.keyword
    • Custom label: Number of unique aura users
  • Buckets: split rows
    • Aggregation: Date histogram
    • Field: @timestamp
    • Minimum interval: 1h

Update the dashboard and name it, to have it available.

A basic example of this dashboard with a panel per component is delivered with Aura, so it can be imported in the Kibana of the environment using kibana import objects API.

Basic queries - get number of unique aura users in kibana

8.2 - Basic database queries

Basic database queries

Learn how to get information from the database to get some insights

Requirements

  • A valid kubeconfig for the environment

  • If the environment database is in Atlas: access to Atlas by IP

  • Get the variables to access the database:

    # substitute {{aura-environment}} with the environment you're configuring
    export AURA_ENVIRONMENT={{aura-environment}}
    
    $ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_URI"
    
    {{mongo_uri}}
    
    $ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USERNAME"
    
    {{mongo_user}}
    
    $ kubectl -n $AURA_ENVIRONMENT get secret authentication-api -o json | jq -r ".data.AURA_MONGODB_PASSWORD|@base64d"
    
    {{mongo_pass}}
    
    $ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USER_DB"
    
    {{mongo_users_db}}
    
    $ kubectl -n $AURA_ENVIRONMENT get cm authentication-api -o json | jq -r ".data.AURA_MONGODB_USER_COLLECTION"
    
    {{mongo_users_col}}
    
  • Get the channel_name and channel_id for the all channels in the environment:

    # substitue {{aura-environment}} with the environment you're configuring
    export AURA_ENVIRONMENT={{aura-environment}}
    
    $ kubectl -n $AURA_ENVIRONMENT get cm aura-bot -o json | jq -r ".data.AURA_CHANNELS_CONFIGURATION_API_ENDPOINT"
    
    {{channels_configuration_endpoint}}
    
    $ kubectl -n $AURA_ENVIRONMENT get secret aura-bot -o json | jq -r ".data.AURA_AUTHORIZATION_HEADER|@base64d"
    
    {{authorization_header}}
    
    $ curl {{channels_configuration_endpoint}}/aura-services/v2/configuration/channels -H "Authorization: {{authorization_header}}" -o channels_config.json
    
    $ cat channels_config.json| jq -r '.[] | .name + ":" +.id'
    {{ channels }}
    
    # Example of channels
    # novum-mytelco:45494a5b-835a-4fff-a813-b3d2be529dbe
    # whatsapp:f7fd1021-41cd-588a-a461-387cc24be223
    # whatsapp-1004:e75e7b9d-7949-451a-9493-3d759745492c
    # movistar-plus:60f0ffda-e58a-4a96-aad9-d42be70b7b42
    # set-top-box:814bc401-7743-47d3-957b-7f1b2dafe398
    # set-top-box-haac:dc388448-b1d1-11e9-b77b-67224ed60908
    

Queries

Total number of users registered in Aura

⚠️ This information is only for authenticated users. Currently, anonymous users are not stored in the Aura users’ database.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.find({}).count()
10167

Total number of users registered in aura per channel

⚠️ This information is only for authenticated users. Currently, anonymous users are not stored in the Aura users’ database.

Use the output of {{ channels }} to identify the channel by its name rather than by its identifier.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.aggregate([
    {"$group" : {_id: "$channelId", count: {$sum:1}}}
])
{ "_id" : "981e5db7-8031-4370-a326-b6f4d163cd82", "count" : 1 }
{ "_id" : "814bc401-7743-47d3-957b-7f1b2dafe398", "count" : 21 }
{ "_id" : "189d4016-bcd0-491d-a75e-64e7a54aa75c", "count" : 1 }
{ "_id" : "b2501583-6d76-4e77-b364-aa169490efec", "count" : 1 }
{ "_id" : "b94aec9a-da4d-46de-afc3-06cfe0157888", "count" : 1 }
{ "_id" : "60f0ffda-e58a-4a96-aad9-d42be70b7b42", "count" : 22 }
{ "_id" : "e75e7b9d-7949-451a-9493-3d759745492c", "count" : 2518 }
{ "_id" : "9924335b-321a-4f48-b820-e35c7eb9e58b", "count" : 1 }
{ "_id" : "dc388448-b1d1-11e9-b77b-67224ed60908", "count" : 69 }
{ "_id" : "f7fd1021-41cd-588a-a461-387cc24be223", "count" : 2062 }
{ "_id" : "e59aa30f-bae5-4c9e-9d1a-0be8b904711d", "count" : 1 }
{ "_id" : "45494a5b-835a-4fff-a813-b3d2be529dbe", "count" : 346 }
{ "_id" : "5ad28380-85fa-4ba2-bcdb-0732127f4a85", "count" : 4792 }
{ "_id" : "ecd189c2-b1dd-4142-bbe1-eb9549b327e2", "count" : 1 }
{ "_id" : "25695326-c67c-40fe-b5df-a8fd5b4deb22", "count" : 14 }
{ "_id" : "4c14973e-3369-4c6a-b59d-e3e0ecaed78c", "count" : 291 }

Total number of users with expired authorization_id

An authorization_id is expired if it has not been used for 180 days in a row.

$ mongo -u {{mongo_user}} -p {{mongo_pass}} {{mongo_uri}}
> use {{mongo_users_db}}
> db.{{mongo_users_col}}.find({lastAccess: {
        $lt: new Date(ISODate().getTime() - 180*24*60*60*1000)
    }
}).count()
3034