Tags:

aura-platform

Management of alerts in Aura

Learn how to manage alerts through Prometheus system

Introduction to alerts in Aura

As previously stated, Prometheus has a list of alert rules that are part of the platform configuration. These alerting rules allow you to define alert conditions based on Prometheus expression language.

⚠️ It is possible to edit the Aura alert rules but, for now, changes are lost in a re-deployment.
If you think an alert is important and should be part of the platform, let us know, so we can officially include it.

Alerts are sent via email, using a global SMTP server managed by the Aura Team. Other notification channels (Slack) are also available but not used by default in production.

Alerts are disabled (silenced) during Aura deployments to avoid false positives due to services that need to be restarted, etc.

In order to manage alerts, Aura Platform includes the AlertManager system, which is the part of Prometheus Stack. The URL to access to alertmanager is:
alerts-{{ environment_name }}.auracognitive.com

When accessing the web, you can see all the alerts, as shown in the image below.

Alert manager home

In this panel, the most important thing that you can do is “silence” one alarm pushing in the “silence alarm” or pressing the “new silence button”

Alert manager new silence

In order to check if the cluster is ok (ready) or the status of the system, click in the “status” section.

Alert manager status

Alerts set in Aura

The current section includes the different alerts currently set in Aura, organized by their scope.

Scope: infrastructure

high_cpu_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of CPU. CPU usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum by(kubernetes_io_hostname) (machine_cpu_cores) * 100 > 90
- For: 10m
- summary: HIGH CPU USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
high_memory_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of Memory. Memory usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (container_memory_working_set_bytes{id="/"}) / sum by(kubernetes_io_hostname) (machine_memory_bytes) * 100 > 90
- For: 10m
- summary: HIGH MEMORY USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
high_fs_usage_on_hosts
- Description: « $labels.kubernetes_io_hostname » is using a LOT of FileSystem space. FileSystem usage is « humanize $value »%.
- Expr: sum by(kubernetes_io_hostname) (container_fs_usage_bytes{device=~"^/dev/.*$",id="/"}) / sum by(kubernetes_io_hostname) (container_fs_limit_bytes{device=~"^/dev/.*$",id="/"}) * 100 > 70
- For: 10m
- summary: HIGH FILESYSTEM USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’

Scope: kubernetes

high_persistent_volume_usage
- Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume space. Persistent volume usage is « humanize $value »%.
- Expr: kubelet_volume_stats_used_bytes * 100 / kubelet_volume_stats_capacity_bytes > 70
- For: 10m
- summary: HIGH PERSISTENT VOLUME USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
high_persistent_volume_inode_usage
- Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume inodes. Persistent volume inode usage is « humanize $value »%.
- Expr: kubelet_volume_stats_inodes_used * 100 / kubelet_volume_stats_inodes > 70
- For: 10m
- summary: HIGH PERSISTENT VOLUME INODE USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
docker_deleted_container_rate_on_hosts
- Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
- Expr: sum by(kubernetes_io_hostname) (rate(kubelet_docker_operations{operation_type=~"remove_container|stop_container"}[5m])) > 0.1
- For: 1m
- summary: DOCKER DELETED/STOPPED CONTAINER RATE WARNING
runtime_deleted_container_rate_on_hosts
- Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
- Expr: sum by(kubernetes_io_hostname) (rate(kubelet_runtime_operations{operation_type=~"stop_podsandbox|remove_container|stop_container"}[5m])) > 0.1
- For: 1m
- summary: RUNTIME DELETED/STOPPED CONTAINER RATE WARNING
frequent_container_restarts
- Description: Container « $labels.container » on pod « $labels.pod » has been restarted « $value » times within the last hour.
- Expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
- For: 5m
- summary: KUBERNETES FREQUENT CONTAINER RESTARTS WARNING
node_not_ready
- Description: Node « $labels.node » has status « $labels.condition » as « $labels.status ».
- Expr: kube_node_status_condition{condition!="Ready",status!="false"} > 0 or on(node) kube_node_status_condition{condition="Ready",status="false"} > 0
- For: 5m
- summary: KUBERNETES NODE NOT READY WARNING
job_error
- Description: JOB ERROR
- Expr: kube_job_status_failed==1
- For: 5m
- summary: KUBERNETES JOB NOT READY WARNING

Scope: prometheus

prometheus_rule_evaluation_slow
- Description: Prometheus has a 90th percentile latency of « $value »s completing rule evaluation cycles.
- Expr: prometheus_evaluator_duration_seconds{quantile="0.9"} > 60
- For: 10m
- summary: PROMETHEUS RULE EVALUATION SLOW WARNING
prometheus_indexing_backlog
- Description: Prometheus is backlogging on the indexing queue. Queue is currently « $value | printf %.0f »% full.
- Expr: prometheus_local_storage_indexing_queue_length / prometheus_local_storage_indexing_queue_capacity * 100 > 10
- For: 10m
- summary: PROMETHEUS INDEXING BACKLOG WARNING
prometheus_not_ingesting_samples
- Description: Prometheus has not ingested any sample in the last 10 minutes.
- Expr: rate(prometheus_local_storage_ingested_samples_total[5m]) == 0
- For: 5m
- summary: PROMETHEUS NOT INGESTING SAMPLES WARNING
prometheus_persist_errors
- Description: Prometheus has encountered « $value » persistent errors per second in the last 10 minutes.
- Expr: rate(prometheus_local_storage_persist_errors_total[10m]) > 0
- For: 5m
- summary: PROMETHEUS PERSIST ERRORS WARNING
prometheus_notifications_backlog
- Description: Prometheus is backlogging on the notifications queue. The queue has not been empty for 10 minutes. Current queue length: « $value ».
- Expr: prometheus_notifications_queue_length > 0
- For: 10m
- summary: PROMETHEUS NOTIFICATIONS BACKLOG WARNING
prometheus_storage_inconsistent
- Description: Prometheus has detected a storage inconsistency. A server restart is needed to initiate recovery.
- Expr: prometheus_local_storage_inconsistencies_total > 0
- For: 5m
- summary: PROMETHEUS STORAGE INCONSISTENCY WARNING
prometheus_persistence_pressure_too_high_24h
- Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 24h.
- Expr: prometheus_local_storage_persistence_urgency_score > 0.8 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 24) > 1
- For: 30m
- summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
prometheus_persistence_pressure_too_high_2h
- Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 2h.
- Expr: prometheus_local_storage_persistence_urgency_score > 0.85 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 2) > 1
- For: 30m
- summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
prometheus_series_maintenance_stalled
- Description: Prometheus is maintaining memory time series so slowly that it will take « $value | printf %.0f »h to complete a full cycle. This will lead to persistence falling behind.
- Expr: prometheus_local_storage_memory_series / on(job, instance) rate(prometheus_local_storage_series_ops_total{type="maintenance_in_memory"}[5m]) / 3600 > 24 and prometheus_local_storage_rushed_mode == 1
- For: 1h
- summary: PROMETHEUS SERIES MAINTENANCE WARNING
prometheus_target_scrape_sync_too_low
- Description: Prometheus target scrape sync rate is too low.
- Expr: rate(prometheus_target_scrape_pool_sync_total{app="prometheus"}[10m]) == 0
- For: 5m
- summary: PROMETHEUS TARGET SCRAPE SYNC WARNING

Scope: logs

elasticsearch_too_few_nodes_running
- Description: There are only « $value » < 3 ElasticSearch nodes running.
- Expr: elasticsearch_cluster_health_number_of_node < 3
- For: 10m
- summary: TOO FEW ELASTICSEARCH NODES
elasticsearch_high_memory_usage
- Description: The memory (heap) usage is over 90% for 15m on node « $labels.node »
- Expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
- For: 15m
- summary: ELASTICSEARCH HIGH MEMORY USAGE
elasticsearch_not_indexing
- Description: ElasticSearch data node is not indexing new documents
- Expr: increase(elasticsearch_indices_docs{es_data_node="true"}[1m]) == 0
- For: 5m
- summary: ELASTICSEARCH NOT INDEXING

Scope: Aura

aura-bot_unauthorized_aura-bridge
- Description: aura-bridge has not authorized the connection with aura-bot for 3 minutes.
- Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"401"}[3m])) > 0
- For: 3m
- summary: AURA-BOT RETURN UNAUTHORIZED TO AURA-BRIDGE
aura-bot_bad-request_aura-bridge
- Description: aura-bridge has not been able to correctly handle the connection with aura-bot for 3 minutes.
- Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"400"}[3m])) > 0
- For: 3m
- summary: AURA-BOT RETURN BAD REQUEST TO AURA-BRIDGE
aura-bot_internal-error_aura-bridge
- Description: aura-bridge failed to connect to aura-bot for 3 minutes.
- Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host=~"aura-bot.*"}[3m])) > 0
- For: 3m
- summary: COMMUNICATION ERROR BETWEEN AURA-BOT AND AURA-BRIDGE
aura-bridge-error_callback
- Description: aura-bridge failed to handle the connection with callback for 3 minutes.
- Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host!~"aura-bot.*"}[3m])) > 0
- For: 3m
- summary: COMMUNICATION ERROR BETWEEN AURA-BRIDGE AND CALLBACK
aura-bridge_error_whatsapp
- Description: errors in aura-bridge with WhatsApp functionality for 5 minutes.
- Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200",httpStatus!~"403|408|400"}[5m])) > 0
- For: 5m
- summary: Error happened in WhatsApp functionality.
aura-bridge_error_4p
- Description: errors in aura-bridge with Kernel in WhatsApp functionality for 5 minutes.
- Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"403|408|400"}[5m])) > 0
- For: 5m
- summary: Error happened with Kernel in WhatsApp functionality.
nlp-provisioning_killed-processes
- Description: killed nlp-provisioning processes for 15 minutes.
- Expr: sum by (exported_job) (rate(nlp_provisioning_killed_processes{exported_job="nlp_provisioning_job"}[15m])) > 0
- For: 15m
- summary: Processes killed in nlp-provisioning
alive-processes_nlp-provisioning_expected-alive-processes
- Description: alive nlp-provisioning processes vs expected alive nlp-provisioning processes for 15 minutes.
- Expr: sum by (exported_job)(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}) / sum by (exported_job) (nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"})!=1
- For: 15m
- summary: Processes killed in nlp-provisioning

Scope: misc

probe_down
- Description: The endpoint « $labels.instance » is down or not reachable. The blackbox exporter could not validate « $labels.app »’s health.
- Expr: probe_success == 0
- For: 2m
- summary: PROBE FAILING

Last modified September 12, 2025: feat: Clean up of MHC, Metaverso and COL projects #AURA-30235 [RTM] (17c23953)

Management of alerts in Aura

Introduction to alerts in Aura

Alerts set in Aura

Scope: infrastructure

high_cpu_usage_on_hosts

high_memory_usage_on_hosts

high_fs_usage_on_hosts

Scope: kubernetes

high_persistent_volume_usage

high_persistent_volume_inode_usage

docker_deleted_container_rate_on_hosts

runtime_deleted_container_rate_on_hosts

frequent_container_restarts

node_not_ready

job_error

Scope: prometheus

prometheus_rule_evaluation_slow

prometheus_indexing_backlog

prometheus_not_ingesting_samples

prometheus_persist_errors

prometheus_notifications_backlog

prometheus_storage_inconsistent

prometheus_persistence_pressure_too_high_24h

prometheus_persistence_pressure_too_high_2h

prometheus_series_maintenance_stalled

prometheus_target_scrape_sync_too_low

Scope: logs

elasticsearch_too_few_nodes_running

elasticsearch_high_memory_usage

elasticsearch_not_indexing

Scope: Aura

aura-bot_unauthorized_aura-bridge

aura-bot_bad-request_aura-bridge

aura-bot_internal-error_aura-bridge

aura-bridge-error_callback

aura-bridge_error_whatsapp

aura-bridge_error_4p

nlp-provisioning_killed-processes

alive-processes_nlp-provisioning_expected-alive-processes

Scope: misc

probe_down