Management of alerts in Aura

Learn how to manage alerts through Prometheus system

Introduction to alerts in Aura

As previously stated, Prometheus has a list of alert rules that are part of the platform configuration. These alerting rules allow you to define alert conditions based on Prometheus expression language.

⚠️ It is possible to edit the Aura alert rules but, for now, changes are lost in a re-deployment.
If you think an alert is important and should be part of the platform, let us know, so we can officially include it.

Alerts are sent via email, using a global SMTP server managed by the Aura Team. Other notification channels (Slack) are also available but not used by default in production.

Alerts are disabled (silenced) during Aura deployments to avoid false positives due to services that need to be restarted, etc.

In order to manage alerts, Aura Platform includes the AlertManager system, which is the part of Prometheus Stack. The URL to access to alertmanager is:
alerts-{{ environment_name }}.auracognitive.com

When accessing the web, you can see all the alerts, as shown in the image below.

Alert manager home

In this panel, the most important thing that you can do is “silence” one alarm pushing in the “silence alarm” or pressing the “new silence button”

Alert manager new silence

In order to check if the cluster is ok (ready) or the status of the system, click in the “status” section.

Alert manager status

Alerts set in Aura

The current section includes the different alerts currently set in Aura, organized by their scope.

Scope: infrastructure

  • high_cpu_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of CPU. CPU usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum by(kubernetes_io_hostname) (machine_cpu_cores) * 100 > 90
    • For: 10m
    • summary: HIGH CPU USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
  • high_memory_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of Memory. Memory usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (container_memory_working_set_bytes{id="/"}) / sum by(kubernetes_io_hostname) (machine_memory_bytes) * 100 > 90
    • For: 10m
    • summary: HIGH MEMORY USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’
  • high_fs_usage_on_hosts

    • Description: « $labels.kubernetes_io_hostname » is using a LOT of FileSystem space. FileSystem usage is « humanize $value »%.
    • Expr: sum by(kubernetes_io_hostname) (container_fs_usage_bytes{device=~"^/dev/.*$",id="/"}) / sum by(kubernetes_io_hostname) (container_fs_limit_bytes{device=~"^/dev/.*$",id="/"}) * 100 > 70
    • For: 10m
    • summary: HIGH FILESYSTEM USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’

Scope: kubernetes

  • high_persistent_volume_usage

    • Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume space. Persistent volume usage is « humanize $value »%.
    • Expr: kubelet_volume_stats_used_bytes * 100 / kubelet_volume_stats_capacity_bytes > 70
    • For: 10m
    • summary: HIGH PERSISTENT VOLUME USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
  • high_persistent_volume_inode_usage

    • Description: « $labels.persistentvolumeclaim » on « $labels.kubernetes_io_hostname » is using a LOT of persistent volume inodes. Persistent volume inode usage is « humanize $value »%.
    • Expr: kubelet_volume_stats_inodes_used * 100 / kubelet_volume_stats_inodes > 70
    • For: 10m
    • summary: HIGH PERSISTENT VOLUME INODE USAGE WARNING ON ‘{{ $labels.kubernetes_io_hostname }}’ by ‘{{ $labels.persistentvolumeclaim }}’
  • docker_deleted_container_rate_on_hosts

    • Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
    • Expr: sum by(kubernetes_io_hostname) (rate(kubelet_docker_operations{operation_type=~"remove_container|stop_container"}[5m])) > 0.1
    • For: 1m
    • summary: DOCKER DELETED/STOPPED CONTAINER RATE WARNING
  • runtime_deleted_container_rate_on_hosts

    • Description: « $labels.kubernetes_io_hostname » has a HIGH rate of deleted/stopped containers.
    • Expr: sum by(kubernetes_io_hostname) (rate(kubelet_runtime_operations{operation_type=~"stop_podsandbox|remove_container|stop_container"}[5m])) > 0.1
    • For: 1m
    • summary: RUNTIME DELETED/STOPPED CONTAINER RATE WARNING
  • frequent_container_restarts

    • Description: Container « $labels.container » on pod « $labels.pod » has been restarted « $value » times within the last hour.
    • Expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
    • For: 5m
    • summary: KUBERNETES FREQUENT CONTAINER RESTARTS WARNING
  • node_not_ready

    • Description: Node « $labels.node » has status « $labels.condition » as « $labels.status ».
    • Expr: kube_node_status_condition{condition!="Ready",status!="false"} > 0 or on(node) kube_node_status_condition{condition="Ready",status="false"} > 0
    • For: 5m
    • summary: KUBERNETES NODE NOT READY WARNING
  • job_error

    • Description: JOB ERROR
    • Expr: kube_job_status_failed==1
    • For: 5m
    • summary: KUBERNETES JOB NOT READY WARNING

Scope: prometheus

  • prometheus_rule_evaluation_slow

    • Description: Prometheus has a 90th percentile latency of « $value »s completing rule evaluation cycles.
    • Expr: prometheus_evaluator_duration_seconds{quantile="0.9"} > 60
    • For: 10m
    • summary: PROMETHEUS RULE EVALUATION SLOW WARNING
  • prometheus_indexing_backlog

    • Description: Prometheus is backlogging on the indexing queue. Queue is currently « $value | printf %.0f »% full.
    • Expr: prometheus_local_storage_indexing_queue_length / prometheus_local_storage_indexing_queue_capacity * 100 > 10
    • For: 10m
    • summary: PROMETHEUS INDEXING BACKLOG WARNING
  • prometheus_not_ingesting_samples

    • Description: Prometheus has not ingested any sample in the last 10 minutes.
    • Expr: rate(prometheus_local_storage_ingested_samples_total[5m]) == 0
    • For: 5m
    • summary: PROMETHEUS NOT INGESTING SAMPLES WARNING
  • prometheus_persist_errors

    • Description: Prometheus has encountered « $value » persistent errors per second in the last 10 minutes.
    • Expr: rate(prometheus_local_storage_persist_errors_total[10m]) > 0
    • For: 5m
    • summary: PROMETHEUS PERSIST ERRORS WARNING
  • prometheus_notifications_backlog

    • Description: Prometheus is backlogging on the notifications queue. The queue has not been empty for 10 minutes. Current queue length: « $value ».
    • Expr: prometheus_notifications_queue_length > 0
    • For: 10m
    • summary: PROMETHEUS NOTIFICATIONS BACKLOG WARNING
  • prometheus_storage_inconsistent

    • Description: Prometheus has detected a storage inconsistency. A server restart is needed to initiate recovery.
    • Expr: prometheus_local_storage_inconsistencies_total > 0
    • For: 5m
    • summary: PROMETHEUS STORAGE INCONSISTENCY WARNING
  • prometheus_persistence_pressure_too_high_24h

    • Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 24h.
    • Expr: prometheus_local_storage_persistence_urgency_score > 0.8 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 24) > 1
    • For: 30m
    • summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
  • prometheus_persistence_pressure_too_high_2h

    • Description: Prometheus is approaching critical persistence pressure. Throttled ingestion expected within the next 2h.
    • Expr: prometheus_local_storage_persistence_urgency_score > 0.85 and predict_linear(prometheus_local_storage_persistence_urgency_score[30m], 3600 * 2) > 1
    • For: 30m
    • summary: PROMETHEUS PERSISTENCE PRESSURE 24H WARNING
  • prometheus_series_maintenance_stalled

    • Description: Prometheus is maintaining memory time series so slowly that it will take « $value | printf %.0f »h to complete a full cycle. This will lead to persistence falling behind.
    • Expr: prometheus_local_storage_memory_series / on(job, instance) rate(prometheus_local_storage_series_ops_total{type="maintenance_in_memory"}[5m]) / 3600 > 24 and prometheus_local_storage_rushed_mode == 1
    • For: 1h
    • summary: PROMETHEUS SERIES MAINTENANCE WARNING
  • prometheus_target_scrape_sync_too_low

    • Description: Prometheus target scrape sync rate is too low.
    • Expr: rate(prometheus_target_scrape_pool_sync_total{app="prometheus"}[10m]) == 0
    • For: 5m
    • summary: PROMETHEUS TARGET SCRAPE SYNC WARNING

Scope: logs

  • elasticsearch_too_few_nodes_running

    • Description: There are only « $value » < 3 ElasticSearch nodes running.
    • Expr: elasticsearch_cluster_health_number_of_node < 3
    • For: 10m
    • summary: TOO FEW ELASTICSEARCH NODES
  • elasticsearch_high_memory_usage

    • Description: The memory (heap) usage is over 90% for 15m on node « $labels.node »
    • Expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
    • For: 15m
    • summary: ELASTICSEARCH HIGH MEMORY USAGE
  • elasticsearch_not_indexing

    • Description: ElasticSearch data node is not indexing new documents
    • Expr: increase(elasticsearch_indices_docs{es_data_node="true"}[1m]) == 0
    • For: 5m
    • summary: ELASTICSEARCH NOT INDEXING

Scope: Aura

  • aura-bot_unauthorized_aura-bridge

    • Description: aura-bridge has not authorized the connection with aura-bot for 3 minutes.
    • Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"401"}[3m])) > 0
    • For: 3m
    • summary: AURA-BOT RETURN UNAUTHORIZED TO AURA-BRIDGE
  • aura-bot_bad-request_aura-bridge

    • Description: aura-bridge has not been able to correctly handle the connection with aura-bot for 3 minutes.
    • Expr: sum by (status_code) (rate(http_request_duration_seconds_count{app="aura-bridge",status_code=~"400"}[3m])) > 0
    • For: 3m
    • summary: AURA-BOT RETURN BAD REQUEST TO AURA-BRIDGE
  • aura-bot_internal-error_aura-bridge

    • Description: aura-bridge failed to connect to aura-bot for 3 minutes.
    • Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host=~"aura-bot.*"}[3m])) > 0
    • For: 3m
    • summary: COMMUNICATION ERROR BETWEEN AURA-BOT AND AURA-BRIDGE
  • aura-bridge-error_callback

    • Description: aura-bridge failed to handle the connection with callback for 3 minutes.
    • Expr: sum by (host,status) (rate(outgoing_request_duration_seconds_count{app="aura-bridge",status=~"5..",host!~"aura-bot.*"}[3m])) > 0
    • For: 3m
    • summary: COMMUNICATION ERROR BETWEEN AURA-BRIDGE AND CALLBACK
  • aura-bridge_error_whatsapp

    • Description: errors in aura-bridge with WhatsApp functionality for 5 minutes.
    • Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"aura-bot|whatsapp|4p",originStatus!="200",httpStatus!~"403|408|400"}[5m])) > 0
    • For: 5m
    • summary: Error happened in WhatsApp functionality.
  • aura-bridge_error_4p

    • Description: errors in aura-bridge with Kernel in WhatsApp functionality for 5 minutes.
    • Expr: sum by (origin,originStatus)(rate(outgoing_message_duration_seconds_count{app="aura-bridge",origin=~"4p",httpStatus=~"403|408|400"}[5m])) > 0
    • For: 5m
    • summary: Error happened with Kernel in WhatsApp functionality.
  • nlp-provisioning_killed-processes

    • Description: killed nlp-provisioning processes for 15 minutes.
    • Expr: sum by (exported_job) (rate(nlp_provisioning_killed_processes{exported_job="nlp_provisioning_job"}[15m])) > 0
    • For: 15m
    • summary: Processes killed in nlp-provisioning
  • alive-processes_nlp-provisioning_expected-alive-processes

    • Description: alive nlp-provisioning processes vs expected alive nlp-provisioning processes for 15 minutes.
    • Expr: sum by (exported_job)(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}) / sum by (exported_job) (nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"})!=1
    • For: 15m
    • summary: Processes killed in nlp-provisioning

Scope: misc

  • probe_down

    • Description: The endpoint « $labels.instance » is down or not reachable. The blackbox exporter could not validate « $labels.app »’s health.
    • Expr: probe_success == 0
    • For: 2m
    • summary: PROBE FAILING