1 - Alertmanager dashboard

Alertmanager dashboard

Information provided by Alertmanager dashboards

Panels

Received alerts rate

It shows a time series with the received alerts rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows received alerts rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_alertmanagers_discovered[1m])) by(status)

An example of this panel is shown below:

The available metrics are defined in the following sections.

Successful notification rate

It shows a time series with the successful notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the successful notifications rate.

The queries used to get the panel information are:

sum(rate(prometheus_notifications_sent_total[1m])) by(integration)

An example of this panel is shown below:

Failed notifications rate

It shows a time series with the failed notifications rate aggregated by one minute.

The x-axis shows the time series and the y-axis shows the failed notifications rate.

The queries used to get panel information are:

sum(rate(prometheus_notifications_errors_total[1m])) by(integration)

An example of this panel is shown below:

CPU usage rate

It shows a time series with the CPU usage rate aggregated by one minute. It also shows the current minimum, maximum and average cpu consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the CPU usage rate.

The queries used to get panel information are:

sum(rate(container_cpu_usage_seconds_total{container="alertmanager"}[1m])) by (pod_name)

An example of this panel is shown below:

Memory usage

It shows a time series with the memory usage. It also shows the current minimum, maximum and average memory consumption of alertmanager.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get panel information are:

sum (container_memory_working_set_bytes{container="alertmanager"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

It shows a time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O.

The x-axis shows the time series and the y-axis shows the network usage.

The queries used to get panel information are:

sum (rate (container_network_receive_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~"alertmanager.*"}[1m])) by (pod)

An example of this panel is shown below:

2 - Elasticsearch dashboard

Elasticsearch dashboard

Information provided by Elasticsearch dashboard

Introduction

Elastic dashboard monitors multiple data, service and system related metrics.

The different graphs are shown in the following sections:

  • Cluster graphs
  • Shard graphs
  • system graphs
  • Documents graphs
  • Total operations stats graphs
  • Elastic search times graphs
  • Caches graphs
  • Thread pool graphs
  • JVM garbage collection graphs

Cluster graphs

The current section includes cluster related graphs.

Health status

Code coloured indicator of cluster health.

Metrics:

((sum(elasticsearch_cluster_health_status{color="green"})*2)+sum(elasticsearch_cluster_health_status{color="yellow"}))/count(elasticsearch_index_stats_up)

Nodes

Number of nodes.

Metrics:

count(elasticsearch_index_stats_up)

Data nodes

Number of data nodes per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_data_nodes{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Pending tasks

Pending tasks per node.

Metrics:

sum(elasticsearch_cluster_health_number_of_pending_tasks{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

Shards graphs

Shards related graphs.

Active primary shards

Number of active primary shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_primary_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Active shards

Number of active shards per node.

Metrics:

sum(elasticsearch_cluster_health_active_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Initializing shards

Number of shards initializing per node.

Metrics:

sum(elasticsearch_cluster_health_initializing_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Relocating shards

Number of relocating shards per node.

Metrics:

sum(elasticsearch_cluster_health_relocating_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Unassigned shards

Number of unassigned shards per node.

Metrics:

sum(elasticsearch_cluster_health_delayed_unassigned_shards{cluster="elasticsearch"})/count(elasticsearch_index_stats_up)

Graph visual

System graphs

System related graphs.

CPU usage

Percentage of used CPU on master and data nodes.

Metrics: It includes two metrics:

  • Master node CPU usage
elasticsearch_process_cpu_percent{cluster="elasticsearch",es_master_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Data nodes CPU usage:
elasticsearch_process_cpu_percent{cluster="elasticsearch",es_data_node="true",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

JVM memory usage

Memory used by JVM graph in bytes.

Metrics:

It includes three metrics:

  • Used memory
elasticsearch_jvm_memory_used_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Committed memory
elasticsearch_jvm_memory_committed_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}
  • Max memory
elasticsearch_jvm_memory_max_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Disk usage

Disk usage in bytes.

Metrics:

1-(elasticsearch_filesystem_data_available_bytes{cluster="elasticsearch"}/elasticsearch_filesystem_data_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"})

Network usage

Bytes rate sent and received, aggregated by one minute.

Metrics: It includes two metrics:

  • Sent bytes
irate(elasticsearch_transport_tx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Received bytes
irate(elasticsearch_transport_rx_size_bytes_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Documents graphs

Documents state related graphs.

Documents count

Number of documents in cluster.

Metrics:

elasticsearch_indices_docs{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Documents indexed rate

Rate of indexed documents, aggregated by one minute.

Metrics:

irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents deleted rate

Rate of deleted documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_docs_deleted{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Documents merged rate

Rate of merged documents, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Total operations stats graphs

Data related to total operations.

Total operations rate

Total operations number rate, aggregated by one minute.

Metrics: It includes six metrics:

  • Indexing index
irate(elasticsearch_indices_indexing_index_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search queries
irate(elasticsearch_indices_search_query_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search fetch
irate(elasticsearch_indices_search_fetch_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Merges
irate(elasticsearch_indices_merges_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Refresh
irate(elasticsearch_indices_refresh_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Flush
irate(elasticsearch_indices_flush_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Total operations time

Time rate for the different operations in milliseconds, aggregated by one minute.

Metrics: It includes six metrics:

  • Indexing index
irate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search queries
irate(elasticsearch_indices_search_query_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Search fetch
irate(elasticsearch_indices_search_fetch_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Merges
irate(elasticsearch_indices_merges_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Refresh
irate(elasticsearch_indices_refresh_total_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])
  • Flush
irate(elasticsearch_indices_flush_time_ms_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Elasticsearch times graphs

Graphs related to elapsed times of different actions.

Query time

Time rate for search query operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_search_query_time_seconds{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m]) 

Indexing time

Time rate for indexing index operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_indexing_index_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Merging time

Time rate for merge operations in seconds, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_merges_total_time_seconds_total{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Caches graphs

Graphs related to caches metrics.

Field data memory size

Field data memory size in bytes.

Metrics:

elasticsearch_indices_fielddata_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Field data evictions

Rate of field data evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_fielddata_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Query cache size

Bytes of memory occupied by cached queries.

Metrics:

elasticsearch_indices_query_cache_memory_size_bytes{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Query cache evictions

Rate of queries evicted, aggregated by one minute.

Metrics:

rate(elasticsearch_indices_query_cache_evictions{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

Thread pool graphs

Graphs related to the thread pool.

Operations rejected

Rate of rejected operations, aggregated by one minute.

Metrics:

irate(elasticsearch_thread_pool_rejected_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Operations queued

Rate of queued operations, aggregated by one minute.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Threads active

Number of active threads.

Metrics:

elasticsearch_thread_pool_active_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}

Operations completed

Shows rate of completed operations, aggregated by one minute

Metrics:

irate(elasticsearch_thread_pool_completed_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

JVM Garbage collection graphs

Graphs related to JVM garbage collector activity.

GC count

Rate of GC count, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

GC time

Rate of GC execution time, aggregated by one minute.

Metrics:

rate(elasticsearch_jvm_gc_collection_seconds_count{cluster="elasticsearch",name=~"(elasticsearch-es-aura-0|elasticsearch-es-aura-1|elasticsearch-es-aura-2)"}[1m])

Graph visual

3 - Fluent bit dashboard

Fluent bit dashboard

Information provided by Fluent bit dashboard

Introduction

Fluent bit dashboard monitors system metrics related to fluent bit.

The available metrics are defined in the following sections.

Input bytes

Input bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_input_bytes_total[1m])

Graph visual

Output bytes

Output bytes rate, aggregated by one minute.

Metrics:

rate(fluentbit_output_proc_bytes_total[1m])

Graph visual

Retries/fails

Rate of retries and fails, aggregated by one minute

Metrics:
It includes two metrics:

  • Retries rate
rate(fluentbit_output_retries_total[1m])
  • Fails rate
rate(fluentbit_output_retries_failed_total[1m])

Graph visual

Errors

Rate of output errors, aggregated by one minute.

Metrics:

rate(fluentbit_output_errors_total[1m])

Graph visual

4 - Kubernetes cluster monitoring dashboard

Kubernetes cluster monitoring dashboard

Information provided by Kubernetes cluster monitoring dashboard

Introduction

Kubernetes cluster monitoring dashboard monitors multiple systems and networks related data from Kubernetes clusters.

The available metrics are defined in the following sections.

Network I/O pressure graph

Rate of total received/sent data on all cluster containers, in bytes and aggregated by one minute.

Metrics:
It includes two metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))
  • Sent bytes (negative value)
- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))

Graph visual

Total usage

Graphs with different system parameters usage.

Cluster memory usage

It is composed by three graphs:

  • Memory usage, showing percentage of used memory
  • Used, showing used memory
  • Total, showing total memory

Metrics:
It includes three metrics:

  • Memory usage percentage
sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used memory
sum (container_memory_working_set_bytes{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})
  • Total cluster memory
sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster CPU usage

It is composed by three graphs:

  • CPU usage, showing percentage of used CPU cores, aggregated by one minute
  • Used, showing used CPU cores, aggregated by one minute
  • Total, showing total CPU cores

Metrics:
It includes three metrics:

  • CPU usage percentage
sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) / 
sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used CPUs
sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m]))
  • Total cluster CPUs
sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Cluster filesystem usage

It is composed by three graphs:

  • Filesystem usage, showing percentage of used filesystem space
  • Used, showing used filesystem space
  • Total, showing total filesystem space

Metrics:
It includes three metrics:

  • Filesystem usage
sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) / 
sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) * 100
  • Used
sum (container_fs_usage_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})
  • Total
sum (container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Pods CPU usage

CPU usage rate, classified by pod and aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers CPU usage

CPU usage rate, classified by container and aggregated by one minute.

Metrics:
It includes two metrics:

  • Containers with “k8s_”
sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Containers without “k8s_”
sum (rate (container_cpu_usage_seconds_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes CPU usage

Total CPU usage rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id!="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (id)

Graph visual

Pods memory usage

Memory usage, classified by pod.

Metrics:

sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Containers memory usage

Memory usage, classified by container.

Metrics:
It includes two metrics:

  • Containers with “k8s_”
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container_name!="POD",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (container_name, pod_name)
  • Containers without “k8s_”
sum (container_memory_working_set_bytes{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname, name, image)

Graph visual

All processes memory usage

Total memory usage rate.

Metrics:

sum (container_memory_working_set_bytes{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (pod_name)

Graph visual

Pods network I/O

Total network received/sent usage rate, classified by pod and aggregated by one minute.

Metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Containers network I/O

Total network received/sent usage rate, classified by container and aggregated by one minute.

Metrics:

  • Received bytes, containers with “k8s_”
sum (rate (container_network_receive_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)
  • Sent bytes, containers with “k8s_”
- sum (rate (container_network_transmit_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (container_name, pod_name)
  • Received bytes, containers without “k8s_”
sum (rate (container_network_receive_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)
  • Sent bytes, containers without “k8s_”
- sum (rate (container_network_transmit_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

All processes network I/O

Total network received/sent usage rate, aggregated by one minute.

Metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{pod_name!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (pod_name)

Graph visual

Pods disk I/O

Total disk reads/writes rate, classified by pod and aggregated by one minute.

Metrics:

  • Read bytes, pods without device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)
  • Written bytes, pods without device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (pod_name)
  • Read bytes, pods with device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)
  • Written bytes, pods with device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (pod_name)

Graph visual

Containers disk I/O

Total disk reads/writes rate, classified by container and aggregated by one minute.

Metrics:

  • Read bytes, containers without device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)
  • Written bytes, containers without device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device!=""}[1m])) by (container_name, pod_name)
  • Read bytes, containers with device
sum(rate(container_fs_reads_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)
  • Written bytes, containers with device
sum(rate(container_fs_writes_bytes_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*", device=""}[1m])) by (container_name, pod_name)
  • Read bytes, containers without “k8s_”
sum(rate(container_fs_reads_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)
  • Written bytes, containerswithout “k8s_”
sum(rate(container_fs_writes_bytes_total{image!="",name!~"^k8s_.*",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname, name, image)

Graph visual

5 - Kubernetes cron and batch job monitoring dashboard

Kubernetes cron and batch job monitoring dashboard

Information provided by cron and batch job monitoring dashboard

Introduction

Kubernetes cron and batch job monitoring dashboard monitors success/fail rates for cron/batch jobs.

The available metrics are defined in the following sections.

Jobs succeeded

Successfully executed jobs.

Metrics:

kube_job_status_succeeded

Graph visual

Jobs failed

Failed job executions.

Metrics:

kube_job_status_failed

Graph visual

6 - Kubernetes nodes dashboard

Kubernetes nodes dashboard

Information provided by Kubernetes nodes dashboard

Introduction

Kubernetes nodes dashboard monitors nodes general system status.

The available metrics are defined in the following sections.

CPU usage

CPU usage percent rate, aggregated by one minute.

Metrics:

sum (rate (container_cpu_usage_seconds_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname) / sum (machine_cpu_cores{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Memory usage

Memory usage percentage.

Metrics:

sum (container_memory_working_set_bytes{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) / sum (machine_memory_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname) * 100

Graph visual

Disk I/O

Disk read/written data in bytes.

Metrics:
It includes two metrics:

  • Read bytes
sum (container_fs_reads_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)
  • Written bytes
sum (container_fs_writes_bytes_total{id="/", kubernetes_io_hostname=~"^.*$",agentpool=~".*"}) by (kubernetes_io_hostname)

Graph visual

Network I/O

Network received/sent data in bytes, aggregated by one minute.

Metrics:
It includes two metrics:

  • Received bytes
sum (rate (container_network_receive_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{id="/",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}[1m])) by (kubernetes_io_hostname)

Graph visual

7 - Kubernetes services dashboard

Kubernetes services dashboard

Information provided by Kubernetes services dashboard

Introduction

Kubernetes services dashboard monitors system metrics related to services/pods.

The available metrics are defined in the following sections.

Service CPU usage

Services cpu usage rate, aggregated by one minute

Metrics:

sum(rate(container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container)

Graph visual

Pods CPU usage

Pods CPU usage rate, aggregated by one minute.

Metrics:
It includes two metrics:

  • CPU usage by pod and container
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (container, pod)
  • CPU usage by container and instance
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service memory usage

Service memory usage in bytes.

Metrics:

sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)

Graph visual

Pods memory usage

Pods memory usage in bytes, and memory usage rate aggregated by one minute

Metrics:
It includes four metrics:

  • memory usage classified by pod and container
sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container, pod)
  • memory usage classified by container, image and instance
sum (rate (container_cpu_usage_seconds_total{container!="",container=~".*"}[1m])) by (instance, container)
  • memory usage rate, classified by pod and container, and aggregated by one minute
sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (container, pod)
  • memory usage rate, classified by instance and container, and aggregated by one minute
sum (rate (container_memory_working_set_bytes{container!="",container=~".*"}[1m])) by (instance, container)

Graph visual

Service network I/O

Network received/sent data rate, aggregated by one minute

Metrics:
It includes two metrics:

  • Received bytes
sum (container_memory_working_set_bytes{container!="",container=~".*"}) by (container)
  • Sent bytes
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (pod)

Graph visual

Pods network I/O

Pods received/sent data rate in bytes, aggregated by one minute.

Metrics:
It includes four metrics:

  • Received bytes classified by pod
sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (name, pod)
  • Sent bytes classified by pod
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (container, pod)
  • Received bytes classified by container and instance
sum (rate (container_network_receive_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)
  • Send bytes classified by container and instance
- sum (rate (container_network_transmit_bytes_total{pod!="",pod=~".*.*"}[1m])) by (instance, container, image)

Graph visual

8 - Kubernetes storage monitoring dashboard

Kubernetes storage monitoring dashboard

Information provided by Kubernetes storage monitoring dashboard

Introduction

Kubernetes storage monitoring dashboard monitors storage related metrics.

The available metrics are defined in the following sections.

Used space

Kubelets volumes and container filesystems data usage in bytes.

Metrics:
It includes two metrics:

  • Kubelet volumes used bytes
kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*"}
  • Container filesystem usage in bytes
container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"}

Graph visual

PVC used space %

PersistentVolumeClaim used space percent.

Metrics:

(kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_capacity_bytes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

Local used space %

Containers assigned space usage percentage.

Metrics:

(container_fs_usage_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"} / container_fs_limit_bytes{image!="",kubernetes_io_hostname=~"^.*$",agentpool=~".*"})

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

Used inodes

Kubelet PersistentVolumeClaim volumes total used inodes.

Metrics:

kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"}

Graph visual

PVC used inodes %

Kubelet PersistentVolumeClaim volumes inodes usage percentage.

Metrics:

(kubelet_volume_stats_inodes_used{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"} / kubelet_volume_stats_inodes{kubernetes_io_hostname=~"^.*$",agentpool=~".*",persistentvolumeclaim=~"(data-prometheus-0|data-prometheus-1|data-prometheus-2|datadir-mongodb-0|datadir-mongodb-1|datadir-mongodb-2|elasticsearch-data-elasticsearch-es-aura-0|elasticsearch-data-elasticsearch-es-aura-1|elasticsearch-data-elasticsearch-es-aura-2|grafana-grafana-0|redis-data-redis-0|redis-data-redis-1|redis-data-redis-2|store-thanos-store-gateway-0|store-thanos-store-gateway-1)"})

Graph visual

9 - NLP provisioning dashboard

NLP provisioning dashboard

Information provided by NLP provisioning dashboard

Panels

Expected Killed Alive

Number of expected, killed and alive provisioning processes.

The queries used to get the panel information are:

nlp_provisioning_expected_alive_processes
nlp_provisioning_killed_processes
nlp_provisioning_alive_processes

An example of this panel is shown below:

Killed by container

Time series with the killed processes by container.

The x-axis shows the time series and the y-axis shows the number of killed processes by container.

The queries used to get the panel information are:

nlp_provisioning_container_killed_count_total

An example of this panel is shown below:

Killed processes

Time series with the total killed processes.

The x-axis shows the time series and the y-axis shows the number of killed processes.

The queries used to get the panel information are:

nlp_provisioning_killed_processes

An example of this panel is shown below:

Alive processes VS Expected alive processes

Time series with the ratio between alive processes and expected alive processes.

The x-axis shows the time series and the y-axis shows the ratio between alive and expected

The queries used to get the panel information are:

nlp_provisioning_alive_processes/ nlp_provisioning_expected_alive_processes

An example of this panel is shown below:

Alive processes VS expected processes

Time series with the ratio between alive processes rate aggregated by 15 minutes and expected alive processes rate aggregated by 15 minutes.

The x-axis shows the time series and the y-axis shows the ratio between alive/expected processes

The queries used to get the panel information are:

sum by (exported_job) (rate(nlp_provisioning_alive_processes{exported_job="nlp_provisioning_job"}[15m])) / 
sum by (exported_job) (rate(nlp_provisioning_expected_alive_processes{exported_job="nlp_provisioning_job"}[15m]))

An example of this panel is shown below:

10 - Prometheus stats dashboard

Prometheus stats dashboard

Information provided by Prometheus stats dashboard

Introduction

This is a dashboard to obtain a lot of information on how Prometheus performs.

To get the information about each pod, the dashboard counts on a filter with the following fields:

  • jobs: list of active jobs.
  • instances: list of scrapeable instances.
  • interval: possible time intervals.

Once selected, the following graphs are printed.

Panels

Pods CPU usage

Time series with CPU usage rate, aggregated by one minute. It also shows the current minimum, maximum and average cpu usage.

The x-axis shows the time series and the y-axis shows the cpu usage rate.

The queries used to get the panel information are:

sum(rate(container_cpu_usage_seconds_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Pods memory usage

Time series with memory usage. It also shows the current minimum, maximum and average memory usage.

The x-axis shows the time series and the y-axis shows the memory usage.

The queries used to get the panel information are:

sum (container_memory_working_set_bytes{pod_name!="",pod_name=~"prometheus.*"}) by (pod_name)

An example of this panel is shown below:

Pods network I/O

Time series with the network I/O average aggregated by one minute. It also shows the current minimum, maximum and average network I/O bytes.

The x-axis shows the time series and the y-axis shows the network I/O.

The queries used to get the panel information are:

sum (rate (container_network_receive_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)
- sum (rate (container_network_transmit_bytes_total{pod_name!="",pod_name=~"prometheus.*"}[1m])) by (pod_name)

An example of this panel is shown below:

Uptime

Percentage of uptime for the last hour.

The queries used to get the panel information are:

avg(avg_over_time(up{instance=~"(10\\.240\\.0\\.10:9093|10\\.240\\.3\\.161:9093|10\\.240\\.0\\.34:9114|10\\.240\\.0\\.253:8080|10\\.240\\.3\\.205:9090|10\\.240\\.3\\.236:9090|10\\.240\\.4\\.14:9090|10\\.240\\.4\\.156:9121|10\\.240\\.4\\.186:9121|10\\.240\\.4\\.223:9121)",job=~"kubernetes-service-endpoints"}[1h]) * 100)

An example of this panel is shown below:

Currently down

Currently down instances.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"} < 1

An example of this panel is shown below:

Total series

Total series count.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Total series

Memory chunks being used.

The queries used to get the panel information are:

sum(prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})

An example of this panel is shown below:

Quick numbers

Quick numbers section shows a series of Prometheus indicators.

Missed iterations

Number of missed iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped iterations

Number of skipped iterations, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Tardy scrapes

Number of scrapes that elapsed more than expected, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Reload failures

Number of reload failures, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_tsdb_reloads_failures_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h]))

Skipped scrapes

Number of uncompleted scrapes due to multiple reasons, aggregated by one hour.

The queries used to get the panel information are:

sum(sum_over_time(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) +
sum(sum_over_time(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) + 
sum(sum_over_time(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1h])) 

An example of this panel is shown below:

Failures and errors

Time series with the number of several different errors and failures, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows a series of different errors and failures:

  • Dialer connection errors.
  • Evaluator iterations missed.
  • Evaluator iterations skipped.
  • Evaluation failures.
  • Azure refresh failures.
  • Consul rpc failures.
  • Dns lookup failures.
  • Ec2 refresh failures.
  • Gce refresh failures.
  • Marathon refresh failures.
  • Openstack refresh failures.
  • Triton refresh failures.
  • Scrapes exceeded sample limit.
  • Scrapes sample duplicate timestamp.
  • Scrapes sample out of bounds.
  • Treecache zookeeper failures.
  • Tsdb compactions failed.
  • Tsdb head series not found.
  • Tsdb reloads failures.

The queries used to get the panel information are:

sum(increase(net_conntrack_dialer_conn_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_missed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_evaluator_iterations_skipped_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_rule_evaluation_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_azure_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_consul_rpc_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_dns_lookup_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_ec2_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_gce_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_marathon_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_openstack_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_sd_triton_refresh_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_exceeded_sample_limit_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_bounds_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_target_scrapes_sample_out_of_order_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_treecache_zookeeper_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_compactions_failed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_head_series_not_found{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0
sum(increase(prometheus_tsdb_reloads_failures_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])) > 0

An example of how this panel looks like:

Upness (stacked)

Time series with a time bound representation of services upness. Those values are shown stacked.

The x-axis shows the time series and the y-axis shows the upness state of the different services.

The queries used to get the panel information are:

up{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}

An example of this panel is shown below:

Storage memory chunks

Time series with the number of memory chunks used.

The x-axis shows the time series and the y-axis shows the number of memory chunks.

The queries used to get the panel information are:

prometheus_tsdb_head_chunks{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series count

Time series with the number of tsdb series.

The x-axis shows the time series and the y-axis shows the number of series.

The queries used to get the panel information are:

prometheus_tsdb_head_series{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Series created/removed

Time series with the number of tsdb series created/removed.

The x-axis shows the time series and the y-axis shows the number of series created/removed, aggregated by 5 minutes.

The queries used to get the panel information are:

sum( increase(prometheus_tsdb_head_series_created_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )
sum( increase(prometheus_tsdb_head_series_removed_total{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) )

An example of this panel is shown below:

Appended samples per second

Time series with the number of metrics per second stored by Prometheus.

The x-axis shows the time series and the y-axis shows the number of metrics per second stored by Prometheus.

The queries used to get the panel information are:

rate(prometheus_tsdb_head_samples_appended_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])

An example of this panel is shown below:

Scrape Sync total

Time series with the total number of syncs that were executed on a scrape pool.

The x-axis shows the time series and the y-axis shows the total number of syncs that were executed on a scrape pool.

The queries used to get the panel information are:

sum(prometheus_target_scrape_pool_sync_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (scrape_job)

An example of this panel is shown below:

Target sync

Time series with the interval to sync the scrape pool.

The x-axis shows the time series and the y-axis shows the interval to sync the scrape pool.

The queries used to get the panel information are:

sum(rate(prometheus_target_sync_length_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[2m])) by (scrape_job) * 1000

An example of this panel is shown below:

Scrape duration

Time series with the scrape duration in seconds.

The x-axis shows the time series and the y-axis shows the scrape duration in seconds.

The queries used to get the panel information are:

scrape_duration_seconds{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

Rejected scrapes

Time series with the rejected scrapes.

The x-axis shows the time series and the y-axis shows the rejected scrapes for several reasons:

  • Total number of scrapes that hit the sample limit and were rejected.
  • Total number of scrapes samples duplicated.
  • Total number of scrapes samples out of bounds.
  • Total number of scrapes samples out of order.

The queries used to get the panel information are:

sum(prometheus_target_scrapes_exceeded_sample_limit_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_bounds_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"})
sum(prometheus_target_scrapes_sample_out_of_order_total{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) 

An example of this panel is shown below:

Average rule evaluation duration

Time series with the average duration of rule group evaluations, aggregated by five minutes.

The x-axis shows the time series and the y-axis shows the average duration of rule group evaluations.

The queries used to get the panel information are:

1000 * rate(prometheus_evaluator_duration_seconds_sum{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]) / 
rate(prometheus_evaluator_duration_seconds_count{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m])

An example of this panel is shown below:

HTTP request duration

Time series with the HTTP request duration, aggregated by one minute.

The x-axis shows the time series and the y-axis shows the http request duration.

The queries used to get the panel information are:

sum(rate(http_request_duration_microseconds_count{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[1m])) by (handler) > 0

An example of this panel is shown below:

Prometheus engine query duration seconds

Time series with the engine query duration in seconds.

The x-axis shows the time series and the y-axis shows the engine query duration.

The queries used to get the panel information are:

sum(prometheus_engine_query_duration_seconds_sum{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) by (slice)

An example of this panel is shown below:

Rule evaluator iterations

Time series with the number of scheduled rule group evaluations, whether executed, missed or skipped.

The x-axis shows the time series and the y-axis shows the number of scheduled rule group evaluations.

The queries used to get the panel information are:

sum(rate(prometheus_evaluator_iterations_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_missed_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))
sum(rate(prometheus_evaluator_iterations_skipped_total{job=~"kubernetes-service-endpoints", instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}[5m]))

An example of this panel is shown below:

Notifications sent

Time series with the rate of sent notifications, aggregated by 5 minutes.

The x-axis shows the time series and the y-axis shows the rate of sent notifications.

The queries used to get the panel information are:

rate(prometheus_notifications_sent_total[5m])

An example of this panel is shown below:

Minutes since successful config reload

Time series with the number of minutes since the last successful config reload.

The x-axis shows the time series and the y-axis shows the number of minutes since the last successful reload.

The queries used to get the panel information are:

(time() - prometheus_config_last_reload_success_timestamp_seconds{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}) / 60

An example of this panel is shown below:

Successful config reload

Time series with the last successful reload.

The x-axis shows the time series and the y-axis shows the last successful reload.

The queries used to get the panel information are:

prometheus_config_last_reload_successful{job=~"kubernetes-service-endpoints",instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)"}

An example of this panel is shown below:

GC rate

Time series with the GC invocation durations rate, aggregated by two minutes.

The x-axis shows the time series and the y-axis shows the GC invocation durations rate.

The queries used to get the panel information are:

sum(rate(go_gc_duration_seconds_sum{instance=~"(10\\.240\\.0\\.5:9093|10\\.240\\.0\\.76:9093|10\\.240\\.0\\.26:9114|10\\.240\\.0\\.253:8080|10\\.240\\.0\\.94:9090|10\\.240\\.1\\.199:9090|10\\.240\\.2\\.4:9090|10\\.240\\.2\\.204:9121|10\\.240\\.2\\.245:9121|10\\.240\\.3\\.10:9121)",job=~"kubernetes-service-endpoints"}[2m])) by (instance)

An example of this panel is shown below:

11 - Redis dashboard

Redis dashboard

Information provided by Redis dashboard

Introduction

Redis dashboard monitors multiple data and service-related metrics.

The available metrics are defined in the following sections.

Redis uptime

Uptime graph shows time since last restart/shutdown.

Metrics:

max(max_over_time(redis_uptime_in_seconds{kubernetes_name=~"redis-announce-0"}[$__interval]))

Graph visual

Redis clients

Clients graph shows number of connected clients.

Metrics:

redis_connected_clients{kubernetes_name=~"redis-announce-0"}

Graph visual

Redis memory usage

Memory usage graph shows percentage of used memory.

Metrics:

100 * (redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"}  / redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"} )

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis commands executed per second

Commands executed per second graph shows the rate of commands executed per second, aggregated by one minute.

Metrics:

rate(redis_commands_processed_total{kubernetes_name=~"redis-announce-0"}[1m])

Graph visual

Redis hits/missed per second

Hits/missed per second graph shows the rate of hits and misses per second, aggregated by five minutes.

Metrics: It includes two metrics:

  • Hits metrics
irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])
  • Misses metrics
irate(redis_keyspace_hits_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total memory usage

Total memory usage graph shows total memory usage and total memory free + used.

Metrics: It includes two metrics:

  • Used memory
redis_memory_used_bytes{kubernetes_name=~"redis-announce-0"} 
  • Max memory
redis_memory_max_bytes{kubernetes_name=~"redis-announce-0"} 

Graph visual

Redis network I/O

Network I/O graph shows rate of total in/out bytes, aggregated by 5 minutes.

Metrics: It includes two metrics:

  • In bytes
rate(redis_net_input_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])
  • Out bytes
rate(redis_net_output_bytes_total{kubernetes_name=~"redis-announce-0"}[5m])

Graph visual

Redis total items per DB

Total items per DB graph shows total number of items separated by db number.

Metrics:

sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) by (db) > 0

Graph visual

Redis expiring vs not-expiring keys

Expiring vs not-expiring keys graph shows total number of expiring and not expiring keys.

Metrics: It includes two metrics:

  • Not-expiring keys.
sum (redis_db_keys{kubernetes_name=~"redis-announce-0"}) - sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"}) 
  • Expiring keys
sum (redis_db_keys_expiring{kubernetes_name=~"redis-announce-0"}) 

Graph visual

Redis expired/evicted

Expired/evicted graph shows total number of expired and evicted keys, aggregated by 5 minutes.

Metrics: It includes two metrics:

  • Expired keys.
sum(rate(redis_expired_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)
  • Evicted keys
sum(rate(redis_evicted_keys_total{kubernetes_name=~"redis-announce-0"}[5m])) by (kubernetes_name)

Graph visual

Redis command calls per second

Command calls per second graph shows top commands number of executions rate, aggregated by 5 minutes.

Metrics:

topk(5, irate(redis_commands_total{kubernetes_name=~"redis-announce-0"} [1m]))

Graph visual