Aura Platform services

Description of all the services that compose Aura Platform

Introduction to Aura Platform services

All services that compose the Aura Platform are run as Docker containers on the Kubernetes cluster. This helps us monitor and operate them in a consistent way.

The services can be grouped in three categories:

Infrastructure services. Services that are very tied to the infrastructure and could be reused by other products using this infrastructure.
System services. Management services that are part of Aura Platform and could be potentially shared by several Aura Platform deployments.
Core services. All the other platform services that provide the end-user features.

Infrastructure services

Services related to Aura infrastructure that could be reused by other products using this infrastructure.

cluster-autoscaler

This service scales the cluster nodes. See cluster autoscaler section in the Kubernetes cluster documentation for more details.

System services

alertmanager

The alertmanager is part of the Prometheus suite in charge of sending notifications (email, slack) when an alert goes off in Prometheus.

Notifications are sent to the notifications_email (defined in the deployment profile) using an external global SMTP server administered by the Aura Platform team.

⚠️ It is important that the different teams that operate the platform are subscribed to the alerts.

blackbox-exporter

The blackbox-exporter is a service that allows probing endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

Aura Platform is deployed along with an external service/endpoint that is able to check some services health. The blackbox-exporter uses some HTTPS probes that periodically sends a request to the external endpoint to validate that is still healthy. Its metrics (result, latency, etc) are stored in Prometheus.

Elasticsearch

ElasticSearch is a stateful service that indexes the Aura Platform logs so they can be used for analysis.

It runs as a statefulset. Logs can use a lot of space in disk, so it is important to size the volume accordingly by modifying the following section of the deployment profile:

elasticsearch:
  storage: 10

The retention time of the logs is 7 days by default, but can be configured in the deployment profile. Remember that increasing this value means that logs will take more space on disk and queries against ElasticSearch could take longer to complete.

log_retention_time: 3

There is a lifecycle policy configured to remove the old index from Elasticsearch. You can check this in the Kibana UI.

Also for long term storage of indexes, there is an snapshot configured that stores the index during a year in a blob from the cluster associated storage account. This snapshot can be also checked in Kibana UI.

To check the disk usage, you can use both the Kubernetes Storage or the ElasticSearch dashboard in Grafana.

fluent-bit

Fluent-bit is a daemonset that runs in all nodes, processing its logs and sending them to fluentbit-aggregator.

fluentbit-aggregator

Stateful service that aggregates all logs coming from the fluent-bit processes on every node.

It indexes the logs in Elasticsearch. It has a small data disk of 10GB that acts as a buffer to avoid losing data if something goes wrong (e.g., a network issue or a problem with the Elasticsearch cluster) while trying to index the logs.

It is safe to kill a specific fluentbit-aggregator pod (kubectl delete pod) if it gets stuck for some reason.

node-exporter

The node-exporter is an official Prometheus exporter that gathers hardware and OS metrics exposed by the virtual machines that compose Aura Platform.

Prometheus

Prometheus is a stateful service that scrapes metrics from all the exporters in the platform.

It works in a pull-based metrics collection approach. This means that it periodically (every 30 seconds) requests metrics from every HTTP endpoint exposed by the Aura Platform services. It also gathers information about the infrastructure that sustains the Aura Platform.

Metrics are stored using a local on-disk time series database. The local storage is not meant as durable long-term storage, so metrics have a retention time of 15 days.

On average, Prometheus uses only around 1-2 bytes per sample. To plan the capacity of a Prometheus server, you can use the formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

Since metrics can take some space, it is important to size its volume size accordingly in the following section of the deployment profile:

prometheus:
  storage_size: 10

If the local storage becomes corrupted for whatever reason, shut down Prometheus and remove the storage directory, that is mounted under /prometheus on the Prometheus container.

To check the disk usage, you can use the Kubernetes Storage dashboard in Grafana, looking for the PVC named data-prometheus-0.

If you have a user with a valid role to access the Kubernetes cluster, you can also check the disk usage running df -h on the Prometheus container:

$ kubectl -n aura-system exec -it prometheus-0 -- df -h

However, the second alternative is discouraged because it is more intrusive and error prone. Running arbitrary commands on the Aura Platform containers or nodes will be forbidden and will fire security alerts in future releases.

The recommended pattern for running Prometheus in HA mode is to run duplicated instances (same configuration, scraping the same targets independently). That means having at least two replicas running.

Thanos

thanos-sidecar container

The “thanos-sidecar” container is a sidecar to the “prometheus” container that enhances it by exposing a gRPC StoreAPI and by uploading blocks to an object storage API (like Azure Blob Storage). It is a stateless component.

thanos-querier

The “thanos-querier” container exposes a gRPC StoreAPI and an HTTP Prometheus v1 API. It gathers the data needed to evaluate the query from underlying StoreAPIs, evaluates the query and returns the result. It is a stateless component.

The configured StoreAPI sources are:

system namespace -> prometheus/thanos-sidecar
system namespace -> thanos-store-gateway

thanos-compact

The “thanos-compact” container is a component that applies the compaction procedure of the Prometheus 2.0 storage engine in order to block data stored in object storage APIs (like Azure Blob Storage). It also generates downsampled blocks from each raw block.

It is a stateful component, as it must be deployed as a singleton (against an exclusive label selector).

thanos-store-gateway

The “thanos-store-gateway” container exposes a gRPC StoreAPI. It serves data blocks containing metrics stored in Azure Blob Storage.

It is a stateless component; however it consumes local storage for sync purposes and benefits from persistence against increased start-up times.

Grafana

Grafana is an open-source service for analytics and monitoring purposes. The Aura Platform uses it to display metrics from Prometheus in several dashboards.

⚠️ Changes to the official Aura Platform dashboards (those with the “baikal” label) will be overridden on every deployment of the Aura Platform infrastructure.

⚠️ Grafana is not prepared to alert the Aura Platform. Aura Platform interfaces are the Prometheus API and the alertmanager.

Kibana

Kibana is the service used by Aura Platform to access logs indexed in ElasticSearch.

The first time accessing Kibana, you need to create an index mapping against “aura-services-*”, using @timestamp as the temporal reference for logs.

Remember that logs are stored in ElasticSearch for 10 days by default.

Last modified January 17, 2024: feat: Documentation for ACDC release #AURA-20370 [RTM] (9efef9ab)

Tags:

Categories: