NLP system configuration

Internal configuration of Aura NLP system: operational configuration and configuration of NLP stages

Introduction

The configuration of the NLP system is organized in two different purposes, each of them supported by one configuration file:

  • NLP operational configuration:

    • Internal configuration for NLP system
    • Not modifiable by OBs
    • Based on the file bootstrap.cfg
    • For descriptive purposes, it is included below
  • NLP stages configuration:

    • Configuration of each stage composing the NLP pipeline
    • Configurable when developing an NLP model for a specific use case
    • Based on the file nlp.json
    • The practical process for editing the nlp.json pipeline when developing a use case is included in Configure your NLP model

NLP operational configuration: bootstrap.cfg

The bootstrap.cfg file contains operational config sections for Aura NLP (ports, URIs, usernames, passwords, etc.).

This file can be found in the path:
aura-nlpdata-[country_code]/config/etc/bootstrap.cfg

⚠️ When developing a use case, NLP engineers or linguists should not modify this file. If any modification is required, it must be approved by the Aura Platform Team.

The file follows the general structure shown hereunder:

[working_directory]
stages_folder =./tmp/

[logging]
handlers = { . . . }
loggers = { . . . }
root = { . . . }

[country-langs]
country_mapper = { . . . }

[channels]
channel_list = [
        {
             'prefix': 'fb',
	         'name': 'whatsapp',
	         'id': '269d6-f052-4d2e-8f66-f59a9f31eff9'
	    },
. . . ]

[platform]
platform = 'ES'

[azure_models]
container_url = ${AZURE_NLP_MODELS_URL}

Moreover, it is required to include in this file other different sections belonging to specific stages or databases used. The fields included in each section are described below.

Working directory

[working_directory]
stages_folder = ./tmp/

The main fields are explained below:

  • stages_folder: Main directory for the different stages.

Logging

[logging]
handlers = {
      'hdl1': {
         'class':'logging.StreamHandler',
         'formatter':'console',
         'level':'INFO'
      }
   }
loggers = {
      'nlp': {
         'level': 'INFO',
         'handlers': [
            'hdl1'
         ],
    'filters': []
      }
    }
root = {
      'level':'INFO',
      'handlers': [
         'hdl1'
      ]
    }

The main fields are explained below. However, for more details, developers are kindly requested to read the General Python logging documentation

  • handlers: This field configures a dictionary with different logging handlers. Each key is the name of a handler, and it is composed by the next parameters:

    • class: It is configured with Python logging handlers (See Python documentation).
    • formatter: It configures the format of logs. It must be filled with the labels json, string, console or simple.
    • level: Level of the logging event. It must be filled with the labels INFO, ERROR, WARN or DEBUG.
  • loggers: The corresponding value is a Python dictionary in which each key is a logger name and each value is a dictionary describing how to configure the corresponding Logger instance:

    • level (optional parameter): Level of the logger.
    • handlers (optional parameter): List of IDs of the handlers for this logger.
    • filters (optional parameter): List of IDs of the filters for this logger.
  • root: Configuration for the root logger.

    • level (optional parameter): Level of the logger.
    • handlers (optional parameter): List of IDs of the handlers for this logger.

Country-langs / channels / platform

[country-langs]
country_mapper = { 
  'es-es': {
        'country_name': 'Spain',
        'language_name': 'Spanish',
        'alpha2': 'es',
        'alpha3': 'esp',
        'culture': 'es-es'
    }
  }

[channels]
channel_list = [
        {
            'prefix': 'mp',
            'name': 'movistar-plus',
            'id': '60f0ffda-e58a-4a96-aad9-d42be70b7b42'
        },

  ]

[platform]
platform = 'ES'

The main fields are explained below:

  • country_mapper: Mapper with a list of fields that specifies the allowed languages based on the ISO-639 code.

  • channel_list: List of available channels. This field must contain three parameters for each channel. This information is already configured for every OB.

    • prefix: Prefix of the channel.
    • name: Name of the channel.
    • id: ID of the channel.
  • platform: Allowed platform.

CLU

The CLU stage requires a specific operational configuration:

[CLU]
base_url = https://${RESOURCE_NAME_CLU}.cognitiveservices.azure.com
base_url_api = https://${RESOURCE_NAME_CLU}.cognitiveservices.azure.com
api_version = 2023-04-01
http_retry_codes = {429, 500}
http_max_attempts = 10
http_sleep_time = 5
http_time_out = 60
http_time_out_recognizer = 60
http_retry_codes_recognizer = {429, 500}
http_max_attempts_recognizer = 5
http_sleep_time_recognizer = 0.5
http_raise_when_retry_limit_exceeded_recognizer = True

The main fields are explained below:

  • base_url: Base URL for CLU service.
  • base_url_api: Base URL for CLU API service.
  • api_version: CLU API version.
  • http_retry_codes: Response status code, if more requests than the limit have been sent.
  • http_max_attemps: Maximum number of HTTP requests allowed.
  • http_sleep_time: Timeout between HTTP requests.
  • http_time_out: Time in seconds for raising a timeout exception when HTTP request does not return a response for training API requests.
  • http_time_out_recognizer: Time in seconds for raising a timeout exception when HTTP request does not return a response for CLU recognizer.
  • http_retry_codes_recognizer: Set of response status codes that will retry CLU recognizer request.
  • http_max_attemps_recognizer: Maximum number of attempts that will be performed in CLU recognizer request when there is an exception by timeout or connection error or a request code defined in http_retry_codes_recognizer is set.
  • http_sleep_time_recognizer: Time to wait between HTTP CLU recognizer requests.
  • http_raise_when_retry_limit_exceeded_recognizer: Boolean (true/false) value to inform if an exception must be re-raised when it happens and the maximum number of retries is exceeded.

OpenAI Embeddings

The OpenAI Embeddings stage configuration allows to have different databases per each combination of language and channel.

Some of these values will be configured by the installer aurak8s, such as base_url_api.
It is also necessary to enable its configuration in aurak8s installer, following the instructions in the Enable OpenAI deployment section.

[openai_embeddings_recognizer]
azure_token_base_url = https://login.microsoftonline.com
management_url = https://management.azure.com
management_api_version = 2023-05-01
http_retry_codes = {429,500}
http_max_attempts = 10
http_sleep_time = 5
http_time_out = 30
base_url_api = https://test.openai.azure.com/openai
base_api_version = 2023-05-15
http_time_out_recognizer = 20
http_retry_codes_recognizer = {429,500}
http_max_attempts_recognizer = 10
http_sleep_time_recognizer = 10
http_raise_when_retry_limit_exceeded_recognizer = True
sku_name = Standard
sku_capacity = 120

[qdrant:instance]
url = http://hotname:6333
api_key = api-test
shard_number = 1
replication_factor = 1
chunk_size = 30
exponential_sleep = True
max_exponential_sleep_time = 120

The associated fields are defined below:

  • azure_token_base_url: Base URL to get oauth token.
  • management_url: Azure URL where the embedding OpenAI model will be deployed.
  • management_api_version: Version of the embedding OpenAI model in Azure.
  • http_retry_codes: Response status code to retry request.
  • http_max_attemps: Maximum number of HTTP requests allowed.
  • http_sleep_time: Timeout for each attempt when we retry any HTTP request.
  • http_time_out: Time in seconds for raising a timeout exception when HTTP request does not return a response for OpenAI embeddings training API requests.
  • base_url_api: Base URL for OpenAI embeddings service.
  • base_api_version: OpenAI embeddings version.
  • http_time_out_recognizer: Time in seconds for raising a timeout exception when HTTP request does not return a response for OpenAI embeddings recognizer.
  • http_retry_codes_recognizer: Set of response status codes that will retry OpenAI embeddings recognizer request.
  • http_max_attemps_recognizer: Maximum number of attempts that will be performed in OpenAI embeddings recognizer request when there is an exception by timeout or connection error or a request code defined in http_retry_codes_recognizer.
  • http_sleep_time_recognizer: Time to wait between HTTP OpenAI embeddings recognizer requests.
  • http_raise_when_retry_limit_exceeded_recognizer: Boolean (true/false) value to inform if an exception must be re-raised when it happens and the maximum number of retries is exceeded.
  • sku_name: Name of the resource model representing the SKU.
  • sku_capacity: Capacity of Tokens per Minute Rate Limit (Thousands).
  • url: URL for Qdrant service.
  • api_key: Key needed to connect with Qdrant service.
  • shard_number: Number of shards for Qdrant service.
  • replication_factor: Replication factor for Qdrant service.
  • chunk_size: Number of embeddings to be sent in each request to the Qdrant service.
  • exponential_sleep: Boolean (true/false) value to inform if the exponential sleep is enabled. By default, it is False.
  • max_exponential_sleep_time: Maximum time in seconds for the exponential sleep. By default, it is 120 seconds.
base_url_api = https://internal.com/
http_retry_codes = {429, 500}
http_max_attempts = 10
http_sleep_time = 5
http_time_out = 30

Where:

  • http_retry_codes: Response status code to retry request.
  • http_max_attempts: Maximum number of HTTP requests allowed.
  • http_sleep_time: Timeout for each attempt when we retry any HTTP request.

Azure models

The azure_models configuration is detailed below:

[azure_models]
container_url = ${AZURE_NLP_MODELS_URL}

Where:

  • container_url: URL for the Azure NLP models container.

Embeddings Domain Classifier

The Embeddings Domain Classifier stage configuration allows the use of different databases per each combination of language and channel.

Some of these values will be configured by the installer aurak8s, such as base_url_api.

It is also necessary to enable its configuration in aurak8s installer, following the instructions in the Enable OpenAI deployment section.

[openai_embeddings_domain_classifier]
azure_token_base_url = https://login.microsoftonline.com
management_url = https://management.azure.com
management_api_version = 2023-05-01
http_retry_codes = {429,500}
http_max_attempts = 10
http_sleep_time = 5
http_time_out = 30
base_url_api = https://test.openai.azure.com/openai
base_api_version = 2023-05-15
http_time_out_domain_classifier = 20
http_retry_codes_domain_classifier = {429,500}
http_max_attempts_domain_classifier = 10
http_sleep_time_domain_classifier = 10
http_raise_when_retry_limit_exceeded_domain_classifier = True
sku_name = Standard
sku_capacity = 120


[qdrant:instance]
url = http://hotname:6333
api_key = api-test
shard_number = 1
replication_factor = 1
chunk_size = 30
exponential_sleep = True
max_exponential_sleep_time = 120

The associated fields are defined below:

  • azure_token_base_url: Base URL to get oauth token.
  • management_url: Azure URL where the embedding OpenAI model will be deployed.
  • management_api_version: Version of the embedding OpenAI model in Azure.
  • http_retry_codes: Response status code to retry request.
  • http_max_attemps: Maximum number of HTTP requests allowed.
  • http_sleep_time: Timeout for each attempt when we retry any HTTP request.
  • http_time_out: Time in seconds for raising a timeout exception when HTTP request does not return a response for OpenAI embeddings training API requests.
  • base_url_api: Base URL for OpenAI embeddings service.
  • base_api_version: OpenAI embeddings version.
  • http_time_out_domain_classifier: Time in seconds for raising a timeout exception when HTTP request does not return a response for embeddings domain classifier.
  • http_retry_codes_domain_classifier: Set of response status codes that will retry embeddings domain classifier request.
  • http_max_attempts_domain_classifier: Maximum number of attempts that will be performed in embeddings domain classifier request when there is an exception by timeout or connection error or a request code defined in http_retry_codes_domain_classifier.
  • http_sleep_time_domain_classifier: Time to wait between HTTP embeddings domain classifier requests.
  • http_raise_when_retry_limit_exceeded_domain_classifier: Boolean (true/false) value to inform if an exception must be re-raised when it happens and the maximum number of retries is exceeded.
  • sku_name: Name of the resource model representing the SKU.
  • sku_capacity: Capacity of Tokens per Minute Rate Limit (Thousands).
  • url: URL for Qdrant service.
  • api_key: Key needed to connect with Qdrant service.
  • shard_number: Number of shards for Qdrant service.
  • replication_factor: Replication factor for Qdrant service.
  • chunk_size: Number of embeddings to be sent in each request to the Qdrant service.
  • exponential_sleep: Boolean (true/false) value to inform if the exponential sleep is enabled. By default, it is False.
  • max_exponential_sleep_time: Maximum time in seconds for the exponential sleep. By default, it is 120 seconds.
Last modified May 18, 2026: Remove KGB (52b04d91)