Tags:

aura-nlp

Generate a local branch for the NLP data repository

Discover the structure of Aura NLP data repository and learn how to clone it for working purposes in local environment

Introduction to Aura NLP data repository

The GitHub Aura NLP data repositories, for uses cases, are defined below for every country:

Use cases: aura-nlpdata-[country_code]

Both have the same specific structure of folders and files, as shown in the section Aura NLP data repository structure

Local NLP experts must work over a local branch, thus cloning the intended global repository, following the steps in section Generate a local branch.

In the continuous process for Aura NLP optimization, Aura Global Team offers the possibility of splitting the NLP repository into different repos, for a more efficient way of working. Find the details in section Split Aura NLP repository.

The following sections show the content of each folder and file in the Aura NLP repository, for use cases.

As an example, access https://github.com/Telefonica/aura-nlpdata-es

.github

GitHub config files

config/etc

This folder includes files for the configuration of the Aura NLP model:

config/etc/	Description	Modifiable for use cases development?	Detailed information
`bootstrap.cfg`	General purpose config file.	No	NLP system configuration
`nlp_config/nlp.json`	File that contains the configuration by language and channel for each stage of the pipeline.	Yes	Configure your NLP model
`build_catalogs.cfg.tpl`	File to configure source data for dictionaries. Only required if the NLP model includes stages using dictionaries.	Yes	Guidelines for the generation of dictionaries in Aura NLP
`api_trainings.cfg.tpl`	File only used in ABACUS tool. It is a configuration template that will be filled automatically with the values defined in `build_local_variables.sh`.	No	ABACUS documentation
`env.js.tpl`	File only used in ABACUS tool. This template will be filled automatically.	No	ABACUS documentation

data/

This folder includes the resources and files required for the generation of the Aura NLP pipeline and for the training of every NLP stage:

data/	Description	Modifiable for use cases development?	Detailed information
`pipeline.json`	File for building up the NLP dynamic pipeline	Yes	Build the NLP dynamic pipeline
Training files	Specific training files for each NLP stage	Yes	Define your data resources
`sdict_items.json` `sdict_aliases.json`	Dictionary files automatically generated per language and channel	Yes	Guidelines for the generation of dictionaries in Aura NLP

delivery

Internal folder containing scripts and resources related with Continuous Integration.

⚠️ Do not to modify this folder when developing new use cases.

pipeline_eval

pipeline_eval/	Description	Modifiable for use cases development?	Detailed information
pipeline_eval/ob/[country_code]/resources/[language]/[channel]/	end-to-end tests for evaluation of the pipeline accuracy per country, language and channel	Yes	Define your E2E tests

tools

Scripts for local training and testing of the Aura NLP model:

tools/	Description	Used for use cases development?	Detailed information
`build_local_variables.sh.tpl`	File for configuration purposes, specifically for the definition of CLU and other connection parameters.	yes	Set up configuration properties
`build_local.sh`	Script that automatically generates the local training environment and results files.	yes	Execute the training script
`build_local_testset.sh`	Script for the definition of specific E2E testsets files for an isolated stage. Currently, available for the OpenAI embeddings stage.	yes	Define stage-specific E2E testset files
`run_local_pipeline.sh`	Script used to test the system in a live mode during the pipeline launching stage.	yes	Launch and test your pipeline locally
`build_local_catalogs.sh`	Script used to generate dictionaries using local catalogs data.	yes	Guidelines for the generation of dictionaries in Aura NLP
`run_web_training.sh`	Script used to run ABACUS tool.	yes	ABACUS documentation
`import_nlpdata_tools.sh`	Auxiliar script used by other scripts. This script must not be executed by the user.	no	…

ℹ️ Now, all the scripts need to connect with the centralized repository in Github aura-nlp-tools, so it is necessary that your Github user have read access to it. Ask the APE Team to get this permission.

catalogs

Folder required just in case the Aura NLP uses manual catalogs.

catalogs/	Description	Used for use cases development?	Detailed information
catalogs/[language]/[channel]/	Files for the manual update of catalogs	yes	Guidelines for the generation or update of entities catalogs

validation

Configuration files for different validators.

⚠️ These files must not be modified.

gitignore

Config file containing files to be ignored by the version control system.

CODEOWNERS

Config file indicating which user or group is the code owner responsible for merging the code.

⚠️ This file must not be modified.

config.txt

File containing branch name of current working release, used in different scripts.

⚠️ This file must not be modified.

requirements.txt

File containing Python module dependencies. These dependencies are installed automatically during the training process.

⚠️ This file must not be modified.

Generate a local branch

The GitHub interaction allows the generation of local branches from the master branch.

Local NLP experts must carry out the NLP customization over the local branch, that is a clone of the NLP GitHub repository and, afterwards, create a Pull Request (PR) to push the local branch to master or release branch of the corresponding Aura release.

For this purpose, follow these steps:

Create the working directory:
```
mkdir -p ~/Telefonica
cd ~/Telefonica 
```
In order to clone the Aura NLP data project (Step 3), generate an SSH key and add it to your Github account.
For this purpose, follow the instructions in Github documentation or access to the document SSH configuration guidelines.
Clone the Aura NLP data project of your country. The repository URL follows the next pattern: https://github.com/Telefonica/aura-nlpdata-[country_code]-[optional:channelName).git

Where [country_code] is the acronym of a specific country, for example: es, br, de, gb

In order to clone the repository, it is possible to use some git client as GitKraken or it can be done directly from a console running the command:
git clone <url_repo>

The project should be cloned in the folder where the above command was executed and the folder should have the same name as the repository:
git clone git@github.com:Telefonica/aura-nlpdata-[country_code].git
Once the repository is cloned in the local machine, create a new git branch every time modifications need to be made concerning new use cases implementation, bug fixing, etc.

The name of the branch should start with one of the next reserved words, depending on the modification purpose, followed by a slash and a brief description:
- feat/: new functionalities (for example, feat/weather_forecast_UC_#56624)
- fix/: bugs or non-relevant modifications (for example: fix/balance_light_on_#117076)
- release/: release synchronization
The command to create this new branch must follow this pattern:
```
cd ~/Telefonica/aura-nlpdata-gb
git checkout -b "[feat|fix|release]/<change_description>"
```
Find here detailed information regarding Semantic Commit Messages.

Split Aura NLP repository

As a recommendation, the OB’s aura-nlpdata repository can be split by groups of channels with similar uses cases. This provides a greater flexibility and independence to constructors.

At the same time, this functionality allows optimizing the training times, as only the pipelines of the repositories that undergo modifications will be retrained.

In this scenario, the format of the repository name must be: aura-nlpdata-[country_code]-[repo_name]

If OBs want to organize their NLP repo in this way, they must contact with Aura Global Team.

Finally, it is possible to allocate dedicated processing capacity of the C.I, system, if necessary, but only after a joint analysis with Aura Global Team.

Last modified November 11, 2025: feat: Clean up of Living Apps related stuff #AURA-30761 [RTM] (c97ca748)