Categories:
Generate a local branch for the NLP data repository
Discover the structure of Aura NLP data repository and learn how to clone it for working purposes in local environment
Introduction to Aura NLP data repository
The GitHub Aura NLP data repositories, for uses cases, are defined below for every country:
- Use cases:
aura-nlpdata-[country_code]
Both have the same specific structure of folders and files, as shown in the section Aura NLP data repository structure
Local NLP experts must work over a local branch, thus cloning the intended global repository, following the steps in section Generate a local branch.
In the continuous process for Aura NLP optimization, Aura Global Team offers the possibility of splitting the NLP repository into different repos, for a more efficient way of working. Find the details in section Split Aura NLP repository.
The following sections show the content of each folder and file in the Aura NLP repository, for use cases.
As an example, access https://github.com/Telefonica/aura-nlpdata-es
.github
GitHub config files
config/etc
This folder includes files for the configuration of the Aura NLP model:
| config/etc/ | Description | Modifiable for use cases development? | Detailed information |
|---|---|---|---|
bootstrap.cfg |
General purpose config file. | No | NLP system configuration |
nlp_config/nlp.json |
File that contains the configuration by language and channel for each stage of the pipeline. | Yes | Configure your NLP model |
build_catalogs.cfg.tpl |
File to configure source data for dictionaries. Only required if the NLP model includes stages using dictionaries. | Yes | Guidelines for the generation of dictionaries in Aura NLP |
api_trainings.cfg.tpl |
File only used in ABACUS tool. It is a configuration template that will be filled automatically with the values defined in build_local_variables.sh. |
No | ABACUS documentation |
env.js.tpl |
File only used in ABACUS tool. This template will be filled automatically. | No | ABACUS documentation |
data/
This folder includes the resources and files required for the generation of the Aura NLP pipeline and for the training of every NLP stage:
| data/ | Description | Modifiable for use cases development? | Detailed information |
|---|---|---|---|
pipeline.json |
File for building up the NLP dynamic pipeline | Yes | Build the NLP dynamic pipeline |
| Training files | Specific training files for each NLP stage | Yes | Define your data resources |
sdict_items.json sdict_aliases.json |
Dictionary files automatically generated per language and channel | Yes | Guidelines for the generation of dictionaries in Aura NLP |
delivery
Internal folder containing scripts and resources related with Continuous Integration.
⚠️ Do not to modify this folder when developing new use cases.
pipeline_eval
| pipeline_eval/ | Description | Modifiable for use cases development? | Detailed information |
|---|---|---|---|
| pipeline_eval/ob/[country_code]/resources/[language]/[channel]/ | end-to-end tests for evaluation of the pipeline accuracy per country, language and channel | Yes | Define your E2E tests |
tools
Scripts for local training and testing of the Aura NLP model:
| tools/ | Description | Used for use cases development? | Detailed information |
|---|---|---|---|
build_local_variables.sh.tpl |
File for configuration purposes, specifically for the definition of CLU and other connection parameters. | yes | Set up configuration properties |
build_local.sh |
Script that automatically generates the local training environment and results files. | yes | Execute the training script |
build_local_testset.sh |
Script for the definition of specific E2E testsets files for an isolated stage. Currently, available for the OpenAI embeddings stage. | yes | Define stage-specific E2E testset files |
run_local_pipeline.sh |
Script used to test the system in a live mode during the pipeline launching stage. | yes | Launch and test your pipeline locally |
build_local_catalogs.sh |
Script used to generate dictionaries using local catalogs data. | yes | Guidelines for the generation of dictionaries in Aura NLP |
run_web_training.sh |
Script used to run ABACUS tool. | yes | ABACUS documentation |
import_nlpdata_tools.sh |
Auxiliar script used by other scripts. This script must not be executed by the user. | no | … |
ℹ️ Now, all the scripts need to connect with the centralized repository in Github aura-nlp-tools, so it is necessary that your Github user have read access to it. Ask the APE Team to get this permission.
catalogs
Folder required just in case the Aura NLP uses manual catalogs.
| catalogs/ | Description | Used for use cases development? | Detailed information |
|---|---|---|---|
| catalogs/[language]/[channel]/ | Files for the manual update of catalogs | yes | Guidelines for the generation or update of entities catalogs |
validation
Configuration files for different validators.
⚠️ These files must not be modified.
gitignore
Config file containing files to be ignored by the version control system.
CODEOWNERS
Config file indicating which user or group is the code owner responsible for merging the code.
⚠️ This file must not be modified.
config.txt
File containing branch name of current working release, used in different scripts.
⚠️ This file must not be modified.
requirements.txt
File containing Python module dependencies. These dependencies are installed automatically during the training process.
⚠️ This file must not be modified.
Generate a local branch
The GitHub interaction allows the generation of local branches from the master branch.
Local NLP experts must carry out the NLP customization over the local branch, that is a clone of the NLP GitHub repository and, afterwards, create a Pull Request (PR) to push the local branch to master or release branch of the corresponding Aura release.
For this purpose, follow these steps:
-
Create the working directory:
mkdir -p ~/Telefonica cd ~/Telefonica -
In order to clone the Aura NLP data project (Step 3), generate an SSH key and add it to your Github account.
For this purpose, follow the instructions in Github documentation or access to the document SSH configuration guidelines. -
Clone the Aura NLP data project of your country. The repository URL follows the next pattern: https://github.com/Telefonica/aura-nlpdata-[country_code]-[optional:channelName).git
Where
[country_code]is the acronym of a specific country, for example:es,br,de,gbIn order to clone the repository, it is possible to use some git client as GitKraken or it can be done directly from a console running the command:
git clone <url_repo>The project should be cloned in the folder where the above command was executed and the folder should have the same name as the repository:
git clone git@github.com:Telefonica/aura-nlpdata-[country_code].git -
Once the repository is cloned in the local machine, create a new git branch every time modifications need to be made concerning new use cases implementation, bug fixing, etc.
The name of the branch should start with one of the next reserved words, depending on the modification purpose, followed by a slash and a brief description:
feat/: new functionalities (for example, feat/weather_forecast_UC_#56624)fix/: bugs or non-relevant modifications (for example: fix/balance_light_on_#117076)release/: release synchronization
The command to create this new branch must follow this pattern:
cd ~/Telefonica/aura-nlpdata-gb git checkout -b "[feat|fix|release]/<change_description>"Find here detailed information regarding Semantic Commit Messages.
Split Aura NLP repository
As a recommendation, the OB’s aura-nlpdata repository can be split by groups of channels with similar uses cases. This provides a greater flexibility and independence to constructors.
At the same time, this functionality allows optimizing the training times, as only the pipelines of the repositories that undergo modifications will be retrained.
In this scenario, the format of the repository name must be: aura-nlpdata-[country_code]-[repo_name]
If OBs want to organize their NLP repo in this way, they must contact with Aura Global Team.
Finally, it is possible to allocate dedicated processing capacity of the C.I, system, if necessary, but only after a joint analysis with Aura Global Team.