Aura – rag-documents

Docs:

Mon, 01 Jan 0001 00:00:00 +0000

Check Hugging Face embedding models downloading

Guidelines to check if the Hugging Face models used in ATRIA are downloaded during the generate-db process

Introduction

The free embedding templates we are currently using in ATRIA are paraphrase-multilingual-MiniLM-L12-v2 and multi-qa-distilbert-cos-v1 both from Hugging Face. (These models are the ones used with the following embeddings by default available in ATRIA: Local Sentence Transformer and Distilbert-based Local Sentence Transformer).

During the generate-db process, these models are loaded into memory and the process may fail if there is a connection problem with Hugging Face. In this error scenario, the only solution is to wait until the service is again up and running.

In the current document, we include the instructions to check if the embedding models can be downloaded, in order to detect the process failure.

Prerequisites

Install huggingface-cli
```
pkgx install huggingface-cli
```

Check if the Hugging Face models are downloaded properly

The way to check if the service is up is by launching the following command:

huggingface-cli download sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

If the download starts, the service is up, and you can restart the generate-db process.

Docs:

Mon, 01 Jan 0001 00:00:00 +0000

Import documents into ATRIA

Guidelines for importing documents and new data into ATRIA environment

Introduction

As described in General RAG: functional overview, when using RAG capability, different databases are used for lexical and semantic search.

The documents that feed these knowledge bases must be uploaded into the environment to be used in the RAG chain and updated when required. In this framework, two processes must be considered:

a. Curate data (recommended): Firstly, it is important to curate the data to be uploaded afterwards, to optimize the recognition process.
b. Import documents: Once the data is curated, the documents must be uploaded into the system. For that purpose, apart from the general method, a hot swapping process can be executed.

a. Data curation

Data curation is the process of organizing, managing, cleaning up and maintaining data to ensure it stays relevant and valuable. Good practices in this task leads to an efficient recognition by the AI model.

For this purpose, we recommend following these tips, based on research and internal analysis:

1. Data selection and cleaning

Include only data relevant to the purpose of the RAG. Redundant, irrelevant or outdated information should be removed to clean up noise that does not add value.

2. Clarity and consistency in content

Be concrete and specific: Keep the information to the point. Avoid unnecessary words or complex explanations.
Avoid ambiguous messages: Avoid vague or unclear terms that could lead to confusion. Make sure the meaning is easy to interpret.
Reinforce the message: Make the message clearer by using specific terms related to the category being discussed. Use keywords strategically to reinforce the message.
Make sure procedures are clear and include all the necessary steps: Make sure each step in tutorials is fully described, logically structured and easy to follow. Avoid fragmented or disjointed instructions.
Remove unnecessary reference information: Minimize excessive details between steps that could distract or confuse the LLM. Keep the flow simple and clear.

3. Improvements in information

Add missing content: If the product includes features similar to others but with slight variations, add a sentence explaining what is and is not supported to make the LLM more accurate.
Add similar terminology: Although you cannot control what terminology people use, mentioning common alternative terms in your content can help the LLM provide more informative answers.

4. Structure and formatting

Maintain consistent formatting: Ensure all steps follow a parallel structure (similar sentence formats and style) to improve coherence.
Simplify complex tables: Avoid blank cells and ensure every cell has a complete value. Replace symbols (e.g., checkmarks) with clear text (“Yes”, “Supported”) to improve interpretation. Rewrite footnote text to add context. Move complex information in table cells out of the table.
Avoid nested content: LLMs can have difficulty with multiple levels of nesting (e.g., steps within steps). Keep content linear and simple for better understanding.
Add summaries to tutorials or long procedures: LLMs can get “lost” with long tutorials or procedures due to context window limitations. Including a summary is a simple way to enhance results.

5. Clarification and Explanation of Concepts

Easy writing: Resolve writing issues such as wordiness, passive voice, and unclear pronouns (with ambiguous references) to make text more understandable.
Explain graphics/images in text: Clearly explain conceptual graphics through text to resolve ambiguities and avoid relying on an image-to-text model

b. Import documents

Once the data is curated, the documents must be uploaded into the system. For that purpose, the following guidelines must be followed.

Note: The RAG does not support files with whitespaces.

1. Upload documents in the Azure container `atria-resources`

Insert these documents in the <preset_name>/<retrievalStg.sources.name>/<retrievalStg.sources.docs[i].extension>/ folder.
Keep in mind the allowed formats for documents, set in the preset’s variable loader.loaderType.

2. Configure `docs` parameter in preset

For these documents to be used in your use case, they must be included in the preset, following these instructions.

Fill in the parameters in the docs key of your preset, which is related to the configuration of documents.

Here is an example of documents configuration. In this example, documents in the preset are separated into two folders, as we are going to load two different types of data (jsonl and pdf) into this preset.

```json
{
"retrievalStg":{
    "sources":{
        "name":"project-de-faqs",
        "embeddings":"text-embedding-ada-002",
        "docs":[
            {
            "extension":"jsonl",
            "loader":{
                "loaderType":"jsonl"
            }
            },
            {
            "extension":"pdf",
            "loader":{
                "loaderType":"unstructured",
                "options":{
                    "loaderMode":"single"
                }
            }
            }
        ],
        "splitter":{
            "splitterType":"recursivechar",
            "options":{
            "chunkSize":512,
            "chunkOverlap":160
            }
        },
        "retrievers":[
            {
            "retrieverType":"qdrant"
            },
            {
            "retrieverType":"tfidf"
            }
        ]
    }
}
}
```

3. Upload list of URLs

If you use URLs as documents ("loaderType": "url_list"), you also need to upload a file with the list of URLs in the preset folder.
Separate each URL with a line break. The file must have the extension .txt.
```
http://www.url1.com
http://www.url2.com
```

4. Upload jsonl or jsond files

If you use jsonl or jsond files as documents ("loaderType": "jsonl" or "loaderType": "jsond"), you also need to upload the file content in the same folder with the extension .jsonl or .jsond.

To do so, each desired document content must be provided in the page_content key.

{"page_content": "test1", "metadata": {"source": "https://www.dummy1.es/"}, "type": "Document"}
{"page_content": "test2", "metadata": {"source": "https://www.dummy2.es/"}, "type": "Document"}

5. Add project.metadata file (optional)

Scenario 1: Unstructured, csv or text data

If the loaderType is url_list, unstructured or csv, you can optionally add a file called project.metadata with relevant information about each file. This metadata will be stored in the database and is very helpful when we want to modify the source URL.

It is important that the file is correctly tabulated and does not contain any invalid characters.

The file is composed of:

Key __global__, which contains global data that affects all the files.
Names of the specific files to which we want to include this extra data.

It is not necessary to define metadata for all the files in the folder.

Example:

__global__:
   url: https://www.google.com
   field1: test
   field2: test
file1.txt:
   url: https://www.dummy-url.com
   title: file1 title
file2.txt:
   url: https://www.dummy-url.com
   title: file1 title
   source: test

NOTE: From all the information added to the project.metadata when creating your use case, you can select the specific sources that will be shown to the user as part of the response, adding them to the field baseURL of the preset configuration.

Scenario 2: URL or json documents

In this case, there is no need to add the project.metadata file:

"loaderType": "url_list" —> Metadata information is included in the URLs themselves, uploaded in step 3
"loaderType": "jsonl", "loaderType": "jsond" —> Metadata information is already included in the files uploaded in step 4

6. Update data into the environment

Finally, execute the atria-rag-generate-db job to update the data into the environment.

Docs:

Mon, 01 Jan 0001 00:00:00 +0000

ATRIA RAG Server

Descriptive documentation regarding the ATRIA component atria-rag-server

Introduction

atria-rag-server is an ATRIA component that manages a RAG-type server. It is called by atria-model-gateway when RAG (Retrieval Augmented Generation) is used.

atria-rag-server manages the request made to the RAG model following the predefined RAG chain (pipeline) and making continuous requests combining Generative AI technology (LLMs) with semantic and lexical searches to retrieve the required information.

Associated documentation

Descriptive technical documentation regarding atria-rag-server includes:

Docs:

Mon, 01 Jan 0001 00:00:00 +0000

ATRIA RAG Generate DB

Descriptive documentation regarding the ATRIA component atria-rag-generate-db

Introduction

atria-rag-generate-db is an ATRIA component that manages a RAG-type database. This component is launched when you want to feed the document database for the first time or when you want to update the database with new information. See more information about these processes in the guidelines Import documents into ATRIA.

atria-rag-generate-db is in charge of handling the information coming from different sources and feeding the databases the RAG works with.

Associated documentation

Descriptive technical documentation regarding atria-rag-generate-db includes:

Launch atria-rag-generate-db

To launch atria-rag-generate-db, there are two suitable options:

Option 1

Send a request to the API for it to launch the atria-rag-generate-db. The endpoint responsible for this is:
/aura-services/v2/operations/data

curl -X POST "https://<your-atria-domain>/aura-services/v2/operations/data" \
-H "Content-Type: application/json"
-d '{
  "presetId": "<name of the project>"
}'

Option 2

Execute the following command to update the data in the environment. This command is in charge of launching the generation of the database for all the projects, but we can launch this generation for a specific project.

PROJECT='project-copilot-reduced'
kubectl patch configmap/atria-rag-generate-db-project --type merge -p "{\"data\":{\"ATRIA_PROJECT\":\"${PROJECT}\"}}" -n <namespace>
kubectl create job --from=cronjob/atria-rag-generate-db $(date +%Y%m%d%H%M%S)-atria-rag-generate-db-${PROJECT} -n <namespace>

(Change <namespace> by the specific one)

Aura – rag-documents

Docs:

Check Hugging Face embedding models downloading

Introduction

Prerequisites

Check if the Hugging Face models are downloaded properly

Docs:

Import documents into ATRIA

Introduction

a. Data curation

b. Import documents

1. Upload documents in the Azure container atria-resources

2. Configure docs parameter in preset

3. Upload list of URLs

4. Upload jsonl or jsond files

5. Add project.metadata file (optional)

Scenario 1: Unstructured, csv or text data

Scenario 2: URL or json documents

6. Update data into the environment

Docs:

ATRIA RAG Server

Introduction

Associated documentation

Docs:

ATRIA RAG Generate DB

Introduction

Associated documentation

Launch atria-rag-generate-db

1. Upload documents in the Azure container `atria-resources`

2. Configure `docs` parameter in preset