Import documents into ATRIA

Guidelines for importing documents and new data into ATRIA environment

Introduction

As described in General RAG: functional overview, when using RAG capability, different databases are used for lexical and semantic search.

The documents that feed these knowledge bases must be uploaded into the environment to be used in the RAG chain and updated when required. In this framework, two processes must be considered:

  • a. Curate data (recommended): Firstly, it is important to curate the data to be uploaded afterwards, to optimize the recognition process.

  • b. Import documents: Once the data is curated, the documents must be uploaded into the system. For that purpose, apart from the general method, a hot swapping process can be executed.

a. Data curation

Data curation is the process of organizing, managing, cleaning up and maintaining data to ensure it stays relevant and valuable. Good practices in this task leads to an efficient recognition by the AI model.

For this purpose, we recommend following these tips, based on research and internal analysis:

1. Data selection and cleaning

  • Include only data relevant to the purpose of the RAG. Redundant, irrelevant or outdated information should be removed to clean up noise that does not add value.

2. Clarity and consistency in content

  • Be concrete and specific: Keep the information to the point. Avoid unnecessary words or complex explanations.
  • Avoid ambiguous messages: Avoid vague or unclear terms that could lead to confusion. Make sure the meaning is easy to interpret.
  • Reinforce the message: Make the message clearer by using specific terms related to the category being discussed. Use keywords strategically to reinforce the message.
  • Make sure procedures are clear and include all the necessary steps: Make sure each step in tutorials is fully described, logically structured and easy to follow. Avoid fragmented or disjointed instructions.
  • Remove unnecessary reference information: Minimize excessive details between steps that could distract or confuse the LLM. Keep the flow simple and clear.

3. Improvements in information

  • Add missing content: If the product includes features similar to others but with slight variations, add a sentence explaining what is and is not supported to make the LLM more accurate.
  • Add similar terminology: Although you cannot control what terminology people use, mentioning common alternative terms in your content can help the LLM provide more informative answers.

4. Structure and formatting

  • Maintain consistent formatting: Ensure all steps follow a parallel structure (similar sentence formats and style) to improve coherence.
  • Simplify complex tables: Avoid blank cells and ensure every cell has a complete value. Replace symbols (e.g., checkmarks) with clear text (“Yes”, “Supported”) to improve interpretation. Rewrite footnote text to add context. Move complex information in table cells out of the table.
  • Avoid nested content: LLMs can have difficulty with multiple levels of nesting (e.g., steps within steps). Keep content linear and simple for better understanding.
  • Add summaries to tutorials or long procedures: LLMs can get “lost” with long tutorials or procedures due to context window limitations. Including a summary is a simple way to enhance results.

5. Clarification and Explanation of Concepts

  • Easy writing: Resolve writing issues such as wordiness, passive voice, and unclear pronouns (with ambiguous references) to make text more understandable.
  • Explain graphics/images in text: Clearly explain conceptual graphics through text to resolve ambiguities and avoid relying on an image-to-text model

b. Import documents

Once the data is curated, the documents must be uploaded into the system. For that purpose, the following guidelines must be followed.

Note: The RAG does not support files with whitespaces.

1. Upload documents in the Azure container atria-resources

  • Insert these documents in the <preset_name>/<retrievalStg.sources.name>/<retrievalStg.sources.docs[i].extension>/ folder.

  • Keep in mind the allowed formats for documents, set in the preset’s variable loader.loaderType.

2. Configure docs parameter in preset

For these documents to be used in your use case, they must be included in the preset, following these instructions.

  • Fill in the parameters in the docs key of your preset, which is related to the configuration of documents.

Here is an example of documents configuration. In this example, documents in the preset are separated into two folders, as we are going to load two different types of data (jsonl and pdf) into this preset.

```json
{
"retrievalStg":{
    "sources":{
        "name":"project-de-faqs",
        "embeddings":"text-embedding-ada-002",
        "docs":[
            {
            "extension":"jsonl",
            "loader":{
                "loaderType":"jsonl"
            }
            },
            {
            "extension":"pdf",
            "loader":{
                "loaderType":"unstructured",
                "options":{
                    "loaderMode":"single"
                }
            }
            }
        ],
        "splitter":{
            "splitterType":"recursivechar",
            "options":{
            "chunkSize":512,
            "chunkOverlap":160
            }
        },
        "retrievers":[
            {
            "retrieverType":"qdrant"
            },
            {
            "retrieverType":"tfidf"
            }
        ]
    }
}
}
```

3. Upload list of URLs

  • If you use URLs as documents ("loaderType": "url_list"), you also need to upload a file with the list of URLs in the preset folder.

  • Separate each URL with a line break. The file must have the extension .txt.

    http://www.url1.com
    http://www.url2.com
    

4. Upload jsonl or jsond files

  • If you use jsonl or jsond files as documents ("loaderType": "jsonl" or "loaderType": "jsond"), you also need to upload the file content in the same folder with the extension .jsonl or .jsond.

  • To do so, each desired document content must be provided in the page_content key.

    {"page_content": "test1", "metadata": {"source": "https://www.dummy1.es/"}, "type": "Document"}
    {"page_content": "test2", "metadata": {"source": "https://www.dummy2.es/"}, "type": "Document"}
    

5. Add project.metadata file (optional)

Scenario 1: Unstructured, csv or text data

If the loaderType is url_list, unstructured or csv, you can optionally add a file called project.metadata with relevant information about each file. This metadata will be stored in the database and is very helpful when we want to modify the source URL.

It is important that the file is correctly tabulated and does not contain any invalid characters.

The file is composed of:

  • Key __global__, which contains global data that affects all the files.
  • Names of the specific files to which we want to include this extra data.

It is not necessary to define metadata for all the files in the folder.

Example:

__global__:
   url: https://www.google.com
   field1: test
   field2: test
file1.txt:
   url: https://www.dummy-url.com
   title: file1 title
file2.txt:
   url: https://www.dummy-url.com
   title: file1 title
   source: test

NOTE: From all the information added to the project.metadata when creating your use case, you can select the specific sources that will be shown to the user as part of the response, adding them to the field baseURL of the preset configuration.

Scenario 2: URL or json documents

In this case, there is no need to add the project.metadata file:

  • "loaderType": "url_list" —> Metadata information is included in the URLs themselves, uploaded in step 3

  • "loaderType": "jsonl", "loaderType": "jsond" —> Metadata information is already included in the files uploaded in step 4

6. Update data into the environment

Finally, execute the atria-rag-generate-db job to update the data into the environment.