<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Aura – </title>
    <link>/docs/atria/technical-components/atria-rag-generate-db/</link>
    <description>Recent content on Aura</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    
	  <atom:link href="/docs/atria/technical-components/atria-rag-generate-db/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Docs: </title>
      <link>/docs/atria/technical-components/atria-rag-generate-db/components/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/docs/atria/technical-components/atria-rag-generate-db/components/</guid>
      <description>
        
        
        &lt;h1 id=&#34;atria-rag-generate-db-architecture-and-components&#34;&gt;ATRIA RAG Generate DB architecture and components&lt;/h1&gt;


&lt;div class=&#34;pageinfo pageinfo-primary&#34;&gt;
&lt;p&gt;Development architecture and technical components of the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;/div&gt;

&lt;h2 id=&#34;architecture-overview&#34;&gt;Architecture overview&lt;/h2&gt;
&lt;p&gt;The following diagram schematically shows the main technical components integrated into &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;../../../../images/atria/atria-technical-components/rag-generate-db-arch.png&#34; alt=&#34;atria-rag-server-arch&#34;&gt;&lt;/p&gt;
&lt;p&gt;A brief description of the technical components is included below:&lt;/p&gt;
&lt;h3 id=&#34;data-sources&#34;&gt;Data sources&lt;/h3&gt;
&lt;p&gt;A project contains information required for the execution of the generation of the databases: specific path of documents to feed the databases, allowed file extensions, etc. It can read from different sources, this source type is defined in the &lt;code&gt;extensions&lt;/code&gt; field.&lt;/p&gt;
&lt;p&gt;Before the information from the documents is stored in the corresponding database, the documents are processed, e.g., they are cut up and cleaned.&lt;/p&gt;
&lt;h3 id=&#34;retrievers&#34;&gt;Retrievers&lt;/h3&gt;
&lt;p&gt;The retrievers are in charge of reading the information from the documents and feeding the databases.&lt;/p&gt;
&lt;p&gt;The retrievers are defined in the &lt;code&gt;retrievers&lt;/code&gt; field of the project. Each retriever is associated with a database in order to feed or retrieve information from it.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Docs: </title>
      <link>/docs/atria/technical-components/atria-rag-generate-db/operational-overview/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/docs/atria/technical-components/atria-rag-generate-db/operational-overview/</guid>
      <description>
        
        
        &lt;h1 id=&#34;atria-rag-generate-db-operational-overview&#34;&gt;ATRIA RAG Generate DB operational overview&lt;/h1&gt;


&lt;div class=&#34;pageinfo pageinfo-primary&#34;&gt;
&lt;p&gt;Overview of the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; operation&lt;/p&gt;

&lt;/div&gt;

&lt;h2 id=&#34;operational-flow&#34;&gt;Operational flow&lt;/h2&gt;
&lt;p&gt;The operational flow between an application (for the communication with &lt;a href=&#34;../../../../docs/atria/technical-components/aura-gateway-api/&#34;&gt;&lt;em&gt;aura-gateway-api&lt;/em&gt;&lt;/a&gt;), &lt;a href=&#34;../../../../docs/atria/technical-components/atria-model-gateway/&#34;&gt;&lt;em&gt;atria-model-gateway&lt;/em&gt;&lt;/a&gt;, &lt;a href=&#34;../../../../docs/atria/technical-components/atria-rag-server/&#34;&gt;&lt;em&gt;atria-rag-server&lt;/em&gt;&lt;/a&gt; and &lt;a href=&#34;../../../../docs/atria/technical-components/atria-rag-generate-db/&#34;&gt;&lt;em&gt;atria-rag-generate-db&lt;/em&gt;&lt;/a&gt; is schematically shown in the document &lt;a href=&#34;content/en/docs/atria/technical-components/atria-model-gateway/operational-overview/#operational-workflow&#34;&gt;&lt;em&gt;&lt;strong&gt;atria-model-gateway: operational flow&lt;/strong&gt;&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;configuration&#34;&gt;Configuration&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;atria-model-gateway&lt;/strong&gt;&lt;/em&gt; includes a default configuration. Constructors can use it as is or they can modify it to be adapted to their requirements or business models: Go to document &lt;a href=&#34;../../../../docs/atria/technical-guidelines/configuration/&#34;&gt;ATRIA configuration&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;data-persistence-feature&#34;&gt;Data persistence feature&lt;/h2&gt;
&lt;p&gt;Now ATRIA enables &lt;strong&gt;data persistence in knowledge bases across releases&lt;/strong&gt;: After the installation of a new release, all existing data in the knowledge base (currently, &lt;strong&gt;Qdrant&lt;/strong&gt;) remains fully available and accessible for every &lt;em&gt;&lt;strong&gt;ATRIA&lt;/strong&gt;&lt;/em&gt; experience. Thus, information is completely independent of the deployed version.&lt;/p&gt;
&lt;p&gt;This feature provides key advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Guaranteed continuity of &lt;em&gt;&lt;strong&gt;ATRIA&lt;/strong&gt;&lt;/em&gt; experiences.&lt;/li&gt;
&lt;li&gt;No need for data re-ingestion after each release.&lt;/li&gt;
&lt;li&gt;No need to recalculate embeddings.&lt;/li&gt;
&lt;li&gt;Data ingested after the installation of a release (through hot swapping) is now automatically consolidated and carried forward to subsequent releases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;tracking-and-clean-up-processes&#34;&gt;Tracking and clean-up processes&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; keeps a record of the current state of documents and related configuration for data sources, so it only feeds documents that have been modified or added since the last update.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; also cleans up any resources that are left behind and no longer used after new ones are introduced.&lt;/p&gt;
&lt;h2 id=&#34;preset-management&#34;&gt;Preset management&lt;/h2&gt;
&lt;h3 id=&#34;preset-report&#34;&gt;Preset report&lt;/h3&gt;
&lt;p&gt;After generation-db is executed, a report is logged with the following information for each preset:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The preset name.&lt;/li&gt;
&lt;li&gt;The status of the execution (success, skipped or error).&lt;/li&gt;
&lt;li&gt;A descriptive message with the reason for the status.&lt;/li&gt;
&lt;li&gt;Date and time of the execution start.&lt;/li&gt;
&lt;li&gt;Date and time of the execution end.&lt;/li&gt;
&lt;li&gt;The configured documents for the preset&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;preset-availability&#34;&gt;Preset availability&lt;/h3&gt;
&lt;p&gt;When a new preset is created, it is necessary to launch the database generation process by executing the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; component. This process may take several minutes to complete. Once the generation is finished, the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; component is automatically restarted.&lt;/p&gt;
&lt;p&gt;While these processes are running, a message is shown to the user indicating that the preset is not yet available.&lt;/p&gt;
&lt;p&gt;When both processes are finished, the preset becomes available for use.&lt;/p&gt;
&lt;h2 id=&#34;data-migration-between-atria-releases&#34;&gt;Data migration between ATRIA releases&lt;/h2&gt;
&lt;p&gt;The data persistence feature is implemented by a migration tool between environments or releases integrated in the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; component. This tool moves the trained data from one release to another, to avoid generating preset data that has been previously created in a release.&lt;/p&gt;
&lt;p&gt;The process for migrating data must be &lt;strong&gt;triggered manually&lt;/strong&gt; by &lt;a href=&#34;#launch-migration-process&#34;&gt;launching a command&lt;/a&gt; (similar to the &lt;em&gt;&lt;strong&gt;aura-rag-generate-db&lt;/strong&gt;&lt;/em&gt; job), where both source and target environments should be indicated.&lt;/p&gt;
&lt;p&gt;After executing this command, data will be migrated from one environment to the other automatically.&lt;/p&gt;
&lt;p&gt;The migration flow is executed as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Process the hashes file and, for each preset we want to migrate, we will do the following steps:
&lt;ul&gt;
&lt;li&gt;Check that the preset from the source environment is in the config of the target environment&lt;/li&gt;
&lt;li&gt;Move the &lt;code&gt;trained_data&lt;/code&gt; files from the source environment to the respective training folder of the target environment&lt;/li&gt;
&lt;li&gt;Duplicate the collections from the source environment to the target environment&lt;/li&gt;
&lt;li&gt;Move the TFIDFs files from the source environment to the target environment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Move the hashes file from the source environment to the target environment&lt;/li&gt;
&lt;li&gt;Add the new presets training files to the respective training folder in the target environment.&lt;/li&gt;
&lt;li&gt;Launch &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;. Only new presets will be reloaded.&lt;/li&gt;
&lt;/ol&gt;
&lt;p align=&#34;center&#34;&gt;
  &lt;img width=&#34;1200&#34; height=&#34;1200&#34; src=&#34;../../../../images/atria/atria-persistence.png&#34;&gt;&lt;br&gt;
  &lt;i&gt;Data migration flow&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;In the migration process described above, the following folders are generated and stored in an Azure blob storage after &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; is finished:&lt;/p&gt;
&lt;h3 id=&#34;shared-data&#34;&gt;Shared data&lt;/h3&gt;
&lt;p&gt;This folder contains the trained data shared between &lt;em&gt;&lt;strong&gt;atria-rag-server&lt;/strong&gt;&lt;/em&gt; and &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;.
This is used to store the files that the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; generates and then the &lt;em&gt;&lt;strong&gt;atria-rag-server&lt;/strong&gt;&lt;/em&gt; uses to be able to process the request.&lt;/p&gt;
&lt;p&gt;At the moment, only the files generated by the TFIDF (Term Frequency–Inverse Document Frequency) exist in this folder.&lt;/p&gt;
&lt;p&gt;This folder is used for migration, as we can take the TFIDFs of a trained preset to the blob of a specific release where that preset has not been trained and save the training afterward.&lt;/p&gt;
&lt;h3 id=&#34;trained-data&#34;&gt;Trained data&lt;/h3&gt;
&lt;p&gt;This folder contains the files that have been used in the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; for each preset.&lt;/p&gt;
&lt;p&gt;The folder structure is defined with a hash of the contents of all the files for each preset, to facilitate migration.&lt;/p&gt;
&lt;h3 id=&#34;atria-rag-project-hashes&#34;&gt;Atria RAG project hashes&lt;/h3&gt;
&lt;p&gt;This is a file containing all the information for each preset, to facilitate migration.&lt;/p&gt;
&lt;p&gt;It contains the following information for each preset:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;config_hash&lt;/code&gt;: Hash of the preset configuration at the time the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; was launched.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;source_files_hash&lt;/code&gt;: Hash of the source files used to generate the preset. This hash should exist in one folder into the trained data folder.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;metadata&lt;/code&gt;: Metadata of the preset, including the date of &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt; launching.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;retrievers&lt;/code&gt;: Info that retrievers used to generate the preset. It contains the name of the Qdrant collection and the path where it holds the TFIDF files, which would correspond to the shared data.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;5905dece-433d-47f4-a78c-72366bcd1473&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;config_hash&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#4e9a06&#34;&gt;&amp;#34;28f837d56079f30c59a419292d129bc3&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;source_files_hash&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#4e9a06&#34;&gt;&amp;#34;cda3afcd8e74ede0d23065e897d55fae&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;date&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#4e9a06&#34;&gt;&amp;#34;2025-04-01 11:25:59&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;retrievers&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;qdrant_collection_name&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#4e9a06&#34;&gt;&amp;#34;rag-ap-eight-9100-dev-project-copilot&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;      &lt;span style=&#34;color:#204a87;font-weight:bold&#34;&gt;&amp;#34;tfidf_path_file&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;:&lt;/span&gt; &lt;span style=&#34;color:#4e9a06&#34;&gt;&amp;#34;project-copilot/tfidf&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  &lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#000;font-weight:bold&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to using this data for migration, it also speeds up the launch of the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;config_hash&lt;/code&gt; and &lt;code&gt;source_files_hash&lt;/code&gt; values are used to verify if, at the moment of launching the &lt;em&gt;&lt;strong&gt;atria-rag-generate-db&lt;/strong&gt;&lt;/em&gt;, something has been changed in the configuration or in the training data. If changes are detected, all the data for that preset is regenerated. Otherwise, if the preset has not changed, we will save that generation.&lt;/p&gt;
&lt;h3 id=&#34;launch-migration-process&#34;&gt;Launch migration process&lt;/h3&gt;
&lt;p&gt;The process to persist data between releases has to be launched manually through the execution of the following command:
To run this script, we just need the output files with the environment configuration info generated by the installer in the output_install directory from the source and destination environment.
With this info, run the script as shown below, using the corresponding files names for the desired environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  ./migrate-data --source-file &lt;span style=&#34;color:#4e9a06&#34;&gt;${&lt;/span&gt;&lt;span style=&#34;color:#000&#34;&gt;SOURCE_ENVIRONMENT_INFO_FILE&lt;/span&gt;&lt;span style=&#34;color:#4e9a06&#34;&gt;}&lt;/span&gt; --dest-file &lt;span style=&#34;color:#4e9a06&#34;&gt;${&lt;/span&gt;&lt;span style=&#34;color:#000&#34;&gt;DEST_ENVIRONMENT_INFO_FILE&lt;/span&gt;&lt;span style=&#34;color:#4e9a06&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;source-file&lt;/code&gt;: Source environment info file where the data is stored.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dest-file&lt;/code&gt;: Target environment info file where the data is going to be migrated.&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
