Categories:
Aura Databricks Jobs user guide
Guidelines including the orderly steps to use Aura Databricks Jobs
Prerequisites
-
Python version 3.9 or higher.
# determine python version python --version -
Installed aura-pytraces: Aura repository for Python traces functionalities.
-
Prerequisites in Aura installer:
- Databricks must be enabled in Aura installer
- Databricks cluster node type must be configured
- Databricks job execution must be configured
-
Configure Kernel datasets. See more details in Kernel datasets configuration.
Flow
The flow that aura-databricks-jobs follows to validate if it is going to be executed is as follows:
Generate Reports
By default, aura-databricks-jobs generates a report in the import process. This report is available in the Azure Storage defined in AURA_MICROSOFT_AZURE_STORAGE_COMMON_ACCOUNT, and path AURA_KPI_AVRO_REPORTS_DESTINATION_PATH with the file name: aura-avro-kpis-report-{iso-date}.json.
If you want to change the behavior and generate reports of all uploaded files or disable their generation, you can do it by changing the environment variable AURA_KPIS_REPORTS_MODE. If the value is set to all, it will generate a report for each of the processed files, if it is set to none, it will not generate any report and if it set to error, the report will be generated only when there are errors in the process. The default value is all.
3.1 Report Model
A report will contain the following template in JSON format.
{
"num_files_kernel_uploaded": 30,
"num_files_moved_to_processed": 30,
"num_files_deleted": 30,
"num_files_skipped": 0,
"num_errors": 0,
"summary": {
"D_Aura_Channel": {
"dataset_id": "D_Aura_Channel",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_Channel",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "8fb3e408-2ce0-42f4-8bbf-5b0974b44108",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 116,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 14640,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 116,
"local_spark_records_written_total": 116,
"total_not_informed_records_written": 0,
"records_written": 116,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 4796
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_Channel/6.0.0/CR_DIM_CHANNEL_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_Channel/6.0.0/CR_DIM_CHANNEL_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_Channel/6.0.0/CR_DIM_CHANNEL_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_Channel/6.0.0/CR_DIM_CHANNEL_20241017T100000Z.avro"
],
"duration_seconds": 141.32
},
"D_Aura_Recognizer": {
"dataset_id": "D_Aura_Recognizer",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_Recognizer",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "415fb219-6ef4-4b21-9e14-c10347f1d2fa",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 376,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 49744,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 376,
"local_spark_records_written_total": 376,
"total_not_informed_records_written": 0,
"records_written": 376,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 9055
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_Recognizer/6.0.0/CR_DIM_RECOGNIZER_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_Recognizer/6.0.0/CR_DIM_RECOGNIZER_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_Recognizer/6.0.0/CR_DIM_RECOGNIZER_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_Recognizer/6.0.0/CR_DIM_RECOGNIZER_20241017T100000Z.avro"
],
"duration_seconds": 94.75
},
"D_Aura_Component": {
"dataset_id": "D_Aura_Recognizer",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_Component",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "340c90a8-00d5-4868-a746-5ec0f8342a90",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 28,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 2108,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 28,
"local_spark_records_written_total": 28,
"total_not_informed_records_written": 0,
"records_written": 28,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 1255
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_Component/6.0.0/CR_DIM_COMPONENT_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_Component/6.0.0/CR_DIM_COMPONENT_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_Component/6.0.0/CR_DIM_COMPONENT_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_Component/6.0.0/CR_DIM_COMPONENT_20241017T100000Z.avro"
],
"duration_seconds": 105.14
},
"D_Aura_Skill": {
"dataset_id": "D_Aura_Skill",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_Skill",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "60da9e25-0767-4097-ab9a-2bf388d8daa7",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 16,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 1280,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 16,
"local_spark_records_written_total": 16,
"total_not_informed_records_written": 0,
"records_written": 16,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 1246
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_Skill/6.0.0/CR_DIM_SKILL_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_Skill/6.0.0/CR_DIM_SKILL_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_Skill/6.0.0/CR_DIM_SKILL_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_Skill/6.0.0/CR_DIM_SKILL_20241017T100000Z.avro"
],
"duration_seconds": 95.97
},
"D_Aura_Preset": {
"dataset_id": "D_Aura_Preset",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_Preset",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "8b143625-9bf7-484a-8a05-671a6cff72fe",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 64,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 5020,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 64,
"local_spark_records_written_total": 64,
"total_not_informed_records_written": 0,
"records_written": 64,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 2001
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_Preset/6.0.0/CR_DIM_PRESETS_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_Preset/6.0.0/CR_DIM_PRESETS_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_Preset/6.0.0/CR_DIM_PRESETS_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_Preset/6.0.0/CR_DIM_PRESETS_20241017T100000Z.avro"
],
"duration_seconds": 72.97
},
"D_Aura_App": {
"dataset_id": "D_Aura_App",
"schema": "dimensional",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 4,
"num_files_moved_to_processed": 4,
"num_files_deleted": 4,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "D_Aura_App",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "f99b5dac-47ce-4525-aa86-6d3bbb3b67f5",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 28,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 5192,
"total_malformed_records_by_partition_written": [],
"partitions_written": [],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [],
"total_not_informed_records_by_partition_written": [],
"records_read": 28,
"local_spark_records_written_total": 28,
"total_not_informed_records_written": 0,
"records_written": 28,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 2742
}
},
"files_uploaded": [
"avro_test/dimensional/D_Aura_App/6.0.0/CR_DIM_APP_20241017T070000Z.avro",
"avro_test/dimensional/D_Aura_App/6.0.0/CR_DIM_APP_20241017T080000Z.avro",
"avro_test/dimensional/D_Aura_App/6.0.0/CR_DIM_APP_20241017T090000Z.avro",
"avro_test/dimensional/D_Aura_App/6.0.0/CR_DIM_APP_20241017T100000Z.avro"
],
"duration_seconds": 93.86
},
"Aura_Audit": {
"dataset_id": "Aura_Audit",
"schema": "entity",
"version": "6.0.0",
"step": "FINISH",
"num_files_kernel_uploaded": 2,
"num_files_moved_to_processed": 2,
"num_files_deleted": 2,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {
"dataset_id": "Aura_Audit",
"version": 6,
"correlator": "55fc318d-b9cd-4070-ae6e-0407ef4b871e",
"resource_id": "3013424c-4ef1-4bdb-b4fc-a02540f9b1f8",
"request_type": "writes",
"status": "finished",
"metrics": {
"total_records_written": 63,
"local_spark_write_discards": 0,
"local_spark_write_discards_total": 0,
"malformed_records_written": 0,
"total_records_filtered_by_gdpr": 0,
"local_spark_bytes_written_total": 12452,
"total_malformed_records_by_partition_written": [],
"partitions_written": [
[
[
"DAY_DT",
"2024-10-04"
]
],
[
[
"DAY_DT",
"2024-10-07"
]
]
],
"total_malformed_records_written": 0,
"total_malformed_records_by_column_written": [],
"total_records_by_partition_written": [
[
"DAY_DT=2024-10-04",
53
],
[
"DAY_DT=2024-10-07",
10
]
],
"total_not_informed_records_by_partition_written": [],
"records_read": 63,
"local_spark_records_written_total": 63,
"total_not_informed_records_written": 0,
"records_written": 63,
"total_malformed_records_discarded": 0,
"records_discarded": 0,
"data_access_audit": {
"partitions_num": 1,
"wasb_type": "avro_fp"
},
"total_executor_cpu_millis": 1,
"total_executor_memory": 593913446,
"total_bytes_written": 6854
}
},
"files_uploaded": [
"avro_test/entity/Aura_Audit/6.0.0/AURA_062a0ab0-d0bd-5347-98bf-d88977af622f_CR_AUDIT_20241007T090000Z.avro",
"avro_test/entity/Aura_Audit/6.0.0/AURA_1d43887a-f368-51ce-abee-60f5b25387ad_CR_AUDIT_20241004T110000Z.avro"
],
"duration_seconds": 100.70
},
"Aura_Gateway_Message": {
"dataset_id": "Aura_Gateway_Message",
"schema": "entity",
"version": "6.0.0",
"step": "NOT_PROCESSED",
"num_files_kernel_uploaded": 0,
"num_files_moved_to_processed": 0,
"num_files_deleted": 0,
"num_files_skipped": 0,
"num_errors": 0,
"errors": [],
"spark_executions": {},
"files_uploaded": [],
"duration_seconds": 0.07
}
},
"start_time": "2024-10-23T15:18:30.098166Z",
"end_time": "2024-10-23T15:36:57.161532Z",
"duration_seconds": 1107.06,
"step": "FINISH",
"status": "successfully"
}
The parameters are defined as follows:
-
dataset_id: Kernel dataset id to load.
-
schema: Type of schema to load.
-
version: Dataset version to load.
-
step: Stage of loading process. It could be:
- INIT: In this stage, the necessary Azure and Spark connections are created and a report is created.
- CHECK_PREVIOUS_ERRORS: In this stage, it is checked if there were errors in the last execution; the errors of the datasets that cannot be recovered are marked and those that can be recovered will be executed again.
- WRITING_KERNEL_STAGE: Stage for reading files and writing data to the Kernel datasets.
- MOVING_PROCESSED_BLOBS_STAGE: Stage for moving files to the processed folder.
- FINISH: This stage indicates that the process has been completed.
-
num_files_kernel_uploaded: Number of files that have been verified as successfully uploaded in Kernel Datalake.
-
num_files_moved_to_processed: Number of files that have been moved to the processed folder.
-
num_files_deleted : Number of files that have been deleted from the main folder.
-
num_files_skipped: Number of files that have been skipped. This is because they have not yet been processed due to match with pattern defined in job’s variable: AURA_KPI_AVRO_SCHEMAS_NOT_TO_UPLOAD
-
num_errors: Total of errors reported. It may indicate an error when loading the source files contained in one of the Avro-formatted folders. So it does not correspond to the number of erroneous files.
-
start_time: Date in ISO format with start time
-
end_time: Date in ISO format with end time
-
duration_seconds: duration in seconds of the import process.
-
status: It contains the status of process. The value will be
failedorsuccessfully. -
summary: It contains the information of each coroutine processed that is responsible for loading a folder with files that have the same Avro schema and the same version. If there is a general error prior to the coroutines, it will also appear in the summary in the
process_errorfield. It contains for each dataset id:- num_files_kernel_uploaded: Number of files that have been verified as successfully uploaded in Kernel Datalake for this dataset id.
- num_files_moved_to_processed: Number of files that have been moved to the processed folder for this dataset id.
- num_files_deleted: Number of files that have been deleted from the main folder for this dataset id.
- num_errors: Number of errors reported for this dataset id.
- errors: Produced errors for this dataset id. With elements:
error,corr,step.- error: Description or exception of error obtained.
- corr: Correlator used in process.
- step: It indicates the phase of the process for each Kernel dataset.
- MOVING_BLOBS_TO_PROCESSED_WITH_PREVIOUS_ERRORS: In this stage, the processed files that were pending to move due to an error are now moved.
- REMOVING_BLOBS_WITH_PREVIOUS_ERRORS: In this stage, the processed files that were pending to be deleted due to an error are now deleted.
- NOT_PROCESSED_PREVIOUS_ERRORS: Errors that occurred in a previous process that are not recoverable. For example, if the writing has malformed or discarded records, they must be reviewed manually and should not be written to the dataset. Or if after trying to move the files to be processed again they fail again, it would be necessary to specifically check what happens with those files.
- READING_BLOBS: In this stage, the files are read to create data to be written to the dataset.
- WRITING_DATASET: This stage proceeds to write data to the dataset.
- WRITING_DATASET_OK: At this stage, the data has already been correctly written to the dataset.
- WRITING_DATASET_ERROR_NOT_RECOVERABLE: In the writing process, malformed or discarded records have been detected that must be checked manually.
- MOVING_BLOBS_TO_PROCESSED: At this stage, the files are moved to the processed folder.
- REMOVING_BLOBS: At this stage, the files are deleted from the processed folder.
- NOT_PROCESSED: The dataset has no data and will not be processed.
- FINISH: The dataset uploading has been completed correctly.
- spark_executions: Spark report for that dataset id. Included info such as records read, written, discarded, etc.
- files_uploaded: List of files that have been uploaded in Kernel for this dataset id.
Example of one coroutine executed for ´D_Aura_Channel´ dataset:
{ "D_Aura_Channel": { "dataset_id": "D_Aura_Channel", "schema": "dimensional", "version": "6.0.0", "step": "FINISH", "num_files_kernel_uploaded": 156, "num_files_moved_to_processed": 156, "num_files_deleted": 156, "num_files_skipped": 0, "num_errors": 0, "errors": [], "spark_executions": { "dataset_id": "D_Aura_Channel", "version": 6, "correlator": "d558b080-f261-4e6b-9adc-a7503f3e51a9", "resource_id": "36417c66-a276-4107-bcb8-3792bccb076c", "request_type": "writes", "status": "finished", "metrics": { "total_records_written": 4967, "local_spark_write_discards": 0, "local_spark_write_discards_total": 0, "malformed_records_written": 0, "total_records_filtered_by_gdpr": 0, "local_spark_bytes_written_total": 4049495, "total_malformed_records_by_partition_written": [], "partitions_written": [], "total_malformed_records_written": 0, "total_records_by_partition_written": [], "total_not_informed_records_by_partition_written": [], "records_read": 4967, "local_spark_records_written_total": 4967, "total_not_informed_records_written": 0, "records_written": 4967, "total_malformed_records_discarded": 0, "records_discarded": 0, "data_access_audit": { "partitions_num": 1, "wasb_type": "avro_fp" }, "total_executor_cpu_millis": 1, "total_executor_memory": 593913446, "total_bytes_written": 394038 } }, "duration_seconds": 112.05 } }