Categories:
Aura Databricks Jobs configuration
This document describes the internal configuration of the aura-databricks-jobs component that will be enabled in every Aura release from the current one onwards.
⚠️ The users can modify this configuration at a certain extent, described in Aura Databricks Jobs user guide
Prerequisites
-
Python version 3.9 or higher
# determine python version python --version -
aura-pytraces: Aura repository for Python traces functionalities.
Execution of the tool in Databricks cluster
1. Configuration of the Databricks cluster
Firstly, it is necessary to follow the steps defined in Kernel documentation for the correct installation of the cluster: Create a Databricks cluster.
In addition, to configure our environment and Python package in the Databricks cluster, it is necessary to configure a docker image that we will have previously registered:
docker_image: auraregistry.azurecr.io/aura/tools/aura-databricks-jobs:$VERSION
Configuration example obtained by applying the steps in the Kernel documentation and configuring docker image URL:
{
"spark_version": "12.2.x-scala2.12",
"spark_conf": {
"spark.driver.memory": "4g",
"spark.jars.packages": "com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:3.3.2",
"spark.jars.repositories": "https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/",
"spark.debug.maxToStringFields": "100"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3",
"JNAME": "zulu11-ca-amd64"
},
"init_scripts": [
{
"workspace": { "destination": "/InitScripts//init_script.sh"}
}
],
"docker_image": {
"url": "auraregistry.azurecr.io/aura/tools/aura-databricks-jobs:{$VERSION}",
"basic_auth": {
"username": "$USERNAME",
"password": "$PASSWORD"
}
}
}
Example of configuring the init script as indicated in the Kernel documentation:
#!/bin/bash
wget -O /databricks/jars/config-1.3.4.jar https://repo1.maven.org/maven2/com/typesafe/config/1.3.4/config-1.3.4.jar
rm -f /databricks/jars/*--com.typesafe__config__1.2.1.jar
2. Configuration of the job’s variables
The job will be configured with some input parameters that are included in the variable: config_dict.
You can review all variables in Job’s variables.
config_dict = {
'AURA_ENVIRONMENT_NAME': 'DEV',
'AURA_DATABRICKS_EXECUTION_PERIOD': 24,
'AURA_FP_SPARK_BASE_URL': '',
'AURA_FP_SPARK_CLIENT_ID': 'aura-bot-xxx',
'AURA_FP_SPARK_CLIENT_SECRET': '',
'AURA_FP_SPARK_PURPOSES': '',
'AURA_FP_SPARK_SCOPES': '',
'AURA_FP_SPARK_JARS_PACKAGES': 'com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:2.2.1',
'AURA_FP_SPARK_JARS_REPOSITORIES':
'https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/',
'AURA_FP_SPARK_SUFFIX_DATASET_TEST': '',
'AURA_KPI_AVRO_SOURCE_PATH': 'avro',
'AURA_KPI_AVRO_REPORTS_DESTINATION_PATH': 'avro/reports',
'AURA_MICROSOFT_AZURE_STORAGE_COMMON_ACCOUNT': '',
'AURA_MICROSOFT_AZURE_STORAGE_COMMON_ACCESS_KEY': '',
'AURA_MICROSOFT_AZURE_STORAGE_KPIS_CONTAINER_NAME': 'aura-kpis',
'AURA_KPI_AVRO_SCHEMAS_NOT_TO_UPLOAD': 'entity:E_Aura_GROOT',
'AURA_KPI_AVRO_PROCESSED_FOLDER_PATH': 'processed'
}
if __name__ == "__main__":
asyncio.run(import_avro_files_job(config_dict))
3. Configuration of job in Databricks cluster
To execute the job in Databricks, you should create a new job, following the guidelines Create and run Databricks Jobs and copying the template avro_to_dataset_job_cli.py without these unnecessary params:
AURA_FP_SPARK_JARS_PACKAGESAURA_FP_SPARK_JARS_REPOSITORIES
Execution of the tool in local environment
To install Apache Spark on your local machine and run Python scripts, follow the steps below.
1. Install Java 11
Apache Spark requires Java to run. We recommend using Java 11, as indicated in the Kernel documentation Spark SDK.
You can install Java 11 using a package manager or downloading the installer: Download.
- On Ubuntu/Debian:
sudo apt update
sudo apt install openjdk-11-jdk
- On macOS (using Homebrew):
brew install openjdk@11
- On Windows: Download the JRE installer from the Oracle website, run the installer and follow the on-screen instructions.
Finally, verify the installation with:
java -version
2. Install requirements via pip
pip install -r requirements.txt
These requirements include PySpark library and automatically includes a lightweight version of Spark, so you can run Spark jobs locally without needing to install Spark separately.
pip install pyspark
3. Config spark Session
By default, the Databricks cluster is configured with the required jar files and packages. But in local mode, you must indicate this configuration when you create the Spark session using the jobs variables: AURA_FP_SPARK_JARS_PACKAGES and AURA_FP_SPARK_JARS_REPOSITORIES.
Example:
AURA_FP_SPARK_JARS_PACKAGES = 'com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:3.3.2'
AURA_FP_SPARK_JARS_REPOSITORIES = 'https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/'
4. Execute job
You can execute the job with the configured variables:
python avro_to_dataset_job_cli.py