Aura Databricks Jobs configuration

This document describes the internal configuration of the aura-databricks-jobs component that will be enabled in every Aura release from the current one onwards.

⚠️ The users can modify this configuration at a certain extent, described in Aura Databricks Jobs user guide

Prerequisites

  • Python version 3.9 or higher

    # determine python version
    python --version
    
  • aura-pytraces: Aura repository for Python traces functionalities.

Execution of the tool in Databricks cluster

1. Configuration of the Databricks cluster

Firstly, it is necessary to follow the steps defined in Kernel documentation for the correct installation of the cluster: Create a Databricks cluster.

In addition, to configure our environment and Python package in the Databricks cluster, it is necessary to configure a docker image that we will have previously registered: docker_image: auraregistry.azurecr.io/aura/tools/aura-databricks-jobs:$VERSION

Configuration example obtained by applying the steps in the Kernel documentation and configuring docker image URL:

{
    "spark_version": "12.2.x-scala2.12",
    "spark_conf": {
        "spark.driver.memory": "4g",
        "spark.jars.packages": "com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:3.3.2",
        "spark.jars.repositories": "https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/",
        "spark.debug.maxToStringFields": "100"
    },
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3",
        "JNAME": "zulu11-ca-amd64"
    },
    "init_scripts": [
        {
            "workspace": { "destination": "/InitScripts//init_script.sh"}
        }
    ],
    "docker_image": {
        "url": "auraregistry.azurecr.io/aura/tools/aura-databricks-jobs:{$VERSION}",
        "basic_auth": {
            "username": "$USERNAME",
            "password": "$PASSWORD"
        }
    }
}

Example of configuring the init script as indicated in the Kernel documentation:

#!/bin/bash
wget -O /databricks/jars/config-1.3.4.jar https://repo1.maven.org/maven2/com/typesafe/config/1.3.4/config-1.3.4.jar
rm -f /databricks/jars/*--com.typesafe__config__1.2.1.jar

2. Configuration of the job’s variables

The job will be configured with some input parameters that are included in the variable: config_dict.

You can review all variables in Job’s variables.

config_dict = {
    'AURA_ENVIRONMENT_NAME': 'DEV',
    'AURA_DATABRICKS_EXECUTION_PERIOD': 24,
    'AURA_FP_SPARK_BASE_URL': '',
    'AURA_FP_SPARK_CLIENT_ID': 'aura-bot-xxx',
    'AURA_FP_SPARK_CLIENT_SECRET': '',
    'AURA_FP_SPARK_PURPOSES': '',
    'AURA_FP_SPARK_SCOPES': '',
    'AURA_FP_SPARK_JARS_PACKAGES': 'com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:2.2.1',
    'AURA_FP_SPARK_JARS_REPOSITORIES':
        'https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/',
    'AURA_FP_SPARK_SUFFIX_DATASET_TEST': '',
    'AURA_KPI_AVRO_SOURCE_PATH': 'avro',
    'AURA_KPI_AVRO_REPORTS_DESTINATION_PATH': 'avro/reports',
    'AURA_MICROSOFT_AZURE_STORAGE_COMMON_ACCOUNT': '',
    'AURA_MICROSOFT_AZURE_STORAGE_COMMON_ACCESS_KEY': '',
    'AURA_MICROSOFT_AZURE_STORAGE_KPIS_CONTAINER_NAME': 'aura-kpis',
    'AURA_KPI_AVRO_SCHEMAS_NOT_TO_UPLOAD': 'entity:E_Aura_GROOT',
    'AURA_KPI_AVRO_PROCESSED_FOLDER_PATH': 'processed'
}

if __name__ == "__main__":
    asyncio.run(import_avro_files_job(config_dict))

3. Configuration of job in Databricks cluster

To execute the job in Databricks, you should create a new job, following the guidelines Create and run Databricks Jobs and copying the template avro_to_dataset_job_cli.py without these unnecessary params:

  • AURA_FP_SPARK_JARS_PACKAGES
  • AURA_FP_SPARK_JARS_REPOSITORIES

Execution of the tool in local environment

To install Apache Spark on your local machine and run Python scripts, follow the steps below.

1. Install Java 11

Apache Spark requires Java to run. We recommend using Java 11, as indicated in the Kernel documentation Spark SDK.

You can install Java 11 using a package manager or downloading the installer: Download.

  • On Ubuntu/Debian:
sudo apt update
sudo apt install openjdk-11-jdk
  • On macOS (using Homebrew):
brew install openjdk@11
  • On Windows: Download the JRE installer from the Oracle website, run the installer and follow the on-screen instructions.

Finally, verify the installation with:

java -version

2. Install requirements via pip

pip install -r requirements.txt

These requirements include PySpark library and automatically includes a lightweight version of Spark, so you can run Spark jobs locally without needing to install Spark separately.

pip install pyspark

3. Config spark Session

By default, the Databricks cluster is configured with the required jar files and packages. But in local mode, you must indicate this configuration when you create the Spark session using the jobs variables: AURA_FP_SPARK_JARS_PACKAGES and AURA_FP_SPARK_JARS_REPOSITORIES.

Example:

AURA_FP_SPARK_JARS_PACKAGES = 'com.telefonica.baikal:spark-sdk_2.12:2.2.1,org.apache.spark:spark-avro_2.12:3.3.2'
AURA_FP_SPARK_JARS_REPOSITORIES = 'https://4p-public-artifacts.s3.amazonaws.com/baikal/releases/,https://repo.osgeo.org/repository/release/'

4. Execute job

You can execute the job with the configured variables:

python avro_to_dataset_job_cli.py