Edit

Share via


Connect Azure Databricks to an Azure NetApp Files object REST API-enabled volume

The object REST API feature enables Azure Databricks to read and write data to Azure NetApp Files volumes, supporting end-to-end data science workflows from ingestion to model deployment.

To connect to Azure Databricks, you configure an initialization (init) script to load the SSL certificate on the Databricks compute endpoints. Using this setup ensures secure communication between Azure Databricks and your Azure NetApp Files object REST API-enabled volume.

Before you begin

Ensure you have:

Create the init script

The init script runs during cluster startup. For more information about init scripts, see What are init scripts?

  1. Write a bash script to load the SSL certificate. Save the script with an .sh extension. For example:

    #!/bin/bash 
    
    cat << 'EOF' > /usr/local/share/ca-certificates/myca.crt 
    
    -----BEGIN CERTIFICATE----- 
    
    
    -----END CERTIFICATE----- 
    
    EOF 
    
    update-ca-certificates 
    
    PEM_FILE="/etc/ssl/certs/myca.pem" 
    
    PASSWORD="changeit" 
    
    JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") 
    
    KEYSTORE="$JAVA_HOME/lib/security/cacerts" 
    
    CERTS=$(grep 'END CERTIFICATE' $PEM_FILE| wc -l) 
    
    # To process multiple certificates with keytool, you need to extract each one from the PEM file and import it into the Java KeyStore. 
    
    for N in $(seq 0 $(($CERTS - 1))); do 
    
    ALIAS="$(basename $PEM_FILE)-$N" 
    
    echo "Adding to keystore with alias:$ALIAS" 
    
    cat $PEM_FILE | 
    
        awk "n==$N { print }; /END CERTIFICATE/ { n++ }" | 
    
        keytool -noprompt -import -trustcacerts \ 
    
                -alias $ALIAS -keystore $KEYSTORE -storepass $PASSWORD 
    
    done 
    
    echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh 
    
    echo "export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh 
    
    #echo "volume IP URL of the bucket >> /etc/hosts 
    
  2. Use the Databricks CLI or the Databricks UI to upload the bash script to the Databricks File System (DBFS). For more information, see, work with files on Azure Databricks.

Configure the cluster

  1. Navigate to your Azure Databricks workspace. Open the cluster configuration settings.

  2. In the Advanced Options section, add the path to the init script under Init Scripts. For example: dbfs:/path/to/your/script.sh

    Screenshot of Create new compute menu.

  3. Select the init.sh script. Select Add then Confirm.

  4. To apply the changes and load the SSL certificate, restart the cluster.

  5. In the logs, validate if the certificate is placed correctly.

Connect to an Azure NetApp Files bucket

Databricks recommends using secret scopes for storing all credentials. For more information, see Manage secret scopes.

  1. In your Databricks notebook, configure the Spark session to connect to the Azure NetApp Files bucket. For example:

    spark.conf.set("fs.s3a.endpoint", "https://your-s3-endpoint") 
    spark.conf.set("fs.s3a.access.key", "your-access-key") 
    spark.conf.set("fs.s3a.secret.key", "your-secret-key") 
    spark.conf.set("fs.s3a.connection.ssl.enabled", "true") 
    
  2. Verify the connection by performing a simple read operation. For example:

    df = spark.read.csv("s3a://your-bucket/path/to/data.csv") 
    df.show() 
    

    Screenshot of successful read operation.

More information