Establishing a Connection between ADLS Gen2, Databricks and ADF In Microsoft Azure

Question

Establishing a Connection between ADLS Gen2, Databricks and ADF In Microsoft Azure

Rosalina5 201

Hello,

May, Someone please help me with establishing connection between ADLS Gen2, Databricks and ADF, full steps if possibble. Do I need to route through key-vault, this is i am first time doing in production,.

May somebody please share detailed step for implementing in Production.

ADF - Orchastrator

ADLS Gen2 - Storage

Databricks - Processsing data, transformation using pyspark.

Thanks a lot

Answer accepted by question author

1 additional answer

Your answer

Answer 1

Hey Rosalina5!

It sounds like you're trying to set up a connection between Azure Data Lake Storage (ADLS) Gen2, Databricks, and Azure Data Factory (ADF) for your data processing needs. Here's a comprehensive step-by-step guide that you can follow:

1. Set Up ADLS Gen2:

Ensure you have created an ADLS Gen2 account. Note down the storage account name and the file system name you will be using.

2. Configure Permissions:

Assign the Storage Blob Data Contributor role to the service principal that Databricks uses. This is critical for enabling access to your storage account.
For details on granting this access, refer to the documentation on how to grant the service principal access to Azure Data Lake Storage Gen2.

3. Set Up Databricks to Access ADLS Gen2:

In your Azure Databricks workspace, navigate to the workspace and create a new cluster if you don't have one already.
Configure the Spark environment to access your ADLS Gen2 by setting up the proper configurations.
- Use the following Spark configuration settings (with your actual values):
```
   spark.conf.set("fs.azure.account.auth.type.<your-storage-account-name>.dfs.core.windows.net", "OAuth")
```

spark.conf.set("fs.azure.account.oauth.provider.type.<your-storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.<your-storage-account-name>.dfs.core.windows.net", "<your-client-id>") spark.conf.set("fs.azure.account.oauth2.client.secret.<your-storage-account-name>.dfs.core.windows.net", "<your-client-secret>") spark.conf.set("fs.azure.account.oauth2.client.endpoint.<your-storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<your-tenant-id>/oauth2/token")

   
#### 4. Connecting ADF:

- To set up a connection in ADF, create a new linked service in ADF for your ADLS Gen2 storage, and provide the necessary credentials (Service Principal or Account Key).

- Make sure to set the connection string correctly using the linked service properties.

#### 5. Connecting Databricks to ADF:

- Use the **Azure Databricks** linked service in ADF to orchestrate your data workflows.

#### 6. Integration Testing:

- After setting everything up, run a sample pipeline that reads from ADLS Gen2, transforms data using Databricks notebooks, and writes back to ADLS or another destination.

#### 7. Using Key Vault (if needed):

- If you want to enhance security, you can store your sensitive information such as Client ID and Client Secret in Azure Key Vault.

- In this case, set up a Key Vault linked service in ADF and refer to the secrets in your ADF pipeline.

I hope these steps are helpful! If you run into issues or specific errors, let us know so we can troubleshoot further.

### References:

- [Connect to Azure Data Lake Storage and Blob Storage](https://learn.microsoft.com/azure/databricks/connect/storage/azure-storage)

- [Tutorial: Connect to Azure Data Lake Storage](https://learn.microsoft.com/azure/databricks/connect/storage/tutorial-azure-storage)

- [How do I set ACLs correctly for a service principal?](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-access-control#how-do-i-set-acls-correctly-for-a-service-principal)

- [Configure your virtual network to use a Microsoft Entra service endpoint](https://learn.microsoft.com/azure/data-lake-store/data-lake-store-network-security?toc=%2Fazure%2Fvirtual-network%2Ftoc.json#configuration)

- [Create a new application secret](https://docs.microsoft.com/azure/active-directory/develop/howto-create-service-principal-portal#create-a-new-application-secret)

If this answers your query, do click `Accept Answer` and `Yes` for was this answer helpful. And, if you have any further query do let us know.

Answer 2

To establish a connection between Azure Data Lake Storage Gen2 (ADLS Gen2), Azure Databricks, and Azure Data Factory (ADF), follow these steps:

1. Set Up ADLS Gen2

Create an ADLS Gen2 account if you haven't already. Ensure you have the necessary permissions to access it.
Create a container within your ADLS Gen2 account to store your data.

2. Configure Azure Data Factory (ADF)

Create a new ADF instance in the Azure portal.
Set up a Linked Service to connect to your ADLS Gen2:
- Go to the ADF instance, select Manage > Linked services > New.
- Choose Azure Data Lake Storage Gen2 as the connector.
- Fill in the required details, including the URL of your ADLS Gen2 account and authentication method (e.g., Account Key, Service Principal).

3. Set Up Azure Databricks

Create a Databricks workspace if you don’t have one.
Create a new notebook in Databricks:
- In your Databricks workspace, select Create > Notebook.
- Name your notebook and choose Python as the language.

4. Configure Connection in Databricks

In the first cell of your notebook, add the following code to configure the connection to your ADLS Gen2 account:

  # Configure ADLS Gen2 connection
  storage_account_name = "your_storage_account_name"
  storage_account_key = "your_storage_account_key"
  container_name = "mydatacontainer"
  
  spark.conf.set(
      f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
      storage_account_key
  )

Replace your_storage_account_name and your_storage_account_key with your actual credentials.
Run the cell to establish the connection.

5. Use Azure Key Vault (Optional but Recommended)

For production environments, it is recommended to use Azure Key Vault to securely manage your credentials instead of hardcoding them in your notebook. You can set up a Key Vault and reference the secrets in your Databricks notebook.

6. Read and Write Data

Use Spark to read and write data from/to ADLS Gen2 in your Databricks notebook. For example, to read a CSV file:

  customers_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/Customers.csv"
  customers_df = spark.read.format("csv") \
      .option("header", "true") \
      .option("inferSchema", "true") \
      .load(customers_path)

7. Orchestrate with ADF

In ADF, you can create pipelines that utilize the linked service to read from and write to ADLS Gen2, and also trigger your Databricks notebooks for data processing.

By following these steps, you can successfully establish a connection between ADLS Gen2, Databricks, and ADF for data processing and orchestration.

References:

Share via

Establishing a Connection between ADLS Gen2, Databricks and ADF In Microsoft Azure

1. Set Up ADLS Gen2:

2. Configure Permissions:

3. Set Up Databricks to Access ADLS Gen2:

1 additional answer

1. Set Up ADLS Gen2

2. Configure Azure Data Factory (ADF)

3. Set Up Azure Databricks

4. Configure Connection in Databricks

5. Use Azure Key Vault (Optional but Recommended)

6. Read and Write Data

7. Orchestrate with ADF

Your answer