How to Integrate Databricks and MS Fabric to Empower Complete Data Analytics
Databricks and MS Fabric might seem like competition. However, when integrated, they make for a strong case for the best capabilities of Data Engineering and Analytics. Positioned as an analytics-first platform, Fabric empowers data analysts to act as citizen data engineers, facilitating data ingestion and transformation without relying on centralized teams. This innovation significantly reduces wait times for analytics, enhancing workflow efficiency. With seamless integration for Power BI users to access data from various sources, Fabric stands poised to transform how organizations handle data analytics and reporting.
1. Fabric, a Competitive Tool for Unified Analytics
Microsoft made its MS Fabric Platform generally available on November 15, 2023. Microsoft positioned the product as an analytics-first platform that enables all data analysts to become citizen data engineers. Data analysts can also perform data ingestion and transformation without waiting for the centralized data teams to enable them, drastically reducing the wait time for basic analytics.
Power BI users can now access structured warehouse data, unstructured data from the Lakehouse, or external data in the OneLake. They can access data directly via the direct lake method to create reports, vastly improving the analytics workflow.
2. Databricks, the Flag Bearer in Data Engineering Space
Databricks, another unified analytics platform, has gained worldwide popularity. It is mainly known for its data engineering capabilities. Its notebook-based platform enables data engineering, data science, and machine learning workloads on the same platform. However, MS also recognizes that many customers have invested heavily in the Databricks ecosystem. Hence, in its keynote address on Fabric General Availability Announcement, Microsoft stressed the interoperability between Databricks and Fabric.
3. How do Databricks and Fabric Work Together?
Databricks stores data in the backend enterprise cloud storage in Parquet or Delta formats. Microsoft Fabric is built around the same open-source Delta-Parquet format. Hence, both can operate smoothly with the same data.
Here are the two different ways this can be achieved:
- Access data in object storage through the Onelake shortcut
- Land and build the Lakehouse directly in OneLake
Let’s explore both architectures in detail.
3.1. Access Data in Object Storage through the Onelake Shortcut
Onelake is a Microsoft offering that allows one copy of data and works as a single software as a service (SaaS) data lake for the organization.
Microsoft has introduced the Onelake shortcut feature, which can create a virtual link to any data, internal or external. Shortcuts can be built to any data, either Azure Data Lake Gen2 or Amazon Simple Storage Service (Amazon S3). Onelake ensures no data duplication.
Figure 1: Onelake Shortcut over external data lake (Image by Aaron Merrill) (https://dataplatformblogcdn.azureedge.net/wp-content/uploads/2023/06/a-diagram-of-a-computer-description-automatically.png)
We can build a shortcut on the existing data lake (Azure Data Lake Storage(ADLS) Gen2 or AWS S3). Databricks can continue to load and transform data on the data lake. However, creating the shortcut in Onelake makes the same data instantly accessible via the direct lake method, even to PowerBI for reporting. It ensures no data movement and that one copy of data is accessible to Databricks and MS Fabric.
3.1.1. Let us explore how we can implement this-
- Go to the ADLS Gen2 location, on top of which the shortcut needs to be created
- Copy the URL of the container from the container properties
- Now, come to the Fabric workspace, and on the Lakehouse, click on the corresponding tables/files folder as needed. You will find the new shortcut option
- In the new shortcut wizard, provide the container URL and change the blob to distributed file system (DFS) as below:
Figure 2: Shortcut creation wizard, image by author. Screenshot of the shortcut creation wizard
- If Azure Data Lake Storage is in the same tenant organization, the organization’s account can be used to log in. In a multi-tenant environment, a service principal should be provided access to the Azure blob container. The service principal login details will be used to authenticate the shortcut connection
- In the next screen, select the subfolder/files. Once done, it will create a shortcut to the container files. The shortcut will appear next to the tables/files in the Lakehouse
Figure 3: Onelake shortcut to ADLS Gen2, image by author, screenshot of shortcut in Fabric
This architecture is worthwhile for enterprises that already have a significant investment in Databricks. If they have an eligible PowerBI license, they can use MS Fabric immediately using Onelake shortcuts.
3.2. Build the Lakehouse Directly in OneLake
In this section, let’s explore how Databricks can directly connect to the Onelake storage. This enables us to build the medallion architecture right into the Onelake. Different Lakehosues can be used within Onelake to identify various stages of the medallion architecture.
Figure 4: Medallion architecture built in Onelake (Image by Aaron Merrill) (https://dataplatformblogcdn.azureedge.net/wp-content/uploads/2023/06/a-diagram-of-a-computer-description-automatically.png)
This allows data engineers to use the data without any shortcuts. PowerBI can now access the data using the direct lake method from Onelake. Normal business users and citizen data scientists can use the same data from Onelake within Microsoft Office tools.
3.2.1. Let us see step by step how we can implement this:
- Create a service principal from Entra ID on the same tenant that has MS Fabric
- Create a new security group in the Microsoft Entra ID
- Add the new service principals to the security group
- Go to the Power BI tenant setting in Fabric and enable API access to the service principal for the specific user group
Figure 5 : Enable Power BI API for Service Principal, screenshot of MS Fabric Admin portal
Once this is done, go to the Fabric portal:
- Create a new workspace that you want to expose to Databricks
- Click the manage access option and add the service principal to the workspace
Figure 6 : Add service principal to the workspace, image by author, screenshot of workspace access settings
Figure 7 : image by author, workspace access settings screenshot
- Now create a new Lakehouse in the workspace
Once done, go back to Databricks and connect to the Onelake path of the Lakehouse using a syntax similar to that used to connect to the ADLS Gen2.
We need to change to the path and use the Onelake application programming interface (API) in the connection as below:
abfss://stagedbricks@onelake.dfs.fabric.microsoft.com/
Sample code to mount Onelake in Databricks:
Figure 8 : Sample Code to mount Onelake in Databricks by author
Now that Onelake is accessible from Databricks, we can write the complete Lakehouse implementation inside the Onelake by creating separate Lakehouses for each stage.
4. Conclusion
Azure Databricks is a mature data engineering and data science platform. On the other hand, PowerBI enjoys the lion’s share in business intelligence (BI) and analytics tooling. MS Fabric has just supercharged PowerBI by enabling the direct lake method. These two platforms can enable organizations to build a more robust data management platform.
More from Saikat Dutta
When OpenAI launched ChatGPT, a sudden buzz was created around Generative AI and how AI will…
Latest Blogs
Since the pandemic, the global volume of digital payment transactions has been rising rapidly.…
Introduction to RAG To truly understand Graph RAG implementation, it’s essential to first…
Welcome to our discussion on responsible AI —a transformative subject that is reshaping technology’s…
Introduction In today’s evolving technological landscape, Generative AI (GenAI) is revolutionizing…