Implementing a Unified Databricks Lakehouse Architecture for Accelerated AI Adoption
1 Introduction
A data warehouse is a centralized repository for structured, relational data that has been cleansed, integrated, and transformed from multiple sources. A data lake is a centralized repository for storing raw, unstructured data in its native format until it’s needed. A lakehouse combines the best of both worlds by providing an architecture enabling users to store and process structured and unstructured data in one place. It provides the agility of a data lake with the governance of a Data Warehouse (DWH).
This blog explains the differences between a data warehouse, datalake, and data lakehouse and why a Databricks lakehouse architecture is essential for Artificial Intelligence (AI) adoption across your organization.
2 Key challenges faced by enterprise IT leaders
Modernization of data platforms
Existing on-prem workloads like Hadoop and Enterprise Data Warehouse (EDW) stores such as Oracle, Netezza, etc., result in huge costs and resources for enterprises. Also, the ability to be agile and innovative is limited.
Modernizing the data platform to a cloud-based solution will lead to improved productivity.
Tech consolidation
Recently, there has been an explosion of tech stack to manage structured and unstructured data with the rise of cloud, mobile, social, AI, and IoT data. Custom technologies for each area and functionality just aren’t sustainable.
Cost optimization
IT leaders need to closely examine their existing infrastructure and performance, particularly regarding rising cloud DWH costs. Costs can quickly spiral out of control, with more and more data and users running queries.
Data governance
Risk, governance, compliance, and security have long been fundamental data challenges as data leaders strive to build trust in their data models both internally and externally. Bad-quality data leads to inaccurate analytics, poor decision-making, cost overhead, etc.
Implementing a Unified Data Management (UDM) architecture can effectively address the challenges and expectations of the modern cloud-based data platform.
3 Unified data management architecture
DWH vs. data lake vs. lakehouse
Before diving into lakehouse, here are the high-level data architecture patterns for DWH, data lake, and lakehouse. Also shown is the comparison between three architectural patterns.
Fig 1: DWH, Data Lake, and Lakehouse Architecture comparison
Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Table 1: DWH, Data Lake, and Lakehouse features
Challenges of DWH
- Suitable for a huge IT project that can absorb high maintenance costs
- Primarily supports BI and reporting use cases
- Limited capability for supporting ML use cases
- Inefficient handling of semi-structured and unstructured data
Challenges of a data lake
- Appending data is hard
- Modification of existing data is difficult
- Data lakes perform poorly
- Do not support transactions
- Do not enforce data quality
Data lakehouse
Data lakehouse combines the best of both worlds of data lake and DWH. The performance, concurrency, and data management of EDWs with the scalability, low cost, and workload flexibility of the data lake.
Lakehouse enables optimized AI and BI directly on big data stored in data lakes using an object store mechanism and providing transaction control using a delta format.
Lakehouse caters to all the use cases, can store and process all data types, and implements open standards.
Key features of a lakehouse
Transaction support: Support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL.
BI support: Lakehouses enable the use of BI tools directly on the source data, which reduces latency.
Openness: The storage formats, such as Parquet and Delta, are open and standardized.
Storage is decoupled from compute: Storage and computing use separate clusters.
Support for diverse workloads: Supports data science, machine learning, SQL, and analytics.
End-to-end streaming: Support for streaming eliminates the need for separate systems to serve real-time data applications.
Benefits of lakehouse
Unify data teams: Unifies all data teams of data engineers, data scientists, and analysts on one architecture.
Break data silos: Facilitates breaking data silos by providing a complete and firm copy of all your data in a centralized location.
Prevent data from becoming stale: You can process batch and streaming data, so your data is never stale.
Reduces cost: One system for DWH and ML through which data can be stored in cheap object storage such as Amazon S3, Azure Blob Storage, etc.
Simplifies data governance: Eliminate the operational overhead of managing data governance on multiple tools.
Simplifies ETL jobs: Minimize the Extract, Transform, and Load (ETL) process by connecting the query engine directly to the data lake.
Connects directly to BI tools: Supports the connection to popular BI tools like Tableau, PowerBI, etc.
Gartner Hype Cycle for Data Management, 2022
As per the below hype cycle, lakehouse is expected to reach the plateau of productivity in the next 2 to 5 years.
Fig 2: Gartner Hype Cycle for data management
Source: https://www.databricks.com/resources/ebook/hype-cycle-for-data-management
4 Lakehouse implementation using Databricks
Databricks lakehouse platform
The Databricks lakehouse platform is built on open source and open standards. It ensures the data quality, performance, security, and governance expected from a data warehouse. Data only needs to exist once to support all data, AI, and BI workloads on one common platform, establishing a single source of truth.
Organizing Databricks lakehouse platform
Databricks lakehouse can ingest petabytes of data with auto-evolving schemas. It can also automatically and efficiently track data without manual intervention, infer schema, and detect column changes for structured and unstructured data formats.
Databricks recommends the Bronze, Silver, and Gold layer architecture. It lets you easily merge and transform new and existing data in batches or streaming.
Table 2: Features of medallion architecture using Bronze, Silver, and Gold
Databricks SQL for DWH-like experience
Databricks SQL offers a native first-class SQL experience with a built-in SQL editor, rich visualizations, and dashboards, and integrates seamlessly with widely used BI tools.
Databricks AI/ML capabilities
Databricks lakehouse helps orchestrate the ML process’s end-to-end lifecycle, automating the ML lifecycle using various tools like Data Science Workspace, MLflow, etc.
5 Conclusion
Many Fortune 500 organizations, like AT&T, Shell, ABN AMRO etc., have chosen to leverage Databricks lakehouse architecture for various purposes like accelerating AI adoption across operations, democratizing data etc. Being the pioneer of the lakehouse architecture, Databricks has the first mover advantage, with new features getting introduced regularly to make this offering more comprehensive.
6 References
https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
https://www.databricks.com/resources/ebook/hype-cycle-for-data-management
Latest Blogs
Since the pandemic, the global volume of digital payment transactions has been rising rapidly.…
Introduction to RAG To truly understand Graph RAG implementation, it’s essential to first…
Welcome to our discussion on responsible AI —a transformative subject that is reshaping technology’s…
Introduction In today’s evolving technological landscape, Generative AI (GenAI) is revolutionizing…