Accelerating AI Adoption with Unified Databricks Lakehouse Architecture

June 19, 2023

By: Santosh Tambe, Principal, Architecture

1 Introduction

A data warehouse is a centralized repository for structured, relational data that has been cleansed, integrated, and transformed from multiple sources. A data lake is a centralized repository for storing raw, unstructured data in its native format until it’s needed. A lakehouse combines the best of both worlds by providing an architecture enabling users to store and process structured and unstructured data in one place. It provides the agility of a data lake with the governance of a Data Warehouse (DWH).

This blog explains the differences between a data warehouse, datalake, and data lakehouse and why a Databricks lakehouse architecture is essential for Artificial Intelligence (AI) adoption across your organization.

2 Key challenges faced by enterprise IT leaders

Modernization of data platforms

Existing on-prem workloads like Hadoop and Enterprise Data Warehouse (EDW) stores such as Oracle, Netezza, etc., result in huge costs and resources for enterprises. Also, the ability to be agile and innovative is limited.

Modernizing the data platform to a cloud-based solution will lead to improved productivity.

Tech consolidation

Recently, there has been an explosion of tech stack to manage structured and unstructured data with the rise of cloud, mobile, social, AI, and IoT data. Custom technologies for each area and functionality just aren’t sustainable.

Cost optimization

IT leaders need to closely examine their existing infrastructure and performance, particularly regarding rising cloud DWH costs. Costs can quickly spiral out of control, with more and more data and users running queries.

Data governance

Risk, governance, compliance, and security have long been fundamental data challenges as data leaders strive to build trust in their data models both internally and externally. Bad-quality data leads to inaccurate analytics, poor decision-making, cost overhead, etc.

Implementing a Unified Data Management (UDM) architecture can effectively address the challenges and expectations of the modern cloud-based data platform.

3 Unified data management architecture

DWH vs. data lake vs. lakehouse

Before diving into lakehouse, here are the high-level data architecture patterns for DWH, data lake, and lakehouse. Also shown is the comparison between three architectural patterns.

Fig 1: DWH, Data Lake, and Lakehouse Architecture comparison

Source: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Table 1: DWH, Data Lake, and Lakehouse features

Challenges of DWH

Suitable for a huge IT project that can absorb high maintenance costs
Primarily supports BI and reporting use cases
Limited capability for supporting ML use cases
Inefficient handling of semi-structured and unstructured data

Challenges of a data lake

Appending data is hard
Modification of existing data is difficult
Data lakes perform poorly
Do not support transactions
Do not enforce data quality

Data lakehouse

Data lakehouse combines the best of both worlds of data lake and DWH. The performance, concurrency, and data management of EDWs with the scalability, low cost, and workload flexibility of the data lake.

Lakehouse enables optimized AI and BI directly on big data stored in data lakes using an object store mechanism and providing transaction control using a delta format.

Lakehouse caters to all the use cases, can store and process all data types, and implements open standards.

Key features of a lakehouse

Transaction support: Support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL.

BI support: Lakehouses enable the use of BI tools directly on the source data, which reduces latency.

Openness: The storage formats, such as Parquet and Delta, are open and standardized.

Storage is decoupled from compute: Storage and computing use separate clusters.

Support for diverse workloads: Supports data science, machine learning, SQL, and analytics.

End-to-end streaming: Support for streaming eliminates the need for separate systems to serve real-time data applications.

Benefits of lakehouse

Unify data teams: Unifies all data teams of data engineers, data scientists, and analysts on one architecture.

Break data silos: Facilitates breaking data silos by providing a complete and firm copy of all your data in a centralized location.

Prevent data from becoming stale: You can process batch and streaming data, so your data is never stale.

Reduces cost: One system for DWH and ML through which data can be stored in cheap object storage such as Amazon S3, Azure Blob Storage, etc.

Simplifies data governance: Eliminate the operational overhead of managing data governance on multiple tools.

Simplifies ETL jobs: Minimize the Extract, Transform, and Load (ETL) process by connecting the query engine directly to the data lake.

Connects directly to BI tools: Supports the connection to popular BI tools like Tableau, PowerBI, etc.

Gartner Hype Cycle for Data Management, 2022

As per the below hype cycle, lakehouse is expected to reach the plateau of productivity in the next 2 to 5 years.

Fig 2: Gartner Hype Cycle for data management

Source: https://www.databricks.com/resources/ebook/hype-cycle-for-data-management

4 Lakehouse implementation using Databricks

Databricks lakehouse platform

The Databricks lakehouse platform is built on open source and open standards. It ensures the data quality, performance, security, and governance expected from a data warehouse. Data only needs to exist once to support all data, AI, and BI workloads on one common platform, establishing a single source of truth.

Organizing Databricks lakehouse platform

Databricks lakehouse can ingest petabytes of data with auto-evolving schemas. It can also automatically and efficiently track data without manual intervention, infer schema, and detect column changes for structured and unstructured data formats.

Databricks recommends the Bronze, Silver, and Gold layer architecture. It lets you easily merge and transform new and existing data in batches or streaming.

Table 2: Features of medallion architecture using Bronze, Silver, and Gold

Databricks SQL for DWH-like experience

Databricks SQL offers a native first-class SQL experience with a built-in SQL editor, rich visualizations, and dashboards, and integrates seamlessly with widely used BI tools.

Databricks AI/ML capabilities

Databricks lakehouse helps orchestrate the ML process’s end-to-end lifecycle, automating the ML lifecycle using various tools like Data Science Workspace, MLflow, etc.

5 Conclusion

Many Fortune 500 organizations, like AT&T, Shell, ABN AMRO etc., have chosen to leverage Databricks lakehouse architecture for various purposes like accelerating AI adoption across operations, democratizing data etc. Being the pioneer of the lakehouse architecture, Databricks has the first mover advantage, with new features getting introduced regularly to make this offering more comprehensive.

6 References

https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

https://www.databricks.com/resources/ebook/hype-cycle-for-data-management

Santosh Tambe

Principal, Architecture

Santosh Tambe has around 20 years of experience in data and analytics. He has been part of multiple end-end data warehousing and analytics project involving on-prem implementations using various public clouds such as Azure, AWS, etc. He is a thought leader in data engineering, data warehousing, data science, data on cloud, etc. He has led various turn-key projects in domains like insurance, banking, and manufacturing using Snowflake, Databricks, Azure, and AWS. On the personal front, Santosh enjoys being with nature whenever possible and is an avid trekker and reader.

Latest Blogs

Enhancing Efficiency in Supply Chain and Logistics…

he supply chain is a network of suppliers, factories, logistics, warehouses, distributers and…

Transforming Large Language Model Optimization…

Introduction What if training powerful AI models didn’t have to be slow, expensive, or data-hungry?…

Speed to Market: How Gen AI accelerates MLR Reviews…

Pharmaceutical marketing has evolved significantly with digital platforms, but strict regulations…

Accelerating Cloud Migrations with Clarity and…

Leveraging the right cloud technology with appropriate strategies can lead to significant cost…

Blogs

Implementing a Unified Databricks Lakehouse Architecture for Accelerated AI Adoption

1 Introduction

2 Key challenges faced by enterprise IT leaders

Modernization of data platforms

Tech consolidation

Cost optimization

Data governance

3 Unified data management architecture

DWH vs. data lake vs. lakehouse

Challenges of DWH

Challenges of a data lake

Data lakehouse

Key features of a lakehouse

Benefits of lakehouse

Gartner Hype Cycle for Data Management, 2022

4 Lakehouse implementation using Databricks

5 Conclusion

6 References

Blogger's Profile

Santosh Tambe

Latest Blogs

Contact us

Blogs

1 Introduction

2 Key challenges faced by enterprise IT leaders

Modernization of data platforms

Tech consolidation

Cost optimization

Data governance

3 Unified data management architecture

DWH vs. data lake vs. lakehouse

Challenges of DWH

Challenges of a data lake

Data lakehouse

Key features of a lakehouse

Benefits of lakehouse

Gartner Hype Cycle for Data Management, 2022

4 Lakehouse implementation using Databricks

5 Conclusion

6 References

Blogger's Profile

Santosh Tambe

Latest Blogs