Data Intelligence Platform-The New Databricks Avatar is Set to Revolutionize Data Platforms
When OpenAI launched ChatGPT, a sudden buzz was created around Generative AI and how AI will disrupt the way we interact with technology. This is the most significant disruption since the advent of the internet or cloud computing.
Data platforms are no exception. Soon after the Generative AI wave, many companies started to invest in AI-enabled code assistants, conversational bots for data discovery and integrating AI into pipeline design and development processes. However, they were still limited by many challenges, including skill gaps, data quality issues, governance, and understanding data semantics.
Let’s explore how Databricks is becoming a Data Intelligence Platform
On November 15, 2023, Databricks proposed the direction in which data platforms should move to solve the challenges. They coined the future data platforms as Data Intelligence Platforms. This was quite an enhancement from their revolutionary Lakehouse architecture, which was proposed in 2018.
Fig.1 Data Intelligence Engine,Image Source: https://cms.databricks.com/sites/default/files/inline-images/blog-marketecture-1.png
The core ideas behind the Data Intelligence Platform, as shared by Databricks CEO Ali Ghodsi, are as follows:
- Natural language access to data
- Automatic reading of the semantic catalog and data discovery
- Automated management
- Enhanced security
- Support for enhanced AI workloads
Now, let’s explore how all of these components work together under the hood.
1 Architecture
The Databricks Data Intelligent Platform is centered around the core Data Lake, where raw data is stored in an open format. The Delta Lake, built on top of it, provides the data lake with atomicity, consistency, isolation, and durability (ACID) properties. The Unity Catalog offers unified governance capabilities and automatic loading abilities.
Additionally, the Data Intelligence Platform uses the Data Intelligence Engine (DatabricksIQ) to optimize every aspect of this platform. This engine optimizes storage within the Data Lake and generates and reads metadata in the Unity catalog.
Figure 2: Data Intelligence Platform Architecture, Image source: Databricks (https://cms.databricks.com/sites/default/files/inline-images/blog-marketecture-1.png)
All the knowledge about the data and metadata is then utilized to drive intelligence in optimizing computation, ensuring data quality, generating text to structured query language (SQL), and code and training, deploying, and fine-tuning large language models (LLMs) and AI apps.
Let’s look at some capabilities that empower the Databricks Data Intelligence Platform.
1.1 AI documentation using Unity Catalog
Databricks has introduced AI-generated documentation in its Unity Catalog, which will simplify the organization’s documentation, data discovery, and metadata management.
Let’s face it: who loves documentation? No one, right? With the new AI documentation, the Unity Catalog can generate plain English documentation for all the tables and columns.
They have also kept the humans in the loop. Users can review, edit, or accept auto-generated metadata. This ensures that correct descriptions are aligned with the specific use case and domain knowledge.
1.2 Semantic search
Semantic search enables users to search across the data landscape and provides the most appropriate data relevant to our search. Semantic search is empowered by all the English descriptions auto-generated by AI in the catalog.
Information discovery has always been a challenge in big organizations. A data engineer spends plenty of time explaining the meaning of specific data or simply finding the table that answers their questions. Now, users can search the data themselves.
1.3 Databricks Assistant
Databricks Assistant is a context-aware AI assistant capable of automatically generating SQL queries or Python codes. It can also explain existing code, format it, and address issues. Further, Databricks Assistant leverages Unity Catalog metadata to understand the data in tables and columns. It even understands the descriptions of popular data assets and provides personalized responses.
Databricks Assistant can generate charts from a previously defined Lakeview dataset. Users can determine what they need to learn from the chart, and the assistant will generate it. The Assistant can also be used to edit the charts.
N.B. This is still in preview, and users should always review visualizations generated by the Assistant to verify correctness.
1.4 Auto optimization
DatabricksIQ is Databricks’s new Data Intelligence Engine, which is deeply integrated into all its products. It empowers the auto-optimization of different services within the Databricks Data Intelligence Platform. For example, it can automatically index columns and provide partitions for the data. This improves the Lakehouse’s performance, resulting in lower total cost of ownership (TCO) and better performance.
1.5 Serverless compute
Databricks, in its Data Engineering in the Age of AI conference, showed how serverless computing can run jobs from within workflows. Once a serverless computer is selected, a user can run the job as quickly or cheaply as possible.
Databricks will then handle all the administration and scaling for the serverless computer, optimizing it through its Data Intelligence Platform.
N.B. Please note that this has only been announced and is just starting to be previewed.
1.6 Run generative AI functions
Generative AI functions can be executed from within the Databricks platform. For example, the ai_generate_text method can be called from within a Databricks notebook using SQL. This function can call any ready-built LLM (like openAI) to generate auto descriptions, etc.
Conclusion and way forwards
Data Intelligence Platform in Databricks is an ongoing effort, and they have shared the roadmap for building Databricks as a DI platform. Many of the above services are still in preview, and new capabilities are added daily.
The deep knowledge of the data and metadata in the Unity Catalog is the main differentiator for Databricks’s becoming the primary Data Intelligence Platform. This deep context allows Databricks to improve the queries and code according to the custom use case and specific business data.
Databricks integrates DatabricksIQ with Mosaic AI to enable businesses to create custom AI applications specific to their data. They are building to support end-to-end RAG (Retrieval Augmented Generation) systems, training custom models or pretraining existing models on the business and domain-specific data of the customers, serverless abstractions, and end-to-end MLOps.
With such a detailed roadmap, the Databricks Data Intelligence Platform certainly looks to be one of the forerunners in democratizing AI and data access.
References:
- Data Intelligence Platforms by Michael Armbrust, Adam Conway, Ali Ghodsi, Naveen Rao, Arsalan Tavakoli-Shiraji, Patrick Wendell, Reynold Xin and Matei Zaharia November 15, 2023, in Platform Blog: https://www.databricks.com/blog/what-is-a-data-intelligence-platform
- Data Engineering in the Age of AI: https://www.databricks.com/resources/demos/videos/data-engineering/databricks-data-intelligence-platform
- DatabricksIQ, April 19, 2024: https://docs.databricks.com/en/databricksiq/index.html
More from Saikat Dutta
Databricks and MS Fabric might seem like competition. However, when integrated, they make for…
Latest Blogs
Introduction to RAG To truly understand Graph RAG implementation, it’s essential to first…
Welcome to our discussion on responsible AI —a transformative subject that is reshaping technology’s…
Introduction In today’s evolving technological landscape, Generative AI (GenAI) is revolutionizing…
At our recent roundtable event in Copenhagen, we hosted engaging discussions on accelerating…