From Spark SQL to Declarative Pipelines at Databricks

Wait 5 sec.

On his first day at Databricks in 2013, Michael Armbrust — employee No. 9, — began coding Spark SQL.Twelve years later, Armbrust, now a distinguished engineer, announced at the Databricks annual Data + AI Summit in June that the company had open sourced two of its platform technologies to Apache Spark. The news demonstrates Databricks’ continued focus on building out Spark, a project that has served as the company’s playbook since its inception.Databricks CTO Matei Zaharia created Spark in 2009 at the University of California, Berkeley’s AMPLab as a platform for distributed machine learning. In early 2010, the codebase was open sourced, and in 2013, the project became part of the Apache Software Foundation.Spark offers distributed data processing across compute clusters and coordinating workloads across multiple nodes. The outcome of that work stands as the foundation for what we see today in Databricks’ offerings, dating back to Armbrust’s first days at the company.Zaharia, along with Databricks’ CEO  Ali Ghodsi and Andy Konwinski, Ion Stoica, Patrick Wendell and Reynold Xin, all contributed to Spark and formed Databricks in 2014. As active contributors to Spark, the team commercialized the technologies they built to develop Databricks’ foundational technology.Their first research project became what we know today as Spark SQL. Called Shark, named after Spark and Apache Hive, the technology provided better performance than Hive through better querying and caching data in a cluster’s memory. Perhaps most importantly, Shark integrated SQL, which led to the development of Spark SQL, made available with Spark 1.0 in May 2014.Databricks historically presented itself as a group of people who started the Spark project. They emphasized simplicity, getting better value from data, and their open source roots.Over the years, the company has open sourced several of its platform technologies.2014-2017: Apache Spark contributions.2018: MLflow.2019: Koalas and Delta Lake.2021: Delta Sharing and an initial release of Unity Catalog.2024: Unity Catalog OSS and DBRX in 2024.2025: Spark Declarative Pipelines and real-time mode.At the Data + AI Summit in San Francisco last month, Databricks open sourced its Declarative Pipeline service and real-time mode technologies, which enable easier data streaming capabilities with low latency.Declarative PipelinesThe distributed ETL pipeline, initially known as Delta Live Tables, evolved into the Lakeflow Declarative Pipeline, which is now open sourced for Apache Spark. The structured streaming capability also emerged in Spark through a similar developmental process.“Structured streaming — we built this team, we got it working before we open sourced it,” Armbrust said. “Delta, very similarly, it was a product inside of Databricks for over a year and a half before we open sourced it.”Structured streaming leverages SQL’s high-level declarative language, which understands tables, columns, data types and schemas, as well as functions, to process ever-growing input tables. When an engineer adds new rows, the query runs incrementally over the data, producing a new answer, but only examines the latest data that has arrived since the last update.“There’s nothing that a sophisticated engineer couldn’t do with Spark, Spark SQL, structured streaming and Delta by hand that you can do with Declarative Pipelines,” Armbrust said.He added, “Declarative Pipelines let you focus on the interesting part, the data transformation, and it extracts away what I would call undifferentiated heavy lifting.”The Databricks team designed Delta with streaming in mind, Armbrust said. It provides insight into the capability to transform data across multiple tables by consuming it and pushing it downstream.“Our customers often call this the medallion architecture, where you take raw data, you bring it into bronze, you do a little bit of cleaning, you bring it to silver, and then you bring it to finally gold,” Armbrust said.“Gold are the tables that actually have answers for your business. It’s alway a process to get from bronze dirty data to gold data, and the pipelines and streaming are what enable this. Delta – I think of it as the nodes of this graph. And because it natively supports change data feeds. It allows you to do this incrementally, which is critical for performance at scale.”And time travel? It all comes back to how the data tells a story, Armbrust said. The logs are a record of the content in the tables over time.“It’s no longer just a static collection of data,” he said. “It’s this living and changing collection of data where you can ask questions about what has changed over time.”And the Unity Catalog, also open source, provides governance, notably through rich metadata, which allows for fine-grained filtering, Armbrust said. An engineer may annotate columns and tables with descriptions. An AI assistant can read those comments and use that information to help write queries over the data.MLflow is another core piece that fits with Declarative Pipelines.The result is that customers can build end-to-end data and AI workflows using only Databricks technologies while still benefiting from open standards and avoiding vendor lock-in through the open source Apache Spark Foundation.What Is Real-Time Mode?Declarative Pipelines rely on low latency. Real-time mode, also open sourced by Databricks for Apache Spark, expands the aperture for low-latency workflows by enabling structured streaming for operational use cases, thereby transforming the way streaming data is processed.“Instead of running micro batches, where we decide ahead of time what data is going to be processed, we start long-running tasks that are continually polling for new data,” Armbrust said. “And so that means we can process it immediately.”It again shows why streaming is now a first-class citizen. Microbatching can lead to latency issues, complexities in resource utilization, data quality challenges and difficulty in debugging.Databricks is making a run in a fast-growing market and faces plenty of competition. VentureBeat has a comprehensive look at Databricks open sourcing declarative pipelines, citing Snowflake and how it integrates with Apache NiFi to centralize any data from any source into its platform.The Databricks approach overlaps with multiple vendors.Google has Google Cloud Data Flow. Amazon Web Services offers AWS Glue, and Microsoft provides Azure Data Factory—all of which are market data transformation capabilities. There are also vendors like Fivetran and Airbyte, which also partner with Databricks. As mentioned, Snowflake is also a competitor with Databricks.Staying True to Open Source RootsDatabricks confirms why open source companies do so well when they stay committed to their roots while also building out a proprietary platform that accelerates growth.Building an open source project from scratch, transforming it into a platform, and using it to set the direction for an entire ecosystem positions Databricks to take on the largest monolithic software companies of the past 20 to 30 years.Numerous companies have failed in their open source journey. It’s not even worth mentioning any by name. Their stories are all quite similar. They face pressure due to a host of factors, become proprietary and struggle to maintain their standing in the community.The creators behind Spark are still involved with Databricks. Long focused on data analytics, Databricks has developed a range of products, established partnerships and made acquisitions to cater to the needs of those who create data pipelines as well as those who use them to transform data.Declarative environments are well known, as is the need to reduce latency, especially as open source communities working on complex pipelines will increasingly face pressure to implement AI and agent-based frameworks.Getting data in the desired state is the promise of declarative data pipelines and how they fit with DevOps code deployments, data operations and the layering of data models with AI that adapt to the user’s needs.The open sourcing of Databricks technology demonstrates how the company contributes back to the open source project it created. It strengthens their place in the community.And it’s not just the technology that gets contributed. Databricks engineers contribute to the core engine, demonstrating the value they provide while also using the technology as the foundation of its platform products.However, there are always some downsides to an approach that relies heavily on open source. Foremost, there’s the problem of perception. Do open source companies fine-tune their proprietary platforms over their open source equivalents? Does the open source platform then rank second in importance?These are the kinds of questions that affect any open source provider. Databricks is not immune to these types of concerns as well.The post From Spark SQL to Declarative Pipelines at Databricks appeared first on The New Stack.