Why Startups Are Betting Everything on Apache DataFusion

Wait 5 sec.

An increasing number of new systems are needed to ingest, organize, and query multimodal data in near real-time to feed context-hungry AI, support real-time applications such as industrial infrastructure monitoring, analytics dashboards, and high-throughput event streams. The same underlying trends that spawned database and data infrastructure companies in the last 20 years continue to accelerate as we fully enter the age of AI.Techniques required for high-performance analytic systems are now well understood, but were previously only available in a small number of proprietary, tightly integrated, and expensive enterprise products, given the enormous engineering investment they require to implement. Thankfully, an increasing number of new, innovative analytic systems can be built using Apache DataFusion. This permissively licensed open-source library offers the same level of technology with a much lower barrier of entry.Many new startups are using DataFusion in their products, including Flarion, Wayfare.ai, LakeSail, Embucket, Feldera, and Pydantic Logfire. They join the ranks of more mature startups, such as Synnada, Polygon.io, Greptime, LanceDB, WrenAI, SpiceAI, Cube, OpenObserve, InfluxData, and Coralogix, as well as established companies like Apple, eBay, and DataDog, which use DataFusion to optimize internal systems and processes. Finally, the last 12 months saw the first wave of DataFusion product-powered startups acquired: SDD Labs by dbtLabs and Arroyo by Cloudflare.Why DataFusion Matters NowDataFusion is a query engine written in Rust, optimized for columnar formats like Apache Parquet. It’s part of a broader movement toward composable, high-performance systems built on open standards. With fast vectorized execution, flexible extension points, and a large and rapidly growing community, DataFusion has evolved from experimental to essential.Databases are complex technology. Query languages, such as SQL, optimizers, execution engines, and storage formats must all work quickly and efficiently with arbitrary user queries within defined resource constraints.At InfluxData, we bet early on DataFusion, basing InfluxDB 3 on the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet. All open source, all governed by the Apache Software Foundation. That decision allowed most of our engineering team to focus on what matters for time series: ingestion speed, real-time queries, compaction, and scale, while using a shared open source infrastructure for everything else.  It paid off. Today, every aspect of data processing in InfluxDB 3 is governed by a DataFusion plan, and we execute tens of millions of plans per day in production. When the community contributes to DataFusion, those improvements are directly integrated into InfluxDB 3, just as when we contribute to DataFusion, the improvements are shared with other users.We’re not alone in this shift. The increasing adoption of Open Data Lake architectures and formats, such as Parquet and Iceberg, necessitates new, optimized systems. Apache DataFusion, with its reusable high-performance vectorized engine and open format support, is well-suited for building these next-generation systems.Figure 1: The next generation of analytic systems are being built around an Open Data Lake (typically Parquet files stored on Object Storage). This new architecture will spawn a large number of new, specialized processing engines tailored for specific use cases, and many will be powered by DataFusion.The Apache DataFusion Revolution: From Experimental to Essential in 2024This past year, DataFusion was elevated to a Top-Level Apache project, as a recognition of its maturity and momentum. DataFusion 43.0.0 (briefly) was the fastest engine for querying Apache Parquet files in ClickBench, outperforming DuckDB, ClickHouse, and other C/C++-based engines. This was a watershed moment as it was the first time a Rust engine topped the leaderboard.Such performance does not come easily. It took concerted effort from dozens of contributors to deliver deep, low-level optimization, everything from smarter memory layouts (StringView) to skipping wasteful aggregations and rethinking how multi-column groupings are stored and compared.How DataFusion’s Community-Driven Development Powers Enterprise GrowthDataFusion’s strength isn’t the code, which is, after all, free for anyone to use. The strength is in the community.DataFusion does not have the luxury of a VC-funded startup paying people to work on it full time. Instead, our users aren’t passive adopters; they’re active contributors.  We rely on each other to find enough value in the project to contribute back. Contributors from large companies, startups at all stages, students, and hobbyists all work together to push the project forward. Every optimization, every fix, and every feature eventually make their way back into the ecosystem, benefiting everyone.  While this approach has its challenges, it has enabled hundreds of developers to come together across time zones, job titles, companies, and industries to build something that no single team could have achieved alone.The momentum behind DataFusion is palpable. It’s no longer just a component — it’s increasingly part of the foundation for an entire ecosystem of next-generation analytic systems. There are many exciting projects currently underway, such as first-class support for unstructured data, improved filtering, late materialization, and dynamic pushdowns, as well as easier-to-use Apache Iceberg support. Additionally, this includes faster processing of larger-than-memory datasets, subqueries, and more.Building the Future: DataFusion’s Role in AI-Driven Real-Time AnalyticsIf you’re building a data platform where performance matters, take a serious look at DataFusion. It’s fast, open, extensible, and battle-tested, as demonstrated by the numerous companies that have staked their products’ futures on it. I genuinely believe DataFusion is at an inflection point — we are seeing a real acceleration in community growth, in raw performance, in feature depth, and robustness. However, as with all mature, full-featured software, it now requires additional investment to drive the project forward.The community is growing, and the relative skill level of contributors is growing, but as we continue to add new and innovative things, we need your help. If you’ve ever dreamed of learning about, contributing to, and shaping the future of the internals of a query engine, now’s the time to dive in. Join the community (find us online here), contribute code, review PRs, test systems, and file bugs. The future is composable, and DataFusion is driving it forward.The post Why Startups Are Betting Everything on Apache DataFusion appeared first on The New Stack.