Implement an enterprise-ready data lakehouse architecture with Spark and Kyuubi

Wait 5 sec.

Here at Canonical we are excited to announce that we have shipped the first release of our solution for enterprise-ready data lakehouses, built on the combination of Apache Spark and Apache Kyuubi. Using our Charmed Apache Kyuubi in integration with Spark, you can deliver a robust, production-level, and open source data lakehouse . Our Apache Kyuubi charm integrates tightly as part of the Charmed Apache Spark bundle, providing a single and simpler-to-use SQL interface to big data analytics enthusiasts.Data lakehouse: an architecture overviewThe lakehouse architecture for data processing and analytics is an enterprise data management paradigm shift. Historically, organizations have been forced to make a trade-off between the raw, scalable storage of data lakes and the fast-performing queryability of structured data warehouses. The lakehouse approach is able to bridge the gap, enabling enterprises to store large quantities of structured and unstructured data in a single platform, perform data streaming, batch processing, and rapid analytics all bundled up in a wrapper of transactional integrity and governance. Canonical’s approach to data lakehousing relies on the integration of Apache Spark and Apache Kyuubi, creating a platform where batch and streaming data can coexist, be processed at scale, and be made available for advanced analytics and AI/ML in an instant.At the heart of this lakehouse blueprint is Apache Spark, the industry’s standard distributed data processing engine. Spark’s in-memory, fault-tolerant architecture allows a user to run high-throughput ETL, data transformation, and iterative machine learning workloads. Canonical’s approach leverages Spark OCI images with Kubernetes as a cluster manager, targeting a modernized approach to standard Spark jobs, optimizing for cost and performance. The integration supports many data sources like any S3 compliant storage and Azure Blob storage for data ingestion, as well as other databases as metastores for processing.One of the greatest challenges in the deployment of enterprise Spark has always been to provide secure, multi-user, and easy-to-use SQL access to business users, analysts, and data scientists. That is where Apache Kyuubi shines. Kyuubi is a high-throughput, multi-tenant SQL gateway for Spark that provides a single JDBC and ODBC endpoint well-suited to integrate with data explorers like Tableau, Power BI, and Apache Superset. Unlike Spark’s own Thrift Server, Kyuubi provides true session isolation so that each application or user runs its own secure Spark context. This not only provides an additional layer of security but also enables fine-grained resource allocation, workload prioritization, and strict auditing which are critical capabilities for compliance and governance within regulated industries.A charming lakehouse, fit for an enterpriseCanonical’s Spark and Kyuubi lakehouse stack is built for speed and reliability. In fact the deployment is automated end-to-end using Canonical’s charmed operators, which oversee the lifecycle of Spark, Kyuubi, and supporting components. This includes automated cluster provisioning, rolling upgrades, fault-tolerance, security patching, and cloud-native elastic scaling across Kubernetes environments. Security is built into every layer of the bundle. The release of the Charmed Apache Spark/Kyuubi bundle includes end-to-end encryption, native integration with the Canonical Observability Stack, and security-hardening with improved documentation. In addition we have been working on the patching of several critical and high CVE for this launch, enhancing overall the product security posture. The bundle now includes back up and restore for Kyuubi, with the benefits of reliability and continuity of business, and adds in-place upgrades to minimize downtime and complexity. High-availability support allows servers running Kyuubi to be scaled reliably for mission-critical workloads.The spark-kyuubi bundle is platform agnostic, supporting hybrid and multi-cloud, as well as on-premises deployments. This is done with the goal of avoiding vendor lock-in, empowering organizations to optimize cost, performance, and compliance on the infrastructure of their choice. Whether greenfielding a new analytics platform or refactoring a legacy Hadoop deployment, Canonical’s solution provides an easy way forward with expert support every step of the way.Alongside new features and security patches, the release brings improved usability and documentation. The deployment process is fully explained, and the solution is made available via the standard Canonical channels, so we’d encourage you to go look at the documentation and the release notes and ultimately to give it a try. We have also recently delivered a webinar “Open source data lakehouse architecture with Spark and Kyuubi – an engineering deep dive” that you can follow for a guided deployment experience. It all comes down to a more secured and innovative big data analytics stack available to be deployed on-premises or in the cloud by enterprises. With the new launch, organizations can go ahead with confidence that they are benefiting from the latest developments in the open source domain for big data analysis.Open Source Data Lakehouse Architecture with Spark and Kyuubi: Engineering Deep DiveSpark and Kyuubi: try it todayIn summary, Canonical’s Kyuubi and Spark-based data lakehouse enables organizations to unify data architecture, accelerate analytics, and future-proof data strategy. By combining open source innovation with enterprise-grade support, Canonical empowers businesses to unlock the true potential of their data – reliably, efficiently, and at scale. We invite data engineers, architects, and IT enthusiasts to test the solution and find out more about how Canonical can help you build the next generation of data-driven applications and insights.Starting a new big data project? Contact us Build your online data hub with our whitepaperRead about big data and AI trends in our whitepaper