Stop Freezing Your Data to Death

Wait 5 sec.

When it comes to data retention for logs, it’s cold out there — at least for enterprises using tiered solutions to cut costs. Tiered storage systems usually include some variation of hot, warm, cold and frozen tiers for managing data.But data doesn’t typically stay hot for very long, especially for enterprises ingesting terabytes of log data every day. It’s quickly moved to the cold and eventually the frozen tier, where at best, it’s inconvenient to rehydrate, and at worst, it becomes dark data — a costly source of lost insights and potential security risks.Frozen storage or various methods of discarding data (such as downsampling) might seem to be the only solutions when the volume of log data keeps going up and when costs are skyrocketing. But that way of thinking is outdated, and based on using tightly coupled, expensive hardware (such as SSDs) to keep data hot for queries.The real problem is underlying storage architectures, which aren’t designed for ingesting, storing and querying high volumes of log data cost-effectively. Rebuilding legacy systems from the ground up is costly and disruptive, so both the high costs and the compromises are passed on to customers.In recent years, a new approach that maximizes the performance of object storage provides a much better alternative. It’s now possible for enterprises to use solutions that are built on object storage to keep all their data hot for real-time analytics while remaining cost-effective.Some of the most recent news in this space is the announcement of AWS S3 Tables, which uses Apache Iceberg to partition and optimize object storage. Tools like Iceberg provide wrappers around object storage, dramatically improving the performance of data lakes. Meanwhile, solutions like Hydrolix provide both real-time and long-term historical analytics of log data by maximizing the performance of object storage — all without needing to build a solution from the ground up using tools like Iceberg.With these approaches, you no longer have to choose between keeping your data hot and keeping costs down. If your business is compromising access to data in order to cut costs, it’s probably time to rethink your storage solution.Let’s explore some of the issues with tiered storage, the benefits of keeping all data hot, and how modern data storage solutions are maximizing the performance of object storage to provide cost-effective, low-latency query performance for petabytes of data that can span years.The Problems With Tiered StorageFrozen storage can cut costs compared to tightly coupled, expensive hot storage. That’s where the benefits of frozen storage end and the downsides begin. Frozen storage is inconvenient to rehydrate, so it’s rarely queried and quickly goes dark. It’s much slower than hot storage, inaccessible for machine learning runs and surprisingly expensive as a whole — mainly because there tends to be so much of it and it provides so little long-term value. In some cases, pipelines and data replicas are necessary to move data between tiers, leading to additional complexity and operational overhead.As a result, the tiered data paradigm freezes teams into an outdated, legacy approach where log data is only valuable for short-term operational insights such as observability. From this perspective, only the last few weeks of data matter for high-performance analytics, and the only remaining value of log data is for compliance and security purposes.However, this runs counter to the approach that many forward-thinking enterprises are taking to federate and democratize access to data, making that data available to teams and analysts in the tools they use. That includes not just operations but business intelligence (BI), data scientists, cybersecurity and teams developing machine learning models.The Benefits of Long-Term Hot StorageBy eliminating the high costs that traditionally come with hot storage, enterprises can unlock a wide range of benefits, and these extend well beyond those listed in the use cases above. In contrast to frozen storage, once the cost consideration is gone, there is only upside to keeping all data hot. Fully hot storage also provides the following benefits:The ability to compare current and historical data: With tiered storage, operations teams typically only have insights into a few weeks or at most a few months of data. But with long-term hot data, it’s possible to compare real-time data against data last week, last month or even last year. It’s much easier to track cyclical events, behaviors of specific users (such as malicious actors) and understand patterns and trends you might not be able to uncover otherwise. For enterprises using frozen storage, queries that include historical data typically need to be rehydrated and will run much more slowly. To make matters worse, queries that traverse multiple tiers of storage will be bottlenecked, and the benefits of hot storage are then lost.Easier data management: By keeping all data hot, there’s no need to manage multiple data tiers, back up data as it’s moved between tiers or deal with potentially complex pipelines for moving data. And you’ll also be able to eliminate difficult decisions about data management, such as how long each kind of data should reside in each tier before it’s moved.Increased ability to federate and democratize data across the organization: Long-term hot storage can benefit enterprises that are looking to democratize their log data and make it accessible to teams beyond operations.Bringing dark data to light: Because frozen storage is inconvenient to rehydrate, it’s often a significant source of dark data. By keeping that data hot, it’s much less likely to go dark. This can help mitigate the risks that come with dark data, such as the possibility that inaccessible data may be concealing important evidence about malicious attacks and breaches. It also brings potential value to data that wouldn’t otherwise have it.Unlocking ML, Cybersecurity and BI Use CasesBeyond the benefits, there are many use cases for long-term, historical hot data that are much harder, or even impossible, with frozen storage. The following three use cases — across the areas of cybersecurity, machine learning and business intelligence — are just a few examples of the importance of long-term hot data retention.Threat hunting: The average breach takes 272 days to detect, which is outside the hot data retention window for many platforms. In fact, it’s common for malicious actors to use a “low and slow” approach, making it harder to detect suspicious patterns and prevent intrusions from becoming serious breaches. When data is quickly moved into frozen storage, it becomes impossible to detect patterns that are occurring over months or even years. Instead, forensic analysis only happens after a breach has occurred and the damage is done.Training machine learning models: Just about everyone is talking about harnessing the power of AI, but many enterprises are still trying to figure out exactly what that means. One of the challenges is generating high-quality data sets to ensure that models are accurate. Log and systems data can provide high-fidelity, long-range data sets for use cases like anomaly detection and capacity planning. But frozen data creates blockers for access, increasing the time and effort needed for training runs. Ultimately, machine learning models should be working with “hot” data sets — any data that doesn’t fit this criteria can potentially limit the efficacy of the model.BI and data science: Logs provide far more than just a record of how your applications are performing; they usually include detailed information about how users are interacting with your brand, products and sites. BI and data science teams can mine this data for insights that can help product development, inventory planning, marketing campaigns and ad placement. But these insights are only available if teams have full access to data sets, not incomplete data spanning only a few weeks or months.With long-term, cost-effective hot data, the question becomes, “What can we do to maximize the value of this data?” instead of, “How long can I keep data accessible without runaway costs?”Reinventing Hot Storage for Real-Time AnalyticsAll of these benefits are only possible if object storage is performant enough for real-time analytics. But traditionally, object stores haven’t been the right approach for the low-latency queries needed for real time. The distributed nature of object storage makes it both infinitely scalable and extremely cost-effective, but it also means that data is physically dispersed instead of closely coupled with query components, leading to higher latency. And it’s more common to see object storage used for data that’s cold or frozen, not hot.To maximize the performance of object storage, solutions build around the following core concepts:Parallelism: Object stores such as AWS S3 and GCP Cloud Storage allow multiple parallel connections to object storage so solutions using systems like Kubernetes can write and read data concurrently.Minimizing the amount of data that needs to be traversed: Techniques like partitioning minimize the amount of data that needs to be traversed. For example, one common partitioning strategy is by timestamp. Then, when a user makes a time-filtered query, all partitions that don’t include the timestamp range are pruned from consideration.Minimizing the amount of data that needs to be transferred through distributed systems: Techniques such as high-density compression and predicate pushdown dramatically reduce the amount of data that needs to be transferred through HTTP.With the right solution, it’s possible to reduce the “time to glass,” including data ingestion, transformation and querying, to a matter of seconds. For example, with Hydrolix, the typical time to glass is less than 10 seconds, even when an enterprise is ingesting millions of log lines per second.While this is not true real-time latency to the order of milliseconds, many real-time use cases such as analytics do not require millisecond latency. According to Gartner’s definition of real-time analytics, “For some use cases, real time simply means the analytics is completed within a few seconds or minutes after the arrival of new data.” In the case of observability, business intelligence and many cybersecurity use cases, to name a few, latency in the range of seconds allows operations and other teams to quickly find and fix issues and uncover deeper insights in their data.Object storage isn’t appropriate for use cases that require true millisecond latency, but at the same time, solutions that rely on in-memory stores or expensive, tightly coupled hardware are no longer appropriate for analytics of large volumes of data either. As always, it’s important to use the right tool for the job. And when it comes to ingesting, storing and analyzing large volumes of log data, it’s time to use solutions built on object storage instead of tiered storage that leaves your data in the cold.Learn how Hydrolix can help you keep more data longer and more cost-effectively by maximizing the performance of disaggregated object storage.The post Stop Freezing Your Data to Death appeared first on The New Stack.