Building Event-Driven Systems That Can Recover With Confidence

Wait 5 sec.

Most engineering teams learn to ask whether a system is reliable.Does it stay up? Does it retry? Does it avoid losing messages? Does it recover when a consumer restarts?Those are important questions, but they are not enough for modern event-driven systems. In high-throughput enterprise platforms, the harder question is what happens after something goes wrong and the system has to replay, rebuild, or reconcile state.\Can the team explain what happened?Can they prove which events were processed, which were skipped, which were duplicated, and which projections changed?Can they say why the recovered state should be trusted?That is the difference between a system that merely resumes and a system that actually recovers.This article introduces a design idea I call confidence-carrying replay. The basic argument is simple: replay should not only rebuild state. It should also produce evidence. If a system cannot explain why its recovered state is correct, then replay is just another operational gamble.The practical tool for getting there is a Recovery Contract: an explicit design artifact that defines the authoritative history, ordering boundary, idempotency key, deterministic projection function, replay scope, reconciliation check, and recovery evidence for a critical event flow.That might sound abstract, so let us ground it in a concrete architecture: a near-real-time inventory pipeline using Kafka, PostgreSQL/Aurora, Debezium, Kafka Streams, and downstream availability services.\ Reliability Is Not the Same as ReplayabilityReliability usually asks whether the system keeps working.Replayability asks whether the system can safely revisit history.That distinction matters because event-driven systems rarely fail in a clean way. A message may be delivered twice. A consumer may crash after writing to a database but before committing an offset. A schema may evolve in a way that breaks an older consumer. A connector may fall behind. A downstream projection may drift silently from the source of truth.In a synchronous request/response system, many failures are visible at the boundary of the call. In an event-driven system, the failure often appears later as disagreement between derived states.Inventory says 388 units. The selling engine says 380.The warehouse allocation service says 379. The audit table says a marketplace event added 100 units, but the downstream view only reflected 92.Now the question is not "did Kafka lose the message?" The question is "which state is correct, and how do we prove it?"Example: Real-Time InventoryConsider a marketplace seller sending inventory updates into an e-commerce platform. The seller publishes availability changes for a SKU. The platform persists those updates, derives stock-on-hand, classifies inventory into buckets such as sellable, damaged, and return-to-vendor, and publishes availability to downstream services.\A simplified flow looks like this:Marketplace inventory update-> Kafka ingestion topic-> Java consumer-> PostgreSQL/Aurora transaction-> Debezium CDC topics-> Kafka Streams topology-> availability, sellability, ETA, and fulfillment services\The event may start as something very small:{  "event_id": "mkt-evt-8f11a",  "sku": "1231241",  "quantity": 100,  "operation": "I",  "event_time": "2026-06-19T18:23:11Z",  "seller_id": "seller-42"}\From a business standpoint, this event is not small. It may affect whether customers can buy a product, whether fulfillment promises are accurate, whether an item oversells, and whether financial reconciliation later makes sense.\ \n The platform may store the result across multiple tables:CREATE TABLE inventory_stock_on_hand (  sku             VARCHAR(64) PRIMARY KEY,  stock_on_hand   BIGINT NOT NULL,  updated_at      TIMESTAMP NOT NULL);CREATE TABLE inventory_bucket (  sku             VARCHAR(64) NOT NULL,  bucket_type     VARCHAR(32) NOT NULL,  location_id     VARCHAR(64) NOT NULL,  quantity        BIGINT NOT NULL,  updated_at      TIMESTAMP NOT NULL,  PRIMARY KEY (sku, bucket_type, location_id));CREATE TABLE inventory_transaction (  event_id        VARCHAR(128) PRIMARY KEY,  sku             VARCHAR(64) NOT NULL,  seller_id       VARCHAR(64) NOT NULL,  delta_quantity  BIGINT NOT NULL,  event_time      TIMESTAMP NOT NULL,  accepted_at     TIMESTAMP NOT NULL);\The transaction table is important. It is not just an audit decoration. It is part of the system's memory. When projections drift, the transaction history is how we explain what should have happened.\CDC Gives You History, Not ConfidenceChange data capture is a powerful pattern for this kind of architecture. Debezium can read the PostgreSQL write-ahead log, maintain a replication slot, and emit table changes into Kafka topics.{  "name": "postgres-inventory-connector",  "config": {    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",    "database.hostname": "",    "database.port": "5432",    "database.user": "",    "database.password": "",    "database.dbname": "",    "topic.prefix": "inventory_source",    "plugin.name": "pgoutput",    "slot.name": "debezium_inventory_slot",    "publication.autocreate.mode": "filtered",    "table.include.list": "public.inventory_stock_on_hand,public.inventory_bucket,public.inventory_transaction",    "key.converter": "org.apache.kafka.connect.json.JsonConverter",    "value.converter": "org.apache.kafka.connect.json.JsonConverter",    "key.converter.schemas.enable": "true",    "value.converter.schemas.enable": "true"  }}This is a strong foundation. But CDC alone does not solve recovery. \n CDC tells you what changed. It does not automatically tell you:which topic is authoritative during replaywhich key preserves business orderingwhich duplicate events should be ignoredwhich derived state should be rebuiltwhich invariants prove the rebuild workedwhich evidence should be emitted after recovery \n That is the gap.State Is Temporary. History Is ForeverMaterialized state is useful because it is fast. Downstream services do not want to recompute availability from raw events every time a customer opens a product page. They want a view. But derived state is also dangerous because it can drift.The availability topic may be stale.The cache may have been updated by a previous version of the consumer.The stream processor may have skipped a malformed event.The materialized view may reflect an event ordering that was never intended by the business.This is why I like the principle: \n State is temporary. History is forever.The system should be designed so that important state can be rebuilt from authoritative history. More importantly, the rebuild should produce enough evidence for operators to trust it.That is what confidence-carrying replay means.\Recovery ContractsA Recovery Contract is a small but explicit agreement for a critical event flow.It answers seven questions:[ ] H = Authoritative history[ ] O = Ordering boundary[ ] I = Idempotency key[ ] F = Deterministic projection function[ ] S = Replay scope[ ] Q = Reconciliation query or invariant[ ] E = Recovery evidenceFor the inventory pipeline, the contract might look like this:recovery_contract:  flow: inventory-availability-projection  authoritative_history:    - inventory_transaction    - debezium.inventory_transaction    - debezium.inventory_bucket    - debezium.inventory_stock_on_hand  ordering_boundary:    key: sku    rationale: "Updates for the same SKU must be processed in entity-local order."  idempotency:    key: event_id    duplicate_policy: skip_and_report  projection_function:    name: compute_sellable_availability    deterministic: true    side_effects: isolated  replay_scope:    supported:      - by_sku      - by_time_window      - by_topic_partition  reconciliation:    checks:      - stock_on_hand_matches_transactions      - sellable_quantity_non_negative      - projection_event_time_not_newer_than_source  recovery_evidence:    emit:      - replay_scope      - events_processed      - duplicates_skipped      - projections_changed      - reconciliation_failures      - confidence_statusThe point is not bureaucracy. The point is that recovery should be designed before the incident.\What This Adds Beyond Kafka and DebeziumIt is easy to mistake this for a tooling article. It is not. Kafka gives you durable logs and partitioned processing.Debezium gives you database change history. Kafka Streams gives you stateful transformations and materialized views. Those are ingredients. The Recovery Contract defines how those ingredients behave when the system is under stress.The contract says:this is the history we trustthis is the ordering boundary we must preservethis is how duplicate effects are suppressedthis is how we rebuild derived statethis is how we decide whether the rebuild is correctthis is the evidence we emit so operators can trust the recovery\ \Partitioning Is a Correctness BoundaryIn Kafka systems, partitioning is often discussed as a scaling mechanism. Add partitions. Increase throughput. Spread load.That is true, but incomplete.For stateful event processing, partitioning is also a correctness boundary.If all updates for the same SKU must be processed in order, then the SKU belongs in the partition key. If related tables are joined in Kafka Streams, they need compatible keys and partitioning. If events for the same entity scatter across partitions, replay becomes harder to reason about.The mistake is treating partitioning as an infrastructure-only decision.It is an architectural decision.public class SkuPartitioner implements Partitioner {    @Override    public int partition(            String topic,            Object key,            byte[] keyBytes,            Object value,            byte[] valueBytes,            Cluster cluster) {        InventoryEvent event = (InventoryEvent) value;        String orderingKey = event.getSku();        int partitionCount = cluster.partitionCountForTopic(topic);        return Math.floorMod(orderingKey.hashCode(), partitionCount);    }}The implementation is simple. The decision behind it is not.The ordering key should come from the business invariant. In inventory, that may be SKU, item, seller plus SKU, location plus SKU, or tenant plus item. In billing, it may be account, subscription, or usage meter. In security analytics, it may be tenant, user, asset, or detection rule. \n Confidence-Carrying ReplayReplay is often treated as a mechanical action:1. Stop consumer.2. Reset offset.3. Reprocess events.4. Hope the state looks right.That is not enough.Confidence-carrying replay adds a second output. The replay produces rebuilt state and a recovery evidence record.Example:{  "recovery_id": "rec-2026-06-19-001",  "flow": "inventory-availability-projection",  "scope": {    "sku": "1231241",    "from_event_time": "2026-06-19T18:00:00Z",    "to_event_time": "2026-06-19T19:00:00Z"  },  "events_processed": 1842,  "duplicates_skipped": 17,  "late_events_accepted": 3,  "projection_rows_changed": 11,  "reconciliation": {    "stock_on_hand_matches_transactions": true,    "sellable_quantity_non_negative": true,    "projection_event_time_valid": true  },  "confidence_status": "trusted",  "published_outputs": [    "inventory.availability.v2"  ]}\This evidence record is the difference between operational hope and architectural confidence.\ \Observability Is Necessary, But Not SufficientMetrics are essential. You should track consumer lag, connector lag, task restarts, dead-letter events, replay duration, duplicate-event counts, and end-to-end latency.But observability tells you what the system is doing. It does not automatically define whether recovered state is correct.A dashboard might show that lag is zero. That does not prove the projection is right.A log might show that a replay completed. That does not prove duplicate effects were skipped. A trace might show that events flowed through services. That does not prove the final availability number matches accepted history.Recovery Contracts connect observability to correctness. They define which signals matter for recovery confidence.\Why Dead-Letter Queues Are Not a Recovery StrategyDead-letter queues are useful. They keep poison messages from blocking progress, and they give operators a place to inspect events that failed parsing, validation, authorization, or downstream processing.But a DLQ is not a recovery strategy by itself.A DLQ usually answers a narrow question: which messages failed? It does not answer the broader architectural questions:Was derived state already partially updated before the message failed?Did a retry create duplicate effects?Did downstream projections observe related events out of order?Which entities are affected?Which historical range should be replayed?Which reconciliation checks prove the repaired state is correct?In other words, a DLQ can tell you where some failures landed. It does not necessarily tell you how to restore confidence in the system.This is especially important in CDC-based systems. A Debezium topic may include changes from multiple source tables. A failed event may be only one symptom of a broader inconsistency between a source table, a materialized view, and a downstream topic. If the recovery plan is only "fix the event and replay the DLQ," you may miss the state that was already derived incorrectly.A better pattern is to treat the DLQ as one input into the Recovery Contract.The contract should define:whether DLQ events are part of authoritative historywhether they can be replayed directlywhether they require source-table reconciliation firstwhether related entities must be replayed togetherwhat evidence must be produced after DLQ replayThat is the difference between message repair and system recovery.\Schema Governance Is a Part of ReplayabilityReplayability depends on the future being able to understand the past.If an event schema changes in a way that older consumers cannot read, then replay becomes fragile. If a field changes meaning without a version boundary, then replay may produce a technically valid but semantically wrong projection. If a topic is compacted or retained without considering recovery windows, then history may disappear before it can be used.Schema governance is often treated as an API compatibility problem. It is also a recovery problem.For replay-safe systems, schema decisions should answer questions like:Can a current consumer read events from the last recovery window?Are removed fields still needed to rebuild historical projections?Does the projection function understand every version it may replay?Do schema changes include migration rules for materialized state?Does the recovery evidence record include the schema versions used?A Recovery Contract should therefore name the schema compatibility expectations for the flow. The point is not to freeze schemas forever. The point is to ensure that change does not silently destroy the ability to recover.In enterprise platforms, this is where architecture discipline matters. Throughput problems are often obvious. Replayability problems are often discovered only during an incident, when the team realizes that a six-month-old event no longer means what the current consumer thinks it means.\This Is Not Only About InventoryInventory is a useful example because the consequences are easy to understand. A wrong projection can cause overselling, underselling, bad fulfillment promises, and customer trust problems.But the same pattern appears in other domains.In usage-based billing, events represent consumption. The derived state may include metered totals, quota usage, invoice line items, and customer-facing cost views. A replay must not double bill a customer. It must explain which usage events were accepted, which were deduplicated, which arrived late, and which billing projections changed.In security analytics, events represent activity across users, endpoints, cloud resources, and network boundaries. The derived state may include detections, risk scores, alerts, or investigation timelines. A replay must not simply regenerate alerts blindly. It must explain whether a detection changed because new evidence arrived, because old evidence was corrected, or because the detection logic changed.In payments and financial platforms, events often cross service and ledger boundaries. The derived state may include balances, settlements, reversals, fees, and disputes. Recovery without evidence is not acceptable. It is not enough to say that a consumer caught up. The system must explain why the reconstructed financial state is trustworthy.That is why Recovery Contracts are useful as an architectural pattern rather than a one-off inventory trick.The implementation changes by domain. The contract shape stays surprisingly stable:What history do we trust?What ordering must we preserve?How do we suppress duplicate effects?How do we rebuild state?How do we check the result?What evidence do we emit?Those questions travel well.\A Practical ChecklistFor any high-value event flow, ask:1. What is the authoritative history?2. What is the ordering boundary?3. What is the idempotency key?4. Can the projection be rebuilt deterministically?5. What replay scopes are supported?6. What reconciliation checks prove correctness?7. What evidence is emitted after replay?8. Who is allowed to trigger replay?9. Which downstream systems are republished?10. What does "trusted" mean after recovery?If those answers are not written down, the system may still be reliable in the happy path. But it is not yet recovery-centric. \n Deployment ViewA cloud-native deployment separates ingestion, persistence, CDC, stream processing, downstream publication, and observability. In Kubernetes, that might mean independent deployments for ingestion consumers, Kafka Streams applications, Kafka Connect workers, and recovery tooling. \n The recovery contract should not live only in someone’s head. It can live in architecture decision records, schema metadata, stream topology documentation, runbooks, or platform-level recovery tooling.The most mature version is when the platform itself understands recovery contracts. At that point, replay is no longer a dangerous manual operation. It becomes a controlled workflow.Common Failure ScenariosRecovery contracts become especially valuable in scenarios like these:- A consumer restarts during a high-volume marketplace update burst.- A marketplace sends duplicate inventory updates.- A late event arrives after a downstream projection has already been published.- A Debezium connector falls behind or restarts from a replication slot.- A Kafka Streams task rebalances and restores state from a changelog topic.- A schema change breaks an older event reader.- A manual offset rewind is required after a bad deployment. \n The question is the same every time:Can the team rebuild the affected state and explain why the rebuilt state is correct?If the answer is yes, the system has recovery confidence.If the answer is no, the system may be running, but the operators are still guessing.\ConclusionEvent-driven systems do not just need reliability. They need replayability. And replayability itself is incomplete unless replay carries confidence.Kafka, Debezium, PostgreSQL, and Kafka Streams provide excellent building blocks for high-throughput cloud-native event pipelines. But the architecture discipline comes from defining how those systems recover after failure.Recovery Contracts make that discipline explicit.They define the history we trust, the ordering boundary we preserve, the duplicate effects we suppress, the projection we rebuild, the checks we run, and the evidence we emit.That is the architecture I want in systems that manage inventory, billing, security analytics, payments, and other high-value enterprise workflows.The strongest systems are not the ones that never fail. They are the ones that can restore justified confidence quickly. \n ResourcesDebezium PostgreSQL connector documentation: https://debezium.io/documentation/reference/stable/connectors/postgresql.htmlApache Kafka Streams documentation: https://kafka.apache.org/documentation/streams/PostgreSQL logical replication documentation: https://www.postgresql.org/docs/current/logical-replication.htmlConfluent Kafka partition strategy: https://www.confluent.io/learn/kafka-partition-strategy/\ \n \n \\