Azure Cosmos DB is engineered from the ground up to deliver high availability, low latency, throughput, and consistency guarantees for globally distributed applications. As mission-critical systems increasingly rely on Cosmos DB for performance at scale, understanding and configuring advanced high availability features becomes essential.Advanced high availability capabilities in Azure Cosmos DBAzure Cosmos DB provides a rich set of features to support high availability and resilience: Availability Zones: By deploying replicas across physically separate data centers within a region, Cosmos DB can survive zonal failures without any downtime. Multi-region replication: Cosmos DB supports geo-replication across any number of Azure regions, allowing your data to be served locally to end users across the globe. Multi-region writes: For truly active-active architectures, Cosmos DB supports multi-region writes with conflict resolution policies to ensure consistency and reliability. Automatic regional failover: When a region becomes unavailable, Cosmos DB can automatically redirect traffic to a secondary region. Per-partition automatic failover (PPAF): Recently introduced, this ground-breaking feature provides more granular control by enabling failover at the logical partition level. It ensures that only affected partitions are redirected to another region, preserving performance for healthy partitions.These capabilities form the backbone of Cosmos DB's 99.999% availability SLA. To fully realize the platform's benefits, applications need to be configured appropriately, especially at the SDK level.Database vs. Client FailoverBefore diving in, let’s clarify the distinction between database-level and client-level failover in Azure Cosmos DB: Database failover - In Azure Cosmos DB, you can configure either single-region or multi-region writes. Each option has trade-offs in performance, cost, consistency, and availability. Database failover refers to the process of promoting another region (or another partition, in the case of PPAF) to accept writes when the current write region becomes unavailable. This concept applies only to single-region write configurations, as in multi-region writes, all regions are already writable, so there’s no need for failover. While promotion during failover can introduce a brief delay, PPAF significantly reduces this delay. Client failover - This refers to the behavior of the client communicating with the database replicas. In this context, the SDK is considered the client (though in practice, it could also refer to an application using the SDK). Unlike database failover, client failover can occur almost instantly, as the client can quickly reroute requests to another available region or replica. However, this speed introduces its own trade-offs. For example, temporary inconsistencies or latency changes depending on the new target region.The SDK perspective: what happens by default?While Azure Cosmos DB supports robust infrastructure-level resiliency features, including various mechanisms for automatic database failover as mentioned above, it is important to be aware that the Cosmos DB SDKs do not automatically failover in many scenarios unless explicitly configured to do so.Even when preferred regions are specified (these are not enabled by default), the SDKs will only perform failover when the primary region is completely unreachable. For less severe disruptions - such as transient network issues, throttling, or degraded performance - the SDKs will continue to retry initial operations starting with the first region in the preferred list, and only then performing cross-region retries.This behavior is designed to maintain consistency and favour data locality by contacting the first region in the preferred list first for every new request. However, it is important to be aware that this is not strictly failover in the often accepted sense of the term (i.e. re-routing all subsequent requests elsewhere after identifying some issue with the primary target). Depending on preference and business requirements, other strategies may be preferred. Retrying across regions can introduce latency and even compromise availability when local replicas are degraded. To handle such scenarios more effectively, Azure Cosmos DB SDKs for .NET, Python, and Java provide advanced opt-in features.Advanced opt-in SDK availability configurationsTo improve application resilience and response times during partial outages or degraded conditions, Cosmos DB SDKs support the following advanced strategies. Threshold-based availability strategy: This strategy allows applications to define latency thresholds per region. If operations exceed the threshold, the SDK can dynamically issue parallel requests to preferred secondary regions, accepting the request which returns the quickest. This pattern is also referred to as "hedging". Per-partition circuit breaker: When a specific physical partition experiences consistent failures or degraded performance, the SDK can redirect all future requests for that partition to another region for a specified period of time, without impacting healthy partitions.These strategies are opt-in and require explicit configuration in the client SDK initialization code.Fine tuning consistency vs availabilityStrictly speaking, there is a hard trade-off between consistency and availability (per CAP theorem). However, the excluded regions feature in the SDK can be used as part of fine tuning the balance between the two. For example: In a scenario where region 1 (primary region) is suffering an outage, and per-partition circuit breaker is not configured (or even when it is configured but the threshold for failover is proving too long), excluded regions allows for excluding the primary (or any other) region without making code changes or restarting the application. In a scenario where preferences for consistent reads differ depending on whether the application is in steady state or suffering an outages. In steady state you may favour consistent reads in your application, but when there is an outage you may favour availability (at the expense of potentially greater data loss). In that scenario, you can exclude all regions except the primary region, so as to avoid that cross region retries that can compromise consistency, while still being able to control failover to other regions (for both reads and writes) using an external mechanism (e.g. traffic manager or load balancer).Refer to the .NET performance tips, Java performance tips , and Python performance tips documentation for detailed guidance on Threshold-based availability strategy, Per-partition circuit breaker, and excluded regions features.SummaryHigh availability in Azure Cosmos DB is a multi-layered strategy. While the platform provides strong infrastructure-level guarantees through availability zones and multi-region replication, it's up to developers to configure SDKs appropriately to unlock these benefits during real-world outages and latency spikes.By enabling preferred regions, configuring threshold-based failover policies, leveraging partition-level circuit breakers, and configuring excluded regions where appropriate, applications can achieve faster recovery, reduce downtime, and deliver a more resilient user experience.If your application is mission-critical, it's time to move beyond defaults and take full advantage of the high availability features in Cosmos DB SDKs. Check out our blog here for a deep dive on how Cosmos DB keeps your applications online.Leave a reviewTell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we’ll gift you $50. Get started here.About Azure Cosmos DBAzure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.