Improved resiliency with backpressure and admission control for Amazon OpenSearch Service | Amazon Web Services

Improved resiliency with backpressure and admission control for Amazon OpenSearch Service | Amazon Web Services

Source Node: 2723961

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. Last year, we introduced Shard Indexing Backpressure and admission control, which monitors cluster resources and incoming traffic to selectively reject requests that would otherwise pose stability risks like out of memory and impact cluster performance due to memory contentions, CPU saturation and GC overhead, and more.

We are now excited to introduce Search Backpressure and CPU-based admission control for OpenSearch Service, which further enhances the resiliency of clusters. These improvements are available for all OpenSearch versions 1.3 or higher.

Search Backpressure

Backpressure prevents a system from being overwhelmed with work. It does so by controlling the traffic rate or by shedding excessive load in order to prevent crashes and data loss, improve performance, and avoid total failure of the system.

Search Backpressure is a mechanism to identify and cancel in-flight resource-intensive search requests when a node is under duress. It’s effective against search workloads with anomalously high resource usage (such as complex queries, slow queries, many hits, or heavy aggregations), which could otherwise cause node crashes and impact the cluster’s health.

Search Backpressure is built on top of the task resource tracking framework, which provides an easy-to-use API to monitor each task’s resource usage. Search Backpressure uses a background thread that periodically measures the node’s resource usage and assigns a cancellation score to each in-flight search task based on factors like CPU time, heap allocations, and elapsed time. A higher cancellation score corresponds to a more resource-intensive search request. Search requests are cancelled in descending order of their cancellation score to recover nodes quickly, but the number of cancellations is rate-limited to avoid wasteful work.

The following diagram illustrates the Search Backpressure workflow.

Search requests return an HTTP 429 “Too Many Requests” status code upon cancellation. OpenSearch returns partial results if only some shards fail and partial results are allowed. See the following code:

{ "error": { "root_cause": [ { "type": "task_cancelled_exception", "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]" } ], "type": "search_phase_execution_exception", "reason": "SearchTask was cancelled", "phase": "fetch", "grouped": true, "failed_shards": [ { "shard": 0, "index": "nyc_taxis", "node": "9gB3PDp6Speu61KvOheDXA", "reason": { "type": "task_cancelled_exception", "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]" } } ], "caused_by": { "type": "task_cancelled_exception", "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]" } }, "status": 429
}

Monitoring Search Backpressure

You can monitor the detailed Search Backpressure state using the node stats API:

curl -X GET "https://{endpoint}/_nodes/stats/search_backpressure"

You can also view the cluster-wide summary of cancellations using Amazon CloudWatch. The following metrics are now available in the ES/OpenSearchService namespace:

  • SearchTaskCancelled – The number of coordinator node cancellations
  • SearchShardTaskCancelled – The number of data node cancellations

The following screenshot shows an example of tracking these metrics on the CloudWatch console.

CPU-based admission control

Admission control is a gatekeeping mechanism that proactively limits the number of requests to a node based on its current capacity, both for organic increases and spikes in traffic.

In addition to the JVM memory pressure and request size thresholds, it now also monitors each node’s rolling average CPU usage to reject incoming _search and _bulk requests. It prevents nodes from being overwhelmed with too many requests leading to hot spots, performance problems, request timeouts, and other cascading failures. Excessive requests return an HTTP 429 “Too Many Requests” status code upon rejection.

Handling HTTP 429 errors

You’ll receive HTTP 429 errors if you send excessive traffic to a node. It indicates either insufficient cluster resources, resource-intensive search requests, or an unintended spike in the workload.

Search Backpressure provides the reason for rejection, which can help fine-tune resource-intensive search requests. For traffic spikes, we recommend client-side retries with exponential backoff and jitter.

You can also follow these troubleshooting guides to debug excessive rejections:

Conclusion

Search Backpressure is a reactive mechanism to shed excessive load, while admission control is a proactive mechanism to limit the number of requests to a node beyond its capacity. Both work in tandem to improve the overall resiliency of an OpenSearch cluster.

Search Backpressure is available in OpenSearch, and we are always looking for external contributions. You can refer to the RFC to get started.


About the authors

Ketan Verma is a Senior SDE working on Amazon OpenSearch Service. He is passionate about building large-scale distributed systems, improving performance, and simplifying complex ideas with simple abstractions. Outside work, he likes to read and improve his home barista skills.

Suresh N S is a Senior SDE working on Amazon OpenSearch Service. He is passionate towards solving problems in large scale distributed systems.

Pritkumar Ladani is an SDE-2 working on Amazon OpenSearch Service. He likes to contribute to open source software development, and is passionate about distributed systems. He is an amateur badminton player and enjoys trekking.

Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in building distributed and autonomous systems. He is a maintainer and an active contributor to OpenSearch.

Time Stamp:

More from AWS Big Data