Amazon SageMaker एसिंक्रोनस एंडपॉइंट के साथ बड़े वीडियो पर कंप्यूटर विज़न अनुमान चलाएँ

स्रोत नोड: 1139696

AWS customers are increasingly using computer vision (CV) models on large input payloads that can take a few minutes of processing time. For example, space technology companies work with a stream of high-resolution satellite imagery to detect particular objects of interest. Similarly, healthcare companies process high-resolution biomedical images or videos like echocardiograms to detect anomalies. Additionally, media companies scan images and videos uploaded by their customers to ensure they are compliant and without copyright violations. These applications receive bursts of incoming traffic at different times in the day and require near-real-time processing with completion notifications at a low cost.

You can build CV models with multiple deep learning frameworks like TensorFlow, PyTorch, and Apache MXNet. These models typically have large payloads, such as images or videos. Advanced deep learning models for use cases like object detection return large response payloads ranging from tens of MBs to hundreds of MBs in size. Additionally, high-resolution videos require compute-intensive preprocessing before model inference. Processing times can range in the order of minutes, eliminating the option to run real-time inference by passing payloads over a HTTP API. Instead, there is a need to process input payloads asynchronously from an object store like अमेज़न सरल भंडारण सेवा (Amazon S3) with automatic queuing and a predefined concurrency threshold. The system should be able to receive status notifications and eliminate unnecessary costs by cleaning up resources when the tasks are complete.

अमेज़न SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker provides state-of-the-art open-source model serving containers for XGBoost (कंटेनर, एसडीके), स्किकिट-लर्न (कंटेनर, एसडीके), पाइटॉर्च (कंटेनर, एसडीके), टेंसरफ्लो (कंटेनर, एसडीके) और अपाचे एमएक्सनेट (कंटेनर, एसडीके). SageMaker provides three options to deploy trained ML models for generating inferences on new data:

In this post, we show you how to serve a PyTorch CV model with SageMaker asynchronous inference to process a burst traffic of large input payload videos uploaded to Amazon S3. We demonstrate the new capabilities of an internal queue with user-defined concurrency and completion notifications. We configure auto scaling of instances to scale down to 0 when traffic subsides and scale back up as the request queue fills up. We use a g4dn instance with a Nvidia T4 GPU and the SageMaker पूर्व-निर्मित टॉर्चसर्व कंटेनर with a custom inference script for preprocessing the videos before model invocation, and अमेज़ॅन क्लाउडवॉच metrics to monitor the queue size, total processing time, invocations processed, and more.

इस उदाहरण के लिए कोड उपलब्ध है GitHub.

समाधान अवलोकन

निम्नलिखित चित्र हमारे समाधान वास्तुकला को दर्शाता है।

Our model is first hosted on the scaling endpoint. Next, the user or some other mechanism uploads a video file to an input S3 bucket. The user invokes the endpoint and is immediately returned an output Amazon S3 location where the inference is written. After the inference is complete, the result is saved to the output S3 bucket, and an अमेज़न सरल अधिसूचना सेवा (अमेज़ॅन एसएनएस) अधिसूचना उपयोगकर्ता को पूर्ण सफलता या विफलता के बारे में सूचित करने के लिए भेजी जाती है।

Use case model

For this object detection example, we use a TorchVision मास्क-RCNN model, pre-trained on 91 classes, to demonstrate inference on a stacked 4D video Tensor. Because we’re detecting objects on large input payload that requires preprocessing, the total latency can be substantial. Although this isn’t ideal for a real-time endpoint, it’s easily handled by asynchronous endpoints, which process the queue and save the results to an Amazon S3 output location.

To host this model, we use a pre-built SageMaker PyTorch inference container वह उपयोग करता है मशाल को सुरक्षित रखें model serving stack. SageMaker containers allow you to provide your own inference script, which gives you flexibility to handle preprocessing and postprocessing, as well as dictate how your model interacts with the data.

Input and output payload

In this example, we use an input video of size 71 MB from यहाँ उत्पन्न करें. The asynchronous endpoint’s inference handler expects an mp4 video, which is sharded into 1024x1024x3 tensors for every second of video. To define this handler, we provide the endpoint with a custom inference.py script. The script provides functions for model loading, data serialization and deserialization, preprocessing, and prediction. Within the handler, our input_fn calls a helper function known as video2frames:

video_frames = []
cap = cv2.VideoCapture(tfile.name)
frame_index, frame_count = 0, 0 if cap.isOpened(): success = True else: success = False while success: success, frame = cap.read() if frame_index % interval == 0: print("---> Reading the %d frame:" % frame_index, success) resize_frame = cv2.resize( frame, (frame_width, frame_height), interpolation=cv2.INTER_AREA ) video_frames.append(resize_frame) frame_count += 1 frame_index += 1 cap.release()
return video_frames

These stacked tensors are processed by our Mask-RCNN model, which saves a result JSON containing the bounding boxes, labels, and scores for detected objects. In this example, the output payload is 54 MB. We demonstrate a quick visualization of the results in the following animation.

Create the asynchronous endpoint

हम बनाते हैं asynchronous endpoint similarly to a real-time hosted endpoint. The steps include creation of a सेजमेकर मॉडल, followed by endpoint configuration and deployment of the endpoint. The difference between the two types of endpoints is that the asynchronous endpoint configuration contains an AsyncInferenceConfig section. In this section, we specify the Amazon S3 output path for the results from the endpoint invocation and optionally include SNS topics for notifications on success and failure. We also specify the maximum number of concurrent invocations per instance as determined by the customer. See the following code:

AsyncInferenceConfig={ "OutputConfig": { "S3OutputPath": f"s3://{bucket}/{bucket_prefix}/output", # Optionally specify Amazon SNS topics for notifications "NotificationConfig": { "SuccessTopic": success_topic, "ErrorTopic": error_topic, } }, "ClientConfig": { "MaxConcurrentInvocationsPerInstance": 2 #increase this value upto throughput peak for best performance } }

For details on the API to create an endpoint configuration for asynchronous inference, एक अतुल्यकालिक निष्कर्ष समापन बिंदु बनाएँ.

अतुल्यकालिक समापन बिंदु को लागू करें

The input payload in the following code is a video .mp4 file uploaded to Amazon S3:

sm_session.upload_data( input_location, bucket=sm_session.default_bucket(), key_prefix=prefix, extra_args={"ContentType": "video/mp4"})

हम समापन बिंदु को लागू करने के लिए इनपुट पेलोड फ़ाइल में Amazon S3 URI का उपयोग करते हैं। प्रतिक्रिया वस्तु में पूरा होने के बाद परिणाम प्राप्त करने के लिए अमेज़ॅन एस 3 में आउटपुट स्थान होता है:

response = sm_runtime.invoke_endpoint_async(EndpointName=endpoint_name, InputLocation=input_1_s3_location)
output_location = response['OutputLocation'])

For details on the API to invoke an asynchronous endpoint, see Invoke an Asynchronous Endpoint.

उपयोगकर्ता द्वारा परिभाषित संगामिति के साथ आमंत्रण अनुरोधों को कतारबद्ध करें

The asynchronous endpoint automatically queues the invocation requests. It uses the MaxConcurrentInvocationsPerInstance parameter in the preceding endpoint configuration to process new requests from the queue after previous requests are complete. This is a fully managed queue with various monitoring metrics and doesn’t require any further configuration.

Auto scaling instances within the asynchronous endpoint

We set the auto scaling policy with a minimum capacity of 0 and a maximum capacity of five instances. Unlike real-time hosted endpoints, asynchronous endpoints support scaling the instances count to 0, by setting the minimum capacity to 0. With this feature, we can scale down to 0 instances when there is no traffic and pay only when the payloads arrive.

We use the ApproximateBacklogSizePerInstance metric for the scaling policy configuration with a target queue backlog of five per instance to scale out further. We set the cooldown period for ScaleInCooldown to 120 seconds and the ScaleOutCooldown to 120 seconds. See the following code:

client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services resource_id='endpoint/' + endpoint_name + '/variant/' + 'variant1' # This is the format in which application autoscaling references the endpoint response = client.register_scalable_target( ServiceNamespace='sagemaker', # ResourceId=resource_id, ScalableDimension='sagemaker:variant:DesiredInstanceCount', MinCapacity=0, MaxCapacity=5
) response = client.put_scaling_policy( PolicyName='Invocations-ScalingPolicy', ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource. ResourceId=resource_id, # Endpoint name ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling' TargetTrackingScalingPolicyConfiguration={ 'TargetValue': 5.0, # The target value for the metric. 'CustomizedMetricSpecification': { 'MetricName': 'ApproximateBacklogSizePerInstance', 'Namespace': 'AWS/SageMaker', 'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name }], 'Statistic': 'Average', }, 'ScaleInCooldown': 120, # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start. 'ScaleOutCooldown': 120 # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
# 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
# If the value is true, scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
}
)

For details on the API to automatically scale an asynchronous endpoint, see Autoscale an Asynchronous Endpoint.

Notifications from the asynchronous endpoint

हम प्रत्येक समापन बिंदु आमंत्रण परिणाम के लिए सफलता और त्रुटि सूचनाओं के लिए दो अलग SNS विषय बनाते हैं:

sns_client = boto3.client('sns')
response = sns_client.create_topic(Name="Async-Demo-ErrorTopic2")
error_topic = response['TopicArn']
response = sns_client.create_topic(Name="Async-Demo-SuccessTopic2")
success_topic = response['TopicArn']

The other options for notifications include periodically checking the output of the S3 bucket, or using S3 bucket notifications to trigger an AWS लाम्बा function on file upload. SNS notifications are included in the endpoint configuration section as described earlier.

For details on how to set up notifications from an asynchronous endpoint, see भविष्यवाणी परिणाम की जाँच करें.

Monitor the asynchronous endpoint

We monitor the asynchronous endpoint with built-in additional CloudWatch metrics specific to asynchronous inference. For example, we monitor the queue length in each instance with ApproximateBacklogSizePerInstance and total queue length with ApproximateBacklogSize. Consider deleting the SNS topic to avoid the flooding notifications during the following invocations. In the following chart, we can see the initial backlog size due to sudden traffic burst of 1,000 requests, and the backlog size per instance reduces rapidly as the endpoint scales out from one to five instances.

Similarly, we monitor the total number of successful invocations with InvocationsProcessed and the total number of failed invocations with InvocationFailures. In the following chart, we can see the average number of video invocations processed per minute after auto scaling at approximately 18.

We also monitor the model latency time, which includes the video preprocessing time and model inference for the batch of video images at 1 FPS. In the following chart, we can see the model latency for two concurrent invocations is about 30 seconds.

We also monitor the total processing time from input in Amazon S3 to output back in Amazon S3 with TotalProcessingTime and the time spent in backlog with the TimeInBacklog metric. In the following chart, we can see that the average time in backlog and total processing time increases over time. The requests that are added during the burst of traffic in the front of the queue have a time in backlog that is similar to the model latency of 30 seconds. The requests in the end of the queue have the highest time in backlog at about 3,500 seconds.

We also monitor how the endpoint scales back down to 0 after processing the complete queue. The endpoint runtime settings display the current instance count size at 0.

The following table summarizes the video inference example with a burst traffic of 1,000 video invocations.

विशेषता वैल्यू
Number of invocations (total burst size) 1000
समवर्ती स्तर 2
उदाहरण प्रकार ml.g4dn.xबड़ा
Input payload (per invocation) size 71 एमबी
Video frame sampling rate (FPS) 1 एफपीएस
Output payload (per invocation) size 54 एमबी
मॉडल विलंबता 30 सेकंड
Maximum auto scaling instances 5
Throughput (requests per minute) 18
मॉडल का आकार 165 एमबी

We can optimize the endpoint configuration to get the most cost-effective instance with high performance. In this example, we use a g4dn.xlarge instance with a Nvidia T4 GPU. We can gradually increase the concurrency level up to the throughput peak while adjusting other model server and container parameters.

For a complete list of metrics, see Monitoring Asynchronous Endpoints.

क्लीन अप

After we complete all the requests, we can delete the endpoint similarly to deleting real-time hosted endpoints. Note that if we set the minimum capacity of asynchronous endpoints to 0, there are no instance charges incurred after it scales down to 0.

If you enabled auto scaling for your endpoint, make sure you deregister the endpoint as a scalable target before deleting the endpoint. To do this, run the following:

response = client.deregister_scalable_target( ServiceNamespace='sagemaker', ResourceId='resource_id', ScalableDimension='sagemaker:variant:DesiredInstanceCount' )

Endpoints should be deleted when no longer in use, because (per the SageMaker pricing page) they’re billed by time deployed. To do this, run the following:

sm_client.delete_endpoint(EndpointName=endpoint_name)

निष्कर्ष

In this post, we demonstrated how to use the new asynchronous inference capability from SageMaker to process a large input payload of videos. For inference, we used a custom inference script to preprocess the videos at a predefined frame sampling rate and trigger a well-known PyTorch CV model to generate a list of outputs for each video. We addressed the challenges of burst traffic, high model processing times and large payloads with managed queues, predefined concurrency limits, response notifications, and scale down to zero capabilities. To get started with SageMaker asynchronous inference, see अतुल्यकालिक अनुमान and refer the नमूना कोड for your own use cases.


लेखक के बारे में

हसन पूनावाला लंदन, यूके स्थित AWS में मशीन लर्निंग स्पेशलिस्ट सॉल्यूशंस आर्किटेक्ट हैं। हसन ग्राहकों को AWS पर उत्पादन में मशीन लर्निंग एप्लिकेशन को डिज़ाइन और तैनात करने में मदद करता है। उन्हें विभिन्न उद्योगों में व्यावसायिक समस्याओं को हल करने के लिए मशीन लर्निंग के उपयोग का शौक है। अपने खाली समय में, हसन को बाहर प्रकृति का पता लगाना और दोस्तों और परिवार के साथ समय बिताना पसंद है।

रघु रमेश is a Software Development Engineer (AI/ML) with the Amazon SageMaker Services SA team. He focuses on helping customers migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

शॉन मॉर्गनशॉन मॉर्गन एडब्ल्यूएस में एआई/एमएल सॉल्यूशंस आर्किटेक्ट हैं। उनके पास सेमीकंडक्टर और अकादमिक अनुसंधान क्षेत्रों में अनुभव है, और ग्राहकों को एडब्ल्यूएस पर अपने लक्ष्यों तक पहुंचने में मदद करने के लिए अपने अनुभव का उपयोग करते हैं। अपने खाली समय में, शॉन एक सक्रिय ओपन-सोर्स योगदानकर्ता और अनुरक्षक है, और TensorFlow ऐड-ऑन के लिए विशेष रुचि समूह लीड है।

Source: https://aws.amazon.com/blogs/machine-learning/run-computer-vision-inference-on-large-videos-with-amazon-sagemaker-asynchronous-endpoints/

समय टिकट:

से अधिक एडब्ल्यूएस मशीन लर्निंग ब्लॉग