Hosting NVIDIA speech NIM models on Amazon SageMaker AI: Parakeet ASR

This post was written with NVIDIA and the authors would like to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for their collaboration.

Organizations today face the challenge of processing large volumes of audio data–from customer calls and meeting recordings to podcasts and voice messages–to unlock valuable insights. Automatic Speech Recognition (ASR) is a critical first step in this process, converting speech to text so that further analysis can be performed. However, running ASR at scale is computationally intensive and can be expensive. This is where asynchronous inference on Amazon SageMaker AI comes in. By deploying state-of-the-art ASR models (like NVIDIA Parakeet models) on SageMaker AI with asynchronous endpoints, you can handle large audio files and batch workloads efficiently. With asynchronous inference, long-running requests can be processed in the background (with results delivered later); it also supports auto-scaling to zero when there’s no work and handles spikes in demand without blocking other jobs.

In this blog post, we’ll explore how to host the NVIDIA Parakeet ASR model on SageMaker AI and integrate it into an asynchronous pipeline for scalable audio processing. We’ll also highlight the benefits of Parakeet’s architecture and the NVIDIA Riva toolkit for speech AI, and discuss how to use NVIDIA NIM for deployment on AWS.

NVIDIA speech AI technologies: Parakeet ASR and Riva Framework

NVIDIA offers a comprehensive suite of speech AI technologies, combining high-performance models with efficient deployment solutions. At its core, the Parakeet ASR model family represents state-of-the-art speech recognition capabilities, achieving industry-leading accuracy with low word error rates (WERs) . The model’s architecture uses the Fast Conformer encoder with the CTC or transducer decoder, enabling 2.4× faster processing than standard Conformers while maintaining accuracy.

NVIDIA speech NIM is a collection of GPU-accelerated microservices for building customizable speech AI applications. NVIDIA Speech models deliver accurate transcription accuracy and natural, expressive voices in over 36 languages–ideal for customer service, contact centers, accessibility, and global enterprise workflows. Developers can fine-tune and customize models for specific languages, accents, domains, and vocabularies, supporting accuracy and brand voice alignment.

Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA models ideal for agentic AI applications, helping your organization stand out with more secure, high-performing, voice AI. The NIM framework delivers these services as containerized solutions, making deployment straightforward through Docker containers that include the necessary dependencies and optimizations.

This combination of high-performance models and deployment tools provides organizations with a complete solution for implementing speech recognition at scale.

Solution overview

The architecture illustrated in the diagram showcases a comprehensive asynchronous inference pipeline designed specifically for ASR and summarization workloads. The solution provides a robust, scalable, and cost-effective processing pipeline.

Architecture components

The architecture consists of five key components working together to create an efficient audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR model with auto scaling capabilities that can scale to zero when idle for cost optimization.

  1. The data ingestion process begins when audio files are uploaded to Amazon Simple Storage Service (Amazon S3), triggering AWS Lambda functions that process metadata and initiate the workflow.
  2. For event processing, the SageMaker endpoint automatically sends out Amazon Simple Notification Service (Amazon SNS) success and failure notifications through separate queues, enabling proper handling of transcriptions.
  3. Successfully transcribed content on Amazon S3 moves to Amazon Bedrock LLMs for intelligent summarization and additional processing like classification and insights extraction.
  4. Finally, a comprehensive tracking system using Amazon DynamoDB stores workflow status and metadata, enabling real-time monitoring and analytics of the entire pipeline.

Detailed implementation walkthrough

In this section, we will provide the detailed walkthrough of the solution implementation.

SageMaker asynchronous endpoint prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker async hosting instances. In this example, we need one ml.g5.xlarge SageMaker async hosting instance and a ml.g5.xlarge SageMaker notebook instance. You can also choose a different integrated development environment (IDE), but make sure the environment contains GPU compute resources for local testing.

SageMaker asynchronous endpoint configuration

When you deploy a custom model like Parakeet, SageMaker has a couple of options:

  • Use a NIM container provided by NVIDIA
  • Use a large model inference (LMI) container
  • Use a prebuilt PyTorch container

We’ll provide examples for all three approaches.

Using an NVIDIA NIM container

NVIDIA NIM provides a streamlined approach to deploying optimized AI models through containerized solutions. Our implementation takes this concept further by creating a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to help maximize both performance and capabilities while simplifying the deployment process.

Innovative dual-protocol architecture

The key innovation is the combined HTTP + gRPC architecture that exposes a single SageMaker AI endpoint with intelligent routing capabilities. This design addresses the common challenge of choosing between protocol efficiency and feature completeness by automatically selecting the optimal transport method. The HTTP route is optimized for simple transcription tasks with files under 5MB, providing faster processing and lower latency for common use cases. Meanwhile, the gRPC route supports larger files (SageMaker AI real-time endpoints support a max payload of 25MB) and advanced features like speaker diarization with precise word-level timing information. The system’s auto-routing functionality analyzes incoming requests to determine file size and requested features, then automatically selects the most appropriate protocol without requiring manual configuration. For applications that need explicit control, the endpoint also supports forced routing through /invocations/http for simple transcription or /invocations/grpc when speaker diarization is required. This flexibility allows both automated optimization and fine-grained control based on specific application requirements.

Advanced speech recognition and speaker diarization capabilities

The NIM container enables a comprehensive audio processing pipeline that seamlessly combines speech recognition with speaker identification through the NVIDIA Riva integrated capabilities. The container handles audio preprocessing, including format conversion and segmentation, while ASR and speaker diarization processes run concurrently on the same audio stream. Results are automatically aligned using overlapping time segments, with each transcribed segment receiving appropriate speaker labels (for example, Speaker_0, Speaker_1). The inference handler processes audio files through the complete pipeline, initializing both ASR and speaker diarization services, running them in parallel, and aligning transcription segments with speaker labels. The output includes the full transcription, timestamped segments with speaker attribution, confidence scores, and total speaker count in a structured JSON format.

Implementation and deployment

The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the foundation, adding a Python aiohttp server that seamlessly manages the complete NIM lifecycle by automatically starting and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to appropriate NIM APIs, implements the intelligent routing logic that analyzes request characteristics, and provides comprehensive error handling with detailed error messages and fallback mechanisms for robust production deployment. The containerized solution streamlines deployment through standard Docker and AWS CLI commands, featuring a pre-configured Docker file with the necessary dependencies and optimizations. The system accepts multiple input formats including multipart form-data (recommended for maximum compatibility), JSON with base64 encoding for simple integration scenarios, and raw binary uploads for direct audio processing.

For detailed implementation instructions and working examples, teams can reference the complete implementation and deployment notebook in the AWS samples repository, which provides comprehensive guidance on deploying Parakeet ASR with NIM on SageMaker AI using the bring your own container (BYOC) approach. For organizations with specific architectural preferences, separate HTTP-only and gRPC-only implementations are also available, providing simpler deployment models for teams with well-defined use cases while the combined implementation offers maximum flexibility and automatic optimization.

AWS customers can deploy these models either as production-grade NVIDIA NIM containers directly from SageMaker Marketplace or JumpStart, or open source NVIDIA models available on Hugging Face, which can be deployed through custom containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This allows organizations to choose between fully managed, enterprise-tier endpoints with auto-scaling and security, or flexible open-source development for research or constrained use cases.

Using an AWS LMI container

LMI containers are designed to simplify hosting large models on AWS. These containers include optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that can automatically handle things like model parallelism, quantization, and batching for large models. The LMI container is essentially a pre-configured Docker image that runs an inference server (for example a Python server with these optimizations) and allows you to specify model parameters by using environment variables.

To use the LMI container for Parakeet, we would typically:

  1. Choose the appropriate LMI image: AWS provides different LMI images for different frameworks. For Parakeet , we might use the DJLServing image for efficient inference. Alternatively, NVIDIA Triton Inference Server (which Riva uses) is an option if we package the model in ONNX or TensorRT format.
  2. Specify the model configuration: With LMI, we often provide a model_id (if pulling from Hugging Face Hub) or a path to our model, along with configuration for how to load it (number of GPUs, tensor parallel degree, quantization bits). The container then downloads the model and initializes it with the specified settings. We can also download our own model files from Amazon S3 instead of using the Hub.
  3. Define the inference handler: The LMI container might require a small handler script or configuration to tell it how to process requests. For ASR, this might involve reading the audio input, passing it to the model, and returning text.

AWS LMI containers deliver high performance and scalability through advanced optimization techniques, including continuous batching, tensor parallelism, and state-of-the-art quantization methods. LMI containers integrate multiple inference backends (vLLM, TensorRT-LLM through a single unified configuration), helping users seamlessly experiment and switch between frameworks to find the optimal performance stack for your specific use case.

Using a SageMaker PyTorch container

SageMaker offers PyTorch Deep Learning Containers (DLCs) that come with PyTorch and many common libraries pre-installed. In this example, we demonstrated how to extend our prebuilt container to install necessary packages for the model. You can download the model directly from Hugging Face during the endpoint creation or download the Parakeet model artifacts, packaging it with necessary configuration files into a model.tar.gz archive, and uploading it to Amazon S3. Along with the model artifacts, an inference.py script is required as the entry point script to define model loading and inference logic, including audio preprocessing and transcription handling. When using the SageMaker Python SDK to create a PyTorchModel, the SDK will automatically repackage the model archive to include the inference script under /opt/ml/model/code/inference.py, while keeping model artifacts in /opt/ml/model/ on the endpoint. Once the endpoint is deployed successfully, it can be invoked through the predict API by sending audio files as byte streams to get transcription results.

For the SageMaker real-time endpoint, we currently allow a maximum of 25MB for payload size. Make sure you have set up the container to also allow the maximum request size. However, if you are planning to use the same model for the asynchronous endpoint, the maximum file size that the async endpoint supports is 1GB and the response time is up to 1 hour. Accordingly, you should setup the container to be prepared for this payload size and timeout. When using the PyTorch containers, here are some key configuration parameters to consider:

  • SAGEMAKER_MODEL_SERVER_WORKERS: Set the number of torch workers that will load the number of models copied into GPU memory.
  • TS_DEFAULT_RESPONSE_TIMEOUT: Set the time out setting for Torch server workers; for long audio processing, you can set it to a higher number
  • TS_MAX_REQUEST_SIZE: Set the byte size values for requests to 1G for async endpoints.
  • TS_MAX_RESPONSE_SIZE: Set the byte size values for response.

In the example notebook, we also showcase how to leverage the SageMaker local session provided by the SageMaker Python SDK. It helps you create estimators and run training, processing, and inference jobs locally using Docker containers instead of managed AWS infrastructure, providing a fast way to test and debug your machine learning scripts before scaling to production.

CDK pipeline prerequisites

Before deploying this solution, make sure you have:

  1. AWS CLI configured with appropriate permissions – Installation Guide
  2. AWS Cloud Development Kit (AWS CDK) installedInstallation Guide
  3. Node.js 18+ and Python 3.9+ installed
  4. Docker – Installation Guide
  5. SageMaker endpoint deployed with your ML model (Parakeet ASR models or similar)
  6. Amazon SNS topics created for success and failure notifications

CDK pipeline setup

The solution deployment begins with provisioning the necessary AWS resources using Infrastructure as Code (IaC) principles. AWS CDK creates the foundational components including:

  • DynamoDB Table: Configured for on-demand capacity to track invocation metadata, processing status, and results
  • S3 Buckets: Secure storage for input audio files, transcription outputs, and summarization results
  • SNS topics: Separate queues for success and failure event handling
  • Lambda functions: Serverless functions for metadata processing, status updates, and workflow orchestration
  • IAM roles and policies: Appropriate permissions for cross-service communication and resource access

Environment setup

Clone the repository and install dependencies:

# Install degit, a library for downloading specific sub directories
npm install -g degit

# Clone just the specific folder
npx degit aws-samples/genai-ml-platform-examples/infrastructure/automated-speech-recognition-async-pipeline-sagemaker-ai/sagemaker-async-batch-inference-cdk sagemaker-async-batch-inference-cdk

# Navigate to folder
cd sagemaker-async-batch-inference-cdk

# Install Node.js dependencies
npm install

# Set up Python virtual environment
python3 -m venv .venv
source .venv/bin/activate

# On Windows:
.venvScriptsactivate
pip install -r requirements.txt

Configuration

Update the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:

vim bin/aws-blog-sagemaker.ts 

# Change the endpoint name 
sageMakerConfig: { 
    endpointName: 'your-sagemaker-endpoint-name',     
    enableSageMakerAccess: true 
}

If you have followed the notebook to deploy the endpoint, you should have created the two SNS topics. Otherwise, make sure you create the correct SNS topics using CLI:

# Create SNS topics
aws sns create-topic --name success-inf
aws sns create-topic --name failed-inf

Build and deploy

Before you deploy the AWS CloudFormation template, make sure Docker is running.

# Compile TypeScript to JavaScript
npm run build

# Bootstrap CDK (first time only)
npx cdk bootstrap

# Deploy the stack
npx cdk deploy

Verify deployment

After successful deployment, note the output values:

  • DynamoDB table name for status tracking
  • Lambda function ARNs for processing and status updates
  • SNS topic ARNs for notifications

Submit audio file for processing

Processing Audio Files

Update the upload_audio_invoke_lambda.sh

LAMBDA_ARN="YOUR_LAMBDA_FUNCTION_ARN"
S3_BUCKET="YOUR_S3_BUCKET_ARN"

Run the Script:

AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh

This script will:

  • Download a sample audio file
  • Upload the audio file to your s3 bucket
  • Send the bucket path to Lambda and trigger the transcription and summarization pipeline

Monitoring progress

You can check the result in DynamoDB table using the following command:

aws dynamodb scan --table-name YOUR_DYNAMODB_TABLE_NAME

Check processing status in the DynamoDB table:

  • submitted: Successfully queued for inference
  • completed: Transcription completed successfully
  • failed: Processing encountered an error

Audio processing and workflow orchestration

The core processing workflow follows an event-driven pattern:

Initial processing and metadata extraction: When audio files are uploaded to S3, the triggered Lambda function analyzes the file metadata, validates format compatibility, and creates detailed invocation records in DynamoDB. This facilitates comprehensive tracking from the moment audio content enters the system.

Asynchronous Speech Recognition: Audio files are processed through the SageMaker endpoint using optimized ASR models. The asynchronous process can handle various file sizes and durations without timeout concerns. Each processing request is assigned a unique identifier for tracking purposes.

Success path processing: Upon successful transcription, the system automatically initiates the summarization workflow. The transcribed text is sent to Amazon Bedrock, where advanced language models generate contextually appropriate summaries based on configurable parameters such as summary length, focus areas, and output format.

Error handling and recovery: Failed processing attempts trigger dedicated Lambda functions that log detailed error information, update processing status, and can initiate retry logic for transient failures. This robust error handling results in minimal data loss and provides clear visibility into processing issues.

Real-world applications

Customer service analytics: Organizations can process thousands of customer service call recordings to generate transcriptions and summaries, enabling sentiment analysis, quality assurance, and insights extraction at scale.

Meeting and conference processing: Enterprise teams can automatically transcribe and summarize meeting recordings, creating searchable archives and actionable summaries for participants and stakeholders.

Media and content processing: Media companies can process podcast episodes, interviews, and video content to generate transcriptions and summaries for improved accessibility and content discoverability.

Compliance and legal documentation: Legal and compliance teams can process recorded depositions, hearings, and interviews to create accurate transcriptions and summaries for case preparation and documentation.

Cleanup

Once you have used the solution, remove the SageMaker endpoints to prevent incurring additional costs. You can use the provided code to delete real-time and asynchronous inference endpoints, respectively:

# Delete real-time inference
endpointreal_time_predictor.delete_endpoint()

# Delete asynchronous inference
endpointasync_predictor.delete_endpoint()

You should also delete all the resources created by the CDK stack.

# Delete CDK Stack
cdk destroy

Conclusion

The integration of powerful NVIDIA speech AI technologies with AWS cloud infrastructure creates a comprehensive solution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and speed with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can achieve both high-performance speech recognition and cost-effective scaling. The solution leverages the managed services of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automated, scalable pipeline for processing audio content. With features like auto scaling to zero, comprehensive error handling, and real-time monitoring through DynamoDB, organizations can focus on extracting business value from their audio content rather than managing infrastructure complexity. Whether processing customer service calls, meeting recordings, or media content, this architecture delivers reliable, efficient, and cost-effective audio processing capabilities. To experience the full potential of this solution, we encourage you to explore the solution and reach out to us if you have any specific business requirements and would like to customise the solution for your use case.


About the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Tony Trinh is a Senior AI/ML Specialist Architect at AWS. With 13+ years of experience in the IT industry, Tony specializes in architecting scalable, compliance-driven AI and ML solutions—particularly in generative AI, MLOps, and cloud-native data platforms. As part of his PhD, he’s doing research in Multimodal AI and Spatial AI. In his spare time, Tony enjoys hiking, swimming and experimenting with home improvement.

Alick Wong is a Senior Solutions Architect at Amazon Web Services, where he helps startups and digital-native businesses modernize, optimize, and scale their platforms in the cloud. Drawing on his experience as a former startup CTO, he works closely with founders and engineering leaders to drive growth and innovation on AWS.

Andrew Smith is a Sr. Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Derrick Choo is a Senior AI/ML Specialist Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.

Tim Ma is a Principal Specialist in Generative AI at AWS, where he collaborates with customers to design and deploy cutting-edge machine learning solutions. He also leads go-to-market strategies for generative AI services, helping organizations harness the potential of advanced AI technologies.

Curt Lockhart is an AI Solutions Architect at NVIDIA, where he helps customers deploy language and vision models to build end to end AI workflows using NVIDIA’s tooling on AWS. He enjoys making complex AI feel approachable and spending his time exploring the art, music, and outdoors of the Pacific Northwest.

Francesco Ciannella is a senior engineer at NVIDIA, where he works on conversational AI solutions built around large language models (LLMs) and audio language models (ALMs). He holds a M.S. in engineering of telecommunications from the University of Rome “La Sapienza” and an M.S. in language technologies from the School of Computer Science at Carnegie Mellon University.