An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference

An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference

In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the FULL CODES here.

!pip install litserve torch transformers -q


import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List

We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the FULL CODES here.

class TextGeneratorAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
       self.device = device
   def decode_request(self, request):
       return request["prompt"]
   def predict(self, prompt):
       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
       return result[0]['generated_text']
   def encode_response(self, output):
       return {"generated_text": output, "model": "distilgpt2"}


class BatchedSentimentAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
   def decode_request(self, request):
       return request["text"]
   def batch(self, inputs: List[str]) -> List[str]:
       return inputs
   def predict(self, batch: List[str]):
       results = self.model(batch)
       return results
   def unbatch(self, output):
       return output
   def encode_response(self, output):
       return {"label": output["label"], "score": float(output["score"]), "batched": True}

Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the FULL CODES here.

class StreamingTextAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
   def decode_request(self, request):
       return request["prompt"]
   def predict(self, prompt):
       words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
       for word in words:
           time.sleep(0.1)
           yield word + " "
   def encode_response(self, output):
       for token in output:
           yield {"token": token}

In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the FULL CODES here.

class MultiTaskAPI(ls.LitAPI):
   def setup(self, device):
       self.sentiment = pipeline("sentiment-analysis", device=-1)
       self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
       self.device = device
   def decode_request(self, request):
       return {"task": request.get("task", "sentiment"), "text": request["text"]}
   def predict(self, inputs):
       task = inputs["task"]
       text = inputs["text"]
       if task == "sentiment":
           result = self.sentiment(text)[0]
           return {"task": "sentiment", "result": result}
       elif task == "summarize":
           if len(text.split()) < 30:
               return {"task": "summarize", "result": {"summary_text": text}}
           result = self.summarizer(text, max_length=50, min_length=10)[0]
           return {"task": "summarize", "result": result}
       else:
           return {"task": "unknown", "error": "Unsupported task"}
   def encode_response(self, output):
       return output

We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the FULL CODES here.

class CachedAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("sentiment-analysis", device=-1)
       self.cache = {}
       self.hits = 0
       self.misses = 0
   def decode_request(self, request):
       return request["text"]
   def predict(self, text):
       if text in self.cache:
           self.hits += 1
           return self.cache[text], True
       self.misses += 1
       result = self.model(text)[0]
       self.cache[text] = result
       return result, False
   def encode_response(self, output):
       result, from_cache = output
       return {"label": result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

We implement an API that uses caching to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the FULL CODES here.

def test_apis_locally():
   print("=" * 70)
   print("Testing APIs Locally (No Server)")
   print("=" * 70)


   api1 = TextGeneratorAPI(); api1.setup("cpu")
   decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
   result = api1.predict(decoded)
   encoded = api1.encode_response(result)
   print(f"✓ Result: {encoded['generated_text'][:100]}...")


   api2 = BatchedSentimentAPI(); api2.setup("cpu")
   texts = ["I love Python!", "This is terrible.", "Neutral statement."]
   decoded_batch = [api2.decode_request({"text": t}) for t in texts]
   batched = api2.batch(decoded_batch)
   results = api2.predict(batched)
   unbatched = api2.unbatch(results)
   for i, r in enumerate(unbatched):
       encoded = api2.encode_response(r)
       print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")


   api3 = MultiTaskAPI(); api3.setup("cpu")
   decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
   result = api3.predict(decoded)
   print(f"✓ Sentiment: {result['result']}")


   api4 = CachedAPI(); api4.setup("cpu")
   test_text = "LitServe is awesome!"
   for i in range(3):
       decoded = api4.decode_request({"text": test_text})
       result = api4.predict(decoded)
       encoded = api4.encode_response(result)
       print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")


   print("=" * 70)
   print("✅ All tests completed successfully!")
   print("=" * 70)


test_apis_locally()

We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently.

In conclusion, we create and run diverse APIs that showcase the framework’s versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe’s seaMLess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference appeared first on MarkTechPost.