From Bottleneck to 1000+ RPS: Zero-Downtime API Scaling | Edstem Technologies

Java

JavaPerformanceOptimization

AsyncProcessing

ThreadPoolOptimization

Servlets

HighVolume

ThreadManagement

JavaConcurrencyOptimization

ServiceVirtualization

From Bottleneck to 1000+ RPS: Zero-Downtime API Scaling

by: Azhar MA

August 27, 2025

Featured image for blog post: From Bottleneck to 1000+ RPS: Zero-Downtime API Scaling

A deep dive into Java performance optimization: how AsyncContext transformed our service virtualization engine from thread-starved to thread-smart

Picture this: It's 3 AM, and your service virtualization engine is choking. Your thread pool is exhausted, response times are through the roof, and your high-volume performance tests are failing miserably. Sound familiar?

This was our reality when we hit the wall with our SV (Service Virtualization) engine. We needed high throughput Java processing to handle over 1000 requests per second, many with artificial delays to simulate real-world service behavior. The solution? A complete rethinking of our API scaling techniques and async processing implementation, without changing a single line of client code.

The Problem: When Thread.sleep() Creates Performance Bottlenecks

Our original implementation seemed straightforward enough:

@RequestMapping("/**")
public void handleRequest(HttpServletRequest request, HttpServletResponse response) {
    // Process the request
    ProcessedResponse result = processRequest(request);
    
    // Simulate delay if configured
    if (result.hasDelay()) {
        Thread.sleep(result.getDelayMillis());  // 😱 The killer line
    }
    
    // Send response
    response.getWriter().write(result.getBody());
}

Looks innocent, right? But here's what happens under load:

Request 1 arrives → Thread 1 handles it → Needs 5-second delay → Thread 1 sleeps for 5 seconds
Request 2 arrives → Thread 2 handles it → Needs 5-second delay → Thread 2 sleeps for 5 seconds
...
Request 200 arrives → No threads available → Request queued or rejected 💥

With 1000+ requests per second and many having delays, we were essentially running a "thread hotel" where guests checked in but didn't check out for 5-20 seconds. Our 8-CPU machine with 200 threads was experiencing severe thread pool exhaustion, bringing the system to its knees.

The Lightbulb Moment: Understanding Async Processing vs Blocking Threads

Here's where many developers get confused about Java concurrency optimization (we did too): Asynchronous processing doesn't mean the client gets an immediate response. The client still waits. The magic is in efficient thread management on the server side.

Think of it like a restaurant:

Synchronous: One waiter takes your order, goes to the kitchen, waits while your food cooks, then brings it to you. That waiter is stuck with you the entire time.

Asynchronous: One waiter takes your order, tells the kitchen, then serves other tables. When your food is ready, any available waiter brings it to you.

You still wait the same time for your food, but the restaurant can serve many more customers!

The Solution: AsyncContext for Non-Blocking Request Processing

Here's how we implemented async servlet processing to transform our blocking code into a non-blocking powerhouse:

@RequestMapping("/**")
public void handleRequest(HttpServletRequest request, HttpServletResponse response) {
    // Start async processing
    AsyncContext asyncContext = request.startAsync();
    asyncContext.setTimeout(60000);
    
    // Submit to executor service - original thread is FREE!
    executorService.submit(() -> {
        try {
            processRequestAsync(request, response, asyncContext);
        } catch (Exception e) {
            handleError(asyncContext, e);
        }
    });
    // Original thread returns immediately to handle more requests
}

private void processRequestAsync(HttpServletRequest request,
                                HttpServletResponse response,
                                AsyncContext asyncContext) {
    // Process the request
    ProcessedResponse result = processRequest(request);
    
    if (result.hasDelay()) {
        // Schedule response for later - no blocking!
        scheduledExecutor.schedule(() -> {
            writeResponse(response, result);
            asyncContext.complete();  // Signal we're done
        }, result.getDelayMillis(), TimeUnit.MILLISECONDS);
    } else {
        // Immediate response
        writeResponse(response, result);
        asyncContext.complete();
    }
}

The Performance Optimization Results

The transformation was dramatic:

Before (Blocking Approach):

Thread Usage: 800 threads at max capacity
Memory: ~800MB just for thread stacks
Throughput: Struggling at 1000 req/sec
During Delays: Threads just sleeping, doing nothing

After (Async Approach):

Thread Usage: 200 threads handling same load
Memory: ~200MB for thread stacks (75% reduction!)
Throughput: Comfortable at 1500+ req/sec
During Delays: Threads serving other requests

Here's a visualization of the difference:

BEFORE (Blocking):
Thread-1: [──Request-1──][────Sleep 5s────][──Response──]
Thread-2: [──Request-2──][────Sleep 5s────][──Response──]
Thread-3: [──Request-3──][────Sleep 5s────][──Response──]
(All threads blocked during sleep)

AFTER (Async):
Thread-1: [Req-1][Req-4][Req-7][Req-10]...
Thread-2: [Req-2][Req-5][Req-8][Req-11]...
Thread-3: [Req-3][Req-6][Req-9][Req-12]...
Scheduler: ────5s later───→[Resp-1][Resp-2][Resp-3]
(Threads free during delays)

The Client's Perspective: Absolutely Nothing Changed!

This is the beautiful part. Our clients' code remained exactly the same:

# Before our changes
$ curl http://api.example.com/delayed-endpoint
# Wait 5 seconds...
{"response": "data"}

# After our changes
$ curl http://api.example.com/delayed-endpoint
# Still wait 5 seconds...
{"response": "data"}

The HTTP contract didn't change. Clients still:

Send a request
Wait for the response
Get the complete response

No websockets, no polling, no callbacks. Just good old HTTP request-response, but now our server can handle 10x the load!

Thread Management Best Practices: The Gotchas We Hit

1. Don't Block the Executor Threads

// ❌ BAD - Still blocking!
executorService.submit(() -> {
    processRequest();
    Thread.sleep(5000);  // You're still blocking a thread!
    sendResponse();
});

// ✅ GOOD - Truly async
executorService.submit(() -> {
    processRequest();
    scheduledExecutor.schedule(() -> {
        sendResponse();
    }, 5000, TimeUnit.MILLISECONDS);
});

2. Always Complete the AsyncContext

// Every path must call complete()!
try {
    // ... processing ...
    asyncContext.complete();
} catch (Exception e) {
    response.setStatus(500);
    response.getWriter().write("Error: " + e.getMessage());
    asyncContext.complete();  // Even on error!
}

3. Thread Pool Sizing Best Practices

With async servlet processing, you need fewer processing threads but more scheduled threads for optimal thread pool optimization:

# Before (blocking)
processing.threads=800
scheduled.threads=20

# After (async)
processing.threads=200    # 75% reduction!
scheduled.threads=200     # Handles all delays

The Performance Test That Made Us Believers

We ran a load test simulating our production scenario:

1000 requests/second
50% of requests with 5-second delays
30% with 2-second delays
20% with no delay

Results:

Blocking Implementation: Failed at 400 req/sec (thread exhaustion)
Async Implementation: Handled 1500 req/sec with capacity to spare

The real kicker? Response time consistency. With blocking threads, response times became erratic under load. With async, they remained predictable even at peak load.

Beyond Thread.sleep(): Scalable Java Concurrency Patterns

While our immediate problem was Thread.sleep(), these Java concurrency optimization patterns help with any blocking operation:

// Database calls
CompletableFuture<User> userFuture =
    CompletableFuture.supplyAsync(() -> userDao.findById(id));

// External API calls
CompletableFuture<Weather> weatherFuture =
    CompletableFuture.supplyAsync(() -> weatherService.getWeather(city));

// Combine results without blocking
CompletableFuture.allOf(userFuture, weatherFuture)
    .thenRun(() -> {
        writeResponse(userFuture.get(), weatherFuture.get());
        asyncContext.complete();
    });

API Performance Monitoring: How We Keep It Running Smooth

We added comprehensive API performance monitoring to ensure our microservices performance stays optimal:

@Component
public class AsyncHealthMonitor {
    @Scheduled(every = "30s")
    public void checkHealth() {
        int activeThreads = executorService.getActiveCount();
        int queueSize = executorService.getQueue().size();
        
        if (activeThreads > poolSize * 0.8) {
            log.warn("Thread pool usage high: {}%",
                     (activeThreads * 100) / poolSize);
        }
        
        if (queueSize > queueCapacity * 0.7) {
            log.warn("Queue filling up: {} tasks waiting", queueSize);
        }
    }
}

The Takeaway: Java Performance Optimization Is About HOW, Not WHEN

The biggest misconception about async processing and API scaling techniques is that it changes when clients get responses. It doesn't. It changes how the server handles the request internally through efficient thread management.

Your clients still:

Send a synchronous HTTP request
Wait for the complete response
Get the response when it's ready

But your server now:

Uses threads efficiently
Handles more concurrent requests
Scales better under load
Uses less memory
Provides more predictable performance

Try These Java Performance Optimization Techniques Yourself

Ready to implement these thread pool optimization strategies? Here's a minimal AsyncContext example to get started with high throughput Java processing:

@RestController
public class AsyncController {
    private final ExecutorService executor = Executors.newFixedThreadPool(10);
    private final ScheduledExecutorService scheduler =
        Executors.newScheduledThreadPool(20);
    
    @GetMapping("/async-delay")
    public void asyncDelay(HttpServletRequest request,
                          HttpServletResponse response) {
        AsyncContext ctx = request.startAsync();
        
        executor.submit(() -> {
            // Simulate processing
            String result = "Processed at " + Instant.now();
            
            // Schedule delayed response
            scheduler.schedule(() -> {
                try {
                    response.getWriter().write(result);
                    ctx.complete();
                } catch (IOException e) {
                    ctx.complete();
                }
            }, 5, TimeUnit.SECONDS);
        });
    }
}

Conclusion: Achieving Zero-Downtime API Scaling

Moving from blocking to async processing felt like teaching our server to juggle instead of just catching and holding. Through proper Java performance optimization, the same hands (threads) can now keep many more balls (requests) in the air.

The best part? Our clients never knew anything changed. They send requests and get responses just like before. But now we can handle 10x the load with 75% fewer threads.

Sometimes the best optimizations are the ones nobody notices -- except your ops team at 3 AM when the system is humming along smoothly instead of falling over.

P.S. - If you're wondering why we didn't just use reactive frameworks like WebFlux or explore Virtual Threads (Project Loom) from the start, that's a story for another post. Sometimes you have to evolve the architecture you have, not rebuild the one you want.

Cost-Effective Athena Partition Management: Beyond Glue Crawlers

Azhar MA

October 03, 2025

Archiving Service Logs from OpenSearch to S3: A Comprehensive Guide

Ashish Sharma

September 23, 2025

Fuzz Testing: Breaking Things to Build Them Better

AryaKrishna AB

August 22, 2025

Get started now

Get a quote for your project.

Contact us section background featuring professional consultation setup