From chai tapris to cloud infrastructure – understanding the fundamental trade-offs that shape every system we build.
Table of Contents
- Introduction: The Tea Stall Revelation
- Core Definitions & Concepts
- Mathematical Relationships
- System Architecture Patterns
- Measurement & Monitoring
- Code Examples & Implementation
- Real-World Case Studies
- Optimization Strategies
- Tools & Technologies
- Best Practices & Guidelines
Introduction: The Tea Stall Revelation {#introduction}
Picture this : It’s 8 AM on a Monday morning, and you’re standing in line at your favorite chai tapri near the office in Bangalore. Chai Express serves each customer in 30 seconds, but Tapri Junction across the street serves 150 cups per hour with their 5-person team. This everyday scenario perfectly illustrates the fundamental trade-off in system performance: latency vs throughput.
Chai Express optimizes for latency – you get your tea fast, but they can only serve one customer at a time. Tapri Junction optimizes for throughput – they serve many customers simultaneously, but you might wait longer in line.
This trade-off exists in every system we build, from web applications to distributed databases, from microservices to mobile apps. Understanding when to optimize for latency vs throughput is crucial for building scalable, efficient systems.
Core Definitions & Concepts {#definitions}
Latency: The Speed of Individual Operations
Definition: Latency is the time delay between the initiation of a request and the completion of that request.
Formula: Latency = Response Time = End Time - Start Time
Key Characteristics:
- Measured in time units (milliseconds, seconds)
- Affects user experience directly
- Often has diminishing returns on optimization
- Can be measured at different system layers
Types of Latency:
- Network Latency: Time for data to travel across network
- Processing Latency: Time for CPU to process request
- I/O Latency: Time for disk/database operations
- Queue Latency: Time spent waiting in queues
User Request → [Network] → [Queue] → [Processing] → [I/O] → [Network] → Response
↑ ↓
|←――――――――――――――――――― Total Latency ―――――――――――――――――――→|
Throughput: The Volume of Work Completed
Definition: Throughput is the number of operations completed per unit of time.
Formula: Throughput = Number of Operations / Time Period
Key Characteristics:
- Measured in operations per time unit (requests/second, transactions/minute)
- Indicates system capacity
- Can be improved through parallelization
- Limited by system bottlenecks
Types of Throughput:
- Request Throughput: Requests processed per second
- Data Throughput: Bytes processed per second
- Transaction Throughput: Transactions completed per second
- User Throughput: Concurrent users supported
Time Period: 1 Second
Requests: R1, R2, R3, R4, R5, R6, R7, R8, R9, R10
Throughput = 10 requests/second
The Fundamental Relationship
Little’s Law: Average Number of Customers = Arrival Rate × Average Service Time
In system terms: Concurrent Users = Throughput × Average Response Time
This law reveals the intrinsic relationship between latency and throughput:
- Higher latency can reduce effective throughput
- Higher throughput can increase latency due to queuing
- System capacity is bounded by both metrics
Mathematical Relationships {#mathematics}
Performance Metrics Formulas
Response Time Components:
Total Response Time = Service Time + Wait Time
Service Time = Processing Time + I/O Time
Wait Time = Queue Time + Network Time
Utilization Law:
Utilization = Throughput × Service Time
Queueing Theory:
Queue Length = Arrival Rate × Average Wait Time
Average Wait Time = (Service Time × Utilization) / (1 - Utilization)
Scalability Relationships
Amdahl’s Law (for parallel processing):
Speedup = 1 / (S + (1-S)/N)
where S = serial portion, N = number of processors
Universal Scalability Law:
Throughput = N / (1 + α(N-1) + βN(N-1))
where α = serialization factor, β = crosstalk factor
System Architecture Patterns {#architecture}
1. Single-Threaded Architecture
┌─────────────────────────────────────────────────┐
│ Single Thread │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Request 1│→ │Process │→ │Response │ │
│ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────┐ (Waiting) │
│ │Request 2│ ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← │
│ └─────────┘ │
└─────────────────────────────────────────────────┘
Characteristics:
- Low latency for individual requests
- Low throughput (sequential processing)
- Simple to implement and debug
- No concurrency issues
2. Multi-Threaded Architecture
┌─────────────────────────────────────────────────┐
│ Multi-Thread │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Request 1│→ │Thread 1 │→ │Response │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Request 2│→ │Thread 2 │→ │Response │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Request 3│→ │Thread 3 │→ │Response │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────┘
Characteristics:
- Higher throughput (parallel processing)
- Potential for higher latency (context switching)
- Complex synchronization
- Resource contention issues
3. Asynchronous Architecture
┌─────────────────────────────────────────────────┐
│ Event Loop │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Request 1│→ │Event │→ │Callback │ │
│ └─────────┘ │Queue │ │Handler │ │
│ │ │ └─────────┘ │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │Request 2│→ │ │→ │Callback │ │
│ └─────────┘ │ │ │Handler │ │
│ │ │ └─────────┘ │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │Request 3│→ │ │→ │Callback │ │
│ └─────────┘ └─────────┘ │Handler │ │
│ └─────────┘ │
└─────────────────────────────────────────────────┘
Characteristics:
- High throughput with single thread
- Low latency for non-blocking operations
- Complex error handling
- Callback complexity
4. Microservices Architecture
┌─────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────┬───────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌───▼───┐ ┌────▼────┐ ┌───▼───┐
│Service│ │Service │ │Service│
│ A │ │ B │ │ C │
└───┬───┘ └────┬────┘ └───┬───┘
│ │ │
┌───▼───┐ ┌────▼────┐ ┌───▼───┐
│Cache │ │Database │ │Queue │
│ │ │ │ │ │
└───────┘ └─────────┘ └───────┘
Characteristics:
- Distributed latency (network calls)
- Scalable throughput (independent scaling)
- Complex deployment and monitoring
- Fault tolerance through redundancy
Measurement & Monitoring {#measurement}
Latency Measurement Techniques
Percentile Analysis:
- P50 (Median): 50% of requests complete within this time
- P90: 90% of requests complete within this time
- P95: 95% of requests complete within this time
- P99: 99% of requests complete within this time
- P99.9: 99.9% of requests complete within this time
Why Percentiles Matter:
- Average can be misleading due to outliers
- P99 represents user experience for 1% of requests
- P50 shows typical performance
- P99.9 shows worst-case scenarios
Throughput Measurement Techniques
Load Testing Metrics:
- Concurrent Users: Number of simultaneous users
- Requests per Second (RPS): Request throughput
- Transactions per Second (TPS): Business transaction throughput
- Data Transfer Rate: Bytes per second
Throughput vs Load Relationship:
Throughput
▲
│ ┌─────────────── Saturation Point
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
└─────────────────────────────────────────▶
Load (Concurrent Users)
Code Examples & Implementation {#code}
1. Latency Optimization Examples
Database Query Optimization
Before – High Latency:
// Poor implementation - High latency due to N+1 queries
public class UserServiceBad {
private UserRepository userRepo;
private OrderRepository orderRepo;
public List<UserWithOrders> getUsersWithOrders() {
long startTime = System.currentTimeMillis();
List<User> users = userRepo.findAll(); // 1 query
List<UserWithOrders> result = new ArrayList<>();
for (User user : users) {
// N additional queries - causes high latency
List<Order> orders = orderRepo.findByUserId(user.getId());
result.add(new UserWithOrders(user, orders));
}
long endTime = System.currentTimeMillis();
System.out.println("Query time: " + (endTime - startTime) + "ms");
return result;
}
}
After – Low Latency:
// Optimized implementation - Low latency with batch queries
public class UserServiceOptimized {
private UserRepository userRepo;
private OrderRepository orderRepo;
public List<UserWithOrders> getUsersWithOrders() {
long startTime = System.currentTimeMillis();
// Fetch all users
List<User> users = userRepo.findAll();
List<Long> userIds = users.stream()
.map(User::getId)
.collect(Collectors.toList());
// Batch fetch all orders in single query
List<Order> allOrders = orderRepo.findByUserIdIn(userIds);
Map<Long, List<Order>> ordersByUserId = allOrders.stream()
.collect(Collectors.groupingBy(Order::getUserId));
// Combine results
List<UserWithOrders> result = users.stream()
.map(user -> new UserWithOrders(user,
ordersByUserId.getOrDefault(user.getId(), new ArrayList<>())))
.collect(Collectors.toList());
long endTime = System.currentTimeMillis();
System.out.println("Optimized query time: " + (endTime - startTime) + "ms");
return result;
}
}
Caching for Latency Reduction
// Redis-based caching for latency optimization
@Service
public class ProductService {
private final ProductRepository productRepo;
private final RedisTemplate<String, Object> redisTemplate;
public ProductService(ProductRepository productRepo,
RedisTemplate<String, Object> redisTemplate) {
this.productRepo = productRepo;
this.redisTemplate = redisTemplate;
}
public Product getProduct(Long productId) {
long startTime = System.currentTimeMillis();
String cacheKey = "product:" + productId;
// Try cache first - reduces latency for cached items
Product cachedProduct = (Product) redisTemplate.opsForValue().get(cacheKey);
if (cachedProduct != null) {
long endTime = System.currentTimeMillis();
System.out.println("Cache hit - Latency: " + (endTime - startTime) + "ms");
return cachedProduct;
}
// Cache miss - fetch from database
Product product = productRepo.findById(productId).orElse(null);
if (product != null) {
// Cache for 1 hour
redisTemplate.opsForValue().set(cacheKey, product, Duration.ofHours(1));
}
long endTime = System.currentTimeMillis();
System.out.println("Database fetch - Latency: " + (endTime - startTime) + "ms");
return product;
}
}
2. Throughput Optimization Examples
Thread Pool for High Throughput
// Thread pool implementation for high throughput processing
@Service
public class OrderProcessingService {
private final ExecutorService executorService;
private final OrderRepository orderRepo;
private final EmailService emailService;
public OrderProcessingService(OrderRepository orderRepo, EmailService emailService) {
this.orderRepo = orderRepo;
this.emailService = emailService;
// Configure thread pool based on CPU cores and I/O requirements
int corePoolSize = Runtime.getRuntime().availableProcessors();
int maxPoolSize = corePoolSize * 2;
this.executorService = new ThreadPoolExecutor(
corePoolSize, maxPoolSize,
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(1000),
new ThreadPoolExecutor.CallerRunsPolicy()
);
}
public CompletableFuture<Void> processOrdersBatch(List<Order> orders) {
long startTime = System.currentTimeMillis();
// Process orders in parallel for higher throughput
List<CompletableFuture<Void>> futures = orders.stream()
.map(order -> CompletableFuture.runAsync(() -> {
try {
processOrder(order);
} catch (Exception e) {
System.err.println("Error processing order " + order.getId() + ": " + e.getMessage());
}
}, executorService))
.collect(Collectors.toList());
return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
.thenRun(() -> {
long endTime = System.currentTimeMillis();
double throughput = (double) orders.size() / ((endTime - startTime) / 1000.0);
System.out.println("Processed " + orders.size() + " orders in " +
(endTime - startTime) + "ms");
System.out.println("Throughput: " + throughput + " orders/second");
});
}
private void processOrder(Order order) {
// Simulate order processing
orderRepo.updateStatus(order.getId(), OrderStatus.PROCESSING);
// Simulate some processing time
try {
Thread.sleep(100); // 100ms processing time
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
orderRepo.updateStatus(order.getId(), OrderStatus.COMPLETED);
emailService.sendOrderConfirmation(order);
}
}
Batch Processing for Throughput
// Batch processing for improved throughput
@Service
public class BatchProcessor {
private final DataRepository dataRepo;
private final int BATCH_SIZE = 1000;
public BatchProcessor(DataRepository dataRepo) {
this.dataRepo = dataRepo;
}
public void processBulkData(List<DataRecord> records) {
long startTime = System.currentTimeMillis();
int totalRecords = records.size();
// Process in batches to optimize throughput
for (int i = 0; i < records.size(); i += BATCH_SIZE) {
int endIndex = Math.min(i + BATCH_SIZE, records.size());
List<DataRecord> batch = records.subList(i, endIndex);
// Batch insert/update for better throughput
dataRepo.batchInsert(batch);
System.out.println("Processed batch: " + (i / BATCH_SIZE + 1) +
"/" + ((records.size() - 1) / BATCH_SIZE + 1));
}
long endTime = System.currentTimeMillis();
double throughput = (double) totalRecords / ((endTime - startTime) / 1000.0);
System.out.println("Total throughput: " + throughput + " records/second");
}
}
3. Asynchronous Processing
// Asynchronous processing for better throughput and user experience
@RestController
public class FileProcessingController {
private final FileProcessingService fileService;
public FileProcessingController(FileProcessingService fileService) {
this.fileService = fileService;
}
@PostMapping("/files/process")
public ResponseEntity<ProcessingResponse> processFile(@RequestParam MultipartFile file) {
long startTime = System.currentTimeMillis();
// Generate unique processing ID
String processingId = UUID.randomUUID().toString();
// Start asynchronous processing - immediate response (low latency)
CompletableFuture<ProcessingResult> futureResult =
fileService.processFileAsync(file, processingId);
// Handle completion asynchronously
futureResult.thenAccept(result -> {
System.out.println("Processing completed for ID: " + processingId);
// Send notification, update database, etc.
}).exceptionally(throwable -> {
System.err.println("Processing failed for ID: " + processingId);
return null;
});
long endTime = System.currentTimeMillis();
return ResponseEntity.ok(new ProcessingResponse(
processingId,
"Processing started",
endTime - startTime
));
}
@GetMapping("/files/status/{processingId}")
public ResponseEntity<ProcessingStatus> getProcessingStatus(@PathVariable String processingId) {
ProcessingStatus status = fileService.getProcessingStatus(processingId);
return ResponseEntity.ok(status);
}
}
@Service
public class FileProcessingService {
private final ExecutorService executorService;
private final Map<String, ProcessingStatus> processingStatusMap;
public FileProcessingService() {
this.executorService = Executors.newFixedThreadPool(10);
this.processingStatusMap = new ConcurrentHashMap<>();
}
@Async
public CompletableFuture<ProcessingResult> processFileAsync(MultipartFile file, String processingId) {
return CompletableFuture.supplyAsync(() -> {
try {
processingStatusMap.put(processingId, new ProcessingStatus("IN_PROGRESS", 0));
// Simulate file processing
byte[] fileData = file.getBytes();
int totalChunks = fileData.length / 1024 + 1;
for (int i = 0; i < totalChunks; i++) {
// Process chunk
Thread.sleep(50); // Simulate processing time
// Update progress
int progress = (i + 1) * 100 / totalChunks;
processingStatusMap.put(processingId,
new ProcessingStatus("IN_PROGRESS", progress));
}
ProcessingResult result = new ProcessingResult(processingId, "SUCCESS", fileData.length);
processingStatusMap.put(processingId, new ProcessingStatus("COMPLETED", 100));
return result;
} catch (Exception e) {
processingStatusMap.put(processingId, new ProcessingStatus("FAILED", 0));
throw new RuntimeException("File processing failed", e);
}
}, executorService);
}
public ProcessingStatus getProcessingStatus(String processingId) {
return processingStatusMap.getOrDefault(processingId,
new ProcessingStatus("NOT_FOUND", 0));
}
}
4. Circuit Breaker Pattern
// Circuit breaker for handling latency spikes and maintaining throughput
@Component
public class CircuitBreaker {
private final int failureThreshold;
private final long timeoutDuration;
private final long retryTimePeriod;
private int failureCount = 0;
private long lastFailureTime = 0;
private State state = State.CLOSED;
public enum State {
CLOSED, // Normal operation
OPEN, // Failing fast
HALF_OPEN // Testing if service recovered
}
public CircuitBreaker(int failureThreshold, long timeoutDuration, long retryTimePeriod) {
this.failureThreshold = failureThreshold;
this.timeoutDuration = timeoutDuration;
this.retryTimePeriod = retryTimePeriod;
}
public <T> T call(Supplier<T> operation) throws Exception {
if (state == State.OPEN) {
if (System.currentTimeMillis() - lastFailureTime > retryTimePeriod) {
state = State.HALF_OPEN;
} else {
throw new Exception("Circuit breaker is OPEN - failing fast");
}
}
try {
long startTime = System.currentTimeMillis();
T result = operation.get();
long duration = System.currentTimeMillis() - startTime;
if (duration > timeoutDuration) {
recordFailure();
throw new Exception("Operation timed out");
}
recordSuccess();
return result;
} catch (Exception e) {
recordFailure();
throw e;
}
}
private void recordSuccess() {
failureCount = 0;
state = State.CLOSED;
}
private void recordFailure() {
failureCount++;
lastFailureTime = System.currentTimeMillis();
if (failureCount >= failureThreshold) {
state = State.OPEN;
}
}
}
// Usage example
@Service
public class ExternalServiceClient {
private final CircuitBreaker circuitBreaker;
private final RestTemplate restTemplate;
public ExternalServiceClient(RestTemplate restTemplate) {
this.restTemplate = restTemplate;
this.circuitBreaker = new CircuitBreaker(5, 2000, 10000); // 5 failures, 2s timeout, 10s retry
}
public String callExternalService(String request) {
try {
return circuitBreaker.call(() -> {
return restTemplate.postForObject("/external-api", request, String.class);
});
} catch (Exception e) {
// Fallback mechanism
return "Service temporarily unavailable";
}
}
}
Real-World Case Studies {#case-studies}
Case Study 1: E-commerce Search Engine
Challenge: An e-commerce platform needed to handle 10,000 search queries per second while maintaining sub-100ms response times.
Solution Architecture:
@Service
public class SearchService {
private final ElasticsearchClient elasticsearchClient;
private final RedisTemplate<String, Object> redisTemplate;
private final ExecutorService searchExecutor;
public SearchService(ElasticsearchClient elasticsearchClient,
RedisTemplate<String, Object> redisTemplate) {
this.elasticsearchClient = elasticsearchClient;
this.redisTemplate = redisTemplate;
this.searchExecutor = Executors.newFixedThreadPool(50);
}
public CompletableFuture<SearchResult> searchProducts(SearchRequest request) {
return CompletableFuture.supplyAsync(() -> {
long startTime = System.currentTimeMillis();
// Check cache first (reduces latency)
String cacheKey = generateCacheKey(request);
SearchResult cachedResult = (SearchResult) redisTemplate.opsForValue().get(cacheKey);
if (cachedResult != null) {
long latency = System.currentTimeMillis() - startTime;
System.out.println("Cache hit - Latency: " + latency + "ms");
return cachedResult;
}
// Search in Elasticsearch
SearchResult result = performElasticsearchQuery(request);
// Cache result for 5 minutes
redisTemplate.opsForValue().set(cacheKey, result, Duration.ofMinutes(5));
long latency = System.currentTimeMillis() - startTime;
System.out.println("Elasticsearch query - Latency: " + latency + "ms");
return result;
}, searchExecutor);
}
private SearchResult performElasticsearchQuery(SearchRequest request) {
// Optimized Elasticsearch query
return elasticsearchClient.search(request);
}
}
Results:
- Latency: P95 < 80ms, P99 < 150ms
- Throughput: 12,000 QPS sustained
- Cache Hit Rate: 85%
Case Study 2: Payment Processing System
Challenge: Process 50,000 payment transactions per hour with strict latency requirements (< 500ms per transaction).
Solution:
@Service
public class PaymentProcessingService {
private final PaymentGateway paymentGateway;
private final FraudDetectionService fraudService;
private final EventPublisher eventPublisher;
private final ExecutorService paymentExecutor;
public PaymentProcessingService(PaymentGateway paymentGateway,
FraudDetectionService fraudService,
EventPublisher eventPublisher) {
this.paymentGateway = paymentGateway;
this.fraudService = fraudService;
this.eventPublisher = eventPublisher;
this.paymentExecutor = new ThreadPoolExecutor(
20, 100, 60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(1000)
);
}
public CompletableFuture<PaymentResult> processPayment(PaymentRequest request) {
long startTime = System.currentTimeMillis();
return CompletableFuture.supplyAsync(() -> {
try {
// Parallel fraud detection and payment validation
CompletableFuture<FraudResult> fraudCheck =
CompletableFuture.supplyAsync(() -> fraudService.checkFraud(request));
CompletableFuture<ValidationResult> validation =
CompletableFuture.supplyAsync(() -> validatePayment(request));
// Wait for both to complete
CompletableFuture.allOf(fraudCheck, validation).join();
FraudResult fraudResult = fraudCheck.join();
ValidationResult validationResult = validation.join();
if (fraudResult.isRisky() || !validationResult.isValid()) {
return PaymentResult.rejected(request.getTransactionId());
}
// Process payment
PaymentResult result = paymentGateway.processPayment(request);
// Async event publishing (doesn't affect latency)
eventPublisher.publishAsync(new PaymentProcessedEvent(result));
long totalLatency = System.currentTimeMillis() - startTime;
System.out.println("Payment processed - Latency: " + totalLatency + "ms");
return result;
} catch (Exception e) {
return PaymentResult.error(request.getTransactionId(), e.getMessage());
}
}, paymentExecutor);
}
}
Case Study 3: Social Media Feed Generation
Challenge: Generate personalized feeds for 1 million users with sub-second latency.
Solution:
@Service
public class FeedGenerationService {
private final UserConnectionService connectionService;
private final ContentService contentService;
private final RedisTemplate<String, Object> redisTemplate;
private final ExecutorService feedExecutor;
public FeedGenerationService(UserConnectionService connectionService,
ContentService contentService,
RedisTemplate<String, Object> redisTemplate) {
this.connectionService = connectionService;
this.contentService = contentService;
this.redisTemplate = redisTemplate;
this.feedExecutor = Executors.newFixedThreadPool(100);
}
public CompletableFuture<Feed> generateFeed(Long userId, int pageSize) {
return CompletableFuture.supplyAsync(() -> {
long startTime = System.currentTimeMillis();
// Check for cached feed first
String cacheKey = "feed:" + userId;
Feed cachedFeed = (Feed) redisTemplate.opsForValue().get(cacheKey);
if (cachedFeed != null && !cachedFeed.isExpired()) {
long latency = System.currentTimeMillis() - startTime;
System.out.println("Feed cache hit - Latency: " + latency + "ms");
return cachedFeed;
}
// Generate fresh feed
List<Long> connections = connectionService.getUserConnections(userId);
// Parallel content fetching for better throughput
List<CompletableFuture<List<Content>>> contentFutures = connections.stream()
.map(connectionId -> CompletableFuture.supplyAsync(() ->
contentService.getRecentContent(connectionId, 10), feedExecutor))
.collect(Collectors.toList());
// Wait for all content to be fetched
List<Content> allContent = contentFutures.stream()
.map(CompletableFuture::join)
.flatMap(List::stream)
.sorted((c1, c2) -> c2.getTimestamp().compareTo(c1.getTimestamp()))
.limit(pageSize)
.collect(Collectors.toList());
Feed feed = new Feed(userId, allContent);
// Cache for 10 minutes
redisTemplate.opsForValue().set(cacheKey, feed, Duration.ofMinutes(10));
long latency = System.currentTimeMillis() - startTime;
System.out.println("Feed generated - Latency: " + latency + "ms");
return feed;
}, feedExecutor);
}
// Pre-generate feeds for active users to reduce latency
@Scheduled(fixedRate = 300000) // Every 5 minutes
public void preGenerateFeeds() {
List<Long> activeUsers = connectionService.getActiveUsers();
activeUsers.parallelStream()
.forEach(userId -> {
try {
generateFeed(userId, 50).join();
} catch (Exception e) {
System.err.println("Failed to pre-generate feed for user: " + userId);
}
});
}
}
Results:
- Latency: P95 < 800ms, P99 < 1.2s
- Throughput: 5,000 feeds/second
- Cache Hit Rate: 92%
Optimization Strategies {#optimization}
1. Latency Optimization Techniques
Connection Pooling
@Configuration
public class DatabaseConfig {
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost:5432/mydb");
config.setUsername("user");
config.setPassword("password");
// Connection pool settings for optimal latency
config.setMinimumIdle(10);
config.setMaximumPoolSize(50);
config.setConnectionTimeout(30000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);
config.setLeakDetectionThreshold(60000);
return new HikariDataSource(config);
}
}
JVM Tuning for Low Latency
// JVM flags for low latency applications
/*
-Xms4g -Xmx4g # Fixed heap size
-XX:+UseG1GC # G1 garbage collector
-XX:MaxGCPauseMillis=50 # Target max GC pause
-XX:+UnlockExperimentalVMOptions
-XX:+UseStringDeduplication # Reduce memory usage
-XX:+PrintGC # GC logging
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
*/
@Service
public class LowLatencyService {
// Pre-allocate objects to avoid GC pressure
private final ThreadLocal<StringBuilder> stringBuilderCache =
ThreadLocal.withInitial(() -> new StringBuilder(1024));
// Object pooling for frequently used objects
private final ObjectPool<ByteBuffer> bufferPool =
new GenericObjectPool<>(new ByteBufferFactory());
public String processData(String input) {
StringBuilder sb = stringBuilderCache.get();
sb.setLength(0); // Reset for reuse
ByteBuffer buffer = null;
try {
buffer = bufferPool.borrowObject();
// Process data with minimal allocations
return processWithBuffer(input, buffer, sb);
} catch (Exception e) {
throw new RuntimeException("Processing failed", e);
} finally {
if (buffer != null) {
try {
bufferPool.returnObject(buffer);
} catch (Exception e) {
// Log error
}
}
}
}
}
2. Throughput Optimization Techniques
Batch Processing
@Service
public class BatchProcessingService {
private final DatabaseService databaseService;
private final BlockingQueue<DataRecord> queue;
private final ExecutorService batchProcessor;
public BatchProcessingService(DatabaseService databaseService) {
this.databaseService = databaseService;
this.queue = new LinkedBlockingQueue<>();
this.batchProcessor = Executors.newSingleThreadExecutor();
// Start batch processing thread
startBatchProcessor();
}
public void addRecord(DataRecord record) {
queue.offer(record);
}
private void startBatchProcessor() {
batchProcessor.submit(() -> {
List<DataRecord> batch = new ArrayList<>();
while (!Thread.currentThread().isInterrupted()) {
try {
// Wait for first record
DataRecord first = queue.take();
batch.add(first);
// Drain available records up to batch size
queue.drainTo(batch, 999); // Max batch size of 1000
long startTime = System.currentTimeMillis();
databaseService.batchInsert(batch);
long duration = System.currentTimeMillis() - startTime;
double throughput = (double) batch.size() / (duration / 1000.0);
System.out.println("Batch processed: " + batch.size() +
" records, Throughput: " + throughput + " records/sec");
batch.clear();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
} catch (Exception e) {
System.err.println("Batch processing error: " + e.getMessage());
batch.clear();
}
}
});
}
}
Reactive Programming for High Throughput
@Service
public class ReactiveDataProcessor {
private final WebClient webClient;
private final DataRepository dataRepository;
public ReactiveDataProcessor(WebClient webClient, DataRepository dataRepository) {
this.webClient = webClient;
this.dataRepository = dataRepository;
}
public Flux<ProcessedData> processDataStream(Flux<InputData> inputStream) {
return inputStream
.buffer(Duration.ofSeconds(1), 100) // Buffer for 1 second or 100 items
.flatMap(batch -> {
long startTime = System.currentTimeMillis();
return Flux.fromIterable(batch)
.flatMap(this::enrichDataAsync, 10) // Process 10 concurrent requests
.collectList()
.flatMapMany(enrichedData -> {
// Batch save to database
return dataRepository.saveAll(enrichedData)
.doOnComplete(() -> {
long duration = System.currentTimeMillis() - startTime;
double throughput = (double) batch.size() / (duration / 1000.0);
System.out.println("Batch throughput: " + throughput + " items/sec");
});
});
});
}
private Mono<ProcessedData> enrichDataAsync(InputData input) {
return webClient.post()
.uri("/enrich")
.bodyValue(input)
.retrieve()
.bodyToMono(ProcessedData.class)
.timeout(Duration.ofSeconds(2))
.onErrorResume(throwable -> {
// Handle errors gracefully
return Mono.just(ProcessedData.createDefault(input));
});
}
}
3. Load Balancing Strategies
@Component
public class LoadBalancer {
private final List<ServerInstance> servers;
private final AtomicInteger roundRobinIndex;
private final Map<String, ServerInstance> serverHealthMap;
public LoadBalancer(List<ServerInstance> servers) {
this.servers = servers;
this.roundRobinIndex = new AtomicInteger(0);
this.serverHealthMap = new ConcurrentHashMap<>();
// Start health monitoring
startHealthMonitoring();
}
// Round-robin for balanced throughput
public ServerInstance getNextServer() {
List<ServerInstance> healthyServers = servers.stream()
.filter(this::isHealthy)
.collect(Collectors.toList());
if (healthyServers.isEmpty()) {
throw new RuntimeException("No healthy servers available");
}
int index = roundRobinIndex.getAndIncrement() % healthyServers.size();
return healthyServers.get(index);
}
// Least connections for optimal latency
public ServerInstance getLeastConnectionsServer() {
return servers.stream()
.filter(this::isHealthy)
.min(Comparator.comparingInt(ServerInstance::getActiveConnections))
.orElseThrow(() -> new RuntimeException("No healthy servers available"));
}
// Weighted round-robin for different server capacities
public ServerInstance getWeightedServer() {
List<ServerInstance> weightedServers = new ArrayList<>();
for (ServerInstance server : servers) {
if (isHealthy(server)) {
// Add server multiple times based on weight
for (int i = 0; i < server.getWeight(); i++) {
weightedServers.add(server);
}
}
}
if (weightedServers.isEmpty()) {
throw new RuntimeException("No healthy servers available");
}
int index = roundRobinIndex.getAndIncrement() % weightedServers.size();
return weightedServers.get(index);
}
private boolean isHealthy(ServerInstance server) {
return serverHealthMap.getOrDefault(server.getId(), server).isHealthy();
}
private void startHealthMonitoring() {
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
scheduler.scheduleAtFixedRate(() -> {
servers.parallelStream().forEach(server -> {
try {
boolean healthy = checkServerHealth(server);
server.setHealthy(healthy);
serverHealthMap.put(server.getId(), server);
} catch (Exception e) {
server.setHealthy(false);
serverHealthMap.put(server.getId(), server);
}
});
}, 0, 10, TimeUnit.SECONDS);
}
private boolean checkServerHealth(ServerInstance server) {
// Implement health check logic
return true; // Simplified
}
}
Tools & Technologies {#tools}
1. Performance Monitoring Tools
Application Performance Monitoring (APM)
@Component
public class PerformanceMonitor {
private final MeterRegistry meterRegistry;
private final Timer.Sample sample;
public PerformanceMonitor(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.sample = Timer.start(meterRegistry);
}
@EventListener
public void handleRequestStarted(RequestStartedEvent event) {
Timer.Sample.start(meterRegistry)
.stop(Timer.builder("http.request.duration")
.tag("method", event.getMethod())
.tag("uri", event.getUri())
.register(meterRegistry));
}
public void recordLatency(String operation, long latencyMs) {
Timer.builder("operation.latency")
.tag("operation", operation)
.register(meterRegistry)
.record(latencyMs, TimeUnit.MILLISECONDS);
}
public void recordThroughput(String operation, double throughput) {
Gauge.builder("operation.throughput")
.tag("operation", operation)
.register(meterRegistry, () -> throughput);
}
}
Custom Performance Metrics
@Component
public class PerformanceMetrics {
private final AtomicLong totalRequests = new AtomicLong(0);
private final AtomicLong totalLatency = new AtomicLong(0);
private final Map<String, LongAdder> operationCounts = new ConcurrentHashMap<>();
private final Map<String, LatencyTracker> latencyTrackers = new ConcurrentHashMap<>();
public void recordRequest(String operation, long latencyMs) {
totalRequests.incrementAndGet();
totalLatency.addAndGet(latencyMs);
operationCounts.computeIfAbsent(operation, k -> new LongAdder()).increment();
latencyTrackers.computeIfAbsent(operation, k -> new LatencyTracker()).recordLatency(latencyMs);
}
public double getAverageLatency() {
long requests = totalRequests.get();
return requests > 0 ? (double) totalLatency.get() / requests : 0;
}
public double getThroughput(String operation, long timeWindowMs) {
LongAdder counter = operationCounts.get(operation);
if (counter == null) return 0;
return (double) counter.sum() / (timeWindowMs / 1000.0);
}
public LatencyStats getLatencyStats(String operation) {
LatencyTracker tracker = latencyTrackers.get(operation);
return tracker != null ? tracker.getStats() : new LatencyStats();
}
private static class LatencyTracker {
private final List<Long> samples = new CopyOnWriteArrayList<>();
private final int maxSamples = 10000;
public void recordLatency(long latencyMs) {
samples.add(latencyMs);
if (samples.size() > maxSamples) {
samples.remove(0);
}
}
public LatencyStats getStats() {
if (samples.isEmpty()) return new LatencyStats();
List<Long> sorted = samples.stream().sorted().collect(Collectors.toList());
int size = sorted.size();
return new LatencyStats(
sorted.get(size / 2), // P50
sorted.get((int) (size * 0.9)), // P90
sorted.get((int) (size * 0.95)), // P95
sorted.get((int) (size * 0.99)), // P99
Collections.min(sorted),
Collections.max(sorted)
);
}
}
}
2. Load Testing Tools Integration
@Component
public class LoadTestingService {
private final RestTemplate restTemplate;
private final ExecutorService executor;
public LoadTestingService(RestTemplate restTemplate) {
this.restTemplate = restTemplate;
this.executor = Executors.newFixedThreadPool(100);
}
public LoadTestResult runLoadTest(LoadTestConfig config) {
long startTime = System.currentTimeMillis();
AtomicLong successCount = new AtomicLong(0);
AtomicLong errorCount = new AtomicLong(0);
List<Long> latencies = new CopyOnWriteArrayList<>();
List<CompletableFuture<Void>> futures = new ArrayList<>();
// Generate load
for (int i = 0; i < config.getTotalRequests(); i++) {
CompletableFuture<Void> future = CompletableFuture.runAsync(() -> {
long requestStart = System.currentTimeMillis();
try {
ResponseEntity<String> response = restTemplate.postForEntity(
config.getUrl(),
config.getRequestBody(),
String.class
);
if (response.getStatusCode().is2xxSuccessful()) {
successCount.incrementAndGet();
} else {
errorCount.incrementAndGet();
}
} catch (Exception e) {
errorCount.incrementAndGet();
} finally {
long latency = System.currentTimeMillis() - requestStart;
latencies.add(latency);
}
}, executor);
futures.add(future);
// Control request rate
if (config.getRequestsPerSecond() > 0) {
try {
Thread.sleep(1000 / config.getRequestsPerSecond());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}
// Wait for all requests to complete
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
long totalTime = System.currentTimeMillis() - startTime;
double throughput = (double) config.getTotalRequests() / (totalTime / 1000.0);
return new LoadTestResult(
successCount.get(),
errorCount.get(),
throughput,
calculateLatencyStats(latencies)
);
}
private LatencyStats calculateLatencyStats(List<Long> latencies) {
if (latencies.isEmpty()) return new LatencyStats();
List<Long> sorted = latencies.stream().sorted().collect(Collectors.toList());
int size = sorted.size();
return new LatencyStats(
sorted.get(size / 2), // P50
sorted.get((int) (size * 0.9)), // P90
sorted.get((int) (size * 0.95)), // P95
sorted.get((int) (size * 0.99)), // P99
Collections.min(sorted),
Collections.max(sorted)
);
}
}
Best Practices & Guidelines {#best-practices}
1. Design Principles
The Performance Trade-off Matrix
High Latency Low Latency
High Throughput [Batch Systems] [Ideal Zone]
Low Throughput [Poor Design] [Simple Systems]
When to Optimize for Latency
- User-facing applications: Web UIs, mobile apps
- Real-time systems: Trading platforms, gaming
- Interactive services: Chat applications, video calls
- APIs with SLA requirements: < 100ms response times
When to Optimize for Throughput
- Data processing: ETL pipelines, analytics
- Background tasks: Email sending, report generation
- Batch operations: Data imports, bulk updates
- Resource-intensive operations: Image processing, ML training
2. Architecture Guidelines
// Example: Design pattern for handling both requirements
@Service
public class HybridProcessingService {
private final FastProcessingService fastService;
private final BatchProcessingService batchService;
private final RequestClassifier classifier;
public ProcessingResult processRequest(ProcessingRequest request) {
// Classify request based on priority and requirements
RequestType type = classifier.classify(request);
switch (type) {
case REAL_TIME:
// Optimize for latency
return fastService.processImmediately(request);
case BATCH:
// Optimize for throughput
return batchService.addToBatch(request);
case PRIORITY:
// Hybrid approach
if (fastService.hasCapacity()) {
return fastService.processImmediately(request);
} else {
return batchService.processPriority(request);
}
default:
throw new IllegalArgumentException("Unknown request type");
}
}
}
3. Monitoring and Alerting
@Component
public class PerformanceAlerting {
private final AlertingService alertingService;
private final PerformanceMetrics metrics;
@Scheduled(fixedRate = 30000) // Check every 30 seconds
public void checkPerformanceThresholds() {
// Latency alerting
double avgLatency = metrics.getAverageLatency();
if (avgLatency > 500) { // 500ms threshold
alertingService.sendAlert(
"High latency detected: " + avgLatency + "ms",
AlertLevel.WARNING
);
}
// Throughput alerting
double throughput = metrics.getThroughput("api_requests", 60000);
if (throughput < 100) { // 100 RPS threshold
alertingService.sendAlert(
"Low throughput detected: " + throughput + " RPS",
AlertLevel.WARNING
);
}
// P99 latency alerting
LatencyStats stats = metrics.getLatencyStats("api_requests");
if (stats.getP99() > 2000) { // 2 second threshold
alertingService.sendAlert(
"P99 latency breach: " + stats.getP99() + "ms",
AlertLevel.CRITICAL
);
}
}
}
4. Capacity Planning
@Service
public class CapacityPlanningService {
public CapacityRecommendation analyzeCapacity(SystemMetrics metrics) {
double currentThroughput = metrics.getThroughput();
double currentLatency = metrics.getAverageLatency();
double cpuUtilization = metrics.getCpuUtilization();
double memoryUtilization = metrics.getMemoryUtilization();
CapacityRecommendation recommendation = new CapacityRecommendation();
// Analyze bottlenecks
if (cpuUtilization > 0.8) {
recommendation.addRecommendation("Scale CPU: Add more cores or instances");
}
if (memoryUtilization > 0.85) {
recommendation.addRecommendation("Scale Memory: Increase heap size or add instances");
}
if (currentLatency > 100 && cpuUtilization < 0.5) {
recommendation.addRecommendation("I/O Bottleneck: Optimize database queries or add caching");
}
// Predict future capacity needs
double projectedGrowth = calculateGrowthRate(metrics.getHistoricalData());
if (projectedGrowth > 0.5) { // 50% growth expected
recommendation.addRecommendation("Proactive scaling needed for " +
(projectedGrowth * 100) + "% growth");
}
return recommendation;
}
private double calculateGrowthRate(List<HistoricalMetric> historicalData) {
if (historicalData.size() < 2) return 0;
double oldestThroughput = historicalData.get(0).getThroughput();
double latestThroughput = historicalData.get(historicalData.size() - 1).getThroughput();
return (latestThroughput - oldestThroughput) / oldestThroughput;
}
}
5. Testing Strategies
@ExtendWith(SpringExtension.class)
@SpringBootTest
public class PerformanceTests {
@Autowired
private TestRestTemplate restTemplate;
@Test
public void testLatencyRequirements() {
int iterations = 100;
List<Long> latencies = new ArrayList<>();
for (int i = 0; i < iterations; i++) {
long start = System.currentTimeMillis();
ResponseEntity<String> response = restTemplate.postForEntity(
"/api/test", "test data", String.class);
long latency = System.currentTimeMillis() - start;
latencies.add(latency);
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
}
// Calculate statistics
double avgLatency = latencies.stream().mapToLong(Long::longValue).average().orElse(0);
long p95Latency = latencies.stream().sorted().skip((long) (iterations * 0.95)).findFirst().orElse(0L);
assertThat(avgLatency).isLessThan(100); // 100ms average
assertThat(p95Latency).isLessThan(200); // 200ms P95
}
@Test
public void testThroughputRequirements() throws InterruptedException {
int threadCount = 10;
int requestsPerThread = 100;
CountDownLatch latch = new CountDownLatch(threadCount);
AtomicInteger successCount = new AtomicInteger(0);
long startTime = System.currentTimeMillis();
for (int i = 0; i < threadCount; i++) {
new Thread(() -> {
try {
for (int j = 0; j < requestsPerThread; j++) {
ResponseEntity<String> response = restTemplate.postForEntity(
"/api/test", "test data", String.class);
if (response.getStatusCode().is2xxSuccessful()) {
successCount.incrementAndGet();
}
}
} finally {
latch.countDown();
}
}).start();
}
latch.await();
long totalTime = System.currentTimeMillis() - startTime;
double throughput = (double) successCount.get() / (totalTime / 1000.0);
assertThat(throughput).isGreaterThan(500); // 500 RPS minimum
assertThat(successCount.get()).isEqualTo(threadCount * requestsPerThread);
}
}
Summary
Understanding the trade-offs between latency and throughput is crucial for building high-performance systems. Key takeaways:
- Latency focuses on individual request speed – optimize for user experience
- Throughput focuses on system capacity – optimize for handling load
- The relationship is governed by Little’s Law and queueing theory
- Architecture choices significantly impact both metrics
- Monitoring both metrics is essential for maintaining performance
- Optimization strategies must be chosen based on system requirements
Remember the chai tapri analogy: sometimes you need the quick single-serve model, sometimes you need the high-volume team approach, and often you need a hybrid solution that balances both.
The key is to understand your system’s requirements, measure continuously, and optimize intelligently. Performance is not about choosing latency OR throughput – it’s about finding the right balance for your specific use case.
This guide provides a comprehensive foundation for understanding and optimizing latency vs throughput in system design. Apply these concepts thoughtfully, measure relentlessly, and always consider the trade-offs in your specific context.