Understanding Batch Processing
Batch processing is a technique for processing large datasets by breaking them into smaller, manageable chunks (batches). This prevents memory exhaustion, improves error recovery, and enables progress tracking.
Why Use Batch Processing?
Memory Management
Loading millions of records into memory at once causes out-of-memory errors. Batching limits memory usage to a predictable amount.
- Prevents application crashes from memory exhaustion
- Allows processing datasets larger than available RAM
- Enables predictable memory consumption patterns
Better Error Handling
Processing in batches isolates errors to specific chunks:
- Easier to identify which records caused failures
- Can retry failed batches without reprocessing everything
- Reduces impact of transient errors
Progress Tracking
Batches provide natural checkpoints for monitoring:
- Display progress percentage to users
- Log progress for debugging and auditing
- Enable pause/resume functionality
Transaction Management
Smaller transactions improve database performance:
- Shorter lock durations reduce contention
- Smaller transaction logs
- Faster rollback on errors
Choosing the Right Batch Size
Too Small
Problems with batch sizes that are too small:
- High overhead from frequent network/disk I/O
- More transaction commits (database overhead)
- Slower overall processing time
- More API calls (costs with cloud services)
Too Large
Problems with batch sizes that are too large:
- Risk of out-of-memory errors
- Long-running transactions lock resources
- Difficult to recover from errors
- Poor progress visibility
- Database timeout risks
Just Right
The optimal batch size balances:
- Memory usage (50-80% of available)
- Processing time (1-10 seconds per batch)
- Error recovery (manageable retry scope)
- Progress updates (visible but not excessive)
Factors Affecting Batch Size
Available Memory
Primary constraint. Always leave headroom for:
- Operating system overhead
- Other processes sharing resources
- Temporary objects during processing
- Framework/library overhead
Record Size
Larger records mean fewer fit in memory:
- Simple records (100 bytes): Batches of 10,000+
- Medium records (1 KB): Batches of 1,000-5,000
- Large records (10+ KB): Batches of 100-1,000
Processing Complexity
- Simple transformations: Larger batches OK
- Complex computations: Smaller batches to prevent timeouts
- External API calls: Consider rate limits
- Database writes: Balance commit frequency with transaction size
Network and I/O
- Local processing: Can use larger batches
- Network transfers: Smaller batches reduce latency impact
- Cloud APIs: Check batch size limits in documentation
Implementation Best Practices
Dynamic Batch Sizing
Adjust batch size based on runtime conditions:
if memory_pressure_high:
batch_size = batch_size // 2
elif memory_available and processing_fast:
batch_size = min(batch_size * 1.5, max_batch_size)
Checkpoint and Resume
Track progress to enable resumption after failures:
- Save last processed record ID/offset
- Use idempotent operations (safe to retry)
- Implement proper rollback on batch failure
Parallel Processing
Process multiple batches concurrently:
- Use thread pools or process pools
- Ensure batch independence (no shared state)
- Monitor total memory across all workers
- Consider database connection pooling
Monitor and Tune
- Track memory usage per batch
- Measure processing time per batch
- Log batch sizes and performance metrics
- Adjust based on production data
Common Batch Sizes by Use Case
| Use Case |
Typical Batch Size |
Rationale |
| Database inserts |
1,000 - 10,000 |
Balance commit overhead with transaction size |
| API calls |
100 - 1,000 |
Respect rate limits, handle failures |
| File processing |
10,000 - 100,000 |
Minimize I/O overhead |
| Machine learning inference |
32 - 256 |
GPU memory constraints, model size |
| Stream processing |
100 - 1,000 |
Low latency requirements |