Part 5/11:
Enormous volume: 57 million events daily from just three key APIs.
Cost and performance constraints.
Schema variability: Evolving data schemas over time.
Small file proliferation: Generating many tiny files that complicate storage and retrieval.
Redbus faced the challenge of storing this raw data cost-effectively while maintaining accessibility and integrity.
Exploiting Schema Inference and Optimized Storage
Schema Evolution and Inference
Given that API data schemas evolve daily—new fields and structures are introduced—the team adopted schema inference:
When raw data arrives, the system automatically deduces its schema.
Schemas are versioned and buckets are created based on timestamps.