Part 6/11:
- This allows the system to handle new event types gracefully, without manual schema tracking.
Data Transformation Workflow
The workflow for processing incoming raw events is as follows:
Metadata Extraction: Headers like event source, country, and event type are extracted for contextual information.
Schema Inference & Bucketing: The system infers the schema, compares it with existing ones, and buckets the data accordingly.
Casting & Transformation: The raw JSON data undergoes upcasting—resolving data types to a generic schema—using predefined casting rules.
Compression & Storage: Transformed data is compressed, stored in Apache Parquet format in S3, with a focus on optimal file sizes to mitigate the small file problem.