Part 5/12:
CDC Data Collection: Change Data Capture (CDC) streams from databases via DBZ (Debezium), which captures change logs and streams.
Message Queues & Storage: Data is pushed into Kafka (MSK), with schemas maintained through schema registries. From Kafka, data is stored in Amazon S3, which acts as a cost-effective, external storage layer.
Data Processing & Transformation: Using Databricks with Spark and Photon engines, Zepto performs data cleaning, de-duplication, and transformation across bronze (raw), silver (cleaned), and gold (aggregated) layers.
This architecture allows Zepto to:
Scale Indefinitely: Through delta format and S3 storage.
Improve Query Speed: By partitioning data and optimizing storage.