We build stream processing systems that ingest tens of millions of events per day, transform them in real time, and surface actionable insights with sub-second latency — using Kafka, ClickHouse, and event-driven microservices.
Discuss your projectEvery pipeline starts with a schema registry. We define event contracts (using Avro or Protocol Buffers) before writing producers or consumers. Schema evolution rules (backward/forward compatibility) are enforced at the registry level, preventing breaking changes from propagating silently.
Stream processors must handle duplicates gracefully. We implement idempotent writes using event IDs as deduplication keys, and leverage Kafka's transactional API for exactly-once semantics where business correctness demands it — financial calculations, inventory updates.
Mobile and distributed systems produce events that arrive late — sometimes hours after the fact. We design windowing strategies with configurable watermarks and late-arrival handling, ensuring that historical aggregates are updated correctly when delayed events arrive.
Kafka's configurable retention means any consumer can replay historical events. We design pipelines with replay in mind from the start — separating raw event storage from derived state so schema changes or bug fixes can be applied retroactively.
We map all events in the system, define their schemas, and agree on ownership boundaries. Each event type lives in one topic, owned by one service.
Producers are built and deployed first, with consumers consuming from day one in parallel — allowing us to validate event shape before building downstream logic.
We set up consumer lag dashboards from the first deployment. Lag is the most important health signal — it tells us immediately if processing has fallen behind producers.
We replicate production-scale event volumes in staging using a traffic generator and verify that consumer throughput, ClickHouse ingestion rate, and dashboard latency all meet SLAs.
We instrument events with a source timestamp and measure time-to-dashboard for a sample of events continuously. Alerting fires if p99 exceeds the agreed SLA (typically 500ms for operational dashboards, 5s for analytics).
We simulate Kafka broker loss, consumer crashes, and ClickHouse write failures. Key assertions: no data loss, lag recovers within defined time, downstream dashboards show stale-data indicators rather than incorrect data.
Aggregated metrics from the pipeline are compared against ground truth from the source database on a scheduled basis. Discrepancies above 0.1% trigger an alert and investigation.
All schema changes are tested against schema registry compatibility rules before deployment. Backward-incompatible changes require a multi-step migration: add new field → dual-write → remove old field.
Tell us your event volume, latency requirements, and what insights you need to surface. We'll propose an architecture that fits.