I worked at the KPN 'Datascience Lab' in a 6 person innovation team. We developed new techniques and methodologies for realtime event processing and analytics on streaming data. An initial pilot showed live "Customer Journeys". The team only developed stream processing; batch processing and ETL were
not part of its mandate.
We designed the so-called Realtime Analytics Engine ("RAE"), a horizontally scalable and fault-tolerant system where it was easy to apply algorithms to live event data. This system was comprised of state-of-the-art components:
- Kubernetes base platform
- Kafka for persistent event queues
- Kafka Streams and Apache Flink for event transformation
- Avro, Json-schema and similar protocols for event schemas
- Cassandra for high throughput storage
- PostgreSQL for storage and stream-table joins
- Helm for deployment on the platform
Part of my job was writing an overall architecture document that described and justified the design choices.
During the two years I spent on this project I:
- Coauthored a design document defining rae-schema, an http inspired protocol that allowed us to separate event schema-definition, event validation, routing, partitioning and filtering from serialized event content
- Wrote Java, Python and Javascript versions of a 'rae-schema' library that implemented this design
- Wrote initial versions of Flink jobs that provided realtime aggregation of customer journeys
- Wrote a D3 based browser application that showed live customer journey statistics
- Deployed an initial version of the RAE using Ansible on a 6 virtual machine cluster
- Deployed a second version of the RAE on a Docker cluster
- Deployed a third and final version on Kubernetes and Helm
- Wrote Kafka Streams versions of the Flink jobs
A lot of my time was spent in debugging issues on the still immature Kubernetes cluster.
The key design idea of our engine was Dr. Martin Kleppmann's the database inside-out.
References: