Bart van Deenen c.v. 2019

I worked at the KPN 'Datascience Lab' in a 6 person innovation team. We developed new techniques and methodologies for realtime event processing and analytics on streaming data. An initial pilot showed live "Customer Journeys". The team only developed stream processing; batch processing and ETL were not part of its mandate.

We designed the so-called Realtime Analytics Engine ("RAE"), a horizontally scalable and fault-tolerant system where it was easy to apply algorithms to live event data. This system was comprised of state-of-the-art components:

Kubernetes base platform
Kafka for persistent event queues
Kafka Streams and Apache Flink for event transformation
Avro, Json-schema and similar protocols for event schemas
Cassandra for high throughput storage
PostgreSQL for storage and stream-table joins
Helm for deployment on the platform

Part of my job was writing an overall architecture document that described and justified the design choices.

During the two years I spent on this project I:

Coauthored a design document defining rae-schema, an http inspired protocol that allowed us to separate event schema-definition, event validation, routing, partitioning and filtering from serialized event content
Wrote Java, Python and Javascript versions of a 'rae-schema' library that implemented this design
Wrote initial versions of Flink jobs that provided realtime aggregation of customer journeys
Wrote a D3 based browser application that showed live customer journey statistics
Deployed an initial version of the RAE using Ansible on a 6 virtual machine cluster
Deployed a second version of the RAE on a Docker cluster
Deployed a third and final version on Kubernetes and Helm
Wrote Kafka Streams versions of the Flink jobs

A lot of my time was spent in debugging issues on the still immature Kubernetes cluster.

The key design idea of our engine was Dr. Martin Kleppmann's the database inside-out.

KPN Datalab Data Streaming Engine

References:

Categorized skills