Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Jan 17, 2017

This month we have a great session on Big Data & Enterprise Integration in Java, with a deep dive discussion on Akka, Kafka, Spark, and Alpakka. 

In this session, we will discuss: 
* reactive architecture tenets 
* distributed “fast data” streams 
* application and analytics focused Data Lake

Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated.  The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies.  DevOps and application architecture concerns will be first-class themes throughout.

Reactive principles and technology will be the second act of this talk.  Kafka.  Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for.  The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform).  We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns.

The final act will be to dive into the Data Lake, from both an analytics and application development perspective.  Technologies used to explain concepts will include Amazon and Hadoop.  A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data.  It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect).  The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake.  The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline…  Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.