KDnuggets Beginner’s Guide to Apache Flink – 12 Key Terms, Explained

We review 12 core Apache Flink concepts, to better understand what it does and how it works, including streaming engine terminology.

By Andres Vivanco, TU Berlin, Master IT4BI

In this post we will learn basic concepts and description of the technology and workflow used by Apache Flink:

1. What is Apache Flink?

At first glance, the origins of Apache Flink can be traced back to June 2008 as a researching project of the Database Systems and Information Management (DIMA) Group at the Technische Universität (TU) Berlin in Germany.

Apache Flink is an open source platform for distributed stream and batch data processing, initially it was designed as an alternative to MapReduce and the Hadoop Distributed File System (HFDS) in Hadoop origins.

According to the Apache Flink project, it is

an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.”

What does Flink offer?

Streaming:

  • High Performance & Low Latency
  • Support for Event Time and Out-of-Order Events
  • Exactly-once Semantics for Stateful Computations
  • Highly flexible Streaming Windows
  • Continuous Streaming Model with Backpressure
  • Fault-tolerance via Lightweight Distributed Snapshots

Batch and Streaming in One System

  • One Runtime for Streaming and Batch Processing
  • Memory Management
  • Iterations and Delta Iterations
  • Program Optimizer

Ecosystem

  • Broad Integration (Yarn, Hadoop, HDFS,Kafka, others)

What are Flink’s components?

Flink stack offers application programming interfaces (APIs) (in Java/Scala/Python), shell console, tools and Libraries to develop new data-intensive applications over Flink engine.

Apache Flink Stack

Fig. 1: Apache Flink Stack - see flink.apache.org/

2. Deploy Layer / What does Flink’s Deploy Layer do?

Apache Flink can execute programs in a diversity of context, such as standalone, or embedded in other programs. The execution could be in a local Java Virtual Machine JVM or in different clusters with many nodes.

3. Core Layer (Runtime) / What does Flink’s Core Layer do?

The Distributed Streaming Dataflow layer receives a program like a generic parallel data flow with arbitrary tasks, which inputs and outputs are data streams. This tasks are called “JobGraph”.

4. Stream Processing/ What is stream processing?

Stream processing is a new sight of processing, the business logic is applied to every transaction that is being recorded in real-time in a system, for example E-commerce, the Internet of Things (IoT) with various sensors that emit data online, online monitoring of traffic in a city, telecom, banking, etc. In other words, stream processing applies business logic to each event that is being captured online in instead of store whole events and hence process it as a batch. The highlight of process in this way is that the analysis will show the real or online state of the data at this instance, that is in real-time with unbounded data.

5. Batch Processing

Bath processing is processing a huge volume of bounded data at once. The steps need for processing are called batch jobs. Batch jobs could be stored up during working hours for example working hour and hence executed during the evening, or even during weeks or months and executed on weekend or once a month. The classic example of batch processing is how the credit card companies process billing. The client does not receive a bill for each transaction, usually a customer receives the billing each month when whole data has been collected. One example for managing it is Hadoop that provides map Reduce as a processing tool for these large scale files which can be months or years of data stored.

6. Flink DataStream API (for Stream Processing)

Data Stream is the main API that offers Apache Flink, and what makes difference with its competitors. DataStream API allows develop programs (in Java, Scala and Python) that implement transformations on datastreams (see examples in 6.1). The data streams are initially created from multiple sources such as message queues, socket streams or files. The results of the data streams return via Data Sinks, which allow write the data to distributed files or for example command line terminal.

6.1  Examples of transformations in Flink:

  • Map
  • FlatMap
  • Filter
  • KeyBy
  • Reduce
  • Fold
  • Aggregations
  • Window
  • WindowAll
  • Window Apply
  • Window Reduce
  • Window Fold
  • Aggregations on windows
  • Union
  • Window Join
  • Window CoGroup
  • Connect
  • CoMap, CoFlatMap
  • Split
  • Select
  • Iterate
  • Extract Timestamps
  • Project (for data streams of Tuples)