Part 1 : Apache Beam : Key Pillar in IoT driven Digital Transformation

This is the first of three part series where I will be covering Apache Beam. While this part talks about features and benefits, the next part will give code examples and the last part will detail on how to migrate existing data streaming logics from cloud to edge.

Digital transformation has been an overused and a raging topic for the most part of the last decade. As the different channels of data generation, acquisition increased and data processing with cloud increased, enterprises and startups came up with innovative products and frameworks to add value.

As the ecosystem matured, people were looking for ways to improve the overall ecosystem tailoring to their needs. In the IoT ecosystem, it is the real time decision which differentiated the successful IoT solution and products from the others.

In IoT driven digital transformation, technologists were looking at improving upon the data processing from the following perspective

  • Ability to take decisions quickly
  • Reducing the noise in data and using cloud for streaming quality data
  • Optimizing the cloud cost

Edge driven data processing and decision making is an ideal next stage for IoT solutions to meet the above mentioned objectives. By Edge, i am referring to Mini Edge, medium edge and heavy edge captured beautifully by Janakiram in this article and this article also considers only those edge layers for evaluation.Taking the experiences from cloud native solution development, I was looking for following features in data processing ecosystem to make the edge solutions more reusable and run time independent

  1. Consistent approach for data processing at cloud and at edge. Though edge is gaining traction, cloud based data processing is still here to stay.
  2. Single library or language to meet the various runtime requirements. Tomorrow if i am changing my cloud provider, i should not be changing the code completely.
  3. Clear distinction and demarcation between the data travelling infrastructure, computation infrastructure, library which is independent of a specific product (be it data ingestion platform or runtime infrastructure),ability to run both at low powered edge location and high powered cloud runtimes.
  4. Library which is easy to learn and implement and available in different languages.
  5. Consistent design and implementation approach for different data sources (File processing,Databases, data streams etc)

Let us take a pause and look at popular options for real time data streaming at cloud. Below are popular messaging engines

a. Kafka

b. Azure EventHub

c. Google Pubsub

d. AWS Kinesis

Kafka is one of the popular data streaming engines. It also provides data access via its API and also as KSQL (which I really love). Spark API also supports accessing Kafka messaging and performing streaming operations.

EventHub is similar to Kafka from multiple perspectives. It also enables you to access EventHub endpoints and topics using KafkaAPI so that you can easily move away from Kafka to Eventhub just by changing configuration.

Pubusb is the data streaming engine provided by GCP. It is while working on this, I stumbled upon Apache Beam.

I have not worked on AWS Kinesis, so cannot give an opinion about it. But given by the high standards set by AWS and maturity of their solutions, I am sure its capabilities and flexibility will have the best of Kafka and Event Hub.

While we have talked about Messaging engines, if we take into account of runtimes and libraries supported, the complexity increases

Edge Characteristics

A key characteristic of Edge is the limited computing capabilities available. Even in an industrial environment, creating a cluster from edge devices is not recommended and is not possible. Apart from that, by the nature of the need and data, most of the IoT deployments use the MQTT or AMQP protocol for data communication and those brokers act as messaging engines.

How Data streaming differs between Edge and Cloud

By now, you might have a fair idea of how real time data processing is different between edge and cloud. Below are the key points

  1. In the cloud, data streaming should always handle data from multiple devices or clients at the same time. In the edge, the number of devices which acts as a source of data, which streams the data is very limited. So data streaming in cloud should be device or source aware, at the edge, it is not required to be source aware.
  2. Data streaming at edge needs to take care of late arrival of messages etc. By the nature of edge, those requirements are not required
  3. Since cloud needs to support late arrival and late reading it needs to have the capability to go back previous records and support data retention for a long time. Those requirements are not on the edge.

Given the significant difference between characteristics and nature of data streaming between edge and cloud, it is evident that the data processing and streaming ecosystem developed for cloud is not relevant for edge.

Characteristics that Edge based data streaming shoul support

Following are the characteristics that any data streaming engine targeting IoT edge devices should support

  1. Support for MQTT protocol
  2. Ability to perform windowing calculations.
  3. Easy to setup and run

Though GCP is the only popular cloud provider which propagates Beam, the following features and characteristics of Beam make it ideal for both cloud and edge.

Key Beam Features

  1. Logical pipeline based data processing. Easy to understand, easy to develop.
  2. Supports multiple data sources. It includes
  • Files (Regular CSV files , Hadoop based file systems)
  • Databases (mongo, Cassandra,BigQuery,BigTable,Spanner,hbase, JDBC connector for other Databases)
  • Kafka (Since Eventhub supports Kafka endpoints, we can include EVenthub. However, I have not personally tested accessing Eventhubs exposed as Kafka endpoints through Apache Beam)
  • Kinesis
  • Pubsub
  • MQTT
  • JMS
  • AMQP

3. Available in Python,Java and Go. Few features like MQTT are available only in Java.

4. Supports different runtimes ( Spark cluster, Dataflow, running on VM or machines directly, flink

Why You should choose Beam (both cloud and edge)

From the learning of cloud native development and after seeing the benefits it provides, any architect would choose a tool which enables them great flexibility and vendor neutrality. Below are the key benefits and value add you get with beam

  1. Same pipeline code for both edge and cloud. Only aspect that would change is the starting point from which engine the message arrives. Once it is separated from the data processing logic (which is possible), you can have the same pipeline for edge (MQTT) and cloud (Kafka). This means that even if you do not have any Edge strategy immediately, using Beam enables the migration to edge quicker.
  2. Consistent data handling logic for different sources, be it files or messaging engine.
  3. Clearly de-couples the runtime from coding APIs. Even if your data processing is only in the cloud, tomorrow if based on the size of the data if you need to choose a different runtime, you can easily adapt to it without worrying about library dependencies and code change
  4. Since Beam de-couples processing from underlying computing, you can easily migrate from one cloud to another easily.

Given the benefits and features, I firmly believe that Apache beam is a main pillar in next generation digital transformation involving IoT and Edge. Ofcourse, Beam too has few areas of improvements, but it is already a stable and reliable product and definitely needs to be adopted over other products which ends up in vendor lockin.

IoT Solution Architect