Part 1 : Apache Beam : Key Pillar in IoT driven Digital Transformation

  • Ability to take decisions quickly
  • Reducing the noise in data and using cloud for streaming quality data
  • Optimizing the cloud cost
  1. Consistent approach for data processing at cloud and at edge. Though edge is gaining traction, cloud based data processing is still here to stay.
  2. Single library or language to meet the various runtime requirements. Tomorrow if i am changing my cloud provider, i should not be changing the code completely.
  3. Clear distinction and demarcation between the data travelling infrastructure, computation infrastructure, library which is independent of a specific product (be it data ingestion platform or runtime infrastructure),ability to run both at low powered edge location and high powered cloud runtimes.
  4. Library which is easy to learn and implement and available in different languages.
  5. Consistent design and implementation approach for different data sources (File processing,Databases, data streams etc)
  1. In the cloud, data streaming should always handle data from multiple devices or clients at the same time. In the edge, the number of devices which acts as a source of data, which streams the data is very limited. So data streaming in cloud should be device or source aware, at the edge, it is not required to be source aware.
  2. Data streaming at edge needs to take care of late arrival of messages etc. By the nature of edge, those requirements are not required
  3. Since cloud needs to support late arrival and late reading it needs to have the capability to go back previous records and support data retention for a long time. Those requirements are not on the edge.
  1. Support for MQTT protocol
  2. Ability to perform windowing calculations.
  3. Easy to setup and run
  1. Logical pipeline based data processing. Easy to understand, easy to develop.
  2. Supports multiple data sources. It includes
  • Files (Regular CSV files , Hadoop based file systems)
  • Databases (mongo, Cassandra,BigQuery,BigTable,Spanner,hbase, JDBC connector for other Databases)
  • Kafka (Since Eventhub supports Kafka endpoints, we can include EVenthub. However, I have not personally tested accessing Eventhubs exposed as Kafka endpoints through Apache Beam)
  • Kinesis
  • Pubsub
  • MQTT
  • JMS
  • AMQP
  1. Same pipeline code for both edge and cloud. Only aspect that would change is the starting point from which engine the message arrives. Once it is separated from the data processing logic (which is possible), you can have the same pipeline for edge (MQTT) and cloud (Kafka). This means that even if you do not have any Edge strategy immediately, using Beam enables the migration to edge quicker.
  2. Consistent data handling logic for different sources, be it files or messaging engine.
  3. Clearly de-couples the runtime from coding APIs. Even if your data processing is only in the cloud, tomorrow if based on the size of the data if you need to choose a different runtime, you can easily adapt to it without worrying about library dependencies and code change
  4. Since Beam de-couples processing from underlying computing, you can easily migrate from one cloud to another easily.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store