Part 2 : Guidelines for Moving Real time data processing from Cloud to Edge

Dependencies Identification

This might sound as a very generic item, but it is more pertinent. Here the dependencies could be not just compile level dependencies, but run time dependencies. It could be a side input of what your real time analytics job requires, a static data accessed from look up tables, data present in cache etc which is being referred, external services being invoked (e.g AI service being called etc), how the output is being performed etc. Let me elaborate a few scenarios and options for each of them.

Data from Lookup Tables

What is the size of the lookup table? How frequently is the table going to get updated. When the data processing moves to the edge how this table will be accessed. If the cloud instance is still accessed, what would be the latency and overhead?

  • What network changes are required to make it accessible from multiple discrete networks (in the worst case assuming each edge location cannot be networked into the enterprise network etc).
  • Will the lookup table in the cloud still remain relevant or a copy of the table will be moved to edge locations? If each location is going to get a copy, how the subsequent updates to the lookup tables will happen.If a copy is given to edge locations, will they have only the subset of data relevant to them or the entire table?
  • What is the DB type and version. Does the edge infrastructure equipped or have compatible infrastructure to host them?
  • If the edge location(s) are going to get a copy, it means each edge instance will be running the database. Will it increase the licensing cost (if applicable and relevant).
  • What is the DB security and upgrade strategy? Will each edge instance have the same credentials (might post a possible security risk as if one instance credentials are exposed, hacker can potentially access all edge database instances)? What are the implications of that? If each instance is going to have a separate credential, how at the runtime processes will know the specific username and password etc.

Side Inputs

In many cases side inputs are similar to database lookups or static data lookups. In a few scenarios, it could be files or data from another stream. When you move your pipelines to the edge, we need to evaluate if the side inputs need any change. As we move to the edge, it gives the team an opportunity to make solutions and data pipelines less complicated. Edge would be processing data only from a limited subset of the whole fleet. So the conditions and rules can be specific to an edge.

External Services

Generally realtime data analytics or data pipe lines are not dependent on external service. However, many pipelines might invoke AI services to get real time decisions. Integrating AI into the data pipelines or real time data analytics has provided immense business benefits. However, many of the AI functionalities are exposed as an API call. When we move to edge, one question that needs to be answered is, how the AI services would be leveraged? Following few options exist

  • Move the AI service to edge. While the data model generation could still happen in the cloud, realtime data inference can happen in the edge. While this is easy, there could be challenges if the hardware architecture at edge is different from the hardware architecture where models are currently served. This is especially true, if we are moving to micro edge or to gateways. Also testing different models of edge at the same time (i.e A/B testing) on each edge instance could be a challenge. There should also be a process to measure the accuracy of results provided by AI in each edge.
  • Invoke the API running in remote location over the network. While this could cause latency, it can help in scenarios where the edge and the infrastructure where AI is running are in the same virtual network or enterprise network. If the call is made over public internet,security of the data getting transmitted, API security and firewall rules at cloud etc all needs to be checked to ensure that it is feasible to invoke it.

Data & Result communication

One of the main reasons to move to the edge is to enable quicker decision making and taking action without lag. For example in a factory, edge analytics can enable a faster decision making for the process and a few seconds saved could have lot of direct and indirect benefits.

  1. Scheduled and Custom Reports
  2. Dashboard and business analytics
  3. AI model re-training
  4. Feed to downstream or external system
  5. Backup for internal and external compliance and future use
  1. Do not send all the telemetry data if the noise in data is very high. Rather you can send the max or average etc for different time windows.
  2. Send data to backend cloud IoT gateway regularly. From there move to data lake or database using cloud functions are event streaming without any further processing.
  3. Save the data locally on the edge. Send the data to cloud as a nightly job as a file transfer (e.g to Azure blob storage or S3 or Cloud storage bucket). This is helpful when the network connectivity is intermittent.

Changes to Application

As you move the code towards the edge, the actual core data pipeline code and assumptions it has made needs a relook. As the cloud data pipeline will work on data from different devices, it might group by a key(e.g deviceId etc) to group a set of data from the same origin. When the logic moves to edge, this group by itself is not required as the code is processing data from only one source. Removing such complexities will make the code much better to read, maintain and perform better as well.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store