Part 2 : Guidelines for Moving Real time data processing from Cloud to Edge
Continuing on Part1 , In this article, I will highlight the strategy, steps to follow, things to consider and possibly the best practices to do before moving the real time data analytics load from Cloud to edge. This is mostly aimed to serve as a guideline in architecture and process drafting while the enterprise is moving towards the edge. I have captured a common set of challenges and considerations.
While the use case I have worked on was specific to IoT and processing the data nearby the place of data generation, the principles can be applied to moving your processing to other type of Edge (i.e Geo location edge for smart city applications or to a edge at factory site for processing the data from machines and shop floors etc ). As you do the migration, you will see that the entire tech landscape, architecture and networking has a potential to change. So any decision that needs to be taken should consider the long term impact(needless to say) and possible timelines to be achieved.
While moving to edge, especially for moving to edge at city level or edge at a data center level, IT teams will be thinking of lift and shift as a quick approach. With that, code will be the same (for e.g same steam processing job written in Kafka or Spark API) will be running in multiple discrete kafka clusters, each cluster corresponding to an edge. While it might sound like an easy win, it might end up having more overheads and performance issues. Asking difficult questions and adopting a better strategy early will help you to attain long term benefits quicker rather than first doing a lift and shift and then doing a re-architecture.
This might sound as a very generic item, but it is more pertinent. Here the dependencies could be not just compile level dependencies, but run time dependencies. It could be a side input of what your real time analytics job requires, a static data accessed from look up tables, data present in cache etc which is being referred, external services being invoked (e.g AI service being called etc), how the output is being performed etc. Let me elaborate a few scenarios and options for each of them.
Data from Lookup Tables
What is the size of the lookup table? How frequently is the table going to get updated. When the data processing moves to the edge how this table will be accessed. If the cloud instance is still accessed, what would be the latency and overhead?
- What network changes are required to make it accessible from multiple discrete networks (in the worst case assuming each edge location cannot be networked into the enterprise network etc).
- Will the lookup table in the cloud still remain relevant or a copy of the table will be moved to edge locations? If each location is going to get a copy, how the subsequent updates to the lookup tables will happen.If a copy is given to edge locations, will they have only the subset of data relevant to them or the entire table?
- What is the DB type and version. Does the edge infrastructure equipped or have compatible infrastructure to host them?
- If the edge location(s) are going to get a copy, it means each edge instance will be running the database. Will it increase the licensing cost (if applicable and relevant).
- What is the DB security and upgrade strategy? Will each edge instance have the same credentials (might post a possible security risk as if one instance credentials are exposed, hacker can potentially access all edge database instances)? What are the implications of that? If each instance is going to have a separate credential, how at the runtime processes will know the specific username and password etc.
In many cases side inputs are similar to database lookups or static data lookups. In a few scenarios, it could be files or data from another stream. When you move your pipelines to the edge, we need to evaluate if the side inputs need any change. As we move to the edge, it gives the team an opportunity to make solutions and data pipelines less complicated. Edge would be processing data only from a limited subset of the whole fleet. So the conditions and rules can be specific to an edge.
If needed, edge deployments can be categorized to types and each type can have it own data pipeline logic or data analysis logic. Latest edge platforms helps the enterprise to manage the edge deployments better.
Generally realtime data analytics or data pipe lines are not dependent on external service. However, many pipelines might invoke AI services to get real time decisions. Integrating AI into the data pipelines or real time data analytics has provided immense business benefits. However, many of the AI functionalities are exposed as an API call. When we move to edge, one question that needs to be answered is, how the AI services would be leveraged? Following few options exist
- Move the AI service to edge. While the data model generation could still happen in the cloud, realtime data inference can happen in the edge. While this is easy, there could be challenges if the hardware architecture at edge is different from the hardware architecture where models are currently served. This is especially true, if we are moving to micro edge or to gateways. Also testing different models of edge at the same time (i.e A/B testing) on each edge instance could be a challenge. There should also be a process to measure the accuracy of results provided by AI in each edge.
- Invoke the API running in remote location over the network. While this could cause latency, it can help in scenarios where the edge and the infrastructure where AI is running are in the same virtual network or enterprise network. If the call is made over public internet,security of the data getting transmitted, API security and firewall rules at cloud etc all needs to be checked to ensure that it is feasible to invoke it.
Porting AI and external services from the edge becomes a challenge if the service is managed by a different team or if your team is getting charged based on the number of API calls. Any decision being taken in this scenario should be taken after discussing with the API provider.
Data & Result communication
One of the main reasons to move to the edge is to enable quicker decision making and taking action without lag. For example in a factory, edge analytics can enable a faster decision making for the process and a few seconds saved could have lot of direct and indirect benefits.
While processing in the cloud, the output could be a simple one like sending a notification via different channels, etc. While at edge, the benefits and actions can change. While a few options can be removed (why send an email when the supervisor is alerted immediately. Why notify the command centre over email when the edge could turn off power or water supply incase of any issues). Any new notification might be equivalent to developing a new functionality. There could be challenges to continuing the existing functionalities via edge invocation. What changes are required and is making the changes provide significant value adds or the questions to be asked.
Another thing that needs to be considered is how quickly the real time data needs to be pushed to the cloud. Though decision making is happening at edge, cloud systems could still need the data for various purposes. Few of them are
- Scheduled and Custom Reports
- Dashboard and business analytics
- AI model re-training
- Feed to downstream or external system
- Backup for internal and external compliance and future use
As the process moves towards the edge, data communication strategy needs to be evaluated. For e.g prior to edge, your IoT devices could be sending data every minute and cloud could be doing aggregation and then subsequent decision making. Real time data could be saved as is in the data lake. With transition to edge, aggregation and decision making could be happening in edge. While results of decision making should be communicated to cloud, how the telemetry data be sent. Few options are
- Do not send all the telemetry data if the noise in data is very high. Rather you can send the max or average etc for different time windows.
- Send data to backend cloud IoT gateway regularly. From there move to data lake or database using cloud functions are event streaming without any further processing.
- Save the data locally on the edge. Send the data to cloud as a nightly job as a file transfer (e.g to Azure blob storage or S3 or Cloud storage bucket). This is helpful when the network connectivity is intermittent.
Changes to Application
As you move the code towards the edge, the actual core data pipeline code and assumptions it has made needs a relook. As the cloud data pipeline will work on data from different devices, it might group by a key(e.g deviceId etc) to group a set of data from the same origin. When the logic moves to edge, this group by itself is not required as the code is processing data from only one source. Removing such complexities will make the code much better to read, maintain and perform better as well.
Similarly, the local data model that the application stores also could be devoid of identifiers like deviceId etc. However, while sending the data to the cloud, extra metadata can be added and sent. Such tweaks can be made to get the best out of limited edge storage and memory, especially in IoT solutions.
Similarly, being running in edge, pipeline or applications can leverage caching or in memory data access which was not possible earlier in cloud due to the volume of data it processes.
In the cloud, the data could be steamed in Kafka or equivalent solutions and processing could be done by spark or similar data processing programs. However when you move the logic to edge, it might be over-engineering and cumbersome to set up kafka clusters.This is more relevant to IoT use cases. In these cases, what options do you have for data streaming? Will the existing programs written in the language of your choice and library developed for cloud will work on the edge? These are some of the critical questions that need to be asked. While apache beam is a good fit, especially for IoT use cases, based on context, decisions could be made. This article explains more about Beam and why it is a best fit for IoT use cases.
In the previous section, we saw the applications that the data pipeline or stream analytics depends on. In this section we will see few impacts to the process or application that depends on the data pipeline.
For example, think of the real time dashboard which streams the process and equipment data in the dashboard and associated examples like historical trend identification etc. Now if the data pipeline is moved to edge and the data sending frequency from edge has decreased. What will be the impact that this will have on application users or other child systems.
The dashboard application needs to be re-imagined with the stakeholders to make it more relevant and more useful. For example, each factory site might run an instance of the simple web application which will host the dashboard. When a user selects a factory, requests could be routed to the instance of the application running in the factory to get and display the real time data pertaining to it. This could be achieved with a simple networking and forwarding logic. Edge can just transmit the derived parameters real time to the cloud which can be used to specify health of overall fleet (E.g how many of my plants face inventory shortage in next 3 days etc).
In the subsequent posts, I will explain specific use cases I have worked on and how we migrated to the edge. In the post later, I will do an edge migration for a fictitious smart city initiative.