Networking — Where the disconnects reveal in large organisations

Hariharan Anantharaman
4 min readApr 7, 2023

Continuation of https://medium.com/@hariharananantharaman/digital-adoption-a-non-stop-journey-a8201015ed2d

Digital adoption and digital transformation have picked up amongst organisations of all sizes after the benefits of cloud computing have become more visible and predictable. As people moved towards hybrid and multi-cloud architecture, disconnects between the organisation’s vision, plan and implementation of various initiatives became evident. In a large organisation, predominantly, the following structures exist.

  1. Organization-level cloud practice taking strategic decisions.
  2. The Centre of Excellence establishes guidelines, best practices, quality control, and compliance mechanisms.
  3. Horizontal IT infrastructure team taking care of IT asset management and responsible for security (e.g. firewall, VPNs etc.)
  4. IT division (Now called Digital division to sound fashionable) responsible for project implementation.
  5. Run organisations responsible for maintenance, predominantly for legacy software.

What happens most frequently is that application architects designing multi-cloud solutions initially think about everything but networking. They are influenced by existing networking guidelines which work for traditional data centres or a single cloud. There have been instances where architects expect services running in AWS to invoke services in GCP via public internet or through their data centre. This is purely because they lack the networking knowledge and the impact it has. Traditionally many architects start as developers and not everyone get an opportunity to understand or troubleshoot network problems.

The enterprise cloud and network team does not know about application goal and services functionality. They emphasise legacy standards on modern applications without understanding the implications of it. This results in following issues

Under provisioning of network bandwidth and connectivity services.

Network and security rules are derived as after thought and reactive strategy and not as a pro-active approach.

Generally initial applications which adopt a specific cloud or integration methodology are victims of this reactive approach. What makes networking problems interesting is that there are multiple solutions to a problem.

Especially in kubernetes based applications. In one of the applications where application hosted in AWS EKS clusters need to invoke internal services deployed in GCP, enterprise architects envisaged it to go via public internet and the services discovered using a public domain name. This was different from what was told to application teams. So they did set the proxy while invoking the service. In the cloud, all the outgoing internet traffic goes via proxy. Even if a internal service is available with public domain name, it needs to go through proxy. Before getting into the solution, it would be interesting to know different solutions available.

Even for a simple thing like using proxy, there are multiple options

  1. Set the proxy while explicitly making a service call in HTTP RestClient options
  2. Set the proxy at the process level. For e.g in java applications, proxy can be set at process level by passing it as java input arguments
  3. Setting proxy at POD level
  4. Setting proxy at worker node level
  5. Setting proxy at service mesh level (E.g Istio)

Impact of change depends on where they lie in the image above. For example, setting only at function level affects only the specific call. Setting at service mesh affects all the out going traffic going via a service mesh.

When our applications deployed in AWS did not connect to GCP services as it did not go via proxy, we evaluated all the options mentioned above.

  1. We set it at the process level. But it affected other interservice communications. Even though we have mentioned lot of internal domains and IP ranges in NO_PROXY, some of the services gave error. Since it was coming from a product code on which we didnot have much control, we hit a dead lock on it.
  2. Though POD and worker node were configured to use proxy, our requests did not use them.
  3. We were not ready to configure in Istio(Service mesh)level as we were afraid that it might affect all the inbound and external invocations. Since we were calling other internal services without proxy, we did not want to experiment anything.
  4. As a plan B we also changed code to use proxy and had it ready as well.

At the end, we convinced our customers that GCP should act as a logical extension of AWS and proxy was not required. If any other faint-hearted team were there, they would have adopted the easy fix of code change. Still, it would not have unearthed a bigger problem: a disconnect between enterprise architecture, application architecture and networking teams.

--

--