Alan Hannan is a member of the Visionarios de la red Netskope advisory group.
The cloud often seems like a black box for many corporate networking and security professionals. They have expertise in optimizing their internal network. Still, once they offload their traffic to the cloud, they figure they’re handing off optimization to the software-as-a-service (SaaS) provider. But a company that chooses a business-critical cloud solution without considering the underlying architecture is setting itself up for potential disappointment down the road. Or even worse, opening themselves up to the risk of greater exposure to threats, lost or compromised data, or degraded employee productivity.
Offering an excellent customer experience in the cloud requires a multi-faceted and complex balancing act. SaaS providers must optimize many parameters, often weighing cost against various approaches to redundancy, scalability against resiliency, and security against performance. These are not simple equations where a change in one area has entirely predictable ramifications in the other. Vendors are constantly making decisions around tradeoffs to maximize customer satisfaction.
For companies using these vendors’ solutions—especially SaaS security solutions such as a secure access service edge (SASE) framework and security service edge (SSE) capabilities—understanding those tradeoffs is crucial to making the best decisions.
Keys to optimizing cloud traffic, ensuring performance across the internet
Optimizing internet traffic and computing resources have been critical themes of my career. In the mid-90s, when the Internet was nascent, and cloud computing wasn’t a thing, I ran peering at UUNET, one of the first and largest commercial internet service providers (ISPs). It was a lucky opportunity to learn a great deal about the inner workings of the early Internet. A lot of this work was around understanding how to create the most efficient routes. To do this, we’d not only get traffic from point A to B but also do it in the fastest, most cost-effective way.
With this knowledge, I moved on to management positions focused more exclusively on traffic engineering. As a vice president in operations engineering at various well-known Internet companies, including Global Crossing and Internap, I was responsible for deconstructing internet traffic and determining how to best move packets from one location to another; in other words, optimizing traffic across the Internet and for the new cloud paradigm. In different positions at Alcatel-Lucent, Aruba, and CrowdStrike, I operated distributed systems that used these cloud computing systems and their network and optimized our use of the systems and networks.
There are many facets to optimizing the performance of internet traffic. Two of the most important are #1 reducing the distance that data needs to travel–or the hops that potentially add latency and degrade performance, and #2 ensuring that all key networking components are provisioned appropriately. I joined the Netskope Network Visionaries advisory group because I’m impressed with how Netskope handles both of these crucial considerations in NewEdge.
Purpose-built architecture with compute at the edge, closer to the customer
To reduce the distance that data must travel from a customer location to the Netskope Security Cloud, the Netskope Platform Engineering team architected NewEdge to position traffic processing in nearly 60 data centers strategically positioned around the world. This approach places compute power as close to users as possible, and it is made possible because of the NewEdge approach. The distributed nature of compute resources for processing traffic is crucial for delivering a cloud security service. This distributed design is similar to CDNs but focuses on security enhancement. One of my counterparts in the Network Visionaries advisory group, Elaine Feeney, recently published a blog on this topic titled “Why the Edge Really Matters Right Now” that goes into the advantages of edge compute in greater detail.
Leveraging tradecraft from the most prominent cloud companies and hyperscalers, Netskope has intentionally built out NewEdge using what the company calls a “data center factory” approach, in which teams pre-build data center racks before shipping them to a region to go into production. Before shipment, thousands of tests are run to meet stringent quality control measures. Automation is then used to test a second time and deploy in-region before taking production traffic. These processes are further streamlined by emphasizing consistency in the software, hardware, and networking components. A significant part of this is ensuring ideal physical configuration, for example, having the same cable plugged into the same port in every rack globally.
All told, Netskope can roll out a new pod in under three weeks, which is unheard of compared to traditional data center build-outs that would take multiple months at best. Netskope rolled out the first NewEdge data center in 2019. By the end of 2020, Netskope deployed roughly 20 data centers, including four new pods in Latin America in less than 40 days. This speed and scale of deployment are rare in the industry, and it shows the expertise of the team behind NewEdge. Now, three years since first launching, NewEdge is powered by data centers in roughly 60 regions globally—a fact that shows just how scalable this architectural approach to design and deployment is.
Compared with the old-school approach of building racks onsite, this rapid expansion of NewEdge’s highly robust architecture demonstrates a deep commitment to providing resources exactly where customers need them. It’s an elegant and optimized solution. This purpose-built distributed solution is preferred, yet many vendors fall short or use shortcuts like virtual Points of Presence (vPOPs) or rely on public cloud providers for time-to-market.
Massively over-provisioned data centers built for scale
The other key to meeting SASE customers’ performance needs is to provide adequate resources in each data center location. Netskope’s ability to protect customer data is only as good as the Netskope services and the underpinning NewEdge infrastructure that is up, available, and accessible to customers. Traffic volumes increase as more traffic moves to the cloud, and content gets richer, applications become more complex, and Netskope grows its customer base.
Amid this growth, Netskope has committed to never running a single data center above 30% utilization. The company has developed a reputation for massively overprovisioning because when a compute resource reaches 20% to 30% capacity, Netskope scales up and out by adding another data center in-region. This new location adds capacity for scale but also in-region resilience. This is very smart because capacity planning has been a thorn in the side of many SaaS businesses and can ultimately undermine their success.
The basic concept of capacity planning is straightforward:
- You forecast your demand.
- You evaluate your current supply.
- You ensure enough supply to meet the projected demand levels at or before the appropriate time.
Cloud companies typically make these calculations by analyzing a variety of synthetic measurements and metrics. The problem is that although demand for cloud services grows exponentially, supply grows stepwise, getting an immediate bump up whenever new resources are added, then staying steady until the next significant step up.
SaaS providers’ network and systems monitoring teams typically gather usage data every five minutes to estimate future demand. They lop off the top 5%, record the highest traffic within that 95th percentile for each five-minute bucket, and produce an aggregated view. In doing so, they develop a clear picture of the mean level of demand. But this eliminates any granularity and specificity that might identify the causes of highs in each five-minute period.
Internet traffic, however, does not move linearly, and any given five-minute period might include a microburst in demand so large that it could threaten the stability of the cloud solution. That’s why I think Netskope is wise to use a much lower threshold for upgrading its data center capacity, combined with a highly scalable architecture that streamlines the process of adding resources when needed. Nobody knows what the future holds – while we would like to be able to plan precisely, that’s not how life works. Events will happen that we cannot foresee. For a cloud provider, those events will sometimes radically change the volume of customer traffic they need to support.
Because Netskope has a lower threshold for adding or upgrading data centers, they are much better prepared to provide uninterrupted resources through whatever ‘black swan’ event may be hiding just over the horizon. The overprovisioning has served NewEdge particularly well during COVID and, more recently, with the unforeseen global supply-chain issues that have accelerated cloud adoption. In multiple instances, Netskope has had to overnight onboard new customers, with tens or even hundreds of thousands of users. The NewEdge architecture has made that possible.
Optimally overprovisioning doesn’t mean NewEdge needs thousands of data centers. Nor does it mean Netskope needs to run hundreds of racks in nearly 60 regions. What it means is that the decision-makers in Platform Engineering at Netskope deeply understand the business. They have enough of an understanding, more or less, to know what to expect regarding their traffic needs. And accordingly, they build a big enough buffer to handle the unlikely events. Today, each Netskope data center is capable of scaling up to manage an impressive 2Tbps of traffic, which means NewEdge can run more than 100Tbps globally. And suppose an individual data center becomes overloaded. In that case, the NewEdge architecture dynamically enables failover to move surplus traffic to a nearby data center, so it is 100% transparent and seamless to the end customer or user.
Why network design & architecture are so essential to cloud security?
For many cloud-security companies, the overprovisioning described in this blog can be cost-prohibitive. It’s more affordable for Netskope because of the easy scalability of the NewEdge architecture. The problem is that tradeoffs in this area are unacceptable, as failing to overprovision correctly may put customers’ security and valuable data at risk.
Many years ago, when I was in my early 20s, I worked the night shift overseeing a local network of ATMs—by which I mean automated teller machines, not asynchronous transfer mode. Several hundred ATMs were linked via Plain Old Telephone Line (“POTS”) serial connections, and I monitored their status. Each bank set its policy for how its machines respond if they lost network connectivity and had to operate offline. The banks had the option of ceasing to give out any money until the ATM could verify that customers had enough in their accounts. But I was surprised that most banks chose to continue dispensing cash, even without verification. They prioritized customer satisfaction over risk mitigation.
Cloud security providers’ capacity planning involves similar issues to those banks setting policies for offline ATMs. What will a SaaS security company do if a particular data center experiences more traffic than it can handle? I know of some security companies that just let all traffic through; they default to no security in that circumstance. If their systems have an outage from 4 a.m. to 4:15 a.m., they may later go back through traffic data for that period and let customers know whether anything bad happened. But that is not a comfortable conversation.
In my experience, even if the security company offers a post-outage review, customers prefer having their traffic protected all the time. Knowing after the fact what went on during an outage is not the same as having uninterrupted security in place. Ultimately, that’s why a cloud security solution’s architecture matters, and it’s the cornerstone for success in SSE and SASE.