Thought Experiment: How Did Zoom’s Infrastructure Keep Us Connected?

What bit of technology got us through the pandemic, was available for scaling up at just the right time to keep a pretty sizeable chunk of our economy working, redefined education on a moment’s notice, and now is redefining for the world the entire notion of work?  Without our even thinking much about it, the term “Zoom Meeting” has become just as ubiquitous as “Google Search.” Zoom has become a verb to have a meeting like Google has become a verb to search, in fact.

We use Zoom now as a matter of course and get upset when our own wireless network or Internet Service Provider isn’t acting up to snuff. It has become like our highway system; at least we sort of understand how the highway system happens to work. But Zoom? What really is this thing called Zoom?

Let’s start with you. From your point of view, you are hosting or joining a meeting, one with who knows how many participants. You are a Zoom Client, a member of a meeting. You haven’t a clue, and usually don’t want to care, as to where Zoom’s infrastructure is any more than you know where Amazon.com resides. Just as with browsers, the Zoom client application in front of you runs on a number of operating systems (macOS, Windows, Linux, Android, iOS, Chrome OS) and in a range of context-aware applications (mobile, desktop, Zoom Rooms). No matter your client configuration, the pattern of interaction with the Zoom infrastructure remains the same.

Thinking about it even for a moment, you know that your Zoom client must be sending out at least video, audio, and messaging, unless that is you are someone like a student trying to hide from their teacher (muted with just your name showing). Something out there is accepting all of that from the millions of us using Zoom clients at any moment, apparently reformatting all that, and then sending it back.

I happen to also be now – as a retirement job – a Computer Science instructor at a university local to me. At the time the decision was made to switch at least all my classes over to remote instruction, my concern – perhaps oddly – was not with Zoom’s capabilities. After all, as with millions of teachers just like me all over this country, we knew that its capabilities had to meet our needs, either that or we had to be successful nonetheless with what was available. (Fortunately, for me, what Zoom presents is everything that I needed and more.)  My own concerns – perhaps knowing too much – was whether Zoom could collapse when suddenly burdened with expectations of this flood of instructors just like me, the many more millions of students we were serving, and by extension the concurrent use of still more millions of folks in thousands of other organizations. I, being in the Minnesota university system, pictured at first a set of servers with massive networks serving just the educational needs of at least Minnesota. And, interestingly, it could be that if the university system had wanted to use Zoom’s on-premises solutions. But it wasn’t, at least in our case. So, again, what was it?

And to drive this point home still further, think of that camera on your device in front of you. The role-of-quarters-size video camera I used for my lectures had a resolution of 1920×1080, each pixel represented as 24 bits. That’s about 50 megabits per image (just for the image), flowing out of my box at a rate, as published by the Zoom help center, 30 frames per second; about 1.5 gigabits/second. And it’s not just my image that is being sent; as a lecturer it typically included my screen’s worth of PowerPoint charts. And all that just from me sending my stuff to some Zoom server something. And then a screen’s worth of “meeting” is reorganized and sent back to me – and everyone else – at the same rate. And that says nothing about audio. With many millions of us doing this concurrently, something seemed clearly wrong with this picture, and I needed to find out what. After all, even Zoom is only recommending the need for video bandwidth of only about 1 to 5 megapixels/second and, by the way, for audio only 60 to 100 kilobits/second.

And, again, with that mental picture, I pictured some relatively local set of servers for me to work with my students, and by extension other – say – university professors meeting with theirs. Interestingly, it is sort of that, but I – and you – can also use Zoom to meet with anyone, anywhere. Again, something was wrong with my view of this.

So, I thought the place to start was with Zoom’s notion of their own global data centers. At the writing, there are only 17 data centers distributed throughout the world, parts of which reside within various cloud services. Mind blown. How the heck?

OK, I needed to start over from the basics.

Let’s start with the term Frames, as in 30 frames/second. Like a film-based movie, that which appears to be moving is really just a set of rapidly changing still shots. Our brain does the rest. So Frame, a digital picture. The architecture for representing compressed digital pictures have been around for – what? – decades. Case in point. My lectures tend to be PowerPoint charts with a basic white background. How much information is in – say – a 8×8 pixel white background? Only that it is white and its location. Compression is easy; it only needs to be detected – rapidly – and represented in the digital stream representing the frame. Not a problem. That’s 192 bytes represented instead in just a few bytes. Rinse and repeat over your entire image. Your CPU is slightly busier than usual, compressing (and decompressing) each and every frame, but the network traffic only sees a smaller stream of bits.  This is spatial compression. Compression of a JPEG is an example.

Next, 30 frames per second (fps). OK, in that same PowerPoint chart, what all changed transitioning from frame at time 1/30 to 2/30 and then from 2/30 to 3/30?  Often times nothing, and only occasionally the location of the laser pointer I am using for a cursor. Sure, to start an all new, completely different, frame, the entire spatially compressed frame needs to be sent, but what about the next – say – five?  Could the next few just describe the change, if any, and then follow that with another compressed frame?  This is temporal compression. It happens that this, too, is not particularly new and there are quite a few architectures for doing exactly this. Compression of an MPEG is an example. These algorithms for compression are generically called a codec. (If you are into getting deeper, start by looking at a few mainstream codecs likes those named H.264 MPEG-4 AVC, H.265 HEVC, or VP9. And this is still an open area of research.) A set of separate architectures are defined for the packaging of these called containers, and there are multiple of these as well. They get mixed and matched, based on the need. Zoom does it their way. Various kinds of movies do it theirs. Ultimately, from we users’ point of view, as long as the compression technology for the “movie” gets matched up with the decompression technology – both of which run on our local devices – it’s all good. Indeed, Zoom can change it at will and we wouldn’t be any the wiser.

And, let’s keep in mind that we are talking about real time processing; the images just keep on coming. As long as an acceptable version of what I produced over time is sufficiently close to what you are perceiving on your end, we as Zoom users are still happy. Remember that between you and me is a rather complex network, the very same one used at some level for you to see this web page. And that, too, we have been doing for a few years now. Video streaming, right? Again, keep in mind that complete even compressed images do not need to be sent; most of the frames sent can just be deltas – the changes – of a preceding image. And these are presented to us at 30 fps. And, yes, it is called real time, but have you ever watched yourself waving on a different Zoom session?  It is not real time; you see a time-delayed version of you. That is not just a networking effect due to the time to send any given frame. Your system is buffering up those frames, then presenting those to you at the rate of 30 per second, but potentially well after when I sent them.  From a networking point of view, each frame shows up when it shows up. More in this later. So, ask yourself, from a 30 fps point of view, what happens if a delta frame does not show up in time for display? Would it matter? Happens is that you do notice this occasionally. Recall that compressed audio is being sent with these frames. Audio needs to be recreated at the right rate as well. Have you noticed the very occasional skips or lack of fidelity?

You can see that there is a lot less data going out from your workstation than might be assumed at first blush as a result of pre-processing by you the sender. Interestingly, and as you would expect, there is still more processing done on your end to ensure that all that you are sending out into the world remains secure. Although I as an instructor might have no problem with anyone else listening into my lecture, you in your meetings and my students want to ensure that what they are sending out into the world remains scoped to only those allowed to be in your meetings. For starters, Zoom does support some notion of end-to-end encryption. That means to you that everything that does leave your workstation, compressed or not, is also being encrypted prior to output and decrypted upon input. But, except for some specialized one-to-one cases, the encrypted data you are producing does not go directly to the meeting participants, it goes first through a Zoom server.

Zoom claims that in general their servers don’t decrypt there; relating to that encrypted data that those servers see, “We have implemented robust and validated internal controls to prevent unauthorized access to any content that users share during meetings, including – but not limited to – the video, audio, and chat content of those meetings. Zoom has never built a mechanism to decrypt live meetings for lawful intercept purposes, nor do we have means to insert our employees or others into meetings without being reflected in the participant list.” [1][2]  Good. Recall, though, that what you are sending is just your encrypted image and audio. You know that that gets reformatted into the various forms you see in your meetings, one typical being a view of all participants. Here, everyone’s individual encrypted view seems to be reformatted by Zoom server(s) and sent back to you.

But if Zoom servers themselves are not mucking with your encrypted audio/video, that would instead mean that it is your own workstation that is taking all that encrypted video data from all the partcipants, decrypting it, and also reformatting it all into the usual meeting view that you normally see. Zoom software, of course, but on your workstation. What are Zoom servers doing?  Broadcasting what they receive as encrypted data from each of the meeting participants to each and everyone of the meeting participants.

For those wanting the deeper details, consider this from Zoom Help Center … “By default, Zoom encrypts in-meeting and in-webinar presentation content at the application layer during transit using TLS 1.2 with 256-bit AES GCM encryption Advanced Encryption Standard (AES) 256-bit algorithm for the desktop and mobile clients.”  TLS is also used as the security layer of HTTPS.

Does Zoom ever look at the audio/video?  Yes, they can. They do store the cryptographic keys on their servers to do so in lawfully – and abnormally – necessary instances.  

OK, you will recall that my original interest stemmed from why Zoom with this pandemic could handle what appeared to be much much more of already a lot of traffic. As you’ve noted in the above, first the traffic associated with any given user is actually a lot less than I would have first assumed. But, nonetheless, there was a sudden upsurge in the number of users and the numbers of meetings. Handling this is not handled by adding more processing power and more memory to a server or two. Whatever the architecture, they had to scale up, and do it fast. And, again, all that they have is seventeen data centers world-wide, with only a couple of those in the US. Massive power, little bitty living space.

So, what is their data center?  It has got to be a lot more than some straightforward web server. As with so much in the IT business, we need to surmise based on market-tecture, here from their Zoom Global Infrastructure. These snippets tell us quite a bit … (The following italicized paragraphs are copied from that document.)

  • “Zoom’s unique cloud architecture makes all of this possible. Our architecture starts at the base with the Intelligent Transport Layer, which decides if UDP, TCP, TLS, or HTTPS on the client layer is the best experience for connectivity based on different proxy settings and the need to go through firewalls.”
  • “Both cloud and on-premise solutions are designed with failover and load balancing mechanisms when deployed. Zoom monitors the zone level with multiple VMs, and if a zone is approaching a threshold or fails, it will move to the next zone. Similarly, at the VM level, if a VM fails or is approaching threshold, the connection moves to the next VM.”

So, we seem to be talking about multiple cloud-based virtual machines. You want more capacity in a machine, you add more compute capacity, more memory, more networking capacity. When you have too many meetings on a virtual machine, overwhelming any capability of some virtual machine, you can add more virtual machines, splitting off meeting to create and maintain balance. It is at a cost, of course, but for Zoom more users means more income. As with the cloud in general, as meetings tail off, virtual machines get shut down. Data center?  It is not entirely a physical concept here; it is prudent use of localized cloud resources. Well, it is physical sort of. Notice the “on-premise solutions”?  Zoom is also just software, software installable on your own hardware and where you want it if you want control.

To maximize throughput, Zoom works to manage Zoom users to maximize their locality to a zone: “The meeting participants are always connected to a nearby data center and assigned to the least loaded server. On the other hand, meeting participants will be aggregated to same server if they are in same place.”

Pausing for a moment to refer back to codecs, this is what their architecture document says about their codec: “Zoom’s Adaptive Codec in Session Layer, unlike that of other providers, is created with proprietary coding. The multiple layers around this codec optimize the video frame rate and resolution and provide superior quality and reliability for various network environments and different devices. Zoom uses multiple streams, allowing the application to toggle between streams to ensure that the best quality video gets delivered to end users. Because of Zoom’s compression technology, the system can operate well in an environment with up to 45 percent packet loss. In these instances, Zoom will prioritize audio over video, because audio is more crucial in business discussions and collaboration. Zoom’s multi-stream technology handles bandwidth adjustments for the end user to improve their quality based upon their ability to receive data.”

Of course, Zoom servers need to get out onto “the internet” just like you do. Flexibility, reliability, and bandwidth seem to be touchpoints for them. In particular, they multi-home, allowing their traffic the best path in and out …  “Zoom is located in premier co-location facilities that are ISP carrier neutral. Zoom has five ISPs (Level 3, NTT, Cogent, Tata and XO) and is a multi-home BGP. Failover between ISPs is automatic. Even if four ISPs were down, the Zoom service would still work. Zoom supports up to 80G bandwidth and US data center racks are provisioned with a massive amount of bandwidth, each with 40 Gbps of connectivity, for phenomenal performance.”

Another useful white paper is one Zoom calls their Client Connection Process. (As before, italicized quotes come from this document.)  The following figure, also copied from this document, is a quick representation of what they call a Data Center:

You link into this, no matter the OS or physical client device at your end, apparently in much the same way as you would a web server, using secure communications (at first via HTTPS IP port 443), but, of course, with much of the support belonging and specialized to them.

What they call here a Meeting Zone is described in the referenced document as … “A Zoom Meeting Zone is a logical association of servers that are typically physically co-located that can host a Zoom session.”

ZC here is a Zone Controller: “A Zoom Zone Controller is responsible for the management and orchestration of all activity that occurs within a given Zoom Meeting Zone. Deployed in a highly available configuration, these systems track the load on all servers with the Zone and help broker requests for new connections into the zone.” As I read this, as you start a meeting, these decide which virtual machine had the capacity to host your meeting. As participants join, to minimize cross-system copying of participant’s data frames, participants can be linked into the same virtual machine. Whole meetings could be reassigned by the Zone Controllers for reasons of capacity and availability.

MMR here is their Multi-Media Router: “A Zoom Multimedia Router is responsible for hosting Zoom meetings and webinars. As the name implies, these servers ensure that the rich offering of voice, video, and content are properly distributed between all participants in a given session.” In short, the audio/video frames you send in are broadcast to your meeting participant’s systems by these virtual machines.

Once your meeting session has been assigned, Zoom needs to continuously determine the best means of talking to you. As you might imagine, they try to roll their own support of this for purposes specific to video/audio: “Each of these media connections attempt to use Zoom’s own protocol and connect via UDP on port 8801. If that connection cannot be established, Zoom will also try connecting using TCP on port 8801, followed by SSL (port 443). By leveraging different connections for each type of media, further network optimization technology can be applied such as DSCP marking to ensure the most important media is expedited through the network.”

You can almost picture that. And you – and I – can see how the Zoom organization scaled this up, and just when the world needed it.

As a note of conclusion, I would like to express my own appreciation to the entire team at Zoom. At the beginning of this pandemic when we users had not much choice but to throw our lot in with you folks and it was either sink or swim, you folks kept us all afloat. I had my doubts early on, and so the reason for this article, but hopefully we users can now see why it is that you had the confidence in Zoom that you did. It cannot at all have been easy to go from a little-known company and scale up to being as necessary to society as telephones. I have no doubt that you folks must have often felt like Alice (Through The Looking Glass) having to run with the Red Queen. But from at least one thankful college instructor… You did good.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. . . . and thanks to you, too, Mark for taking the trouble to work through all that. I sense that there may be some lessons for Massively Parallel Computing Connectivities (in terms of ‘intelligent’ mass-data-sorting) somewhere in those components.

    • Thanks. As to your follow up, take a look at IBM’s Power10’s persistent memory attach and what their marketing is calling Memory Inception (a.k.a., Memory Cluster). (More on that soon from The Next Platform.)

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.