Streaming Systems Explained: A Comprehensive Deep-Dive

May 18, 2024
This system design guide addresses system design interview questions on managing uploads and ensuring consistent quality of streaming systems.
Streaming is taking over the world. All kinds of media content are now provided as streams, including movies, TV shows, music, podcasts, and even gaming. Exotic examples such as live concerts, virtual reality experiences, and e-sports tournaments are also becoming increasingly popular. Globally recognized services like Netflix, Youtube, Spotify, and Twitch exemplify this shift, delivering a seamless streaming experience to millions of users worldwide.
Given their prevalence, streaming systems are common topics in system design interviews at big tech companies. Regardless of which streaming system you might face in an interview, the implementation principles behind these systems are fundamentally the same. This guide explains the challenges all such systems need to overcome: how to safely upload large media files at scale and how to provide the hosted data as a consistent data stream, regardless of network conditions or user devices.
By the end of this guide, you will have all the background knowledge you need to explain the challenges of any streaming system and design an architecture to overcome these challenges in any system design interview.

Upload and Manage Media Content

notion image
If you aim to provide streaming content to your users, the initial step is to upload this content onto your servers.
When contemplating streaming platforms tailored for long-form content such as movies or series, it's a given that the source files will be substantial in size.
Consider the magnitude of worst-case scenarios like the extended edition of “The Lord of the Rings: The Return of the King" from 2003, boasting a runtime of 4 hours. Can you estimate the size of such a file in 4k resolution? It might astoundingly reach up to 4 Terabytes!
Addressing sizable input files emerges as our primary hurdle. This juncture demands meticulous discussion on managing network outages or other disruptions while uploading such a colossal file. Simply restarting from scratch isn't feasible.
Upon successful file upload, the subsequent challenge unveils itself. With the vast diversity in users' technical setups, especially for globally launched platforms like Netflix, Youtube, or Spotify, your system must adeptly process the raw source file. This entails generating a spectrum of versions, each optimized for a myriad of end-user devices and their respective bandwidth capabilities. Some streaming platforms cater to over 2200 distinct devices, each receiving a tailored file version optimized for its specific resolution and format.
notion image
Take YouTube, for instance. It refrains from transmitting a 4k video over a limited bandwidth mobile connection only to display it on a minuscule screen. Instead, it dispatches a version with reduced resolution, ensuring that users experience uninterrupted streaming that aligns with the optimal quality their device can display.
Generating such a diverse array of versions is no minor feat. Video encoding is a resource-intensive operation, and even the most formidable servers may require significant time to render all the variant resolutions.
To encapsulate, this lecture endeavors to provide technical solutions to two pivotal questions: the means to dependably upload voluminous files and the strategy to subsequently process and efficiently store multiple versions of them. In our ensuing lecture, the focus will shift to the actual streaming of the video to the end user.
Let's begin by understanding how video processing typically functions.
The transformation is achieved through a series of consecutive services, each modifying the raw input data sequentially.
These services are organized in what is termed a processing pipeline.
notion image

File Chunker

To mitigate the impact of network failures throughout the process, we employ a technique you've previously encountered in the section on file sharing systems: dividing the file into blocks or chunks.
In the event of a network disruption or any other incident that halts the file upload, checksums come to our aid, allowing us to swiftly determine how many chunks have already been uploaded to our server and pinpoint where to resume the upload.
By only transmitting the chunks that are still missing, this approach minimizes the data sent over the network. Additionally, it can aid in removing duplicate data in storage.
The data chunks generated in this phase will remain pivotal, especially as we delve into data streaming techniques.

Content Filter

The entirety of this step is managed by a machine learning model responsible for checks on copyright infringements, piracy, and NSFW content.
Upon the data's arrival at the server, the content filter, positioned as the next phase in the pipeline, examines if the video aligns with the platform's content guidelines. For platforms like Netflix, this might be pre-determined based on the content rating of the media. Many nations employ rating systems that designate age restrictions and other criteria. Conversely, for platforms like YouTube, the content filter directly enforces their guidelines by halting any subsequent processing of the non-compliant video.
Should any discrepancies be identified, the task could be shifted to a separate queue, where a member of the moderation team assesses the next steps. However, this element won't be integrated into our design since it diverts attention away from our central objective: video uploads and streaming.


Once we have verified that the content is legitimate, we can begin to produce optimized versions tailored for our various target devices.
The initial phase involves transcoding our data. During this process, the original data is decoded into an intermediary uncompressed format. This format is then encoded into the desired target format. We employ various codecs to adjust bitrate, downsample images, or re-encode the media.
This leads to a more compact file size and a format that's better suited for the intended devices.

Quality Conversion

We now arrive at the concluding step - quality conversion. In this phase, we manage the conversion of the transcoded media into various resolutions such as 4K, 1440p, 1080p, 720p, and so on.
We won't delve further into the technical intricacies of the processing pipeline in this module. While real-world pipelines might encompass added steps like generating subtitles, thumbnails, and other assets, we've covered all you require to articulate your perspective in the interview.
It's now apt to incorporate this pipeline into a coherent architectural framework.

Solution Architecture

For the system architecture, we can leverage our experience from designing the file sharing system. In that system, we have a client that divides the file into blocks, creates md5 hashes for each block, and initiates the upload.
notion image
Blocks are uploaded sequentially. It might be most resource-efficient to delay processing until all blocks have been uploaded.
The architecture significantly depends on the use case.
For a platform dependent on user-generated content like YouTube, it's prudent to wait until the entire file has been uploaded before initiating any processing. Given the unpredictability of the upload environment, it's uncertain how likely it is for the upload to be interrupted or fail. Conversely, for a commercial streaming service where parameters are more controllable, it's efficient to process blocks immediately upon upload, especially since authorized personnel are responsible for uploading raw movie files.
If you opt to wait until the entire file is uploaded, the system requires an initial object storage where blocks can be saved until the file is assembled. Completing the upload doesn't guarantee immediate hardware resource availability for pushing blocks through the pipeline.
notion image
To optimize resource allocation, it's advisable to continue processing the fully uploaded files as blocks. This ensures that every time processing power becomes available, a standard-sized block is ready for processing.
Using a message broker, we can employ a queue to retain blocks until resources are available for processing. Depending on the use case, the raw file in the initial object storage can either be deleted or archived.
Alternatively, the queue could store a reference to the blocks in the object storage and retrieve it when the system is prepared for processing.
Next, we delve into a fascinating architectural detail. To manage the vast volume of data blocks, we can activate individual service instances. Each instance initializes, propels the block through the processing pipeline, saves it in a database, and then concludes. This approach maximizes our hardware resources, allowing for parallel processing of as many blocks as feasible.
After processing, we store the data blocks in a database. Object Storage remains an ideal choice! Unlike the file sharing system, we don't anticipate constant updates, meaning we won't be hindered by the fact that object storage requires rewriting the entire file if a portion changes. This approach also permits metadata storage alongside the blocks, simplifying the architecture by eliminating the need for an additional database.
Next, we dive into the second big challenge any streaming system has to find technical solution for, how to stream out the uploaded content.

Provide Content Streams

Next, we dive into the second big challenge that any streaming system must find a technical solution for: how to stream out the uploaded content. Streaming content efficiently and reliably is crucial for ensuring a seamless user experience. This involves dealing with various network conditions, supporting a wide range of user devices, and maintaining consistent quality.

The Problems at Hand

Currently, our content is stored in our object storage, awaiting distribution. However, ensuring reliable content delivery to users isn't straightforward. Similar to the upload process, multiple factors can slow down or interrupt our data stream.
Inherently, networks can be unpredictable. Two specific issues become particularly crucial for streaming.

Network Latency

Firstly, we have network latency. We've touched upon this before. Recall the highway analogy? The time taken for a car to reach its destination is determined by the number of available lanes (akin to bandwidth) and the volume of other cars on the road (equivalent to throughput).
One aspect we haven't delved into yet is the total distance data traverses. Much like the time taken for a car journey is heavily influenced by the distance to be covered. A trip from California to Alaska will invariably be long, regardless of traffic conditions or road width.
The same principle applies to data, especially for expansive systems serving a global customers. The impact of vast distances on network latency can often be underestimated.
notion image
For a clearer picture, consider this: from my location in Germany, pinging a server in California takes, on average, 150 milliseconds for a round trip. In contrast, if both machines were housed in the same data center, the round trip time would plummet to roughly 0.5 milliseconds.
notion image

Network Congestion

Imagine a bustling road during rush hour, causing traffic to slow down and, inevitably, many motorists to be late for their appointments. Similarly, in the digital realm, when traffic surpasses the capacity of available bandwidth, transmission rates decline. While this isn't a significant concern for downloading a file – it merely extends the download time – it becomes problematic for streaming. For a seamless user experience, a new chunk of data must be delivered every 10 seconds or so. Any delay or congestion jeopardizing this constant flow adversely affects the viewing experience.

Home Network and Devices

It's a common misconception that once data reaches a user's home network, all potential hitches are eliminated. Unfortunately, several challenges can persist even at this stage.
The home network can quickly become the primary obstacle.
Streaming standard video requires roughly 4 Mbps of bandwidth; for high-definition content, this demand escalates. Imagine multiple residents of a household simultaneously streaming 4K videos – the cumulative demand could easily overwhelm the network. Additionally, factors like unstable Wi-Fi connectivity can hamper the user experience.
Further challenges arise even after data reaches the end-user device. Limited processing capabilities can degrade the viewing experience, especially since video playback can be resource-intensive.
The complexities of streaming lie in ensuring the consistent delivery of data chunks, even under less-than-ideal network conditions, and adapting as much as possible to the user's specific setup.
Your strategy's efficacy in navigating these network challenges heavily relies on the foundational transfer protocol upon which the more advanced streaming protocol is built. Let's delve deeper into this aspect.

Network Transfer Protocols

When it comes to transferring data across networks, there are two prevalent protocols: UDP and TCP. Both protocols operate in tandem with the Internet Protocol (IP) but are fundamentally distinct in their approaches. Let's explore each to determine which best suits our needs.
Let's first delve into the workings and advantages of UDP.

UDP (User Datagram Protocol)

UDP, or User Datagram Protocol, operates without establishing a persistent connection between nodes. Rather than creating a consistent end-to-end link, data packets are sent independently with their respective destination addresses, guiding the network to route them accordingly.
notion image
The connectionless nature of UDP renders it exceptionally swift. However, this approach offers no assurance that data will arrive at its destination without corruption. On the brighter side, as each packet operates autonomously, any packet loss—whether from network congestion or other issues—doesn't impede the successful transmission of remaining packets.
Given its lack of guaranteed reliability, UDP isn't a universal streaming protocol. Its utility is primarily in scenarios demanding minimal latency and where occasional data loss is permissible. Examples include real-time video conferencing and online gaming.
For applications like movie or audio streaming, UDP's unpredictable nature isn't ideal. It's unacceptable for users to miss even a fraction of a movie frame or experience brief interruptions in their favorite tunes.
Considering these limitations, it's worth examining TCP to discern if it presents a viable alternative.
UDP's speed is one of its notable advantages. However, this speed comes at the cost of guaranteed data integrity. A benefit of this protocol is that even if some packets are lost due to network congestion or other issues, the rest can continue to their destination because each packet is routed independently.

Use Cases

This lack of reliability means UDP isn't suited for every streaming need. It's most suitable for situations demanding minimal latency where occasional data loss is tolerable, such as in real-time video conferencing or online gaming.
On the other hand, for applications like movie or audio streaming, UDP's unpredictability isn't ideal. Users should never miss a moment of their movie or a beat of their favorite song, even if the disruption is just a brief 1-2 milliseconds.
Given these considerations, it's essential to explore TCP to determine if it provides a more reliable solution.

TCP (Transmission Control Protocol)

TCP, at its essence, facilitates the establishment of connections between two nodes and ensures the reliable delivery of data packets from one node to the other. To achieve this, TCP adopts a connection-oriented methodology, where both nodes must concur on specific parameters before transferring any data. These parameters encompass the source and destination IP addresses, as well as the source and destination port numbers. After setting these parameters, the nodes can commence the exchange of data packets across the network.
For a stable connection to be formed, the two nodes initiate a process known as the three-way handshake. The procedure is outlined as follows

The Three Way Handshake Process

notion image
  1. The client SYNchronizes its sequence number to the server.
  1. The server ACKnowledges the client's sequence number, then SYNchronizes its own sequence number.
  1. The client then ACKnowledges the server's sequence number.
It's commonly called the 3 way handshake because of these 3 messages. Once they are send, the machines have established a reliable connection and are ready to start sending data over it.

Advantages vs. Disadvantages

The primary advantage of using TCP lies in its reliability; it ensures data delivery by sending acknowledgments to the sender once data is successfully received. This guarantees that all packets are delivered before any subsequent ones are transmitted.
If certain acknowledgments don't return within a specified timeframe, the sender presumes that some packets might have been lost or corrupted and thus resends them. Until all packets are successfully acknowledged, this process continues.
Acknowledgments from both ends also regulate the data transmission rate, aiding in buffering durations for users.
notion image
However, TCP's emphasis on reliability makes it slower due to its error-checking mechanisms and the potential need to resend lost packets. The need for a three-way handshake to establish a connection can also introduce delays. Furthermore, every packet undergoes accuracy verification before transmission, adding some overhead to TCP's operations.

Use Cases

Despite these challenges, TCP remains the favored protocol for media streaming. Prominent streaming protocols, HTTP Live Streaming and MPEG-DASH, are both grounded on TCP.
Diving into the intricate workings of these protocols is beyond the scope of this course. Moreover, such in-depth knowledge isn't anticipated in a system design interview. Yet, there's an essential aspect of these protocols you should be familiar with: Adaptive Bitrate Streaming.

Adaptive Bitrate Streaming

If there's a fluctuation in network conditions and TCP identifies packet losses, it will autonomously decrease the chunk size and decelerate their transmission rate.
This is where our varying media file versions come into play! A drop in resolution results in smaller chunks for the equivalent video playback duration.
To ensure a continuous, seamless stream, adaptive streaming comes into effect. It's also applicable for networks that experience high latency, such as those with limited bandwidth mobile connections. The adaptability of the TCP protocol to latency shifts ensures the delivery of a smooth stream, even over extended distances.
You might have personally encountered the impact of adaptive streaming. There are moments when watching a YouTube video where the video quality suddenly diminishes. Such a change is a response to a drop in available bandwidth. YouTube adjusts to this by transmitting chunks of a lower resolution, ensuring that the video continues to play seamlessly.
Now, let's explore how we can incorporate our understanding of streaming into a robust architecture that empowers your system to perform similarly.

Solution Architecture

notion image
Regardless of the specific streaming protocol you choose to employ, both are rooted in HTTP/TCP, allowing us to use a conventional Rest API to provide access to the media files.
To manage this, we introduce a client-facing component, which we'll refer to as the video service. This service is responsible for processing incoming requests and consulting the database to fetch the necessary media chunks.
On the user's side, a client application operates, which could take the form of a native app, a mobile application, or even a web-based application functioning within a browser.
This foundation serves as the crux for a rudimentary streaming architecture, with the brunt of the task being undertaken by the streaming protocol.

Caching in Streaming Systems

Have you considered the immense potential for caching within the streaming system?
The metadata accompanying each media file, including details like the title, description, tags, and more, is crucial for generating each movie's unique page.
By caching these data elements, we could dramatically cut down on database reads, thereby ensuring quicker responses to user requests. Take, for instance, most streaming platforms; typically, about 10% of their content garners 90% of all views. Conversely, the remaining 90% of content only attracts 10% of views.
While caching metadata offers a tangible boost in performance, the more pressing concern remains minimizing the latency of the actual media file. Fortunately, a caching technique exists that facilitates the proximity of our media files to the end-users: the Content Delivery Network, or CDN.
notion image
The innovation I'm alluding to is known as a Content Delivery Network, commonly abbreviated as CDN. As discussed earlier in this lecture, a significant portion of the total latency is determined by the physical distance the content has to traverse from our server to the end-user. Therefore, regardless of the current throughput and available bandwidth, it's always beneficial to minimize the distance our data must cover.
This is precisely the problem CDNs are designed to address.
A CDN consists of a collection of caching servers housed in strategically positioned data centers across the world, aiming to serve users in their vicinity.
Referred to as Points of Presence (PoPs), these locations house copies of our original content. By positioning the data closer to users, the round trip time (RTT) is substantially reduced, resulting in faster access for the end-user.

How Do CDNs Work?

Let's delve into how data is cached within a CDN. Essentially, the CDN functions as a gateway for all your inbound traffic, which is why, when creating a system diagram, you'd typically position the CDN between the client application and your main application server.
notion image
With that structural understanding in place, how does the caching process actually play out in the context of video streaming?
  1. A user from a distant location sends a request to stream a movie.
  1. This request initially reaches the CDN. If the CDN doesn't have a cached version of the initial chunk, it forwards the request to the origin server. In response, the origin server retrieves the desired content. Simultaneously, it sends a copy of this content to the CDN server nearest to that user.
  1. Based on the specific caching policy, the CDN server either saves this copy on its disk or retains it in memory.
  1. When another user from the same geographical area requests that particular movie, it's the CDN server – not the origin server – that fulfills the request.
notion image
While it might seem like CDNs are primarily for media files, their utility extends far beyond that.

Caching of Websites

Contrary to popular belief, CDNs are frequently used to serve website data. Consider this: when a user accesses a website, they are essentially requesting a bundle of HTML, JavaScript, and CSS files, along with various assets such as SVG icons and images. Depending on a website's complexity, these bundled files can be sizeable, significantly impacting the site's initial loading time.
And here's a crucial point to note: even minor delays in loading can have outsized impacts. Research indicates that a mere one-second delay can result in a 7% decrease in conversions and an 11% drop in page views. By reducing network latency through CDNs, commercial websites can directly boost views and, by extension, revenue.
In theory, a CDN can be employed to cache and deliver an entire website. However, as a website evolves and matures, it often features dynamic content tailored to specific regions or individual users.
Take an e-commerce platform as an example. Such a site might offer product recommendations based on a user's shopping history. Or it might host time-sensitive promotions, like flash sales. Picture a scenario where a sale is scheduled to end at 1 pm, but due to cached data, users can still access the discounted prices until 4 pm. Such oversight could result in substantial financial losses.

Addressing Cached Data Updates

While caching certainly has its advantages, updating cached data can pose a challenge, especially for data that isn't highly dynamic but still requires updates.
Consider this scenario: you're part of a team developing an exciting web application. After completing the latest development sprint, you're eager to roll out new features to your audience.
Aware of the CDN in use, you access your cloud provider's CLI to invalidate cached assets before deploying the new updates to your origin server. This should, in theory, prompt users to fetch all files directly from the origin server. But the reality is slightly more complex.
notion image
Here's why: Internet Service Providers (ISPs) often cache popular websites. This not only saves them on network costs but also eases the load on their infrastructure. Similarly, user browsers cache data to improve load times and save on bandwidth.
This multi-layered caching can mean that even if you've invalidated your CDN cache, it's uncertain if all users would experience the changes immediately.
So, given these intricacies, how should you approach updates? Instead of merely invalidating caches when making site modifications, it's essential first to ensure you're using the right cache headers. Beyond that, one effective strategy is cache busting. Let's dive deeper into how this works.

Expires and Max-age

Modern websites often set caching headers by default. However, for optimal results, these headers should be tailored according to the specific use-case at hand.
One such header is the expires header, which dictates the exact time when an object should be purged from the cache.
On the other hand, the max-age header determines the duration an object remains cached. It's commonly set to 86400 seconds (which equates to one day). So, if a website update is deployed, it's guaranteed to be visible to all users after 24 hours.
By fine-tuning these headers, there's no need to manually invalidate the CDN cache, ensuring all users see the updated content after the stipulated time. However, there's a caveat. These headers don't permit immediate content updates across all caches, making the content appear instantly for all users. This is where the technique of Cache Busting becomes invaluable.

Cache Busting

Cache Busting, while intricate, offers an efficient way to instantly roll out updates to every user.
Every time a new version of a web application is deployed, links to files and assets are modified. Since CDN caches perceive these as entirely new files due to the altered URLs, they're treated as such.
As a result, the first user accessing the domain post-update triggers the CDN to fetch the revamped bundle directly from the origin server, courtesy of the changed URL. This ensures the immediate delivery of the updated version to all users.
Remember, Cache Busting isn't just pertinent to websites. It's equally significant when caching media file chunks in a streaming context. This ensures users always get the latest, high-quality streaming content without delay.

Related System Design Interview Questions

Streaming Systems are typical high-level design questions in system design interviews. Here you find my in-depth guides for the most common systems:
  • Netflix
  • Youtube
  • Spotify


Before we conclude, let's briefly recap what we've explored in this lecture:
  • To ensure consistent streaming to our users, we need to counteract the unpredictability of network connections, while also personalizing the stream to suit individual user setups, such as their home network and device specifics.
  • Consequently, opting for the right streaming protocol is crucial. Such protocols can be founded upon either UDP or TCP. While UDP boasts speed, it sacrifices reliability. In contrast, TCP ensures data integrity but can be slower.
  • In terms of media streaming, TCP-centric streaming protocols, such as HTTP Live Streaming and MPEG-DASH, emerge as superior choices since missing out on frames would adversely affect the user experience.
  • These protocols incorporate adaptive bitrate streaming, allowing for dynamic quality adjustments in response to prevailing network conditions.
While we've developed a mechanism to adjust the stream based on current network conditions, this doesn't address the inherent latency caused by extensive distances separating the user from the server. A perpetual reduction in video quality isn't a viable solution for geographically distant users.
Therefore, our challenge lies in devising a strategy to bring our media files closer to the end-users. We'll be delving into potential solutions in our upcoming lecture.