Dropbox System Design Deep-Dive

titleImagePath

date

Apr 15, 2024

slug

dropbox

status

Published

What’s unique about Dropbox?

The principle behind Dropbox is both simple and ingenious, significantly influencing the popularity of commercial file-sharing services.

Dropbox consolidates files into one central location by creating a special folder on the user's computer. This folder's contents are synchronized to Dropbox's servers and to other devices used by collaborators, ensuring the files remain updated across platforms.

Operating on a freemium business model, Dropbox offers users a free account with limited storage capacity. However, with paid subscriptions, users can access more storage and additional features.

Dropbox provides desktop client apps for Microsoft Windows, Apple macOS, and Linux computers. Mobile apps are also available for iOS, Android, and Windows Phone smartphones and tablets. Additionally, a website interface facilitates access for users.

Before diving into your mock interview, take a moment to recognize your progress. You've reached a stage where you're already familiar with most of the interview steps.

System Analysis

Before delving into the system design to meet Dropbox's use case, it is beneficial to analyze the fundamental nature of the system. Understanding this helps in prioritizing system attributes, as the nature directly influences the architectural decisions. This includes considerations like scaling databases, running multiple service instances, and optimizing caching strategies.

1. Read- or Write-Heavy?

Dropbox, a file-sharing service, is predominantly read-heavy. Although files are uploaded (write) by users, each file upload can trigger multiple downloads (read) as different clients synchronize to update their local versions with the most recent file.

2. Monolithic vs. Distributed Architecture

Given the scale at which Dropbox operates, a distributed system is imperative. Relying on a monolithic architecture and running it all from a single server would be insufficient to handle the vast number of users and the amount of data processed.

3. Availability or Global Consistency?

In the context of Dropbox, where files are frequently shared and accessed by multiple users across various devices, maintaining consistency is crucial. This avoids the complexities of merging different document states after resolving network partitions, ensuring that all users view the most up-to-date version of any document.

While prioritizing consistency might reduce the system's availability during network failures, Dropbox mitigates this by allowing users to continue accessing locally stored versions of their files. This approach ensures that users are not completely cut off from their data even when the system cannot immediately synchronize changes due to network issues.

These insights into the system's nature guide the entire design process. Opting for a distributed architecture over a monolithic setup influences critical aspects like architectural style, database distribution, and data consistency models. For Dropbox, choosing a strategy that emphasizes consistency over availability helps in maintaining a seamless and reliable user experience, even in the face of potential network disruptions.

This choice also impacts infrastructure decisions, such as opting for cloud-native solutions, implementing effective load balancing, and deploying advanced caching mechanisms to enhance performance and reliability.

Requirements

Our system should meet the following requirements:

Functional Requirements

File Synchronization - A File should synchronize between devices after it has been updated an any of device.

Data Snapshot - The system should maintain a change History of each file.

Multi-Client - The user should be able to download/upload, update, delete files from any device.

Alternative Features

When outlining the scope of your Dropbox system design, consider including no more than 2-3 features besides the core functionalities. Here are some commonly suggested features for a file-sharing system like Dropbox:

File Version History: Enable users to access and revert to previous versions of files.

Collaborative Editing: Allow multiple users to edit documents simultaneously and see real-time updates.

Selective Sync: Give users the choice to select specific folders or files for synchronization to their devices.

Advanced Security Options: Implement features such as end-to-end encryption, two-factor authentication, and secure links.

File Locking: Prevent conflicts by allowing users to lock files they are editing.

Photo and Video Auto-upload: Automatically upload photos and videos from connected devices.

Integration with Third-party Apps: Seamlessly connect with office tools, project management apps, and more.

Shared Link Controls: Provide users with the ability to set expiration dates and password protection for shared links.

Activity Monitoring and Notifications: Inform users about changes and updates to shared files or folders through push notifications.

In an interview, it's beneficial to discuss the features you're most knowledgeable about in detail. For this example, I will focus on the features already outlined as part of the functional requirements.

Non-Functional Requirements

Multi-tenancy: The system needs to keep keep customer's data strictly separate.

Resilience: - The system ensures that no customer data is ever lost.

Minimal Latency: Files should sync with a minimal latency.

Capacity Estimation

Let's start with the estimation and constraints.

🙋

Before diving into detailed capacity estimations during an interview, clarify the interviewer's expectations. Recently, interviewers tend to focus on estimates that directly influence design decisions, rather than requiring comprehensive estimations.

Throughput

Assumption

100 million daily active users (DAU)

read / write ratio 10:1

users update 2 files per day. (write)

2x peak loads

1. Write Requests (per day)

First you estimate the quest per day. You do that by multiplying the amount of active users with the average amount of activities per day and user.

This is 200 million requests per day and these are write the write requests only.

Next we account for potential peak loads.

Next you need to convert this number into requests per second. Therefore you best convert into scientific notation and then divide by 10 to the power of 5. Which is the equivalent of 100k, the heavily rounded shorthand to convert from a per day to a per second value. In case you didn't remember this, it's on the cheat sheet.

2. Read Request

We know that 2 files are uploaded per day and users, so that's the write requests. With the read-write ratio we can also estimate how many read request we have.

Bandwidth

Assumptions

File size of 1 MB on average.

File changes have a size of 100kB.

Writes Bandwidth

To estimate the required bandwidth for writes, you multiply the RPS with the size of each file change and transform into a meaningful format.

💡

You wonder how kB can be converted to MB? Learn more here.

Bandwidth(write)=4* 10^3 * 10^2 = 10^3 = 4 × 10^8 b/s putting 4 × 10^8 b into a byte converter (google for example) equals 5010^6 B so shouldn't the answer be 50 MB/s not 400MB/s ?

Reads Bandwidth

Storage

The use case of file sharing requires a small teak, normally we would use the bandwidth to estimate the required storage. But here, the bandwidth is related to the updates uploaded, not entire files being stored. That's why we need to calculate how many new files are stored, not updated, per second and then calculate how much storage this will take over the next 5 years.

Assumption

File size of 1 MB on average.

Replication factor of 3x

Storage per Second

In 5 years

With Replications

Data Model

It's time to define the data model and decide which databases to use.

We're aware that Dropbox doesn't only have a server-side application but also a client, which likely has its own database. Creating a data model for both could be quite time-consuming, perhaps excessively so. In such instances, it's perfectly acceptable to omit certain components of a system. However, it's crucial to communicate this decision to your interviewer and secure their agreement. Making this choice without discussing it might give the impression that you overlooked it.

Let's begin by examining our requirements to pinpoint entities, their attributes, and the relationships among them.

Entities & Attributes

This time, the requirements we defined don't give away all the detail we need to define the data model explicitly, but we can still find enough hints to cover all entities with their most critical properties.

Entities

Users

Chunks

Files (metadata)

Properties

User

An UserId

A files property to store the handles of all files they ever created.

Chunks

file_Id

Files

file_Id

owner

last edited (timestamp)

edited by (user)

file name

size

chunks []

snapshots []

Files know which chunks belong to them

Files know which user updated them

Files know which chunks make up a historic snapshot

Relationships

Users Create, Reads, Update and delete Files.

Files link the file chunks that they are made-off and the once that belong to historic snapshots.

Databases & Carnality

Now, let's consider the most appropriate databases for our data.

User data perfectly exemplifies the benefits of a relational database. Not only is user data easy to normalize, helping us avoid redundancy and consequently reducing the system's storage footprint, but relational databases also ensure data consistency.

The pivotal question is: how should we store our metadata and file chunks?

Recalling the lecture on file storage from earlier in this section, there appears to be a clear method for storing our file chunks: we need a block storage database. Essentially, this type of database is a key-value store for data chunks. It retains the chunk and nothing more. As a result, we'll require an additional database to maintain the metadata for each file.

Considering our functional requirements, there's no need to pull data from multiple files, and based on our data model, users will be aware of the files they've created. Hence, a traditional key-value store seems to be an apt choice for the metadata.

However, this database will hold metadata like names, chunk details, and more. It will be subject to modifications by numerous users, potentially simultaneously. Therefore, this database needs robust ACID properties. A NoSQL database may not necessarily offer these attributes. Given this crucial non-functional requirement, a relational database emerges as the top choice here.

API Design

In this step, we'll sketch out an API design to depict how the client would communicate with our remote server.

It’s our objective is to minimize the transmission of unnecessary chunks. Instead, our preference is to transmit hashes initially to determine the requisite chunks. With this concept in mind, we must establish the following API endpoints:

Compare Hashes

Our first endpoint is called compareHashes which only needs the fileID and the hashes of all associated chunks as parameters.

Parameters

fileId (string): The unique ID of a file. chunkHashes ([] string ): Array of data chunks hashes.

Response

The endpoint returns the hashes associated with junks that diverge from the file version the remote server has.

Upload Changes

The next endpoint allows to actually send the diverging chunks to the remote server. The parameters are the fileId and the locally updated chunks or blocks.

Parameters

fileId (string): The unique ID of a file. chunks ([] string ): Array of binary data chunks.

Response

The response returns a simple success message once all chunks are uploaded.

Request Update

The last endpoint is called request updates. It allows the client to update its local version, after it got notified that the local file started to diverge from the remote version. Here we need the fileId again and the hashes of the chunks that are supposed to be pulled from the remote server.

Parameters

fileID (string): The unique ID of a file. chunkHashes ([] string ): Array of data chunks hashes.

Response

The response includes the chunks that are supposed to be updated on the local client.

Core Feature Design

The Client

Begin with drawing the client architecture, by first adding the Files which are stored in the local file system outside of the scope of our application.

Watch Service

Once this service recognized a change to a file it starts the sync process. However, the watch service doesn't do anything else but trigger the process.

Remote Update Service

We need a second service that houses the rsync algorithm and manages the communication with the server. Let's call it remote update service.

Database

The rsync algorithm needs files to be cut in chunks to compare their hashes. So let's persist these hashes.

A simple lightweight SQL database will do.

Local Update Service

To received incoming hashes and check the database for locally available chunks the client needs another service, let's call it Local Update Service in contrast to the Remote update service which sends out data to update the remote version of files.

The new service would receive notifications with hashes, check the database, and request the missing chunks from the server. The server would then return them. Last but not least, the service updates the actual files in the local file system.

Next we get to the cloud service.

The Cloud Service

Now it's time to design the architecture for the remote server, that handles updating the remote file version.

The whole flow starts when the compareHashes(fileId, chunkHashes) endpoint is called. It is exposed by the sync service.

The service then reads out the hashes stored on the server and compares them with the once that were send by the client. Then it sends back the hashes it can't match.

Then the uploadChange(fileId, chunks) is called passing the actual freshly updated chunks to the server.

The sync service doesn't store the incoming data itself, it passes it on to the file service handles all tasks related to reading and writing data.

The chunks, metadata and hashes are stored in two database. A SQL database for the metadata and hashes, the file chunks go into a block storage.

It's important to notice we don't have to recreate the files as a whole on the server, here we are only interested in chunks.

To allow this architecture to handle the high amounts of users we estimated. I add a load balancer that routes the incoming requests to one of multiple instances of our services.

We still lack the infrastructure to support remote update's. What would you do about that? Remember, we want to notify all clients at once that a new, updated version is available. Can you recall which kind of technology is most suitable and why?

Here, the server mirrors the architecture of the client a bit. It has a watch service that monitors the status of the stored file chunks. Once a new chunk is written to the database, it triggers another service - the notification service.

To actually send out the notifications to all the different clients, Server-Send-Events is the technology of choice.

Remember, its a web technology that enables asynchronous event-based communication between the server and the client. It is designed to use the JavaScript EventSource API which is supported by all modern browsers. The major limitation of SSE is its unidirectional nature, which means the server can't monitor the health of listening clients. But that limitation is neglectable for our use case.

Do you remember we also talked about message brokers. Would it make sense to add a queue to the design? If so, where and why?

The straight forward answer is yes! It does make sense. One very clear use case is the decouple the notification service from the watch service. Imagine what would happen if the Notification service is unavailable for some reason and the watch service could not pass on the information about a change in a monitored file - the whole file sharing logic would be disturbed.

Better to add a Message queue in between, which persists the information till the notification service is able to process each message.

Support Feature Design

The core features are implemented. Looking at the support features, data Snapshot can also be supported by keeping a record of all chunks that made up a file at a certain point in time. Multi-client is a support feature that can be achieved by implementation details you would not necessarily see in an architecture drawing. To be accessible via phones, tables, and pcs with different operating systems you would have to implement native clients for all of them, or you would go with a highly responsive web application when you want good value for the cost.

Design Discussion

The design discussion evaluates a candidate's ability to architect and scale complex systems like Dropbox, a popular file-sharing service. This section provides a detailed list of typical questions along with solution drafts and references to in-depth articles that elaborate on more advanced concepts.

Basic Functionality

Questions on "Basic Functionality" explore a candidate's understanding of the essential operations and core features of Dropbox. They assess how well the candidate grasps the fundamental processes, data flows, and user interactions crucial to the system's functionality.

How does file synchronization work in Dropbox?

File Watching: Monitor changes in the file system to detect updates, deletions, or new files.

File Chunking: Break files into smaller chunks to optimize the synchronization process, reducing the amount of data transferred.

Differential Sync: Only sync the parts of the file that have changed, rather than the entire file.

Version Control: Maintain multiple versions of files to prevent data loss and allow users to revert to previous versions.

Conflict Resolution: Automatically handle conflicts when the same file is modified by multiple users at the same time.

How are files stored in the Dropbox system?

Block-Level Storage: Store files as independent blocks in a distributed database to enhance retrieval and updating efficiency.

Data Deduplication: Implement deduplication to avoid storing identical data blocks, saving storage space and bandwidth.

Encryption: Secure files at rest and in transit with strong encryption protocols to protect user data.

Replication: Replicate data across multiple data centers to ensure high availability and data durability.

Indexing: Use indexing mechanisms to quickly locate files and file chunks within the storage architecture.

What happens when a user shares a file in Dropbox?

Sharing Interface: The user selects a file or folder to share and specifies the recipients.

Permission Settings: The user sets permissions, determining whether recipients can view or edit the contents.

Metadata Update: The system updates metadata to reflect the new sharing settings and notifies the recipients.

Access Control: Implement access controls to ensure that only authorized users can view or modify shared files.

Activity Logging: Log activities related to the shared file for auditing and tracking purposes.

Scalability and Performance

Questions on "Scalability and Performance" assess how Dropbox is designed to efficiently handle growth in user demand and data volume. These explore strategies for optimizing system resources and maintaining high performance under increasing loads.

How does Dropbox scale with increasing numbers of users and files?

Horizontal Scaling: Expand the number of servers and storage resources to distribute the load more evenly.

Load Balancing: Employ load balancers to manage user requests across servers, enhancing responsiveness.

Resource Partitioning: Partition resources such as databases and storage to minimize load on any single server.

Caching Strategies: Utilize caching to store frequently accessed files and metadata, reducing database load.

Content Delivery Network (CDN): Use a CDN to distribute user data geographically closer to users, reducing latency.

What caching strategies are employed to improve performance?

In-Memory Caching: Deploy in-memory caches like Redis to store file metadata and small, frequently accessed files.

Edge Caching: Implement edge caching in the CDN to store popular content close to the users.

Lazy Loading: Load only the necessary data when needed, rather than pre-loading large amounts of data.

Cache Invalidation: Develop a robust cache invalidation strategy to ensure data freshness.

Adaptive Caching: Dynamically adjust the size and scope of caches based on user activity and system load.

How is high availability achieved in the Dropbox architecture?

Redundancy: Use redundant storage and servers to ensure system availability even if one component fails.

Failover Mechanisms: Automate failover processes to switch to backup systems without service interruption.

Data Replication: Continuously replicate data across multiple locations to prevent data loss and facilitate quick recovery.

Regular Health Checks: Perform regular health checks and maintenance to preemptively address potential failures.

Geographic Distribution: Distribute data centers across various locations to protect against region-specific failures and natural disasters.

What are the backup and disaster recovery plans?

Routine Backups: Conduct routine backups of all data and system configurations to secure backup locations.

Disaster Recovery Drills: Regularly test disaster recovery protocols to ensure they are effective and that the team is prepared.

Data Recovery Capabilities: Enable granular recovery options to restore individual files or entire datasets as needed.

Real-Time Data Protection: Use technologies like RAID and erasure coding to protect data in real-time.

Compliance with Standards: Adhere to industry standards and regulations regarding data backup and recovery processes.

Security Concerns

Addressing security concerns involves examining strategies that protect Dropbox against potential threats and vulnerabilities, ensuring the integrity and confidentiality of user data.

How does Dropbox handle data security, particularly with third-party integrations?

API Security: Secure API endpoints using OAuth and stringent authentication mechanisms to control third-party access.

Data Encryption: Encrypt all data, both in transit and at rest, using industry-standard encryption protocols.

Regular Security Audits: Conduct regular security audits and vulnerability assessments to identify and mitigate risks.

Third-Party Reviews: Implement a thorough review and approval process for integrating with third-party services.

User Privacy Protections: Uphold strict privacy policies to protect user data from unauthorized access and misuse.

What measures are in place to prevent unauthorized access and data breaches?

Multi-Factor Authentication (MFA): Require MFA for all users, especially when accessing sensitive data.

Role-Based Access Control (RBAC): Enforce RBAC to limit user access based on their role within the organization.

Continuous Monitoring: Monitor all system activity for suspicious behavior and potential security threats.

Incident Response Plan: Maintain a comprehensive incident response plan to quickly address and mitigate security incidents.

User Education: Provide ongoing security training for users to help them recognize and avoid security threats.

Alternative Features

When suggesting additional features for Dropbox during a system design interview, consider enhancements that improve usability, security, or performance. Here are a few possibilities:

Smart Sync: Allow users to see and access all their files and folders but only download the data they need, saving local storage space.

Enhanced Collaboration Tools: Integrate more robust tools for real-time collaboration, such as document co-editing and in-app communications.

Advanced File Management: Implement features such as tagging, automated sorting, and custom views to help users manage large volumes of files more effectively.

AI-Driven Insights: Use artificial intelligence to offer insights on file access patterns, suggest files for archiving, or alert users to duplicated content.

Enhanced Recovery Options: Provide options for users to recover deleted files or previous versions for extended periods beyond the current limitations.

In an interview, it's beneficial to discuss features that you are familiar with and can argue effectively about their implementation and impact. This approach demonstrates your depth of knowledge and understanding of the system's capabilities and future potential.