Dropbox - System Design

While trying to upload or share any file, whether images, pdfs or documents, you would have come across the dropbox file posting service. Dropbox keeps all your data secure and organized, whether with business or school files. It also allows you to share your data using tools for password protection, expiring links and download permissions etc.

Have you ever wondered if you were asked to design an application similar to dropbox within 45 minutes of an interview?

These questions are common in the system design round of your interviews. Everyone knows that large systems are not designed within a time limit; hundreds of IT professionals have worked on them for decades. But the main motive of such a question is not to judge the system designed by you, but the interviewer wants to see the approach you follow to do some particular task.

How would you Design a Dropbox?

So usually, when facing such questions, one would try to find the technical details in it and start mentioning the complete procedures at once, but that is not the correct way to handle such questions. That creates confusion. You may think that you will impress the interviewer if you start mentioning the relevant tools and framework. But remember, the main aim is to solve the problem before you and not to find the means to use.

The interviewer would be expecting to hear some high-level ideas regarding that problem. You should be able to explain the approach to be followed step by step to everyone. So now, let's move forward and look at how you should answer this particular problem.

Now that we will not be designing any cloud storage service here, we assume that the client would be using any service like Amazon S3 or anything similar to manage their files on the cloud.

Step 1: Discuss the primary points

Instead of directly jumping into the solution, you should first ensure that the interviewer understands all the assumptions you would make while designing the system.

Try to collect information regarding the scope of the application or system. Asking questions would help you to clarify what you should focus on and what primary details the interviewer would want to see in the system.

So here, start with the same core primary feature of dropbox and discuss how you would include all the components in your design. Now, if the interviewer wishes to know or add something more like any unique service integration or something, they'll tell you about it.

The core feature of dropbox can be:

1. User should be able to perform the following operations

  • upload files
  • Download files
  • Update files
  • Delete files

2. The history of updates should be visible with the time of each updation.

3. Synchronization of files and folders is a must.

Step 2: Analyse the traffic on application

Try to figure out how popular the system would get and how many people would be engaged with it. Also, calculate the possible number of requests the application server would get each day so that you can analyze how to scale your application resources.

For instance, assume that for the dropbox you would design, there will be

  • 10+ million unique users
  • 100 million requests per day
  • lots of reading and  writing operations

Analyzing the traffic on your application would help you figure out what scalability approach you should consider o handle the requests simultaneously.

Understand the Problem Statement

By this question, many students assume that the task here is to design a dropbox, which would mean using some cloud service and enabling the user to upload and download specific files whenever they wish. But that's not the real problem.

The primary issue that wants to be solved here is to find out how you will save files and, more importantly, where? Assume that you have shared a file with your colleagues over dropbox; the file is stored in the cloud. Now you will need to update something in the file. Would you keep uploading it again and again after updating them each time? No, this would never be the right approach. The reason for the same is

1. More bandwidth and cloud space utilization

In your dropbox, you would have to provide a history of all the updates in the file and hence would be required to keep the multiple version of the same file. Even when there are minor changes, you would have to back up and transfer the whole file again and again on the cloud. This is not a good idea as it requires more bandwidth and more storage in the cloud storage in the cloud.

2. Latency and concurrency utilization

Optimizing the time complexity of the system would also be a problem here, as clearly, it required more time to upload the file again, even if you have made only some minor changes to it. Implementing multithreading to upload/ download the files concurrently would also not be easy.

Present Your Solution

Once you have understood the problem statement, identified the issues and explained them to the interviewer, you are expected to give high-level solutions for the issues you mentioned here. To solve the problem of uploading/downloading the complete file, again and again, you can break the file into multiple parts or chunks. Hence it would not be required to access the entire file again and again, and you would only have to update a particular chunk.

Now the problem may arise how you would identify which chunk belongs to which file? To solve this issue, we maintain a file called meta-data file simultaneously. The metadata file stores indices of the chunks concerning the file they belong to; you would always need to update and synchronize this file with the cloud. This file can be accessed/downloaded anytime from the cloud when needed.

After this, now you must discuss all other components that would be used to make this happen and make the system more functional.

Let us assume that a client (application) is installed on a computer and wishes to use our service. We first assume that the client has four essential components: Watcher, Indexer, Chunker and Internal DB. For the sake of clear understanding, let us only consider only one client, but in reality, there could be multiple clients with the same essential components and may belong to the same user.

The client is Responsible for uploading or downloading the files from Dropbox. It is also expected to identify the changes in the sync folder and handle the conflicts that may arise due to the concurrent changes made by the same user or the offline updates.

The client also monitors the folders to keep track of the changes or updates happening in the files.

To handle all the updates in the metadata file, the client is supposed to interact with the Messaging and the Synchronization service.

To store the files and synchronize the folders, the client must also interact with any remote cloud storage system like Amazon S3.

Now let us discuss the Four Client components mentioned above individually:

1. Watcher: This component monitors all the sync folders for all the activities performed by the user, like creating, updating or deleting a file. This monitors every action performed on each file and sends a notification to Chunker and the indexer.

2. Chunker: As the name suggests, this component is used to chunk down the big files into smaller parts to ease updating files. Chunker breaks down the file and then uploads all the small chunks on cloud storage with a unique hashed id assigned to each of them. To recreate the original file, you just need to add up all the smaller chunks. If the client needs to make a change in any part, the chunking algorithm detects the required chunk that is needed to be modified and only saves the specific chunk to the cloud storage. This process reduces the overall bandwidth usage, the time required for synchronization and the storage space on the cloud.

3. Indexer: Indexer is used to updating the Internal Database whenever a notification is received from the watcher. The indexer receives the URL of chunks and the hashed id from Chunker, and then it updates the file with the modified chunks. Indexer communicates with the Synchronization service using the message queuing service when the chunks are submitted successfully to the cloud storage.

4. Internal Database: This component stores all the files and stores chunks of information about their location and the version in the file system.

Discuss the Other Components

1. Metadata Database

The Metadata Database is used to maintain and keep track of the various parts of the data. The records in the database store the names of the files or chunks and their different versions, including the information of their respective users and workplace. To manage that metadata, you can use any RDBMS or NoSQL, but make sure to keep the data consistent, as more than one user would be working on the same chunk of the file on different systems.

If you choose to use the relational database management system, there would be no issue with the consistency, but in the case of NoSQL, you would need to configure every database differently; for example, to increase the consistency, you could use the Cassandra application factor.

Even if you maintain data consistency using any RDBMS, you still need to use any database starting technique to scale the application. But on using any database sharding technique, the management of records after any update or any new edition in the file would become more and more complex. To solve this problem, you must create an edge wrapper around the sharded database. The edge wrapper provides you with the ORM that can be used by the client to retrieve the data from a record instead of interacting or accessing the database directly.

2. Messaging Queuing Service

The message service queue keeps track of the asynchronous communication between the clients and the service used to synchronize the data, also known as the synchronization service.

Memory requirements of any message queuing service that you can use in your system are mentioned below:

  • The Message Queuing service you use should be able to handle a lot of reads or write requests simultaneously.
  • Should be able to store the message request in large quantity in any reliable queue and should have high availability
  • The service should have high scalability and a high-performance ratio.
  • The Message Queuing service should be elastic for multiple instances and provide load balancing for the synchronization service.

In your application, for smooth performance, two types of message queuing could be used:

Request Queue

This queue would have the global scope in our application and be shared with every client or user. Whenever the client encounters any change in your update in the file or the folder, it would use the request queue to send the request about the same. The requests in this queue are forwarded to the synchronization service so that it would update the metadata database for every change in the file and keep the data in your system consistent.

Response Queue

In your system, you can use a unique response queue to handle the task for every individual client. There would be a personal response quick responding to each and every client or user of your application. The queue receives the data from the synchronization service to be broadcasted on each and every client's system to update about the change made in any of the files or folders. This response queue would deliver an updated message to each client in the network.

After receiving the response message, the clients are expected to update their files according to the change only. Even if the client somehow gets disconnected from the internet, the message would not be lost as it is still stored in the message response queue in our system. There would be n number of response queues for n number of unique clients. The message from the response queue is only deleted after the respective client acknowledges it.

3. Synchronization Service

The client interacts with the synchronization service for one of the reasons below.

  • To receive the latest update of any file or folder from the cloud storage.
  • If they need to send the changes to be stored on cloud storage

The clients use the synchronization services to maintain data consistency at the cloud storage level. The synchronization service would accept the request from the request queue, and it updates the record in the meta database accordingly. The synchronization service is also responsible for broadcasting the latest updates or changes in each system file. The synchronization service sends the updated message to the clients to be stored in the response queue until the user checks it or acknowledges it.

The client's indexer is expected to recognize the updated message from the response Queue and then update the files in the system according to the changes mentioned. The indexer also updates the local database on the client system concerning the metadata database. Even if the client cannot connect to the internet at any particular movement, it would check for updates in the response queue as soon as it goes online.

4. Cloud Storage

To increase the efficiency of your system in the beginning, you can use any cloud storage platform like Microsoft azure or Amazon s3 to store the chance of parts of the large files uploaded by the users. The client would contact the cloud service for any action that needs to be performed