Database Sharding System Design
Nowadays, data servers have to collect and process a lot of data. So, the databases have to deal with large data set. The database server's capacity can be raised, but there is eventually a physical limit. The option is to distribute the data among a group of database servers. Although the primary benefit of sharding is to enhance database capacity. It also has the side benefit of enabling the database to manage more traffic as each server in the cluster only needs to react to a portion of the total requests. For a sharded architecture to be effective, there are a few essential components. We'll discuss how it works with shard keys and the significance of de-normalizing data.
What is Database Sharding?
A shard key is assigned to each row, indicating the logical shard on which it can be found. A logical shard cannot be divided into physical shards but can have multiple locations on the single physical shard.The objective of a sharded architecture is to have numerous little shards that evenly distribute data among the nodes. Fast response times are produced for all nodes because of preventing hotspots from overpowering any one of the nodes. Sharding can be put into practice at the database or application level. With the notable exception of PostgreSQL, one of the best relational databases, many databases at current time enable sharded designs. Data is divided into two or more logical shards, or smaller pieces, by the process of sharding. Then, the logical shards are spread among several database nodes, known as physical shards, each of which can accommodate multiple logical shards. In spite of this, the information contained in all the shards collectively represents a complete logical dataset.
Why the Database Sharding is used?
In the past, information was kept in an RDBMS (Relational Database Management System), where it was organized into tables with rows and columns. Instead of storing the data in a single table connected by a foreign key for data with 1-to-N or N-to-N relationships, a normalization process would store the data in separate tables that were joined together. This would ensure that the data did not become out of sync and that it could be joined to obtain a complete picture of the data.Traditional database systems, however, experience constraints in their ability to process, store, or retrieve data as data size grows. As a result, they will require more costly and advanced gear to maintain performance. Even with the best technology, most successful modern applications demand significantly more data than a typical RDBMS can handle.The most common way to implement sharding is at the application level, which means that the application's logic determines which shard to send reads and writes to. The ability to do sharding directly at the database level is provided by some database management systems, though.
Approaches used in Sharding
Based on how the shard key is assigned, many techniques to sharded architectures can be used. Shard keys must be distinct among shards regardless of where they came from, therefore their values must be coordinated. As a result, a stated distributed procedure that is quicker to compute must be preferred over a centralized "name server" that can dynamically optimize logical shards for efficiency.
To optimize for the most frequent queries, shard keys are obtained from some invariant attribute of the data. Tenant identifiers, locations, and timestamps are typical options. Based on usage patterns, real shard sizes, etc., custom setup can help optimize a sharded architecture's performance.
Geo-based Sharding
Data is divided up based on where the user is located, such as the user's continent of origin or a region of comparable size. Usually, a fixed location is picked, like the user's location at the time their account was created.Users can be directed to the node that is closest to their location using this strategy, which lowers latency. Users may not be distributed equally among the various geographic locations, though.
Range Sharding
A database having sequential time-based data, such as log history, could be sharded based on month periods. Because data that is "near" within the specified range will be on the same shard, range-based shard keys have the significant benefit of making sequential access patterns relatively quick.The balance of the data can be uncertain, which is a drawback of ranges. The shard with the December range can become overburdened while the other shards aren't doing anything, for instance, if an e-commerce business receives much more orders in December due to holiday shopping.
Hash Sharding
This computes the partition by first creating a hash based on the key value using a hashing technique. A decent hash algorithm will equally distribute data among partitions, lowering the possibility of hotspots. However, since it is likely to divide up related rows into multiple divisions, the server cannot improve speed by attempting to foresee and pre-load upcoming queries.There is no single point of failure in a hash-based sharded architecture because any server that understands the hash function may compute the shard key.Hashing has a significant drawback in that, depending on the architecture, adding shards can result in significant overhead. This is limited by consistent hashing, which ensures that only a minimal quantity of data must be transferred each time a new node is added.This method's key selling point is that it may be used to equally disperse data to avoid hotspots. Additionally, unlike other systems like range-based or directory-based sharding, which require maintaining a map of all the data's locations, algorithmic distribution eliminates the requirement for this.
Advantages of Sharding
- A relational database can be set up to work on a single machine and scaled up as needed by improving its computational power.
- The ability to scale horizontally makes your solution much more versatile because, in the end, any non-distributed database will have a limited amount of storage and computation capacity.
- Upgrading the hardware of an existing server entails vertical scaling, also known as scaling up, and often involves adding more RAM or CPU.
- Adding more machines to an existing stack is a technique known as "horizontal scaling," which helps to distribute the load and encourages more traffic and faster processing.
- Another factor that could lead some users to choose sharded database architecture is the requirement to speed up query response times.
- When you submit a query to a database that hasn't been sharded, it could take a while for it to locate the desired result set since it has to search through every row of the table you're querying.
- Queries can become unacceptably slow for an application with a big, monolithic database.
- By reducing the effects of outages, sharding can also aid in increasing the dependability of an application. An outage could render your entire program unusable if your website or application depends on an unsharded database.
- Utilizing standard hardware rather than cutting-edge equipment.
- The scaling of the database can be done very quickly with the help of more number of shards.
- Improved performance because of lower load on each unit.
- Even if the computational constraints have not been reached, sharding may still be necessary to maintain distinct geographic zones.
- Your service will be faster if the data servers are located closer to the users, or there may be restrictions on the use and location of data in one of the nations where your service is available.
Disadvantages of Sharding
- Complexity is the main factor in database sharding's drawbacks.
- Because the queries must obtain the correct shard key and be mindful of preventing multi-shard queries, they become more complicated.
- You must implement eventual consistency for duplicated data or maintaining relational constraints if the shards can't be completely isolated.
- Your database's deployment and implementation, as well as failovers, backups, and other types of maintenance, become significantly more complex. In short, you should only employ database sharding in extreme cases.
- Users must manage data across many shard locations, which may be disruptive to some teams since they are no longer able to access and manage their data from a single entry point.
- A sizable part of your customers may experience application lag and stalling because the A-M shard gradually accumulates more data than the N-Z one.
- To allow for a more equitable distribution of data, the database would probably need to be restored and re-sharded.
Conclusion
If you want to scale your database horizontally, sharding can be a fantastic approach. However, it also increases your application's complexity and the number of potential failure spots. Sharding might be required for some applications, while for others, the costs and time involved in developing and maintaining a sharded architecture may outweigh any advantages.The advantages and disadvantages of sharding should be more evident to you after reading this conceptual piece. This knowledge will help you decide whether sharded database architecture is appropriate for your application going ahead.