While the concept of sharding as a scaling solution has been a hot topic in the Ethereum community for a while now, it’s important to understand that this sort of database architecture isn’t specific to blockchain technology. Computer nerds haven’t shut up about this pattern for quite some time as a method of improving performance, increasing availability, and ensuring the redundancy of data. In this article, we’re going to provide a high-level overview for sharding as a database architecture within distributed systems to help understand how it can be applied to build more performant blockchains in the future.
How does one shard?
Sharding is a computational storage technique in which large independent datasets are broken up into smaller units that are easier to manage. Doing so ensures a certain degree of data redundancy as well as an increase in overall performance. For example, handling all of the data required of a search engine within a single database may not be the most practical solution and if this data only exists on a single storage device, it can be easily lost or corrupted should the system encounter some degree of failure.
In the event a single node fails, replicated chunks of data can be retrieved and consolidated — mitigating risk because redundant data exists across a cluster of separate nodes. While one of the design motivations within distributed systems is to reduce single points of failure, “almost all modern databases are natively sharded. Cassandra, HBase, HDFS, and MongoDB are popular distributed databases.”
This database architecture pattern has been in use for quite some time, which explains how video games can handle millions of players simultaneously. Sharding is also referred to as horizontal partitioning — the practice of separating one table’s rows into multiple different tables across different machines. By distributing the data among multiple machines, a cluster of database systems can store larger datasets and handle additional requests. The data held in each shard is unique and independent of the data held in other partitions. They also don’t share any of the same computing resources and this increases the degree of fault tolerance.
Pros & Cons
An advantage of sharding is that it can assist in horizontal scaling, also known as scaling out. In distributed system architecture, horizontal scaling means adding more computers rather than upgrading the hardware of a single one in order to efficiently scale and increase query speed. “Adding more nodes to better cope with failures is often called “high-availability” or “HA.” There’s no cap on how big you can go — whenever performance degrades you simply add another machine.
Depending on an unsharded database means that a failure can potentially make your entire application unavailable. With a sharded database, however, a failure affects one shard. If a shard is affected, it can still become unavailable, but the event will impact fewer users in comparison to a complete database outage.
image credit to xkcd at https://xkcd.com/1718/Sharding isn’t a magic band aid though and there are still some drawbacks of If there is an imbalance in data accrual, one of your shards can become what’s known as a database hotspot. For example, let’s say you have a database with two separate shards, one for customers whose date of births range from 1950 through 1980 and another for those whose date of birth range from 1981 through 2001. If your application serves an inordinate amount of people who were born in 1993 the shard that stores data ranging from 1981 through 2001 will accrue more data than the 1950–1980 one, causing a significant lag or crash for a large portion of your users.
There are also a number of different sharding methods, such as key based sharding and range based sharding. Key based sharding is the same as hash based sharding. This occurs when a value taken from newly provided data, like a customer’s ID number, email address or ZIP code, for example, is interpreted by a hash function to determine which shard the data should be stored in. A hash function takes as input a piece of data and outputs a discrete value, known as a hash value. “In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on.”
Another method called range based sharding is easier to implement, which can make it an attractive option initially, however, this method won’t prevent database hotspots. The application code can be as simple as reading which range the new data falls into and stores it in the corresponding shard. Unfortunately, if you have a range of products priced between $1–10 in one shard and another priced $11–20, you run the risk of having an uneven distribution of data in a specific shard if a product in one of the two ranges happens to be more popular. Only fine grained partitions can reduce hotspots like these.
Today, blockchain protocols like Ethereum are researching how to leverage a scaling solution by using sharding as a scaling solution to increase the amount of transactions per second the system can process as well as the amount of computational power it requires. Implementing these techniques within a decentralized P2P network, however, presents some unique challenges.
A lot of contemporary blockchain projects, including Ethereum, want to leverage sharding in order to transition from a reliance on one primary blockchain to multiple relatively autonomous blockchains that can interoperate and communicate between one another. By leveraging sharding methods within these systems, blockchain can potentially scale to meet the growing demands of future users. “This attempt at state sharding for a vast, decentralized network is an impressive endeavor and will be an enormous feat of accomplishment if successfully implemented.”
Say it one more time
Sharding can add a great deal of complexity to your application’s storage workflow, but it’s a powerful tool for scaling and can provide long-term benefits. Nonetheless, sharding is usually considered after an application has outgrown simpler methods of storing data. Along with complexity, it can add more points of failure if not implemented correctly. “Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.” It’s usually recommended to exhaust all possible optimization options before considering a sharded architecture for your application’s data.