In this post, we’ll discuss byte serialization in blockchains, why it matters, as well as how different chains approach it, including:
- Ethereum 1x
- Ethereum 2.0
- Secure Scuttlebutt (Bonus!)
Serialization is the process of converting an object into a stream of bytes to store the object or transmit it to memory, a database, or a file using a common set of rules a system can understand. Its main purpose is to save the state of an object in order to be able to recreate it when needed. The reverse process is called deserialization. In this process, you’d use the same set of rules to decode the translated data back into its original form. Serializing data is also known as marshaling and deserialization is known as unmarshaling.
Why is byte serialization important in blockchain?
Serialized data does not have to be in bytes as long as it’s in a format that is efficient for the particular system the data will live in. For example, in WWII data was serialized into morse code in order to transmit data via voice radio systems. Bytes happen to be the most convenient format for computers to operate in. Inherently, blockchains maintain a state of the world to reach consensus. Blockchain based protocols like Bitcoin and Ethereum leverage this technology but in order for nodes to communicate across a network in order to maintain the state, they need to send packets of byte encoded data via peer to peer networking using some sort of standard serialization algorithm.
This is why serialization plays a very important role in distributed systems. When you represent a domain object in a blockchain, you will want to generate a proof of it. The object may be inserted into a data structure, which itself needs to generate a proof. A good way to prove your message is authentic is to sign it with a key or to include the hash of the message into a block. It’s the essence of a blockchain’s function, sending around hashes of transactions and blocks and making sure they match. Let’s take a look at some of the different implementations.
Ethereum 1x: Recursive Length Prefix
Ethereum uses a Patricia Merkle tree based on this byte serialization down to the point a single hash can identify the particular state of the tree. The purpose of RLP (Recursive Length Prefix) is to encode arbitrarily nested arrays of binary data, and RLP is the main encoding method used to serialize objects in Ethereum.
“… Data in Ethereum is serialized and deserialized as a byte-array. Put, an array of bytes is a byte-array.”
RLP is like a binary encoding of JSON, if JSON were restricted only to strings and arrays. The RLP encoding function takes in an item. An item is defined as follows：
- A string (ie. byte array) is an item
- A list of items is an item
Examples of item and items
- [  ]
- [ “Lion” ]
- [ [ “cat”, 123, ‘d’, ‘o’, ‘g’] ]
Ethereum 2.0: Simple SerialiZe
As part of the upgrade to Ethereum 2.0, researchers and developers have been working on significant improvements to the Ethereum protocol in addition to its network architecture. Based on what was learned with RLP, Simple SerialiZe (SSZ) was developed as the culmination of this hard work. This is the serialization algorithm standard for all data structures common across Ethereum 2.0 client implementations. It is outlined in the official Ethereum 2.0 specification. SSZ is poised to be the new standard for marshaling consensus data into bytes.
Specifically, the SimpleSerialize (SSZ) specification defines how different types should be serialized and represented as a merkle tree. Building on RLP, SSZ offers a few more abilities:
- Deterministic lookup of an element in bytes–this means you can look at the bytes of the SSZ output, and if you know what the data structure is, you can look up the bytes of one of the fields of the data structure directly.
- Merkle hash partial computation aka SSZ partials allow us to update the SSZ representation and recompute the hash without recomputing from scratch, optimizing the hashing process.
Let’s take a look at Bitcoin’s approach. Under the current consensus rules, a block is not valid unless its serialized size is less than or equal to 1 MB. The following fields also determine the serialized size:
|80||block header||block_header||The block header in the format described in the block header section.|
|Varies||txn_count||compactSize uint||The total number of transactions in this block, including the coinbase transaction.|
|Varies||txns||raw transaction||Every transaction in this block, one after another, in raw transaction format. Transactions must appear in the data stream in the same order their TXIDs appeared in the first row of the merkle tree. See the merkle tree section for details.|
All data structures in Bitcoin use a custom Bitcoin specific serialization format. The standard that is followed is the Bitcoin defined standard, not any other standard. Two formats are currently supported by the Bitcoin community, the non-segwit format (click here). The segwit format is described in BIP 144. Segwit implements a new serialization format for tx messages to the peer-to-peer protocol. Arguments have been made about supporting this new serialization method but some point out that it is an improvement over the current bitcoin blockchain which reduces the size needed to store transactions in a block. This is done by removing certain signatures with counting serialized witness data as one unit and core block data as four units.
SSB: Secure Scuttlebutt
Different forms of serialization are being used in Secure Scuttlebutt, a global cryptographic social network. Here, they’re using JSON for its messages which configures messages to a specific format to allow signing.
Here’s an example from their docs:
To create a message to post in a feed, start by filling out these fields:
All messages in a feed are signed by that feed’s long-term secret key, enabling recipients to verify that the message was truly posted by a particular identity and not tampered with as it gets gossiped and replicated throughout the Scuttlebutt network. Before signing a message, it must be serialized according to a specific canonical JSON format. Making it so that for any given message there is exactly one way to serialize it as a sequence of bytes, which is necessary for signature verification to work. The reference implementation verifies that all messages it receives are in the canonical format and rejects messages that aren’t. To read more about the specific rules for this serialization, check out their docs.
Although different networks approach serialization differently, it’s an important feature of a distributed system. At Whiteblock many of us are excited to continue assisting in the development of Ethereum 2.0 and hope to share more details about these improvements soon.
At Whiteblock, we’re actively working on providing the best tools to test and optimize distributed systems. Join us for our official launch on January 15th and stay tuned on all of our upcoming updates via telegram.