- AUTHOR Hrishikesh Tak
- PUBLISHED ON December 08, 2020
What is Data Serialization?
Data serialization is the process of converting data objects into byte streams for storage, transfer and distribution on any physical device. The translation of data structures or object state from the in-memory representation to a byte sequence is called data serialization (also known as encoding or marshalling). The reverse is called decoding or deserialization or unmarshalling. Storing and exchanging data between varying environments requires a platform-and-language-neutral data format that all systems understand.
Programs usually works with data in (at least) two different representations:
- In memory, data is kept in objects, structs, lists, arrays, hash tables, trees and so on. These data structures are accessed typically by using Pointers.
- To write data to a file or send it over the network, encode it as some kind of self-contained sequence of bytes (for e.g., a JSON document), since a pointer wouldn’t make sense to any other process.
In the above diagram, serialization and deserialization process is explained. In Serialization, object is transformed to stream of bytes and then converted into files, DB and memory. In Deserialization, File, DB and Memory are transformed into stream of bytes and then converted into objects.
- Security across all applications of user-specific data – Data can be kept secured while transferring from one device to another.
- Transferring Data through the Wires/ Store and re-create objects as per need – Web and mobile apps can transfer objects from client to server and the vice versa.
- Storing Data into Databases or on Hard Drives/ Passing objects from one domain to other – A method which involves converting program objects into byte streams and then storing them into DBs.
- Remote Procedure Call (RPC) – A protocol, where one program can be used to request a service from a program on another computer on a network without needing to know that network’s details.
- Distributing Objects in a Distributed Object Model – Used for instances when programs running on diverse platforms written in different languages have to share object data over a distributed network.
Data Serialization Formats
- Language-Specific Formats:
- Many programming languages come with built-in support for encoding in-memory objects into bytes sequences. For example, Python has pickle, Ruby has Marshal and so on.
- These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code.
- However, they also have a number of deep problems:
○ The encoding is often tied to a particular programming language, and reading data in another language is very difficult.
○ Versioning data is often an afterthought in these libraries.
- Textual Formats:
- JSON, XML and CSV are textual formats (Human readable).
- JSON’s popularity is mainly due to its built-in support for web browsers and simplicity relative to XML.
- They also have some subtle problems:
○ There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits. JSON distinguishes strings and numbers, but it doesn’t distinguish integer and floating-point numbers and it doesn’t specify precision.
- Despite these flaws, JSON, XML and CSV are good enough for data interchange formats (i.e. for sending data from one organization to another).
- Binary Formats:
- The Binary Formats is used only internally within your organization, and is obviously more compressed, so storage usage will be lower.
- JSON is less verbose than XML, but both still use a lot of space compared to binary formats.
- The three frameworks are: Thrift, Protocol Buffers and Avro here, all of which offer efficient, cross-language serialization of data using a scheme
- Thrift — from Facebook, almost the same when it comes to functionalities as Google’s Protocol Buffers, but subjectively Protobuf is easier to use. Documentation is very detailed and extensive. Support and tools for Python, Java and Scala are on a very good level. That’s why I have chosen Protocol Buffer vs Avro for the final comparison.
- The detailed comparison of Thrift, Protocol Buffers and Avro can be found in this great post by Martin Kleppmann.
Protocol Buffers, is a data interchange format developed for internal use. It has been offered by Google for open source projects since 2008. The binary format enables applications to store as well as exchange structured data to overcome some complications and these programs can even be written in different programming languages; namely: C#, C++, Java, and Python.
- Protocol Buffers (a.k.a., protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.
- Protocol buffers currently support generated code in Java, Python, Objective-C, and C++. With new proto3 language version, you can also work with Go, Ruby, and C#, etc.
- It is a binary encoding format that allows you to specify a schema for your data using a specification language, like so:
- The above snippet defines a schema for a Person data type that has three fields: id, name and email. In addition to naming a field, you can provide a type that will determine how the data is encoded and sent over the wire – above we see an int32 type and a string type. Keywords for validation and structure are also provided (required and optional) and fields are numbered, which aids in backward compatibility.
- Protocol Buffer comes with a code generation tool in various languages : Java, C, Go, Python, etc. are all supported, that takes a schema definition and produces classes that implements schema in various programming languages. Your application code can call this generated code to encode or decode records of the schema. What this means is that one spec can be used to transfer data between systems regardless of their implementation language.
- With numbered fields, you never have to change the behavior of code going forward to maintain backward compatibility with older versions of code.
Some advantages of using Protocol Buffer are: language interoperability, flexible, clear schemes, backward and forward compatibility.
Apache Avro is a compact, fast, binary data serialization system that relies on schemas. When data is read, the schema used when writing it is always present in the container file for storing persistent data Works both with code generation as well as in a dynamic manner. Avro supports languages like C, C++, Java, Python and Ruby.
Avro requires the use of schemas when data is either written or read; that can be used for serialization and deserialization and Avro will take care of the missing, extra or modified fields.
What does AVRO Stand for?
AVRO is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. Avro stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient.
- The above snippet defines a schema for a Person data type that has three fields: id, name and email. Fields are defined via an array of objects, each of which defines a name and type. The type attribute of a field is another schema object, which can be either a primitive or complex
- For example, the id and name fields of our schema are the primitive type int and string respectively, whereas the email field is unions, represented by JSON arrays. unions are a complex type that can be any of the types listed in the array; e.g., email can either be a string or null, essentially making it an optional field.
- Avro schemas are defined using JSON, You can say that Avro format is actually a combination of a JSON data structure and a schema for validation purposes.
- Schema evolution in Avro, Protocol Buffers and Thrift
- Apache Avro Python3 Issues