Brief Introduction to Hadoop Data Storage Formats
Agenda
1. What is hadoop data format?
2. How does that affect the application?
3. Different data formats
4. Popular Hadoop data formats and Benefits of Selecting Particular Format
5. Sequence Files
6. Parquet
7. Avro
8. ORC
What is Hadoop Data Format?
A storage format is just a way to define how information is stored in a file. When dealing with Hadoop’s file system not only do you have all of these traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data. A huge issue with HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. Choosing an appropriate file format can have some significant benefits.
Benefits from Selecting Appropriate Format
1. Faster reads
2. Faster writes
3. Splittable files ( when we need to read a part of file and not the complete )
4. Schema changes or evolution
5. Compression support
Different Data Formats
1. Avro
2. Parquet
3. ORC
4. Arrow (Incubating)
Text file Format
Simple text-based files are common in the non-Hadoop world, and they’re super common in the Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by a newline character “\n”
Text-files are inherently splittable (just split on \n characters!), but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2/
Because these files are just text files you can encode anything you like in a line of the file. One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS. To Summarise,
1. Simple format
2. Data is laid out in lines
3. Lines are terminated by \n
4. Format is inherently splittable
Sequence File Format
Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and a value for each record. Records are stored in a binary format that is smaller than a text-based format.
Sequence files by default use Hadoop’s writable interface in order to figure out how to serialize and deserialize classes to the file.
One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.
Sequence files are well supported across Hadoop and many other HDFS enabled projects, To Summarise,
1. They encode a key and value for each record
2. Records are stored in binary format.
3. Support block level compression
4. Well supported across hadoop eco-system
Avro
Avro is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like int or string. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Avro not really feels like a file format, it’s a file format plus a serialization and deserialization framework.
Avro is a well thought out format which defines file data schemas in JSON (for interoperability), allows for multiple serialization/deserialization. It also supports block-level compression. For most Hadoop-based use cases Avro is a really good choice. To Summarise,
1. Language-neutral data serialization system
2. Schema is always present
3. Schema is in JSON
4. Data is binary encoded
5. Datafile has a metadata section where schema is stored
6. Supports compression
7. Are splittable
8. Supported by Pig, Hive, Crunch, Spark
Parquet
The latest buzz in file formats for Hadoop is columnar file storage. That means that instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically.
One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar).
Overall these formats can drastically optimize workloads, especially for Hive and Spark which tend to just read segments of records rather than the whole thing (which is more common in MapReduce). To Summarise,
1. Columnar format
2. Homogeneous data comes together, better for compression
3. Not good if queries often need the entire row
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. ORC File format feature comes with the Hive 0.11 version and cannot be used with previous versions.
With this, we come to end of this article, I hope, this serves as quick reference guide for revisiting the popular Hadoop Data Storage Formats.
I recommend you go through my other article on Frequent HDFS Commands to run on top of HDFS after reading this article.
If you like the article, do leave a clap, it really motivates me to write more.
See you in some other article..:)