History repeate itself....: How to take Big Data Storage type Decision?

Usually two storage types used in Big Data scenario.

Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

Following table summarizes the trade-offs between these two storage types:

Trade-off	CopyOnWrite	MergeOnRead
Data Latency	Higher	Lower
Update cost (I/O)	Higher (rewrite entire parquet)	Lower (append to delta log)
Parquet File Size	Smaller (high update(I/0) cost)	Larger (low update cost)
Write Amplification	Higher	Lower (depending on compaction strategy)

Reference link:
HUDI

History repeate itself....

About Me

Followers

Sunday, April 21, 2019

How to take Big Data Storage type Decision?

No comments:

Post a Comment