Followers

Sunday, April 21, 2019

How to take Big Data Storage type Decision?

Usually two storage types used in Big Data scenario.

Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

Following table summarizes the trade-offs between these two storage types:

Trade-off CopyOnWrite MergeOnRead
Data Latency Higher Lower
Update cost (I/O) Higher (rewrite entire parquet) Lower (append to delta log)
Parquet File Size Smaller (high update(I/0) cost) Larger (low update cost)
Write Amplification Higher Lower (depending on compaction strategy)

Reference link:
HUDI

No comments:

Post a Comment