Followers

Sunday, April 21, 2019

How to take Big Data Storage type Decision?

Usually two storage types used in Big Data scenario.

Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

Following table summarizes the trade-offs between these two storage types:

Trade-off CopyOnWrite MergeOnRead
Data Latency Higher Lower
Update cost (I/O) Higher (rewrite entire parquet) Lower (append to delta log)
Parquet File Size Smaller (high update(I/0) cost) Larger (low update cost)
Write Amplification Higher Lower (depending on compaction strategy)

Reference link:
HUDI

Monday, March 4, 2019

Sqoop Import Analysis

Find the below analysis on similar kind of record on different file format to import from MySQL to HDFS.
--as-textfile:(Size in HDFS:61KB)(Time Taken:70Second)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Transferred 60.8369 KB in 69.8476 seconds (891.8985 bytes/sec)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-avrodatafile:(Size in HDFS:58KB)(Time Taken:63Second)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Transferred 58.124 KB in 62.5104 seconds (952.1461 bytes/sec)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-sequencefile:(Size in HDFS:81KB)(Time Taken:109Second)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Transferred 80.7959 KB in 109.3956 seconds (756.2921 bytes/sec)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-parquetfile:(Size in HDFS:28KB)(Time Taken:82Second)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Transferred 32.2197 KB in 82.2192 seconds (401.281 bytes/sec)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

Conclusion is:
If Space is a concerned then use : parquetfile.
If Speed is concerned then use: avrodatafile