History repeate itself....

Saturday, March 20, 2021

Picking the Best Data Storage Solution

When it comes to picking a data storage solution, companies can choose from a variety of options, but the most common ones fall into three categories:
1.Direct Attached Storage (DAS)

2.Network Attached Storage (NAS)

3.Storage Area Network (SAN)

DAS: DAS solutions are typically the simplest and the cheapest. High capacity backup hard drives, solid state drives, and optical disc drives are all examples of DAS solutions. Because of the relatively cheap price and minimal space obstruction, DAS is a great choice for very small businesses with only a handful of employees. Remote access is usually not possible with DAS.

NAS: Small to mid-size companies have more demanding data storage requirements than small businesses. NAS allows these companies to store data in a centralized location, and can be accessed remotely from various devices on your network. NAS are usually pieces of hardware fitted with multiple hard drives in a RAID configuration, and can be connected to a switch or router on a network via a network interface card. If your business is looking for centralized storage, remote access, file sharing, and scalability, consider utilizing a NAS solution.

SAN: SAN solutions are a good fit for large companies and enterprises that have the space to store multiple disk arrays, switches, and servers. SAN gives access to large blocks of data between servers and storage devices across multiple locations. This solution is a necessity for any large company that needs to reliably and quickly access massive amounts of data. SAN also provides a higher level of data security and is much more fault tolerant than NAS and DAS.

Reference: https://solutionsreview.com/data-storage/data-storage-units-of-measurement-chart-from-smallest-to-largest/

Saturday, July 18, 2020

Elasticsearch basic(ELK -> {Elasticsearch,Logstash,Kibana})

Elasticsearch: JSON document
Elasticsearch is a highly scalable open-source full-text search and analytics engine.
It allows you to store, search, and analyze big volumes of data quickly and in near real time.
It is generally used as the underlying engine/technology that powers applications that have
complex search features and requirements.
Elasticsearch API:
-:Document API
-:Search API
-:Indices API
-:Cat API
-:Cluster API

Single documents API or Multi document APIs

Single Documents API:
-: Index API(Create)
-: Get API
-: Delete API
-: Update API
Multi document APIs:
-: Multi get API
-: Bulk API
-: Delete by query API
-: Update by query API
-: Re-index API(copy data from one index to another index as back-up)
Elasticsearch to database mapping:
-: _index == Database
-:_type == Table
-:_id == primary key

Monday, May 25, 2020

Spark Debugging tricks

Q1)To Get Temporary variable from spark-shell or Check for variable in memory with $intp.
Ans:
$intp.definedTerms.map(dT => s"${dT.toTermName}: ${$intp.typeOfTerm(dT.toTermName.toString)}").filter(x => x.contains("()org.apache.spark.rdd.RDD")).foreach(println)

Q2)To get Lineage of a RDD.
Ans:
use toDebugString function on RDD.

Q3)To Know Wide or Narrow Dependencies of a RDD.
Ans:
use dependencies function on RDD.

Monday, May 4, 2020

HDFS commands for production environment

Q1)To check size of HDFS directory?
Ans: hdfs dfs -count -v -q -h /user/hive/warehouse/test.db/

Q2)To check if file exist or Not:
Ans:
hdfs dfs -test -e /user/hello/test.txt
echo $? =
{
If file exist
0
else
1
}

Q3)To check if file size is zero or not?
Ans:
hdfs dfs -test -z /user/hello/test.txt
echo $? =
{
If file size =0
0
else
1
}

Q4)To set replication factor of a hdfs file?
Ans:
For File:
hdfs dfs -setrep -w 4 /user/hello/test.txt
For Directory: (it will impact to each file inside directory & not to directory.)
hdfs dfs -setrep -w 1 -R /user/hello/testdir

Q5) To delete file from HDFS Trash folder?
Ans:
hdfs dfs -expunge

Q6)To delete file command to by pass Trash folder?
Ans:
For File:(Move to .Trash folder)
hdfs dfs -rm /user/hello/test.txt
For Directory:(Move to .Trash folder)
hdfs dfs -rm -R /user/hello/testdir
For File:(-skipTrash For permanent delete)
hdfs dfs -rm -skipTrash /user/hello/test.txt
For Directory:(-skipTrash For permanent delete)
hdfs dfs -rm -R -skipTrash /user/hello/testdir

Sunday, April 21, 2019

How to take Big Data Storage type Decision?

Usually two storage types used in Big Data scenario.

Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

Following table summarizes the trade-offs between these two storage types:

Trade-off	CopyOnWrite	MergeOnRead
Data Latency	Higher	Lower
Update cost (I/O)	Higher (rewrite entire parquet)	Lower (append to delta log)
Parquet File Size	Smaller (high update(I/0) cost)	Larger (low update cost)
Write Amplification	Higher	Lower (depending on compaction strategy)

Reference link:
HUDI

Monday, March 4, 2019

Sqoop Import Analysis

Find the below analysis on similar kind of record on different file format to import from MySQL to HDFS.
--as-textfile:(Size in HDFS:61KB)(Time Taken:70Second)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Transferred 60.8369 KB in 69.8476 seconds (891.8985 bytes/sec)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-avrodatafile:(Size in HDFS:58KB)(Time Taken:63Second)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Transferred 58.124 KB in 62.5104 seconds (952.1461 bytes/sec)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-sequencefile:(Size in HDFS:81KB)(Time Taken:109Second)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Transferred 80.7959 KB in 109.3956 seconds (756.2921 bytes/sec)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

--as-parquetfile:(Size in HDFS:28KB)(Time Taken:82Second)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Transferred 32.2197 KB in 82.2192 seconds (401.281 bytes/sec)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Retrieved 1114 records.

Conclusion is:
If Space is a concerned then use : parquetfile.
If Speed is concerned then use: avrodatafile

Wednesday, December 5, 2018

Single Node Kafka Cluster in Windows 10 system.

1)Installer JDK1.8

2)Go to https://kafka.apache.org/downloads and download
Scala 2.11 - kafka_2.11-2.0.0.tgz and extract using 7-zip

3)Zookeeper configuration:
Create a directory "C:\kafka_2.11-2.0.0\zookeper_data"
Go to kafka_2.11-2.0.0\config\zookeeper.properties
dataDir=C:\kafka_2.11-2.0.0\zookeper_data

4)Kafka Server properties configuration:
Create a Kafka log directory "C:\kafka_2.11-2.0.0\kafka_log"
log.dirs=C:\kafka_2.11-2.0.0\kfaka-logs
Few more properties need to be added in same properties file.
offsets.topic.num.partitions=1
offsets.topic.replication.factor=1
min.insync.replicas=1
default.replication.factor=1

5)Running & Testing of Kafka with following steps.
Set the Environment variable to 'path' variable with "C:\kafka_2.11-2.0.0\bin\windows".
For Windows system we need to go to windows directory and execute the batch file.
Step1: Start zookeeper ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-server-start.bat C:\kafka_2.11-2.0.0\config\zookeeper.properties
Step2:Start Kafka server ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-server-start.bat C:\kafka_2.11-2.0.0\config\server.properties
Step3:Check number of Kafaka Node ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-shell.bat localhost:2181 ls /brokers/ids
Step4:Create a Kafka Topic ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Step5:Create a Topic to send message(Producer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic test
>Hello Suresh!
>kafka is amazing
>Terminate batch job (Y/N)? y
Step5:Create a Consumer to send message(Consumer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --from-beginning
Hello Suresh!
kafa is amazing

ctrl+C to close the command prompt.