When it comes to picking a data storage solution, companies can choose from a variety of options, but the most common ones fall into three categories:
1.Direct Attached Storage (DAS)
2.Network Attached Storage (NAS)
3.Storage Area Network (SAN)
DAS:
DAS solutions are typically the simplest and the cheapest. High capacity backup hard drives, solid state drives, and optical disc drives are all examples of DAS solutions. Because of the relatively cheap price and minimal space obstruction, DAS is a great choice for very small businesses with only a handful of employees. Remote access is usually not possible with DAS.
NAS:
Small to mid-size companies have more demanding data storage requirements than small businesses. NAS allows these companies to store data in a centralized location, and can be accessed remotely from various devices on your network. NAS are usually pieces of hardware fitted with multiple hard drives in a RAID configuration, and can be connected to a switch or router on a network via a network interface card. If your business is looking for centralized storage, remote access, file sharing, and scalability, consider utilizing a NAS solution.
SAN:
SAN solutions are a good fit for large companies and enterprises that have the space to store multiple disk arrays, switches, and servers. SAN gives access to large blocks of data between servers and storage devices across multiple locations. This solution is a necessity for any large company that needs to reliably and quickly access massive amounts of data. SAN also provides a higher level of data security and is much more fault tolerant than NAS and DAS.
Reference: https://solutionsreview.com/data-storage/data-storage-units-of-measurement-chart-from-smallest-to-largest/
History repeate itself....
Saturday, March 20, 2021
Saturday, July 18, 2020
Elasticsearch basic(ELK -> {Elasticsearch,Logstash,Kibana})
Elasticsearch: JSON document
Elasticsearch is a highly scalable open-source full-text search and analytics engine.
It allows you to store, search, and analyze big volumes of data quickly and in near real time.
It is generally used as the underlying engine/technology that powers applications that have
complex search features and requirements.
Elasticsearch API:
-:Document API
-:Search API
-:Indices API
-:Cat API
-:Cluster API
Single documents API or Multi document APIs
Single Documents API:
-: Index API(Create)
-: Get API
-: Delete API
-: Update API
Multi document APIs:
-: Multi get API
-: Bulk API
-: Delete by query API
-: Update by query API
-: Re-index API(copy data from one index to another index as back-up)
Elasticsearch to database mapping:
-: _index == Database
-:_type == Table
-:_id == primary key
Elasticsearch is a highly scalable open-source full-text search and analytics engine.
It allows you to store, search, and analyze big volumes of data quickly and in near real time.
It is generally used as the underlying engine/technology that powers applications that have
complex search features and requirements.
Elasticsearch API:
-:Document API
-:Search API
-:Indices API
-:Cat API
-:Cluster API
Single documents API or Multi document APIs
Single Documents API:
-: Index API(Create)
-: Get API
-: Delete API
-: Update API
Multi document APIs:
-: Multi get API
-: Bulk API
-: Delete by query API
-: Update by query API
-: Re-index API(copy data from one index to another index as back-up)
Elasticsearch to database mapping:
-: _index == Database
-:_type == Table
-:_id == primary key
Monday, May 25, 2020
Spark Debugging tricks
Q1)To Get Temporary variable from spark-shell or Check for variable in memory with $intp.
Ans:
$intp.definedTerms.map(dT => s"${dT.toTermName}: ${$intp.typeOfTerm(dT.toTermName.toString)}").filter(x => x.contains("()org.apache.spark.rdd.RDD")).foreach(println)
Q2)To get Lineage of a RDD.
Ans:
use toDebugString function on RDD.
Q3)To Know Wide or Narrow Dependencies of a RDD.
Ans:
use dependencies function on RDD.
Ans:
$intp.definedTerms.map(dT => s"${dT.toTermName}: ${$intp.typeOfTerm(dT.toTermName.toString)}").filter(x => x.contains("()org.apache.spark.rdd.RDD")).foreach(println)
Q2)To get Lineage of a RDD.
Ans:
use toDebugString function on RDD.
Q3)To Know Wide or Narrow Dependencies of a RDD.
Ans:
use dependencies function on RDD.
Monday, May 4, 2020
HDFS commands for production environment
Q1)To check size of HDFS directory?
Ans: hdfs dfs -count -v -q -h /user/hive/warehouse/test.db/
Q2)To check if file exist or Not:
Ans:
hdfs dfs -test -e /user/hello/test.txt
echo $? =
{
If file exist
0
else
1
}
Q3)To check if file size is zero or not?
Ans:
hdfs dfs -test -z /user/hello/test.txt
echo $? =
{
If file size =0
0
else
1
}
Q4)To set replication factor of a hdfs file?
Ans:
For File:
hdfs dfs -setrep -w 4 /user/hello/test.txt
For Directory: (it will impact to each file inside directory & not to directory.)
hdfs dfs -setrep -w 1 -R /user/hello/testdir
Q5) To delete file from HDFS Trash folder?
Ans:
hdfs dfs -expunge
Q6)To delete file command to by pass Trash folder?
Ans:
For File:(Move to .Trash folder)
hdfs dfs -rm /user/hello/test.txt
For Directory:(Move to .Trash folder)
hdfs dfs -rm -R /user/hello/testdir
For File:(-skipTrash For permanent delete)
hdfs dfs -rm -skipTrash /user/hello/test.txt
For Directory:(-skipTrash For permanent delete)
hdfs dfs -rm -R -skipTrash /user/hello/testdir
Ans: hdfs dfs -count -v -q -h /user/hive/warehouse/test.db/
Q2)To check if file exist or Not:
Ans:
hdfs dfs -test -e /user/hello/test.txt
echo $? =
{
If file exist
0
else
1
}
Q3)To check if file size is zero or not?
Ans:
hdfs dfs -test -z /user/hello/test.txt
echo $? =
{
If file size =0
0
else
1
}
Q4)To set replication factor of a hdfs file?
Ans:
For File:
hdfs dfs -setrep -w 4 /user/hello/test.txt
For Directory: (it will impact to each file inside directory & not to directory.)
hdfs dfs -setrep -w 1 -R /user/hello/testdir
Q5) To delete file from HDFS Trash folder?
Ans:
hdfs dfs -expunge
Q6)To delete file command to by pass Trash folder?
Ans:
For File:(Move to .Trash folder)
hdfs dfs -rm /user/hello/test.txt
For Directory:(Move to .Trash folder)
hdfs dfs -rm -R /user/hello/testdir
For File:(-skipTrash For permanent delete)
hdfs dfs -rm -skipTrash /user/hello/test.txt
For Directory:(-skipTrash For permanent delete)
hdfs dfs -rm -R -skipTrash /user/hello/testdir
Sunday, April 21, 2019
How to take Big Data Storage type Decision?
Usually two storage types used in Big Data scenario.
Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.
Following table summarizes the trade-offs between these two storage types:
Reference link:
HUDI
Copy On Write :
Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
Merge On Read :
Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.
Following table summarizes the trade-offs between these two storage types:
| Trade-off | CopyOnWrite | MergeOnRead |
|---|---|---|
| Data Latency | Higher | Lower |
| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta log) |
| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update cost) |
| Write Amplification | Higher | Lower (depending on compaction strategy) |
Reference link:
HUDI
Monday, March 4, 2019
Sqoop Import Analysis
Find the below analysis on similar kind of record on different file format to import from MySQL to HDFS.
--as-textfile:(Size in HDFS:61KB)(Time Taken:70Second)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Transferred 60.8369 KB in 69.8476 seconds (891.8985 bytes/sec)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-avrodatafile:(Size in HDFS:58KB)(Time Taken:63Second)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Transferred 58.124 KB in 62.5104 seconds (952.1461 bytes/sec)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-sequencefile:(Size in HDFS:81KB)(Time Taken:109Second)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Transferred 80.7959 KB in 109.3956 seconds (756.2921 bytes/sec)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-parquetfile:(Size in HDFS:28KB)(Time Taken:82Second)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Transferred 32.2197 KB in 82.2192 seconds (401.281 bytes/sec)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
Conclusion is:
If Space is a concerned then use : parquetfile.
If Speed is concerned then use: avrodatafile
--as-textfile:(Size in HDFS:61KB)(Time Taken:70Second)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Transferred 60.8369 KB in 69.8476 seconds (891.8985 bytes/sec)
19/03/04 04:27:43 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-avrodatafile:(Size in HDFS:58KB)(Time Taken:63Second)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Transferred 58.124 KB in 62.5104 seconds (952.1461 bytes/sec)
19/03/04 04:24:07 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-sequencefile:(Size in HDFS:81KB)(Time Taken:109Second)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Transferred 80.7959 KB in 109.3956 seconds (756.2921 bytes/sec)
19/03/04 04:14:57 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
--as-parquetfile:(Size in HDFS:28KB)(Time Taken:82Second)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Transferred 32.2197 KB in 82.2192 seconds (401.281 bytes/sec)
19/03/04 04:17:17 INFO mapreduce.ImportJobBase: Retrieved 1114 records.
Conclusion is:
If Space is a concerned then use : parquetfile.
If Speed is concerned then use: avrodatafile
Wednesday, December 5, 2018
Single Node Kafka Cluster in Windows 10 system.
1)Installer JDK1.8
2)Go to https://kafka.apache.org/downloads and download
Scala 2.11 - kafka_2.11-2.0.0.tgz and extract using 7-zip
3)Zookeeper configuration:
Create a directory "C:\kafka_2.11-2.0.0\zookeper_data"
Go to kafka_2.11-2.0.0\config\zookeeper.properties
dataDir=C:\kafka_2.11-2.0.0\zookeper_data
4)Kafka Server properties configuration:
Create a Kafka log directory "C:\kafka_2.11-2.0.0\kafka_log"
log.dirs=C:\kafka_2.11-2.0.0\kfaka-logs
Few more properties need to be added in same properties file.
offsets.topic.num.partitions=1
offsets.topic.replication.factor=1
min.insync.replicas=1
default.replication.factor=1
5)Running & Testing of Kafka with following steps.
Set the Environment variable to 'path' variable with "C:\kafka_2.11-2.0.0\bin\windows".
For Windows system we need to go to windows directory and execute the batch file.
Step1: Start zookeeper ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-server-start.bat C:\kafka_2.11-2.0.0\config\zookeeper.properties
Step2:Start Kafka server ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-server-start.bat C:\kafka_2.11-2.0.0\config\server.properties
Step3:Check number of Kafaka Node ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-shell.bat localhost:2181 ls /brokers/ids
Step4:Create a Kafka Topic ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Step5:Create a Topic to send message(Producer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic test
>Hello Suresh!
>kafka is amazing
>Terminate batch job (Y/N)? y
Step5:Create a Consumer to send message(Consumer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --from-beginning
Hello Suresh!
kafa is amazing
ctrl+C to close the command prompt.
2)Go to https://kafka.apache.org/downloads and download
Scala 2.11 - kafka_2.11-2.0.0.tgz and extract using 7-zip
3)Zookeeper configuration:
Create a directory "C:\kafka_2.11-2.0.0\zookeper_data"
Go to kafka_2.11-2.0.0\config\zookeeper.properties
dataDir=C:\kafka_2.11-2.0.0\zookeper_data
4)Kafka Server properties configuration:
Create a Kafka log directory "C:\kafka_2.11-2.0.0\kafka_log"
log.dirs=C:\kafka_2.11-2.0.0\kfaka-logs
Few more properties need to be added in same properties file.
offsets.topic.num.partitions=1
offsets.topic.replication.factor=1
min.insync.replicas=1
default.replication.factor=1
5)Running & Testing of Kafka with following steps.
Set the Environment variable to 'path' variable with "C:\kafka_2.11-2.0.0\bin\windows".
For Windows system we need to go to windows directory and execute the batch file.
Step1: Start zookeeper ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-server-start.bat C:\kafka_2.11-2.0.0\config\zookeeper.properties
Step2:Start Kafka server ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-server-start.bat C:\kafka_2.11-2.0.0\config\server.properties
Step3:Check number of Kafaka Node ->
C:\kafka_2.11-2.0.0\bin\windows>zookeeper-shell.bat localhost:2181 ls /brokers/ids
Step4:Create a Kafka Topic ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Step5:Create a Topic to send message(Producer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic test
>Hello Suresh!
>kafka is amazing
>Terminate batch job (Y/N)? y
Step5:Create a Consumer to send message(Consumer) ->
C:\kafka_2.11-2.0.0\bin\windows>kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --from-beginning
Hello Suresh!
kafa is amazing
ctrl+C to close the command prompt.
Subscribe to:
Posts (Atom)


