Get few lines of HDFS data


Get few lines of HDFS data



I am having a 2 GB data in my HDFS.


2 GB


HDFS



Is it possible to get that data randomly.
Like we do in the Unix command line


cat iris2.csv |head -n 50





-n 2 does not give random data... it returns first 2 lines.
– Jasper
Feb 28 '14 at 9:23




7 Answers
7



Native head


hadoop fs -cat /your/file | head



is efficient here, as cat will close the stream as soon as head will finish reading all the lines.



To get the tail there is a special effective command in hadoop:


hadoop fs -tail /your/file



Unfortunately it returns last kilobyte of the data, not a given number of lines.



The head and tail commands on Linux display the first 10 and last 10 lines respectively. But, the output of these two commands is not randomly sampled, they are in the same order as in the file itself.


head


tail



The Linux shuffle - shuf command helps us generate random permutations of input lines & using this in conjunction with the Hadoop commands would be helpful, like so:


shuf



$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>


$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>



Therefore, in this case if iris2.csv is a file on HDFS and you wanted 50 lines randomly sampled from the dataset:


iris2.csv



$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50


$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50



Note: The Linux sort command could also be used, but the shuf command is faster and randomly samples data better.


sort


shuf





This is the correct answer because none of the other answers talks about shuffling.
– Alex Raj Kaliamoorthy
Jun 22 '17 at 11:08


hdfs dfs -cat yourFile | shuf -n <number_of_line>



Will do the trick for you.Though its not available on mac os. You can get installed GNU coreutils.



My suggestion would be to load that data into Hive table, then you can do something like this:


SELECT column1, column2 FROM (
SELECT iris2.column1, iris2.column2, rand() AS r
FROM iris2
ORDER BY r
) t
LIMIT 50;



EDIT:
This is simpler version of that query:


SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;



Write this command


sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50



50 is number of lines(this can be customize by the user based on the requirements)



You can use head command in Hadoop too! Syntax would be


hdfs dfs -cat <hdfs_filename> | head -n 3



This will print only three lines from the file.





What does this add to the existing answers?
– cricket_007
Jul 3 at 4:07


hadoop fs -cat /user/hive/warehouse/vamshi_customers/* |tail



I think the head part is working as per the answer posted by @Viacheslav Rodionov works fine but for the tail part the one that I posted is working good.





This will download the entire file to your local machine, then tail it. Use hdfs dfs -tail
– cricket_007
Jul 3 at 4:08


hdfs dfs -tail






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

JMeter fails on beanshell imports

Why in node-red my HTTP POST no receive payload from inject?

PHP contact form sending but not receiving emails