Hadoop MapReduce output with header

Multi tool use
Multi tool use


Hadoop MapReduce output with header



How can I output a header on my map/reduce job only one time to use is as a csv for hive import instead of manually entering column names.



public class MyMapper extends Mapper {


public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
InputStream is = new ByteArrayInputStream(value.toString().getBytes());
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
//....

doc.getDocumentElement().normalize();

// .......
//context.write(new Text("el_from t Title t External Link"), NullWritable.get());
// ....
String title = eElement.getElementsByTagName("title").item(0).getTextContent();
text = eElement.getElementsByTagName("text").item(0).getTextContent();
String id = eElement.getElementsByTagName("id").item(0).getTextContent();
for(int j = 0; j < externalLinks.length; j++)
{
Pattern prl = Pattern.compile("(http://www.|https://www.|http://|https://)?[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?");
Matcher ml = prl.matcher(externalLinks[j]);
if(ml.find()) {
MatchResult mlr = ml.toMatchResult();
context.write(new Text(id+","+title + ","+ mlr.group(0)), NullWritable.get());
}
}
}
}
} catch (Exception e) {
// LogWriter.getInstance().WriteLog(e.getMessage());
}
}
}`enter code here`



The result which I got is like this



3,agricoltura,http://www.treccani.it



3,agricoltura,http://www.wwf.it/client/render.aspx



The result I want is like below with a header



id, title, link



3,agricoltura,http://www.treccani.it



3,agricoltura,http://www.wwf.it/client/render.aspx





Can you please show us what you have attempted?
– Dragonthoughts
Jul 2 at 20:40





@Dragonthoughts Sorry, I have now edited my questions and added piece of the code
– Habtamu Assefa
Jul 2 at 21:25





Thanks @KlingKlang. was a typo
– Habtamu Assefa
Jul 4 at 8:01




1 Answer
1



You should build a Hive table over the text file, and that will define the "headers" in the Hive schema rather than another random row in the Hive table. More importantly, Map Reduce cannot guarantee your header is the first row in the file.


CREATE EXTERNAL TABLE x (
id INT, title STRING, link STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://mapred/outputDir';



From that, you can write a Hive query to output to a separate CSV file, if needed



Also, Spark can read XML, parse it, and write out CSV with headers, I believe, which might be better for your use case






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

5at8 1HgZkBp7XG,ZS3uz l8NxKGlIkz
tqx05qT29pQ2wHhAHZDpCnlwyWoBg4ZZspYBfD3Bc ZzFgEGEmCqrH 46aZpGnk

Popular posts from this blog

PHP contact form sending but not receiving emails

Do graphics cards have individual ID by which single devices can be distinguished?

Create weekly swift ios local notifications