Hadoop MapReduce output with header

Hadoop MapReduce output with header

How can I output a header on my map/reduce job only one time to use is as a csv for hive import instead of manually entering column names.

public class MyMapper extends Mapper {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { try { InputStream is = new ByteArrayInputStream(value.toString().getBytes()); DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder dBuilder = dbFactory.newDocumentBuilder(); Document doc = dBuilder.parse(is); //.... doc.getDocumentElement().normalize(); // ....... //context.write(new Text("el_from t Title t External Link"), NullWritable.get()); // .... String title = eElement.getElementsByTagName("title").item(0).getTextContent(); text = eElement.getElementsByTagName("text").item(0).getTextContent(); String id = eElement.getElementsByTagName("id").item(0).getTextContent(); for(int j = 0; j < externalLinks.length; j++) { Pattern prl = Pattern.compile("(http://www.|https://www.|http://|https://)?[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?"); Matcher ml = prl.matcher(externalLinks[j]); if(ml.find()) { MatchResult mlr = ml.toMatchResult(); context.write(new Text(id+","+title + ","+ mlr.group(0)), NullWritable.get()); } } } } } catch (Exception e) { // LogWriter.getInstance().WriteLog(e.getMessage()); } } }`enter code here`

The result which I got is like this

3,agricoltura,http://www.treccani.it

3,agricoltura,http://www.wwf.it/client/render.aspx

The result I want is like below with a header

id, title, link

3,agricoltura,http://www.treccani.it

3,agricoltura,http://www.wwf.it/client/render.aspx

Can you please show us what you have attempted?
– Dragonthoughts
Jul 2 at 20:40

@Dragonthoughts Sorry, I have now edited my questions and added piece of the code
– Habtamu Assefa
Jul 2 at 21:25

Thanks @KlingKlang. was a typo
– Habtamu Assefa
Jul 4 at 8:01

1 Answer
1

You should build a Hive table over the text file, and that will define the "headers" in the Hive schema rather than another random row in the Hive table. More importantly, Map Reduce cannot guarantee your header is the first row in the file.

CREATE EXTERNAL TABLE x ( id INT, title STRING, link STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'hdfs://mapred/outputDir';

From that, you can write a Hive query to output to a separate CSV file, if needed

Also, Spark can read XML, parse it, and write out CSV with headers, I believe, which might be better for your use case

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

5at8 1HgZkBp7XG,ZS3uz l8NxKGlIkz

搜尋此網誌

Fjhtyj