Hadoop MapReduce output with header

Multi tool use
Hadoop MapReduce output with header
How can I output a header on my map/reduce job only one time to use is as a csv for hive import instead of manually entering column names.
public class MyMapper extends Mapper {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
InputStream is = new ByteArrayInputStream(value.toString().getBytes());
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
//....
doc.getDocumentElement().normalize();
// .......
//context.write(new Text("el_from t Title t External Link"), NullWritable.get());
// ....
String title = eElement.getElementsByTagName("title").item(0).getTextContent();
text = eElement.getElementsByTagName("text").item(0).getTextContent();
String id = eElement.getElementsByTagName("id").item(0).getTextContent();
for(int j = 0; j < externalLinks.length; j++)
{
Pattern prl = Pattern.compile("(http://www.|https://www.|http://|https://)?[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?");
Matcher ml = prl.matcher(externalLinks[j]);
if(ml.find()) {
MatchResult mlr = ml.toMatchResult();
context.write(new Text(id+","+title + ","+ mlr.group(0)), NullWritable.get());
}
}
}
}
} catch (Exception e) {
// LogWriter.getInstance().WriteLog(e.getMessage());
}
}
}`enter code here`
The result which I got is like this
3,agricoltura,http://www.treccani.it
3,agricoltura,http://www.wwf.it/client/render.aspx
The result I want is like below with a header
id, title, link
3,agricoltura,http://www.treccani.it
3,agricoltura,http://www.wwf.it/client/render.aspx
@Dragonthoughts Sorry, I have now edited my questions and added piece of the code
– Habtamu Assefa
Jul 2 at 21:25
Thanks @KlingKlang. was a typo
– Habtamu Assefa
Jul 4 at 8:01
1 Answer
1
You should build a Hive table over the text file, and that will define the "headers" in the Hive schema rather than another random row in the Hive table. More importantly, Map Reduce cannot guarantee your header is the first row in the file.
CREATE EXTERNAL TABLE x (
id INT, title STRING, link STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://mapred/outputDir';
From that, you can write a Hive query to output to a separate CSV file, if needed
Also, Spark can read XML, parse it, and write out CSV with headers, I believe, which might be better for your use case
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Can you please show us what you have attempted?
– Dragonthoughts
Jul 2 at 20:40