Stripping non html tags/text from a string

Multi tool use
Stripping non html tags/text from a string
I have a string which I need to send in an xml node to a third party application. That string is then parsed through a html parser over there. The string can have html, but problem occurs with non html tags. For example
<cfset str = "This mail was <b>sent</b> by Jen Myke <jmyke@mail.com> on June 20th.<br/> Click on <a href='http://google.com'>this link</a> for more information.">
There can be non-utf characters too in the string, which also cause issues but I found a old blog post which can help remove non-utf.
<cfset str = reReplace(str, "[^x20-x7E]", "", "ALL")>
But I am unable to figure out how I can remove html look alikes.
John Smith <email@example.com>
2 Answers
2
Try wrapping the string with encodeForXML()
. This should encode any non-ASCII character for use within an XML node.
encodeForXML()
<node>#encodeForXml(str)#</node>
<node>#encodeForXml(str)#</node>
If you need to pass data in an attribute, then
<node attr=#encodeForXmlAttribute(str)#"/>
<node attr=#encodeForXmlAttribute(str)#"/>
Edit: You can try using getSafeHTML() before encoding the rest of the string. This will remove HTML tags from a string using an XML configuration file to set your AntiSamy
settings. Check the docs for more info.
AntiSamy
Thank you @Adrian . But it will help only partially. Actually it will allow me to send non-utf, I was already doing that with the regex in question. The problem I have is with invalid html tags which I need to parse out say either remove them completely or strip <> out of them.
– CFML_Developer
Jul 2 at 15:47
What do you mean by "non-HTML tags"? Modern HTML can be semantic in nature, meaning you can wrap content with any "tag"s, like
<myTag>stuff</myTag>
, which they can style however they like.– Adrian J. Moreno
Jul 2 at 16:22
<myTag>stuff</myTag>
The problem is html parser on third party desktop application (probably using an obsolete parser and changing that part of code neither my expertise nor in my pay scale to decide). So I am trying to find out if I can strip out non-standard html tags out of strings.
– CFML_Developer
Jul 2 at 16:45
Which standard HTML? There are so many to choose from.
– James A Mohler
Jul 4 at 4:21
Try replacing
< to <
> to >
Nah, that won't work. String is passed through a CDATA which convert < to < but it still would render invalid html <jmyke@mail.com> to the parser.
– CFML_Developer
Jul 2 at 14:25
What if you insert space or underscore between "<" and the text and same for ">" tag. <cfset str = "This mail was sent by Jen Myke < jmyke@mail.com > on June 20th.">
– Ashu
Jul 2 at 14:29
That will create problem with proper html in the string. Let me put some html into string, though I already mentioned that string contains valid html too.
– CFML_Developer
Jul 2 at 14:36
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
How is your string created, where dose it come from? Do all your failed parser jobs include an email in the format of
John Smith <email@example.com>
?– Twillen
Jul 3 at 13:19