RegEx match open tags except XHTML self-contained tags

Multi tool use

RegEx match open tags except XHTML self-contained tags

I need to match all of these opening tags:

<p> <a href="foo">

But not these:

<br /> <hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

a-z

<([a-z]+) *[^/]*?>

I believe it says:

/

Do I have that right? And more importantly, what do you think?

This post has been locked due to the high amount of off-topic comments generated. For extended discussions, please use chat.

35 Answers
35

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes~~, the pestilent sl~~ithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expre~~ssion parsing~~ will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮om~~es he co~~mes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

inal snuf

͎a̧͈͖r̽̾̈́͒͑e

Have you tried using an XML parser instead?

Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death.
– bobince
Nov 13 '09 at 23:18

Is it possible to use RegEx to parse this answer?
– Chris Porter
Nov 17 '09 at 18:26

If you can't see this post, here's a screencapture of it in all its glory: imgur.com/gOPS2.png
– Andrew Keeton
Nov 19 '09 at 14:37

While it is true that asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser. And if you are comfortable with coding regexes, way faster to code than coding xpaths. And almost certainly less fragile to changes in what you are scraping. So bleh.
– Michael Johnston
Apr 17 '12 at 20:47

@MichaelJohnston "Less fragile"? Almost certainly not. Regexes care about text-formatting details than an XML parser can silently ignore. Switching between &foo; encodings and CDATA sections? Using an HTML minifier to remove all whitespace in your document that the browser doesn't render? An XML parser won't care, and neither will a well-written XPath statement. A regex-based "parser", on the other hand...
– Charles Duffy
Jul 11 '12 at 16:03

&foo;

CDATA

@CharlesDuffy for an one time job it's ok, and for spaces we use s+
– quantum
Jul 12 '12 at 13:50

@xiaomao indeed, if having to know all the gotchas and workarounds to get an 80% solution that fails the rest of the time "works for you", I can't stop you. Meanwhile, I'm over on my side of the fence using parsers that work on 100% of syntactically valid XML.
– Charles Duffy
Jul 12 '12 at 16:07

I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke, and all their styling was inline or with <font> etc.: no classes or IDs to help navigate the DOM. After fighting all day with the "right" approach, I finally switched to a regex solution and had it working in an hour.
– Paul A Jungwirth
Sep 7 '12 at 7:14

<font>

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG.
– LarsH
Mar 2 '12 at 8:43

A -> s A e

In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called "regular expressions" can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can't accomplish.
– Adam Mihalcin
Mar 19 '12 at 23:50

@eyelidlessness: the same "only if" applies to all CFGs, does it not? I.e. if the (X)HTML input is not well-formed, not even a full-blown XML parser will work reliably. Maybe if you give examples of the "(X)HTML syntax errors implemented in real world user agents" you're referring to, I'll understand what you're getting at better.
– LarsH
May 22 '12 at 5:09

@AdamMihalcin is exactly right. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl's) are Turing complete. It's true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why.
– dubiousjim
May 31 '12 at 13:44

This is the most "full and short" answer here. It leads people to learn basics of formal grammars and languages and hopefully some maths so they will not wast time on hopeless things like solving NP-tasks in polynomial time
– mishmashru
Apr 19 '13 at 12:15

Don't listen to these guys. You actually can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order:

I haven't figured out the last part yet, but I know I'm getting close. My code keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions lately, so I'm going to port it to VB 6 and use On Error Resume Next. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

CthulhuRlyehWgahnaglFhtagnException

On Error Resume Next

P.S. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

Divison by zero is a much easier problem than the others you mention. If you use intervals, rather than plain floating point arithmetic (which everyone should be but nobody is), you can happily divide something by [an interval containing] zero. The result is simply an interval containing plus and minus infinity.
– rjmunro
Jun 14 '12 at 10:53

Fermat's small margin problem has been solved by soft margins in modern text-editing software.
– kd4ttc
Mar 1 '13 at 20:24

Fermat's small margin problem has been solved by Randall Munroe by setting the fontsize to zero: xkcd.com/1381
– heltonbiker
Oct 16 '14 at 19:55

FYI: Fermat's problem has actually been solved in 1995, and it only took mathematicians 358 years to do so.
– jmiserez
Jan 22 '15 at 18:40

I was able to bypass that sticky divide-by-zero step by instead using Brownian ratchets yielded from cold fusion...though it only works when I remove the cosmological constant.
– Tim Lehner
Mar 9 '16 at 18:52

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

<a name="badgenerator"">

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

I would go with something that works on sane things than weep about not being universally perfect :-)
– prajeesh kumar
May 10 '12 at 3:44

Is someone using CDATA inside HTML?
– Danubian Sailor
Mar 2 '13 at 7:51

so you do not actually solve the parsing problem with regexp only but as a part of the parser this may work. PS: working product doesn't mean good code. No offence, but this is how industrial programming works and gets their money
– mishmashru
Apr 19 '13 at 12:18

Your regex starts fail on the very shortest possible, valid HTML: <!doctype html><title><</title>. Simple '<!doctype html><title><</title>'.match(/<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g) returns ["<!doctype html>", "<title>", "<</title>"] while should ["<title>", "</title>"].
– Benio
May 1 '14 at 16:48

<!doctype html><title><</title>

'<!doctype html><title><</title>'.match(/<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g)

["<!doctype html>", "<title>", "<</title>"]

["<title>", "</title>"]

What is a "badge nerator"
– Richard de Wit
Jun 3 '15 at 14:07

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the ~~Underverse~~ Stack Based Regex-Verse and returned with ~~powers~~ knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28 995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F 86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169 OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7 O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52 MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU 1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY 12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37 R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn 3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25 D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8 DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3 zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX /ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj 4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6 mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z 0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26 7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29 7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9 r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa 2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8 fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+ +fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx +r/vD34mUADO1P4/AQAA//8=

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

RegexOptions.ExplicitCapture

ELEMENTNAME

ERROR

If you have problems reconverting it to a human-readable regex, this should help:

static string FromBase64(string str) { byte byteArray = Convert.FromBase64String(str); using (var msIn = new MemoryStream(byteArray)) using (var msOut = new MemoryStream()) { using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) { ds.CopyTo(msOut); } return Encoding.UTF8.GetString(msOut.ToArray()); } }

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

regex to tokenize an xml or the full plain regex

Good Lord, it's massive. My biggest question is why? You realize that all modern languages have XML parsers, right? You can do all that in like 3 lines and be sure it'll work. Furthermore, do you also realize that pure regex is provably unable to do certain things? Unless you've created a hybrid regex/imperative code parser, but it doesn't look like you have. Can you compress random data as well?
– Justin Morgan
Mar 8 '11 at 15:23

@Justin I don't need a reason. It could be done (and it wasn't illegal/immoral), so I have done it. There are no limitations to the mind except those we acknowledge (Napoleon Hill)... Modern languages can parse XML? Really? And I thought that THAT was illegal! :-)
– xanatos
Mar 8 '11 at 15:31

Sir, I'm convinced. I'm going to use this code as part of the kernel for my perpetual-motion machine--can you believe those fools at the patent office keep rejecting my application? Well, I'll show them. I'll show them all!
– Justin Morgan
Mar 8 '11 at 17:55

@Justin So an Xml Parser is by definition bug free, while a Regex isn't? Because if an Xml Parser isn't bug free by definition there could be an xml that make it crash and we are back to step 0. Let say this: both the Xml Parser and this Regex try to be able to parse all the "legal" XML. They CAN parse some "illegal" XML. Bugs could crash both of them. C# XmlReader is surely more tested than this Regex.
– xanatos
Mar 9 '11 at 15:08

No, nothing is bug free: 1) All programs contain at least one bug. 2) All programs contain at least one line of unnecessary source code. 3) By #1 and #2 and using logical induction, it's a simple matter to prove that any program can be reduced to a single line of code with a bug. (from Learning Perl)
– sweaver2112
Feb 16 '12 at 0:53

In shell, you can parse HTML using:

sed though:

hxselect from html-xml-utils package

hxselect

html-xml-utils

vim/ex (which can easily jump between html tags), for example:

vim

ex

removing style tag with inner code:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

grep, for example:

grep

extracting outer html of H1:

$ curl -s http://example.com/ | grep -o '<h1>.*</h1>' <h1>Example Domain</h1>

extracting the body:

$ curl -s http://example.com/ | tr 'n' ' ' | grep -o '<body>.*</body>' <body>

 Example Domain
 ...

html2text to plain text parsing:

html2text

like parsing tables:

$ html2text foo.txt | column -ts'|'

using xpath (XML::XPath perl module), see example here

xpath

XML::XPath

perl or Python (see @Gilles example)

for parsing multiple files at once, see: How to parse hundred html source code files in shell?

Related (why you shouldn't use regex match):

See also perlmonks.org/?displaytype=print;node_id=809842
– dubiousjim
Mar 3 '10 at 12:50

I’m afraid you did not get the joke, @kenorb. Please, read the question and the accepted answer once more. This is not about HTML parsing tools in general, nor about HTML parsing shell tools, it’s about parsing HTML via regexes.
– Palec
Oct 13 '15 at 8:12

@Palec I don't get the joke either. Is it nearly impossible to parse HTML with regex?
– Abdul
Mar 24 '17 at 11:49

Yes, that answer summarizes it well, @Abdul. Note that, however, regex implementations are not really regular expressions in the mathematical sense -- they have constructs that make them stronger, often Turing-complete (equivalent to Type 0 grammars). The argument breaks with this fact, but is still somewhat valid in the sense that regexes were never meant to be capable of doing such a job, though.
– Palec
Mar 24 '17 at 14:24

And by the way, the joke I referred to was the content of this answer before kenorb's (radical) edits, specifically revision 4, @Abdul.
– Palec
Mar 24 '17 at 14:26

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by
a regular expression. However, the .NET regular expression engine
provides a few constructs that allow balanced constructs to be
recognized.

(?)

(?(group)yes|no)

These constructs allow for a .NET regular expression to emulate a
restricted PDA by essentially allowing simple versions of the stack
operations: push, pop and empty. The simple operations are pretty much
equivalent to increment, decrement and compare to zero respectively.
This allows for the .NET regular expression engine to recognize a
subset of the context-free languages, in particular the ones that only
require a simple counter. This in turn allows for the non-traditional
.NET regular expressions to recognize individual properly balanced
constructs.

Consider the following regular expression:

(?=) (?> | ]*/> | (?]*[^/]>) | (?]*[^/]>) | [^]* )* (?(opentag)(?!))

Use the flags:

(?=) # match start with

                                        # atomic group / don't backtrack (faster)
                    |          # match xml / html comment
   ]*/>                     |          # self closing tag
   (?]*[^/]>) |          # push opening xml tag
   (?]*[^/]>)    |          # pop closing xml tag
   [^]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

I used the sample source of:


      stuff...

      more stuff

still more Another >ul<, oh my! ... </li> </ul> </div> </body> </html>

This found the match:

<ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li>
still more Another >ul<, oh my! ...
</li> </ul>

although it actually came out like this:

<ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li>
still more Another >ul<, oh my! ...
</li> </ul>

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.

System.Text is not part of C#. It's part of .NET.
– John Saunders
Feb 2 '12 at 19:07

System.Text

In the first line of your regex ((?=<uls*id="matchMe"s*type="square"s*>) # match start with <ul id="matchMe"...), in between "<ul" and "id" should be s+, not s*, unless you want it to match <ulid=... ;)
– C0deH4cker
Jul 6 '12 at 2:49

(?=<uls*id="matchMe"s*type="square"s*>) # match start with <ul id="matchMe"...

s+

s*

@C0deH4cker You are correct, the expression should have s+ instead of s*.
– Sam
Jul 6 '12 at 22:33

s+

s*

Not that I really understand it, but I think your regex fails on <img src="images/pic.jpg" />
– Scheintod
Sep 27 '13 at 17:05

<img src="images/pic.jpg" />

@Scheintod Thank you for the comment. I updated the code. The previous expression failed for self closing tags that had a / somewhere inside which failed for your <img src="images/pic.jpg" /> html.
– Sam
Sep 27 '13 at 19:00

/

<img src="images/pic.jpg" />

I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.

@Kyle—jQuery does not parse XML, it uses the client's built–in parser (if there is one). Therefore you do not need jQuery to do it, but as little as two lines of plain old JavaScript. If there is no built–in parser, jQuery will not help.
– RobG
Oct 31 '13 at 6:25

@RobG Actually jQuery uses the DOM, not the built-in parser.
– Qix
Sep 22 '14 at 3:49

@Qix—you'd better tell the authors of the documentation then: "jQuery.parseXML uses the native parsing function of the browser…". Source: jQuery.parseXML()
– RobG
Sep 22 '14 at 5:01

Having come here from the meme question (meta.stackexchange.com/questions/19478/the-many-memes-of-meta/…), I love that one of the answers is 'Use jQuery'
– Jorn
Apr 1 '16 at 21:09

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

<a >>

[^/]

<space>*[^/]*

[^/]*

My suggestion would be

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

(?<! ... )

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

<a/ >

+1 for noting that the question is not about parsing full (X)HTML, it's about matching (X)HTML open tags.
– LarsH
Sep 8 '12 at 2:26

Something else most of the answers seem to ignore, is that an HTML parser can very well use regular expressions in its implementation for parts of HTML, and I would be surprised if most parsers didn't do this.
– Thayne
Mar 26 '15 at 19:15

@Thayne Exactly. When parsing individual tags, a regular expression is the right tool for the job. It is quite ridiculous that one has to scroll halfway down the page to find a reasonable answer. The accepted answer is incorrect because it mixes up lexing and parsing.
– kasperd
Nov 22 '15 at 10:26

The answer given here will fail when an attribute value contains a '>' or '/' character.
– Martin L
Apr 21 '16 at 8:14

This will work incorrectly on HTML containing comments or CData sections. It will also not work correctly if a quoted attribute contains a > character. I agree what OP suggest can be done with a regex, but the one presented here is far to simplistic.
– JacquesB
Jul 30 '17 at 10:14

>

Try:

<([^s]+)(s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

>

h1

<a href="foo" title="5>3"> Oops </a>
– Gareth
Nov 13 '09 at 23:11

That is very true, and I did think about it, but I assumed the > symbol is properly escaped to >.
– Kobi
Nov 13 '09 at 23:16

>

> is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use >. (Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.)
– bobince
Nov 14 '09 at 0:15

>

>

>

@Kobi: what does the exlamation mark (the one you placed tpward the end) mean in a regexp?
– Marco Demaio
Apr 30 '11 at 17:16

@bobince: are u sure? I don't understand anymore, so is this valid HTML too: ">hello</div>
– Marco Demaio
Apr 30 '11 at 17:31

">hello</div>

Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss.
If you only know yourself, but not your opponent, you may win or may lose.
If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has complexity exceeding regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek is defined within the phrase <([a-zA-Z]+)(?:[^>]*[^/]*)?>

<?php $selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed'); $html = ' <p><a href="#">foo</a></p> <hr/> <br/>
name
'; $dom = new DOMDocument(); $dom->loadHTML($html); $els = $dom->getElementsByTagName('*'); foreach ( $els as $el ) { $nodeName = strtolower($el->nodeName); if ( !in_array( $nodeName, $selfClosing ) ) { var_dump( $nodeName ); } }

Output:

string(4) "html" string(4) "body" string(1) "p" string(1) "a" string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

If you're dealing with real XHTML then append getElementsByTagName with NS and specify the namespace.
– meder omuraliev
Nov 15 '09 at 14:39

NS

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows
you to parse "out of the web" HTML
files. The parser is very tolerant
with "real world" malformed HTML.

You want the first > not preceded by a /. Look here for details on how to do that. It's referred to as negative lookbehind.

>

/

However, a naïve implementation of that will end up matching <bar/></foo> in this example document

<bar/></foo>

<foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

Yep, I sure am. Determining all the tags that are currently open, then compare that against the closed tags in a separate array. RegEx hurts my brain.
– Jeff
Nov 13 '09 at 23:04

The W3C explains parsing in a pseudo regexp form:
W3C Link

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

QName

S

Attribute

That's not a psuedo regexp form, that's an EBNF form, as specified here: XML spec, appendix 6
– Rob G
Feb 11 '15 at 10:34

If you need this for PHP:

The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need.
In your case, something like this:

//p/a[@href='foo']

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/nr t/ /s; s/</n</g; s/>/>n/g; s/n ?n/n/g; s/^ ?n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

I bet I will get downvoted for this.

HTML Split

Against my expectation this got some upvotes, so I'll suggest some better regular expressions:

/(<.*?>|[^<]+)s*/g # get tags and text /(w+)="(.*?)"/g # get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.

The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

"I don't attempt to parse idiot HTML that is deliberately broken." How does your code know the difference?
– Kevin Panko
Jul 26 '11 at 20:38

Well it doesn't matter much if the HTML is broken or not. The thing will still split HTML into tags and text. The only thing that could foul it up is if people include unescaped < or > characters in text or attributes. In practise, my tiny HTML splitter works well. I don't need an enormous monstrosity chock full of heuristics. Simple solutions are not for everyone...!
– Sam Watkins
Mar 8 '12 at 3:22

I added some simpler regexps for extracting tags, text, and attributes, for XML / XHTML.
– Sam Watkins
May 22 '12 at 8:00

(get attributes bug 1) /(w+)="(.*?)"/ assumes double quotes. It will miss values in single quotes. In html version 4 and earlier unquoted value is allowed, if it is a simple word.
– David Andersson
Sep 11 '16 at 8:23

/(w+)="(.*?)"/

(get attributes bug 2) /(w+)="(.*?)"/ may falsely match text that looks like an attribute within an attribute, e.g. <img title="Nope down='up' for aussies" src="..." />. If applied globally, it will also match such things in ordinary text or in html comments.
– David Andersson
Sep 11 '16 at 8:28

/(w+)="(.*?)"/

<img title="Nope down='up' for aussies" src="..." />

Here's the solution:

<?php // here's the pattern: $pattern = '/<(w+)(s+(w+)s*=s*('|")(.*?)4s*)*s*(/>|>)/'; // a string to parse: $string = 'Hello, try clicking <a href="#paragraph">here</a> <br/>and check out.<hr /> <h2>title</h2> <a name ="paragraph" rel= "I'm an anchor"></a> Fine, <span title='highlight the "punch"'>thanks<span>.
<br>'; // let's get the occurrences: preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER); // print the result: print_r($matches[0]); ?>

To test it deeply, I entered in the string auto-closing tags like:

I also entered tags with:

Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.

<EDIT>
I forgot that the question from the user was to avoid the parsing of self-closing tags.
In this case the pattern is simpler, turning into this:

$pattern = '/<(w+)(s+(w+)s*=s*('|")(.*?)4s*)*s*>/';

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:

$pattern = '/<(w+)(s+(w+)(s*=s*('|"|)(.*?)5s*)?)*s*>/';

</EDIT>

Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:

Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.

Does not match valid tags having attributes with no value, i.e. <option selected>. Also does not match valid tags with unquoted attribute values, i.e. <p id=10>.
– ridgerunner
Jul 25 '11 at 15:01

<option selected>

<p id=10>

@ridgerunner: Thanks very much for your comment. In that case the pattern must change a bit: $pattern = '/<(w+)(s+(w+)(s*=s*('|"|)(.*?)5s*)?)*s*>/'; I tested it and works in case of non-quoted attributes or attributes with no value.
– Emanuele Del Grande
Jul 25 '11 at 16:41

How about a space before the tag name: < a href="http://wtf.org" > I'm pretty sure it is legal, but you don't match it.
– Floris
Oct 5 '13 at 4:58

< a href="http://wtf.org" >

NO sorry, whitespaces before a tagname are illegal. Beyond being "pretty sure" why don't you provide some evidences of your objection? Here are mine, w3.org/TR/xml11/#sec-starttags referred to XML 1.1, and you can find the same for HTML 4, 5 and XHTML, as a W3C validation would also warn if you make a test. As a lot of other blah-blah-poets around here, I did not still receive any intelligent argumentation, apart some hundred of minus to my answers, to demonstrate where my code fails according to the rules of contract specified in the question. I would only welcome them.
– Emanuele Del Grande
Oct 6 '13 at 18:03

@ridgerunner of course your comment was intelligent and welcome.
– Emanuele Del Grande
Oct 6 '13 at 18:09

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

htmlawed is another PHP project that parses HTML to filter, convert, etc. Has some nice code if you can figure it out!
– user594694
May 12 '11 at 19:22

No you can’t parse HTML with regex. But for some subsets, it may work.
– mirabilos
Dec 5 '14 at 17:07

There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

For example:

$store =~ s/http:/http:///gi; $store =~ s/https:/https:///gi; $baseurl = $store; if (!$query->param("ascii")) { $html =~ s/ss+/n/gi; $html =~ s/<pre(.*?)>(.*?)</pre>/[code]$2[/code]/sgmi; } $html =~ s/n//gi; $html =~ s/rr//gi; $html =~ s/$baseurl//gi; $html =~ s/<h[1-7](.*?)>(.*?)</h[1-7]>/n[b]$2[/b]n/sgmi; $html =~ s/<p>/nn/gi; $html =~ s/<br(.*?)>/n/gi; $html =~ s/<textarea(.*?)>(.*?)</textarea>/[code]$2[/code]/sgmi; $html =~ s/<b>(.*?)</b>/[b]$1[/b]/gi; $html =~ s/<i>(.*?)</i>/[i]$1[/i]/gi; $html =~ s/<u>(.*?)</u>/[u]$1[/u]/gi; $html =~ s/<em>(.*?)</em>/[i]$1[/i]/gi; $html =~ s/<strong>(.*?)</strong>/[b]$1[/b]/gi; $html =~ s/<cite>(.*?)</cite>/[i]$1[/i]/gi; $html =~ s/<font color="(.*?)">(.*?)</font>/[color=$1]$2[/color]/sgmi; $html =~ s/<font color=(.*?)>(.*?)</font>/[color=$1]$2[/color]/sgmi; $html =~ s/<link(.*?)>//gi; $html =~ s/<li(.*?)>(.*?)</li>/[*]$2/gi; $html =~ s/<ul(.*?)>/[list]/gi; $html =~ s/</ul>/[/list]/gi; $html =~ s/
/n/gi; $html =~ s/
/n/gi; $html =~ s/<td(.*?)>/ /gi; $html =~ s/<tr(.*?)>/n/gi; $html =~ s/<img(.*?)src="(.*?)"(.*?)>/[img]$baseurl/$2[/img]/gi; $html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)</a>/[url=$baseurl/$2]$4[/url]/gi; $html =~ s/[url=$baseurl/http://(.*?)](.*?)[/url]/[url=http://$1]$2[/url]/gi; $html =~ s/[img]$baseurl/http://(.*?)[/img]/[img]http://$1[/img]/gi; $html =~ s/<head>(.*?)</head>//sgmi; $html =~ s/(.*?)//sgmi; $html =~ s/(.*?)//sgmi; $html =~ s/<style(.*?)>(.*?)</style>//sgmi; $html =~ s/<title>(.*?)</title>//sgmi; $html =~ s//n/sgmi; $html =~ s//////gi; $html =~ s/http://http:///gi; $html =~ s/https://https:///gi; $html =~ s/<(?:[^>'"]*|(['"]).*?1)*>//gsi; $html =~ s/rr//gi; $html =~ s/[img]//[img]/gi; $html =~ s/[url=//[url=/gi;

Don't do this. Please.
– maletor
Sep 3 '15 at 2:12

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([w]+)([^>]*?)(([s]*/>)|(>((([^<]*?|)|(?R))*)</1[s]*>))/s";

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on January: Reference

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

I'll put this in the "Regex which doesn't allow greater-than in attributes" bin. Check it against <input value="is 5 > 3?" />
– Gareth
Jul 5 '10 at 16:24

If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him.
– aehiilrs
Jul 5 '10 at 16:33

Regular expressions can't work because by definition they are not recursive. Adding a recursive operator to regular expressions basically makes a CFG only with poorer syntax. Why not use something designed to be recursive in the first place rather than violently insert recursion into something already overflowing with extraneous functionality?
– Welbog
Jul 6 '10 at 18:38

My objection isn't one of functionality it is one of time invested. The problem with RegEx is that by the time you post the cutsey little one liners it appears that you did something more efficiently ("See one line of code!"). And of course no one mentions the half hour (or 3) that they spent with their cheat-sheet and (hopefully) testing every possible permutation of input. And once you get past all that when the maintainer goes to figure out or validate the code they can't just look at it and see that it is right. The have to dissect the expression and essentially retest it all over again...
– Oorang
Jul 10 '10 at 15:11

... to know that it is good. And that will happen even with people who are good with regex. And honestly I suspect that overwhelming majority of people won't know it well. So you take one of the most notorious maintenance nightmares and combine it with recursion which is the other maintenance nightmare and I think to myself what I really need on my project is someone a little less clever. The goal is to write code that bad programmers can maintain without breaking the code base. I know it galls to code to the least common denominator. But hiring excellent talent is hard, and you often...
– Oorang
Jul 10 '10 at 15:17

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

That was helpful, thanks
– Khemraj
Oct 27 '17 at 21:16

<s*(w+)[^/>]*>

The parts explained:

<: starting character

<

s*: it may have whitespaces before tag name (ugly but possible).

s*

(w+): tags can contain letters and numbers (h1). Well, w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

(w+)

w

[^/>]*: anything except > and / until closing >

[^/>]*

>

/

>

>: closing >

>

>

And to fellows who underestimate regular expressions saying they are only as powerful as regular languages:

aⁿbaⁿbaⁿ which is not regular and not even context free, can be matched with ^(a+)b1b1$

^(a+)b1b1$

Backreferencing FTW!

@GlitchMr, that was his point. Modern regular expressions are not technically regular, nor is there any reason for them to be.
– alanaktion
Feb 2 '13 at 15:45

@alanaktion: The "modern" regular expressions (read: with Perl extensions) cannot match within O(MN) (M being regular expression length, N being text length). Backreferences are one of causes of that. The implementation in awk doesn't have backreferences and matches everything within O(MN) time.
– Konrad Borowski
Feb 14 '13 at 16:52

O(MN)

O(MN)

I recently wrote an HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes.

For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: http://roberto.open-lab.com

Your links are dead and this isn't really providing any useful answer.
– kenorb
May 19 '15 at 16:23

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

This does not work. For the input '<x a="<b>"/><y>' the matches are x and y, although x is terminated.
– ceving
May 4 '11 at 16:33

If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:

/<[^/]*?>/g

I wrote it in 30 seconds, and tested here:
http://gskinner.com/RegExr/

It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.

I think you mean /> instead of >.
– Justin Morgan
Dec 19 '14 at 17:36

/>

>

No, just > is what I meant; I never meant to edit the regular expression of my original post.
– Lonnie Best
May 29 '16 at 6:38

>

FYI, you don't need to escape angle brackets. Of course, it does no harm to escape them anyway, but look at the confusion you could have avoided. ;)
– Alan Moore
May 29 '16 at 7:47

I sometimes escape unnecessarily when I'm unsure if something is special character or not. I've edited the answer; it works the same but more concise.
– Lonnie Best
May 31 '16 at 7:23

Looking at this now, I don't know why I thought you meant /, since that would do the exact opposite of the requirements. Maybe I thought you were offering a negative filter pattern.
– Justin Morgan
Jun 1 '16 at 19:14

/

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?
y6,9 WsQWjR65Q,j6bYyb8hSx 9DV4b W3oCBsY0,M LcKEdx46QCGHe904,n IVuLmq v5Z

搜尋此網誌

Fjhtyj