Home  /  Questions  /  Question



340   99.9
Apr 18, 2011


Parsing of HTML: Use Regex or HTML Agility Pack?

I need to parse some HTML in my project.  It is fairly simple and controlled HTML, that is, we don't parse just any malformed HTML out there in the wild.

I was thinking of using Regex for this purpose, but I am not (yet) an expert in building Regex patterns.
However, I found the following pattern that will match all HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>
Does anyone have feedback on this pattern?  Will it indeed capture all HTML tags?  Any weaknesses?

As an alternative I believe I could use the HTML Agility Pack. I know that the Orchard Project uses it internally.
Does anyone want to comment on the appropriateness of using the Agility Pack for my purposes?

Thanks.
 
 2 comments
 
Take a look at a post made by Phil Haack a while back, it might be useful... http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx --- Robert Blixt  Apr 19, 2011
 
I suck at regex, so can't be more helpful than that ;) --- Robert Blixt  Apr 19, 2011