Quickly fixing crappy, invalid html using c#
On many occasions, I’ve met with html that was not compliant. It is really difficult to have excellent html at the same time that you provide non-programmatic updates to web site content. There are a couple of awesome xhtml compliant editors out there for those interested in custom integration but using modern content management systems inevitably means you are bound to either fckEditor or tinyMce. These both produce well-formed html but they tend to add a lot of garbage to html in an effort to format things in a generic and cross-browser way. Pleasing everybody tends to leave me disenchanted with the tools.
Correcting Html following destructive string splits
I can live with the sub-optimal html these editors produce but I had a need to split html into distinct parts. Basically, I had a pseudo-blog. It was a subsonic starter site turned into blogging engine. It worked great for a while. The author would copy the last post, paste it on top, change the date and title, then replace the post body. The old posts moved down the page. Visually, I was able to see a very consistent pattern in the posts and felt confident I could pull the posts apart into an RSS feed. Upon further investigation, however, I found quite a lot of tag artifacts that fckEditor had inserted into the html. The page looked fine as a whole but there was no way to know what each post html would look like when they were separated.
The author had inserted an <hr /> between each post. That was my starting point.
Parse the date
Then, there was always a date and time in the format: MMMM dd, yyyy (April 18, 2008). However, sometimes there were characters inserted in there. Then I discovered that the time was also there but separated with a "|". Like this: April 17, 2008 | 3:50 PM.
I used a regular expression to split the post into parts. Then I replaced the "|" in the dateString and used DateTime.Parse(System.Web.HttpUtility.HtmlDecode(dateString)) to create a DateTime instance.
string dateString = GetDateStringFromPost( html );
string[] parts = post.Split( new string[] { dateString }, StringSplitOptions.None );
public string GetDateStringFromPost( string html )
{
string datePattern = "(?:January|February|March|April|May|June|July|August|September|October|November|December)(?: |\\s)*[0-9]{1,2},(?: |\\s)[0-9]{4}(?:(?: |\\s|\\|)*[0-9]{1,2}:[0-9]{2}(?: |\\s)*(?:AM|PM)?)?";
Regex regex = new Regex( datePattern, RegexOptions.IgnoreCase | RegexOptions.Compiled );
Match match = regex.Match( html );
return match.Value;
}
At this point, I had my post, my date, and the post title happened to be the fragment (parts[0]) before the date.
Clean the title
I just need text for the title so I brute force stripped all html tags (character by character parse) and then used string.Trim() to make that happen. (Caution, not pretty)
public string StripHtml( string html )
{
StringBuilder sb = new StringBuilder();
bool inTag = false;
foreach ( char c in html )
{
if ( c == ‘<’ )
{
inTag = true;
continue;
}
if ( c == ‘>’ && inTag )
{
inTag = false;
continue;
}
if( !inTag ) sb.Append( c );
}
return sb.ToString().Trim();
}
Fixing Html using Tidy.Net
Fixing the now corrupted post body html
Now I had a post body that had any number of invalid html tags because that post may have been in the middle of a table on the page or the post itself may have had numerous formatting containers wrapping the different parts.
The same tool that is used to correct html in wysiwyg editors to the rescue. Html Tidy. I found this post regarding Tidy.Net use and where to find Tidy.Net (Tidy utility in .net).
public string FixHtml( string html )
{
TidyNet.Tidy tidy = new TidyNet.Tidy();
/* Set the options you want */
tidy.Options.DocType = DocType.Strict;
tidy.Options.DropFontTags = true;
tidy.Options.LogicalEmphasis = true;
tidy.Options.Xhtml = true;
tidy.Options.XmlOut = true;
tidy.Options.MakeClean = true;
tidy.Options.TidyMark = false;
/* Tidy will provide messages regarding what it done did */
TidyMessageCollection tmc = new TidyMessageCollection();
MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] byteArray = Encoding.UTF8.GetBytes( html );
input.Write( byteArray, 0, byteArray.Length );
input.Position = 0;
tidy.Parse( input, output, tmc );
string result = Encoding.UTF8.GetString( output.ToArray() );
return result;
}
Amazingly, the html is now xhtml compliant! You can throw it at an xml parser and go to town.
Parsing Html with Html Agility Pack
Paring down the html
What I wanted was everything Tidy had tossed into the body tag. I evaluated several options and finally decided I wanted to try HtmlAgilityPack.
The source includes several examples, so check them out. Here is my code…
public string GetHtmlBodyContents( string html )
{
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml( html );
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes( "//body" );
if ( tags.Count != 1 ) return html;
return tags[0].InnerHtml;
}
Parse Complete, to RSS with you
I will mention that I used Rss.Net to create my Rss feed. Using it is both easy and outside the scope of this post. I will warn you that they have not converted to using subversion so you’ll have to install a cvs client to get the code. There are no releases. You could, if you desire, pay for their commercial version. However, I feel they are falling into the age old trap of over-thinking their license and making the paid for version inaccessible. I believe you should be able to use the personal version for free and a commercial license should be one, flat fee for single business usage. That would allow a single business to use the product in their web sites. Charging extra for source and then again for redistribution is just confusing. They have 30 something options on their online store. c’mon guys, get it together. When I have more than 10 options, I am pretty much finished wasting my time trying to pay someone for their product when there are free alternatives. Yes, you wrote nifty code, don’t make it so hard to be compensated for it.
Ok, off soapbox, back to coding!
Leave a Reply