Search Wiki:
Hello everyone!

Finally the source code for the Html Agility Pack is now put in a central repository! Thanks to the CodePlex team :-)

Note: the old page located at http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html is now obsolete. Use CodePlex forums for discussions, questions, bugs, etc...

Now, erhh... what is exactly the Html Agility Pack? All right, all right, I will tell you know:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's MSHTML dll or W3C's HTML tidy or ActiveX / COM object, or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool. The version posted here on CodePlex is for the .NET Framework 2.0. If you need the old version, please go to the old page or drop me a note.


For example, here is how you would fix all hrefs in an HTML file:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");

If you want to participate to the project - because that's the whole purpose of putting the source there, right - use the forums or drop me a note (simon underscore mourier at hotmail dot com)!

Happy coding, scraping, scanning, html-ing, xhtml-ing, etc... :^)
Simon Mourier.
Last edited Aug 18 2006 at 10:41 AM  by simonm, version 2
Comments
JanWaiz wrote  Oct 6 2006 at 11:31 AM  
Hi Simon,

what in Hell is that: FixLink(att) ??? :-)

FixLink? Is it a Method/Class/??

regards
Jan Waiz

Edgardo wrote  Nov 7 2006 at 6:47 PM  
Dude, you rock

yuvipanda wrote  Dec 26 2006 at 3:56 AM  
Dude, you rock:D

simonm wrote  Jan 1 2007 at 1:48 PM  
Hi JanWaiz, sorry for the delay, I never noticed there were comments down there! FixLink is an hypotethical method you must implement to suit your needs.

rosbicn wrote  Jan 18 2007 at 4:05 PM  
Hi Simonm, Thank you for your effort on HtmlAgilityPack.
I found a small bug today. For HtmlDocument, if you use OptionOutputAsXml to convert html into xml, following html generate error.

html:
<script>if(0<1){document.write(1);}</script>

HtmlAgilityPack Xml Output (Not Right):
<script>
//<![CDATA[
if(0&lt;1){document.write(1);}
//]]>//
</script>

Right Output:
<script>
//<![CDATA[
if(0<1){document.write(1);}
//]]>//
</script>

Regards.
Rosbicn

rosbicn wrote  Jan 18 2007 at 4:09 PM  
Hi Simonm, the comment system is terrible. Hope you can understand my words. For HtmlDocument, if you use OptionOutputAsXml to convert html into xml, following html generate error. "<script>if(0<1){document.write(1);}</script>", this "<" in "<script>" will be changed into "&lt;", but it is not right. In CDATA section, that change is not necessary.

protopopov wrote  Apr 28 2007 at 9:57 PM  
I detected that you class HtmlWeb doesn't load html document properly if html page contains characters other then UTF8 and codepage no described on page.
To solve that problem I modify source code of HtmlWeb class as follow:
1. Add member and property
private Encoding _defaultEncoding = new UTF8Encoding();
public Encoding DefaultEncoding
{
get
{
return _defaultEncoding;
}
set
{
_defaultEncoding = value;
}
}

2. Modify method Get
private HttpStatusCode Get(Uri uri, string method, string path, HtmlDocument doc)
...
// BUG: doc.Load(s, true);
doc.Load(s, DefaultEncoding, true);
...


Usage of class HtmlWeb:
HtmlWeb myWeb = new HtmlWeb();
myWeb.DefaultEncoding = Encoding.GetEncoding(1251);
...
HtmlAgilityPack.HtmlDocument doc = myWeb.Load(Properties.Resources.Url);


srb31513 wrote  Jul 28 2007 at 12:58 AM  
!!! JUST WANTED TO THANK THE DEVELOPER(S) FOR PUTTING TOGETHER THE AGILITY PACK ... Very usefull .. cant thank you enough...YOU'RE ALL ROCKSTARS !!!

dixit_piyush79 wrote  Jul 30 2007 at 6:18 AM  
hi sir
i need to login programattically and scrapping data from a website which is the next page after login .
for this i m using html agility pack.
i m using the link http://www.dotnetjunkies.com/WebLog/joshuagough/archive/2006/01/20/134825.aspx
as reference .
for trial i m trying to login in gmail and code is as following........................
using System;
using System.Data ;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls ;


public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
FormProcessor p = new FormProcessor();
string userName = "*****************";
string password = "******************";
Form form = p.GetForm("https://Gmail.com","//form[@name='loginForm']", FormQueryModeEnum.Nested );
form["j_username"].SetAttributeValue("value", userName);
form["j_password"].SetAttributeValue("value", password);
HtmlDocument doc = p.SubmitForm(form);
string strBal = doc.DocumentNode.SelectSingleNode
("//span[@class='redText']").InnerText;
strBal = System.Web.HttpUtility.HtmlDecode(strBal);
strBal = strBal.Substring(1).Trim();
}
}

in which i m facing problem in xpath //form[@name='loginForm'] the error is node not found.
i want to know that how can i compose the xpath for any website . plz tell me complete reference about it.

thanks in advance

sharad soni

lwang wrote  Aug 2 2007 at 12:50 PM  
Thanks for this great work.

I have some questions want to ask,
1. When I try to use a standard XPath command something like "//a@href" (same as your example), it always throw an exception. I have to change it to "//a[@href]". Any idea?
2. I found that in HtmlDocument class, you probably missing one line of code at line 730 ("writer.Flush()"). If you don't flush the writer, there'll be problem when writing lager HTML page (more than 64K I guess). It only happens when trying to save output directly to a response stream.

public void Save(TextWriter writer)
{
if (writer == null)
{
throw new ArgumentNullException("writer");
}
DocumentNode.WriteTo(writer);
writer.Flush();
}

Again, it's a great work and save me a LOT of time. Thanks~~~

boxabirds wrote  Sep 10 2007 at 11:08 AM  
Hi Simon -- I'm struggling with an html file that has this (contents show in entirety):

<body>
<form name="aspnetForm" method="post" action="test.aspx" id="testform">
<label for="dropdown" id="ctl00_ContentPlaceHolder2_LabelMethodOfPayment">Method of payment *</label>
<select id="dropdown">
<option selected="selected" value="SWIFT">Electronic Transfer</option>
<option value="DRAFT">Bank Draft</option>
</select>
</form>
</body>

I then do this:

XPathNavigator theNavigator = GetXPathNavigatorFromHtml(e.Response.BodyString);

When I do theNavigator.SelectSingleNode( "//option[@selected]").innerXml I get an empty string, rather than "Electronic Transfer"

Any ideas? Worryingly, looking theNavigator.SelectSingleNode( "//select" ).OuterXml you get garbage:

<option selected="selected" value="SWIFT">
Electronic Transfer
Bank Draft
Electronic Transfer
<option value="DRAFT">
Electronic Transfer
Bank Draft
Bank Draft
</option></option>



boxabirds wrote  Sep 10 2007 at 11:08 AM  
Hi Simon -- I'm struggling with an html file that has this (contents show in entirety):

<body>
<form name="aspnetForm" method="post" action="test.aspx" id="testform">
<label for="dropdown" id="ctl00_ContentPlaceHolder2_LabelMethodOfPayment">Method of payment *</label>
<select id="dropdown">
<option selected="selected" value="SWIFT">Electronic Transfer</option>
<option value="DRAFT">Bank Draft</option>
</select>
</form>
</body>

I then do this:

XPathNavigator theNavigator = GetXPathNavigatorFromHtml(e.Response.BodyString);

When I do theNavigator.SelectSingleNode( "//option[@selected]").innerXml I get an empty string, rather than "Electronic Transfer"

Any ideas? Worryingly, looking theNavigator.SelectSingleNode( "//select" ).OuterXml you get garbage:

<option selected="selected" value="SWIFT">
Electronic Transfer
Bank Draft
Electronic Transfer
<option value="DRAFT">
Electronic Transfer
Bank Draft
Bank Draft
</option></option>



boxabirds wrote  Sep 10 2007 at 12:58 PM  
Sorry for duplicate posting above! It was because the ajax control on this forum site came up with an error and I thought it'd not gone through.

An update: similar problems arise when using HtmlDocument. Take the same HTML above, and see the output of this:

HtmlAgilityPack.HtmlDocument theRootDocument = new HtmlAgilityPack.HtmlDocument();
theRootDocument.LoadHtml(e.Response.BodyString);
HtmlNode theRoot = theRootDocument.DocumentNode;


theRoot.SelectSingleNode("//option[@selected]").InnerHtml
>> ""

theRoot.SelectSingleNode("//option[@selected]").OuterHtml
>> "<option selected=\"selected\" value=\"SWIFT\">

It's like it's not matched the </option> and the related text...?

thanks for any comments!
J.

ertant wrote  Oct 23 2007 at 9:04 PM  
I made some fixes on handling of "form" tag and HtmlNodeNavigator to handle xpaths and outerXml properties. How can i send to you ?

tlaukkan wrote  Nov 8 2007 at 9:44 AM  
Hello

I am not sure if this is a bug or my mistake with xpath:

foreach (HtmlNode form in doc.DocumentNode.SelectNodes("//form"))
{
analysisLog += "Found form name: " + form.Attributes["name"].Value + "\r\n";

HtmlNodeCollection fields = form.SelectNodes(".//select");

I find the form all right but when I try to search for select elements the SelectNodes returns null. When I use '//select' those elements are found. The select elements are nested elements of the form.

best regards,
Tommi Laukkanen

tlaukkan wrote  Nov 8 2007 at 10:17 AM  
The problem originates from faulty html code which agility pack resolves so that the form does not contain anything at all though the form has a start element in the beginning of the page and end tag in the end part of the page. The selects are between these tags but there has to be some inconsistency inside the form itself which fools the parser. The page works ok for browsers though.

best regards,
Tommi Laukkanen

berndr wrote  Jan 2 at 9:48 AM  
Hi Simon,

Thanks for the great work. I am using your HTML agility pack to parse a lot of pages, modify them and then output the changed pages. Doing so I discovered that the InnerHtml and OuterHtml properties do not always contain the updated values. If I access the property before modifing the contents it will output the old value after the change. Like the following:

string innerHtml = document.DocumentNode.SelectSingleNode("//body").InnerHtml;
document.DocumentNode.SelectSingleNode("//p").RemoveAll();
Assert.AreEqual(EXPECTED_RESULT, document.DocumentNode.SelectSingleNode("//body").InnerHtml);

To work-around the issue I use the WriteTo and WriteContentTo methods. They always work.

Thanks, Bernd.

jeanswest wrote  Jan 17 at 9:12 AM  
Hi Simon,
Thanks you for your great work.I found the old page you given is not avialable,could you give me a old version?
my E-Mail: hxyliusen@qq.com.
Thanks, Jeaswest

Khayralla wrote  Mar 10 at 4:38 PM  
Hi Simon

I try to get links from this site : http://www.aljazeera.net/NR/exeres/F06E0D8B-BE98-445A-9752-8E7EA9DAD30F.htm

I got a list of url staring with NR/exeres/F06E0D8B

Is this a bug ?

Thanks

Khayralla

JohnWestoby wrote  Tue at 7:25 AM  
Hi Simon / CodePlex team,

It was a huge relief to find HTML Agility. I have a web spider which I'm using to find broken links on our web site - the hardest part about this was actually parsing HTML to extract the links, so HtmlNodeCollection fixed that for me.

I have a couple of questions (and I'm probably missing something simple here, but...)

When I get the HtmlAttribute 'href' or 'src' for example, this comes back as a 'relative' link. So in the worst case it could be something like:

../../doc.html

In other words there's still a job of parsing to be done. Also, I have to be aware if there's a 'baseref' active since this would modify how a relative link is parsed. And I would assume (?) that 'baserefs' could be nested so that my code should really be aware of this.

Have I missed something, or would this be something worth looking into? I'm happy to contribute code if it's of use...

Many thanks

Updating...