If you are using Selenium WebDriver as a web crawler and thinking that it's too slow, welcome inside!
In this article, we will see how to make page parsing time around 50 times faster.
As an example, I will parse comments from another article from this blog. I will first parse it using default WebDriver API (FindElement... methods) and then will compare it to CsQuery
Here is WebDriver parsing code:
var driver = new ChromeDriver();
driver.Navigate().GoToUrl("/2014/07/fixed-setup-was-unable-to-create-a-new-system-partition-or-locate-an-existing-system-partition-during-installing-windows-8-18-7-vista-etc-from-usb/");
// start stopwatch
var comments = driver.FindElementsByCssSelector("li.comment");
foreach (var comment in comments)
{
var parserComment = new Comment();
parserComment.Author = comment.FindElement(By.CssSelector("cite.fn")).Text;
parserComment.Date = comment.FindElement(By.TagName("time")).Text;
parserComment.Content = comment.FindElement(By.ClassName("comment-content")).Text;
}
// stop stopwatch
And this is tooooo slow! I have around 225 comments there, and it took me 26 seconds to parse their structure. All WebDriver's search methods are very slow!, I think this is because it does not cache page's content and always make a request to a browser.
The other approach I gonna test is a bit different. First, I will get HTML code for a comments block and then parse it using CsQuery library (fast, open-source, .net jQuery implementation)
var driver = new ChromeDriver();
driver.Navigate().GoToUrl("/2014/07/fixed-setup-was-unable-to-create-a-new-system-partition-or-locate-an-existing-system-partition-during-installing-windows-8-18-7-vista-etc-from-usb/");
// start stopwatch
var html = driver.FindElementById("comments").GetAttribute("innerHTML");
CQ dom = html;
var comments = dom["li.comment"];
foreach (var comment in comments.Select(x => x.Cq()))
{
var parserComment = new Comment();
parserComment.Author = comment["cite.fn"].Text();
parserComment.Date = comment["time"].Text();
parserComment.Content = comment[".comment-content"].Text();
}
// stop stopwatch
This code took only 600 milliseconds
You could also use other approaches such as Html Agility Pack, RegEx, IndexOf(), etc.
The main idea is to wait until page content is loaded, load HTML using FindElement* or similar method and then parse it using more performance friendly tools.