Performance of concurrent XmlReader construction
Hosting FeedZero on a linux server using Mono can be trying at times. Its not that Mono itself is particularly troublesome (I actually think its amazingly good), but more that Mono is red-headed step-child of the wider .NET community – much like those who try and build various command line UNIX tools on Windows.
Usually running .NET assemblies on Mono “just works” but when you encounter a bug – particularly in third party libraries – it seems I am often the first person to have run into it (judging by Google) and so have no real choice except to diagnose it myself.
Worse, many .NET libraries – even open source ones – do not even try to support Mono. Unlike GameCreate, FeedZero utilises a number of third-party libraries and so is more exposed to these sorts of problems. I encountered three separate problems in three separate libraries this week all of which I had to diagnose and either fix or find a work around for.
The most recent issue I encountered was a problem with Argotic which calls itself a “syndication framework” but is more plainly described as a library for parsing RSS feeds. I noticed while inspecting our service logs that with relative regularity, our updates would pause for minutes at a time – often 5 or more. After inspecting a few occurances the common theme was that after each pause an error was displayed due to the RSS feed being invalid in some way.
Armed with the log data, we set about building a test executable that parsed a list of RSS feeds. Running the test against a sample list obtained from the service logs provided the same results as observed in production – provided with a list of 16 feeds 3 of which were invalid, the test would pause for multiple minutes before finally finishing. However running the test executable with a single item list consisting of any one of the three bad RSS feeds did not cause the pausing. Worse, Windows did not seem to have this behaviour.
Due to the number of feeds we need to scan, FeedZero does RSS feed updates in multiple threads at once; the above behaviour was consistent with some sort of concurrency bugs. The three classes of bugs: concurrency-related, third-party library, Mono-only are probably the most tiresome to diagnose and here was an issue that fell into all three categories.
I eventually narrowed my test case down to updating any two invalid RSS feeds at once and still had the pause. For a third party library (and indeed any distinct module of code), the easiest way to to test for a concurrency issue is to simply prevent that module or library from being called by more than one thread at a time. This is very easy to do using C#’s built-in object locking feature:
private static const object _lock = new object(); public void UpdateFeed(string url) { ... GenericFeed feed = new GenericFeed(); lock (_lock) feed.Load(url); ... }
Sure enough, with the simple two-line addition of a static object on which to lock the problem went away. Unfortunately for me, FeedZero really needs to get good performance on its feed updating so I would need to identify the underlying cause in Argotic. Through some tedious but necessary analysis I was able to narrow the problem down to the construction of System.Xml.XPath.XPathDocument, a part of the BCL.
My next step was to write a program from scratch that demonstrates the problem, that is not reliant on any other libraries and can be freely distributed. I ended up with a command line executable that created two XPathDocument objects simultaneously with a HTML page as the input, with a command-line argument permitting a lock to be used on the constructor.
This simple program shows the problem; on my PC it took 6 seconds using the lock and over 5 minutes without it. I then turned to Windows, where my first run took 9 seconds without the lock but a subsequent run took almost 2 minutes. Through additional reading, I determined that XPathDocument apparently uses a System.Xml.XmlReader internally and adjusted my test executable to construct XmlReader’s instead; the problem remained.
Finally I altered my executable to perform the test 10 times and report the runtime of each so I could look for average runtime; on both Windows and Linux with the lock I received about 6 seconds – but without the lock, the results are more interesting. While Windows seems to be not too bothered without the lock on the first attempt, on the second attempt it threw some sort of internal timeout exception.
So, both Mono and Windows essentially have very poor performance when attempting to parse two non-XML documents simultaneously; I will package up my test executable and submit a bug to Novell for the mono issue.
So, the final outcome is that I really had no choice but to lock; however a secondary observation from this process is that relying on XmlReader to throw an exception on non-XML takes a pretty long time (multiple seconds); too long for my purposes. Finally I settled on passing the downloaded data through the following regular expression to determine if its likely to be a feed
private static readonly Regex _isFeedRegex = new Regex("<([\\w_-]+:)?(feed|rss|rdf)", RegexOptions.IgnoreCase | RegexOptions.Compiled);
If there is no match, then we can skip the entire parsing attempt; with this change we can now fail invalid feeds in a few hundred msec.
Update: Further testing shows that the standard reader obtained from XmlReader.Create notices immediately (on first attempted read) that the document isn’t XML, which makes an even easier way to help out XPathDocument.