xecrets

Sunday, March 1, 2009

How not to shuffle a deck of cards with LINQ

I’m an avid reader of MSDN Magazine, and seldom find any errors. However, in Ken Getz's article “The LINQ Enumerable Class, Part 1” in the July 2008 issue, I found a rather glaring error that needs correction. I sent the following text to Ken, but unfortunately never got a response. Hopefully some will see this blog post, and we'll not be seeing the error illustrated here in production code.

The following piece of code intended to solve the classic shuffle problem is very wrong:

Dim rnd As new System.Random()
Dim numbers = Enumerable.Range(1, 100), OrderBy(Function() rnd.Next)

The error will manifest by making some shuffles more or less likely than others. It is not an unbiased shuffle.

The problem lies in the fact that a list of 100 random numbers, independently chosen, are used to produce a random order of the numbers 1 to 100.

If this code is used as a template for a simulation, the results will be skewed, because not all outcomes of the shuffle are equally likely. If the code is used (with appropriate substitution to a strong pseudo random number generator) for gaming software, either the players or the casino will get better odds than expected.

This is rather serious, as code snippets from MSDN Magazine are likely to be used in many applications.

Why is the code wrong?

Because, when shuffling N numbers in random order, there are N! number of possible shuffles. But, when picking N random numbers independently, from a set of M numbers, there are M**N possible outcomes due the possibility of the same number being drawn more than one time.

For there to be a possibility of this resulting in all shuffles being equally likely, M**N must be evenly divisible by N!. But this is not possible because in this particular case M, 2**31-1 or 2,147,483,647, is prime! System.Random.Next() will return a value >= 0 and < Int32.MaxValue, so there are Int32.MaxValue possible outcomes, which is our M in this case.

This is a variation of a classic implementation error of the shuffle algorithm, and I’m afraid that we’ll have to stick with Fisher-Yates shuffle a while longer. Changing the code to use for example Random.NextDouble() does not remove the problem, it just makes it a bit harder to see. As long as the number of possible outcomes of the random number sequence is larger than the number of possible shuffles, the problem is very likely to be there although the proof will differ from case to case.

There are many more subtle pitfalls in doing a proper shuffle, using the modulo function to reduce integer valued random number generator outputs or using multiplication and rounding to scale a floating point valued RNG just being two of the more well-known.

By the way, the actual implementation of System.Random in the .NET Framework is quite questionable in this regard as well. It will not return an unbiased set of random numbers in some of the overloads, and the Random.NextDouble() implementation will in fact only return the same number of possible outcomes as the System.Next(), because it just scales System.Next() with 1.0/Int32.MaxValue.

Friday, September 12, 2008

How to make a file read in Windows not become a write

A little known, and even less used, feature of all Windows versions from XP and forward is that they support a property called 'Last Access' on all files. On the surface, this seems neat, if not so useful. You can see whenever a file was last accessed using this property.

But think about it. What does this mean? It means that every time you open a file for reading, Windows needs to write something somewhere on the disk! If you're in the process of enumerating, lets say 500 000 files, this is equal to slow! Does anyone ever use that property? Not that I know of.

I'm working with file based persistent storage in my solutions, not with a database, so file access is pretty important to me. By disabling this 'feature', I speeded up enumerating the file system by about a factor of 10! Generally speaking, you'll speed up any system with many file accesses by turning this feature off.

It's really simple too. At a DOS-prompt write:

fsutil behavior set disablelastaccess 1

When you're at it, you might also want to do:

fsutil behavior set disable8dot3 1

This last command disables generation of 8-dot-3 legacy file names, effectively halfing the size of directories in NTFS, which must be a good thing. Beware that there might be 16-bit software out there which actually need those 8-dot-3 names to find your files...

Thursday, February 7, 2008

Book Review: Microsoft Windows Internals, Fourth Edition

Microsoft Windows Internals, Fourth Edition, by Mark E. Russinovich and David A. Salomon, Microsoft Press, LOCCN 2004115221

Many years ago, before the release of NT 3.1, I read a book entitled "Inside Windows NT" by Helen Custer. It was a great book, basically a text-book on operating system theory - as exemplified by Windows NT. It covered the theory of how to implement an operating system kernel, showing how it was done in Windows NT. It did not talk about API's so much as about the data structures and logic behind the scenes and the theory of the basic functions of an operating system such as memory mamangement and the IO system.

As I'm now getting back into some heavy-duty C++ coding for the Windows environment, I thought this might be a good refresher for me to (re-)learn about internal structures and enable me to find the right places to implement the functionality I need.

With these expectations I was a bit disappointed by "Windows Internals, Fourth Edition". It's a very different kind of book compared to the original first edition - in fact it's not the fourth edition of "Inside Windows NT" - it's really the second or third edition of "Windows Internals". So, what kind of book is it then?

"Windows Internals" is a cross between a troubleshooting manual for very advanced system managers, a hackers memoirs, an applied users guide to sysinternals utilities and the documentation Microsoft didn't produce for Windows.

It's almost like an independent black-box investigators' report of findings after many years of peering into the internals of Windows - from the outside. Instead of describing how Windows is designed from the designers point of view, it describes a process of external discovery based on reverse-engineering and observation. Instead of just describing how it works, the book focuses on "experiments" whereby with the help of a bunch of very nifty utilities from sysinternals you can "see" how it works.

I find the approach a little strange, I was expecting a more authoritative text, not an experimental guide to 'discovery'. I don't think one should use experimental approaches to learning about a piece of commercial software. Software is an engineering practice - and it should be described, not discovered. It should not be a research project to find out how Windows works - it should be be a matter of reading documentation and backgrounders, which was what I was hoping for when purchasing the book.

Having read all 870 pages, what did I learn? I learnt that sysinternals (http://technet.microsoft.com/en-us/sysinternals/default.aspx) has some very cool utilities (which I already knew), and I learnt a bit about how they do what they do, and how to use them to inspect the state of a Windows system for troubleshooting purposes. As such, it should really be labelled "The essential sysinternals companion", because that's what it really is. It shows you a zillion ways to use the utilities for troubleshooting. Which is all well and good as it goes and very useful in itself.

To summarize, this is not really the book to read if you want to get an authoritative reference about the Windows operating system, although you will learn quite a bit along the way - after all, there is quite a bit of information here. If you're a system manager and/or facing extremely complicated troubleshooting scenarios, then this book is indeed for you. Also, if you're a more practical-minded person, and just want to discover the 'secrets' of Windows, you'll find all the tools here. I would have preferred that Microsoft documented things, instead of leaving it for 'discovery' (and then hiring the people doing the discovering if they're to good at it, and then make them write a book about - which is what happend here).

Tuesday, January 8, 2008

Lock object sharing with hashes

In web applications we frequently need to serialize access to resources in concurrent applications, since ASP.NET is inherently concurrent. A typical scenario is that we have several loosely connected objects that all apply to the same user, and we need to ensure single-threaded access during a read-read or read-modify-write cycle to get a consistent view or update.
A user is usually represented via some sort of string identifier, perhaps an e-mail address. In C# what we what to do is something like:

lock (user)
{
    // Do something that needs single-thread access
}

The problem is what are we to use as a lock object? C# can use any object as a lock, but which one to pick? We must ensure that multiple threads will always get the right object instance, regardless of when in the application life-time the need arises, so in effect these objects must live for the life of the application. This can lead to massive memory consumption, assume a system with one million users - after a while we'll have to keep one million objects around probably in hash table indexed by the e-mail string. That can mean some serious memory problems.

One approach would be to clean up the table when no-one is using a specific lock object, but this is complicated and fraught with it's own threading problems.

After a few false starts, I came up with the following scheme which has now been tested in the wild and been found quite effective as a trade-off between memory and lock contention.

In actual fact, there are usually a rather limited number of possible concurrent actions, limited basically by the number of threads that are active. This number is typically 100 per processor in ASP.NET, and in most applications even with many users the number of actual concurrent requests at any given time is even fewer. So, assuming a 100 concurrent threads, and assuming that they will only acquire one user lock (our example here) at a time, we really only need at most 100 lock objects - not a million. But how to do this?

The algorithm I've come up with is probably not new, but I've not seen it before in the literature nor when actively searching on the web, so it's at least a bit unusual. Here's how it works:

1. Allocate an array with a fixed number of elements, perhaps twice the number of estimated concurrent accesses.
2. Fill the array with objects to be used as locks.
3. To acquire a lock for a given user, generate an index into the lock object array by taking a hash of the user identifier typically with the GetHashCode() method and then take that module the number of lock objects. This is the index into the lock table, use the indexed object and lock.

At best, you'll get a free lock and acquire the lock.

At second best, another thread is actually holding the lock for the same user, and your thread is put on hold as it should be.

At worst, another thread is actually holding the lock but for a different user that happens to use the same lock when calculating the index via the hash modulo the lock array size. By having good hash algorithms and an appropriate number of locks in relation to the number of concurrent accesses, this should be a very infrequent occurrence. But even if it happens, nothing bad happens except that your thread will have to wait a little longer than was absolutely necessary.

This simple algorithm will require a fixed number of locks in relation to the level of concurrency instead of the number of potential objects that require locking, and at a very low cost. Sample code follows:

public class LockObjects
{
    private object[] _lockObjects;

    public LockObjects(int numberOfLockObjects)
    {
        _lockObjects = GetInitialLockObjects(numberOfLockObjects);
    }

    public LockObjects()
        : this(20)
    {
    }

    private object[] GetInitialLockObjects(int numberOfLockObjects)
    {
        object[] lockObjects = new object[numberOfLockObjects];
        for (int i = 0; i < lockObjects.Length; ++i)
        {
            lockObjects[i] = new object();
        }
        return lockObjects;
    }

    public virtual object GetLockObject(params string[] keys)
    {
        int lockHash = 0;
        foreach (string key in keys)
        {
            lockHash += key.ToLowerInvariant().GetHashCode();
        }
        lockHash = Math.Abs(lockHash) % _lockObjects.Length;

        return _lockObjects[lockHash];
    }
}

Monday, November 26, 2007

Book Review: XSLT 2.0 Programmer's Reference Third Edition

XSLT 2.0 Programmer's Reference Third Edition, by Michael Kay, Wiley Publishing, Inc., ISBN 0-7645-6909-0

XSLT, or XSL, is a subject that I'm no expert in but I've come across it from time to time and generally have had a hard time to really grasp the how and the why of it. In most cases I can program, or at least tweak, just about anything with very little introduction. Fixing and tweaking the XSLT stylesheets that I've come upon has been a tougher experience where I've felt myself reduced to guesswork and magic. That's not a feeling I like, so I decided to do some background studying.

A Programmer's Reference is perhaps not the first choice as an introduction to a subject, but in this case it was hard to find just where to start, and I felt that I was experienced enough to go for some core literature from the start, which also would have the benefit of being useful in a real situation as reference literature.

Since I'm a newcomer to XSLT this review will have to be both about the book as such, and also about the subject matter. Let's start with the book.

Michael Kay is certainly an authority, being the editor of the XSLT 2.0 Working Group. The book is also authoritative and extremely carefully written with an extraordinary focus on details. I did find a few typos, errors and editorials mistakes but taking the amount of text into account it's still a very, very good piece of work.

This book is not written to be read cover to cover, which I did, but it's still not a bad way to get a thorough introduction to XLST. Be prepared for quite a few hours though, I spent about 20 hours reading this book. It's entitled XSLT 2.0, and was written before XSLT was actually approved as an official recommendation which it was on 23 January 2007. I've not checked, but there are sure to be some minor differences between the final recommendation and the drafts upon which the book was written. In consequence being such a recent standard, there are very few XLST 2.0 compliant implementations in existance, so XSLT 1.0 is still very much in use. The book is careful to keep track of differences and changes, and should work well for XLST 1.0 use as well.

It's very heavy reading indeed, but if you only want to get one book about XSLT 2.0 this is very probably the one to get.

The real question though that I must raise after reading this and getting a good feel for XSLT is: Do you want to get any book about XLST at all?

XSLT is about XML transformation, or actually transformation in general. It doesn't really have to be from or to XML, it can be from plain text to HTML or any number of other combinations depending on the requirements and capabilities of parsers and processors available. This is obviously extremely useful - to be able to massage data from and to different forms, and frequently used in one-off applications and in various integration projects. The target use of XSLT is also to fit in along CSS 2.0 as a way to perform formatting for presentation that is not possible with CSS - that's why it's called XML Stylesheet Transformations.

So XSLT certainly address an important area. However, sadly, I must conclude that it's not a very good tool in my opinion. Even if supplemented with good development environments with color coded and syntax checking editors, it's still simply not very human eye-friendly. Too many angle brackets and colons one might say. Syntax does matter! The real problem though is that it's a functional programming language, not a procedural language, and this simply does not lend itself to performing complex tasks in the real world.

Functional languages focus on defining the program in terms of functions that are state-less and without variables. Everything is defined as functions without side-effects, that is to say, each call to a function with the same parameters will always return the same result. Iteration is replaced with recursion - even when iteration is the natural way to address a problem, because an iterator must be updated for each round in a loop, and you can't do that. This means that while anything can be programmed in a functional language, it must frequently be done in ways that are not well known to the majority of developers.

There's a reason why functional languages like Lisp, ML and Scheme have not become commerically successful, although loved by the academic community for decades. Basically I think it's a question about maintenance and complexity. In the real world of commercial programming, the systems must be maintained for decades by perhaps 1000's of different developers over the years. This has always been an uphill task, but no functional language with the possible exception of Erlang has succeeded in combining expressive power, with robustness, documentability and maintainability.

XLST is certainly expressive, but I categorize it as being of the class of write-once and write-only languages. Integrated with XPath 2.0 it's possible to write programs that are so smart, that even the author will have trouble understanding them the next day.

There's nothing wrong with the basic concept of defining a standard way of transforming documents between different representations, and making it possible to choose between doing the processing in the browser or on the server. It's neat and it's cool. However, doing anything but non-trivial transformations is a maintenance nightmare. That the functional programming model is very little known among main-stream developers does not make it any better.

Somehow, it feels like XSLT is 90% geared towards the internal needs of the W3C - it's used extensively to format and publish the specifications for the various specifications published by the W3C. But, this actually means as far as I can judge, that the specifiations are written in raw XML using plain text editors and doing the markup manually - something that won't exactly work for any other organization.

So, unfortunately, in the end I feel that XSLT 2.0 is a technology that's elegant, but will never be used on a wider scale. However, if you do have the situation of having many, many documents in some kind of structured format (not necessarily XML surprisingly enough) and want to transform them to XML or XML-like format like HTML, then XSLT may well be just what you need. Be prepared for a very high entrance cost though, and rest assured that as author of the stylesheets you'll have a very high level of job-security.

There are also serious performance issues with XSLT, due to the functional style of programming, compilers and optimizers have a hard time generating decent code for the underlying procedural architecture of our computers. In theory, functional programming could come into it's own performance wise as multi-core architectures become more common because it does make it easier to realize parallell computation, but today other problems overshadow, and in most cases I'm fairly sure that performance will in many cases be unacceptable with XSLT.

So, to summarize: If you want to learn and use XSLT 1.0 or 2.0, this book is probably the one to get, but you should not assume that XSLT is a silver bullet for XML transformation, there are many caveats.

Wednesday, October 24, 2007

Troubles with Health Monitoring, System.Net.Mail.SmtpClient and SSL

The web is full of desperate pleas for help by prematurely bald developers who have discovered the fatal flaw in the shiny new ASP.NET 2.0 System.Net.Mail.SmtpClient class. This is touted as overcoming all the problems of the old variant, that could not even be configured for credentials without resorting to some heavy-duty tricks.

However, as has been discovered by countless people, while the new SmtpClient() class is neat, easy to use, and configurable via Web.Config - they forgot one thing to make configurable. The EnableSsl property is not settable via Web.Config. So big deal you say - just write a line of code and set it manually... Problem is - you frequently did not write the code that instantiates the SmtpClient. The most well known problem is with the suite of new login controls, which have the capability of sending mail in some circumstances. Works fine - unless you need to enable SSL for the SMTP connection. Fortunately, there's a well-known workaround since these controls expose an event called SendingMail, where you can do magic things including affecting how the mail is sent - most simply by taking over responsibility of sending it.

Then I hit the wall, really hard, trying to use the System.Web.Management.TemplatedMailWebEventProvider class. This is a provider that can subscribe to health monitoring events, and send them via e-mail. Using SmtpClient() of course, with the instantiation hidden deep in its innards of sealed and internal classes in System.Web.dll. No events to the rescue this time either.

After hours of fruitless searching, I finally come to the conclusion that I needed a work-around, ugly as it may be. So, here's where the decorator pattern meets reflection. Sigh. It aint pretty, but it does work, and I do get to use the otherwise rather nice and advanced TemplatedMailWebEventProvider (the same technique can be used for the SimpleMailWebEventProvider, or any provider derived from MailWebEventProvider).

In the end, it's just a few lines of code (comments and veritical white space removed for brevity):

using System;
using System.Collections.Specialized;
using System.Reflection;
using System.Web.Management;

public class TemplatedMailWithSslWebEventProvider : WebEventProvider
{
    private TemplatedMailWebEventProvider _templatedProvider;
   
    public TemplatedMailWithSslWebEventProvider()
    {
        ConstructorInfo constructor = typeof(TemplatedMailWebEventProvider)
            .GetConstructor(BindingFlags.Instance | BindingFlags.NonPublic,
                            null, new Type[0], null);
        _templatedProvider = (TemplatedMailWebEventProvider)constructor
            .Invoke(null);
    }

    public override void Initialize(string name, NameValueCollection config)
    {
        if (config == null)
        {
            throw new ArgumentNullException("config");
        }

        _templatedProvider.Initialize(name, config);

        FieldInfo field = typeof(MailWebEventProvider)
            .GetField("_smtpClient",
                      BindingFlags.Instance | BindingFlags.NonPublic);
        field.SetValue(_templatedProvider, new SmtpClientWithSsl());
    }

    public static MailEventNotificationInfo CurrentNotification
    {
        get
        {
            return TemplatedMailWebEventProvider.CurrentNotification;
        }
    }

    public override void Flush()
    {
        _templatedProvider.Flush();
    }

    public override void ProcessEvent(WebBaseEvent raisedEvent)
    {
        _templatedProvider.ProcessEvent(raisedEvent);
    }

    public override void Shutdown()
    {
        _templatedProvider.Shutdown();
    }
}

All that's left for you to do is to define the SmtpClientWithSsl() class, deriving from System.Net.Mail.SmtpClient() whose developer probably by the same oversight that forgot about SSL, also forgot to make it sealed. Fortunately. Here two wrong almost makes one right!

One of the morales of this story is to really think about the use of sealed and internal. My first try was to implement a custom templated e-mail provider, but it turns out that was quite a job, and I could not override or use anything from System.Web.dll because it was all sealed and used lots of internal helpers. If you really need to hide the implementation that bad, it might be better to introduce a public base class, where the essential interfaces are exposed as protected methods and properties. When you limit a class to sealed, and it depends on lots of additional logic, do consider making that logic available at least to alternative implementations and give it a base to inherit from.

Tuesday, October 16, 2007

Book Review: HTML, XHTML, and CSS Bible, 3rd Edition

HTML, XHTML, and CSS Bible, 3rd Edition, by Bryan Pfaffenberger, Steven M. Schafer, Chuck White and Bill Karow, Wiley Publishing, Inc. ISBN 0-7645-5739-4.

Not being really a GUI person, I tend to thrive more on the backend of things with threads, algorithms and such, I still need to do a fair bit of client side stuff. With this in mind, I purchased this book thinking that an authoritative work like a bible was just what I needed.

However, this is not a bible. Not at all. It's a very broad introduction on a broad range of subjects with very little depth. Why it has XHTML in the title is totally beyond me, there's not even a chapter about it. This book is really about HTML, CSS and the web in general - forget XHTML (although admittedly there is little difference from HTML).

In the preface, it's mentioned that the main author, Bryan Pfaffenberger is the author of more than 75 books. Impressive! Until you actually read this one. I have never seen so many mistakes in one book. Nor have I seen such blatant waste of space and duplication of code. Seriously, you're paying for about 50 pages of Lorem Ipsum here!

So, it's sloppily written, wasteful of dead trees and is mistitled. Is it thus useless? No, actually not really.

If you really don't think you can use this bible as a reference (because of all the mistakes etc), you can probably get some pretty good use out of it if you're relatively new to the web, but can code a little, and you're thinking about starting your own web site for whatever reason. It does cover a wealth of topics which are relevant as an introduction. There are chapters about how to upload your site, how to optimize for search engines etc etc - none of which is implied by the title, but still useful for a different reader.

I would like to rename it to "Getting Started with the Web" or something like that. As such, it's not really that bad. But don't buy this if you're looking for detailed in-depth information about HTML, XHTML and CSS.