I’m an avid reader of MSDN Magazine, and seldom find any errors. However, in Ken Getz's article “The LINQ Enumerable Class, Part 1” in the July 2008 issue, I found a rather glaring error that needs correction. I sent the following text to Ken, but unfortunately never got a response. Hopefully some will see this blog post, and we'll not be seeing the error illustrated here in production code.
The following piece of code intended to solve the classic shuffle problem is very wrong:
Dim rnd As new System.Random()
Dim numbers = Enumerable.Range(1, 100), OrderBy(Function() rnd.Next)
The error will manifest by making some shuffles more or less likely than others. It is not an unbiased shuffle.
The problem lies in the fact that a list of 100 random numbers, independently chosen, are used to produce a random order of the numbers 1 to 100.
If this code is used as a template for a simulation, the results will be skewed, because not all outcomes of the shuffle are equally likely. If the code is used (with appropriate substitution to a strong pseudo random number generator) for gaming software, either the players or the casino will get better odds than expected.
This is rather serious, as code snippets from MSDN Magazine are likely to be used in many applications.
Why is the code wrong?
Because, when shuffling N numbers in random order, there are N! number of possible shuffles. But, when picking N random numbers independently, from a set of M numbers, there are M**N possible outcomes due the possibility of the same number being drawn more than one time.
For there to be a possibility of this resulting in all shuffles being equally likely, M**N must be evenly divisible by N!. But this is not possible because in this particular case M, 2**31-1 or 2,147,483,647, is prime! System.Random.Next() will return a value >= 0 and < Int32.MaxValue, so there are Int32.MaxValue possible outcomes, which is our M in this case.
This is a variation of a classic implementation error of the shuffle algorithm, and I’m afraid that we’ll have to stick with Fisher-Yates shuffle a while longer. Changing the code to use for example Random.NextDouble() does not remove the problem, it just makes it a bit harder to see. As long as the number of possible outcomes of the random number sequence is larger than the number of possible shuffles, the problem is very likely to be there although the proof will differ from case to case.
There are many more subtle pitfalls in doing a proper shuffle, using the modulo function to reduce integer valued random number generator outputs or using multiplication and rounding to scale a floating point valued RNG just being two of the more well-known.
By the way, the actual implementation of System.Random in the .NET Framework is quite questionable in this regard as well. It will not return an unbiased set of random numbers in some of the overloads, and the Random.NextDouble() implementation will in fact only return the same number of possible outcomes as the System.Next(), because it just scales System.Next() with 1.0/Int32.MaxValue.
No comments:
Post a Comment