Thursday, April 29, 2021

Weird random constants in GetHashCode()

Random integers in GetHashCode() for C#/.NET

When you ask Visual Studio to generate Equals() and GetHashCode() for C#/.NET, as well as when you inspect other code, you often see addition and multiplication of various seemingly random constants as part of the calculation. Here's an example of how it can look:

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);
}

public class MyFixedByVsHashCode
{
    public MyFixedByVsHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode()
    {
         int hashCode = -1817952719;
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);
         return hashCode;
    }
}

The above example was generated using Visual Studio 2019 for .NET Framework and contains a number of seemingly random strange integer constants: 250924889, -1007312712, -1817952719 and -1521134295. If you generate code for .NET Core or .NET 5 it may look a little different, but under the hood it's similar.

Executive summary: The reason for these numbers is to reduce the risk of collisions, i.e. the number of situations where two different instances with different values get the same hash code.

So what's up with these magic numbers? First of all: No, they're not random. Let's go through them.

public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);

These values are derived from the name of the property. Exactly how is not documented and not clear, but it's essentially equivalent to "NameOfProperty".GetHashCode() . The purpose is to add the name of property to the equation, reducing the risk that two properties with the same value get the same hash code.

Then we have the integer constants from the multiple property implementation:

int hashCode = -1817952719;
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);

These are fixed, and do not vary. A little bit of analysis shows they are far from random. The first one, -1817952719 is actually the product of two relatively large primes, 16363 * 151379 = 2477014577 and is thus a nice semiprime, and when this is interpreted as a signed 32-bit integer we get -1817952719.

The second one, -1521134295, when interpreted as an unsigned 32-bit integer is 2773833001 - and that is a nice large prime!

Using primes and semiprimes as factors and constants in polynomials has been shown to produce numbers with better distribution than other constants.

So it's all about reducing the risk of collisions.

But how bad can it get? Actually, very bad... Here follows a seemingly good enough implementation that is similar to many real-world manual implementations. In fact, I've written a good number of similar, although hopefully not with as catastrophic result as this.

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => ValueA.GetHashCode();
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => ValueB.GetHashCode();
}

public class MyBrokenHashCode
{
    public MyBrokenHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode() => A.GetHashCode() ^ B.GetHashCode();
}

internal class Program
{
    private static void Main()
    {
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Something").GetHashCode()}");
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Other").GetHashCode()}");
    }
}

The above example produces the following output:

Hashcode is: 0
Hashcode is: 0

That's not good! Two different instances with two different values not only produce the same hashcode, it's 0! In fact it's worse than not good, it's potentially catastrophic for performance. The scary thing is, everything will still work, and look nice during testing but if these objects are placed in a HashTable or Dictionary or similar, and in production they grow to a larger number of elements then indexing these collections degenerate into linear searches in a linked list.

So what happens?

Two different types happen to generate the same hashcode for the same underlying value ("Something" or "Other"). That's actually not that unusual. Then we use XOR to combine the hashes, but XOR has the known weakness that XOR:ing identical values will always result in zero, regardless of the values.

This example is slightly contrived, but it demonstrates that seemingly good-looking code can have subtle pitfalls causing really bad effects.

Conclusion - Trust the tools and use Visual Studio generation of GetHashCode-code! Even if you don't notice any problems with your own implementations, do regenerate the code when you have the chance.

No comments:

Post a Comment