Thursday, April 29, 2021

Weird random constants in GetHashCode()

Random integers in GetHashCode() for C#/.NET

When you ask Visual Studio to generate Equals() and GetHashCode() for C#/.NET, as well as when you inspect other code, you often see addition and multiplication of various seemingly random constants as part of the calculation. Here's an example of how it can look:

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);
}

public class MyFixedByVsHashCode
{
    public MyFixedByVsHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode()
    {
         int hashCode = -1817952719;
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);
         return hashCode;
    }
}

The above example was generated using Visual Studio 2019 for .NET Framework and contains a number of seemingly random strange integer constants: 250924889, -1007312712, -1817952719 and -1521134295. If you generate code for .NET Core or .NET 5 it may look a little different, but under the hood it's similar.

Executive summary: The reason for these numbers is to reduce the risk of collisions, i.e. the number of situations where two different instances with different values get the same hash code.

So what's up with these magic numbers? First of all: No, they're not random. Let's go through them.

public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);

These values are derived from the name of the property. Exactly how is not documented and not clear, but it's essentially equivalent to "NameOfProperty".GetHashCode() . The purpose is to add the name of property to the equation, reducing the risk that two properties with the same value get the same hash code.

Then we have the integer constants from the multiple property implementation:

int hashCode = -1817952719;
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);

These are fixed, and do not vary. A little bit of analysis shows they are far from random. The first one, -1817952719 is actually the product of two relatively large primes, 16363 * 151379 = 2477014577 and is thus a nice semiprime, and when this is interpreted as a signed 32-bit integer we get -1817952719.

The second one, -1521134295, when interpreted as an unsigned 32-bit integer is 2773833001 - and that is a nice large prime!

Using primes and semiprimes as factors and constants in polynomials has been shown to produce numbers with better distribution than other constants.

So it's all about reducing the risk of collisions.

But how bad can it get? Actually, very bad... Here follows a seemingly good enough implementation that is similar to many real-world manual implementations. In fact, I've written a good number of similar, although hopefully not with as catastrophic result as this.

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => ValueA.GetHashCode();
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => ValueB.GetHashCode();
}

public class MyBrokenHashCode
{
    public MyBrokenHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode() => A.GetHashCode() ^ B.GetHashCode();
}

internal class Program
{
    private static void Main()
    {
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Something").GetHashCode()}");
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Other").GetHashCode()}");
    }
}

The above example produces the following output:

Hashcode is: 0
Hashcode is: 0

That's not good! Two different instances with two different values not only produce the same hashcode, it's 0! In fact it's worse than not good, it's potentially catastrophic for performance. The scary thing is, everything will still work, and look nice during testing but if these objects are placed in a HashTable or Dictionary or similar, and in production they grow to a larger number of elements then indexing these collections degenerate into linear searches in a linked list.

So what happens?

Two different types happen to generate the same hashcode for the same underlying value ("Something" or "Other"). That's actually not that unusual. Then we use XOR to combine the hashes, but XOR has the known weakness that XOR:ing identical values will always result in zero, regardless of the values.

This example is slightly contrived, but it demonstrates that seemingly good-looking code can have subtle pitfalls causing really bad effects.

Conclusion - Trust the tools and use Visual Studio generation of GetHashCode-code! Even if you don't notice any problems with your own implementations, do regenerate the code when you have the chance.

Iterations and the squaring factor

The power of 2

I recently found code that was functionally equivalent to the following:
public class Filter
{
    private readonly IEnumerable<string> _old;

    public Filter(IEnumerable<string> old) => _old = old;
public IEnumerable<string> WhatsNew(IEnumerable<string> updated) => updated.Where(s => !_old.Contains(s));
}

Nice, compact and easily understandable. We keep track of an original list of strings, and we get an updated list we'd like to know what's new.

Or is it, really?

As I mentioned, I found this type of code but why did I notice it? Because during debugging the call of the WhatsNew() method took significant time, it was boring to sit there and wait for it to complete!

The problem is that if the two collections are of the approximate same size, for example if updated contains a single new string, the typical number of calls to the string comparer is _old.Length * _old.Length / 2 .

In other words, the number of operations grows exponentially with the length of the list, this is typically expressed as O(N**2), read as "order of N-squared". That it's actually on the average divided by 2 doesn't matter for the O() notation, it just means that the number of operations is proportional to N squared.

In the real-world situation, the number of elements were on the order of 20,000. That's not extraordinary large in any way, but 20,000 * 20,000 / 2 is 200,000,000 !

That's 200 million operations! That can take real time even in a pretty fast machine.

The problem is the lookup in the _old list. We need to enumerate the updated one in one way, no way really to get around that, given the assumptions here.

This is where hashtables, or dictionaries which use hashtables under the hood and similar collections come into play. A lookup using a hashtable is enormously more efficient, and it will approach a linear increase in time rather than exponential. Here's how it could have been (and subsequently was) coded using a HashSet :

public class Filter
{
    private readonly HashSet<string> _old = new HashSet<string>();

    public Filter(IEnumerable<string> old)
    {
        foreach (string value in old)
        {
            _ = _old.Add(value);
        }
    }

    public IEnumerable<string> WhatsNew(IList<string> updated) => updated.Where(s => !_old.Contains(s));
}

Now, our WhatsNew() method will operate O(N), i.e. the time taken will be proportional to the number of elements, not the square of the number of elements! For larger sizes of the collection, that's a huge gain.

Obviously there are many variations both to the problem and the solution, but the message here is to be aware of the cost of doing effectively nested iterations of large collections.

This is also one of those examples of things that might not bite you until it's too late and your application is running in the real world. During testing and unit testing which usually is done with smaller data all will look well (even if we know we should be using expected data sizes somehow it often doesn't happen). Then, when it scales up in the real world performance can deteriorate dramatically and quickly!

This is similar to the old fable of the reward in grains of rice . Doubling the list does not decrease performance by half, as many would expect. It decreases performance proportional to the square of the increase! It get's progressively worse, quicker and quicker, and can surprisingly fast become a critical problem.

With the updated solution, doubling the list will decrease performance by roughly half, which is much easier to handle and scale with.

Tuesday, April 27, 2021

The strange git love of the command line

Git users love of the command line

 Why do a vast majority of git users love the command line so fervently? And why do I care?

Git is a great tool, but like said of "C", it's like a sharp knife. Excellent in the hands of an expert, but easy to cut your fingers if you're not.

Using git inexpertly can get you into all kinds of difficult-to-get-out trouble. Most developers' main job is to code, not maintain super-complex repositories with 100s or 1000s of contributors. You'd think they would welcome any tool that made the use of git easier and safer, perhaps forgoing some complex scenarios.

But, sorry to say, nope. Almost all of the developers I know and work with use git from the bash command line. It's also the only thing they use from bash, or any command line. Git has a zillion commands and options, but 9 out of 10 days the only things we do are:
  • Pull updates from the remote.
  • Create feature branches.
  • Commit changes locally.
  • Push changes to the remote.
  • Stash changes temporarily locally.
  • Switch branch.
  • Merge branches.
  • Rebase feature branches.
  • Inspect the commit history of a file.
There are excellent graphical user interface frontends for git, in all environments. I'm on Windows, and I happen to be fond of TortoiseGit, but there are others.

The great thing about these tools are that for one, they are 100% compatible with each other and the command line, since they all use the same underlying implementation of git.

Nobody, not even the most hard-core developers, use a command line editor or a command line debugger for day-to-day development. We're simply more productive with graphical full screen integrated development environments like Visual Studio, and less prone to mistakes. There's a reason we're not using TECO any more, even if it was awesome in it's time and still is in some ways - I loved it!

The same applies to source code control. It's simply faster, more convenient and less error-prone to use a tool that integrates with your graphical environment be it Visual Studio or Explorer or whatever your local equivalent is.

But, for some strange weird reason - for git suddenly it must be done from the bash prompt. I try to explain, I try to show, but the command line fixation remains. Git from the command line users repeatedly get into trouble and have to reset their local repository clones. Command line users frequently take much longer to investigate history inspections, and also often just skip doing things like rebasing the feature branch before merge to the main development branch causing the commit history to become much more complex. Just because it's too complicated to do from the command line.

Still, they persist.

I get that there are some rare cases that the command line is needed. I get that for batch operations, such as merging a bunch of repositories from perhaps the development branch to the main branch a script using the command line is perfect.

But the command line is not gone just because you use a modern tool for day-to-day use. It's still there when you need it.

Why do young gifted competent developers insist on using a tool model that originates in the 60's when there are so many better alternatives - and in all other cases they do?

I don't get it. But it is a problem, because it affects productivity both directly in daily use and indirectly because it tends to cause more complex commit histories to untangle in the future.

Monday, April 26, 2021

Code should work every time

The Philosophy of Black and White

I have a background in real time operating system kernels, compilers and encryption.

In all of these contexts, it is self-evident that a piece of code either works every time, or else it's broken and it must be fixed immediately.

Code that works most of the time, or even almost every time just won't cut it.

If there are special circumstances during heavy load, or during startup or shutdown or just with bad enough "luck" that the software does not work as expected, it  needs fixing. Perhaps it controls heavy machinery, compiles your code or encrypts your data. Right?

Would you feel comfortable with a compiler that emitted faulty code every now and then, depending on how heavy the system load was? Of course not.

I have taken this view to heart, and in my mind there are really only two types of software:
  1. Software that behaves consistently every time, regardless of load or timing.
  2. Broken software.
(Software of type 1 can of course still be broken in other ways - but it should then be consistently broken!)

Here I'd like to argue that this is just as applicable to a national informational website, a booking system, or even a game.

Software should behave consistently every time, else it's broken.

Unfortunately it's often hard to achieve, especially as so much code today runs in a web server environment, which is an inherently multi-threaded environment with a high degree of parallelism. Most issues with software that behaves inconsistently come from parallelism, and race conditions.

A race condition is when two different actors invoke code at the same time, and the outcome depends on who wins the race.

There are numerous mechanisms, and even whole books, dedicated to solving these problems.

My purpose here is not to talk about all those ways, but rather to argue that it's important as a developer to feel that it's important that our code works consistently every time.

In real life, I often have to argue about these matters, and I'm not infrequently met with the basic opinion "well, it really doesn't happen that often and it looks like hard work to fix so we'll live with it".

The problem is that these things tend to grow worse exponentially with increased load. So, if you're working on something that is expecting fewer and fewer users and lower and lower load, well maybe you can get away with "it's not worth fixing".

If you're like most of us, working on something that expects increased load and continued development, remember that bugs do not age well. They just get worse and worse, until it's really bad.

So, while perfectionism over all is not always a winning strategy, for consistency under load always strive for perfection. Otherwise you're building software broken by design.

For these matters, it really is back or white. Either the code works every time, or it's broken.

Friday, April 23, 2021

Using a search index as a database is a bad idea

The case

I am currently working on project where we are using a CMS product, in conjunction with a search service based on elastic search as well as some back-end API:s.

Elastic search is an incredibly powerful search service, and you can do almost anything with it.

But should you?

What we did

With a CMS that does not support storing of arbitrary data particularly well, it is tempting to look for creative alternatives. It's always a big step for example to add support for an OR mapper or any other custom table or database. It's not necessarily a good idea, if it can be avoided.

In this situation we hit upon the idea of using the search index to store data fetched from a back-end API. After all, it can serialize and index just about any .NET type - so why not add some data carrying properties to our custom index object?

I had a bad feeling about this, my spidey-sense started tingling... I was thinking that an index is something fairly approximate and it's only intended to as well as possible make it possible to find data. Not store it. Hmm...

In the team I tried to argue along these lines, but I had no luck. So off we went, starting to store more than text to search and back-references to the actual data.

What happened

So now we're in trouble. Not really deep trouble as yet, but it's just not a good idea as it turns out. We're getting inconsistent states and the code can't trust what it sees. It works, sort of, kind of, most of the time but...

The problem is basically that when you store your data in a database or a file, you expect consistent and reproduceable behavior, every time. If you don't get that, the assumption is that something is broken.

When you store your data in a search index, this just does not apply, here are some of the reasons:

  • The index never promised to give a consistent view! Two reads can give different results, in elastic there's a concept of shards for example that can cause this behavior.
  • The index never promised that a write is immediately or deterministically reflected in a subsequent read. This is due both to caching and to queueing behavior in the index, since the assumption is that you're basically requesting an index update that you'd like to be effective asap - but not guaranteed immediately.
  • The index has a rate limit, it's perfectly ok for it to say that it's too busy, since the assumption is that at worst you lost an index update. No data is lost. With a database or a file etc., if that happens, you'll just simply have to gear up, it's a fatal error situation.
Specifically we're now in a situation that our code won't always work, mostly if we're "too fast". If we wait a few minutes between writing, and expecting it to be able to read back, it'll often but not always work.

Conclusion

Don't use a search index as a data store. Even if looks cool and easy, don't. I don't know exactly what problems you'll run into, but I'll bet a beer or soda that it'll cause you some unexpected grief.