Monday, May 10, 2021

ConcurrentDictionary.GetOrAdd() may not work as you think!

It's concurrent - not lazy

We had a problem with random crashes in a customers web. It was not catastrophic, it would get going again but still it was there in the logs and we want every page visit to work.

A bit of investigation with an IL-code decompiler followed, which by the way is absolutely the best thing since sliced bread! I found code equivalent to the following in a third party vendors product:
private readonly ConcurrentDictionary<string, Type> _dictionary
    = new ConcurrentDictionary<string, Type>();

private readonly object _lock = new object();

public Type GetType(string typename)
{
    return _dictionary.GetOrAdd(typename,
        (t) =>
        {
            lock (_lock)
            {
                return DynamicallyGenerateType(t);
            }
        });
}
The thing here is that DynamicallyGenerateType() can only be called once per typename, since what it does is emit code into an assembly, and if you do that twice you get a System.ArgumentException: Duplicate type name within an assembly .

No-one wants that, so the author thought that it would be cool to use a ConcurrentDictionary<TKey, TValue> since the GetOrAdd() method guarantees that it will get an existing value from the dictionary, or add a new one using the provided value factory and then return the value.

It looks good, reasonable, and works almost all of the time. Key word here is: almost.

What the concurrent dictionary does is it uses efficient and light-weight locking to ensure that the dictionary can be concurrently accessed and updated in a consistent and thread safe manner.

It does not guarantee a single one-time lazy call to the value factory used to add a value if it's not in the dictionary.

Sometimes, under heavy initial load, the value factory passed as the second argument to GetOrAdd() will be called twice (or more). What the concurrent dictionary guarantees is that the value for the provided key will only be set once, but the value factory may be called multiple times with the result thrown away for all calls except the race-winning one!

This is clearly documented but it's easy to miss. This is not a case of the implementation not being thread safe, as is stated in some places. It's very thread safe. But the value factory may indeed be called multiple times!

To fix it, add a Lazy-layer on top, because Lazy<T> by default does guarantee that it's value factory is only called once!
private readonly ConcurrentDictionary<string, Lazy<Type>> _dictionary
    = new ConcurrentDictionary<string, Lazy<Type>>();

private readonly object _lock = new object();

public Type AddType(string typename)
{
    return _dictionary.GetOrAdd(typename,
        (t) =>
        {
            lock (_lock)
            {
                return DynamicallyGenerateTypeLazy(t);
            }
        }).Value;
}

Now, although we may instantiate more than one Lazy<Type> instance, and then throw it away if we loose the GetOrAdd-race, that's a minor problem and it works as it should.

Please note that this is only true as long as you use the default LazyThreadSafetyMode.ExecutionAndPublication mode.

The additional lock may look confusing, but it was in the original code and makes sense in this context, because while the concurrent dictionary and lazy layer guarantees that only one call per value of  'typename' is made to DynamicallyGenerateTypeLazy(), it does not guarantee that multiple threads do not call it it concurrently with different type names and this may wreak havoc with the shared assembly that the code is generated to.

Tuesday, May 4, 2021

SSH with TortoiseGit and Bitbucket or GitHub on Windows 10

Memo to self

It's always complicated to remember the steps necessary to get SSH working, and there are some idiosyncrasies as well. This guide may help you, I'm sure it'll help me the next time I need to do this myself.

Password-based login with HTTPS is starting to be obsolete, and it's less secure. Also with the nice SSH agent in Windows 10, you only need to enter the password once - ever.

Generate a key pair

Open a command prompt and run the ssh-keygen command, to generate a private and a public key file. Accept the defaults.

Enter a strong password for the private key file when asked, and ensure that you store it securely in your password manager.

This should create files in %USERPROFILE%\.ssh named id_rsa (the private key file) and id_rsa.pub (the public key file).


Enable and start the OpenSSH Authentication Agent Service

Nowadays it is shipped with Windows 10, but it's not enabled by default. So start your Services gadget and ensure the service is set to startup automatically, and it's running.


Add the private key to the SSH Authentication Agent

In the command prompt, type ssh-add . It should select the default ssh key id_rsa, and ask for the password you entered previously.

(If you get the error message "Can't add keys to ssh-agent, communication with agent failed", there seems to be an issue with certain Windows distributions. For whatever reasons, the following workaround appears to work. Open a new command prompt but elevated with Run As Administrator. Then type:

    sc.exe create sshd binPath=C:\Windows\System32\OpenSSH\ssh.exe .

Then exit the elevated command prompt and try again to do the ssh-add in your normal command prompt.)


Save the public key to Bitbucket...

Open the file %USERPROFILE%\.ssh\ids_rsa.pub in Notepad, Select All (Ctrl-A) and Copy (Ctrl-C). Paste it into this dialog in Bitbucket, Your Profile and Settings -> Personal Settings -> SSH keys -> Add key:


The Label is just anything that makes it easy for you to remember what key it is. Perhaps todays date, and the name of the computer you have the private key on can be a good start. Or just "My public SSH key" works too.


...and/or save the public key to GitHub

Go to Settings -> SSH keys -> New SSH key


The Title has the same meaning as Label for Bitbucket, see above.


Remove any Credential Helpers

Git credential helpers may conflict with the use of SSH keys, and there is no need for them anyway, so remove them from TortoiseGit in the Settings -> Git -> Credential menu so it looks like this:



Tell Git where to find SSH

Set the environment variable GIT_SSH to C:\Windows\System32\OpenSSH\ssh.exe . Right-click "This PC" -> Properties -> Advanced system settings -> Environment Variables... -> New...


Restart explorer (Task Manager -> Details -> explorer.exe right-click -> End Task, then File -> Run new Task -> Open: explorer -> OK) , or logout and login, or restart your computer.


Update origin URL to use SSH

Finally, update your repos origin to use SSH instead of HTTPS. The easiest way is to copy the part after 'git clone' in the Bitbucket "Clone" feature.


Click the "Clone" button, Select SSH and then the URL part of the git clone command suggested, and paste it in TortoiseGit Remote origin for the repo:



Done! Now you can enjoy password-less use of git with Bitbucket and/or GitHub.

Monday, May 3, 2021

Thinking about cacheability

Performance, Cacheability and Thinking Ahead

Although it's very true that "premature optimization is the root of all evil" (Hoare/Knuth), this should never be taken to mean that writing inefficient code is a good thing. Nor does it mean that we should ignore the possible need for future optimizations. As all things, it's a question of balance.

When designing APIs for example, just how you define them may impact even the possibility for future optimizations, specifically caching. Although caching is no silver bullet, often it is the single most effective measure to increase performance so it's definitively a very important tool in the optimization toolbox.

Let's imagine a search API, that searches for locations of gas stations within a given distance from a given point, and also filters on the brand of the station.

(In a real life application, the actual search terms will of course be more complex, so let's assume that using an underlying search service really is relevant which is perhaps not necessarily the case in this literal example. Also, don't worry about the use of static methods and classes in the sample code, it's just for show.)

[Route("v1/[controller]")]
public class GasStationV1Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> Search(string brand, double lat, double lon, double distance)
    {
        return SearchService.IndexSearch(brand, lat, lon, distance);
    }
}

We're exposing a REST API that delegates the real work to a fictitious search service, accessing an index and providing results based on the search parameters, possibly adding relevance, sponsoring and other soft parameters to the search. That's not important here.

What is important is that we've decided to let the search index handle the geo location part of the search as well, so we're indexing locations and letting the search index handle distance and nearness calculations, which on the surface of things appear to make sense. The less we do in our own code, and the more we can delegate, the better!

But, unfortunately it turns out this is a little too slow, and also we're overloading the back end search service which has a rate limiting function as well as a per-call pricing schedule so it's expensive too. What to do? The obvious thing is to cache. Said and done.

[Route("v2/[controller]")]
public class GasStationV2Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> CachedSearch(string brand, double lat, double lon, double distance)
    {
        string cacheKey = $"{brand}-{lat}-{lon}-{distance}";
        return ObjectCache.GetOrAdd(cacheKey, () => SearchService.IndexSearch(brand, lat, lon, distance));
    }
}

Now we're using our awesome ObjectCache to either get it from the cache, or if need be, call the back end service. All set, right?

Not quite.

The location that we're looking to find near matches to is essentially where the user is, which means there'll be quite a bit of variation of the search parameters. In fact, there is very little chance that anything in the cache will ever be re-used. The net effect of our caching layer is just to fill server memory. We're not reducing the back end search service load, and we're not speeding anything up for anyone.

The thing to consider here is that when we're designing an API that has the potential of being a bottleneck in one way or another, we should consider to make it possible to add a caching layer even if we don't to begin with (remember that thing about premature optimizations).

Avoid designing low-level API:s that take essentially open-ended parameters, i.e. parameters that have effectively infinite variation, and where very seldom a set of parameters is used twice. It's not always possible, it depends on the situation, but consider it.

As it turns out, our only option was to redesign what we use the search index for, and move some functionality into our own application. This is often a memory/time tradeoff, but in this case, keeping up to a 100 000 gas stations in memory is not a problem, and filtering them in memory in the web server is an excellent option.

This is how it looks like now, and although we're obliged to do some more work on our own, we'll be fast and we're offloading the limited and expensive search service quite a bit.

[Route("v3/[controller]")]
public class GasStationV3Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> Search(string brand, double lat, double lon, double distance)
    {
        string cacheKey = $"{brand}";
        return ObjectCache.GetOrAdd(cacheKey, () => SearchService.IndexSearch(brand))
            .Select(g => g.SetDistance(lat, lon))
            .Where(g => g.Distance <= distance)
            .OrderBy(g => g.Distance);
    }
}

Now we have more manageable set of search parameters to cache, and we can still serve the user rapidly and without overloading the search service and or budget.

Taking this a step further, we'd consider moving this logic to the client, if it's reasonable since then we can even let the HTTP response become cacheable, which can further increase the scalability and speed for the users.

In the end performance is always about compromises, but the lesson learned here is that even if we don't (think) we need optimization and caching at the start, we should at least consider leaving the path for it open.

Thursday, April 29, 2021

Weird random constants in GetHashCode()

Random integers in GetHashCode() for C#/.NET

When you ask Visual Studio to generate Equals() and GetHashCode() for C#/.NET, as well as when you inspect other code, you often see addition and multiplication of various seemingly random constants as part of the calculation. Here's an example of how it can look:

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);
}

public class MyFixedByVsHashCode
{
    public MyFixedByVsHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode()
    {
         int hashCode = -1817952719;
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);
         return hashCode;
    }
}

The above example was generated using Visual Studio 2019 for .NET Framework and contains a number of seemingly random strange integer constants: 250924889, -1007312712, -1817952719 and -1521134295. If you generate code for .NET Core or .NET 5 it may look a little different, but under the hood it's similar.

Executive summary: The reason for these numbers is to reduce the risk of collisions, i.e. the number of situations where two different instances with different values get the same hash code.

So what's up with these magic numbers? First of all: No, they're not random. Let's go through them.

public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);

These values are derived from the name of the property. Exactly how is not documented and not clear, but it's essentially equivalent to "NameOfProperty".GetHashCode() . The purpose is to add the name of property to the equation, reducing the risk that two properties with the same value get the same hash code.

Then we have the integer constants from the multiple property implementation:

int hashCode = -1817952719;
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);

These are fixed, and do not vary. A little bit of analysis shows they are far from random. The first one, -1817952719 is actually the product of two relatively large primes, 16363 * 151379 = 2477014577 and is thus a nice semiprime, and when this is interpreted as a signed 32-bit integer we get -1817952719.

The second one, -1521134295, when interpreted as an unsigned 32-bit integer is 2773833001 - and that is a nice large prime!

Using primes and semiprimes as factors and constants in polynomials has been shown to produce numbers with better distribution than other constants.

So it's all about reducing the risk of collisions.

But how bad can it get? Actually, very bad... Here follows a seemingly good enough implementation that is similar to many real-world manual implementations. In fact, I've written a good number of similar, although hopefully not with as catastrophic result as this.

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => ValueA.GetHashCode();
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => ValueB.GetHashCode();
}

public class MyBrokenHashCode
{
    public MyBrokenHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode() => A.GetHashCode() ^ B.GetHashCode();
}

internal class Program
{
    private static void Main()
    {
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Something").GetHashCode()}");
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Other").GetHashCode()}");
    }
}

The above example produces the following output:

Hashcode is: 0
Hashcode is: 0

That's not good! Two different instances with two different values not only produce the same hashcode, it's 0! In fact it's worse than not good, it's potentially catastrophic for performance. The scary thing is, everything will still work, and look nice during testing but if these objects are placed in a HashTable or Dictionary or similar, and in production they grow to a larger number of elements then indexing these collections degenerate into linear searches in a linked list.

So what happens?

Two different types happen to generate the same hashcode for the same underlying value ("Something" or "Other"). That's actually not that unusual. Then we use XOR to combine the hashes, but XOR has the known weakness that XOR:ing identical values will always result in zero, regardless of the values.

This example is slightly contrived, but it demonstrates that seemingly good-looking code can have subtle pitfalls causing really bad effects.

Conclusion - Trust the tools and use Visual Studio generation of GetHashCode-code! Even if you don't notice any problems with your own implementations, do regenerate the code when you have the chance.

Iterations and the squaring factor

The power of 2

I recently found code that was functionally equivalent to the following:
public class Filter
{
    private readonly IEnumerable<string> _old;

    public Filter(IEnumerable<string> old) => _old = old;
public IEnumerable<string> WhatsNew(IEnumerable<string> updated) => updated.Where(s => !_old.Contains(s));
}

Nice, compact and easily understandable. We keep track of an original list of strings, and we get an updated list we'd like to know what's new.

Or is it, really?

As I mentioned, I found this type of code but why did I notice it? Because during debugging the call of the WhatsNew() method took significant time, it was boring to sit there and wait for it to complete!

The problem is that if the two collections are of the approximate same size, for example if updated contains a single new string, the typical number of calls to the string comparer is _old.Length * _old.Length / 2 .

In other words, the number of operations grows exponentially with the length of the list, this is typically expressed as O(N**2), read as "order of N-squared". That it's actually on the average divided by 2 doesn't matter for the O() notation, it just means that the number of operations is proportional to N squared.

In the real-world situation, the number of elements were on the order of 20,000. That's not extraordinary large in any way, but 20,000 * 20,000 / 2 is 200,000,000 !

That's 200 million operations! That can take real time even in a pretty fast machine.

The problem is the lookup in the _old list. We need to enumerate the updated one in one way, no way really to get around that, given the assumptions here.

This is where hashtables, or dictionaries which use hashtables under the hood and similar collections come into play. A lookup using a hashtable is enormously more efficient, and it will approach a linear increase in time rather than exponential. Here's how it could have been (and subsequently was) coded using a HashSet :

public class Filter
{
    private readonly HashSet<string> _old = new HashSet<string>();

    public Filter(IEnumerable<string> old)
    {
        foreach (string value in old)
        {
            _ = _old.Add(value);
        }
    }

    public IEnumerable<string> WhatsNew(IList<string> updated) => updated.Where(s => !_old.Contains(s));
}

Now, our WhatsNew() method will operate O(N), i.e. the time taken will be proportional to the number of elements, not the square of the number of elements! For larger sizes of the collection, that's a huge gain.

Obviously there are many variations both to the problem and the solution, but the message here is to be aware of the cost of doing effectively nested iterations of large collections.

This is also one of those examples of things that might not bite you until it's too late and your application is running in the real world. During testing and unit testing which usually is done with smaller data all will look well (even if we know we should be using expected data sizes somehow it often doesn't happen). Then, when it scales up in the real world performance can deteriorate dramatically and quickly!

This is similar to the old fable of the reward in grains of rice . Doubling the list does not decrease performance by half, as many would expect. It decreases performance proportional to the square of the increase! It get's progressively worse, quicker and quicker, and can surprisingly fast become a critical problem.

With the updated solution, doubling the list will decrease performance by roughly half, which is much easier to handle and scale with.

Tuesday, April 27, 2021

The strange git love of the command line

Git users love of the command line

 Why do a vast majority of git users love the command line so fervently? And why do I care?

Git is a great tool, but like said of "C", it's like a sharp knife. Excellent in the hands of an expert, but easy to cut your fingers if you're not.

Using git inexpertly can get you into all kinds of difficult-to-get-out trouble. Most developers' main job is to code, not maintain super-complex repositories with 100s or 1000s of contributors. You'd think they would welcome any tool that made the use of git easier and safer, perhaps forgoing some complex scenarios.

But, sorry to say, nope. Almost all of the developers I know and work with use git from the bash command line. It's also the only thing they use from bash, or any command line. Git has a zillion commands and options, but 9 out of 10 days the only things we do are:
  • Pull updates from the remote.
  • Create feature branches.
  • Commit changes locally.
  • Push changes to the remote.
  • Stash changes temporarily locally.
  • Switch branch.
  • Merge branches.
  • Rebase feature branches.
  • Inspect the commit history of a file.
There are excellent graphical user interface frontends for git, in all environments. I'm on Windows, and I happen to be fond of TortoiseGit, but there are others.

The great thing about these tools are that for one, they are 100% compatible with each other and the command line, since they all use the same underlying implementation of git.

Nobody, not even the most hard-core developers, use a command line editor or a command line debugger for day-to-day development. We're simply more productive with graphical full screen integrated development environments like Visual Studio, and less prone to mistakes. There's a reason we're not using TECO any more, even if it was awesome in it's time and still is in some ways - I loved it!

The same applies to source code control. It's simply faster, more convenient and less error-prone to use a tool that integrates with your graphical environment be it Visual Studio or Explorer or whatever your local equivalent is.

But, for some strange weird reason - for git suddenly it must be done from the bash prompt. I try to explain, I try to show, but the command line fixation remains. Git from the command line users repeatedly get into trouble and have to reset their local repository clones. Command line users frequently take much longer to investigate history inspections, and also often just skip doing things like rebasing the feature branch before merge to the main development branch causing the commit history to become much more complex. Just because it's too complicated to do from the command line.

Still, they persist.

I get that there are some rare cases that the command line is needed. I get that for batch operations, such as merging a bunch of repositories from perhaps the development branch to the main branch a script using the command line is perfect.

But the command line is not gone just because you use a modern tool for day-to-day use. It's still there when you need it.

Why do young gifted competent developers insist on using a tool model that originates in the 60's when there are so many better alternatives - and in all other cases they do?

I don't get it. But it is a problem, because it affects productivity both directly in daily use and indirectly because it tends to cause more complex commit histories to untangle in the future.

Monday, April 26, 2021

Code should work every time

The Philosophy of Black and White

I have a background in real time operating system kernels, compilers and encryption.

In all of these contexts, it is self-evident that a piece of code either works every time, or else it's broken and it must be fixed immediately.

Code that works most of the time, or even almost every time just won't cut it.

If there are special circumstances during heavy load, or during startup or shutdown or just with bad enough "luck" that the software does not work as expected, it  needs fixing. Perhaps it controls heavy machinery, compiles your code or encrypts your data. Right?

Would you feel comfortable with a compiler that emitted faulty code every now and then, depending on how heavy the system load was? Of course not.

I have taken this view to heart, and in my mind there are really only two types of software:
  1. Software that behaves consistently every time, regardless of load or timing.
  2. Broken software.
(Software of type 1 can of course still be broken in other ways - but it should then be consistently broken!)

Here I'd like to argue that this is just as applicable to a national informational website, a booking system, or even a game.

Software should behave consistently every time, else it's broken.

Unfortunately it's often hard to achieve, especially as so much code today runs in a web server environment, which is an inherently multi-threaded environment with a high degree of parallelism. Most issues with software that behaves inconsistently come from parallelism, and race conditions.

A race condition is when two different actors invoke code at the same time, and the outcome depends on who wins the race.

There are numerous mechanisms, and even whole books, dedicated to solving these problems.

My purpose here is not to talk about all those ways, but rather to argue that it's important as a developer to feel that it's important that our code works consistently every time.

In real life, I often have to argue about these matters, and I'm not infrequently met with the basic opinion "well, it really doesn't happen that often and it looks like hard work to fix so we'll live with it".

The problem is that these things tend to grow worse exponentially with increased load. So, if you're working on something that is expecting fewer and fewer users and lower and lower load, well maybe you can get away with "it's not worth fixing".

If you're like most of us, working on something that expects increased load and continued development, remember that bugs do not age well. They just get worse and worse, until it's really bad.

So, while perfectionism over all is not always a winning strategy, for consistency under load always strive for perfection. Otherwise you're building software broken by design.

For these matters, it really is back or white. Either the code works every time, or it's broken.

Friday, April 23, 2021

Using a search index as a database is a bad idea

The case

I am currently working on project where we are using a CMS product, in conjunction with a search service based on elastic search as well as some back-end API:s.

Elastic search is an incredibly powerful search service, and you can do almost anything with it.

But should you?

What we did

With a CMS that does not support storing of arbitrary data particularly well, it is tempting to look for creative alternatives. It's always a big step for example to add support for an OR mapper or any other custom table or database. It's not necessarily a good idea, if it can be avoided.

In this situation we hit upon the idea of using the search index to store data fetched from a back-end API. After all, it can serialize and index just about any .NET type - so why not add some data carrying properties to our custom index object?

I had a bad feeling about this, my spidey-sense started tingling... I was thinking that an index is something fairly approximate and it's only intended to as well as possible make it possible to find data. Not store it. Hmm...

In the team I tried to argue along these lines, but I had no luck. So off we went, starting to store more than text to search and back-references to the actual data.

What happened

So now we're in trouble. Not really deep trouble as yet, but it's just not a good idea as it turns out. We're getting inconsistent states and the code can't trust what it sees. It works, sort of, kind of, most of the time but...

The problem is basically that when you store your data in a database or a file, you expect consistent and reproduceable behavior, every time. If you don't get that, the assumption is that something is broken.

When you store your data in a search index, this just does not apply, here are some of the reasons:

  • The index never promised to give a consistent view! Two reads can give different results, in elastic there's a concept of shards for example that can cause this behavior.
  • The index never promised that a write is immediately or deterministically reflected in a subsequent read. This is due both to caching and to queueing behavior in the index, since the assumption is that you're basically requesting an index update that you'd like to be effective asap - but not guaranteed immediately.
  • The index has a rate limit, it's perfectly ok for it to say that it's too busy, since the assumption is that at worst you lost an index update. No data is lost. With a database or a file etc., if that happens, you'll just simply have to gear up, it's a fatal error situation.
Specifically we're now in a situation that our code won't always work, mostly if we're "too fast". If we wait a few minutes between writing, and expecting it to be able to read back, it'll often but not always work.

Conclusion

Don't use a search index as a data store. Even if looks cool and easy, don't. I don't know exactly what problems you'll run into, but I'll bet a beer or soda that it'll cause you some unexpected grief.