Wednesday, May 15, 2024

Localizing a .NET console or desktop application

Having gone through countless different solutions to the ever present problem of localizing .NET applications, I think I've finally found a good solution.

The main driver this time was the need to provide a convenient way to translate texts used in the Xecrets Ez desktop app for easier file encryption. This is developed with .NET 8 and Avalonia UI, which apart from C# uses it's own XAML format, AXAML with a need to reference translatable texts there too.

.NET has of course since forever supported localization, sort of, through the resource compiler, .resx files and satellite assemblies.

The problem has always been, and remains to this day, how do I get the actual translations done? Translation is often not done by the original developer, and often by people who definitely don't want to fire up Visual Studio and the resource editor. Never mind that it doesn't even have a proper translation view, with a source language and target language in the same view.

There have been various attempts to make resx-editors for translators, but they were typically Windows-only desktop applications. Translators often like apples.

Since maybe 10 years the number of web based translation services have been increasing year by year, but the support for the resx file format is often spotty or limited, or non-existing.

I've long been looking at somehow leveraging the huge amount of services and software that use the .po format for texts, all based on the gettext utilities and specifications. However, the old C-style extraction of text from the source has never appealed to me, and also requires a lot of parsing of C#/.NET code that is just bound to go wrong. Not to mention various XAML dialects that might also need parsing for the gettext extraction process to work.

Then when looking for a library to work with .po files, thinking to roll my own somehow, deep inside a related sample project I found POTools!

This little hidden gem can do a gettext-style extraction, but from a .resx file! Now things started to fall into place.

Doing the extraction, or discovery if you like, of translatable strings from a structured source like a .resx file is easy, solid and dependable - and fits the general development model of .NET better than wrapping all texts in T() method calls, similar to the original C-style gettext mode of operations. It also enables easy re-use of strings in multiple places without any problems.

With POTools I could now easily extract a template .pot file from the already existing Resources.resx file, and start looking for a suitable translation tool. There are many such out there, but for my purposes I went with Loco, which offers just the right set of capabilities at the right price. The real point is that once you go with the .po/.pot format for your translations, there's a huge eco system out there to help you out!

As a minor issue, I decided to create a separate resource file, InvariantResources.resx, to hold the texts that by definition should never be translated, but still are suitable to keep in a resource file.

So, now I have a bunch of translated .po files, as well as the original .resx file with the original texts. How to tie all this together?

Using another part of the .NET building blocks for localization, IStringLocalizer offers a nice interface that fits well. Using sample code from the Karambolo.PO library that can read .po files, it was trivial to create a POStringLocalizer implementation.

This is now available as a nuget package for your convenience! There's sample code there on how to wire it all up, as well as in the github repository.

So, we now can use the Portable Object file format and the gettext universe of tools and users, while still maintaining a good fit with .NET code and libraries.

Thanks go to the author of Karambolo.PO library and POTools!

Monday, May 10, 2021

ConcurrentDictionary.GetOrAdd() may not work as you think!

It's concurrent - not lazy

We had a problem with random crashes in a customers web. It was not catastrophic, it would get going again but still it was there in the logs and we want every page visit to work.

A bit of investigation with an IL-code decompiler followed, which by the way is absolutely the best thing since sliced bread! I found code equivalent to the following in a third party vendors product:
private readonly ConcurrentDictionary<string, Type> _dictionary
    = new ConcurrentDictionary<string, Type>();

private readonly object _lock = new object();

public Type GetType(string typename)
{
    return _dictionary.GetOrAdd(typename,
        (t) =>
        {
            lock (_lock)
            {
                return DynamicallyGenerateType(t);
            }
        });
}
The thing here is that DynamicallyGenerateType() can only be called once per typename, since what it does is emit code into an assembly, and if you do that twice you get a System.ArgumentException: Duplicate type name within an assembly .

No-one wants that, so the author thought that it would be cool to use a ConcurrentDictionary<TKey, TValue> since the GetOrAdd() method guarantees that it will get an existing value from the dictionary, or add a new one using the provided value factory and then return the value.

It looks good, reasonable, and works almost all of the time. Key word here is: almost.

What the concurrent dictionary does is it uses efficient and light-weight locking to ensure that the dictionary can be concurrently accessed and updated in a consistent and thread safe manner.

It does not guarantee a single one-time lazy call to the value factory used to add a value if it's not in the dictionary.

Sometimes, under heavy initial load, the value factory passed as the second argument to GetOrAdd() will be called twice (or more). What the concurrent dictionary guarantees is that the value for the provided key will only be set once, but the value factory may be called multiple times with the result thrown away for all calls except the race-winning one!

This is clearly documented but it's easy to miss. This is not a case of the implementation not being thread safe, as is stated in some places. It's very thread safe. But the value factory may indeed be called multiple times!

To fix it, add a Lazy-layer on top, because Lazy<T> by default does guarantee that it's value factory is only called once!
private readonly ConcurrentDictionary<string, Lazy<Type>> _dictionary
    = new ConcurrentDictionary<string, Lazy<Type>>();

private readonly object _lock = new object();

public Type AddType(string typename)
{
    return _dictionary.GetOrAdd(typename,
        (t) =>
        {
            lock (_lock)
            {
                return DynamicallyGenerateTypeLazy(t);
            }
        }).Value;
}

Now, although we may instantiate more than one Lazy<Type> instance, and then throw it away if we loose the GetOrAdd-race, that's a minor problem and it works as it should.

Please note that this is only true as long as you use the default LazyThreadSafetyMode.ExecutionAndPublication mode.

The additional lock may look confusing, but it was in the original code and makes sense in this context, because while the concurrent dictionary and lazy layer guarantees that only one call per value of  'typename' is made to DynamicallyGenerateTypeLazy(), it does not guarantee that multiple threads do not call it it concurrently with different type names and this may wreak havoc with the shared assembly that the code is generated to.

Tuesday, May 4, 2021

SSH with TortoiseGit and Bitbucket or GitHub on Windows 10

Memo to self

It's always complicated to remember the steps necessary to get SSH working, and there are some idiosyncrasies as well. This guide may help you, I'm sure it'll help me the next time I need to do this myself.

Password-based login with HTTPS is starting to be obsolete, and it's less secure. Also with the nice SSH agent in Windows 10, you only need to enter the password once - ever.

Generate a key pair

Open a command prompt and run the ssh-keygen command, to generate a private and a public key file. Accept the defaults.

Enter a strong password for the private key file when asked, and ensure that you store it securely in your password manager.

This should create files in %USERPROFILE%\.ssh named id_rsa (the private key file) and id_rsa.pub (the public key file).


Enable and start the OpenSSH Authentication Agent Service

Nowadays it is shipped with Windows 10, but it's not enabled by default. So start your Services gadget and ensure the service is set to startup automatically, and it's running.


Add the private key to the SSH Authentication Agent

In the command prompt, type ssh-add . It should select the default ssh key id_rsa, and ask for the password you entered previously.

(If you get the error message "Can't add keys to ssh-agent, communication with agent failed", there seems to be an issue with certain Windows distributions. For whatever reasons, the following workaround appears to work. Open a new command prompt but elevated with Run As Administrator. Then type:

    sc.exe create sshd binPath=C:\Windows\System32\OpenSSH\ssh.exe .

Then exit the elevated command prompt and try again to do the ssh-add in your normal command prompt.)


Save the public key to Bitbucket...

Open the file %USERPROFILE%\.ssh\ids_rsa.pub in Notepad, Select All (Ctrl-A) and Copy (Ctrl-C). Paste it into this dialog in Bitbucket, Your Profile and Settings -> Personal Settings -> SSH keys -> Add key:


The Label is just anything that makes it easy for you to remember what key it is. Perhaps todays date, and the name of the computer you have the private key on can be a good start. Or just "My public SSH key" works too.


...and/or save the public key to GitHub

Go to Settings -> SSH keys -> New SSH key


The Title has the same meaning as Label for Bitbucket, see above.


Remove any Credential Helpers

Git credential helpers may conflict with the use of SSH keys, and there is no need for them anyway, so remove them from TortoiseGit in the Settings -> Git -> Credential menu so it looks like this:



Tell Git where to find SSH

Set the environment variable GIT_SSH to C:\Windows\System32\OpenSSH\ssh.exe . Right-click "This PC" -> Properties -> Advanced system settings -> Environment Variables... -> New...


Restart explorer (Task Manager -> Details -> explorer.exe right-click -> End Task, then File -> Run new Task -> Open: explorer -> OK) , or logout and login, or restart your computer.


Update origin URL to use SSH

Finally, update your repos origin to use SSH instead of HTTPS. The easiest way is to copy the part after 'git clone' in the Bitbucket "Clone" feature.


Click the "Clone" button, Select SSH and then the URL part of the git clone command suggested, and paste it in TortoiseGit Remote origin for the repo:



Done! Now you can enjoy password-less use of git with Bitbucket and/or GitHub.

Monday, May 3, 2021

Thinking about cacheability

Performance, Cacheability and Thinking Ahead

Although it's very true that "premature optimization is the root of all evil" (Hoare/Knuth), this should never be taken to mean that writing inefficient code is a good thing. Nor does it mean that we should ignore the possible need for future optimizations. As all things, it's a question of balance.

When designing APIs for example, just how you define them may impact even the possibility for future optimizations, specifically caching. Although caching is no silver bullet, often it is the single most effective measure to increase performance so it's definitively a very important tool in the optimization toolbox.

Let's imagine a search API, that searches for locations of gas stations within a given distance from a given point, and also filters on the brand of the station.

(In a real life application, the actual search terms will of course be more complex, so let's assume that using an underlying search service really is relevant which is perhaps not necessarily the case in this literal example. Also, don't worry about the use of static methods and classes in the sample code, it's just for show.)

[Route("v1/[controller]")]
public class GasStationV1Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> Search(string brand, double lat, double lon, double distance)
    {
        return SearchService.IndexSearch(brand, lat, lon, distance);
    }
}

We're exposing a REST API that delegates the real work to a fictitious search service, accessing an index and providing results based on the search parameters, possibly adding relevance, sponsoring and other soft parameters to the search. That's not important here.

What is important is that we've decided to let the search index handle the geo location part of the search as well, so we're indexing locations and letting the search index handle distance and nearness calculations, which on the surface of things appear to make sense. The less we do in our own code, and the more we can delegate, the better!

But, unfortunately it turns out this is a little too slow, and also we're overloading the back end search service which has a rate limiting function as well as a per-call pricing schedule so it's expensive too. What to do? The obvious thing is to cache. Said and done.

[Route("v2/[controller]")]
public class GasStationV2Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> CachedSearch(string brand, double lat, double lon, double distance)
    {
        string cacheKey = $"{brand}-{lat}-{lon}-{distance}";
        return ObjectCache.GetOrAdd(cacheKey, () => SearchService.IndexSearch(brand, lat, lon, distance));
    }
}

Now we're using our awesome ObjectCache to either get it from the cache, or if need be, call the back end service. All set, right?

Not quite.

The location that we're looking to find near matches to is essentially where the user is, which means there'll be quite a bit of variation of the search parameters. In fact, there is very little chance that anything in the cache will ever be re-used. The net effect of our caching layer is just to fill server memory. We're not reducing the back end search service load, and we're not speeding anything up for anyone.

The thing to consider here is that when we're designing an API that has the potential of being a bottleneck in one way or another, we should consider to make it possible to add a caching layer even if we don't to begin with (remember that thing about premature optimizations).

Avoid designing low-level API:s that take essentially open-ended parameters, i.e. parameters that have effectively infinite variation, and where very seldom a set of parameters is used twice. It's not always possible, it depends on the situation, but consider it.

As it turns out, our only option was to redesign what we use the search index for, and move some functionality into our own application. This is often a memory/time tradeoff, but in this case, keeping up to a 100 000 gas stations in memory is not a problem, and filtering them in memory in the web server is an excellent option.

This is how it looks like now, and although we're obliged to do some more work on our own, we'll be fast and we're offloading the limited and expensive search service quite a bit.

[Route("v3/[controller]")]
public class GasStationV3Controller : ControllerBase
{
    [HttpGet]
    public IEnumerable<GasStation> Search(string brand, double lat, double lon, double distance)
    {
        string cacheKey = $"{brand}";
        return ObjectCache.GetOrAdd(cacheKey, () => SearchService.IndexSearch(brand))
            .Select(g => g.SetDistance(lat, lon))
            .Where(g => g.Distance <= distance)
            .OrderBy(g => g.Distance);
    }
}

Now we have more manageable set of search parameters to cache, and we can still serve the user rapidly and without overloading the search service and or budget.

Taking this a step further, we'd consider moving this logic to the client, if it's reasonable since then we can even let the HTTP response become cacheable, which can further increase the scalability and speed for the users.

In the end performance is always about compromises, but the lesson learned here is that even if we don't (think) we need optimization and caching at the start, we should at least consider leaving the path for it open.

Thursday, April 29, 2021

Weird random constants in GetHashCode()

Random integers in GetHashCode() for C#/.NET

When you ask Visual Studio to generate Equals() and GetHashCode() for C#/.NET, as well as when you inspect other code, you often see addition and multiplication of various seemingly random constants as part of the calculation. Here's an example of how it can look:

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);
}

public class MyFixedByVsHashCode
{
    public MyFixedByVsHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode()
    {
         int hashCode = -1817952719;
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
         hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);
         return hashCode;
    }
}

The above example was generated using Visual Studio 2019 for .NET Framework and contains a number of seemingly random strange integer constants: 250924889, -1007312712, -1817952719 and -1521134295. If you generate code for .NET Core or .NET 5 it may look a little different, but under the hood it's similar.

Executive summary: The reason for these numbers is to reduce the risk of collisions, i.e. the number of situations where two different instances with different values get the same hash code.

So what's up with these magic numbers? First of all: No, they're not random. Let's go through them.

public override int GetHashCode() => 250924889 + EqualityComparer<string>.Default.GetHashCode(ValueA);
public override int GetHashCode() => -1007312712 + EqualityComparer<string>.Default.GetHashCode(ValueB);

These values are derived from the name of the property. Exactly how is not documented and not clear, but it's essentially equivalent to "NameOfProperty".GetHashCode() . The purpose is to add the name of property to the equation, reducing the risk that two properties with the same value get the same hash code.

Then we have the integer constants from the multiple property implementation:

int hashCode = -1817952719;
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeA>.Default.GetHashCode(A);
hashCode = hashCode * -1521134295 + EqualityComparer<MyTypeB>.Default.GetHashCode(B);

These are fixed, and do not vary. A little bit of analysis shows they are far from random. The first one, -1817952719 is actually the product of two relatively large primes, 16363 * 151379 = 2477014577 and is thus a nice semiprime, and when this is interpreted as a signed 32-bit integer we get -1817952719.

The second one, -1521134295, when interpreted as an unsigned 32-bit integer is 2773833001 - and that is a nice large prime!

Using primes and semiprimes as factors and constants in polynomials has been shown to produce numbers with better distribution than other constants.

So it's all about reducing the risk of collisions.

But how bad can it get? Actually, very bad... Here follows a seemingly good enough implementation that is similar to many real-world manual implementations. In fact, I've written a good number of similar, although hopefully not with as catastrophic result as this.

public class MyTypeA
{
    public MyTypeA(string value) => ValueA = value;

    public string ValueA { get; }

    public override int GetHashCode() => ValueA.GetHashCode();
}

public class MyTypeB
{
    public MyTypeB(string value) => ValueB = value;

    public string ValueB { get; }

    public override int GetHashCode() => ValueB.GetHashCode();
}

public class MyBrokenHashCode
{
    public MyBrokenHashCode(string value)
    {
        A = new MyTypeA(value);
        B = new MyTypeB(value);
    }

    public MyTypeA A { get; }

    public MyTypeB B { get; }

    public override int GetHashCode() => A.GetHashCode() ^ B.GetHashCode();
}

internal class Program
{
    private static void Main()
    {
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Something").GetHashCode()}");
        Console.WriteLine($"Hashcode is: {new MyBrokenHashCode("Other").GetHashCode()}");
    }
}

The above example produces the following output:

Hashcode is: 0
Hashcode is: 0

That's not good! Two different instances with two different values not only produce the same hashcode, it's 0! In fact it's worse than not good, it's potentially catastrophic for performance. The scary thing is, everything will still work, and look nice during testing but if these objects are placed in a HashTable or Dictionary or similar, and in production they grow to a larger number of elements then indexing these collections degenerate into linear searches in a linked list.

So what happens?

Two different types happen to generate the same hashcode for the same underlying value ("Something" or "Other"). That's actually not that unusual. Then we use XOR to combine the hashes, but XOR has the known weakness that XOR:ing identical values will always result in zero, regardless of the values.

This example is slightly contrived, but it demonstrates that seemingly good-looking code can have subtle pitfalls causing really bad effects.

Conclusion - Trust the tools and use Visual Studio generation of GetHashCode-code! Even if you don't notice any problems with your own implementations, do regenerate the code when you have the chance.

Iterations and the squaring factor

The power of 2

I recently found code that was functionally equivalent to the following:
public class Filter
{
    private readonly IEnumerable<string> _old;

    public Filter(IEnumerable<string> old) => _old = old;
public IEnumerable<string> WhatsNew(IEnumerable<string> updated) => updated.Where(s => !_old.Contains(s));
}

Nice, compact and easily understandable. We keep track of an original list of strings, and we get an updated list we'd like to know what's new.

Or is it, really?

As I mentioned, I found this type of code but why did I notice it? Because during debugging the call of the WhatsNew() method took significant time, it was boring to sit there and wait for it to complete!

The problem is that if the two collections are of the approximate same size, for example if updated contains a single new string, the typical number of calls to the string comparer is _old.Length * _old.Length / 2 .

In other words, the number of operations grows exponentially with the length of the list, this is typically expressed as O(N**2), read as "order of N-squared". That it's actually on the average divided by 2 doesn't matter for the O() notation, it just means that the number of operations is proportional to N squared.

In the real-world situation, the number of elements were on the order of 20,000. That's not extraordinary large in any way, but 20,000 * 20,000 / 2 is 200,000,000 !

That's 200 million operations! That can take real time even in a pretty fast machine.

The problem is the lookup in the _old list. We need to enumerate the updated one in one way, no way really to get around that, given the assumptions here.

This is where hashtables, or dictionaries which use hashtables under the hood and similar collections come into play. A lookup using a hashtable is enormously more efficient, and it will approach a linear increase in time rather than exponential. Here's how it could have been (and subsequently was) coded using a HashSet :

public class Filter
{
    private readonly HashSet<string> _old = new HashSet<string>();

    public Filter(IEnumerable<string> old)
    {
        foreach (string value in old)
        {
            _ = _old.Add(value);
        }
    }

    public IEnumerable<string> WhatsNew(IList<string> updated) => updated.Where(s => !_old.Contains(s));
}

Now, our WhatsNew() method will operate O(N), i.e. the time taken will be proportional to the number of elements, not the square of the number of elements! For larger sizes of the collection, that's a huge gain.

Obviously there are many variations both to the problem and the solution, but the message here is to be aware of the cost of doing effectively nested iterations of large collections.

This is also one of those examples of things that might not bite you until it's too late and your application is running in the real world. During testing and unit testing which usually is done with smaller data all will look well (even if we know we should be using expected data sizes somehow it often doesn't happen). Then, when it scales up in the real world performance can deteriorate dramatically and quickly!

This is similar to the old fable of the reward in grains of rice . Doubling the list does not decrease performance by half, as many would expect. It decreases performance proportional to the square of the increase! It get's progressively worse, quicker and quicker, and can surprisingly fast become a critical problem.

With the updated solution, doubling the list will decrease performance by roughly half, which is much easier to handle and scale with.

Tuesday, April 27, 2021

The strange git love of the command line

Git users love of the command line

 Why do a vast majority of git users love the command line so fervently? And why do I care?

Git is a great tool, but like said of "C", it's like a sharp knife. Excellent in the hands of an expert, but easy to cut your fingers if you're not.

Using git inexpertly can get you into all kinds of difficult-to-get-out trouble. Most developers' main job is to code, not maintain super-complex repositories with 100s or 1000s of contributors. You'd think they would welcome any tool that made the use of git easier and safer, perhaps forgoing some complex scenarios.

But, sorry to say, nope. Almost all of the developers I know and work with use git from the bash command line. It's also the only thing they use from bash, or any command line. Git has a zillion commands and options, but 9 out of 10 days the only things we do are:
  • Pull updates from the remote.
  • Create feature branches.
  • Commit changes locally.
  • Push changes to the remote.
  • Stash changes temporarily locally.
  • Switch branch.
  • Merge branches.
  • Rebase feature branches.
  • Inspect the commit history of a file.
There are excellent graphical user interface frontends for git, in all environments. I'm on Windows, and I happen to be fond of TortoiseGit, but there are others.

The great thing about these tools are that for one, they are 100% compatible with each other and the command line, since they all use the same underlying implementation of git.

Nobody, not even the most hard-core developers, use a command line editor or a command line debugger for day-to-day development. We're simply more productive with graphical full screen integrated development environments like Visual Studio, and less prone to mistakes. There's a reason we're not using TECO any more, even if it was awesome in it's time and still is in some ways - I loved it!

The same applies to source code control. It's simply faster, more convenient and less error-prone to use a tool that integrates with your graphical environment be it Visual Studio or Explorer or whatever your local equivalent is.

But, for some strange weird reason - for git suddenly it must be done from the bash prompt. I try to explain, I try to show, but the command line fixation remains. Git from the command line users repeatedly get into trouble and have to reset their local repository clones. Command line users frequently take much longer to investigate history inspections, and also often just skip doing things like rebasing the feature branch before merge to the main development branch causing the commit history to become much more complex. Just because it's too complicated to do from the command line.

Still, they persist.

I get that there are some rare cases that the command line is needed. I get that for batch operations, such as merging a bunch of repositories from perhaps the development branch to the main branch a script using the command line is perfect.

But the command line is not gone just because you use a modern tool for day-to-day use. It's still there when you need it.

Why do young gifted competent developers insist on using a tool model that originates in the 60's when there are so many better alternatives - and in all other cases they do?

I don't get it. But it is a problem, because it affects productivity both directly in daily use and indirectly because it tends to cause more complex commit histories to untangle in the future.