Google Analytics

Thursday, April 29, 2021

Iterations and the squaring factor

The power of 2

I recently found code that was functionally equivalent to the following:
public class Filter
{
    private readonly IEnumerable<string> _old;

    public Filter(IEnumerable<string> old) => _old = old;
public IEnumerable<string> WhatsNew(IEnumerable<string> updated) => updated.Where(s => !_old.Contains(s));
}

Nice, compact and easily understandable. We keep track of an original list of strings, and we get an updated list we'd like to know what's new.

Or is it, really?

As I mentioned, I found this type of code but why did I notice it? Because during debugging the call of the WhatsNew() method took significant time, it was boring to sit there and wait for it to complete!

The problem is that if the two collections are of the approximate same size, for example if updated contains a single new string, the typical number of calls to the string comparer is _old.Length * _old.Length / 2 .

In other words, the number of operations grows exponentially with the length of the list, this is typically expressed as O(N**2), read as "order of N-squared". That it's actually on the average divided by 2 doesn't matter for the O() notation, it just means that the number of operations is proportional to N squared.

In the real-world situation, the number of elements were on the order of 20,000. That's not extraordinary large in any way, but 20,000 * 20,000 / 2 is 200,000,000 !

That's 200 million operations! That can take real time even in a pretty fast machine.

The problem is the lookup in the _old list. We need to enumerate the updated one in one way, no way really to get around that, given the assumptions here.

This is where hashtables, or dictionaries which use hashtables under the hood and similar collections come into play. A lookup using a hashtable is enormously more efficient, and it will approach a linear increase in time rather than exponential. Here's how it could have been (and subsequently was) coded using a HashSet :

public class Filter
{
    private readonly HashSet<string> _old = new HashSet<string>();

    public Filter(IEnumerable<string> old)
    {
        foreach (string value in old)
        {
            _ = _old.Add(value);
        }
    }

    public IEnumerable<string> WhatsNew(IList<string> updated) => updated.Where(s => !_old.Contains(s));
}

Now, our WhatsNew() method will operate O(N), i.e. the time taken will be proportional to the number of elements, not the square of the number of elements! For larger sizes of the collection, that's a huge gain.

Obviously there are many variations both to the problem and the solution, but the message here is to be aware of the cost of doing effectively nested iterations of large collections.

This is also one of those examples of things that might not bite you until it's too late and your application is running in the real world. During testing and unit testing which usually is done with smaller data all will look well (even if we know we should be using expected data sizes somehow it often doesn't happen). Then, when it scales up in the real world performance can deteriorate dramatically and quickly!

This is similar to the old fable of the reward in grains of rice . Doubling the list does not decrease performance by half, as many would expect. It decreases performance proportional to the square of the increase! It get's progressively worse, quicker and quicker, and can surprisingly fast become a critical problem.

With the updated solution, doubling the list will decrease performance by roughly half, which is much easier to handle and scale with.

No comments:

Post a Comment