Wednesday, March 21, 2012

7 ways to remove duplicate strings from a list (with timing results)

Recently there was a question on CodeProject about the fastest way to remove duplicate strings from a List. There were a bunch of alternative solutions presented, but no comparison to see what was faster. This post will show each solution and time them in a standard way to determine which actually answers the question.

About the test

Test setup creates a list of 1,000 random strings all 50 characters long. This is the expected result. It then creates duplicates for every other item and then for every 5th item. A list containing duplicates is created from the expected results and the duplicates. Note that for the even 5th items, there are actually two duplicates in the final list.

The core test method takes the list of duplicates, sets up a stopwatch and then calls the method to remove duplicates 1,000 times, each time passing in a new copy of the duplicate list. Since one of the methods under test changes the input list, this causes each iteration to work on the same, pristine input list. The elapsed time from the stopwatch is output and assertions are in place to make sure the method under test does what's expected.

About the methods under test

I've taken the code presented in the original thread and made slight modifications to make them all conform to the same interface. Some of them were code snippets or descriptions so I made my best guess as to the authors' intent.

1. The original poster's solution

public static IList LoadUniqueList1(List doubledList)
{
   var uniqueList = new List();
   foreach (var item in doubledList)
   {
      var x = true;
      foreach (var compare in uniqueList)
      {
         if (item == compare)
         {
            x = false;
         }
         if (!x) break;
      }
      if (x)
         uniqueList.Add(item);
   }
   return uniqueList;
}

2. Same as above, except substituting Contains for the second loop

public static IList LoadUniqueList2(List doubledList)
{
   var uniqueList = new List();
   foreach (var item in doubledList)
   {
      if (!uniqueList.Contains(item))
         uniqueList.Add(item);
   }
   return uniqueList;
}

3. Using a SortedList

public static IList LoadUniqueList3(List doubledList)
{
   var uniqueList = new SortedList();
   foreach (var item in doubledList)
   {
      if (!uniqueList.ContainsKey(item))
         uniqueList.Add(item, item);
   }
   return uniqueList.Values;
}

4. Using a HashSet

public static IList LoadUniqueList4(List doubledList)
{
   return new HashSet(doubledList).ToList();
}

5. Using LINQ Distinct

public static IList LoadUniqueList5(List doubledList)
{
   return doubledList.Distinct().ToList();
}

6. Using LINQ double iteration

public static IList LoadUniqueList6(List doubledList)
{
   var uniqueList = new List();
   uniqueList.Add(
      (from itm in uniqueList
      from compareItem in doubledList
      where itm == compareItem
      select compareItem).ToString());
   return uniqueList;
}

7. Sort and then build filtered list

public static IList LoadUniqueList7(List doubledList)
{
   var uniqueList = new List();
   doubledList.Sort();
   string lastEntry = string.Empty;
   foreach (string str in doubledList)
   {
      if (str != lastEntry)
      {
         uniqueList.Add(str);
         lastEntry = str;
      }
   }
   return uniqueList;
}

Timing results

The table below shows the elapsed time results for three test runs and the average elapsed time. The last column is how the strategy performed.

Before looking at the results, ask yourself "Which are the top two best performers?"


Comments about either the test methodology or results can be left below.

Test project available here.

Monday, March 12, 2012

There and back again, again: Several resource's tale

Or, how to safely change a cursor or other resource

In a previous article, I presented a simple class that utilized the IDisposable pattern to set the cursor to a specific value and then automatically reset it to the initial state when some work was accomplished. This simple class had two limitations: it only worked with WPF and it only worked on cursors.

To overcome these limitations, I refactored this to be more general. I changed the context to a generic type and added an abstract RestoreState method that's called when the class is either destroyed (preferred) or the finalizer is called. This allows all the basic plumbing regarding the IDisposable pattern to be contained in one class.

using System;

namespace ResourceProtection
{
    public abstract class ResourceChangeHandler : IDisposable
    {
        protected TContext Context { get; set; }
        private bool Disposed { get; set; }

        protected ResourceChangeHandler(TContext context)
        {
            Disposed = false;
            Context = context;
        }

        public void Dispose()
        {
            DisposeInt();
            GC.SuppressFinalize(this);
        }

        protected abstract void RestoreState(TContext context);

        private void DisposeInt()
        {
            if (Disposed)
                return;

            RestoreState();
            Disposed = true;
        }

        ~ResourceChangeHandler()
        {
            DisposeInt();
        }
    }
}

The beauty of this is it can be used as the base class for specific needs. The function of actually changing a resource is split out from the framework needed to manage the change. Here are three examples: one each for a WPF cursor changer, a WinForms cursor changer and a TreeView BeginUpdate/EndUpdate handler.

How to change a WPF cursor

public class WpfChangeCursor : ResourceChangeHandler{
    private Cursor OriginalCursor { get; set; }

    public WpfChangeCursor(FrameworkElement element, Cursor newCursor)
            : base(element)
    {
        OriginalCursor = Context.Cursor;
        Context.Cursor = newCursor;
    }

    protected override void RestoreState()
    {
        Context.Cursor = OriginalCursor;
    }
}

To use this, instantiate an instance of WpfChangeCursor in a using block, passing in the element who's cursor should be changed and the new value for the cursor. In this case, the context for the base class is the element containing the cursor to be changed. The descendant class has the responsibility to remember the current value and changing the cursor.

How to change a WinForm cursor

public class WinFormChangeCursor : ResourceChangeHandler{
    private Cursor OriginalCursor { get; set; }

    public WinFormChangeCursor(Cursor newCursor)
            : base(Cusor.Current)
    {
        Cursor.Current = newCursor;
    }

    protected override void RestoreState()
    {
        Cursor.Current = Context;
    }
}

This class is used by instantiating an instance of WinFormChangeCursor in a using block with the new value of the cursor passed to it. This is a little different from the WPF version since there is only one cursor for the application, rather than for each element. In this case, the context passed to the base is the current cursor value which is then used by RestoreState to put things back.

How to call BeginUpdate and EndUpdate safely

public class TreeViewUpdate : ResourceChangeHandler{
    public TreeViewUpdate(TreeView treeview)
            : base(treeview)
    {
        Context.BeginUpdate();
    }

    protected override void RestoreState()
    {
        Context.EndUpdate();
    }
}

And finally, here's a class to make sure EndUpdate is called for every BeginUpate on a TreeView. Like the WPF cursor changer above, the control is passed in as the context for the ancestor class. BeginUpdate is then called on the context. When the using block this class is controlled by finishes, the EndUpdate in RestoreState is called, again on the same context as the BeginUpdate.

This technique provides for a level of separation between the framework needed to safely do and undo things from the actual work of doing and undoing them. It works well for certain classes of problems, specifically fixed things that happen over and over again. The examples above of doing repetitive things on framework level items are good examples.

Sometimes though, one wants this type of functionality without the overhead of creating a new class each time. In the next article, I'll present a method of composing this type of behavior in a more adhoc manner.

Until then, subscribe to the Twitter feed to get notified of future articles and hear about other development related things. And if you have any questions, suggestions or comments, please feel free to leave them below.