Wednesday, March 21, 2012

7 ways to remove duplicate strings from a list (with timing results)

Recently there was a question on CodeProject about the fastest way to remove duplicate strings from a List. There were a bunch of alternative solutions presented, but no comparison to see what was faster. This post will show each solution and time them in a standard way to determine which actually answers the question.

About the test

Test setup creates a list of 1,000 random strings all 50 characters long. This is the expected result. It then creates duplicates for every other item and then for every 5th item. A list containing duplicates is created from the expected results and the duplicates. Note that for the even 5th items, there are actually two duplicates in the final list.

The core test method takes the list of duplicates, sets up a stopwatch and then calls the method to remove duplicates 1,000 times, each time passing in a new copy of the duplicate list. Since one of the methods under test changes the input list, this causes each iteration to work on the same, pristine input list. The elapsed time from the stopwatch is output and assertions are in place to make sure the method under test does what's expected.

About the methods under test

I've taken the code presented in the original thread and made slight modifications to make them all conform to the same interface. Some of them were code snippets or descriptions so I made my best guess as to the authors' intent.

1. The original poster's solution

public static IList LoadUniqueList1(List doubledList)
{
   var uniqueList = new List();
   foreach (var item in doubledList)
   {
      var x = true;
      foreach (var compare in uniqueList)
      {
         if (item == compare)
         {
            x = false;
         }
         if (!x) break;
      }
      if (x)
         uniqueList.Add(item);
   }
   return uniqueList;
}

2. Same as above, except substituting Contains for the second loop

public static IList LoadUniqueList2(List doubledList)
{
   var uniqueList = new List();
   foreach (var item in doubledList)
   {
      if (!uniqueList.Contains(item))
         uniqueList.Add(item);
   }
   return uniqueList;
}

3. Using a SortedList

public static IList LoadUniqueList3(List doubledList)
{
   var uniqueList = new SortedList();
   foreach (var item in doubledList)
   {
      if (!uniqueList.ContainsKey(item))
         uniqueList.Add(item, item);
   }
   return uniqueList.Values;
}

4. Using a HashSet

public static IList LoadUniqueList4(List doubledList)
{
   return new HashSet(doubledList).ToList();
}

5. Using LINQ Distinct

public static IList LoadUniqueList5(List doubledList)
{
   return doubledList.Distinct().ToList();
}

6. Using LINQ double iteration

public static IList LoadUniqueList6(List doubledList)
{
   var uniqueList = new List();
   uniqueList.Add(
      (from itm in uniqueList
      from compareItem in doubledList
      where itm == compareItem
      select compareItem).ToString());
   return uniqueList;
}

7. Sort and then build filtered list

public static IList LoadUniqueList7(List doubledList)
{
   var uniqueList = new List();
   doubledList.Sort();
   string lastEntry = string.Empty;
   foreach (string str in doubledList)
   {
      if (str != lastEntry)
      {
         uniqueList.Add(str);
         lastEntry = str;
      }
   }
   return uniqueList;
}

Timing results

The table below shows the elapsed time results for three test runs and the average elapsed time. The last column is how the strategy performed.

Before looking at the results, ask yourself "Which are the top two best performers?"


Comments about either the test methodology or results can be left below.

Test project available here.

No comments: