## Thursday, July 14, 2011

### LINQ and set notation

So I learned something today.

It's been a long time since I had any classes dealing with sets. I vaguely remember discussions including Venn diagrams and odd notation that didn't have any relevance to any other context that I was familiar with, but they were also pretty cool to study and get to know. However, they're the types of things that have a highly technical, mathematically rigorous side to them and also the day-to-day intuitive side that I use. Not needing it on a regular basis, the finer points of set algebra quickly sink into the background and aren't easily recalled.

Today I needed to do some set manipulation. I knew LINQ had some set related methods in it, such as Intersect which gives only the items that are in both lists.

And I knew it had the Union that gives all the items in both lists, without duplicates.

Needing to find the differences between two lists, I suspected there was something to do that easily. I had two string lists and needed the items from both lists that were only in one of the two lists. In other words, I needed the inverse of Intersect. First I looked at the options in the code completion list. Nothing jumped out at me. I couldn't recall the technical name so I did a web search for "LINQ intersect inverse" and found the Except method. The description sounded promising but, when I tried it, I didn't get what I expected. It only gave the things that were in the first list but not in the second list. In other words, it did not include the things in the second list that weren't in the first. Figuring the people who wrote these methods probably knew something about what they were doing, I went digging a little deeper.

The next web search was for "set intersect inverse" that led me to a Wikipedia page on set intersection. Scrolling to the bottom "see also" section, I saw Complement. Ah-ha! That rang a bell from the distant past. Clicking on that link confirmed that Complement was in fact the difference between two sets.

And this is where I started to learn something new: there are multiple types of differences in set theory. In general, the Complement refers to things not in a given set. First, there's the relative complement as implemented by LINQ's Except method. This gives the things from one list that are not in the other list. For example, if list1 = new List<string> {"a", "b", "c"} and list2 = new List<string>{"b", "c", "d"} then list1.Except(list2) will return "a".

Second, there's the absolute complement. This is the universe of all things not in a given set. Hopefully it's obvious that this can't be implemented in software.

Finally, there is the symmetric difference. This is the name of what I needed. It's the list of things that are in only one of two given sets. Thinking in terms of set operations, it's the relative complement between the union of the sets and the intersection of the sets. In LINQ terms it's list1.Union(list2).Except(list1.Intersect(list2)). This got me what I want.

Now that I had a technical term to search on, out of curiosity I tried "symmetric difference LINQ". (It's always easier to find what you're looking for when you know the proper keywords.) This returned a link to a StackOverflow question that not only gave the answer I came up with but also pointed out there's a SymmetricExceptWith method on the HashSet class.

Curiosity now drove me to do a quick benchmark. Knowing LINQ has never been a speed demon, I assumed it probably wouldn't beat out the HashSet implementation. I threw together a quick console app that simply ran both the above LINQ query and HashSet.SymmetricExceptWith call on two short lists a million times and reported the elapsed times. I used three different types of input containers to see if that made much difference. Here's what I found...
Input typeUsing LINQUsing HashSet.SymmetricExceptWith
List<string>2.8 s1.0 s