Using GROUP BY instead of DISTINCT

March 6, 2017 by Kenneth Fisher

Recently, Aaron Bertrand (b/t) posted Performance Surprises and Assumptions : GROUP BY vs. DISTINCT. In it he says he prefers GROUP BY over DISTINCT. He discusses the fact that GROUP BY will, in fact, under certain circumstances, produce a faster query plan. I highly recommend taking the time to read it. In fact, if you haven’t already, go do it now. It’s ok. I’ll wait.

Done? Good. Interesting stuff wasn’t it? I agree with Aaron by the way. I prefer GROUP BY instead of DISTINCT. However, I was thinking that even before I realized the performance implications. In his post, Aaron said So why would I recommend using the wordier and less intuitive GROUP BY syntax over DISTINCT? I disagree with the less intiuitive part. Wordier yes, certainly, but I’ve found over the years that most people follow GROUP BY more easily. I can’t tell you how many times I’ve seen a query like this

SELECT DISTINCT FirstName, MiddleName, LastName, 
	AddressLine1, AddressLine2, City, StateProvinceName, PostalCode
FROM Sales.vIndividualCustomer
WHERE BusinessEntityID = 17298

And the user asks me why they ended up with two lines.

distinctlydistinct1

And so the conversation begins.

“I want to get one row for each person. Why do I have two rows for this guy?”
“Well, you have two different addresses.”
“But I used DISTINCT?”
“Yes but you are including the address and there are two of them.”
“Ok, so how do I fix it?”
“Don’t pull the address.”
“But I need the address.”

It’s even worse when it’s a SELECT DISTINCT * and you don’t have the columns listed explicitly. For some reason, for some people, it catches them by surprise that DISTINCT covers every column returned. In other words, every row is unique in the result set. I’m not sure why this confuses some people, and these are smart, skilled people, but it does.

Now, look at the same query with a group by.

SELECT FirstName, MiddleName, LastName, 
	AddressLine1, AddressLine2, City, StateProvinceName, PostalCode
FROM Sales.vIndividualCustomer
WHERE BusinessEntityID = 17298
GROUP BY FirstName, MiddleName, LastName

Msg 8120, Level 16, State 1, Line 2
Column ‘Sales.vIndividualCustomer.AddressLine1’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

This time we actually get an error. It’s a lot harder to get confused (IMHO) because you have to explicitly name the grouped columns outside of the field list. Yes, it’s a bit more work, but not that much right?

Category: Microsoft SQL Server, SQLServerPedia Syndication, T-SQL | Tags: best practices, microsoft sql server, T-SQL

6 thoughts on “Using GROUP BY instead of DISTINCT”

Shane O'Neill says:

March 6, 2017 at 8:17 AM

Ugh! Flashbacks to answering questions online

– but I used DISTINCT?
— And you got distinct results…
– but there’s 2 results!
— that’s because there’s 2 distinctly different results…
– BUT IT DIDN’T DO WHAT I WANTED IT TO!!!
— no, but it did exactly what you told it to…

Kenneth Fisher says:

March 6, 2017 at 8:35 AM

Love it! Always good to hear that I’m not the only one having those types of discussions 🙂

Reply

Roland Alexander says:

March 9, 2017 at 10:03 AM

Point well made! I’ve had to explain the “DISTINCT means every value in every column is distinct” more than a few times. Using GROUP BY certainly makes things clearer in that regard. Think I’ll be making this part of my personal Best Practices!

Clarence Bitegeko says:

March 12, 2017 at 12:52 PM

Well, Roland, I think your definition may lead to that frustration of getting two rows while one was expected. DISTINCT means the combined value for all columns returned is distinct. So you should not look on individual columns but the collective row that is queried. Your definition would be true only if you are querying an individual column.

Reply

Oded Dror says:

March 11, 2017 at 3:54 PM

Use Windows function with CTE
With myCTE
AS
(
Select a,b,c,
RN = ROW_NUMBER ( ) OVER ( PARTITION BY a order by c)
From mytable
)

Select * from myCTE
where RN = 1

Kenneth Fisher says:

March 11, 2017 at 4:13 PM

That will work. And that’s certainly a great way to get the top of the two rows. Bit more complicated though 🙂

Reply

SQL Studies

Using GROUP BY instead of DISTINCT

6 thoughts on “Using GROUP BY instead of DISTINCT”

Leave a comment Cancel reply

Follow me via RSS

Follow me via Email

Follow me on Twitter

Archives

Categories

Meta

SQL Studies

Using GROUP BY instead of DISTINCT

Share this:

6 thoughts on “Using GROUP BY instead of DISTINCT”

Leave a comment Cancel reply

Follow me via RSS

Follow me via Email

Follow me on Twitter

Archives

Categories

Meta