Comparing an inner join vs a subquery

10

May 2, 2013 by Kenneth Fisher

I was reading through Stackoverflow today and saw an interesting question. SQL inner join vs subquery. The user was comparing 3 queries and wondered why the first took significantly longer than the other 2.

Here are the queries:

Query 1: SELECT * From TabA INNER JOIN TabB on TabA.Id=TabB.Id
Query 2: SELECT * From TabA WHERE Id in (SELECT Id FROM TabB)
Query 3: SELECT TabA.* From TabA INNER JOIN TabB on TabA.Id=TabB.Id

My answer was that the Query 1 is pulling all of the data from both tables and therefore did more reads. I also suggested that the user use SET STATISTICS IO and SET STATISTICS TIME to test the theory. Well, I got curious and decided to run a similar test for myself.

Not having the OPs tables I decided to use 2 from AdventureWorks2008. Sales.Customer and Person.Person.

So first turning on the statistics.

 SET STATISTICS IO ON
SET STATISTICS TIME ON

Next, here are the queries using the same format as the OPs.

 SELECT * FROM Sales.Customer INNER JOIN Person.Person ON Sales.Customer.PersonID = Person.Person.BusinessEntityID
SELECT * FROM Sales.Customer WHERE Sales.Customer.PersonID IN (SELECT Person.Person.BusinessEntityID FROM Person.Person)
SELECT Sales.Customer.* FROM Sales.Customer INNER JOIN Person.Person ON Sales.Customer.PersonID = Person.Person.BusinessEntityID

I’m trimming down the output for the sake of brevity.

-- Query 1
(19119 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 1, logical reads 3816, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 1594 ms,  elapsed time = 11237 ms.

-- Query 2
(19119 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 859 ms,  elapsed time = 2193 ms.

-- Query 3
(19119 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 1, logical reads 59, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Customer'. Scan count 1, logical reads 123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 875 ms,  elapsed time = 2796 ms.

If you review IO sections you will notice that all physical reads are 0. This means both tables are cached at the time and all of the reads are coming out of memory. So right off the bat we can exclude caching issues.

Also Queries 2 and 3 are close enough in time to be just normal run variations, and the IO section while in a different order has identical numbers. The query plans are also very similar. Oddly enough even though all of the processes in those query plans are the same, the percentages are different. Unfortunately I don’t know enough about query plans to really say why.

Query 1 has a CPU time almost double the other two queries and an elapsed time of 4-5 times greater. Well we expected that from the original question. The Customer Table has 123 logical reads for all three queries. However if you compare the logical reads for the Person table you will see 59 each for Queries 2 & 3 and 3816 for Query 1! That is a lot of extra reads!

So now let’s compare the query plans

Query_Join_vs_Subquery

If you look closely, you will notice that the plans are very similar. The only difference (other than %s) is the Clustered Index Scan for Person.Person on Query 1 and the Index Scan on Queries 2 and 3. The Clustered Index Scan makes perfect sense to me. The clustered index contains all of the data for the table in the leaf nodes and all of the data is being returned. So it makes sense to scan across the entire table, returning everything.

It took me awhile to figure out why Queries 2 and 3 are doing an Index Scan on AK_Person_rowguid. (And honestly if anyone would like to confirm this I would greatly appreciate it.) That particular index is on a column called rowguid which is a uniqueidentifier and isn’t referenced anywhere in the query. So why use it? My understanding is that this particular index is being used for the following two reasons.

First this is the smallest index on the table, the other two being the clustered index and an index on the first, last and middle names. Second (and the real key here) is that the clustered index key has to be in the leaf nodes of every non-clustered index. Assuming a clustered index exists of course. In this particular case the clustered index is on BusinessEntityID which is the value we need for our query. So the optimizer decided that it would be faster to read in the whole non-clustered index, pulling the BusinessEntityID from the leaf nodes, rather than using the index nodes of the clustered index.

Now I’m not sure if the situation from the post I saw is the same. It relies on the indexes available, and I don’t have that information. But I do think this is a pretty good indication of what was happening. If nothing else, for me at least, it was a good exercise in reading the query plans and trying to figure out exactly what was going on.

About these ads
»

10 thoughts on “Comparing an inner join vs a subquery

  1. Keld Laursen says:

    If you look carefully at the queries you have created, then you will see that the query 1 is very much different from the other two in that this one is the only query that returns real data from TabB. The other two only returns TabA data. The bottom of all three queries is “Get all rows from TabA that has IDs that match TabB IDs.
    This is best accomplished by reading all of TabB from the source that in the most compact way contains the ID. As you (and the optimizer) found, this were some index.
    When you then makes a big change in the amount of data you need to get out, then you will get a big change in reads. In this case the optimiser of course went for the data themselves, using the clustered index.
    I think that the best thing to take away from your example is: “Do never EVER do a SELECT *”

    • Actually I didn’t create the queries. I was responding to a question on http://www.stackoverflow.com. I agree 100% that the first query is very different because it is pulling more data. Actually that was my original response to the question. This was just a proof. Hopefully the title of my post wasn’t misleading. I couldn’t think of what to call it so I went with the OP’s original title for hist question.

      I would say there are 2 important things to take out of this. 1) Only do SELECT * when you mean it. I won’t say never, because there have been occasions where I needed all of the columns. And 2) The optimizer used virtually identical query plans for the last 2 queries. Both pulling the same data, but one using an IN clause and the other an INNER JOIN.

  2. NonClustered Indexes (like AK_Person_rowguid) automatically append clustered index key (BusinessEntityid), that’s why it can use AK index (20 bytes per record) instead of PK index (full table)

    • Anibal, Pretty much the same response I had for Anibalven. I’m pretty sure that is what I put in the second to last paragraph :). It took me a few minutes when I was looking at it originally to remember that. Personaly I find it rather interesting and a good argument for artificial keys vs large natural ones for the clustered index.

  3. Glenn says:

    I think your sample is just mixing apples and oranges. Any query pulling more data should realistically always take more total time. However, here it is also pulling from a table full of custom data types and xml. If you reverse your third query to “Select Person.Person.* from…” or even just “Select Person.Person.Demographics from…”, it pumps the time back up to query one territory.

    Also, if you change query three to Select Customer.Customer.*, Person.Person.Title … (a non custom data type) the time is more in line with other two, with the cpu time now more or less matching in repeated tries. Adding addition specified base type fields to be returned only nominally affect total time.

    Also, if you reverse your In-clause query to pull Person.Person.* from an In-clause selecting Sales.Customer.PersonID, the cpu time is same but total time back up to query one when pulling this tables custom data types and xml.

    • Glen, I agree 100% it’s apples and oranges. However they weren’t my examples :). I was posting this in response to a question on http://www.stackoverflow.com and was using the OPs examples. Would you please re-explain your last paragraph? Or provide a full query if possible. I’m not sure I understand what you are trying to do there and I’d like to give it a shot myself.

      • Glenn says:

        Here I meant reversing the return values of the query with the in clause:
        “SELECT * FROM Person.Person WHERE BusinessEntityID IN (SELECT Select Sales.Customer.PersonID from Sales.Customers)” Since it is now pulling the custom datatypes and xml it takes as long as the first query.

      • Ok, so I did some looking. And the Person table is certainly bigger than the Customer table. 3816 vs 123 reads respectively. Next time I do something like this I’ll have to make sure my tables are closer in size! I’m still getting a slower response from the first query than yours but my results are varying so much (~4-10 seconds for the first query vs ~2 for yours) I can’t be certain.

  4. anibalven says:

    Non Clustered Indexes (like the AK one) automatically appends the clustered index key (BusinessEntityId) to columns, that’s why it can use the AK (20 bytes per record) index instead of the PK Index (full table)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 334 other followers

%d bloggers like this: