Which is faster? IN (list) or IN (SELECT FROM Temp)
8October 11, 2018 by Kenneth Fisher
If you’ve done much with IN (list) then you’ve realized that it basically translates out to
col=val1 OR col=val2 OR ....
You’ve probably also realized that you can throw all the values into a temp table and do this
SELECT * FROM tablename WHERE col IN (SELECT col FROM #temp
But which is faster? Well, let’s find out. First I’m going to create a table of 1,000,000 values to pull against. This way I will hopefully get some useful times and not one that takes 10ms and the other 9ms.
tl;dr; It was pretty inconclusive even at large numbers of values.
-- Set up the lookup table USE Test; GO WITH x AS ( SELECT TOP (100) CAST(ROW_NUMBER() OVER (ORDER BY val.[name]) AS INT) AS x FROM [master]..spt_values val), Nums AS ( SELECT CAST(ROW_NUMBER() OVER (ORDER BY x.x) AS INT) AS Num FROM x CROSS JOIN x y CROSS JOIN x z) SELECT REPLICATE('a',ABS(CHECKSUM(NEWID()) % 1000)) AS col1, REPLICATE('a',ABS(CHECKSUM(NEWID()) % 1000)) AS col2, REPLICATE('a',ABS(CHECKSUM(NEWID()) % 1000)) AS col3, REPLICATE('a',ABS(CHECKSUM(NEWID()) % 1000)) AS col4, DATEADD(minute,ABS(CHECKSUM(NEWID()) % 10000),'1/1/2000') AS DateCol INTO ListTable FROM Nums; GO
I’m going to do some tests of various sizes. The first one will be a small test of just 10 values.
Small test (10 values)
-- Get a small list of values to look up SELECT TOP 10 DateCol INTO #ListTemp FROM ListTable ORDER BY newid(); DECLARE @ListVar nvarchar(1000) SELECT @ListVar = STUFF((SELECT ', ' + QUOTENAME(CONVERT(varchar(30), DateCol, 121),'''') FROM #ListTemp FOR XML PATH(''),TYPE).value('.','VARCHAR(MAX)') , 1, 2, ''); DECLARE @sql varchar(max) SET @sql = 'SELECT * FROM ListTable WHERE DateCol IN ('+@ListVar+');' PRINT @sql;
Now using the query printed out:
SET STATISTICS TIME, IO ON SELECT * FROM ListTable WHERE DateCol IN ('2000-01-02 01:28:00.000', '2000-01-06 07:05:00.000', '2000-01-05 20:24:00.000', '2000-01-02 18:15:00.000', '2000-01-02 08:12:00.000', '2000-01-07 03:48:00.000', '2000-01-07 18:07:00.000', '2000-01-06 03:31:00.000', '2000-01-03 19:55:00.000', '2000-01-04 22:13:00.000'); SELECT * FROM ListTable WHERE DateCol IN (SELECT DateCol FROM #ListTemp);
I ran this twice and discarded the first run because it was loading the data into memory. Yes, I realize that could be useful information, but I’m skipping that part of it for today.
SQL Server parse and compile time: CPU time = 12 ms, elapsed time = 12 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. (1049 row(s) affected) Table 'ListTable'. Scan count 3, logical reads 289125, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 844 ms, elapsed time = 984 ms. (1049 row(s) affected) Table '#ListTemp___________________________________________________________________________________________________________000000000004'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ListTable'. Scan count 3, logical reads 289125, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 484 ms, elapsed time = 472 ms.
Ok, so the logical reads are about the same. More from the IN (SELECT) actually, but only by a fraction because of the read on the temp table and some work tables. The time, however, was almost exactly half using the temp table. Now, this was without any indexes. So now I’m going to run the same queries again but this time having put a non-clustered index on ListTable.
CREATE INDEX ix_ListTable ON ListTable(DateCol);
This is a pretty simple index, and I don’t have a clustered index, nor is this a covering index but honestly, it’s a SELECT * so a covering index isn’t likely anyway.
This time the results are:
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 3 ms. (1049 row(s) affected) Table 'ListTable'. Scan count 10, logical reads 1091, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 264 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 6 ms. (1049 row(s) affected) Table 'ListTable'. Scan count 10, logical reads 1081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#ListTemp___________________________________________________________________________________________________________000000000004'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 47 ms, elapsed time = 208 ms.
Oddly enough slightly less reads for the IN (SELECT) this time and it’s only slightly faster. I’m also tried adding an index on the temp table but with no significant change.
Summary (10 values)
With a small test (only 10 values) the IN (SELECT) was faster, with or without an index.
Large test (1000 values)
I’m going to use the exact same process as above but I’m not going to actually post the query because I feel like it will be too long for a blog post. Here is the code to generate the query though if you want to play along. I did have to make a few changes. The TOP became a TOP (1000) instead of a TOP (10), the varchar(1000) became a varchar(max) and the PRINT became a SELECT because you get more information that way.
SELECT TOP 1000 DateCol INTO #ListTemp FROM ListTable ORDER BY newid(); DECLARE @ListVar varchar(MAX) SELECT @ListVar = STUFF((SELECT ', ' + QUOTENAME(CONVERT(varchar(30), DateCol, 121),'''') FROM #ListTemp FOR XML PATH(''),TYPE).value('.','VARCHAR(MAX)') , 1, 2, ''); DECLARE @sql varchar(max) SET @sql = 'SELECT * FROM ListTable WHERE DateCol IN ('+@ListVar+');' SELECT @sql;
Without the index: (I’m skipping IO this time because it didn’t seem to tell us much last time)
(96517 row(s) affected) SQL Server Execution Times: CPU time = 1704 ms, elapsed time = 5472 ms. (96517 row(s) affected) SQL Server Execution Times: CPU time = 1250 ms, elapsed time = 6031 ms.
Hmm, faster for the IN (list) this time. Let’s try it with an index on ListTable.
SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. (95297 row(s) affected) SQL Server Execution Times: CPU time = 1375 ms, elapsed time = 5047 ms. (95297 row(s) affected) SQL Server Execution Times: CPU time = 1250 ms, elapsed time = 5787 ms.
Still a bit faster for the IN (list). And again, I tried with an index on the temp table with no major difference.
Summary (1000 values)
This time the IN (list) was faster with and without indexes. By less than a second though.
Summary (10000 values)
I also did a test with 10,000 values. In this case, the IN (list) was still faster without an index, but once I put an index on the table the IN (SELECT) became faster, and with an index on both the table and the temp table it was a bit faster.
Over all summary
My results were really pretty inconclusive. I was really surprised since I truly expected the IN (SELECT) to be faster. Particularly with an index. But in the end, the timing wasn’t significantly different (a few seconds here or there) and I can’t be certain if that’s not something I’ve done wrong. So for most cases, I’m just going to use the version more convenient. For cases where even a fraction of a second matters (100’s of runs a second anyone?) I’m going to have to test to see which is working fastest in that case.
Can you test these against doing a JOIN between ListTable and #ListTemp on col as well? Curious what, if any, differences there are there.
I can’t imagine how any I/O (Select) can be faster than no I/O (values list).
Fair point. And I did think about that later that the list was going to be pure memory. I did however wonder if the indexing would make up for it on a large scale. Which of course it didn’t.
This gets a little more complex when we get outside the SQL Server world. Optimizers and other products will turn the IN() into various complex structures. The first step is usually to remove duplicates and sort the list. Another option is to put the list into a tree index of some sort (usually just a simple binary tree). The third option when the list is really long is to put it into a hash table.
The real complexity, however, is when the list can include expressions and computations. The safest thing, of course, is to translate into a chain of OR-ed predicates and it guarantees the results will be correct. But they might not be fast. T
Thanks! This is great information 🙂 I have to admit, I wasn’t sure about the background process I was simply doing a time trial. I have also kind of wondered how it would do if insted of a regular temp table I used an in memory table instead.
Temporary tables have recompile thresholds, which are triggered as their statistics are updated (i.e. as they are being inserted) Table variables (until SQL Server 2019) are always estimated to contain one row, no matter how many rows are inserted (i.e. until SQL Server 2019 their statistics are never updated). IN clauses (even though they are also persisted to tempdb). IN clauses can generate different execution plans, where N plans can lead to procedure cache bloat (and increase the chance of other, perhaps more valuable but less costly, cached plans being flushed). IN clauses also have a finite limit when being passed to the optimizer (which will cause a compilation error and thus the its query will fail to run). In contrast, temporary tables are constrained by tempdb’s disk available space, and will not cause the optimizer to fail for the same reason that an excessively large IN clause will cause a failure. An IN(SELECT … FROM Table) allows the optimizer to consider Table’s statistics.
I prefer to say “it depends” when asked this question :).
“IN clauses (even though they are also persisted to tempdb)” should have been “Table variables are also persisted to tempdb”, or just stricken 😉
Either way 🙂 Thanks for the additional info!