Duplicate rows in a table
19February 4, 2013 by Kenneth Fisher
I recently got the following email:
“I just keyed in two rows into a table with all identical column values. How do I get rid of just one of them?”
Now even way back when in the dark ages when I started as a database developer I ran into this problem occasionally. Every now and again I would create a table and forget to put a primary key on it. (I’m fairly sure I’m the only one who has ever done this.) As a result I inevitably ended up with some rows that were exact duplicates and I needed to get rid of just one of them. So the first thing I tried was this:
DELETE TOP 1 FROM MyHeap WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'
And it failed. Looking in BOL (written on stone tablets at the time) I found out that TOP wasn’t available in DELETE, or UPDATE for that matter. Remember this was back in SQL 2000. I honestly don’t remember if this could be done in FoxPro (my first database language). I developed a couple of tricks to manage the problem, and they worked fairly well, but I had always wished for DELETE TOP #. Well my wish finally came true, as of SQL 2005 TOP is available for use with DELETE, UPDATE and even INSERT. Unfortunately I only learned this recently while studying for my 2008 Development MCTS. Here is the definition in 2005-2008 R2. I haven’t looked at 2012 yet but I imagine that it is the same or similar.
TOP (expression) [PERCENT] [ WITH TIES ]
So back to my developer’s problem, I could now give him the simple solution.
DELETE TOP 1 FROM MyHeap WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'
and it worked perfectly. In case you were wondering the original solutions I came up with was this:
SELECT DISTINCT * INTO #temp FROM MyHeap WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking' DELETE FROM MyHeap WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking' INSERT INTO MyHeap SELECT * FROM #temp
With complications for identity columns of course. So you can see how much easier the TOP keyword makes things.
FYI You can also use the TOP keyword as a way to limit a transaction size. I’m using 5000 rows as a transactions size for the following examples.
DELETE TOP (5000) SET ColumnA = 'NewValue' WHERE ColumnA = 'OldValue' WHILE @@ROWCOUNT = 5000 BEGIN WAITFOR DELAY ’00:00:02’ – Wait 2 seconds between transactions DELETE TOP (5000) SET ColumnA = 'NewValue' WHERE ColumnA = 'OldValue' END
Or
UPDATE TOP (5000) SET ColumnA = 'NewValue' WHERE ColumnA = 'OldValue' WHILE @@ROWCOUNT = 5000 BEGIN WAITFOR DELAY '00:00:02' – Wait 2 seconds between transactions UPDATE TOP (5000) SET ColumnA = 'NewValue' WHERE ColumnA = 'OldValue' END
Note: Scripts corrected based on comments below.
I haven’t had a chance to do any extensive testing on this yet but the theory is sound and I didn’t run into any problems the couple of times I’ve tried it.
In FoxPro, you can use the following:
DELETE NEXT 1 FROM mytable WHERE RECNO() = pnRec
Ahh, thank you. I had forgotten that. It’s amazing what you forget when you don’t use a language for 10 years :).
Ok for 1 duplicate record in at table, but if you have multiple sets of duplicate records, you will still need to use the old method, I guess?
BR. Richard
I would think that there would be a way to do it using TOP N, but to be honest I can’t think of a way off the top of my head. Not to mention that it may end up being complicated enough that the old method may still be the easier choice.
WHILE @@ROWCOUNT = 5000
BEGIN
UPDATE TOP (5000) SET ColumnA = ‘NewValue’
WHERE ColumnA = ‘OldValue’
WAITFOR DELAY ’00:02’ – Wait 2 seconds between transactions
END
This won’t work as @@ROWCOUNT would have to equal 5000 on entry and I think the WAITFOR DELAY may reset the @@rowcount.
Better like this:
DECLARE @ROWCOUNT int
SET @ROWCOUNT = -1 –Initialise
WHILE @ROWCOUNT 0
BEGIN
UPDATE TOP (5000) SET ColumnA = ‘NewValue’
WHERE ColumnA = ‘OldValue’
SET @ROWCOUNT = @@ROWCOUNT
END
Good point on the wait. Also I messed up and put wait for 2 minutes not 2 seconds. I’ll have to fix that in my post.
You’re code looks good, although I think you would have to initialize to 0 for the same reason you gave for my @@rowcount. You could also re-write mine like this.
UPDATE TOP (5000) SET ColumnA = ‘NewValue’
WHERE ColumnA = ‘OldValue’
WHILE @@ROWCOUNT = 5000
BEGIN
WAITFOR DELAY ’00:00:02′ – Wait 2 seconds between transactions
UPDATE TOP (5000) SET ColumnA = ‘NewValue’
WHERE ColumnA = ‘OldValue’
END
At least I don’t think I made any mistakes there.
The WHILE I wrote was WHILE @ROWCOUNT NOT EQUAL 0
But the “” less than greater than signs seem to be deleted from the comment probably as it’s html.
FYI, in SQL Server (since 2005), there’s a tidy way to do this with common table expressions and the ROW_NUMBER() windowed function using PARTITION BY like so. When you delete from a CTE, it deletes the data from the source table as well:
;WITH cte(rowNumber) AS
(SELECT ROW_NUMBER() OVER (PARTITION BY [List of Natural Key Fields] ORDER BY [List of Order By Fields]) FROM dbo.TableName)
DELETE FROM cte WHERE rowNumber>1
Awesome. I think this one is my favorite so far. I’ll have to add this one to my regular scripts. Thanks!
Aw, I was too slow. 🙂
🙂
this is the better, I used many times in a cube and works so good… is clear and can make specific partitions…
SQL Server lets you use ROW_NUMBER() to handle this sort of thing as well:
CREATE TABLE #a (a INT);
–Numbers is a Tally table, with Number values from 1 to n
INSERT INTO #a
( a )
SELECT 1 FROM NUMBERS WHERE Number BETWEEN 1 AND 10;
SELECT * FROM #a;
WITH sort AS
(
SELECT a, ROW_NUMBER() OVER(PARTITION BY a ORDER BY a) AS rid FROM #a
)
DELETE FROM sort WHERE rid > 1;
SELECT * FROM #a;
DROP TABLE #a;
I usually just add a new identity column and set it as the primary key. Then I can easily delete the duplicate row.
You can, but there are a few gotchas. First you need to make sure that your new identity column is the last column, otherwise any code or view (see here) that is dependent on column position will get messed up. You still run a risk with code that assumes # of columns. Also sometimes you will be working with vendor code where you aren’t allowed to make changes no matter how much sense it makes.
Assuming all of that is good then absolutely add your identity column. If it’s a big table though you may want to add it after hours :).
You are right. I usually run into this problem in development, It would be another thing entirely to encounter the problem in production.
Old style, before they had all of the fun commands others have suggested
set rowcount 1
DELETE FROM MyHeap
WHERE WhatShouldHaveBeenMyKeyValue = ‘WhatWasIThinking’
Boy I had forgotten that one. I actually didn’t learn that one for quite awhile. I started with the SELECT DISTINCT into a temp table version, but it was years before I learned about SET ROWCOUNT.
[…] Fisher (@sqlstudent144) also wrote a blog post about another way you can accomplish this task if you only have a few of them to […]