Duplicate rows in a table

18

February 4, 2013 by Kenneth Fisher

I recently got the following email:

“I just keyed in two rows into a table with all identical column values. How do I get rid of just one of them?”

Now even way back when in the dark ages when I started as a database developer I ran into this problem occasionally. Every now and again I would create a table and forget to put a primary key on it. (I’m fairly sure I’m the only one who has ever done this.) As a result I inevitably ended up with some rows that were exact duplicates and I needed to get rid of just one of them. So the first thing I tried was this:

 DELETE TOP 1 FROM MyHeap
WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'

And it failed. Looking in BOL (written on stone tablets at the time) I found out that TOP wasn’t available in DELETE, or UPDATE for that matter. Remember this was back in SQL 2000. I honestly don’t remember if this could be done in FoxPro (my first database language). I developed a couple of tricks to manage the problem, and they worked fairly well, but I had always wished for DELETE TOP #. Well my wish finally came true, as of SQL 2005 TOP is available for use with DELETE, UPDATE and even INSERT. Unfortunately I only learned this recently while studying for my 2008 Development MCTS. Here is the definition in 2005-2008 R2. I haven’t looked at 2012 yet but I imagine that it is the same or similar.

TOP (expression) [PERCENT]
     [ WITH TIES ]

So back to my developer’s problem, I could now give him the simple solution.

 DELETE TOP 1 FROM MyHeap
WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'

and it worked perfectly. In case you were wondering the original solutions I came up with was this:

SELECT DISTINCT * INTO #temp
FROM MyHeap
WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'

DELETE FROM MyHeap
WHERE WhatShouldHaveBeenMyKeyValue = 'WhatWasIThinking'

INSERT INTO MyHeap
SELECT * FROM #temp

With complications for identity columns of course. So you can see how much easier the TOP keyword makes things.

FYI You can also use the TOP keyword as a way to limit a transaction size. I’m using 5000 rows as a transactions size for the following examples.

DELETE TOP (5000) SET ColumnA = 'NewValue'
	WHERE ColumnA = 'OldValue'
WHILE @@ROWCOUNT = 5000
BEGIN
	WAITFOR DELAY ’00:00:02’ – Wait 2 seconds between transactions
	DELETE TOP (5000) SET ColumnA = 'NewValue'
		WHERE ColumnA = 'OldValue'
END

Or

UPDATE TOP (5000) SET ColumnA = 'NewValue'
	WHERE ColumnA = 'OldValue'
WHILE @@ROWCOUNT = 5000
BEGIN
	WAITFOR DELAY '00:00:02' – Wait 2 seconds between transactions
	UPDATE TOP (5000) SET ColumnA = 'NewValue'
		WHERE ColumnA = 'OldValue'
END

Note: Scripts corrected based on comments below.

I haven’t had a chance to do any extensive testing on this yet but the theory is sound and I didn’t run into any problems the couple of times I’ve tried it.

About these ads

18 thoughts on “Duplicate rows in a table

  1. Laurie Alvey says:

    In FoxPro, you can use the following:
    DELETE NEXT 1 FROM mytable WHERE RECNO() = pnRec

  2. Richard Tovsig says:

    Ok for 1 duplicate record in at table, but if you have multiple sets of duplicate records, you will still need to use the old method, I guess?

    BR. Richard

    • I would think that there would be a way to do it using TOP N, but to be honest I can’t think of a way off the top of my head. Not to mention that it may end up being complicated enough that the old method may still be the easier choice.

  3. Jonathan Roberts says:

    WHILE @@ROWCOUNT = 5000
    BEGIN
    UPDATE TOP (5000) SET ColumnA = ‘NewValue’
    WHERE ColumnA = ‘OldValue’
    WAITFOR DELAY ’00:02’ – Wait 2 seconds between transactions
    END

    This won’t work as @@ROWCOUNT would have to equal 5000 on entry and I think the WAITFOR DELAY may reset the @@rowcount.
    Better like this:
    DECLARE @ROWCOUNT int
    SET @ROWCOUNT = -1 –Initialise
    WHILE @ROWCOUNT 0
    BEGIN
    UPDATE TOP (5000) SET ColumnA = ‘NewValue’
    WHERE ColumnA = ‘OldValue’
    SET @ROWCOUNT = @@ROWCOUNT
    END

    • Good point on the wait. Also I messed up and put wait for 2 minutes not 2 seconds. I’ll have to fix that in my post.

      You’re code looks good, although I think you would have to initialize to 0 for the same reason you gave for my @@rowcount. You could also re-write mine like this.

      UPDATE TOP (5000) SET ColumnA = ‘NewValue’
      WHERE ColumnA = ‘OldValue’
      WHILE @@ROWCOUNT = 5000
      BEGIN
      WAITFOR DELAY ’00:00:02′ – Wait 2 seconds between transactions
      UPDATE TOP (5000) SET ColumnA = ‘NewValue’
      WHERE ColumnA = ‘OldValue’
      END

      At least I don’t think I made any mistakes there.

      • Jonathan Roberts says:

        The WHILE I wrote was WHILE @ROWCOUNT NOT EQUAL 0
        But the “” less than greater than signs seem to be deleted from the comment probably as it’s html.

  4. Joshua Patchak says:

    FYI, in SQL Server (since 2005), there’s a tidy way to do this with common table expressions and the ROW_NUMBER() windowed function using PARTITION BY like so. When you delete from a CTE, it deletes the data from the source table as well:

    ;WITH cte(rowNumber) AS
    (SELECT ROW_NUMBER() OVER (PARTITION BY [List of Natural Key Fields] ORDER BY [List of Order By Fields]) FROM dbo.TableName)

    DELETE FROM cte WHERE rowNumber>1

  5. Jason W says:

    SQL Server lets you use ROW_NUMBER() to handle this sort of thing as well:

    CREATE TABLE #a (a INT);
    –Numbers is a Tally table, with Number values from 1 to n
    INSERT INTO #a
    ( a )
    SELECT 1 FROM NUMBERS WHERE Number BETWEEN 1 AND 10;

    SELECT * FROM #a;

    WITH sort AS
    (
    SELECT a, ROW_NUMBER() OVER(PARTITION BY a ORDER BY a) AS rid FROM #a
    )
    DELETE FROM sort WHERE rid > 1;

    SELECT * FROM #a;

    DROP TABLE #a;

  6. Carla says:

    I usually just add a new identity column and set it as the primary key. Then I can easily delete the duplicate row.

    • You can, but there are a few gotchas. First you need to make sure that your new identity column is the last column, otherwise any code or view (see here) that is dependent on column position will get messed up. You still run a risk with code that assumes # of columns. Also sometimes you will be working with vendor code where you aren’t allowed to make changes no matter how much sense it makes.

      Assuming all of that is good then absolutely add your identity column. If it’s a big table though you may want to add it after hours :).

      • Carla says:

        You are right. I usually run into this problem in development, It would be another thing entirely to encounter the problem in production.

  7. Jeff Springer says:

    Old style, before they had all of the fun commands others have suggested

    set rowcount 1
    DELETE FROM MyHeap
    WHERE WhatShouldHaveBeenMyKeyValue = ‘WhatWasIThinking’

    • Boy I had forgotten that one. I actually didn’t learn that one for quite awhile. I started with the SELECT DISTINCT into a temp table version, but it was years before I learned about SET ROWCOUNT.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 163 other followers

%d bloggers like this: