SQL Delete Duplicates

De-duplication is the process of deleting duplicate rows from a database.

Duplicate rows is a problem that data analysts frequently deal with.

Example

#

Before deleting duplicates you have to find them.
In this example, duplicates are identified using a CTE on the Order table.

WITH Cte(FirstName, LastName, 
         OrderDate, RowNumber) AS (
  SELECT FirstName, LastName, OrderDate,
         ROW_NUMBER() OVER (
           PARTITION BY OrderDate, FirstName, LastName
           ORDER BY OrderDate, FirstName, LastName) AS RowNumber
   FROM [Order] O
   JOIN Customer C ON C.Id = O.CustomerId
) 

SELECT * 
  FROM Cte 
  WHERE RowNumber > 1

Note: These are not true duplicates, but it gets the point across. This query returns customers that placed more than one order on the same day.

Result:  7 records
FirstName LastName OrderDate RowNumber
Elizabeth Lincoln 2013-01-10 00:00:00.000 2
Philip Cramer 2013-02-25 00:00:00.000 2
Howard Snyder 2013-07-31 00:00:00.000 2
Jose Pavarotti 2013-10-22 00:00:00.000 2
Felipe Izquierdo 2014-01-19 00:00:00.000 2
Daniel Tonini 2014-03-24 00:00:00.000 2
Jose Pavarotti 2014-04-17 00:00:00.000 2

The ROW_NUMBER() function sequentially numbers rows that are grouped by FirstName, LastName, and OrderDate. Those with a RowNumber > 1 are duplicates of another row.


De-duplication

#

Next, the duplicate rows need deletion.
Changing SELECT to DELETE in the above query will accomplish this.

WITH Cte(Id, CustomerId, OrderDate, RowNumber) AS (
  SELECT Id, CustomerId, OrderDate,
         ROW_NUMBER() OVER (
           PARTITION BY CustomerId, OrderDate
           ORDER BY OrderDate, CustomerId) AS RowNumber
   FROM [Order] O
) 

DELETE
  FROM Cte
 WHERE RowNumber > 1

Note: This query always deletes the second row it finds which may, or may not, be desirable. A manual inspection step of the duplicates is often required.

Result:  7 records deleted

More Examples

List original and duplicate rows

The query below lists duplicate pairs of Order records.
This allows a data analyst to decide which row in each pair to delete.
SELECT A.Id AS OrderId, FirstName, LastName, 
       A.OrderNumber, A.OrderDate
  FROM [Order] A
  JOIN (

        SELECT C.Id, FirstName, LastName, OrderDate,
               COUNT(O.Id) AS Occurences
          FROM [Order] O
          JOIN Customer C ON C.Id = O.CustomerId
         GROUP BY C.Id, OrderDate, FirstName, LastName
        HAVING COUNT(O.Id) > 1) AS B

    ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
Result:  14 records
OrderId FirstName LastName OrderNumber OrderDate
163 Elizabeth Lincoln 542540 2013-01-10 00:00:00.000
164 Elizabeth Lincoln 542541 2013-01-10 00:00:00.000
209 Philip Cramer 542586 2013-02-25 00:00:00.000
210 Philip Cramer 542587 2013-02-25 00:00:00.000
369 Howard Snyder 542746 2013-07-31 00:00:00.000
370 Howard Snyder 542747 2013-07-31 00:00:00.000
466 Jose Pavarotti 542843 2013-10-22 00:00:00.000
467 Jose Pavarotti 542844 2013-10-22 00:00:00.000
591 Felipe Izquierdo 542968 2014-01-19 00:00:00.000
592 Felipe Izquierdo 542969 2014-01-19 00:00:00.000
725 Daniel Tonini 543102 2014-03-24 00:00:00.000
726 Daniel Tonini 543103 2014-03-24 00:00:00.000
783 Jose Pavarotti 543160 2014-04-17 00:00:00.000
784 Jose Pavarotti 543161 2014-04-17 00:00:00.000

Here we have a list with paired duplicates, one after the other.


You may also like



Guides


vsn 3.1