Data entry errors are a common source of duplicate records in a database.
Another source is merging data from multiple sources into a single database.
Finding duplicates is the first step in cleaning up the data.
There are different SQL techniques to locate these duplicates.
Find the duplicates in the Order table.
SELECT FirstName, LastName, OrderDate,
COUNT(O.Id) AS Occurences
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
GROUP BY OrderDate, FirstName, LastName
HAVING COUNT(O.Id) > 1
Note: These are not true duplicates, but it gets the point across. These 'duplicates' are customers that placed more than 1 order on the same day.
FirstName | LastName | OrderDate | Occurences |
---|---|---|---|
Elizabeth | Lincoln | 2013-01-10 00:00:00.000 | 2 |
Philip | Cramer | 2013-02-25 00:00:00.000 | 2 |
Howard | Snyder | 2013-07-31 00:00:00.000 | 2 |
Jose | Pavarotti | 2013-10-22 00:00:00.000 | 2 |
Felipe | Izquierdo | 2014-01-19 00:00:00.000 | 2 |
Daniel | Tonini | 2014-03-24 00:00:00.000 | 2 |
Jose | Pavarotti | 2014-04-17 00:00:00.000 | 2 |
As you can see, with COUNT, GROUP BY and HAVING we can confirm that duplicates exist.
Next, we want the actual rows that are duplicates. This is accomplished by joining the above query with a second one, like so:
SELECT A.Id AS OrderId, FirstName, LastName,
A.OrderNumber, A.OrderDate
FROM [Order] A
JOIN (
SELECT C.Id, FirstName, LastName, OrderDate,
COUNT(O.Id) AS Occurences
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
GROUP BY C.Id, OrderDate, FirstName, LastName
HAVING COUNT(O.Id) > 1) AS B
ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
OrderId | FirstName | LastName | OrderNumber | OrderDate |
---|---|---|---|---|
163 | Elizabeth | Lincoln | 542540 | 2013-01-10 00:00:00.000 |
164 | Elizabeth | Lincoln | 542541 | 2013-01-10 00:00:00.000 |
209 | Philip | Cramer | 542586 | 2013-02-25 00:00:00.000 |
210 | Philip | Cramer | 542587 | 2013-02-25 00:00:00.000 |
369 | Howard | Snyder | 542746 | 2013-07-31 00:00:00.000 |
370 | Howard | Snyder | 542747 | 2013-07-31 00:00:00.000 |
466 | Jose | Pavarotti | 542843 | 2013-10-22 00:00:00.000 |
467 | Jose | Pavarotti | 542844 | 2013-10-22 00:00:00.000 |
591 | Felipe | Izquierdo | 542968 | 2014-01-19 00:00:00.000 |
592 | Felipe | Izquierdo | 542969 | 2014-01-19 00:00:00.000 |
725 | Daniel | Tonini | 543102 | 2014-03-24 00:00:00.000 |
726 | Daniel | Tonini | 543103 | 2014-03-24 00:00:00.000 |
783 | Jose | Pavarotti | 543160 | 2014-04-17 00:00:00.000 |
784 | Jose | Pavarotti | 543161 | 2014-04-17 00:00:00.000 |
Here we have a list with paired duplicates, one after the other. This allows us to make decisions about which rows to keep and which ones to delete.
One way to prevent duplicates from entering a database is to use constraints. Relevant constraints are Identity, UNIQUE columns, and UNIQUE Indexes. Click the respective links for details.
Note: SQL Server creates a UNIQUE index on a column that is declared UNIQUE. So, the end-result of UNIQUE columns and UNIQUE Indexes are the same.
Duplicates can also be found by using the ROW_NUMBER Window function.
CUSTOMER |
---|
Id |
FirstName |
LastName |
City |
Country |
Phone |
ORDER |
---|
Id |
OrderDate |
OrderNumber |
CustomerId |
TotalAmount |
WITH Cte(FirstName, LastName,
OrderDate, RowNumber) AS (
SELECT FirstName, LastName, OrderDate,
ROW_NUMBER() OVER (
PARTITION BY OrderDate, FirstName, LastName
ORDER BY OrderDate, FirstName, LastName) AS RowNumber
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
)
SELECT *
FROM Cte
WHERE RowNumber > 1
Note: This returns the same results as the first query on this page.
FirstName | LastName | OrderDate | Occurences |
---|---|---|---|
Elizabeth | Lincoln | 2013-01-10 00:00:00.000 | 2 |
Philip | Cramer | 2013-02-25 00:00:00.000 | 2 |
Howard | Snyder | 2013-07-31 00:00:00.000 | 2 |
Jose | Pavarotti | 2013-10-22 00:00:00.000 | 2 |
Felipe | Izquierdo | 2014-01-19 00:00:00.000 | 2 |
Daniel | Tonini | 2014-03-24 00:00:00.000 | 2 |
Jose | Pavarotti | 2014-04-17 00:00:00.000 | 2 |