Dealing with duplicates in SQL

Was explaining to a friend how to cope with duplicates today so thought I’d just stick a quick note up for him to refer too. Finding duplicates in SQL is relatively easy when dealing with one field, we just need to  count the number of occurances of the field:

SELECT email,
 COUNT(email) AS NoOff
FROM fbis_Users
GROUP BY email
HAVING ( COUNT(email) > 1 )

We can use the HAVING clause to get rows that are duplicated for a particular number of times, this number is zero based.

HAVING ( COUNT(email) = 2 )

To get the original row just wrap the duplicate query using an IN statement

SELECT * FROM fbis_Users WHERE email IN(
  SELECT email,
   COUNT(email) AS NoOff
  FROM fbis_Users
  GROUP BY email
  HAVING ( COUNT(email) > 1 )
)

I often see people using temporary tables to remove dupes but I tend to do it inline.

DELETE FROM Users WHERE UniqueId NOT In(
SELECT     MAX(UniqueID) AS IdToKeep
FROM         Users
GROUP BY Email
HAVING      (COUNT(Email) > 1) OR
                      (COUNT(Email) = 1))

We get the duplicates as usual but get the last entry with the duplicated email using MAX() (Last entry) or MIN() (if you have user history).  We end up with a list of the records we want to keep(only one address or the last(max) or first(min) one of the dupes) and delete the rest.

This probably won’t scale well to tables with lots of entries but is fine for smaller tables.

Leave a Reply

Your email address will not be published. Required fields are marked *