Dealing with duplicates in SQL
Was explaining to a friend how to cope with duplicates today so thought I’d just stick a quick note up for him to refer too. Finding duplicates in SQL is relatively easy when dealing with one field, we just need to count the number of occurances of the field:
SELECT email, COUNT(email) AS NoOff FROM fbis_Users GROUP BY email HAVING ( COUNT(email) > 1 )
We can use the HAVING clause to get rows that are duplicated for a particular number of times, this number is zero based.
HAVING ( COUNT(email) = 2 )
To get the original row just wrap the duplicate query using an IN statement
SELECT * FROM fbis_Users WHERE email IN( SELECT email, COUNT(email) AS NoOff FROM fbis_Users GROUP BY email HAVING ( COUNT(email) > 1 ) )
I often see people using temporary tables to remove dupes but I tend to do it inline.
DELETE FROM Users WHERE UniqueId NOT In( SELECT MAX(UniqueID) AS IdToKeep FROM Users GROUP BY Email HAVING (COUNT(Email) > 1) OR (COUNT(Email) = 1))
We get the duplicates as usual but get the last entry with the duplicated email using MAX() (Last entry) or MIN() (if you have user history). We end up with a list of the records we want to keep(only one address or the last(max) or first(min) one of the dupes) and delete the rest.
This probably won’t scale well to tables with lots of entries but is fine for smaller tables.
Leave a Reply