Analyzing Duplicate Records

Quite some time ago I published Analyzing Duplicate Records on 15seconds.com.  Yes, I know...I only published "Part 1" of that series.  Perhaps I'll continue more articles like that here.

2 Comments

  1. Chris Says:

    David,

    I just saw your article and it was very good. I love that you refer people to set theory and other theoretical foundations of SQL so that the word can get out how amazing it is. Maybe you'll be the next Tom Kyte! :-)

    I think the next in your series should be how to extend the base query to do other kinds of things. For example, the base query *selects* duplicates but what if you want to leave them in place but get a list of unique records without duplicates (getting A, B, C, D from A, A, B, C, C, D, D, D, D)?

    Also, how can I use this as a basis to get pairs of records based on a comparison between records? E.g., employee time punches where time between in and out is > 20 minutes? That'll include sorting, grouping, having, etc. - my mind boggles at the complexity but I'm certain it's merely a tweak to your example!

    Well, best of luck and thanks for the kick-start!

    Regards,

    Chris

  2. David L. Penton Says:

    Chris,

    All of the real magic happens in the GROUP BY/HAVING clauses, but in the case you are speaking of the HAVING clause is the real determining factor for what is filtered from the results. For instance, suppose I wanted to leave at most two records in this table. I could have done this (based on the article's query):

    GROUP BY

    b.[shipment], b.[purchase_order], b.[item_sku]

    HAVING

    SHIPMENT_ORDERS.[id] < MAX(b.[id])

    AND SHIPMENT_ORDERS.[id] > MIN(b.[id])

    or [order_date] could have been a range (less than 7 days from the max date):

    GROUP BY

    b.[shipment], b.[purchase_order], b.[item_sku]

    HAVING

    SHIPMENT_ORDERS.[order_date] < DATEADD(day, -7, MAX(b.[order_date]))

    In short, the HAVING clause is the powerhorse here.

    Thanks,

    David


Leave a Reply