When ingesting data from custom pipelines, you sometimes get duplicates. Find out how to handle them

Dear Data-Traveller, please note that this is a Linkedin-Remix.

I posted this content already on Linkedin in June 2022, but I want to make sure it doesn’t get lost in the social network abyss.

For your accessibility-experience and also for our own content backup, we repost the original text here.

Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!

Link to Post

Screenshot with Comments:

LinkedIn post about handling duplicates in your data

Plain Text:

Handling duplicates in your data.

When ingesting data from custom pipelines, you sometimes get duplicates. For different reasons:

– source data has duplicates already
– some historic data get updates (classic for ad data)
– ingestion job ran twice (happens more often than you think)

Number two drove me crazy when I started with data integrations. Using Stitch for the first projects, I got Google Adwords data with 3-5 entries for the same Ad and day because Google adjusted numbers over days to rule out bot and spam stuff.

At that time, most SQL was new to me, so it took me quite some time to figure out how to handle it.

Today adding:
ROW_NUMBER() OVER (PARTITION report_date,profile_name order by ingested_at desc) as rn

is often a standard thing I add to many source data just to be on the safe side.

How do you handle duplicates?