How do you handle data freshness monitoring?

Dear Data-Traveller, please note that this is a Linkedin-Remix.

I posted this content already on Linkedin in June 2022, but I want to make sure it doesn’t get lost in the social network abyss.

For your accessibility-experience and also for our own content backup, we repost the original text here.

Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!

Link to Post

Screenshot with Comments:

Plain Text:

How do you handle data freshness monitoring?

I was tinkering a bit with some approaches the last few days.
It first started with a vanilla approach and pure SQL.

I created a data freshness view which is basically a big union all the different sources where data arrives.

There I define two date fields: One for the latest date of reporting data and one for the latest ingestion. Because I sometimes had the issue that the ingestion time was misleading because just some historical stuff was ingested, but the new ones had some issues.

This gives me a table; I can just add to Lightdash. And I can also build freshness metrics for all sources and put them into each dashboard as a simple chart (or Oliver Laslett – maybe handling and showing data freshness on chart and dashboard level is an interesting feature for some future).

This works fine so far. Sure it takes some work. Because I also need to set up some tests and checks on top of this table.

There is also the source freshness feature built-in dbt. Which does quite the same, but you just add some configs in the source configs. But it is limited to only one ingest timestamp. But it can be included in the usual dbt tests and runs to warn you.

What are your best practices here? And I don’t want to introduce a new tool just to do data freshness testing.

dbt approach: