The advantage of receiving data in the format you need – and the return of ETL?
Dear Data-Traveller, please note that this is a Linkedin-Remix.
I posted this content about data modeling already on Linkedin in August 2022, but I want to make sure it doesn’t get lost in the social network abyss.
For your accessibility-experience and also for our own content backup, we repost the original text here.
Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!
Screenshot with Comments:
Plain Text:
Don’t get me wrong; I love(d?) the standard integration we get by tools like Fivetran, Stitch, or Airbyte.
Google and Facebook ads data in my database? It takes me 30m to set up. And when I am lucky, I also have some models to build on top, or I just pick the ones that Fivetran provides as dbt models.
That all sounds great. And to some degree, it is excellent.
But there is one problem you need to keep in mind.
Let’s use an analogy.
When you build a house and use a tool like Fivetran or Airbyte – it’s a bit like they come by every day with a truck and offload random house-building material in their sorting on your yard. You have to go through and pick the materials you actually need. The rest just stays there – since you have a considerable yard, no problem with that. And every day, you get a little bit better with sorting, but still, it costs (your compute) time.
To improve in this example, you would start to order precisely the things you need, and you would love Fivetran and Airbyte if they could bring it to your place in the order and shape you would prefer before they put it in your yard. Sounds crazy; well, check how just-in-time delivery in production workflows work – it works just like that.
Step 1 you can achieve in Fivetran and Airbyte by only choosing the tables and columns you need. Start small and grow, not the other way around if, when selected, all are tempting.
Step 2 is more complex – one of the design elements of ELT tools is that they “force” their model on you, and you then work from there to your model. This can include some heavy transformation.
– As a start, develop your agnostic ad model.
– Define how you want to receive the data for this model
– Then map the received data against it
– Analyze the queries need to for the matching process and look at the query costs
– If they are concerning, you can think about transforming the data before it loads in your database > yeah, that’s right – it’s bringing back the good old ETL (for example, loading into S3, transforming, and loading from there) or check out Keboola, where you can do all this in a managed workflow
Takeaway: Be picky about how your data gets delivered – it will save you a lot of headaches downstream.