Why is this not used more in the data world?
Dear Data-Traveller, please note that this is a Linkedin-Remix.
I posted this content already on Linkedin in June 2022, but I want to make sure it doesn’t get lost in the social network abyss.
For your accessibility-experience and also for our own content backup, we repost the original text here.
Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!
Screenshot with Comments:
Plain Text:
One of the things I like to use a lot is an implementation called fan-out.
I think it’s because I spent a lot of time in the serverless world (mostly functions), where it is a common pattern.
You have an initial function that gathers some payloads that should be processed and then loops and calls a new function that will handle a single payload. By this, you can massively parallelize your work.
I built a content renderer that ran on AWS Lambda, where we did a static rendering of over 2.000 pages in 2m because every page was rendered in its own function instance.
When I did data pipelines, these were usually small ones in the last 1-2 years; I did them with Cloud Run. And sometimes also applied the same fan-out principle.
Now I am playing a bit the pipeline orchestrators and am quite surprised that fan-out or map (how it’s called in the data world) is not a thing.
Prefect supports it, and that’s why I end up using it (and like it so far).
Now I am asking myself – why is fan-out not a thing in the data world. How do you implement this?
An example: I am using Transistor.fm for hosting my podcast, and I want to get the daily download numbers for my reporting. They have an API where I can access the download data. For daily jobs, easy. I called the API for yesterday, got the data, and saved it into storage and BigQuery.
I could use a range request for backfill and get a big payload back with data for all days. But I love to have single files for a podcast & date in my storage.
With prefect.map, I can fan out these requests and get the daily data from the API in parallel. I like this so far.
But how do you approach it?