Why is this not used more in the data world?

Dear Data-Traveller, please note that this is a Linkedin-Remix.

I posted this content already on Linkedin in June 2022, but I want to make sure it doesn’t get lost in the social network abyss.

For your accessibility-experience and also for our own content backup, we repost the original text here.

Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!

Link to Post

Screenshot with Comments:

Plain Text:

One of the things I like to use a lot is an implementation called fan-out.

I think it’s because I spent a lot of time in the serverless world (mostly functions), where it is a common pattern.

You have an initial function that gathers some payloads that should be processed and then loops and calls a new function that will handle a single payload. By this, you can massively parallelize your work.

I built a content renderer that ran on AWS Lambda, where we did a static rendering of over 2.000 pages in 2m because every page was rendered in its own function instance.

When I did data pipelines, these were usually small ones in the last 1-2 years; I did them with Cloud Run. And sometimes also applied the same fan-out principle.

Now I am playing a bit the pipeline orchestrators and am quite surprised that fan-out or map (how it’s called in the data world) is not a thing.

Prefect supports it, and that’s why I end up using it (and like it so far).

Now I am asking myself – why is fan-out not a thing in the data world. How do you implement this?

An example: I am using Transistor.fm for hosting my podcast, and I want to get the daily download numbers for my reporting. They have an API where I can access the download data. For daily jobs, easy. I called the API for yesterday, got the data, and saved it into storage and BigQuery.
I could use a range request for backfill and get a big payload back with data for all days. But I love to have single files for a podcast & date in my storage.

With prefect.map, I can fan out these requests and get the daily data from the API in parallel. I like this so far.

But how do you approach it?