How to get a workflow running for a simple API-based data pipeline
Dear Data-Traveller, please note that this is a Linkedin-Remix.
I posted this content already on Linkedin in June 2022, but I want to make sure it doesn’t get lost in the social network abyss.
For your accessibility-experience and also for our own content backup, we repost the original text here.
Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!
Screenshot with Comments:
I played around a bit with data pipeline orchestration in the last weeks.
My usual simple approach is a Docker/Flask app which I deploy on Cloud Run. These pipelines are often simple 1-2 steps one and the approach was just fine. But sometimes I need more steps, and I would like to have some monitoring on top.
#1 – Prefect
I had this on my list for quite some time. So it was the obvious candidate. What is a bit confusing is that there are two versions, 1 vs. 2. 2 is still in beta, so I went with 1 (totally against my usual behavior to use cutting-edge versions).
The core setup is quickly done and works nicely out of the box. I also managed to create a fan-out approach in a short time.
I went with a cloud account for the dashboard and scheduling. The dashboard and logging are excellent.
Deploying is a science of its own. Prefect uses a pretty decoupled setup to ensure that the essential parts are running in your infra. But it was not so evident from the start. I thought, deploy to prefect cloud, and I am ready to go. It took me some days to figure out that my flows are in the dashboard, but they don’t run because I missed an essential part in the setup: the agent. Where the flow should run. I then used a VM to set up a simple local one because Kubernetes is still not my go-to.
#2 – Google Cloud Workflows
I do most of my stuff on GCP, and using Prefect would introduce an additional tool to the stack.
Robert Sahlin wrote good things about workflows and confirmed them again when I asked him.
So I did the same setup with GCP Workflows.
One plus thing – I can keep my usual cloud run setup. The setup of the workflow needs some time to understand structure and principles. You define the workflows in either YML or JSON, and you can use some pre-build GCP connectors to stream data into BigQuery easily. This works excellent when you work 100% or mostly with GCP services. The YML structure needs some rethinking, and I would have loved a more Pulumi-like approach, so I don’t have to learn a new interpretation of control syntax.
But now, I have a workflow running for a simple API-based pipeline.
I will definitely keep an eye on Prefect. Especially v2 looks like a great iteration. But it would be more of choice for a more extensive setup of different pipelines.