Data ownership is an important topic at the moment

Dear Data-Traveller, please note that this is a Linkedin-Remix.

I posted this content already on Linkedin in March 2022, but I want to make sure it doesn´t get lost in the social network abyss.

For your accessibility-experience and also for our own content backup, we repost the original text here.

Have a look, leave a like if you like it, and join the conversation in the comments if this sparks a thought!

Original Post:

Plain Text:

Who owns the data?

Data ownership is an important topic at the moment. Because of different aspects:
– GDPR motivates us to judge the risk of collected user data which is easier when I own the data
– Some company value is based on the data they own
– Customers are asking to get access to their data which is hard when it’s not in your hands

In most setups, companies hire third-party services to manage their data. An external service collects, processes, transforms, models, stores, and presents the data.

This is the Google Analytics scenario (but also Amplitude, Mixpanel, Piwik,…). All these services offer data exports into data warehouses you have access to. Today, most of the exports are in “raw” format. Raw is debatable here since the data is modeled and cleaned before it arrives.

Somewhere in the middle are services is a service like Segment. They collect, process, and slightly model the data but store it in a data warehouse you manage (but they also keep the data on their end, at least for some time).

On the other end of the spectrum, you have open-source services Snowplow, Matomo, Rudderstack, or Jitsu. Here you can set up the data pipeline in your infrastructure and potentially fork the repository to extend the setup or prevent some stuff. Snowplow even offers their managed service that you still own the entire data pipeline setup in your cloud project.

When is this important:
– when you handle sensitive data, you need complete control over the risk of access
– when you own the pipeline, you decide at which stage you remove sensitive data. Snowplow allows you this at the very beginning of the pipeline, so you make sure that you keep the water below clean. Because once an email is in your pipelines, it’s hard to remove it 100% (remember the logs)
– the data pipeline becomes a core infrastructure asset of your company

Tomorrow I will show how quickly you can set up an owned Snowplow data collection pipeline on GCP. Sign up for the event so you can get the recording afterward.

Link to Event: