Minor Gripe

2020-07-17 -- Comfynets and owning data

Chris Ertel

Introduction

I’ve been looking more and more at the ideas behind building comfynets–small self-hosted collections of tools and pages for communities. I’ll probably do a writeup on the sort of stuff I’d like to see in one.

Anyways, one of the bits of fallout from a backchannel I sysadmin involves the custody of data. We’ve got a chat instance setup which supports persistent messaging and so forth, but it’s backed by a database. I am currently the only one who has access to said database and I’m eventually going to be handing the whole mess off to somebody else in the community to deal with since I don’t want the responsibility anymore. Anyways, that’s a second tangent and writeup.

But, thinking about that stuff has put me in a position where I’d like to consider how users could do their own data storage without ever putting service owners in a position where they have custody over private user data. That is what this will be about.

Defining the problem

As a user:

I don’t want to have my data scattered across dozens of different silos, trading hands and getting hacked and whatnot.
I want to be able to instantly see what data a service is storing for me so I can see what it’s doing.
I want to be able to update that data and export it or even grant other services access to it.

As a service developer:

I don’t want to spend money storing individual user data.
I don’t want to spend extra cycles putting in data access and GDPR stuff–I’d prefer if the user just hung on to all of that for me.
I don’t want to take on the liability for becoming a huge target because of the massive treasure trove of data I’ve accumulated.

Now, if those aren’t the concerns you have, obviously this isn’t going to be the direction you want to go. That’s cool, go build your own authoritarian comfynet. That’s your business.

A world without databases

What does this get us?

Imagine now that when you sign up to use a service, you also give it access to a place you control were it can put all the state and stuff it needs to be concerned about for you. Let’s call it a stuffstore, because it’s a place you own where the services store their stuff about you. I think it sounds comfy too, and I have tired of the banal and technical language of our craft.

It would probably be similar and feel a lot like OAuth workflows–and a clever person might even suggest that hey, why not have an OAuth2 provider or whatever that also doubles as a stuffstore. The comedy option is that the OAuth workflow also gives the service the delegated authority it needs to interact with your stuffstore.

Anyawys, let’s pretend that you have added a todo service. The way it would work is:

You sign up for the todo service.
Todo service prompts you for your stuffstore info (or, altnerately, requests permissions via OAuth).
The service initializes its partition of the stuffstore.
You use the todo service and create items.
When the todo service handles your requests, it calls out to your stuffstore.
When you get finished with the todo app, you revoke its access.
You sign up for todo-alt service, which has an import workflow for users of the todo service.
You grant the todo-alt service read permissions, and it migrates over what it needs to its own new partition.
You happily continue using your data.

I don’t know about you, but I think this is super neat. There’s a neat joke about all this coming up, but we’ll get there.

Adding complications

So, there are some obvious shortcomings.

What happens if a service can’t access a stuffstore for a user for whatever reason?

Well, same thing as would happen if the database connection fails normally. The usual solution for this is to use worker queues for things like migrations, transactions to help mitigate failures in the middle of operations, and so forth.

So, is this like a SQL thing or a document store over an endpoint or what?

My first thought about how to implement a stuffstore is basically as a document store ala Mongo…you could even bodge together a simple version using a jsonb column in postgres where the table is something like:

CREATE TABLE naive_stuffstore(
    partition uuid primary key default uuid_generate_v4(),
    service_id uuid not null,
    stuff jsonb not null default '{}'::jsonb
);

Super janky, but you can see how you might augment it with easy auditing via update triggers and get a nice audit log or event log for the user to track what services are doing.

Unfortunately, a lot of existing software isn’t setup to use document databases instead of normal SQL systems. So, it’d be a lot less work to find some way of giving the service software the impression it’s talking to a normal database (which just happens to only have one user).

One way of doing this I thought about was using Postgres foreign data wrappers. You could do something like:

Client joins your service.
You ask client for their stuffstore access.
Client gives you access.
You request a unique set of database credentials for your service to talk to their stuffstore.
You create an FDW and use IMPORT FOREIGN SCHEMA to create a schema just for that client backed by their stuffstore.
You run your normal business logic and migrations, but every time you create a connection to the database for a user you use their stuffstore creds and something like schema search paths to scope your operations to their schema.

The main idea is that this would mean you could apply some (admittedly annoying) patches to an existing codebase and make it compatible with stuffstores instead of needing its own database.

(Okay, I lied…the service would still need its own database to track non-user data and also to keep track of authorizations for stuffstores. Probably.)

Okay, but how do I support usecases where I need to get at multiple users?

Ugh, yeah, this is the next gross thing. If what you need to do is simple analytics, you just march through all of your authorized stuffstores and run your aggregates against them. You could probably even do this with materialized views if you really were averse to modifying your normal logic. The quality of your data will be a bit bad if some number of the stuffstores are unavailable or have revoked access.

Personally, I think this is okay because I think it’s basically predatory and problematic to mine user data. But, I also see how this would make it harder to sell your users during an acquisition or whatever. Again, this is for a comfynet and not for startupcanistan.

Setting aside the analytics workflow, though, there’s also the issue of composite services–services where the value truly is based on bringing together multiple users and their data. A trivial example of this would be the chat app that got me thinking about this stuff…should my stuffstore hold my messages? Should it hold my friends’ messages too? Should the service knit together messages from all of the relevant stuffstores?

The ownership of data and moral obligations are a lot simpler when it’s just a user’s stuffstore in a bubble, but if the data is the result of interactions and collaborations with other users, it gets a lot trickier. I’d argue that the most philosophically pure (and also most terrible in performance) option is the last one where the service would pull from all attached stuffstores in the case of something like a shared chat with history. But, consider the case of a document with shared authorship.

Now, you could argue that if you’re in an CRDT-type situation you could just treat all of the stuffstores as event streams to interleave to create the final product–which is not bad at all from a technical standpoint–but again I’m concerned what happens if one of the stuffstores becomes unreachable.

For a comfynet with non-collaborative services, stuffstores as sketched here work…but outside of that we have a lot of work to do.

What keeps a service from just copying all of the data and not respecting the privacy of stuffstores?

There is no protection against data exfiltration by malicious services. There’s no similar guarantee with the existing state of the art, so at least we haven’t made it worse. If you are worried about data exfiltration, you need to run the service in an environment you completely control.

Are people going to host their own stuffstores?

I imagine that if you sketched a system for this, a lot of folks totally would. There’s also nothing at all preventing people from paying hosting companies from handling the stuffstores for them, and presumably there could be a vibrant business ecosystem around the value-adds for things like better auditing of stuffstores, better backup and versioning, better tools for analyzing how your stuffstore is being used, and so forth.

Unfortunately, paying somebody to host your stuffstore does kinda go directly against the philosophical advantages underpinning it. People make the same mistake with Mastadon (whose whole reason for existence is federation and self-hosting) and using the official servers instead of hosting their own instances. There’s no helping some people.

Doesn’t this also kinda imply that sensitive business structures are going to leak into the hands of users? Like, wouldn’t a stuffstore with my service’s schema in it leak my IP?

Well, yeah, if you don’t take care to mark which tables are the ones needed for user data and which tables are used for your own business processes.

What happens if a service encrypts user data when they put it in the stuffstore?

That’d be really gross. Don’t use those services.

Conclusion

I think it’d be really neat to do a prof of concept of this. I bet you could probably wire up PostgREST to do the OAuth stuff and serve an API to request DB credentials, and have it use some stored procedures to handle things like creating the proper partitions and accounts for services.

I don’t think this is likely to replace the existing way of doing things, but it’s a neat research idea that might even make it out to normal hobbyist use.

One thing I want to note is that a lot of this is basically rebuilding the personal desktop on top of web services. Like, we used to own our own data. If you run your own applications on your own hardware without a network connection, you own your own data.

Tags: thoughts