Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database restructure #458

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft

Database restructure #458

wants to merge 32 commits into from

Conversation

aumetra
Copy link
Member

@aumetra aumetra commented Dec 16, 2023

Part of #418

Rough outline:

  • Create a separate key table to store the keys and their IDs and references to an account that owns them
  • Move the ActivityPub-specific things, such as collection URLs into its seperate table
  • Rename the actor_type field to something more generic (such as just type; renamed to account_type)
  • Remove blurhash column (not sure if we even want blurhashes)
  • Move domains to a separate table, designate owners, and whether they can be globally used
  • Revisit the roles table and make it suitable for custom roles
  • Add a protocol marker to each account , post, like, etc. to identify from what network they originate from (rationale: knowing the account is enough to know how to federate)

TODO:

  • Review the cryptographic keys name and reference structure
  • Get third-party review

@aumetra
Copy link
Member Author

aumetra commented Jan 21, 2024

Structure is looking way too complicated now.. kinda..
kitsune_dev

@aumetra
Copy link
Member Author

aumetra commented Jan 21, 2024

Do we need a protocol marker on each element? Maybe not. I think having it for the account itself is sufficient. It doesn't matter where the elements in particular come from, does it?

@aumetra
Copy link
Member Author

aumetra commented Jan 21, 2024

Ah, yeah, for the domain stuff.. I need to vendor my DNS verification library into the monorepo at some point..
About that. I need to figure out monorepo licensing, because I think I want the lib/ directory licensed under MIT or Apache-2.0, and the rest under whatever more restrictive license we decide on.

@aumetra aumetra force-pushed the aumetra/db-restructure branch from 6622b51 to 619bf9c Compare January 21, 2024 16:04
@aumetra aumetra force-pushed the aumetra/db-restructure branch from 4efb621 to d9b066d Compare March 17, 2024 11:14
@aumetra aumetra force-pushed the aumetra/db-restructure branch from d9b066d to 751c935 Compare June 1, 2024 19:51
@erlend-sh
Copy link

For the two remaining TODOs..

  • Review the cryptographic keys name and reference structure
  • Get third-party review

..couldn't this wait until after the initial merge? Kitsune isn't used for real in production yet anyhow. It'll be a lot easier to get eyes on this and stir up interest among prospective reviewers if the WIP is merged and put to the test.

@aumetra
Copy link
Member Author

aumetra commented Jun 10, 2024

Oh yeah, probably can. I sent the diagram to @perillamint already, they are looking at it next time they have time. After that I'll just hit merge on it and see what we do after that.

@aumetra
Copy link
Member Author

aumetra commented Jun 10, 2024

The "third party review" is mostly about "getting someone else to look at the database schema that I didn't overlook anything or completely overcomplicated everything

@aumetra aumetra force-pushed the aumetra/db-restructure branch from 1e03b02 to 9d9dc86 Compare June 10, 2024 11:34
@perillamint
Copy link
Contributor

perillamint commented Jun 10, 2024

Overally good, but

Some limitations I noticed:

  • This structure can't hold language code that does not adhere to ISO-639 (because it uses enum)

Also, I think it would be good to have name and public_visibility column in user_roles table (to display role badges on user profiles)

Additionally, it seems the current schema does not have a way to store pinned user notes.

For Mastodon API compatibility, I think it needs posts_bookmark to support the bookmark feature

@aumetra
Copy link
Member Author

aumetra commented Jun 12, 2024

This structure can't hold language code that does not adhere to ISO-639 (because it uses enum)

I think this is a fair tradeoff regarding storage and computing vs. storing non-normalized data

Also, I think it would be good to have name and public_visibility column in user_roles table (to display role badges on user profiles)

We have the name of the role inside the roles table. The users_roles table is just a PK-FK mapping table to allow for M:N mapping of users <-> roles.

Not sure if we need the visibility in this iteration. We can always add them later and then collapse the migrations into one each release, basically saying "you need to update in lock-step from version to version" (similar to what Nextcloud does with its major versions, where you can only update one at a time).

But about the bookmark table, that's fair. That's something I'll add now. Because it's an endpoint I want to support in the near future (and this time actually persistent bookmarks, not like the mess Mastodon has where bookmarks just disappear).

Thoughts @perillamint?

@aumetra aumetra force-pushed the aumetra/db-restructure branch from 13596d1 to 5ae824e Compare June 12, 2024 19:49
@perillamint
Copy link
Contributor

This structure can't hold language code that does not adhere to ISO-639 (because it uses enum)

I think this is a fair tradeoff regarding storage and computing vs. storing non-normalized data

I think if it is nullable, I don't think it wouldn't be problem (to handle Misskey posts (they don't send notes with locale property) and weird locales)

Also, I think it would be good to have name and public_visibility column in user_roles table (to display role badges on user profiles)

We have the name of the role inside the roles table. The users_roles table is just a PK-FK mapping table to allow for M:N mapping of users <-> roles.

Not sure if we need the visibility in this iteration. We can always add them later and then collapse the migrations into one each release, basically saying "you need to update in lock-step from version to version" (similar to what Nextcloud does with its major versions, where you can only update one at a time).

I think it could be added after we decide to implement role visibility.

Additional information about its use cases: I saw some cases when some fedmins mark admin account with label (and show it publically to mark it as official account and handle inquiries) or some Misskey admins are giving their users with some kind of "joke" roles

But about the bookmark table, that's fair. That's something I'll add now. Because it's an endpoint I want to support in the near future (and this time actually persistent bookmarks, not like the mess Mastodon has where bookmarks just disappear).

👍

@aumetra
Copy link
Member Author

aumetra commented Jun 13, 2024

I think if it is nullable, I don't think it wouldn't be problem (to handle Misskey posts (they don't send notes with locale property) and weird locales

Actually, my idea here was to try to guess the locale using the language detection backend we have (so either via whatlang or whichlang).

That way we can have somewhat accurate predictions and default to English otherwise. Not sure if we should default to English or introduce nullability..

@aumetra aumetra force-pushed the aumetra/db-restructure branch 3 times, most recently from ac46b26 to cd34376 Compare June 21, 2024 19:21
@aumetra aumetra force-pushed the aumetra/db-restructure branch from cd34376 to aeb1028 Compare June 23, 2024 12:11
@perillamint
Copy link
Contributor

perillamint commented Jun 25, 2024

Btw, one question: Since we have to sanitize HTML from the ActivityPub status, I think it would be a good idea to store sanitized HTML in separated form (because running ammonia (or other sanitizer) every time on client request would be quite expensive).

Is posts.content and posts.content_source meant for that purpose?

@aumetra

@aumetra
Copy link
Member Author

aumetra commented Jun 25, 2024

Btw, one question: Since we have to sanitize HTML from the ActivityPub status, I think it would be a good idea to store sanitized HTML in separated form (because running ammonia (or other sanitizer) every time on client request would be quite expensive).
Is posts.content and posts.content_source meant for that purpose?

These two essentially have the following uses:

  • content contains the processed and sanitized status (i.e. the Markdown is already processed into HTML and then put through bubble-bath to clean it)
  • content_source contains the unsanitized and unprocessed status (that column is there so users can edit their Markdown or HTML statuses before we re-broadcast the processed status in case of an in-place edit)

in_reply_to_id UUID,
reposted_post_id UUID,

is_sensitive BOOLEAN NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misskey sets the sensitive flags on a per-medium basis, i.e. authors can attach non-sensitive media and sensitive media to a single post at the same time. Is this a use case Kitsune would like to support?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea actually.. and for Mastodon, where we have to have a single sensitive flag, we simply do an OR chain for the is_sensitive (or an .any on an iterator, in Rust terms :P)

@tesaguri
Copy link
Contributor

tesaguri commented Jul 8, 2024

#190 requires the original activity JSONs (for signed local posts at least) to be stored in the database in order to serve client-signed objects, IIUC.

@aumetra
Copy link
Member Author

aumetra commented Nov 5, 2024

#190 requires the original activity JSONs (for signed local posts at least) to be stored in the database in order to serve client-signed objects, IIUC.

I will have a look at that. It's just a lot of redundant information we would store, and I personally don't like the model where we would have to basically transform PostgreSQL into a document database (aka the Pleroma data model).

I tried that data model, it was bad, I didn't like it. So perhaps compressed binary data. Or we just don't give a damn. Whatever. We'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants