Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Draft support in the current ("legacy") upload API #17230

Open
5 tasks
woodruffw opened this issue Dec 5, 2024 · 12 comments
Open
5 tasks

[META] Draft support in the current ("legacy") upload API #17230

woodruffw opened this issue Dec 5, 2024 · 12 comments
Assignees
Labels
APIs/feeds feature request meta Meta issues (rollouts, etc)

Comments

@woodruffw
Copy link
Member

woodruffw commented Dec 5, 2024

Support for drafted uploads is a long-standing feature request, dating back (at least) to 2015 with #726.

Since then there's been been a proposed PEP, PEP 694, which defines a new upload API. This API includes support for drafted uploads, among other larger features (like upload session management, resumption, and batched uploading).

In the interest of expedience, @DarkaMau, @facutuesca, and I (@woodruffw) are looking at adding drafting support to just the legacy API, in such a way that a future implementation of PEP 694 (such as one built on #8941 or #16277) can reuse the building blocks within a new endpoint.

To keep things tractable (and keep PRs small + reviewable), here's the units of work we're proposing:

  • Add published: datetime | None = datetime.now to the Release model, with a schema migration. This can be picked/adapted from PEP 694 (resolved merge conflicts from PR #8941) #16277

  • Add a PyPI-specific metadata field (or HTTP request header) to the upload endpoint, with the following semantics:

    • If the project or release is being created for the first time, sets published = None
    • If the release already exists and was not already marked as a draft, fails the upload (since this suggests a mixed/confused state on the user's side)

    The current work in Draft Release support in PyPI #8941/PEP 694 (resolved merge conflicts from PR #8941) #16277 uses the Is-Draft: true | false header, but I propose we prefix this with X-PyPI- to avoid standard header conflicts, so we'd check X-PyPI-Is-Draft or similar.

  • When Release.published is None, we should not present it in the standard indices/APIs. Instead, we'll serve it at a temporary draft index. Draft Release support in PyPI #8941 has this as /simple/draft/<draft_hash>, but I propose we drop the hash and do this with a URL that mirrors the existing hierarchy: /simple/draft/<name>/<version>/ or even just /simple/draft/<name> with all drafts for the project being aggregated.

  • When Release.published is None, the given release should not appear as the latest release on the web view. Instead, it should have presentation behavior similar to yanked releases: we can show it on the /project/<name>/#history timeline, but with a badge indicating that it's a draft. When navigating to /project/<name>/<version>, draft versions should be rendered like normal versions but with two differences:

    • A warning flash/banner/element on the top of the page, similar to what appears when a release has been yanked
    • The pip install ... instructions should be specialized to show the draft index, per the step above

All told, I think this gives us a set of independent tasks to accomplish drafting. Our proposal is that we try and use as much of #8941 and #16277 here as possible, but do so via independent PRs to keep things as small as possible and reduce reviewer burden.

Separately, as client-side/tooling changes that'll need to occur:

  • twine and similar uploading tools will ideally learn a --draft or similar flag that sets the appropriate header

CC @di for thoughts, and thanks @warsaw and @alanbato for the work you've done on this already!

@woodruffw woodruffw added APIs/feeds feature request meta Meta issues (rollouts, etc) labels Dec 5, 2024
@woodruffw woodruffw self-assigned this Dec 5, 2024
@DarkaMaul
Copy link
Contributor

When Release.published is None, the given release should not appear as the latest release on the web view. Instead, it should have presentation behavior similar to yanked releases: we can show it on the /project//#history timeline, but with a badge indicating that it's a draft. When navigating to /project//, draft versions should be rendered like normal versions but with two differences:

I worry that displaying draft releases in the project release timeline would confuse the end-users (i.e., not the project developers).
Having them accessible with the correct (and guessable?) URLs while keeping them unlisted would, I think, create a clearer separation between drafts and pre-releases.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

Thanks, I'll review this in more detail when I get a chance. But just to let you know, I'm almost done with my PEP 694 rewrite. Keep in mind that PR will still go through one more pass on my side, but then I plan on rebooting the DPO thread.

@woodruffw
Copy link
Member Author

I worry that displaying draft releases in the project release timeline would confuse the end-users (i.e., not the project developers).
Having them accessible with the correct (and guessable?) URLs while keeping them unlisted would, I think, create a clearer separation between drafts and pre-releases.

That's a good point -- putting it in the timeline might be unnecessarily confusing. At the same time, I think we should render the draft release itself as a page, for a few reasons:

  • It'll be hard to teach users how to use the temporary "draft indices" without having the information presented somewhere (besides the docs, which people tend to only read as a last resort). Having some kind of HTML view of drafted releases with a specialized pip install command/similar will help get users unstuck when they do need to use drafts.
  • Without some kind of render, the project creation flow with a drafted first release would result in a mixed state: the project would exist, but no /project/<name>/ view would exist for it. I think we want to avoid that, since that would cause additional user confusion + require tools like twine to specialize their printing behavior to avoid showing a friendly URL in that case.

As a middle ground here, we could leave drafted projects out of the timeline but add a flash/element to the top of /project/<name>/ when a newer draft is present.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

that PR will still go through one more pass on my side

This pass includes updating the PR to align with the current state of the resumable upload RFC draft which formed the original basis for PEP 694.

@konstin
Copy link
Contributor

konstin commented Dec 5, 2024

Asking as a uv publish implementer, how is re-uploading files is handled, both when identical file is re-uploaded and when a different file is re-uploaded?

@woodruffw
Copy link
Member Author

Asking as a uv publish implementer, how is re-uploading files is handled, both when identical file is re-uploaded and when a different file is re-uploaded?

I think the draft feature here would not affect the current behavior: files are unique at the index level, so draft releases would have immutable file -> content mappings like all other releases.

(In other words, drafts wouldn't be for iterating on a release's files; they're only for staging a release so that all files become public in one atomic step + allowing projects to share a new release before it becomes resolvable by default.)

@facutuesca
Copy link
Contributor

facutuesca commented Dec 5, 2024

they're only for staging a release so that all files become public in one atomic step (a) + allowing projects to share a new release before it becomes resolvable by default (b).

How important is (b) ?. In my mind, draft releases are a tool for maintainers to:

  • atomically publish all files in a single release (same as (a))
  • test that the upload worked as expected before making the release public (same as using TestPyPI)

Those could be achieved without making draft releases public to users, and without the need to expose a temporary draft index. Then draft releases would be only visible to maintainers in the web UI, as a tool to double-check and more neatly publish a release.

So my question is, how common is the problem that (b) is trying to solve, and is it worth the added complexity? With private draft releases, only publishing tools need to implement support for uploading draft releases, whereas if we make them public, they're suddenly a concern for any tool that downloads packages from the index.

@woodruffw
Copy link
Member Author

So my question is, how common is the problem that (b) is trying to solve, and is it worth the added complexity? With private pre-releases, only publishing tools need to implement support for uploading draft releases

I don't have good numbers here, but my understanding is that (b) is a recurring request from projects with long or complicated release procedures (e.g. due to complex native builds or their own build farms): having to do a pre-release means having to run their release processes twice, and it means more downstream test churn (since users looking to evaluate the pre-release need to upgrade twice, rather than opting into the draft and then rolling forwards).

whereas if we make them public, they're suddenly a concern for any tool that downloads packages from the index.

I think this is ameliorated by the split-index design: drafts won't appear in /simple/ or /simple/<name>/, so tools like pip and uv won't need to consider or filter them at all by default. They'll only be considered for resolution when a user explicitly adds their indices, e.g. via --extra-index-url.

Given that IMO the complexity here is pretty constrained: nothing will change in ordinary installation flows, and anything that uses the draft indices should interoperate fine, given that the indices will be the standard PEP 503/691 ones.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

Some thoughts:

  • I prefer the term "stage" instead of "draft" and my PEP 694 rewrite uses that consistently.
  • I strongly believe stages need to be "obfuscatable". The use case is to be able to stage a release pre-announcement and then publish in one go all the wheels for that release at the moment of the announcement. I do not believe we need to lock stages down behind authenticated access. My PEP 694 rewrite accomplishes this by including an optional "nonce" in the stage creation request, which defaults to the empty string. If not given, then only the package name and version contribute to the stage token, but this nonce gives the client the option to further contribute to the stage token hash, with only information it knows. Thus, it's up to the client as to whether stage access is easily guessable or well-obscured. I think this addresses the (b) question.
  • I don't think staged files should be visible on a project's release timeline, or otherwise visible to the general public. It's confusing and potentially leaking of information (plus, remember another use case for stages is so that clients don't see incomplete releases). I'm inclined to omit them even for logged in owners of the stage, just for simplicity, but I'm not totally against it. However, through the use of the stage URL and/or stage token, programmatic access to the files in the stage (e.g. via pip's --extra-index-url) would be accessible, for testing purposes.
  • @konstin In my rewrite, stages are mutable and files in those stages can be overwritten or deleted at whim. The protocol easily supports this too. Only at the point that a stage is published do they acquire the normal constraints on mutability. I think this is a core use case for stages (i.e. since you're testing your staged release, you must be able to re-upload faulty wheels).
  • Thus, @woodruffw "In other words, drafts wouldn't be for iterating on a release's files; they're only for staging a release so that all files become public in one atomic step + allowing projects to share a new release before it becomes resolvable by default." Disagree with the first part!

@woodruffw
Copy link
Member Author

  • I prefer the term "stage" instead of "draft" and my PEP 694 rewrite uses that consistently.

No objection 🙂 -- I agree that's a better term.

I strongly believe stages need to be "obfuscatable". The use case is to be able to stage a release pre-announcement and then publish in one go all the wheels for that release at the moment of the announcement. I do not believe we need to lock stages down behind authenticated access. [...] Thus, it's up to the client as to whether stage access is easily guessable or well-obscured. I think this addresses the (b) question.

I'm a little bit murky on the value of this: is the idea to allow uploaders to hide their staged release until they want it made fully public? Given that features like attestations (and longer term, things like index-wide transparency logging) will publicly disclosure releases as they're made, I think the obfuscatory effect here is pretty limited.

(I think this is also true especially for the CI/CD context: projects that do public staged releases via GitHub or GitLab are going to have their staged release URLs in their public run logs.)

I'm also a little murky on optional client-controlled randomness: I think, if the staged release needs to be obfuscated, then it makes sense to always do it, and have the server handle the thing we're calling the staging token. My reasoning there is that clients will struggle to make informed decisions around how to turn this knob, so if we want to give them obfuscated staging URLs then we should do it for them. But as part of that, I don't want to accidentally produce an unenforceable security boundary here, per above 🙂

I don't think staged files should be visible on a project's release timeline, or otherwise visible to the general public. It's confusing and potentially leaking of information (plus, remember another use case for stages is so that clients don't see incomplete releases).

No objection, although I think we should state explicitly whether this is for user confusion reasons (a sufficient reason IMO!) or obfuscation reasons (which IMO are harder to formalize, per the transparency and CI/CD leakiness points above).

More generally, I think PyPI should retain the (currently implicit, but maybe it should be explicit) assumption that once you put something on the index, it should be considered disclosed. Obfuscation can be an OK feature for misuse-resistance purposes in that context, but I think it can't be a security boundary here.

(I'll update the top-level comment to adjust these parts, since I think you're 100% right about drafts being confusing in the web UI.)

I think this is a core use case for stages (i.e. since you're testing your staged release, you must be able to re-upload faulty wheels).

I'm not inherently opposed to mutable staging but I think it encourages a pretty different model than the one that Python packaging (IMO) is moving towards as a whole, where versions are "cheap" and the mapping between a package's identity (= distribution filename) and its contents (= strong digest) is immutable. That model is the one that PEP 740 (and other cryptographic packaging PEPs, like 458) assumes. It's also (separately) the one that PyPI is architecturally rigged for, since PyPI already has permanent, unique (filename, digest) pairs baked into the codebase.

Or perhaps as an alternate framing: I think the distinction between pre-releases and staged releases collapses too much if pre-releases become a mutable feature 😅 -- I think right now a lot of projects use pre-releases as a low-impact place to experiment with their uploads since they don't get resolved by default, and I'm not sure it makes sense for PyPI to have multiple ways to accomplish similar packaging tasks, versus having pre-releases be the "sandbox" feature and staging be the "make a release atomic" feature.

@woodruffw
Copy link
Member Author

That model is the one that PEP 740 (and other cryptographic packaging PEPs, like 458) assumes. It's also (separately) the one that PyPI is architecturally rigged for, since PyPI already has permanent, unique (filename, digest) pairs baked into the codebase.

To tack onto this: one weird (not necessarily insecure, but not ideal) consequence to mutable staged releases is that a given foo-1.2.3.tar.gz would potentially have multiple attestations for different content digests. This can happen currently anyways (since users and their CI/CD setups are in full control of signing), but it's less likely in current setups because of the larger shift towards release immutability.

The consequence of this would be some potentially useful (although admittedly narrow!) flexibility for an attacker: if foo-1.2.3.tar.gz @ sha123 is uploaded with a valid attestation, then replaced with foo-1.2.3.tar.gz @ sha456 with another valid attestation, an attacker could take advantage of a vulnerability in the first "revision" and serve it as if it was the second revision.

@warsaw
Copy link
Contributor

warsaw commented Dec 6, 2024

I'm a little bit murky on the value of this: is the idea to allow uploaders to hide their staged release until they want it made fully public? Given that features like attestations (and longer term, things like index-wide transparency logging) will publicly disclosure releases as they're made, I think the obfuscatory effect here is pretty limited.

Maybe the key difference in how we're thinking about stages is whether you would call the stage "a release" (and/or "pre-release" - a term we should probably avoid so as not to confuse with pre-release version numbers). The way I'm thinking about them, they really aren't a release, hence the term "stage"! This is not a release until the stage is actually published, at which point it becomes a release, along with all the constraints and guarantees you mention.

It seems clear that test.pypi.org is insufficient for testing purposes, so a staged release with overwrite capability and optional obfuscation provides everything needed for complex testing purposes. This is especially true for coordinated releases across many cross-dependent packages. E.g. I intend to release foo but also foo-plugin-A, foo-plugin-B, etc. I can create a stage for each of these, add all the --extra-index-url options I want, do all the final mile testing I need to do, then on Flag Day, I push the button and publish all the stages for all the packages at the same time.

I think PyPI should retain the (currently implicit, but maybe it should be explicit) assumption that once you put something on the index, it should be considered disclosed. Obfuscation can be an OK feature for misuse-resistance purposes in that context, but I think it can't be a security boundary here.

Agreed that if you really care about keeping artifacts completely secret, you can't use the staging feature. That said, I still think the optional obfuscation is useful as a middle ground between completely public and completely secret. And because the nonce in my proposal is optional, it's totally up to the client to decide. For example:

  • I want to stage a new release and let my downstreams test it out. I decide not to use a nonce and therefore the stage token is easily calculated by any interested consumer. They test things out with --extra-index-url and confirm that the packages to be published in the stage are working correctly.
  • My own CI system is building wheels and uploading them to the stage, but I don't want anybody to be able to guess (based just on project name and version) that there might be a release. I certainly don't want random users to be able to probe PyPI for upcoming releases. I share the nonce between my pipeline jobs so they can upload the wheels when they've been built, and then once all those jobs are complete, I run a final CI job to ensure that everything looks good before I publish it.

I think the distinction between pre-releases and staged releases collapses too much if pre-releases become a mutable feature 😅 -- I think right now a lot of projects use pre-releases as a low-impact place to experiment with their uploads since they don't get resolved by default, and I'm not sure it makes sense for PyPI to have multiple ways to accomplish similar packaging tasks, versus having pre-releases be the "sandbox" feature and staging be the "make a release atomic" feature.

By "pre-releases" here, you mean actual published releases with a pre-release version number. And all of what you say is fine, but it's not a substitute for a testable, pre-published, final release. Failures can and do happen between say the last rc and the final release, and it's that use case that I want to support.

The consequence of this would be some potentially useful (although admittedly narrow!) flexibility for an attacker: if foo-1.2.3.tar.gz @ sha123 is uploaded with a valid attestation, then replaced with foo-1.2.3.tar.gz @ sha456 with another valid attestation, an attacker could take advantage of a vulnerability in the first "revision" and serve it as if it was the second revision.

If we're treating stages as not-yet-published, then I don't think this is a realistic scenario. Yes, the @ sha123 attestation gets uploaded, but if we DELETE or overwrite the foo-1.2.3.tar.gz, regardless of the sha, then that should also delete any attestations connected to the original tarball. It's as if that original tarball was never uploaded.

The original PEP 694 was based on https://tus.io which then got turned into an Internet Draft. The draft has been updated since then, and I am in the process of aligning my PEP 694 update to the latest draft version. Although stages aren't part of that protocol, the ability to delete or overwrite artifacts are, and I think that has valid use cases that should be supported, at least up until the actual publish step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APIs/feeds feature request meta Meta issues (rollouts, etc)
Projects
None yet
Development

No branches or pull requests

5 participants