Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Proof of concept of integration with Hugging Face Hub #3206

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

osanseviero
Copy link

Hi Gensim team! I hereby propose a proof of concept integration with the Hugging Face Hub

As mentioned in https://groups.google.com/g/gensim/c/47hWFeRDJOA and discussed in Twitter, this integration would allow you and your users to freely download and host models from the Hugging Face Hub (https://huggingface.co/). The current implementation just adds downloading models from the Hub, here is an example which downloads https://huggingface.co/Gensim/glove-twitter-25:

import gensim.downloader as api

model = api.load("glove-twitter-25", from_hf=True)
print(model.most_similar("cat"))

Sorry for the lack of tests, I wanted to share an early demo to see if this is a line of work that you're interested. Please let me know your thoughts

Some follow-ups:

  • Retrieving info on all models from the Hugging Face Hub.
  • Adding widgets that will allow users to try out the models directly in the browser.
  • Add datasets to the Hugging Face Hub.
  • Add different UI items to the Hub, such as filtering for all gensim models and adding automatic code snippets.

FYI @LysandreJik, @julien-c, @lhoestq.

@gojomo
Copy link
Collaborator

gojomo commented Jul 29, 2021

For reasons which predate this integration demo, I believe the gensim.downloader, as currenly implemented, is a security risk that would be better to eliminate rather than expand. For more details, see #2283.

(To the extent there could be value downloading raw models from the HF hub, a purposeful API for narrowly doing that would be preferable to a parameterized from_hf switch layering more custom behavior into gensim.downloader. Is there a huggingface-hub-operation library/package that could just be imported when need?)

@piskvorky
Copy link
Owner

piskvorky commented Jul 30, 2021

Thanks @osanseviero ! I'm +1 on supporting external data storages in general, including from Hugging Face Hub. Seeing as there hasn't been much activity on our gensim-data storage (none of us have the time), I'll be happy to offload this functionality to a more active hub.

I don't have the same reservations about gensim.downloader that @gojomo has; if a malicious person takes over our Github repos (gensim, gensim-data, smart_open…), they can make our users execute arbitrary code anyway. Same if they take over our PyPI account. No major additional attack surface there.

Some questions:

  • Who uploads these models to HFH and how?
  • What's the maintenance process? E.g. when we want to modify something or if there's a problem.
  • All model releases in gensim-data are immutable (by agreement). So once released, that model stays forever; any updates / fixes must come as a new, separate model release. Can HFH guarantee that also technically?

@osanseviero
Copy link
Author

Let me answer your questions

  • Who uploads these models to HFH and how?

For the canonical models (the existing models under Releases), the models would be in the Gensim organization in the Hub. https://huggingface.co/Gensim. Only organization members would be able to upload models.

If it's of interest for the community members to upload their own models, they can do so by uploading to their account (not Gensim organization). I think this might bring security concerns though (running any code), but this can be supported if it's of interest. (In the current PR all downloaded models are from Gensim organization which you would have full control)

What's the maintenance process? E.g. when we want to modify something or if there's a problem.

Answering this and the previous question, all Hugging Face repos are git-based. This means that you can simply do git clone, git commit, git push and your standard git workflows in order to upload new models or update existing ones. There is also commit history, versioning and branches that should help with this. https://huggingface.co/docs/hub/main#whats-a-repository

All model releases in gensim-data are immutable (by agreement). So once released, that model stays forever; any updates / fixes must come as a new, separate model release. Can HFH guarantee that also technically?

I don't think we have guarantees for this, but if model repos are not updated, then there should be no changes. For new models there can be new model repos.

@gojomo
Copy link
Collaborator

gojomo commented Jul 30, 2021

I don't have the same reservations about gensim.downloader that @gojomo has; if a malicious person takes over our Github repos (gensim, gensim-data, smart_open…), they can make our users execute arbitrary code anyway. Same if they take over our PyPI account. No major additional attack surface there.

A key difference is those takeovers would create a record, & require extra build/packaging steps, & pass through package-managers with their own recordkeeping. The current gensim.downloader process causes users to run novel code that (1) has no git-version history at all; (2) is never packaged/released/installed via package manager; (3) is downloaded & executed in response to a .load() method call whose name & docs give no idea is installing/running arbitrary new executable code.

@osanseviero
Copy link
Author

I'm not an expert on the topic, but my experience with other libraries is that this approach is very frequent.

For example, sklearn (and other libraries) uses joblib and pickle which run arbitrary code (security warning in sklearn, warning in joblib), but are still widely used for model sharing. Installing spacy pipelines is done through installing wheel files (pip install hf.co/.../en_core.whl), which shows similar concerns to what you just shared.

If models are only installed from a Hugging Face organization managed by you, as @piskvorky mentioned, the only risk would be if someone enters your account and changes a repo, which would mean larger security concerns, and there is git-based versioning built in so mitigation should be straightforward.

@gojomo
Copy link
Collaborator

gojomo commented Aug 2, 2021

Of course it's true that any code using unpickle on an untrusted file could inadvertently run arbitrary code from that file, even if it's thought of as mere data. Warnings reminding users of this fact should appear near any APIs using unpickle – the Gensim docs for .load() on each class should be more explicit about this.

But also, this means projects that want to distribute pickle-based files responsibly should carefully version those files, with clear audit logs of the who/how/when of their creation & release. I'm unsure if HuggingFace Hub does this. If it doesn't, and it welcomes & promotes pickle-based models from untrusted sources, then someday a bad actor will use the HuggingFace Hub to compromise its users with a juicy-seeming but trojaned model.

The various data bundles offered via gensim-data aren't version controlled, and as far as I can tell their replacement in Github's 'release' file download area can occur without any notifications to maintainers or externally-visible logging of the changes. That's also a concern! But the inclusion of a __init__.py shim of Python source-code, outside any version-control or release process, that's downloaded & run without any warnings, is still an escalation over the risks of pickle files.

A fix for Gensim, or other projects in the same position, could be for any expansion of the data offerings to be clearly attributed to a responsible contributor, reviewed via normal release processes, and put into a named release with frozen artifacts. This can still pull the bulk of data from elsewhere, when needed later, on-demand - but the code would insist on a specific artifact by secure hash, ensuring the intent of the responsible contributor can't be silently/secretly corrupted, and that any/all users of the same fixed release are running the same vetted code/data (creating "safety in numbers").

@piskvorky
Copy link
Owner

piskvorky commented Aug 2, 2021

Changing how the gensim-data models and corpora are uploaded / maintained is definitely possible – pending a strong contributor to actually implement it.

If I understand correctly, the options now are:

  1. Rethink and reimplement the whole gensim-data API first, before considering integration with HFH.
    Pros: We can architect a better system.
    Cons: Non-trivial work; keep in mind no one has stepped up to do it for years.

  2. Accept this PR.
    Pros: Useful, to at least N users, with N >= 3 :) (real user base for this HFH-Gensim integration unclear)
    Cons: Adding a bit of complexity we may eventually replace via 1) or 3).

  3. Other?

Ideally, we'd sit down one weekend and do 1) = rewrite the gensim-data API properly. Or possibly outsource the whole "models/corpora management" problem altogether, if it turns out something sufficient already exists. With "sufficient" both in terms of technology and long-term stability (avoiding any hard-dependencies on startups, "it's been a great journey kthxbye").

Personally, 1) is too low on my list of priorities to ever happen. So unless there's sufficient interest from you @gojomo or others, I'm in favour of 2). Because 1) can be done later anyway.

@gojomo
Copy link
Collaborator

gojomo commented Aug 2, 2021

I don't get the point of this integration. If HuggingFace Hub offers convenient curated models and/or datasets, my expectation would be that a user should be able to do something like...

% pip install huggingface-hub
% python -m huggingface_hub snapshot_download "glove-twitter-25"

...then work with the explicit file(s) that were downloaded. They'd do this from their own code, perhaps leveraging any example code that's included as part of the bundle.

There's no need to layer extra cryptic, repository-specific downloading flags (from_hf=True) into an already dubious feature – which also would make gensim.downloader harder to deprecate or fix.

It's also a bad idea to scatter responsibility for this download-flow across two packages (Gensim & huggingface-hub) when it'd work just fine in one, with a Huggingface-maintained library fully handling the Huggingface-specific usability, security concerns, documentation, failure-modes, and so forth. It seems to me that all the code in the if from_hf: branch of this PR can more usefully live inside the huggingface_hub package - providing any theorized user benefit, without the tech debt of extra project coupling.

Regarding the independent issuse of what should happen to gensim.downloader, I'll discuss that on #2283.

@osanseviero
Copy link
Author

osanseviero commented Aug 2, 2021

I don't get the point of this integration. If HuggingFace Hub offers convenient curated models and/or datasets, [...]

It's also a bad idea to scatter responsibility for this download-flow across two packages (Gensim & huggingface-hub) [...]. It seems to me that all the code in the if from_hf: branch of this PR can more usefully live inside the huggingface_hub package - providing any theorized user benefit, without the tech debt of extra project coupling.

Let me reply some of the points since maybe I was not clear on the benefits of using the Hugging Face Hub. I agree that scattering responsibility might not be the best idea, but this PR is mostly a proof of concept to begin this discussion. I would actually suggest to replace the Gensim data release system in favour of the Hugging Face Hub. You mention that this logic should live in the huggingface-hub library, but if you look at the if from_hf branch, it's only 4 lines of code and it's specific to Gensim.

from huggingface_hub import snapshot_download
name = "Gensim/" + name
download_dir = snapshot_download(name, cache_dir=BASE_DIR)
os.rename(download_dir, data_folder_dir)

I would also like to take a step back on the larger benefits of this integration and how I see this going.

  • There would be a Gensim organization in which only the core contributors of Gensim have access.
  • @gojomo, all Hugging Face models are git-based repos. In SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283 you mention the lack of attribution as an issue. With Hugging Face changes to model repositories would be properly authored in the commit. So there is a clear audit log of who/how/when (author, diff, date) of each change. You can find more about HF repos here.
  • You also mention "There's zero documentation of what kind of object this opaque magic call just delivered.". Model repos usually have Model Cards that document the model (example, which will properly document things from overall usage to intended uses, biases and training procedures.
  • Given that only members of the Gensim organization would have edit access to the repositories, you can set the documentation standards that you deem correct for the repos.

I think #2283 presents good questions and concerns in regards to what's the best way to share pretrained models for Gensim, I just wanted to make sure there's clear understanding of the potential benefits of this integration.

@gojomo
Copy link
Collaborator

gojomo commented Aug 3, 2021

Let me reply some of the points since maybe I was not clear on the benefits of using the Hugging Face Hub. I agree that scattering responsibility might not be the best idea, but this PR is mostly a proof of concept to begin this discussion. I would actually suggest to replace the Gensim data release system in favour of the Hugging Face Hub. You mention that this logic should live in the huggingface-hub library, but if you look at the if from_hf branch, it's only 4 lines of code and it's specific to Gensim.

from huggingface_hub import snapshot_download
name = "Gensim/" + name
download_dir = snapshot_download(name, cache_dir=BASE_DIR)
os.rename(download_dir, data_folder_dir)

Indeed - it's tiny bolt-on in an odd place, and any benefits for users to do Hub-specific operations would seem better implemented in the huggingface_hub package.

I would also like to take a step back on the larger benefits of this integration and how I see this going.

  • There would be a Gensim organization in which only the core contributors of Gensim have access.
  • @gojomo, all Hugging Face models are git-based repos. In SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283 you mention the lack of attribution as an issue. With Hugging Face changes to model repositories would be properly authored in the commit. So there is a clear audit log of who/how/when (author, diff, date) of each change. You can find more about HF repos here.
  • You also mention "There's zero documentation of what kind of object this opaque magic call just delivered.". Model repos usually have Model Cards that document the model (example, which will properly document things from overall usage to intended uses, biases and training procedures.
  • Given that only members of the Gensim organization would have edit access to the repositories, you can set the documentation standards that you deem correct for the repos.

I think #2283 presents good questions and concerns in regards to what's the best way to share pretrained models for Gensim, I just wanted to make sure there's clear understanding of the potential benefits of this integration.

Here's the thing: not a single dataset/model in gensim-data was created by the Gensim team. They're all just copied from other places for durability.

The models/formats we have so far aren't unique to Gensim. I don't think any of the existing files even use pickle formatting, so that risky avenue for inadvertently running arbitrary untrusted code isn't even a current consideration. (If we started sharing our own Gensim-native model saves, that'd become a factor.)

The datasets we've imported are a random grab-bag & I don't think we know (or even could know?) which, if any, are getting usage – compared to people who'd more likely go to the original sources, wherever they're still up.

Github hosting has a longer track record of providing stable URLs – learning your system requires a new set of maintainer accounts, provider-specific workflows, and a new dependency on a startup that might go away.

The discipline of documenting things via your 'model cards' might be good, but I have a hard time imagining any Gensim contributor further fleshing out these descriptions, of 3rd-party models, for the benefit of Huggingface Hub users. This is more-or-less an orphaned part of the project, not very central, not very maintained, & not the subject of any upcoming planned improvements. It's not been touched in over 3 years.

So I tend to think you're barking up the wrong tree.

If you like the particular datasets/models in gensim-data, I suppose you could add any or all to the HFHub yourself. (They wouldn't need to – & perhaps should not – have any 'Gensim' branding, as Gensim is not the originator of any of them.)

I'd be happy there'd be a redundant source, and if HFHub proves reliable over time I suppose these download options might eventually get mentioned in our docs/examples as an alternate or even preferred source.

@julien-c
Copy link

julien-c commented Aug 4, 2021

So there is a clear audit log of who/how/when (author, diff, date) of each change.

Note it's on our roadmap @huggingface to add support for GPG-signed commits

@versae
Copy link

versae commented Oct 4, 2022

It's been over a year and now HF supports GPG-signed commits. Is there any update or extra thought on this? Is handling separately HF snapshot download and Gensim load still the best way to host user models on the hub for use with Gensim?

@gojomo
Copy link
Collaborator

gojomo commented Oct 4, 2022

All my prior preferences/concerns still apply.

Specifically, these aren't really Gensim's datasets, just public datasets that at one point seemed useful to provide via a 2nd reliable download source. Huggingface may already have them under other names, or if Huggingface would find them useful, could add them – ideally from the original, better-documented, surely-canonical sources rather than Gensim's project-peculiar repackagings.

If the same datasets were reliably downloadable/documented from Huggingface, I'd favorably view any contributions which update Gensim docs/tutorials/examples to grab them from Huggingface, using Huggingface packages/APIs, rather than Gensim's mirrors. (That's because I find the current api/downloader approach unwise from a security-hygeine perspective & unhelpfully opaque in what it delivers. Directing coders to acquire primary files of known formats, then using Gensim filepath APIs to explicitly load them, seems better to me on every relevant dimension: explicitness, security, maintainability, teaching profesional practices, etc.)

But separate from potentially preferring the hub as a download source whenever practical, bolting a from_hf= option onto the Gensim downloader – retaining the opacity/indirectness of the current approach, with a redundant 2nd code path that could break separately from the legacy option – makes no sense to me at all. It adds complexity/coupling/duplication for negligible benefit.

@lbourdois
Copy link

Hi 👋

As part of a project I have to use pre-trained gensim models from different sources. To name a few: http://vectors.nlpl.eu/repository/, https://wikipedia2vec.github.io/wikipedia2vec/pretrained/, https://github.com/dccuchile/spanish-word-embeddings, https://github.com/Embedding/Chinese-Word-Vectors, https://github.com/3Top/word2vec-api, https://github.com/nathanshartmann/portuguese_word_embeddings, https://fauconnier.github.io/#data, etc.
My inventory is still ongoing but I think the final list should contain something between 600 and 1000 models (I'm at 522 right now and haven't yet tried to reference English models that are the most common).

The sources being varied, there is not a single protocol to load them easily (most of the time you have to download them manually, some must be unzipped, others not, the documentation is not necessarily in English, etc.).
So I tell myself that this should lead to a very complicated pipeline to be able to use any model whenever I want. And that to simplify things, it would be enough to be able to put them in a single place on which it would be possible to download them remotely and then simply use them via gensim.
I naturally thought of the HF Hub but found that there were hardly any word embedding type models on it and gensim was not managed 😢
I see that discussions have taken place here.
The line model = api.load("model_name_on_the_hub", from_hf=True) pleased me because it simply answered my problem but does not seem to convince.
The following code does the job very well (the "lbourdois/43" model is here the 43rd http://vectors.nlpl.eu/repository/ model I added to my HF account to be able to do the test below):

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="lbourdois/43", filename="model.bin"), binary=True, unicode_errors="ignore")
model.most_similar("exemple")
[('exemple?', 0.764797568321228),
 ('spearphishing', 0.7313995361328125),
 ('analogie', 0.7245253324508667),
 ('exemple.dans', 0.7238249182701111),
 ('exemple.on', 0.7150753140449524),
 ('ptplot', 0.7103664875030518),
 ('mail.contoso.com', 0.7083507776260376),
 ('?prenons', 0.7068799734115601),
 ('l’acrostiche', 0.7055863738059998),
 ('concernen', 0.7036089301109314)]

Proceeding in this way does not bother me for my part but I find it a pity that it is a "hidden solution". In the sense that a simple additional argument from_hf=True as proposed would be more easily found by adding a line in the documentation of gensim explaining what it is for, or via the "Use in gensim" button which would then be available on the HF Hub. I think it would also invite people to put their word2vec type models on the HF Hub and improve their referencement, where currently models are split between various sources that are more or less easy to find. It can also be thought that this would lead to the development of demonstrators using these models with gradio or streamlit applications (which could illustrate the gensim documentation).

In short, all this to say that we can indeed do without a from_hf=True but that it would still be a plus to have this option available by making things more user friendly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants