Pretrain your Machine Learning Models (LLM) with your own data
Comes complete with an Automated Embedding Generator and model Q&A Interface
View Demo
·
Report Bug
·
Request Feature
Table of Contents
Automatically Pretrain your Machine Learning Models (LLM) with:
- PDF Files
- Github Repositories
- Scraped HTML files in a local folder
This tooling will accept input in the form of a github repo url, pdf file, or local html files folder and perform the following actions:
generate_embedding_github/pdf.py
• Break apart your input data into manageable chunks
• Send chunked data to Ray Serve Cluster
• Use Ray Cluster to create an embedding from our input chunks
serve run serve:deployment
• Use Ray Cluster to download a Foundational Model
• Load Foundation Model with our Embedding on top
• Start a WebServer and make the Model available via api
query.py "what is the api endpoint to get a list of agents"
• Allow you to interface with the model through the API
This tooling is for anyone who wants to train an LLM on a specific source of knowledge in a simple way, where all the heavy lifting has been abstracted behind-the-scenes
Wingman is built on top of a Ray Cluster so it can either be scalable and distributed, or can be run on just one machine.
This project (and many others) would not be possible without the following:
Link | Name | Developer | Description |
---|---|---|---|
Link | Faiss | Facebook Research | A library for efficient similarity search and clustering of dense vectors. |
Link | LangChain | LangChain | LangChain is a framework for developing applications powered by language models. |
Link | Ray | Ray Project | Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads. |
Link | Python | Python Software Foundation | Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. |
Link | PyTorch | The Linux Foundation | Tensors and Dynamic neural networks in Python with strong GPU acceleration. |
Link | Beautiful Soup | Leonard Richardson | Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. |
Link | Typing_Inspect | Ivan Levkivskyi | The typing_inspect module defines an experimental API for runtime inspection of types defined in the Python standard typing module. |
To get a local copy up and running follow these simple steps.
- 16+GB of VRam (24GB Recommended)
- Linux or WSL
- Whatever Python Virtual Environment You'd like
(We like MiniConda)
Once you are in the project folder and have your venv/conda environment loaded, run the following:
- pip install -r requirements.txt
- python generate_embedding_pdf.py ./PathTo/local.pdf
- (Optional) Modify prompt in serve.py on line 30 to suit your use case
- serve run serve:deployment
- python query.py "what is the api endpoint to disable data collection for a specified agent"
Adding an interim launcher as well as a UI are both on the current roadmap for this open source edition.
python query.py "what is the api endpoint to disable data collection for a specified agent"
/api/sn_agent/agents/{agent_id}/data/off.
python query.py "what is the api endpoint for the ActivitySubscriptions API"
The API endpoint for the ActivitySubscriptions API is /now/actsub/activities.
python query.py "what is the api endpoint to get a list of agents"
The API endpoint to get a list of agents is "/api/sn_agent/agents/list.
python query.py "what is the api endpoint of the Agent Client Collector API"
The API endpoint of the Agent Client Collector API is "https://<sn_agent-host>:<sn_agent-port>/api/agent-client-collector/admin".
For Additional Features such as Page Number Citations, Additional Programming Language Compability, Multi-LLM Pipelines (summarize relevant passages for better context utilization), Mulimodal Model Support (Train your knowledge embedding based on data in images), Increased Accuracy via 3D Vector Database (Vector Cloud) Support, Agent Support (Complete tasks based on facts in ingested knowledge source), Docker & Kubernetes Support, and more please contact us about our Enterprise Software Suite.
- Create Readme
- Create requirements.txt
- Refine requirements.txt
- Basic Launcher Script
- Screenshots
- User Interface
- Embedding Library
- Quantised Model Support
- trust_remote_code via kwargs
- Advanced Device_map support
- Multi-language Support
- Chinese
- Spanish
- Russian
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Commercial use prohibited.
Contact us for a commercial license for our Enterprise Version.
Christian Mirra - LinkedIn
Project Link: https://github.com/SeeMirra/Wingman/
Todo -- This list is currently incomplete