GitHub - Unstructured-IO/community: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to the Unstructured Community! 😊

We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.

☕ Getting Started

Unstructured's open-source packages currently target Python 3.8. If you are using or contributing to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can use the following instructions to get up and running with a Python 3.8 virtual environment with pyenv-virtualenv:

Mac / Homebrew

Install pyenv with brew install pyenv.
Install pyenv-virtualenv with brew install pyenv-virtualenv
Follow the instructions here to add the pyenv-virtualenv startup code to your terminal profile.
Install Python 3.8 by running pyenv install 3.8.15.
Create and activate a virtual environment by running:

pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured

You can changed the name of the virtual environment from unstructured to another name if you're creating a virtual environment for a pipeline. For example, if you're a creating a virtual environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec.

Linux

Run git clone https://github.com/pyenv/pyenv.git ~/.pyenv to install pyenv
Run git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv to install pyenv-virtualenv as a pyenv plugin.
Follow steps 3-5 from the Mac/Homebrew instructions.

👐 Contributions

We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.

When contributing, please follow our Contributing to Unstructured guidelines.

Don't hesitate to reach out us on slack with any questions. Thank you!

📗 Key Concepts

🧱 Bricks

Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized in the Unstructured library. These collectively form the Swiss Army knife that Python developers can use to extract structured data from raw documents into the format that they want. They may be used independently of any other Unstructured repos under the terms of its license. pip install unstructured and you are good to go.

🔹 Preprocessing pipeline APIs

A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.

See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.

🔩 Developer tools for generating FastAPIs

The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.

🤗 Hugging Face

Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
examples		examples
img		img
specs		specs
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Pipelines-and-APIs.md		Pipelines-and-APIs.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-Source Pre-Processing Tools for Unstructured Data

☕ Getting Started

Mac / Homebrew

Linux

👐 Contributions

📗 Key Concepts

🧱 Bricks

🔹 Preprocessing pipeline APIs

🔩 Developer tools for generating FastAPIs

🤗 Hugging Face

About

Contributors 5

License

Unstructured-IO/community

Folders and files

Latest commit

History

Repository files navigation

Open-Source Pre-Processing Tools for Unstructured Data

☕ Getting Started

Mac / Homebrew

Linux

👐 Contributions

📗 Key Concepts

🧱 Bricks

🔹 Preprocessing pipeline APIs

🔩 Developer tools for generating FastAPIs

🤗 Hugging Face

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 5