Welcome to the Unstructured Community! 😊
We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.
Unstructured's open-source packages currently target Python 3.8. If you are using or contributing
to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can
use the following instructions to get up and running with a Python 3.8 virtual environment
with pyenv-virtualenv
:
- Install
pyenv
withbrew install pyenv
. - Install
pyenv-virtualenv
withbrew install pyenv-virtualenv
- Follow the instructions here
to add the
pyenv-virtualenv
startup code to your terminal profile. - Install Python 3.8 by running
pyenv install 3.8.15
. - Create and activate a virtual environment by running:
pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured
You can changed the name of the virtual environment from unstructured
to another name if you're
creating a virtual environment for a pipeline. For example, if you're a creating a virtual
environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec
.
- Run
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
to installpyenv
- Run
git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv
to installpyenv-virtualenv
as apyenv
plugin. - Follow steps 3-5 from the Mac/Homebrew instructions.
We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.
When contributing, please follow our Contributing to Unstructured guidelines.
Don't hesitate to reach out us on slack with any questions. Thank you!
Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized
in the Unstructured library. These collectively form
the Swiss Army knife that Python developers can use to extract structured data from raw documents into
the format that they want. They may be used independently of any other Unstructured repos under the
terms of its license. pip install unstructured
and you are good to go.
A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.
See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.
The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.
Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!