Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain_unstructured.UnstructuredLoader no longer supports loading document for mode = "single" #28626

Open
5 tasks done
lesliechueng1996 opened this issue Dec 9, 2024 · 2 comments

Comments

@lesliechueng1996
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_unstructured.document_loaders import UnstructuredLoader
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredLoader("./xxx.txt")

docs = loader.load()

print(docs)
print(len(docs))

Error Message and Stack Trace (if applicable)

No response

Description

I saw the UnstructuredFileLoader class has the @deprecated tag and suggest us to use the langchain_unstructured.UnstructuredLoader as the alternative import. But when I use the UnstructuredLoader class, I cannot read a file with single mode.
Is it a break change or issue?
This is very confuse me. I am not sure that UnstructuredLoader can replace UnstructuredFileLoader with all features.

System Info

aiofiles==24.1.0
aiohappyeyeballs==2.4.4
aiohttp==3.9.5
aiolimiter==1.2.0
aiosignal==1.3.1
alembic==1.14.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.6.2.post1
async-timeout==4.0.3
attrs==24.2.0
Authlib==1.3.1
azure-core==1.32.0
backoff==2.2.1
bce-python-sdk==0.9.23
beautifulsoup4==4.12.3
blinker==1.9.0
blis==1.0.1
cachetools==5.5.0
catalogue==2.0.10
certifi==2024.8.30
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.0
click==8.1.7
cloudpathlib==0.20.0
coloredlogs==15.0.1
confection==0.1.5
contourpy==1.3.1
cryptography==44.0.0
cycler==0.12.1
cymem==2.0.10
dataclasses-json==0.6.7
Deprecated==1.2.15
dill==0.3.9
diskcache==5.6.3
distro==1.9.0
docopt==0.6.2
doctran==0.0.14
effdet==0.4.1
emoji==2.14.0
et_xmlfile==2.0.0
eval_type_backport==0.2.0
exceptiongroup==1.2.2
faiss-cpu==1.9.0.post1
filelock==3.16.1
filetype==1.2.0
Flask==3.1.0
Flask-Cors==5.0.0
Flask-Migrate==4.0.7
Flask-SQLAlchemy==3.1.1
Flask-WTF==1.2.2
flatbuffers==24.3.25
fonttools==4.55.2
frozenlist==1.5.0
fsspec==2024.10.0
future==1.0.0
google-api-core==2.23.0
google-auth==2.36.0
google-cloud-vision==3.8.1
googleapis-common-protos==1.66.0
grpcio==1.68.1
grpcio-health-checking==1.68.1
grpcio-status==1.68.1
grpcio-tools==1.68.1
h11==0.14.0
html5lib==1.1
httpcore==1.0.7
httpx==0.27.0
httpx-sse==0.4.0
huggingface-hub==0.26.3
humanfriendly==10.0
idna==3.10
iniconfig==2.0.0
injector==0.22.0
iopath==0.1.10
itsdangerous==2.2.0
jieba==0.42.1
Jinja2==3.1.4
jiter==0.8.0
joblib==1.4.2
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==3.0.0
kiwisolver==1.4.7
langchain==0.3.10
langchain-community==0.3.10
langchain-core==0.3.22
langchain-experimental==0.3.3
langchain-huggingface==0.1.2
langchain-openai==0.2.11
langchain-pinecone==0.2.0
langchain-text-splitters==0.3.2
langchain-unstructured==0.1.6
langchain-weaviate==0.0.3
langcodes==3.5.0
langdetect==1.0.9
langsmith==0.1.147
language_data==1.3.0
lark==1.2.2
layoutparser==0.3.4
lxml==4.9.4
Mako==1.3.6
marisa-trie==1.2.1
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
marshmallow==3.23.1
matplotlib==3.9.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.17
murmurhash==1.0.11
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.4.2
nltk==3.8.1
numpy==1.26.4
olefile==0.47
omegaconf==2.3.0
onnx==1.17.0
onnxruntime==1.19.2
openai==1.57.0
opencv-python==4.10.0.84
openpyxl==3.1.5
orjson==3.10.12
packaging==24.2
pandas==2.2.3
pdf2image==1.17.0
pdfminer.six==20231228
pdfplumber==0.11.4
phonenumbers==8.13.51
pi_heif==0.21.0
pikepdf==9.4.2
pillow==11.0.0
pinecone-client==5.0.1
pinecone-plugin-inference==1.1.0
pinecone-plugin-interface==0.0.7
pipreqs==0.5.0
pluggy==1.5.0
portalocker==3.0.0
preshed==3.0.9
presidio_analyzer==2.2.355
presidio_anonymizer==2.2.355
prompt_toolkit==3.0.48
propcache==0.2.1
proto-plus==1.25.0
protobuf==5.29.1
psutil==6.1.0
psycopg2-binary==2.9.10
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycocotools==2.0.8
pycparser==2.22
pycryptodome==3.21.0
pydantic==2.10.3
pydantic-settings==2.6.1
pydantic_core==2.27.1
Pygments==2.18.0
pyparsing==3.2.0
pypdf==5.1.0
pypdfium2==4.30.0
pytest==8.3.4
python-dateutil==2.9.0.post0
python-docx==1.1.2
python-dotenv==1.0.1
python-iso639==2024.10.22
python-magic==0.4.27
python-multipart==0.0.19
python-oxmsg==0.0.1
python-pptx==1.0.2
pytz==2024.2
PyYAML==6.0.2
qianfan==0.4.12.1
rank-bm25==0.2.2
RapidFuzz==3.10.1
regex==2024.11.6
requests==2.32.3
requests-file==2.1.0
requests-toolbelt==1.0.0
rich==13.9.4
rsa==4.9
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.3.1
shellingham==1.5.4
simsimd==4.4.0
six==1.17.0
smart-open==7.0.5
sniffio==1.3.1
soupsieve==2.6
spacy==3.8.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
SQLAlchemy==2.0.36
srsly==2.4.8
sympy==1.13.1
tenacity==8.5.0
thinc==8.3.2
threadpoolctl==3.5.0
tiktoken==0.8.0
timm==1.0.12
tldextract==5.1.3
tokenizers==0.20.3
tomli==2.2.1
torch==2.5.1
torchvision==0.20.1
tqdm==4.67.1
transformers==4.46.3
typer==0.15.1
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2024.2
unstructured==0.16.9
unstructured-client==0.28.1
unstructured-inference==0.8.1
unstructured.pytesseract==0.3.13
urllib3==2.2.3
validators==0.34.0
wasabi==1.1.3
wcwidth==0.2.13
weasel==0.4.1
weaviate-client==4.9.6
webencodings==0.5.1
Werkzeug==3.1.3
wrapt==1.17.0
WTForms==3.2.1
XlsxWriter==3.2.0
yarg==0.1.9
yarl==1.18.3

@ml-lubich
Copy link

ml-lubich commented Dec 11, 2024

Please provide your error message, this is very generic and vague. (inputs and outputs as well would be helpful for debugging)

@lesliechueng1996
Copy link
Author

No error message. Currently I have the code like this:

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("./one_file", mode="single")

docs = loader.load()

Since I pass the mode as single to the loader, it will return a single langchain document object.
At the same time , I receive some warning that suggest me to use UnstructuredLoader instead of UnstructuredFileLoader.
But when I use UnstructuredLoader, since it doesn't have the mode field

from langchain_unstructured.document_loaders import UnstructuredLoader

loader = UnstructuredLoader("./one_file")

docs = loader.load()

so it will not return a single langchain document object.

My question is how to get the single doc, after I migrate to the UnstructuredLoader to avoid big change in my project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants