You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def get_paper_texts(path):
"""Function to pre-process pdfs in given directory
Args:
path (str): path to folder containing pdfs
Returns:
paper_texts (list[str]): Return list of narrative texts for each paper
"""
# Loop Through PDFs and pre-Process
paper_texts = []
for filename in tqdm(os.listdir(path)):
pdf_file = os.path.join(path, filename)
elements = partition(pdf_file) # Partition PDF Using Unstructured
isd = convert_to_dict(elements) # Convert List of Elements to List of Dictionaries
narrative_texts = [
element["text"] for element in isd if element["type"] == "NarrativeText"
] # Only Keep Narrative Text and Combine Into One String
# os.remove(pdf_file) # Delete PDF
paper_texts += narrative_texts
return paper_texts
However, when I try to run it: paper_texts = get_paper_texts("pdfs")
I am getting this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[10], [line 1](vscode-notebook-cell:?execution_count=10&line=1)
----> [1](vscode-notebook-cell:?execution_count=10&line=1) paper_texts = get_paper_texts("pdfs")
Cell In[5], [line 14](vscode-notebook-cell:?execution_count=5&line=14)
[12](vscode-notebook-cell:?execution_count=5&line=12) for filename in tqdm(os.listdir(path)):
[13](vscode-notebook-cell:?execution_count=5&line=13) pdf_file = os.path.join(path, filename)
---> [14](vscode-notebook-cell:?execution_count=5&line=14) elements = partition(pdf_file) # Partition PDF Using Unstructured
[15](vscode-notebook-cell:?execution_count=5&line=15) isd = convert_to_dict(elements) # Convert List of Elements to List of Dictionaries
[16](vscode-notebook-cell:?execution_count=5&line=16) narrative_texts = [
[17](vscode-notebook-cell:?execution_count=5&line=17) element["text"] for element in isd if element["type"] == "NarrativeText"
[18](vscode-notebook-cell:?execution_count=5&line=18) ] # Only Keep Narrative Text and Combine Into One String
File [~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:149](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:149), in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper)
[147](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:147) else:
[148](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:148) msg = "Invalid file" if not filename else f"Invalid file {filename}"
--> [149](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:149) raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")
[151](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:151) for element in elements:
[152](https://file+.vscode-resource.vscode-cdn.net/Users/sophievarabioff/Documents/lumiense/~/.pyenv/versions/3.9.18/envs/lumiense_3.9_env/lib/python3.9/site-packages/unstructured/partition/auto.py:152) element.metadata.url = url
ValueError: Invalid file pdfs/.DS_Store. The FileType.UNK file type is not supported in partition.
I really have no idea what this means, and any advice would be appreciated! Thx
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I am going off of this example: https://github.com/Unstructured-IO/unstructured/tree/main/examples/arxiv-topic-modelling
I've installed the requirements in a virtual environment, and modified the first function in the notebook to this:
However, when I try to run it:
paper_texts = get_paper_texts("pdfs")
I am getting this error:
I really have no idea what this means, and any advice would be appreciated! Thx
Beta Was this translation helpful? Give feedback.
All reactions