Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdown产出的章节标题能否继承pdf中的章节层级,现在都是同一级的 #944

Open
gcy0926 opened this issue Nov 13, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@gcy0926
Copy link

gcy0926 commented Nov 13, 2024

pdf中的章节实际上反映了文章的层级关系,现在貌似解析出来的结果都是相同层级的,能否根据章节标号等特征,确定下标题的层级?
Uploading RAG评估-A unified evaluation.pdf…

@gcy0926 gcy0926 added the enhancement New feature or request label Nov 13, 2024
@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

论文见:https://arxiv.org/abs/2409.12941

@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

omni-parse这个工具的解析结果中就会分为多个层级,https://github.com/adithya-s-k/omniparse

@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

标题分级需要考虑到所有样例的普适性,并不是所有文档都存在章节标号,因此我们预计会采用标题行高聚类的方案进行分级,但优先级较低,可能会放到下下个版本

@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

标题分级需要考虑到所有样例的普适性,并不是所有文档都存在章节标号,因此我们预计会采用标题行高聚类的方案进行分级,但优先级较低,可能会放到下下个版本
好的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants