-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing the reproducibility of the hybrid approach #2
Comments
The formatting of the TSV seems to be disrupted due to certain special characters present in the XML files listed above. To handle these characters within the TSV file, we utilized the "quotechar" parameter. The relevant code can be found at line 335 in the xml_translate.py script, where the XML is converted into a TSV format using the following functionality:
Certain special characters, such as Î, ±, ≥, %, and a few more, are part of the text. The '%' symbol, while generally a regular symbol, can cause a formatting issue in specific cases where there is no whitespace between the number and the '%' symbol, e.g., 20% vs. 20 %. Nothing to be fixed. |
@Soudeh-Jahanshahi this is marked as nothing to fix but, how the bug that originated this issue affects the approaches you are working on? Does it have an effect or was something that you observed and got your attention? Please clarify, thanks. |
@ljgarcia : These 32 annotated xml-files do not have any contribution in post-processing approach. Specifically, (If they are part of the input data) their tokens just contribute in creating Word2Vec model, but when doing post-annotation, the presence of MeSH-terms in the corresponding documents is neglected... However comparing to the number of entire dataset, ignoring these documents for post-processing would have just a negligible impact on final evaluation results ... |
The code "xml_translate.py" has a bug for processing 32 annotated xml-files!
defect_list = [27817193, 28240519, 28244787, 28438127, 28670879, 28707850, 28749127, 28749635, 28843255, 29095577, 29099159, 29116736, 29132205, 29172291, 29206099, 29220461, 29235983, 29283531, 29373899, 29374411, 29388757, 29451968, 29481028, 29533587, 29616530, 29630142, 29644823, 29688353, 29688370, 29693981, 29716180, 29801411]
For PMIDs in this list, the single tsv file is not generated correctly: the code splits their title and abstract between different lines.
The text was updated successfully, but these errors were encountered: