-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure that full corpus passes #41
Comments
Removed |
Investigating import codecs
text = codecs.open('/Users/jhn/ucl/oracc/pyoracc/pyoracc/test/fixtures/sample_corpus/cmawro-01-01.atf', encoding='utf-8-sig').read()
from pyoracc.atf.atffile import AtfLexer
lexer = AtfLexer().lexer
lexer.input(text)
for tok in lexer:
print(tok) Fails in the translation at line 1151
|
Seems like the |
Tried online validation agains ORACC server of the whole corpus, but the server broke after a few hundred tests, so it's not complete. These are the ones that failed (more info on each of them below):
These break the syntax highlight:
|
progress is here https://github.com/jenshnielsen/pyoracc/tree/improvecorpuscover |
Fixed an issue with various forms of ' being used in t_transctrl_ID but now the file fails because SCORELABEL is not implemented in the parser |
At least the |
Rerun with improved error messages: Sample corpus:
|
New corpus as of today https://gist.github.com/jenshnielsen/82bd8a859a3ed556641a91dd540231ab |
Adding new changes from rillian.
In sample corpus the following files currently fail:
The text was updated successfully, but these errors were encountered: