Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing thumb images - no paragraph tags #75

Open
sven-h opened this issue Jan 14, 2019 · 0 comments
Open

Parsing thumb images - no paragraph tags #75

sven-h opened this issue Jan 14, 2019 · 0 comments

Comments

@sven-h
Copy link

sven-h commented Jan 14, 2019

Hi,

I'm currently changing my fork [1] of the DBpedia extraction
framework[2] to use the sweble parser instead of a running mediawiki
instance for extracting the abstracts of each wiki page.

What I noticed is a difference when the page contains a thumb image at
the beginning.
The HTML output of sweble is nearly fine, but the following wiki text is
not surrounded with a html paragraph tag (

) any more.
This is currently required by the extraction framework [3].

A minimal maven example is created (parsingThumbImages.zip).
If thumb is removed from this media wiki markup
"[[File:Example.jpg|thumb]]"
the overall text is in paragraph tags, otherwise not.

Do I have to change some WikiConfig settings?
(I already tried the auto correct feature)
Or is the output intended?
(I also tried the parsoid parser [4]. With this parser the text is
always surrounded by paragraph tags.)

Thanks

Best regards
Sven

[1] https://github.com/sven-h/extraction-framework
[2] https://github.com/dbpedia/extraction-framework
[3] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala#L174
[4] https://www.mediawiki.org/wiki/Parsoid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant