You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently changing my fork [1] of the DBpedia extraction
framework[2] to use the sweble parser instead of a running mediawiki
instance for extracting the abstracts of each wiki page.
What I noticed is a difference when the page contains a thumb image at
the beginning.
The HTML output of sweble is nearly fine, but the following wiki text is
not surrounded with a html paragraph tag (
) any more.
This is currently required by the extraction framework [3].
A minimal maven example is created (parsingThumbImages.zip).
If thumb is removed from this media wiki markup
"[[File:Example.jpg|thumb]]"
the overall text is in paragraph tags, otherwise not.
Do I have to change some WikiConfig settings?
(I already tried the auto correct feature)
Or is the output intended?
(I also tried the parsoid parser [4]. With this parser the text is
always surrounded by paragraph tags.)
Hi,
I'm currently changing my fork [1] of the DBpedia extraction
framework[2] to use the sweble parser instead of a running mediawiki
instance for extracting the abstracts of each wiki page.
What I noticed is a difference when the page contains a thumb image at
the beginning.
The HTML output of sweble is nearly fine, but the following wiki text is
not surrounded with a html paragraph tag (
) any more.
This is currently required by the extraction framework [3].
A minimal maven example is created (parsingThumbImages.zip).
If thumb is removed from this media wiki markup
"[[File:Example.jpg|thumb]]"
the overall text is in paragraph tags, otherwise not.
Do I have to change some WikiConfig settings?
(I already tried the auto correct feature)
Or is the output intended?
(I also tried the parsoid parser [4]. With this parser the text is
always surrounded by paragraph tags.)
Thanks
Best regards
Sven
[1] https://github.com/sven-h/extraction-framework
[2] https://github.com/dbpedia/extraction-framework
[3] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala#L174
[4] https://www.mediawiki.org/wiki/Parsoid
The text was updated successfully, but these errors were encountered: