Parsing thumb images - no paragraph tags #75

sven-h · 2019-01-14T08:10:27Z

Hi,

I'm currently changing my fork [1] of the DBpedia extraction
framework[2] to use the sweble parser instead of a running mediawiki
instance for extracting the abstracts of each wiki page.

What I noticed is a difference when the page contains a thumb image at
the beginning.
The HTML output of sweble is nearly fine, but the following wiki text is
not surrounded with a html paragraph tag (

) any more.
This is currently required by the extraction framework [3].

A minimal maven example is created (parsingThumbImages.zip).
If thumb is removed from this media wiki markup
"[[File:Example.jpg|thumb]]"
the overall text is in paragraph tags, otherwise not.

Do I have to change some WikiConfig settings?
(I already tried the auto correct feature)
Or is the output intended?
(I also tried the parsoid parser [4]. With this parser the text is
always surrounded by paragraph tags.)

Thanks

Best regards
Sven

[1] https://github.com/sven-h/extraction-framework
[2] https://github.com/dbpedia/extraction-framework
[3] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala#L174
[4] https://www.mediawiki.org/wiki/Parsoid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing thumb images - no paragraph tags #75

Parsing thumb images - no paragraph tags #75

sven-h commented Jan 14, 2019

Parsing thumb images - no paragraph tags #75

Parsing thumb images - no paragraph tags #75

Comments

sven-h commented Jan 14, 2019