-
Notifications
You must be signed in to change notification settings - Fork 78
Dumps
This snippet of a conversation from issue #111 may shed some light on using wptools with Wikimedia dumps...
The XML dump is tantalizing, and working with it is what inspired me to make this package. However, it has a problem with respect to extracting data. If you take a look at the content model from your examples, you'll see that it is "wikitext":
<model>wikitext</model>
Because wikitext is unstructured, the difficulty in parsing it (outside of a MediaWiki instance) is significant. Infobox templates contain all manner of wikitext: formatting, links, HTML/entities, comments, various other metadata, as well as other templates!
It may seem like a reasonably sane (Infobox) world, but the devil are in the details. Let me explain. You may have some exciting success parsing select wikitext examples, but then you'll need to add more and more corner cases to accommodate more weirdness. You may even want to hit the API to parse some idiosyncratic wikitext coventions in use in many places. Take a look at Taxoboxes for instance (which warns: "This template employs intricate features of template syntax.") and some of our related issues #62, #66, #91, #109.
Eventually, you'll discover Wikidata... (the heavens ring out in harmony: AUHHH!) which solves two huge problems with getting data out of Wikipedia. It has a well-defined, standard structure, and that structure is designed to support all languages. A surprise for me was discovering that wikitext syntax can be different in different language instances of Wikipedia (e.g. "Ficha de taxón" [es] versus "Taxobox" [en]). Some of us are programming in English, but there are wikitexts in many other languages. Wikidata dispenses with the complexity of idiosyncratic/surprising wikitext and it works the same for all languages. All hail Wikidata!
Wptools supports Wikidata and I encourage you to use wptools.page.get_wikidata()
. The problem with Wikidata is that it is often data-poor where the analogous Infobox is data-rich. This is because Wikidata is relatively young. Another goal of this package is to help that situation. Let's get the info out of Infoboxes and into Wikidata.
You can browse the wikitext of Infoboxes to get a better sense of what we're up against simply by adding ?action=raw§ion=0
to a Wikipedia URL. This will let you inspect the wikitext of the so-called lead-section. Here's one from the example you posted:
https://en.wikipedia.org/wiki/Autism?action=raw§ion=0
{{Infobox medical condition (new)
| name = Autism
| image = Autism-stacking-cans 2nd edit.jpg
| alt = Boy stacking cans
| caption = Repetitively stacking or lining up objects is associated with autism.
| field = [[Psychiatry]]
| symptoms = Trouble with [[Interpersonal relationship|social interaction]], impaired [[communication]], restricted interests, repetitive behavior<ref name=Land2008/>
| complications =
| onset = By age two or three<ref name=NIH2016>{{cite web |title= NIMH » Autism Spectrum Disorder |url= https://www.nimh.nih.gov/health/topics/autism-spectrum-disorders-asd/index.shtml |website= nimh.nih.gov |accessdate= 20 April 2017 |language=en |date= October 2016}}</ref><ref name=DSM5/>
| duration = Long-term<ref name=NIH2016/>
| causes = [[Heritability of autism|Genetic]] and environmental factors<ref name=Ch2012/>
| risks =
| diagnosis = Based on behavior and developmental history<ref name=NIH2016/>
| differential = [[Reactive attachment disorder]], [[intellectual disability]], [[schizophrenia]]<ref>{{cite book |first1= Jacqueline |last1= Corcoran |first2=Joseph |last2=Walsh |title= Clinical Assessment and Diagnosis in Social Work Practice |url= https://books.google.com/books?id=y28kokLoe78C&pg=PA72 |publisher= Oxford University Press, USA |date=9 February 2006 |isbn= 9780195168303 |via=Google Books |page=72 }}</ref>
| prevention =
| treatment = Early speech and [[Early intensive behavioral intervention|behavioral interventions]]<ref name=CCD2007/>
| medication =
| prognosis = Frequently poor<ref name=Ste106/>
| frequency = 24.8 million (2015)<ref name=GBD2015Pre>{{cite journal|last1=GBD 2015 Disease and Injury Incidence and Prevalence|first1=Collaborators.|title=Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990-2015: a systematic analysis for the Global Burden of Disease Study 2015.|journal=Lancet|date=8 October 2016|volume=388|issue=10053|pages=1545–1602|pmid=27733282|doi=10.1016/S0140-6736(16)31678-6|pmc=5055577}}</ref>
| deaths =
}}
That's actually quite sane to me, and our tool handles it reasonably well by parsing the API parse tree:
>>> page = wptools.page('Autism')
>>> page.get_parse()
>>> page.data['infobox']
{'alt': 'Boy stacking cans',
'caption': 'Repetitively stacking or lining up objects is associated with autism.',
'causes': [[Heritability of autism|Genetic]] and environmental factors',
'diagnosis': 'Based on behavior and developmental history',
'differential': [[Reactive attachment disorder]], [[intellectual disability]], [[schizophrenia]]',
'duration': 'Long-term',
'field': '[[Psychiatry]]',
'frequency': '24.8 million (2015)',
'image': 'Autism-stacking-cans 2nd edit.jpg',
'name': 'Autism',
'onset': 'By age two or three',
'prognosis': 'Frequently poor',
'symptoms': 'Trouble with [[Interpersonal relationship|social interaction]], impaired [[communication]], restricted interests, repetitive behavior',
'treatment': 'Early speech and [[Early intensive behavioral intervention|behavioral interventions]]'}
The trick is to get that parse tree. One way to do this locally is by running your own Mediawiki instance. It's not that difficult, but it will take hours to import the dump. Then, just point wptools at it as though it were out on the network. You can do this for as many dumps in as many languages as you like.
Another approach could be to extract the wikitext for all the Infoboxes from the dumps (another parsing problem), and then pass that text to the API to get the correct parse trees. However, then you are hitting the API again.
There may be another alternative: if you can find those parse trees sitting on some network endpoint somewhere. Then, you can envoke a simple module to bunzip2 them, stream parse them, and wptool them:
# bzip2, expat, and extract a parse tree
ptree = '<template><title>Speciesbox</title><part><name>name</name><equals>=</equals><value>Okapi</value></part></template>'
def infobox(ptree):
import wptools
return wptools.utils.get_infobox(ptree) # {'name': 'Okapi'}
I think the challenge there is that executing a dump is already a (computationally) expensive process. Dumping the parse trees must be more expensive. How much more? That's a good question.
Maybe the parse trees are lurking somewhere in the dumps, and I just haven't found them yet. Can someone please send a page down into the stacks to fetch a copy of the parse trees? If they ever return, please post them here. Really, maybe they are there. Maybe DBPedia has them?
With regard to easing the load on Wikimedia servers, another possibility is to use the more robust and performant RESTBase API. However, it does not currently offer parse trees. If we could get them to offer a /page/parsetree/{title}
endpoint, then that may help Wikimedia. One way to motivate progress like that would be to show a high load from wptools (fiendishly rubs hands together).
That's basically a little history of this package. I hope it helps. There's a good chance I may have some of this wrong. If anyone has any further thoughts or corrections, please let us know!