Corpus creator for Chinese Wikipedia
-
Updated
Jun 30, 2021 - Python
Corpus creator for Chinese Wikipedia
Wikipedia text corpus for self-supervised NLP model training
Reading the data from OPIEC - an Open Information Extraction corpus
Practical ML and NLP with examples.
Involves building a search engine on the Wikipedia Data Dump using the data dump of 2013 of size 43 GB. The search results returns in real time.
Python package for working with MediaWiki XML content dumps
A complete Python text analytics package that allows users to search for a Wikipedia article, scrape it, conduct basic text analytics and integrate it to a data pipeline without writing excessive code.
Collects a multimodal dataset of Wikipedia articles and their images
Convert WIKI dumped XML (Chinese) to human readable documents in markdown and txt.
Convert Wikipedia XML dump files to JSON or Text files
Code and data for the paper 'Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings'
📚 A Kotlin project which extracts ngram counts from Wikipedia data dumps.
Repositório para disponibilização de bases de dados do Wikipedia e Simple Wikipedia pré-processadas, além de scripts de pré-processamento e geração de bases em Python.
A desktop application that searches through a set of Wikipedia articles using Apache Lucene.
Wiki dump parser (jupyter)
Interactive chatbot using python :)
IR search Engine for Wikipedia app
(Ongoing module in development) Getting Wikipedia articles parsed content. Created for getting text corpuses data fast and easy. But can be freely used for other purpuses too
Add a description, image, and links to the wikipedia-corpus topic page so that developers can more easily learn about it.
To associate your repository with the wikipedia-corpus topic, visit your repo's landing page and select "manage topics."