Multilingual dataset of labeled news web pages for information extraction task
Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets:
- For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags
- For other languages: title, publication date, text, authors, tags
Title | Text | Date | Author | Tag | ||
---|---|---|---|---|---|---|
ru | Sites / Pages | 112 / 722 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
110 712 714 |
112 716 5918 |
110 708 724 |
54 262 272 |
49 332 1190 |
|
en | Sites / Pages | 10 / 500 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
10 500 500 |
10 499 22200 |
10 499 499 |
4 147 147 |
2 98 258 |
|
de | Sites / Pages | 9 / 450 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
9 450 454 |
9 449 6847 |
9 450 600 |
9 270 308 |
2 100 336 |
|
zh | Sites / Pages | 10 / 500 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
10 500 501 |
10 500 5872 |
10 500 500 |
6 227 277 |
0 0 0 |
|
ko | Sites / Pages | 10 / 500 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
10 500 500 |
10 500 6898 |
10 500 550 |
8 358 409 |
1 41 155 |
|
ar | Sites / Pages | 10 / 500 | ||||
Sites with attribute Pages with attribute Nodes with attribute |
10 500 500 |
10 500 5752 |
10 500 550 |
10 180 274 |
4 184 648 |
Creating the Russian-language part of the dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.
For other languages, we marked up nodes on pages using sitemaps created in the Web Scraper.
For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format):
[
{
'id':
'url':
'html':
'html_en':
'agency':
'site':
'title':
'annotator':
'annotation_id':
'created_at':
'updated_at':
'lead_time':
'labels': [
{
'text':
'hypertextlabels':
'start':
'end':
'endOffset':
'startOffset':
'globalOffsets':
},
...]
},
...]
We additionally added html_en
with translated HTML into English.
JSONs structure for other languages:
{'site': [
{
'uuid':
'url':
'html':
'annotations': [
{
'xpath':
'text':
'label':
},
...]
},
...],
...}
- Multilingual dataset (1.1 GB):
data/annotations
- Russian-language web pages in MHTML format (zipped 1 GB):
data/mhtml-ru.zip
More details about the Russian-language part of the dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:
@INPROCEEDINGS{10076872,
author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)},
title={A Dataset for Information Extraction from News Web Pages},
year={2022},
volume={},
number={},
pages={100-106},
keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
doi={10.1109/ISPRAS57371.2022.10076872}}