Dataset For Information Extraction From News Web Pages

Multilingual dataset of labeled news web pages for information extraction task

Dataset Description

Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets:

For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags
For other languages: title, publication date, text, authors, tags

		Title	Text	Date	Author	Tag
ru	Sites / Pages	112 / 722
ru	Sites with attribute Pages with attribute Nodes with attribute	110 712 714	112 716 5918	110 708 724	54 262 272	49 332 1190
en	Sites / Pages	10 / 500
en	Sites with attribute Pages with attribute Nodes with attribute	10 500 500	10 499 22200	10 499 499	4 147 147	2 98 258
de	Sites / Pages	9 / 450
de	Sites with attribute Pages with attribute Nodes with attribute	9 450 454	9 449 6847	9 450 600	9 270 308	2 100 336
zh	Sites / Pages	10 / 500
zh	Sites with attribute Pages with attribute Nodes with attribute	10 500 501	10 500 5872	10 500 500	6 227 277	0 0 0
ko	Sites / Pages	10 / 500
ko	Sites with attribute Pages with attribute Nodes with attribute	10 500 500	10 500 6898	10 500 550	8 358 409	1 41 155
ar	Sites / Pages	10 / 500
ar	Sites with attribute Pages with attribute Nodes with attribute	10 500 500	10 500 5752	10 500 550	10 180 274	4 184 648

Data Collection

Creating the Russian-language part of the dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.

For other languages, we marked up nodes on pages using sitemaps created in the Web Scraper.

Dataset Format

For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format):

[
  {
    'id':
    'url':
    'html':
    'html_en':
    'agency':
    'site':
    'title':
    'annotator':
    'annotation_id':
    'created_at':
    'updated_at':
    'lead_time':
    'labels': [
      {
        'text':
        'hypertextlabels':
        'start':
        'end':
        'endOffset':
        'startOffset':
        'globalOffsets':
      },
      ...]
  },
...]

We additionally added html_en with translated HTML into English.

JSONs structure for other languages:

{'site': [
  {
    'uuid':
    'url':
    'html':
    'annotations': [
      {
        'xpath':
        'text':
        'label':
      },
      ...]
  },
  ...],
...}

Download

Multilingual dataset (1.1 GB): data/annotations
Russian-language web pages in MHTML format (zipped 1 GB): data/mhtml-ru.zip

Citation

More details about the Russian-language part of the dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:

@INPROCEEDINGS{10076872,
  author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
  booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)}, 
  title={A Dataset for Information Extraction from News Web Pages}, 
  year={2022},
  volume={},
  number={},
  pages={100-106},
  keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
  doi={10.1109/ISPRAS57371.2022.10076872}}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
manifest_resources		manifest_resources
.gitattributes		.gitattributes
MANIFEST.md		MANIFEST.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset For Information Extraction From News Web Pages

Dataset Description

Data Collection

Dataset Format

Download

Citation

About

Releases

Packages

Contributors 2

ispras/news-page-dataset

Folders and files

Latest commit

History

Repository files navigation

Dataset For Information Extraction From News Web Pages

Dataset Description

Data Collection

Dataset Format

Download

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages