Skip to content

Latest commit

 

History

History
71 lines (70 loc) · 5 KB

README.md

File metadata and controls

71 lines (70 loc) · 5 KB

KGdatasets

Public datasets for graph embedding

Available datasets

Datasets' table
Number Dataset Description
1 CN15K (ConceptNet 15k) It is a subset of ConceptNet, a semantic network, designed to help computers understand the meanings of words that people use. Numeric values on triples represent uncertainty.
2 FB15k (Freebase 15K) The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of 592,213 triplets with 14,951 entities and 1,345 relationships. FB15K-237 is a variant of the original dataset where inverse relations are removed, since it was found that a large number of test triplets could be obtained by inverting triplets in the training set.
3 FB15k-237 FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, many triples are inverses that cause leakage from the training to testing and validation splits. FB15k-237 was created by Toutanova and Chen (2015) to ensure that the testing and evaluation datasets do not have inverse relation test leakage. In summary, FB15k-237 dataset contains 310,079 triples with 14,505 entities and 237 relation types.
4 FB13 FB13 is a subset of Freebase
5 NL27K NL27K is a typical UKG dataset extracted from NELL (Never Ending Language Learning). The triples in NL27K dataset are high quality (confidence scores >= 0.95) which rarely has noises or uncertain data.
6 O*NET20K It is a subset of O*NET , a dataset that includes job descriptions, skills and labeled, binary relations between such concepts. Each triple is labeled with a numeric value that indicates the importance of that link.
7 PPI5K (protein-protein interactions) It is a subset of the protein-protein interactions (PPI) knowledge graph. Numeric values represent the confidence of the link based on existing scientific literature evidence.
8 WN18 (WordNet18) The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found out that a large number of the test triplets can be found in the training set with another relation or the inverse relation. Therefore, a new version of the dataset WN18RR has been proposed to address this issue.
9 WN18RR WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.
10 WordNet11 (WN11) A lexical database for English
11 YAGO3-10 (Yet Another Great Ontology 3-10) YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of YAGO) that contains entities associated with at least ten different relations. In total, YAGO3-10 has 123,182 entities and 37 relations, and most of the triples describe attributes of persons such as citizenship, gender, and profession.