Google Research Datasets

hiertext Public
The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

Jupyter Notebook 272 CC-BY-SA-4.0 24 2 1 Updated Dec 2, 2024
scin Public
The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

Jupyter Notebook 84 9 2 0 Updated Nov 23, 2024
MISeD Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

9 3 0 0 Updated Nov 20, 2024
uicrit Public
UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.

7 1 0 0 Updated Nov 19, 2024
WordGraph Public
The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon entries contain inflected word-form and morphological information all locales.

0 CC0-1.0 0 0 0 Updated Nov 7, 2024
Education-Dialogue-Dataset Public
Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with specific topic to teach the student, and the student is prompted with their learning preferences. https://arxiv.org/abs/2405.14655

3 0 0 0 Updated Oct 29, 2024
sanpo_dataset Public

Python 41 Apache-2.0 2 3 2 Updated Oct 28, 2024
GeniL Public
GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

3 CC-BY-4.0 1 0 0 Updated Oct 18, 2024
tap-typing-with-touch-sensing-images Public
The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.

1 CC-BY-4.0 1 1 0 Updated Oct 15, 2024
mittens Public
Datasets for measuring misgendering in translation

5 0 0 0 Updated Oct 4, 2024

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

Pinned Loading

Repositories

People

Top languages

Most used topics