This repository contains code and data for the following research study.
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
Melanie Walsh, Anna Preus, Maria Antoniak
EMNLP Findings 2024
Please cite this paper when using resources found in this repository.
The data in this repository includes:
- 1.4k+ public domain poems tagged by poetic form by the Poetry Foundation, the Academy of American Poets, or both — with accompanying metadata such as subject tags and author birth and death dates where available
- retrieval metadata from Dolma using the WIMBD platform including source domains for each detected poem
- memorization predictions using n-gram overlap between true poems and generated poem continuations by GPT-4
The code in this repository includes:
- a Python notebook demonstrating how to query for data from Dolma using the WIMBD platform
- a Python notebook analyzing the query data from Dolma
- a Python notebeook demonstrating the memorization experiments
- Python scripts demonstrating how to prompt models for the poetry form classifcation task
- a Python notebook demonstrating analysis of classification results