Code used in our findings of EMNLP'23 paper for evaluating the creativity of LLMs.
-
Saved preprocessed dataset:
dataset/
-
Origin dataset:
-
Human-level baseline:
human-level/
-
Models:
- Greedy search:
greedy_search/
- Top-$p$ (
$p=0.9, t=0.7$ ):Top_p/
- Top-$p$ (scaling t):
temperature/
- Validating DAT:
validating_DAT/
- Greedy search:
-
-
Word embeddings: word2vec and fasttext are used to validate the effect of different word embedding methods.
-
GloVe:
glove.840B.300d.txt
(need to download from here) -
Word2vec (optional):
GoogleNews-vectors-negative300.bin
-
Fasttext (optional):
wiki-news-300d-1M.vec
-
-
Data analysis:
DAT_analysis.ipynb
-
Generate data: For GPT-3.5-turbo and GPT-4, we use the OpenAI Python Library. For other LLMs, we reference the FastChat to collect data in
dataset.py
-
Calculate DAT:
dat_score.py
-
The DAT paradigm: asks models to generate unrelated words and calculates the semantic distance between them.
-
The DAT for humans and models. (a) The distance matrix of words generated by GPT-4 and GPT-3.5-turbo. The average distance is defined as the DAT. (b) The DAT of models and human. (c) The percentile of models’ DAT against human results. We find when using the greedy search strategy, GPT-4 outperforms 96% of humans, while GPT-3.5-turbo exceeds the average human level. Stochastic sampling and temperature scaling are effective to obtain higher DAT scores for models except GPT-4, but face a trade-off between creativity and stability.
-
The effect of temperature tuning. The bands indicate the standard deviations.
-
The results of the DAT and surprisal for more LLMs using greedy search.