Please cite the following paper
@inproceedings{huang2023converser,
title = "{CONVERSER}: Few-shot Conversational Dense Retrieval with Synthetic Data Generation",
author = "Huang, Chao-Wei and Hsu, Chen-Yu and Hsu, Tsu-Yuan and Li, Chen-An and Chen, Yun-Nung",
booktitle = "Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
month = sep,
year = "2023",
address = "Prague, Czechia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.sigdial-1.34",
doi = "10.18653/v1/2023.sigdial-1.34",
pages = "381--387"
}
- Python >= 3.6
- Transformers
- torch
Our generated dataset can be found in the google drive
We used LLaMA-13B in our experiments. Please apply for access here. You can also try other open-source LLMs such as LLaMA-2 and Falcon. Note that our method doesn't require instruction-tuned LLMs, so you can use any pretrained LLM.
In order to run dialogue generation, you'll need a collection of passages. In our experiments, we used the passage collection from OR-QuAC. You can process the released data with the ConvDR repo.
- Modify the paths to LLAMA_CHECKPOINT_DIR and COLLECTION_JSONL in
generate_dialog.py
to your local paths. - Simple run
python3 generate_dialog.py
- You can also find our generated datasets here
Please refer to the original DPR repo or the more resource-light implementation GC-DPR for training a DPR model given the generated dataset. With GC-DPR, you should be able to train a DPR model with only 1 GPU. Below is a reference command we used with GC-DPR to train the model:
CUDA_VISIBLE_DEVICES=0 python3 train_dense_encoder.py \
--max_grad_norm 2.0 \
--encoder_model_type hf_bert \
--pretrained_model_cfg bert-base-uncased \
--seed 12345 \
--sequence_length 384 \
--warmup_steps 1237 \
--batch_size 64 \
--dev_batch_size 16 \
--do_lower_case \
--train_file ${GENERATED_DATASET} \
--dev_file ../ConvDR/datasets/or-quac/dev_dpr.json \
--output_dir ${MODEL_DIR} \
--learning_rate 2e-05 \
--num_train_epochs 30 \
--val_av_rank_start_epoch 0 \
--fp16 \
--grad_cache \
--q_chunk_size 8 \
--ctx_chunk_size 8 \
--global_loss_buf_sz 2097152 \
--val_av_rank_max_qs 10000