This repository provides translation of Spider, CoSQL, SParC, Spider-DK, Spider-Syn datasets into Polish and code for some experiments.
📄 Associated master thesis: download link.
Polish translations are ready to download from Hugging Face Datasets 🤗
datasets
directory contains scripts for dataset synthesis
# clone repository
https://github.com/klima7/Polish-Spider
# create environment
conda create -n pol-spider python=3.19
conda activate pol-spider
pip install -r requirements.txt
# download spacy model
python -m spacy download xx_sent_ud_sm
Then download oryginal english databases from here and place inside datasets/components/database
Synthesize dataset named pol-spider-en
, which is based on samples from spider
. Translate questions to polish. Apply context-curated
translation to schema names. Translate strings in SQL queries to polish:
python datasets/scripts/synthesize.py spider pol-spider-en \
--question-lang pl \
--schema-translation context-curated \
--query-lang pl \
--with-db
Create pol-spider
dataset by joining pol-spider-en
and pol-spider-pl
:
python datasets/scripts/join.py pol-spider pol-spider-en pol-spider-pl
app
directory contains streamlit app, which allows to use C3SQL
and RESDSQL
models easily.
To use RESDSQL
model downloading weights from Hugging Face 🤗 and placing inside app/models
is required.
cd app
docker compose up --build
experiments
directory contains dockerized code for experiments with RAT-SQL
, BRIDGE
, RESDSQL
, C3
.
evaluation
directory contains code for calculating metrics.