This is knowledge base QA task with the data from (task 5: Open Domain Question Answering). Chinese version introduction:
- My development environment is ubuntu 14.04 with pytorch 1.2.0.
- Please make sure 'mysql' was installed in your desktop/PC. There are many guidances for mysql installing in Ubuntu.
You can clone or download this 'KB_QA' repository.
please 'cd preProcessData'
- run
There are train and test dataset in NLPCC2017 task5. We can spilt test data by 1:1 to get test and dev data.
- run
There are three functions in this script: getNERData, getDBData and getSimilarityData. You will get three folders named NERData, DBData, SIMData.
- run
(pls create a KB_QA database in mysql). Running the script, it will upload data in DBData folder to KB_QA database.
- Pls download Bert Pretraining model(pytorch version) adding them to subModel/BertPreTrainedModel. and pls make sure NERData folder existed under preProcessData/NERData, there should be three text files(train.txt, dev.txt, test.txt)
training NER model
run --data_dir preProcessData/NERData --vocab_file BertPreTrainedModel/vocab.txt --model_config BertPreTrainedModel/conig.json --output_dir output_model --pre_train_model BertPreTrainedModel/pytorch_model.bin --max_seq_length 64 --do_train --train_batch_size 8 --eval_batch_szie 8 --gradient_accumulation_steps 16 --num_train_epochs 8
- Pls make sure Bert Pretraining model(pytorch version) under subModel/BertPreTrainedModel existed.
- training classification model
run --data_dir preProcessData/SIMData --vocab_file BertPreTrainedModel/vocab.txt --model_config BertPreTrainedModel/config.json --output_dir output_model --pre_train_model BertPreTrainedModel/pytorch_model.bin --max_seq_length 64 --do_train --train_epoch_size 8 --eval_batch_size 8 --gradient_accumulation_steps 16 -num_train_epochs 8