Developed by the Dain team at Huazhong University of Science and Technology, this project aims to integrate large language models (LLMs) into ASR systems to improve recognition accuracy. Note that the current work is based on the framework of existing datasets and requires pre-recognized text files.
pip install -r requirements.txt
-
Select and Edit Configuration File: Choose the appropriate configuration file in the
config
directory based on the API provider. -
Fill in API Information: Example:
# Example: gpt_conformer.yaml provider: openai model: gpt-3.5-turbo api_key: YOUR_API_KEY base_url: YOUR_BASE_URL # Example: azure_conformer.yaml provider: azure model: gpt4o api_key: YOUR_API_KEY base_url: YOUR_AZURE_ENDPOINT api_version: YOUR_API_VERSION
-
Specify Text Path and Label Information:
path: example1: "path_to_text1" example2: "path_to_text2" text: "path_to_label"
-
Set Parameters: Adjust parameters based on the task language.
combination_num: 30 # Number of sentences per request thread_num: 100 # Number of concurrent threads temperature: 0.2 max_repeat_times: 1 # Number of repeat inquiries top_p: 0.8
-
Add Prompt: Add as needed in the configuration file. Refer to PromptList
prompt: "Your custom prompt here"
-
Configuration File Path: Specify the configuration file path in
main.py
.with open("path_to_your_config", 'r') as f:
-
Run the Program: Execute the command and generate the result directory.
python main.py
config
: The configuration file used.diff
: Details of text modifications.err
: Error examples.response
: Responses from the LLM.skips
: Texts skipped based on filter settings.text
: Modified texts.total
: Statistical results.wer
: Word error rate files.wrong_sentence
: List of incorrect sentences.
- Tokens: Using the Deepseekv2 model, the token consumption on the AISHELL-1 and Librispeech datasets is 250k(input)+260k(output) and 640k(input)+650k(output) respectively, with the combination numbers set to 30 and 10.
- You can see the historical results in the result directory
-
Fill in the
config/test/Chinese/gpt.yaml
configuration file. -
Run the test script:
python tools/test_model_capability.py
Results will be printed directly and need to be recorded manually. The testing process may take a long time without multithreading.
You can also fill in a new test
The original model below is U2++ Conforemr
Two-stage | Decode | Chunk | Temp | Top p | Task num | Changed Sentence | Error Sentence | CER(%) | Changed |
---|---|---|---|---|---|---|---|---|---|
- | attention | full | - | - | - | - | 2650 | 5.06 | - |
- | attention rescore | full | - | - | - | - | 2493 | 4.62 | - |
- | ctc greedy search | full | - | - | - | - | 2810 | 5.17 | - |
- | ctc prefix search | full | - | - | - | - | 2810 | 5.17 | - |
deepseekv2 | attention | full | 0.2 | 0.8 | 20 | 1568 | 2365 | 4.69 | -0.37(7.3%) |
deepseekv2 | attention rescore | full | 0.2 | 0.8 | 20 | 1451 | 2189 | 4.21 | -0.41(8.8%) |
deepseekv2 | ctc greedy search | full | 0.2 | 0.8 | 20 | 1892 | 2331 | 4.51 | -0.66(12.7%) |
deepseekv2 | ctc prefix search | full | 0.2 | 0.8 | 20 | 1860 | 2324 | 4.48 | -0.69(13%) |
gpt-3.5-turbo | attention | full | 0.8 | 0.8 | 20 | 595 | 2651 | 5.09 | +0.03(0.5%) |
gpt-3.5-turbo | attention rescore | full | 0.8 | 0.8 | 20 | 568 | 2502 | 4.69 | +0.07(1.5%) |
gpt-3.5-turbo | ctc greedy search | full | 0.8 | 0.8 | 20 | 519 | 2798 | 5.20 | +0.03(0.6%) |
gpt-3.5-turbo | ctc prefix search | full | 0.8 | 0.8 | 20 | 433 | 2785 | 5.21 | +0.04(0.7%) |
gpt-4o | attention | full | 0.2 | 0.8 | 20 | 1715 | 2295 | 4.32 | -0.74(14.6%) |
gpt-4o | attention rescore | full | 0.2 | 0.8 | 20 | 1736 | 2153 | 4.01 | -0.61(13%) |
gpt-4o | ctc greedy search | full | 0.2 | 0.8 | 20 | 2075 | 2272 | 4.06 | -1.11(21%) |
gpt-4o | ctc prefix search | full | 0.2 | 0.8 | 20 | 2137 | 2234 | 4.06 | -1.11(21%) |
Two-stage | Decode | Chunk | Temp | Top p | Task num | Changed Sentence | Error Sentence | CER(%) | Change |
---|---|---|---|---|---|---|---|---|---|
- | attention rescore | 16 | - | - | - | - | 1715 | 5.57 | - |
GPT4o | attention rescore | 16 | 0.8 | 0.2 | 20 | 896 | 1517 | 4.95 | -0.62(11%) |
deepseekv2 | attention rescore | 16 | 0.8 | 0.8 | 20 | 223 | 1510 | 6.09 | +0.52(9%) |
GPT3.5-turbo | attention rescore | 16 | 0.8 | 0.4 | 20 | 389 | 1715 | 5.65 | +0.08(1.5%) |
Two-stage | Decode | Chunk | Temp | Top p | Task num | test-clean | test-other | ||
---|---|---|---|---|---|---|---|---|---|
WER(%) | Change | WER(%) | Change | ||||||
- | attention | full | - | - | 3.82 | 8.79 | - | ||
- | attention rescore | full | - | - | 3.35 | 8.77 | - | ||
- | ctc greedy search | full | - | - | 3.77 | 9.52 | - | ||
- | ctc prefix search | full | - | - | 3.75 | 9.50 | - | ||
deepseekv2 | attention | full | 0.8 | 0.8 | 10 | 4.58 | +0.76(19%) | 10.11 | +1.32(-15%) |
deepseekv2 | attention rescore | full | 0.8 | 0.8 | 10 | 4.21 | +0.86(25%) | 10.42 | +1.65(18%) |
deepseekv2 | ctc greedy search | full | 0.8 | 0.8 | 10 | 4.45 | +0.68(18%) | 10.20 | +0.68(7%) |
deepseekv2 | ctc prefix search | full | 0.8 | 0.8 | 10 | 4.93 | +1.18(31%) | 9.97 | +0.47(4.9%) |
gpt-4o | attention | full | 0.8 | 0.8 | 10 | 3.64 | -0.18(4.7%) | 8.32 | -0.47(5.3%) |
gpt-4o | attention rescore | full | 0.8 | 0.8 | 10 | 3.19 | -0.16(4.7%) | 8.38 | -0.39(4.4%) |
gpt-4o | ctc greedy search | full | 0.8 | 0.8 | 10 | 3.43 | -0.34(9%) | 8.45 | -1.07(11.2%) |
gpt-4o | ctc prefix search | full | 0.8 | 0.8 | 10 | 3.58 | -0.17(4.5%) | 8.41 | -1.09(11.4%) |
Thanks to wenet for providing tools and pre-trained models.