Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 model, large difference in results when remove_input_padding is enabled #1999

Closed
2 of 4 tasks
ogaloglu opened this issue Jul 22, 2024 · 37 comments
Closed
2 of 4 tasks
Assignees
Labels

Comments

@ogaloglu
Copy link
Contributor

ogaloglu commented Jul 22, 2024

System Info

Container: nvidia/cuda:12.4.1-devel-ubuntu22.04
GPU: L4
TensorRT-LLM version: 0.12.0.dev2024071600

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Follow the steps for installation
  2. Download weights of t5-small: git clone https://huggingface.co/t5-small tmp/hf_models/t5-small
  3. Convert weights from HF format to TRT-LLM format and build TRT engines with the following parameters:
export MODEL_NAME="t5-small"
export MODEL_TYPE="t5"
export INFERENCE_PRECISION="bfloat16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1
export GPUS_PER_NODE=1
export PAGED_KV_CACHE="disable"
export MAX_BATCH_SIZE=32
export MAX_INPUT_LEN=512
export MAX_SEQ_LEN=201
export REMOVE_INPUT_PADDING="enable"
  1. Update input_text with the following content:
 input_text = [
            'Die Schale mit Murtlap-Essenz fiel zu Boden und zerbrach.',
            'Harry wurde bewusst, dass er auf den Beinen war, obwohl er sich nicht erinnern konnte, aufgestanden zu sein.',
            'Sie hatten das Ende der Reihe erreicht und traten wieder in trübes Kerzenlicht.',
            '»Wir – das heißt deine Tante, Dudley und ich – wir gehen aus.«',
            'Einen Moment lang geschah nichts, Filch starrte Cho zornfunkelnd an und Cho erwiderte seinen Blick nicht minder zornig, dann machte der Hausmeister kehrt und schlurfte zur Tür zurück.',
            'Die Hand auf der Klinke, hielt er inne und blickte sich zu Harry um.',
            '»Ja, er hat die Sache für mich rumgerissen«, bestätigte Harry.',
            'Er wusste, es würde äußerst undankbar, wenn nicht gar kindisch klingen, wenn er sagte: »Hätte er doch nur mal mit mir gesprochen.',
            'Oder mich wenigstens angesehen.«',
            'Sie machte eine Miene, als hätte man sie gezwungen, Stinksaft zu schlucken, und ließ ihre Tasche wieder zuschnappen.',
            '»Ich hab gesagt aufschlussreich, nicht gut«, sagte Hermine.', '»Sie hat vieles erklärt.«',
            '»Gut!«, nuschelte Hagrid und stand auf, hielt sich mit der einen Hand seine blutende Nase zu und mit der anderen seine Armbrust umklammert.',
            "»Also ... das war's dann ... ihr habt ihn kennen gelernt und – und er kennt euch jetzt, wenn ihr wieder kommt.", 'Jaah ... nun ...«',
            '»Mit niemandem -«, sagte Harry und versuchte sich ihrem Griff zu entziehen.',
            '»Von wem ist eigentlich der Brief?«, fragte Ron und nahm Harry das Blatt aus der Hand.',
            '»Ich – ähm – hab gehört, sie geht mit jemand anderem«, sagte Hermine behutsam.',
            '»Ja ... nun ...«, sagte Professor Umbridge, während Harry, Ron und Hermine sich so langsam wie möglich die Marmortreppe hochschleppten, »lassen Sie uns ins Lehrerzimmer gehen.',
            'Ich möchte meinen, Sie könnten eine Tasse Tee vertragen nach Ihrer Reise.«', 'Okay, gehen wir.', 'Locomotor Koffer.«',
            '»wie finden Sie als zeitweiliges Mitglied des Kollegiums - als neutrale Außenstehende, wie man vielleicht sagen könnte -, wie finden Sie Hogwarts?',
            '»Sie haben es selbst gesehen, was soll das heißen?«, sagte Professor McGonagall und ihre dunklen Augenbrauen zogen sich zusammen.',
            'Sie mussten im Laufschritt gehen, um mitzuhalten, während er über den Rasen marschierte und sich bei jedem zweiten Schritt umsah.',
            'Als sie zu seiner Hütte gelangten, wandte sich Hermine wie selbstverständlich nach links zur Vordertür.',
            '»Der Tornados-Hasser?«, sagte Cho ziemlich kühl.', '»Taugt er was?«',
            'Als er sich Ron und Hermine gegenübersetzte, fiel sein Blick auf sein Spiegelbild im Fenster.',
            'Er war sehr weiß und seine Narbe schien sich deutlicher abzuzeichnen als sonst.',
            'Ron stand mit halb offenem Mund da, sichtlich bestürzt und vollkommen sprachlos, während Hermine den Tränen nahe schien.',
            'Mit hämmerndem Herzen blickte er auf.'
        ]
  1. Run Python runtime

Expected behavior

HF outputs and TRT-LLM outputs should be roughly the same. As a side note, this is the case with FasterTransformer.

actual behavior

TRT-LLM outputs differ significantly from HF outputs. Generation quality is clearly worse: Repetitions, outputs with only special tokens etc.

additional notes

I just realized before submitting the issue that the TRT-LLM and HF results are much closer to each other when remove_input_padding is disabled. Therefore, changed the title accordingly. Additionally, similar to the previously created issues, I notice that the discrepancy between TRT-LLM and HF outputs increases as batch size is increased.

@ogaloglu ogaloglu added the bug Something isn't working label Jul 22, 2024
@symphonylyh symphonylyh self-assigned this Jul 30, 2024
@0xd8b
Copy link

0xd8b commented Jul 31, 2024

After the transformation of the T5 model is complete, with "remove input padding" set to false and a maximum batch size of 8, when the model inference is set to a batch size of 1, 4, or 8 in the float32 data type, there are slight differences in the results. In the float16 data type, the differences in results are even more pronounced. In theory, the results of inference with different batch sizes should be consistent. Is this a normal phenomenon?

@thanhlt998
Copy link

thanhlt998 commented Aug 1, 2024

I faced the same issue while running with TensorRT-LLM Triton Backend latest version, the output is not consistent and not the same as HF output.
When I run load test generate api, shuffling inputs causes the difference outputs. Sometime the output is entirely pad token string like <pad><pad><pad><pad><pad>...<pad> while output is normal with FasterTransformer.

@symphonylyh
Copy link
Collaborator

@0xd8b @thanhlt998 Thanks for reporting this! It might be some bugs introduced in recent versions. We'll investigate this soon.

btw, if I remember correctly you have been actively using enc-dec TRT-LLM for a while, do you have some additonal info on whether this is a regression, like did everything work without any problem on your previous TRT-LLM version? If so, can you share which version was that?

@thanhlt998
Copy link

thanhlt998 commented Aug 1, 2024

@symphonylyh Yes! As I remember it worked normally at TRT-LLM version 0.11.0.dev2024061800.

@thanhlt998
Copy link

@symphonylyh Do you find out the issue here?

@symphonylyh
Copy link
Collaborator

@thanhlt998 we found (1) the issue only appears for batch size >= 18 (2) the issue is token-specific, i.e. only happens for specific tokens (3) it was due to some NaN elements after the 1st layer.

It's the progress so far -- we're still investigating the root cause. btw, we found 0604 version also had the same issue (Oh I noticed you have changed the version to 0618, we can try this as well)

@thanhlt998
Copy link

@symphonylyh Let me know if there are something new insighted. I am very eager to apply TensorRT-LLM for enc-dec in production.

@ogaloglu
Copy link
Contributor Author

ogaloglu commented Aug 6, 2024

@thanhlt998 Just to share my own experience: When I use float32 and disable remove_input_padding, I see almost no discrepancy between HF outputs and TRT-LLM outputs with Python runtime (max_batch_size=32).

@thanhlt998
Copy link

@ogaloglu I know, but I am eager to apply the inflight batching feature, not the dynamic batching feature. If use the above parameters, it is total like FasterTransformer with dynamic batching.

@ogaloglu
Copy link
Contributor Author

ogaloglu commented Aug 6, 2024

@thanhlt998 I see, I got your point! Just want to add that I can actually use bfloat16 in FasterTransformer without having this discrepancy in the outputs.

@thanhlt998
Copy link

@symphonylyh Do you have any progress here?

@symphonylyh
Copy link
Collaborator

@thanhlt998 @ogaloglu @0xd8b root caused and fixing internally now! It was due to the padding_offset setup in the gptAttentionCommon. Bascially we need a padding_offset for the encoder input lengths and a separate one for the decoder output lengths. Otherwise it will fail under certain edge cases when length(1st input seq in the batch) < batch size (also, under certain config: dtype != fp32 && remove_padding mode)...

Expect a fix in next week's main branch update!

@thanhlt998
Copy link

@symphonylyh Expect your fix next week!

@thanhlt998
Copy link

@symphonylyh when will the fixes be landed on main branch this week?

@symphonylyh
Copy link
Collaborator

@thanhlt998 @ogaloglu @0xd8b updated on main branch now! closing the issue. Please verify

@thanhlt998
Copy link

thanhlt998 commented Aug 21, 2024

@symphonylyh I am trying your fixes with tensorrtllm triton backend at the latest version on main branch. But I can not launch the engines with triton backend after serializing encoder and decoder engines. I suspect the cause of triton 24.07 but when I build the docker with triton 24.05 (launch model normally before your fixes), it still raises the same error:

[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0821 15:04:54.552435 2226 pb_stub.cc:2121]  Non-graceful termination detected.
I0821 15:04:54.552446 2227 pb_stub.cc:2121]  Non-graceful termination detected.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 2420ae08cfa8 exited on signal 8 (Floating point exception).

So I cannot verify your fixes now :(( The same problem is reported at here.

@symphonylyh
Copy link
Collaborator

@thanhlt998 you can always follow enc_dec/README.md and use the exampes/run.py or examples/enc_dec/run.py to verify by changing the input text.

The triton backend error you saw should be either a version of setup issue from your end, because we have tested from e2e and didn't see such error. Maybe you can verify with non-triton first?

@thanhlt998
Copy link

@symphonylyh Yes, I ran tensorrt engines with c++ backend with examples/enc_dec/run.py in the built docker container successfully. I can always follow the enc_dec/README.md. But I cannot launch the engines with triton backend.

@symphonylyh
Copy link
Collaborator

@thanhlt998 I see. It looks like a general triton issue not limited to enc-dec. We need to wait for the backend people to investigate and fix then

@ogaloglu
Copy link
Contributor Author

@symphonylyh thanks a lot; I can confirm that, for this input_text, there is no difference in the results depending on the remove_input_padding parameter (bfloat16 and Python runtime are used)! I can also confirm that the difference between HF and TRT outputs is much smaller now. Yet I would like to share one observation in case it can be helpful to identify an unexpected behavior.

So, the following experiments are still based on the input_text list that I shared above. And I am using the same parameters:

export MODEL_NAME="t5-small"
export MODEL_TYPE="t5"
export INFERENCE_PRECISION="bfloat16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1
export GPUS_PER_NODE=1
export PAGED_KV_CACHE="disable"
export MAX_BATCH_SIZE=32
export MAX_INPUT_LEN=512
export MAX_SEQ_LEN=201
export REMOVE_INPUT_PADDING="enable"

When I pass a batch with the first two sentences of input_text, HF and TRT results are completely the same. But when I pass a batch with more than 4 sentences from the input_text, the result for the second sentences is very different.
This is the respective sentence: 'Harry wurde bewusst, dass er auf den Beinen war, obwohl er sich nicht erinnern konnte, aufgestanden zu sein.'

HF output (basically the same as the input):
'Harry wurde bewusst, dass er auf den Beinen war, dass er sich nicht erinnern konnte, aufgestanden zu sein.’

TRT output:
'was a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man’

Can there be still some token-specific problems? Thanks a lot in advance! Please let me know if you think I should rather create a new issue for that!

@symphonylyh
Copy link
Collaborator

@ogaloglu thanks for the info. We can continue using this issue to investigate. Reopened now

@symphonylyh symphonylyh reopened this Aug 22, 2024
@ogaloglu
Copy link
Contributor Author

@symphonylyh thank you! Looking forward to your updates!

@thanhlt998
Copy link

@symphonylyh I've tested your fixes with triton backend, it still have the same issues before but with lower frequency. Looking forward to your insights of these issues!

@symphonylyh
Copy link
Collaborator

@ogaloglu the "a man, a man" issue appears to be only with bfloat16. When you switch to float16, it's seems fine. This is not a fix yet, but just to share findings

@thanhlt998
Copy link

@symphonylyh In my experiments, I only used RTX2080Ti (support float16 only) so I think float16/bfloat16 is not the cause of this issue.

@ogaloglu
Copy link
Contributor Author

@symphonylyh thank you for the insight! Then, I will run some experiments early next week and share the outcomes.

@symphonylyh
Copy link
Collaborator

@0xd8b @ogaloglu @thanhlt998 good news!

The main issue regarding remove padding has been fixed and will release next week.
Explanation: when handling remove padding mode and split from a [num_tokens, hidden_size * 3] QKV into [BS, seq_len, hidden_size] Q/K/V, the Q/K/V buffers are uninitialized. so those halo elements will have random value. The fix is to memset at the beginning, just like the GPTAttention plugin here.

If you want to verify locally first, you can search for invokeAddFusedQKVBiasTranspose in bertAttentionPlugin.cpp, and copy the cudaMemset call (link above) and put it before invokeAddFusedQKVBiasTranspose.

This should solve a failure when doing IFB + Remove padding + when BS is large

The "a man, a man" issue is still somehow vague, but I will suggest @ogaloglu to switch to float16 because I saw the "a man" in bfloat16 but switching to float16 gives 100% match with HF. This cause must be something else and I would suggest filing another bug.

The more fundamental issue should be fixed by the solution in my comment here, and will be released next week

@thanhlt998
Copy link

@symphonylyh It is great to hear your solution to fix this issue. I am looking forward to your update next week!

@thanhlt998
Copy link

@symphonylyh I have tested the fixes following your instruction. After the load test processing with 999 samples (random shuffling):

  • TRT-LLM vs FasterTransformer: EM = 938/999 samples, avg_fuzzy_score = 98.65
  • TRT-LLM (1st time) vs TRT-LLM (2nd time): EM = 985/999 samples, avg_fuzzy_score = 99.76

The results are much better than before the fixes. But it still has slight difference between TRT-LLM and FasterTransformer or two diffence run times.

@symphonylyh
Copy link
Collaborator

@thanhlt998 do you have a concreate input text to reproduce the difference? So far we think the TRT-LLM results are correct and good match with HF. While an exact 100% match cannot be guaranteed among TRT-LLM, FT, HF because their kernel selected is indeed different

But if you found weird output, please post here with a reproducer and we'll investigate

@thanhlt998
Copy link

@symphonylyh I think it's ok with non-structured output task. However, I have tested it with tasks that have structured outputs like Function Calling, and the output of TRT-LLM causes the accuracy to decrease by 10-30% compared to FasterTransformer on the same test set.
I think it still has the problems with the structured output tasks.

@symphonylyh
Copy link
Collaborator

@thanhlt998 understood, but do you have a Model name + Reproduce text? We can't investigate the accuracy issue without such info

And in this case it makes sense to create a separate issue for better tracking. Can you file one?

@thanhlt998
Copy link

@symphonylyh I use my private model that follow the same config with UL2 model architecture (use silu instead of relu activation) which raises these above issues. I will find the way to reproduce these issues with the open-source model and let you know.

@thanhlt998
Copy link

@symphonylyh You can reproduce the bug with model ul2-base-en-nl, with the inputs:

input_text = [
"<|im_start|>user[NEWLINE]How would you approach solving global income inequality without destabilizing existing economic systems?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]",
"<|im_start|>user[NEWLINE]What are the potential consequences, both positive and negative, of achieving a fully cashless global society?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]",
"<|im_start|>user[NEWLINE]If you could witness any event in history, which one would you choose?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]"
]

The outputs after run run.py:

['1. Hoe zou u benaderen om de wereldwijde inkomensongelijkheid op te lossen zonder bestaande economische systemen te destabiliseren? 2. _Openbaar_ _eind_ _eind_ _eind_ _starttoezicht_ _eind_ _eind_ _toezic
ht_ _toezicht_ _', 'Wat zijn de mogelijke gevolgen, zowel positief als negatief, van het bereiken van een volledig liquide wereldgemeenschap? _****eind**********ditmaal************vanaf*********************
*************************']
['1. Hoe zou u het benaderen om de wereldwijde inkomensongelijkheid op te lossen zonder bestaande economische systemen te destabiliseren? 2. Oplossen van de 1. Ongeacht de 1. Ongeacht de 2. Ongeacht de 3.',
 'Wat zijn de mogelijke gevolgen, zowel positief als negatief, van het bereiken van een volledig liquide wereldgemeenschap? _**oneindig**_**-**************vanuiteindig** _**toeval**_************vanuiteindig
** _**toeval']

[CAVEAT] Comparing TRT-LLM float16 results with HF float32 results. Close match are not expected!
Traceback (most recent call last):
  File "/tensorrtllm_backend/tensorrt_llm/examples/enc_dec/run.py", line 386, in <module>
    assert match_rate > 0.8, f"Incorrect results! Match rate {match_rate}"
AssertionError: Incorrect results! Match rate 0.618096357226792

@thanhlt998
Copy link

@symphonylyh Did you reproduce the bug with the provided information above? Do you have any insights?

@thanhlt998
Copy link

@symphonylyh Do you have any progress here?

@symphonylyh You can reproduce the bug with model ul2-base-en-nl, with the inputs:

input_text = [
"<|im_start|>user[NEWLINE]How would you approach solving global income inequality without destabilizing existing economic systems?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]",
"<|im_start|>user[NEWLINE]What are the potential consequences, both positive and negative, of achieving a fully cashless global society?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]",
"<|im_start|>user[NEWLINE]If you could witness any event in history, which one would you choose?<|im_end|>[NEWLINE]<|im_start|>assistant[NEWLINE]"
]

The outputs after run run.py:

['1. Hoe zou u benaderen om de wereldwijde inkomensongelijkheid op te lossen zonder bestaande economische systemen te destabiliseren? 2. _Openbaar_ _eind_ _eind_ _eind_ _starttoezicht_ _eind_ _eind_ _toezic
ht_ _toezicht_ _', 'Wat zijn de mogelijke gevolgen, zowel positief als negatief, van het bereiken van een volledig liquide wereldgemeenschap? _****eind**********ditmaal************vanaf*********************
*************************']
['1. Hoe zou u het benaderen om de wereldwijde inkomensongelijkheid op te lossen zonder bestaande economische systemen te destabiliseren? 2. Oplossen van de 1. Ongeacht de 1. Ongeacht de 2. Ongeacht de 3.',
 'Wat zijn de mogelijke gevolgen, zowel positief als negatief, van het bereiken van een volledig liquide wereldgemeenschap? _**oneindig**_**-**************vanuiteindig** _**toeval**_************vanuiteindig
** _**toeval']

[CAVEAT] Comparing TRT-LLM float16 results with HF float32 results. Close match are not expected!
Traceback (most recent call last):
  File "/tensorrtllm_backend/tensorrt_llm/examples/enc_dec/run.py", line 386, in <module>
    assert match_rate > 0.8, f"Incorrect results! Match rate {match_rate}"
AssertionError: Incorrect results! Match rate 0.618096357226792

@0xd8b
Copy link

0xd8b commented Dec 19, 2024

@symphonylyh We have transformed the T5 model using TensorRT LLM, with the input sequence set to 2048, in float16 format, and using both GPT and BERT plugins. During the transformation, we set the max_batchsize to 3, and for inference, we set the batch to 1, and the model inference results were normal. However, when we set the max_batchsize to 1 during the transformation of the engine, and kept the batch to 1 for inference, the model inference results were abnormal. We have identified that the issue is with the GPT plugin, but since we have retrained the model, we cannot provide a sample to reproduce the issue. We are using TensorRT LLM version 0.9, and we would like to inquire about the direction for troubleshooting. We look forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants