[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

Zhenzhong1 · 2024-05-14T06:12:16Z

Your current environment

Env: CPU device
vllm: 0.4.2+cpu

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

For this code, as long as I define the torch.nn.modules in the domain of the current vLLM model, it affects ouput token results even I don't use them. In other words, If I move theses nn.modules I don't use to the above of LLM() definition, it does't affect results.

llm1 is the same as llm2, because they both define the nn.module in the current model domain. But, llm3 is different because I don't define anything, and llm3 is the correct result I want.

Shouldn't three of them have the same result? Please check the screenshot or text.

Output screenshots:

Processed prompts: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1/1 [00:01<00:00,  1.22s/it]
outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，我是小助手 本AI 欢迎你随时向我提问，我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665805.6874118, last_token_time=1715665805.6874118, first_scheduled_time=1715665805.689108, first_token_time=1715665805.8463485, time_in_queue=0.0016961097717285156, finished_time=1715665806.759257), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，我是小助手 本AI 欢迎你随时向我提问，我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665811.4080832, last_token_time=1715665811.4080832, first_scheduled_time=1715665811.4091282, first_token_time=1715665811.539016, time_in_queue=0.0010449886322021484, finished_time=1715665812.7462144), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，我是 ChatGLM2-6B， 我是基于大型语言模型', token_ids=[31123, 33030, 22011, 10461, 30944, 30943, 30941, 30978, 30949, 31123, 30910, 33030, 33053, 32997, 32330, 34030], cumulative_logprob=-8.741462323308497, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665822.238591, last_token_time=1715665822.238591, first_scheduled_time=1715665822.2395456, first_token_time=1715665822.5107977, time_in_queue=0.0009546279907226562, finished_time=1715665823.461715), lora_request=None)]

Besides, if I change the ouput feature of torch.nn.module, it aslo affects output tokens.

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)

I only change the output_features, but results are different.
outputs:

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，是一名人工智能助手。 \n\n如果你需要帮助,请告诉我具体问题', token_ids=[31123, 38628, 34797, 42481, 31155, 30910, 13, 13, 32763, 31665, 31934, 30932, 55073, 38953, 32149, 31639], cumulative_logprob=-21.3015581928193, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666711.2086165, last_token_time=1715666711.2086165, first_scheduled_time=1715666711.2102835, first_token_time=1715666711.3079636, time_in_queue=0.001667022705078125, finished_time=1715666712.208443), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，小河流段便会非常活跃。很多体载货物的鱼类 difficult,', token_ids=[31123, 54603, 36773, 55005, 42237, 31685, 35203, 31155, 31679, 54618, 55387, 55466, 34090, 49426, 2529, 30932], cumulative_logprob=-96.62851423444226, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666716.799589, last_token_time=1715666716.799589, first_scheduled_time=1715666716.8003457, first_token_time=1715666716.8765712, time_in_queue=0.0007567405700683594, finished_time=1715666718.0433056), lora_request=None)]

As you see, I don't use these nn.modules actually, but they affect results in fact. I provide 5 output results but they are all different. The only change is about nn.module.

Need some help. Thank you!

How would you like to use vllm

Seems nn.module definition may affect the output tokens. Don't know the reason.

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-05-14T23:32:28Z

This is quite interesting. Can you double check by setting seed?

youkaichao · 2024-05-14T23:41:08Z

If this is real, I suspect this has something to do with memory leak and pytorch caching allocator. Maybe we leaked some object reference, and when you create new nn module, pytorch caching allocator recycles some memory it thinks is not used anymore, but it is actually used somewhere?

I might be wrong anyway. If this is the case, the rootcase would be quite difficult to debug.

Zhenzhong1 · 2024-05-16T02:04:33Z

@simon-mo Hi

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

I set the same seed, but also output three different results. Acutally LLM() has the default seed (seed: int = 0).

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='， p更 爱 你 要 是 你 要 是 你 要 是 你 要 是', token_ids=[31123, 281, 54664, 47802, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369], cumulative_logprob=-41.74734868388623, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824550.3473322, last_token_time=1715824550.3473322, first_scheduled_time=1715824550.3491716, first_token_time=1715824555.3297749, time_in_queue=0.0018393993377685547, finished_time=1715824620.9681613), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='老师和同学们，今天我带了人民调解委员会调解费收据 我不知道', token_ids=[42116, 32812, 31123, 31869, 54546, 54882, 54537, 31657, 36122, 32007, 36122, 55000, 54821, 54830, 34211, 32522], cumulative_logprob=-43.803544878959656, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824629.7847252, last_token_time=1715824629.7847252, first_scheduled_time=1715824629.7856104, first_token_time=1715824633.9895625, time_in_queue=0.0008852481842041016, finished_time=1715824653.5920393), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='，我是人工智能助手。 根据用户名登录后，我的作用是提供咨询', token_ids=[31123, 33030, 34797, 42481, 31155, 47383, 32053, 54653, 36782, 54585, 31123, 31791, 31827, 54532, 31692, 32539], cumulative_logprob=-32.18759796023369, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824663.3346176, last_token_time=1715824663.3346176, first_scheduled_time=1715824663.3352196, first_token_time=1715824663.549846, time_in_queue=0.0006020069122314453, finished_time=1715824664.6953938), lora_request=None)]

Zhenzhong1 added the usage How to use vllm label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

Zhenzhong1 commented May 14, 2024

simon-mo commented May 14, 2024

youkaichao commented May 14, 2024

Zhenzhong1 commented May 16, 2024

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

Comments

Zhenzhong1 commented May 14, 2024

Your current environment

How would you like to use vllm

simon-mo commented May 14, 2024

youkaichao commented May 14, 2024

Zhenzhong1 commented May 16, 2024