Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason. #4805

Open
Zhenzhong1 opened this issue May 14, 2024 · 3 comments
Labels
usage How to use vllm

Comments

@Zhenzhong1
Copy link

Your current environment

Env: CPU device
vllm: 0.4.2+cpu

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

For this code, as long as I define the torch.nn.modules in the domain of the current vLLM model, it affects ouput token results even I don't use them. In other words, If I move theses nn.modules I don't use to the above of LLM() definition, it does't affect results.

llm1 is the same as llm2, because they both define the nn.module in the current model domain. But, llm3 is different because I don't define anything, and llm3 is the correct result I want.

Shouldn't three of them have the same result? Please check the screenshot or text.

Output screenshots:
image

Processed prompts: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1/1 [00:01<00:00,  1.22s/it]
outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665805.6874118, last_token_time=1715665805.6874118, first_scheduled_time=1715665805.689108, first_token_time=1715665805.8463485, time_in_queue=0.0016961097717285156, finished_time=1715665806.759257), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665811.4080832, last_token_time=1715665811.4080832, first_scheduled_time=1715665811.4091282, first_token_time=1715665811.539016, time_in_queue=0.0010449886322021484, finished_time=1715665812.7462144), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是 ChatGLM2-6B, 我是基于大型语言模型', token_ids=[31123, 33030, 22011, 10461, 30944, 30943, 30941, 30978, 30949, 31123, 30910, 33030, 33053, 32997, 32330, 34030], cumulative_logprob=-8.741462323308497, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665822.238591, last_token_time=1715665822.238591, first_scheduled_time=1715665822.2395456, first_token_time=1715665822.5107977, time_in_queue=0.0009546279907226562, finished_time=1715665823.461715), lora_request=None)]

Besides, if I change the ouput feature of torch.nn.module, it aslo affects output tokens.

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)

I only change the output_features, but results are different.
outputs:
image

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',是一名人工智能助手。 \n\n如果你需要帮助,请告诉我具体问题', token_ids=[31123, 38628, 34797, 42481, 31155, 30910, 13, 13, 32763, 31665, 31934, 30932, 55073, 38953, 32149, 31639], cumulative_logprob=-21.3015581928193, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666711.2086165, last_token_time=1715666711.2086165, first_scheduled_time=1715666711.2102835, first_token_time=1715666711.3079636, time_in_queue=0.001667022705078125, finished_time=1715666712.208443), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',小河流段便会非常活跃。很多体载货物的鱼类 difficult,', token_ids=[31123, 54603, 36773, 55005, 42237, 31685, 35203, 31155, 31679, 54618, 55387, 55466, 34090, 49426, 2529, 30932], cumulative_logprob=-96.62851423444226, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666716.799589, last_token_time=1715666716.799589, first_scheduled_time=1715666716.8003457, first_token_time=1715666716.8765712, time_in_queue=0.0007567405700683594, finished_time=1715666718.0433056), lora_request=None)]

As you see, I don't use these nn.modules actually, but they affect results in fact. I provide 5 output results but they are all different. The only change is about nn.module.

Need some help. Thank you!

How would you like to use vllm

Seems nn.module definition may affect the output tokens. Don't know the reason.

@Zhenzhong1 Zhenzhong1 added the usage How to use vllm label May 14, 2024
@simon-mo
Copy link
Collaborator

This is quite interesting. Can you double check by setting seed?

@youkaichao
Copy link
Member

If this is real, I suspect this has something to do with memory leak and pytorch caching allocator. Maybe we leaked some object reference, and when you create new nn module, pytorch caching allocator recycles some memory it thinks is not used anymore, but it is actually used somewhere?

I might be wrong anyway. If this is the case, the rootcase would be quite difficult to debug.

@Zhenzhong1
Copy link
Author

@simon-mo Hi

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

I set the same seed, but also output three different results. Acutally LLM() has the default seed (seed: int = 0).

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=', p更 爱 你 要 是 你 要 是 你 要 是 你 要 是', token_ids=[31123, 281, 54664, 47802, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369], cumulative_logprob=-41.74734868388623, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824550.3473322, last_token_time=1715824550.3473322, first_scheduled_time=1715824550.3491716, first_token_time=1715824555.3297749, time_in_queue=0.0018393993377685547, finished_time=1715824620.9681613), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='老师和同学们,今天我带了人民调解委员会调解费收据 我不知道', token_ids=[42116, 32812, 31123, 31869, 54546, 54882, 54537, 31657, 36122, 32007, 36122, 55000, 54821, 54830, 34211, 32522], cumulative_logprob=-43.803544878959656, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824629.7847252, last_token_time=1715824629.7847252, first_scheduled_time=1715824629.7856104, first_token_time=1715824633.9895625, time_in_queue=0.0008852481842041016, finished_time=1715824653.5920393), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是人工智能助手。 根据用户名登录后,我的作用是提供咨询', token_ids=[31123, 33030, 34797, 42481, 31155, 47383, 32053, 54653, 36782, 54585, 31123, 31791, 31827, 54532, 31692, 32539], cumulative_logprob=-32.18759796023369, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824663.3346176, last_token_time=1715824663.3346176, first_scheduled_time=1715824663.3352196, first_token_time=1715824663.549846, time_in_queue=0.0006020069122314453, finished_time=1715824664.6953938), lora_request=None)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants