Difficulty understanding sequence length and context length #4621
Replies: 1 comment
-
EDIT : I think it's the interface (https://github.com/mckaywrigley/chatbot-ui) I'm using that isn't sending the tokens correctly. I'll have a look on their github EDIT 2 : Nevermind, did not use the slider |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
Since this morning I've been trying to play with the Phi3-mini-128k model, which in theory should give a context length of about 128k tokens. vLLM finds the sequence correctly from the config.json as show below with the parameter
max_seq_len=131072
.However, it turns out that the model responds
<s>
when the sequence is 4k tokens, which is annoying. This token corresponds to the BOS_TOKEN. I've tried increasing the size of the sequence captured by the CUDA graphs and tried to increase themax-num-batched-tokens
to 131072, but nothing helps.I don't quite understand how to manage my parameters to achieve this sequence length. I'm using the docker version vllm-openai:v0.4.2 and here's my command:
One response:
So what's the way to get long prompts with vLLM?
Beta Was this translation helpful? Give feedback.
All reactions