Difficulty understanding sequence length and context length #4621

NicolasDrapier · 2024-05-06T10:18:25Z

NicolasDrapier
May 6, 2024

Hello,

Since this morning I've been trying to play with the Phi3-mini-128k model, which in theory should give a context length of about 128k tokens. vLLM finds the sequence correctly from the config.json as show below with the parameter max_seq_len=131072.

INFO 05-06 09:43:11 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/root/data/phi-3-mini-128k-instruct', 
speculative_config=None, tokenizer='/root/data/phi-3-mini-128k-instruct', 
skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, 
dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, 
load_format=LoadFormat.SAFETENSORS, tensor_parallel_size=4, disable_custom_all_reduce=False, 
quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, 
device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, 
served_model_name=phi3-128k)

However, it turns out that the model responds <s> when the sequence is 4k tokens, which is annoying. This token corresponds to the BOS_TOKEN. I've tried increasing the size of the sequence captured by the CUDA graphs and tried to increase the max-num-batched-tokens to 131072, but nothing helps.

I don't quite understand how to manage my parameters to achieve this sequence length. I'm using the docker version vllm-openai:v0.4.2 and here's my command:

docker run --rm -it --runtime nvidia \
-e CUDA_VISIBLE_DEVICES=4,5,6,7 \
-e NVIDIA_VISIBLE_DEVICES=4,5,6,7 \
--gpus all --shm-size 1g -p 8910:8000 \
-v /data/vllm/huggingface:/root/vllm/huggingface \
-v /data/phi-3-mini-128k-instruct:/root/data/phi-3-mini-128k-instruct \
--ipc=host \
--name vllm-2 \
vllm/vllm-openai:v0.4.2 \
--model /root/data/phi-3-mini-128k-instruct \
--served-model-name phi3-128k \
--load-format safetensors \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--device cuda \
--tensor-parallel-size 4 \
--max-seq_len-to-capture 128000 \
--disable_custom_all_reduce \
--trust-remote-code

One response:

INFO:     192.168.67.100:36680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-06 09:44:38 async_llm_engine.py:529] Received request cmpl-148d91fce2294995922841d7bd97221a: 
prompt: '<s>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, 
frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0,
 top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, 
early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False,
 ignore_eos=False, max_tokens=131070, min_tokens=0, logprobs=None, 
prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, 
truncate_prompt_tokens=None), prompt_token_ids: [1, 1], lora_request: None.

So what's the way to get long prompts with vLLM?

NicolasDrapier · 2024-05-06T13:14:10Z

NicolasDrapier
May 6, 2024
Author

EDIT : I think it's the interface (https://github.com/mckaywrigley/chatbot-ui) I'm using that isn't sending the tokens correctly. I'll have a look on their github

EDIT 2 : Nevermind, did not use the slider

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty understanding sequence length and context length #4621

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Difficulty understanding sequence length and context length #4621

NicolasDrapier May 6, 2024

Replies: 1 comment

NicolasDrapier May 6, 2024 Author

NicolasDrapier
May 6, 2024

NicolasDrapier
May 6, 2024
Author