[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

chenqianfzh · 2024-05-12T23:18:39Z

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit using bitsandbytes quantization, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.

This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:

https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")

User can run with or without a QLoRA adapter.

So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models.
Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.

Explanation on Changes

Modified files mainly include

Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA/bitsandbytes. The new parameter is:
- qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.
Modify vllm/model_loader/loader.py: Define a new loader class, which will quantize the weight using bitsandbytes during loading
Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in bitsandbytes in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.

The newly added files are:

VLLM/model_executor/layers/quantization/bitsandbytes.py: Here, similar to other quantization methods, we define two classes, class BitsAndBytesConfig (QuantizationConfig) and class BitsAndBytesLinearMethod (LinearMethodBase).
Examples/qlora_inference.py: Demonstration of the use of bitsandbytes, both with and without an adapter.

jeejeelee · 2024-05-13T02:21:48Z

ping @Yard1

Yard1

This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.

chenqianfzh · 2024-05-13T17:38:58Z

This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.

Thanks for your reply. You are not the first one who popped this concern. Actually, I asked myself the same question. :-)

I considered about re-use of LoRA at the first place. I have to start a new set of code because:

the existing LoRA in vLLM is implementing punica (https://github.com/punica-ai/punica), a multi-tenant scenario of LoRA. A lot of effort have made on the LoRA manager, which manages the cases where different sets of fine-tune weights using the same basic models.

But QLoRA, though carring a very similar name, work for a totally different scenario, thus unable to re-use the existing code of LoRA in vLLM.

punica is based on cuda code of BGMV, and BGMV does not support any quantization. But in QLORA, quantization of basic model is the keypoint in saving memory. This is another reason I had to deviate away from reusing LoRA.
On the other hand, QLoRA use a different set of cuda code. The author of QLORA provides the Cuda implementation of QLORA implemention and packed in the python package of bitsandbytes, which is used in the QLORA implementation of huggingface transformers package. So I moved away from re-using the LoRA code.

How about I add some comments somewhere to clarify your concern?

Yard1 · 2024-05-13T17:48:39Z

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

chenqianfzh · 2024-05-13T21:29:21Z

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

I am not sure what you mean by "at will". Do you mean load/unload during runtime?

In this implementation, user can load an adpater by specfiying "qlora_adapter_name_or_path" in parameter when starting the inference. User can also run without an adapter by leaving the above parameter empty.

However, the user cannot switch the adapter during the runtime. Switching adapter is not a scenario supported in the QLoRA design.

The main goal of QLoRA is to use to LoRA weights to compensate the loss caused by the 4-bit quantization in the basic model. So it is a quantization technique. Switching LoRA to support different fine-tune scenarios, as in punica, is not in its design goals.

Yard1 · 2024-05-13T21:57:59Z

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class
we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

chenqianfzh · 2024-05-14T00:23:08Z

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class

we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

Thanks for the suggestion. I will make the changes as suggested.

Cheers!

jeejeelee · 2024-05-14T02:49:09Z

Thank you for your excellent work. Here are some personal opinions:

vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.
Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

chenqianfzh · 2024-05-14T20:57:38Z

Thank you for your excellent work. Here are some personal opinions:

vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.

For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.

Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

I re-read the LoRA code carefully and saw that quantization is supported in LoRA now. It was not supported when I started my design and coding. Sorry for the miss.

I will re-think my design again based on this change, as well as Yard1's suggestions.

Thanks & Happy Coding!

chenqianfzh · 2024-05-23T01:03:44Z

@Yard1 @jeejeelee

I just updated the MR of QLoRA/BitsAndBytes with the changes suggested. Could you please take another look?

Thanks for the great advice from you. Learned a lot and improved a lot. :-)

BTW, I hit a lot of yapf errors in CI/CD. I found the the yapf errors are not from me. Should I just ignore it?

jeejeelee · 2024-05-23T02:33:45Z

@chenqianfzh We cannot igore format error, you can run bash format.sh to check for format errors

Yard1

Thanks, this is looking much cleaner! Left some comments, hope they will be useful.

vllm/model_executor/model_loader/loader.py

vllm/engine/arg_utils.py

vllm/model_executor/layers/quantization/bitsandbytes.py

vllm/model_executor/model_loader/loader.py

Yard1 · 2024-05-23T07:25:31Z

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

requirements-common.txt

chenqianfzh · 2024-05-23T16:27:33Z

@mgoin @Yard1 @jeejeelee

Thanks for the feedback. Working on the changes now.

chenqianfzh · 2024-05-24T06:43:17Z

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

the newly added file examples/qlora_inference.py is created for this purpose. In this file, both the case that bitsandbytes quantization with/withou LoRA adpaters are tested.

Here are the ouput I got in my local test ( of the four, the last is without a LORA adapter , the other three are with adpaters:

--------------------------------------------------------------------------
Prompt: The capital of France is 
Output:  Paris.
--------------------------------------------------------------------------
Prompt: The capital of USA is 
Output:  Washington DC.
--------------------------------------------------------------------------
Prompt: my name is 
Output:  john and i am a 20 year old male. i am a student at the university of maryland. i am a sophomore and i am majoring in business. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party.
--------------------------------------------------------------------------
Prompt: My name is 
Output:  Kyle and I am a 20 year old college student. I am a huge fan of the outdoors and love to hike, camp, and fish. I am a very active person and love to stay busy. I am a very outgoing person and love to meet new people. I am a very easy going person and love to have fun. I am a very hard worker and love to work. I am a very trustworthy person and love to help people. I am a very caring person and love to help people. I am a very respectful person and love to respect others. I am a

Yard1 · 2024-05-24T17:15:20Z

@chenqianfzh example is fine, but we need an automated pytest test to run in CI to prevent regressions.

jeejeelee · 2024-05-25T02:36:59Z

@chenqianfzh Can we add more quantization type examples in qlora_example.py, such as GPT+LoRA, so that users can refer to this script to learn how to utilize LoRA on quantized model, thanks

Yard1 · 2024-05-29T17:11:14Z

@chenqianfzh the merge commit is expected, that's just how git works

chenqianfzh · 2024-05-29T17:39:56Z

@chenqianfzh the merge commit is expected, that's just how git works

I did something wrong in squashing commits before merging, so the commits are mixed. Sorry to make your review more difficult. :-(

vllm/engine/arg_utils.py

vllm/model_executor/layers/quantization/bitsandbytes.py

Yard1

Thanks, left two last nits! We can merge after those are resolved.

chenqianfzh · 2024-05-30T06:21:18Z

Thanks, left two last nits! We can merge after those are resolved.

I've updated the code based on your feedback and have omitted one comment, for which I've provided an explanation. Could you please take a look?

thanks.

chenqianfzh · 2024-05-31T20:08:36Z

@Yard1 I kept trying the CI tests in the past two days. But hit all kinds of weird errors, like the latest failure is due to a container missing in AMD tests.

I did not find a way to restart the specific tests. Could you let me know what to do? Thanks.

Yard1 · 2024-05-31T22:42:32Z

It's OK, we'll just have a maintainer force merge it. Can you resolve #4776 (comment) and I will accept

This reverts commit 161c792.

examples/offline_inference.py

vllm/model_executor/layers/quantization/bitsandbytes.py

vllm/config.py

vllm/model_executor/layers/quantization/bitsandbytes.py

vllm/model_executor/model_loader/loader.py

vllm/model_executor/models/llama.py

chenqianfzh · 2024-06-01T06:16:28Z

@mgoin Thanks for reviewing the PR!

I updated the code per your comments. Could u have another check?

…roject#4776)

chenqianfzh force-pushed the qian/qlora branch 2 times, most recently from 05dbfb7 to 5e868f7 Compare May 13, 2024 00:44

jeejeelee mentioned this pull request May 13, 2024

[Feature]: bitsandbytes support #4033

Open

Yard1 requested changes May 13, 2024

View reviewed changes

chenqianfzh force-pushed the qian/qlora branch from 5e868f7 to 6e74420 Compare May 23, 2024 00:43

Yard1 reviewed May 23, 2024

View reviewed changes

mgoin reviewed May 23, 2024

View reviewed changes

requirements-common.txt Outdated Show resolved Hide resolved

chenqianfzh force-pushed the qian/qlora branch 8 times, most recently from 523c053 to 0ab5879 Compare May 28, 2024 06:04

Merge branch 'main' into qian/qlora

eba8541

Yard1 reviewed May 30, 2024

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

Yard1 reviewed May 30, 2024

View reviewed changes

vllm/model_executor/layers/quantization/bitsandbytes.py Show resolved Hide resolved

Yard1 reviewed May 30, 2024

View reviewed changes

chenqianfzh added 2 commits May 30, 2024 04:48

update error messages

64ad1a3

Merge branch 'main' into qian/qlora

aacab4a

chenqianfzh added 2 commits May 30, 2024 17:14

Merge branch 'main' into qian/qlora

973fd63

Merge branch 'main' into qian/qlora

1f8aea9

chenqianfzh added 4 commits June 1, 2024 00:36

add comments about bitandbytes bug

161c792

Merge branch 'main' into qian/qlora

5264d57

Revert "add comments about bitandbytes bug"

fbdff73

This reverts commit 161c792.

add comment about bitsandbytes bug

5c25ae3

Yard1 reviewed Jun 1, 2024

View reviewed changes

examples/offline_inference.py Outdated Show resolved Hide resolved

Merge branch 'main' into qian/qlora

25b7e75

mgoin changed the title ~~support QLoRA~~ [Feature][Kernel] Support bitsandbytes quantization and QLoRA Jun 1, 2024

mgoin reviewed Jun 1, 2024

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

update per comments

e16bcb6

Yard1 approved these changes Jun 1, 2024

View reviewed changes

mgoin approved these changes Jun 1, 2024

View reviewed changes

mgoin merged commit b9c0605 into vllm-project:main Jun 1, 2024
65 checks passed

blinkbear pushed a commit to blinkbear/vllm that referenced this pull request Jun 3, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

50df7af

…roject#4776)

XiaoningDing mentioned this pull request Jun 4, 2024

Q2 Roadmap bd-iaas-us/vllm#2

Open

11 tasks

triple-Mu pushed a commit to CC-LLM/vllm that referenced this pull request Jun 5, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

c59f0bc

…roject#4776)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

chenqianfzh commented May 12, 2024 •

edited

jeejeelee commented May 13, 2024

Yard1 left a comment

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024 •

edited

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024

chenqianfzh commented May 14, 2024

jeejeelee commented May 14, 2024

chenqianfzh commented May 14, 2024 •

edited

chenqianfzh commented May 23, 2024

jeejeelee commented May 23, 2024

Yard1 left a comment

Yard1 commented May 23, 2024

chenqianfzh commented May 23, 2024

chenqianfzh commented May 24, 2024

Yard1 commented May 24, 2024

jeejeelee commented May 25, 2024

Yard1 commented May 29, 2024

chenqianfzh commented May 29, 2024 •

edited

Yard1 left a comment

chenqianfzh commented May 30, 2024

chenqianfzh commented May 31, 2024

Yard1 commented May 31, 2024 •

edited

chenqianfzh commented Jun 1, 2024

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

[Feature][Kernel] Support bitsandbytes quantization and QLoRA #4776

Conversation

chenqianfzh commented May 12, 2024 • edited

jeejeelee commented May 13, 2024

Yard1 left a comment

Choose a reason for hiding this comment

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024 • edited

chenqianfzh commented May 13, 2024

Yard1 commented May 13, 2024

chenqianfzh commented May 14, 2024

jeejeelee commented May 14, 2024

chenqianfzh commented May 14, 2024 • edited

chenqianfzh commented May 23, 2024

jeejeelee commented May 23, 2024

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 commented May 23, 2024

chenqianfzh commented May 23, 2024

chenqianfzh commented May 24, 2024

Yard1 commented May 24, 2024

jeejeelee commented May 25, 2024

Yard1 commented May 29, 2024

chenqianfzh commented May 29, 2024 • edited

Yard1 left a comment

Choose a reason for hiding this comment

chenqianfzh commented May 30, 2024

chenqianfzh commented May 31, 2024

Yard1 commented May 31, 2024 • edited

chenqianfzh commented Jun 1, 2024

chenqianfzh commented May 12, 2024 •

edited

Yard1 commented May 13, 2024 •

edited

chenqianfzh commented May 14, 2024 •

edited

chenqianfzh commented May 29, 2024 •

edited

Yard1 commented May 31, 2024 •

edited