Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Coati Lora incompatible with Gemini & HybridParallel(pp=1), but runs well with HybridParallel(tp>=2) #5507

Open
Fallqs opened this issue Mar 26, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Fallqs
Copy link

Fallqs commented Mar 26, 2024

馃悰 Describe the bug

Description

I implemented Coati Lora before parallel fine-tuning for LlaMA-7B, and found:

  • Gemini runs into Error(s) in loading state_dict for GeminiCheckpointIO: and Train params remained 6.32 B
  • HybridParallel(pp=1) runs into RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn but Train params are set correctly and equals to 38.68 M
  • HybridParallel(pp=2) ran successfully and Train params are divided properly, 19.06 M on master GPU

Considering the efficiency and stability in fine-tuning large models and viability to supply longer seq_len and larger batch_size, I'm sincerely looking forward to a recent update to fully support LoRA in distributed training/fine-tuning.

To Reproduce

  • environment CUDA=11.7, torch=2.1.2cu118, 2*A100-40G, NCCL backend, python=3.10.14, Ubuntu20.04

  • requirements colossalai=0.35, loralib=0.1.2, transformer=4.33 . I'm also using flash-attn=2.5.6&dropout-layer-norm=0.1(submodule of flash-atto) with a few verifications to shardformer/modeling/llama.py to implement flash attention for HybridParallel plugin.

  • modifications

    • finetune.py under model loading part:

      with init_ctx:
          model = LlamaForCausalLM(config)
          if args.lora:
              from coati_lora import convert_to_lora_module
              # coati_lora is the lora.py copied from Coati
              model = convert_to_lora_module(model, 16)
    • finetune.py under arg_parser:

      parser.add_argument("--lora", action="store_true")
      parser.add_argument("--ppsize", default=2, type=int)
      parser.add_argument("--tpsize", default=4, type=int)
      
      # Gemini is left unchanged but HybridParallel had modifications
      if args.plugin == "hybrid_parallel":
          # modify the param accordingly, default configuration is for llama2-7b
          # The pptp_size below is an parameter to control DataParallel
          # and does not matter here
          args.pptp_size = args.ppsize * args.tpsize
          plugin = HybridParallelPlugin(
              tp_size=args.tpsize,
              pp_size=args.ppsize,
              num_microbatches=2, microbatch_size=None,
              enable_jit_fused=False, zero_stage=0,
              precision="bf16", initial_scale=1,
          )
    • finetune.sh re-written another version:

      MODEL_NAME="deepseek-coder-6.7b-instruct"
      DATASET_PATH=""
      SAVE_DIR="save_checkpoint/$MODEL_NAME"
      
      # LoRA
      # Notice that I did not use DataParallel here
      CUDA_VISIBLE_DEVICES=3,5 CUDA_LAUNCH_BLOCKING=1 \
          nohup colossalai run --nproc_per_node 2 --master_port 29503 \
          col_train.py  --plugin "gemini" \
          --model_path "./model/$MODEL_NAME" --dataset "$DATASET_PATH" \
          --save_dir $SAVE_DIR --save_interval 5000 \
          --lr 0.00005 --lora --batch_size 2 --max_length 2048 --ppsize 2 --tpsize 1 \
          --mixed_precision bf16 --flash_attention \
          --tensorboard_dir "log/train/tb_logs" \
          > log/train/[$$]${MODEL_NAME}.log &

Expected Behavior

Considering the efficiency and stability in fine-tuning large models and viability to supply longer seq_len and larger batch_size, I'm sincerely looking forward to a recent update to fully support LoRA in distributed training/fine-tuning. Specifically, I demand

  1. Making Coati Lora compatible with HybridParallel plugin when pp_size=1

  2. Making Coati Lora compatible with Gemini plugin

  3. Further support Peft in distributed training/fine-tuning, making it compatible with Gemini and HybridParallel and even flash-attn

Screenshots

Gemini plugin failure

鎴睆2024-03-26 09 24 24

HybridParallel(pp=1,tp=2) failure

鎴睆2024-03-26 09 26 02

HybridParallel(pp=2,tp=1) success

鎴睆2024-03-26 09 30 51

Environment

CUDA11.7
accelerate 0.28.0
colossalai 0.3.5
datasets 2.18.0
dropout-layer-norm 0.1
flash-attn 2.5.6
loralib 0.1.2
ninja 1.11.1.1
numpy 1.26.4
packaging 23.2
peft 0.10.0
ray 2.10.0
safetensors 0.4.2
scipy 1.12.0
sentencepiece 0.2.0
okenizers 0.13.3
torch 2.1.2
tqdm 4.66.2
transformers 4.33.0
triton 2.1.0
xformers 0.0.23.post1

@Fallqs Fallqs added the bug Something isn't working label Mar 26, 2024
@Fallqs
Copy link
Author

Fallqs commented Mar 28, 2024

It turn outs to be a problem with tensor parallel in hybrid_parallel plugin, where lora parameters are ignored when building column&row parallel layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant