Add couple configs into generator.py for mixed input MM #1350

alexsamardzic · 2024-02-16T18:55:31Z

I'm adding (PR here) CUTLASS kernels as an auto-tune option for PyTorch compiler, and it would be nice to have these additional configurations available. This is not urgent, and more of alike changes may be further desired, so if it's OK to make changes like this then this PR could be kept open for while, and I'll make further additions, as needed, to it.

@manishucsd : Would it make sense for GenerateSM80_TensorOp_16816_mixed_input_upcast_a and GenerateSM80_TensorOp_16816_mixed_input_upcast_b to be symmetric w.r.t. math_instructions and tile_descriptions? I can change it through this PR too.

manishucsd · 2024-02-19T23:24:41Z

to be symmetric w.r.t. math_instructions and tile_descriptions.

What do you mean by symmetric (same?). Tensor Core math_instruction shape for both upcast_a and upcast_b is 16816. The supported tile_description (more precisely tile shape) may need to be different for upcast_a vs upcast_b.

alexsamardzic · 2024-02-20T12:56:30Z

By symmetry, I meant on math_instructions list within given generator methods: I was thinking that, if GenerateSM80_SparseTensorOp_16832 method has for example DataType.f16, DataType.f16, DataType.f32 combination listed there, then upcast_a method should have DataType.s8, DataType.f16, DataType.f32 and DataType.u8, DataType.f16, DataType.f32, and upcast_b method should have DataType.f16, DataType.s8, DataType.f32 and DataType.f16, DataType.u8, DataType.f32; and alike for other elements of this list in GenerateSM80_SparseTensorOp_16832 method. I've update the PR with all the changes I think should be made in that regard.

As far as tile_descriptions lists concerned, I thought that most of them should be the same between GenerateSM80_SparseTensorOp_16832, and upcast_a and upcast_b methods - my reasoning was that the multiplication itself is 16-bit in each case. On the other side, less shared memory is used for mixed data-types case, so some configurations may be different, but at least I'd expect some kind of "symmetry" between upcast_a and upcast_b, in the sense that I don't get why the first one has only 9 elements in tile_descriptions list, while the other one has 16.

manishucsd · 2024-02-21T06:39:21Z

For math_instructions makes sense. Yes, we should have the support for combinations you listed. Once you add those, please ensure the references for the same are also in place, run the verification to ensure the kernel runs and verifies.

For tile_descriptions, the 8-bit operand needs to be loaded from GMEM to SMEM and this puts some restrictions on what tile_descriptions (shapes) are currently supported. These may not be same for upcast_a and upcast_b.

alexsamardzic · 2024-02-21T20:15:15Z

Thanks for the clarification. I've updated gemm_fp_mixed_input.cu in my PR. W.r.t. verification - is there an "official" way to do it? I've checked that, on A100, whenever for example there is an item in gemm_fp_mixed_input.cu like:

make_gemm_real_canonical_layouts<
    uint8_t,
    half_t,
    half_t,
    half_t,
    half_t
  >(manifest);

that matching cutlass_profiler run, in this case:

cutlass_profiler --A=u8 --B=f16 --C=f16 --accum=f16

produces at least one line of profiling output, which should mean that the kernel compiled and ran successfully.

As a matter of fact, in these tests of mine cutlass_profiler always produces exactly one line of output, and if I try to force (using cta_m/cta_n/cta_k etc. command-line arguments) different tile description (but still one listed in generator.py, for upcast_a in this particular case) from one that cutlass_profiler printed out, it won't print anything. The cutlass_profiler is not giving any information about tile descriptions that it tried, but that didn't work; on the other side, in the PyTorch mixed data-types GEMM auto-tuning context mentioned in my first comment, more information about the compilation is printed, and I noticed that kernels generated by cutlass_library would fail to compile for some tile descriptions. This is another reason that I'm interested in proper verification.

alexsamardzic · 2024-03-04T12:41:12Z

Asking again: how to properly run verification after my changes?

manishucsd · 2024-03-05T03:37:54Z

For your mixed-input case, add a device-level unit test. Track similar unit test from here.
You should also test if the profiler is working with verification for your mixed-input case. Tips on achiving that:

Use cmake flags to compile only the kernels you are interested in. Use the below cmake command as an example to create your own

cmake --no-warn-unused-cli -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=80 -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8 -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ -S/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/cutlass -B/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/build -G Ninja

The cmake flags to play with are

        // "CUTLASS_LIBRARY_KERNELS": "tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8,s16816gemm_bf16_128x128_64x3_tn_align8,s16816gemm_f16_128x128_64x3_tn_align8,h16816gemm_128x128_64x3_tn_align8",
        // Upcast on OperandA and OperandB
        "CUTLASS_LIBRARY_KERNELS": "f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8",
        // Upcast on OperandB only        
        // "CUTLASS_LIBRARY_KERNELS": "s16816gemm_f16_s8_*,s16816gemm_bf16_s8_*,s16816gemm_bf16_128x128_64x*_tn_align8,s16816gemm_f16_128x128_64x*_tn_align8",
        "CUTLASS_LIBRARY_IGNORE_KERNELS": "gemm_grouped*,gemm_planar*"

compile cutlass_profiler, make sure the kernel you are interested is generated and compiled.
Use ./cutlass_profiler --kernels="kernel_name" to run the kernel you are interested.

Apologies for the delayed response. I have been OOO for last few weeks.

alexsamardzic · 2024-03-07T22:05:29Z

Thanks for the clarifications.

PR is updated with the changes suggested: Added number of tests, so that it should be all consistent now between tests, generator.py and gemm_fp_mixed_input.cu. Also fixed several unrelated typos in the generator and tests.

Script used to validate that cutlass_profiler generates kernels for mixed data-types

#! /bin/bash

IFS=","

for cfg in \
    s8,f16,f32,f32 \
    u8,f16,f32,f32 \
    s8,bf16,f32,f32 \
    u8,bf16,f32,f32 \
    s8,f16,f16,f32 \
    u8,f16,f16,f32 \
    s8,bf16,bf16,f32 \
    u8,bf16,bf16,f32 \
    s8,f16,f16,f16 \
    u8,f16,f16,f16
do
    set -- $cfg
    ./tools/profiler/cutlass_profiler \
        --operation=gemm --op_class=tensorop \
        --A=$1 --B=$2 --C=$3 --accum=$4
    read -n1 -s -r -p $"A=$1 B=$2 C=$3 accum=$4 done - Press any key to continue..." key
    ./tools/profiler/cutlass_profiler \
        --operation=gemm --op_class=tensorop \
        --A=$2 --B=$1 --C=$3 --accum=$4
    read -n1 -s -r -p $"A=$2 B=$1 C=$3 accum=$4 done - Press any key to continue..." key
done

github-actions · 2024-04-22T22:05:49Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

alexsamardzic force-pushed the add-mixed-input-configs branch from 93dc359 to 6ee70e7 Compare February 20, 2024 12:42

alexsamardzic force-pushed the add-mixed-input-configs branch from 6ee70e7 to 33f48a7 Compare February 21, 2024 19:55

alexsamardzic mentioned this pull request Feb 29, 2024

Add CUTLASS kernel as choice for _int_mm() Inductor autotuning pytorch/pytorch#119685

Closed

alexsamardzic force-pushed the add-mixed-input-configs branch from 33f48a7 to e130f14 Compare March 7, 2024 21:45

alexsamardzic force-pushed the add-mixed-input-configs branch from e130f14 to 36223dd Compare March 8, 2024 20:54

alexsamardzic mentioned this pull request Mar 22, 2024

Add support for mixed 4-bit/8-bit data types GEMM #1413

Open

alexsamardzic force-pushed the add-mixed-input-configs branch 3 times, most recently from fe7e3ed to 4e57cc2 Compare March 23, 2024 22:05

github-actions bot added the inactive-30d label Apr 22, 2024

Add couple configs into generator.py for mixed input MM

14cbeab

alexsamardzic force-pushed the add-mixed-input-configs branch from 4e57cc2 to 14cbeab Compare April 23, 2024 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add couple configs into generator.py for mixed input MM #1350

Add couple configs into generator.py for mixed input MM #1350

alexsamardzic commented Feb 16, 2024

manishucsd commented Feb 19, 2024

alexsamardzic commented Feb 20, 2024

manishucsd commented Feb 21, 2024 •

edited

alexsamardzic commented Feb 21, 2024

alexsamardzic commented Mar 4, 2024

manishucsd commented Mar 5, 2024 •

edited

alexsamardzic commented Mar 7, 2024 •

edited

github-actions bot commented Apr 22, 2024

Add couple configs into generator.py for mixed input MM #1350

Are you sure you want to change the base?

Add couple configs into generator.py for mixed input MM #1350

Conversation

alexsamardzic commented Feb 16, 2024

manishucsd commented Feb 19, 2024

alexsamardzic commented Feb 20, 2024

manishucsd commented Feb 21, 2024 • edited

alexsamardzic commented Feb 21, 2024

alexsamardzic commented Mar 4, 2024

manishucsd commented Mar 5, 2024 • edited

alexsamardzic commented Mar 7, 2024 • edited

github-actions bot commented Apr 22, 2024

manishucsd commented Feb 21, 2024 •

edited

manishucsd commented Mar 5, 2024 •

edited

alexsamardzic commented Mar 7, 2024 •

edited