Move loop ordering after fusion #126255

shunting314 · 2024-05-15T01:09:27Z

🚀 The feature, motivation and pitch

Here is some brief context:
Right now inductor decides loop ordering before fusion. That can lose some fusion opportunities. E.g. if node1 and node2 pick incompatible loop orders before fusion, we can not fuse them. But if we delay the final decision of loop orders and manage to pick consistent loop orders for node1 and node2, we would be able to fuse them.

I create this issue since I've seen people showing interests about inductor supporting loop ordering after fusion previously. I'm wondering if people can provide examples that they found loop ordering after fusion will be helpful. That can help me more effectively develop this feature.

Right now I have some manually constructed examples. But would be nice to develop based on real use cases.

WIP PR: #126254 .

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @Chillee @lezcano @jansel

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

lezcano · 2024-05-15T06:06:25Z

@isuruf @vfdev-5 hit a number of cases when working on decompositions of pooling operations. See if they can repro any.

shunting314 · 2024-05-16T21:19:52Z

@isuruf @vfdev-5 can you share the examples that inductor does not fuse kernels due to loop ordering? I'll make sure they can fuse with my PR

isuruf · 2024-05-17T16:11:28Z

Running

python benchmarks/dynamo/timm_models.py --performance --amp -dcuda -n50 --no-skip --dashboard --batch-size 128 --threads 1 --only adv_inception_v3 --inference --freezing --timeout 9000 --backend=inductor

with the branch https://github.com/isuruf/pytorch/tree/debug_loop_fusion you'll see code of the form

triton_per_fused_avg_pool2d_11.run(buf12, buf22, 30105600, 9, grid=grid(30105600), stream=stream0)
buf23 = buf12; del buf12  # reuse
# Source Nodes: [branch_pool], Original ATen: [aten.avg_pool2d]
triton_poi_fused_avg_pool2d_12.run(buf22, buf23, 30105600, grid=grid(30105600), stream=stream0)

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

Is this okay or do you prefer a smaller reproducer?

shunting314 · 2024-05-17T22:49:31Z

Is this okay or do you prefer a smaller reproducer?

This is already very helpful to show loop ordering issue in real use cases. Although a smaller reproducer will be even more useful for faster iteration and adding as unit test.

shunting314 · 2024-05-29T01:11:54Z

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

I took a look at this example and I think it's different to 'normal' loop ordering issues we considered. The normal loop ordering issues we considered are

kernel A and B both pick the optimal loop orders by just looking at the memory access patterns in itself
A and B can not fuse because they pick different loop orders according to step 1.

But in this avg_pool2d kernel: https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5#file-triton-py-L729-L735 , the kernel should pick a different loop order so that index expression x2 + (192*x8) + (235200*x3) will be flattened to a single index. If inductor does that, then this kernel would be able get fused with the next kernel.

So the main issue here is what cause inductor to pick a 'sub-optimal' loop order for this kernel

shunting314 · 2024-05-30T01:09:43Z

@isuruf can you try if #127367 fix the fusion problem on your side?

This fixes the loop ordering issue for avg_pool2d here (#126255 (comment)). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

This fixes the loop ordering issue for avg_pool2d here (#126255 (comment)). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) Pull Request resolved: #127367 Approved by: https://github.com/jansel

This fixes the loop ordering issue for avg_pool2d here (pytorch#126255 (comment)). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) Pull Request resolved: pytorch#127367 Approved by: https://github.com/jansel

Summary: This fixes the loop ordering issue for avg_pool2d here (pytorch/pytorch#126255 (comment)). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) X-link: pytorch/pytorch#127367 Approved by: https://github.com/jansel Reviewed By: izaitsevfb Differential Revision: D58014745 Pulled By: shunting314 fbshipit-source-id: 91f14ea313aeb4f297e259b7423107e34dafe512

shunting314 self-assigned this May 15, 2024

shunting314 added the module: inductor label May 15, 2024

pytorch-bot bot added the oncall: pt2 label May 15, 2024

xmfan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2024

shunting314 mentioned this issue May 30, 2024

[inductor] remove reordering_reindex #127367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move loop ordering after fusion #126255

Move loop ordering after fusion #126255

shunting314 commented May 15, 2024 •

edited by pytorch-bot bot

lezcano commented May 15, 2024

shunting314 commented May 16, 2024

isuruf commented May 17, 2024

shunting314 commented May 17, 2024

shunting314 commented May 29, 2024 •

edited

shunting314 commented May 30, 2024

Move loop ordering after fusion #126255

Move loop ordering after fusion #126255

Comments

shunting314 commented May 15, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

lezcano commented May 15, 2024

shunting314 commented May 16, 2024

isuruf commented May 17, 2024

shunting314 commented May 17, 2024

shunting314 commented May 29, 2024 • edited

shunting314 commented May 30, 2024

shunting314 commented May 15, 2024 •

edited by pytorch-bot bot

shunting314 commented May 29, 2024 •

edited