Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move loop ordering after fusion #126255

Open
shunting314 opened this issue May 15, 2024 · 6 comments
Open

Move loop ordering after fusion #126255

shunting314 opened this issue May 15, 2024 · 6 comments
Assignees
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@shunting314
Copy link
Contributor

shunting314 commented May 15, 2024

馃殌 The feature, motivation and pitch

Here is some brief context:
Right now inductor decides loop ordering before fusion. That can lose some fusion opportunities. E.g. if node1 and node2 pick incompatible loop orders before fusion, we can not fuse them. But if we delay the final decision of loop orders and manage to pick consistent loop orders for node1 and node2, we would be able to fuse them.

I create this issue since I've seen people showing interests about inductor supporting loop ordering after fusion previously. I'm wondering if people can provide examples that they found loop ordering after fusion will be helpful. That can help me more effectively develop this feature.

Right now I have some manually constructed examples. But would be nice to develop based on real use cases.

WIP PR: #126254 .

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @Chillee @lezcano @jansel

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

@lezcano
Copy link
Collaborator

lezcano commented May 15, 2024

@isuruf @vfdev-5 hit a number of cases when working on decompositions of pooling operations. See if they can repro any.

@xmfan xmfan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2024
@shunting314
Copy link
Contributor Author

@isuruf @vfdev-5 can you share the examples that inductor does not fuse kernels due to loop ordering? I'll make sure they can fuse with my PR

@isuruf
Copy link
Collaborator

isuruf commented May 17, 2024

Running

python benchmarks/dynamo/timm_models.py --performance --amp -dcuda -n50 --no-skip --dashboard --batch-size 128 --threads 1 --only adv_inception_v3 --inference --freezing --timeout 9000 --backend=inductor

with the branch https://github.com/isuruf/pytorch/tree/debug_loop_fusion you'll see code of the form

triton_per_fused_avg_pool2d_11.run(buf12, buf22, 30105600, 9, grid=grid(30105600), stream=stream0)
buf23 = buf12; del buf12  # reuse
# Source Nodes: [branch_pool], Original ATen: [aten.avg_pool2d]
triton_poi_fused_avg_pool2d_12.run(buf22, buf23, 30105600, grid=grid(30105600), stream=stream0)

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

Is this okay or do you prefer a smaller reproducer?

@shunting314
Copy link
Contributor Author

Is this okay or do you prefer a smaller reproducer?

This is already very helpful to show loop ordering issue in real use cases. Although a smaller reproducer will be even more useful for faster iteration and adding as unit test.

@shunting314
Copy link
Contributor Author

shunting314 commented May 29, 2024

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

I took a look at this example and I think it's different to 'normal' loop ordering issues we considered. The normal loop ordering issues we considered are

  1. kernel A and B both pick the optimal loop orders by just looking at the memory access patterns in itself
  2. A and B can not fuse because they pick different loop orders according to step 1.

But in this avg_pool2d kernel: https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5#file-triton-py-L729-L735 , the kernel should pick a different loop order so that index expression x2 + (192*x8) + (235200*x3) will be flattened to a single index. If inductor does that, then this kernel would be able get fused with the next kernel.

So the main issue here is what cause inductor to pick a 'sub-optimal' loop order for this kernel

@shunting314
Copy link
Contributor Author

@isuruf can you try if #127367 fix the fusion problem on your side?

shunting314 added a commit that referenced this issue May 30, 2024
This fixes the loop ordering issue for avg_pool2d here (#126255 (comment)).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue May 31, 2024
This fixes the loop ordering issue for avg_pool2d here (#126255 (comment)).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)

Pull Request resolved: #127367
Approved by: https://github.com/jansel
petrex pushed a commit to petrex/pytorch that referenced this issue Jun 5, 2024
This fixes the loop ordering issue for avg_pool2d here (pytorch#126255 (comment)).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)

Pull Request resolved: pytorch#127367
Approved by: https://github.com/jansel
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this issue Jun 6, 2024
Summary:
This fixes the loop ordering issue for avg_pool2d here (pytorch/pytorch#126255 (comment)).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)

X-link: pytorch/pytorch#127367
Approved by: https://github.com/jansel

Reviewed By: izaitsevfb

Differential Revision: D58014745

Pulled By: shunting314

fbshipit-source-id: 91f14ea313aeb4f297e259b7423107e34dafe512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants