Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make propagate_real_tensor more safe #126281

Closed
wants to merge 4 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented May 15, 2024

Stack from ghstack (oldest at bottom):

Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/

There a few improvements here, which luckily fix some xfails:

  • In generally, it can be unsafe to call operations on Tensors under a no_dispatch() mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
  • We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
  • I noticed that occasionally when I do storage.copy_(real_storage), the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.

Signed-off-by: Edward Z. Yang ezyang@meta.com

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126281

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit e76529d with merge base a961e1a (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang added a commit that referenced this pull request May 15, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 9f2c794f6be0e640dbd18a5bf60c158b7d45c4cf
Pull Request resolved: #126281
[ghstack-poisoned]
ezyang added a commit that referenced this pull request May 15, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 427cb50a75bde0eb968e191975cf908c7e18a35b
Pull Request resolved: #126281
[ghstack-poisoned]
ezyang added a commit that referenced this pull request May 15, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: be50c2d84564214ca5fbb8a714e0ed970c26710e
Pull Request resolved: #126281
@ezyang ezyang added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels May 15, 2024
@ezyang
Copy link
Contributor Author

ezyang commented May 15, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@ezyang
Copy link
Contributor Author

ezyang commented May 15, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 69c166fd1c73d933cca2babeac11f6188a61021d returned non-zero exit code 1

Auto-merging test/test_fake_tensor.py
CONFLICT (content): Merge conflict in test/test_fake_tensor.py
error: could not apply 69c166fd1c7... Make propagate_real_tensor more safe
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented May 15, 2024

@pytorchbot merge

@albanD albanD removed their request for review May 15, 2024 19:35
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ezyang
Copy link
Contributor Author

ezyang commented May 15, 2024

@pytorchbot merge -f "unrelated errors"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request May 16, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 44c39e5d775e355887bd42ffea6abc5497f0cdd4
Pull Request resolved: pytorch#126281
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/

There a few improvements here, which luckily fix some xfails:

* In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
* We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
* I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#126281
Approved by: https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants