-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
同一份训练模型的代码在使用CPU和GPU运行时梯度存在明显差异 #64297
Comments
从你的输出结果看不出来啥,是没有把梯度之类的放上去么 |
对不起,昨天贴漏了。这是用chebyshev距离,得到的每层的梯度差值。 [[-3.72414969e+05 -3.73191812e+05 -2.26085250e+05 7.64871500e+05 [[ 5.23475703e+04 5.49324570e+04 -1.76759875e+05 1.30762588e+06 [[ 6.55417252e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-1.50522844e+05 -2.35194312e+05 -1.37334016e+05 -6.06970469e+04 [[-2.62650875e+05 -1.16864781e+05 -2.15904000e+05 1.72069212e+06 [[[ 2.69577531e+05 2.92178375e+05 1.19407650e+06 1.26323875e+06 [[ 3.86893200e+06 4.05173050e+06 4.45501050e+06 4.21305000e+06 [[ 7.03271688e+05 2.69933469e+05 -4.23030078e+04 -5.50020898e+03 [[-4.24267139e+03 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 2.88902300e+06 2.85713100e+06 1.98817400e+06 2.31171900e+06 [[ 3.93839175e+06 2.78740325e+06 2.82867150e+06 3.36569650e+06 [[[-1.05050531e+05 -1.31717109e+05 -6.34260188e+05 -9.56963438e+05 [[-2.55336250e+06 -2.88503975e+06 -3.17217325e+06 -2.82195600e+06 [[ 4.85538672e+04 -1.86045266e+05 -3.41200219e+05 -1.41402119e+04 [[ 1.26366481e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-2.00877638e+06 -1.62695862e+06 -1.55209325e+06 -2.06723350e+06 [[-2.32106925e+06 -1.82371925e+06 -2.55469250e+06 -2.14721675e+06 ... [[[ 5.20250469e+05 6.22927938e+05 3.58193945e+04 4.60490586e+04 [[ 5.95240300e+06 5.61964050e+06 4.41002150e+06 3.72228450e+06 [[ 4.63367656e+05 1.77958109e+05 -3.07492031e+04 -7.00024658e+02 [[ 1.45108031e+05 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 3.72800200e+06 2.85056000e+06 2.25826700e+06 2.19885075e+06 [[ 3.16691350e+06 2.29182500e+06 2.74246425e+06 2.57032525e+06 [[[-4.49195844e+05 -7.77083438e+05 -1.38768862e+06 -1.38334212e+06 [[-6.26523100e+06 -7.14752350e+06 -7.44298900e+06 -6.45251850e+06 [[-9.18106812e+05 -4.78329406e+05 1.73206219e+05 -9.97502578e+04 [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-4.34047300e+06 -2.96181725e+06 -2.73728450e+06 -2.93566200e+06 [[-6.52106050e+06 -4.40498900e+06 -5.00024300e+06 -5.13812600e+06 [[[-5.84712031e+04 -2.84318906e+04 1.84151641e+04 -1.02213701e+04 [[ 8.56032000e+05 2.06287234e+05 2.93098975e+06 4.36367550e+06 [[ 2.67236425e+06 2.48789425e+06 4.12391438e+05 1.71626362e+06 [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 9.77942812e+04 2.25403675e+06 1.40720775e+06 1.56448975e+06 [[ 3.22218000e+06 3.57122825e+06 2.18848475e+06 4.58014950e+06 [[-3.72415094e+05 -3.73191844e+05 -2.26085266e+05 7.64871625e+05 [[ 5.23476016e+04 5.49323750e+04 -1.76759875e+05 1.30762575e+06 [[ 6.55417252e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-1.50522828e+05 -2.35194391e+05 -1.37334094e+05 -6.06971406e+04 [[-2.62651000e+05 -1.16864727e+05 -2.15903594e+05 1.72069312e+06 [[[ 2.69577438e+05 2.92178375e+05 1.19407638e+06 1.26323912e+06 [[ 3.86893325e+06 4.05173125e+06 4.45501000e+06 4.21304950e+06 [[ 7.03271750e+05 2.69933469e+05 -4.23029961e+04 -5.50020801e+03 [[-4.24267139e+03 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 2.88902350e+06 2.85713225e+06 1.98817400e+06 2.31171575e+06 [[ 3.93839175e+06 2.78740400e+06 2.82867150e+06 3.36569425e+06 [[[-1.05050523e+05 -1.31717156e+05 -6.34259938e+05 -9.56963500e+05 [[-2.55336275e+06 -2.88504050e+06 -3.17217225e+06 -2.82195525e+06 [[ 4.85538242e+04 -1.86045031e+05 -3.41200281e+05 -1.41402100e+04 [[ 1.26366481e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-2.00877562e+06 -1.62695862e+06 -1.55209288e+06 -2.06723475e+06 [[-2.32106950e+06 -1.82371938e+06 -2.55469225e+06 -2.14721575e+06 ... [[[ 5.20250531e+05 6.22928125e+05 3.58193906e+04 4.60490586e+04 [[ 5.95240300e+06 5.61964150e+06 4.41002150e+06 3.72228450e+06 [[ 4.63367625e+05 1.77958125e+05 -3.07492051e+04 -7.00024658e+02 [[ 1.45108016e+05 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 3.72800225e+06 2.85056000e+06 2.25826725e+06 2.19885175e+06 [[ 3.16691425e+06 2.29182500e+06 2.74246375e+06 2.57032650e+06 [[[-4.49196000e+05 -7.77083188e+05 -1.38768850e+06 -1.38334212e+06 [[-6.26522700e+06 -7.14752300e+06 -7.44299350e+06 -6.45252250e+06 [[-9.18106750e+05 -4.78329406e+05 1.73206219e+05 -9.97502734e+04 [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[-4.34047400e+06 -2.96181825e+06 -2.73728400e+06 -2.93566250e+06 [[-6.52106100e+06 -4.40498950e+06 -5.00024300e+06 -5.13812700e+06 [[[-5.84711953e+04 -2.84319082e+04 1.84151680e+04 -1.02213711e+04 [[ 8.56031938e+05 2.06286984e+05 2.93098975e+06 4.36367650e+06 [[ 2.67236425e+06 2.48789375e+06 4.12391375e+05 1.71626375e+06 [[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 [[ 9.77942031e+04 2.25403700e+06 1.40720812e+06 1.56449038e+06 [[ 3.22218100e+06 3.57122850e+06 2.18848450e+06 4.58015000e+06 [[-2516005.8 -4645216.5 -4593647.5 -4013469.2 -4365871.5 ] [[ 868272.5 764133.5 233899.94 -69916.81 672294.25 ] [[ 112190.086 112190.08 0. 0. 0. ] [[ 6520251. 5139144.5 5132530. 4131283.5 1840510. ] [[ -584923.5 -1430988.6 -1186714.2 -1241283. -2743481. ] [[-2516006.2 -4645215.5 -4593649. -4013469. -4365873. ] [[ 868272.4 764132.6 233900.9 -69916.48 672295.4 ] [[ 112190.086 112190.086 0. 0. 0. ] [[ 6520250. 5139144. 5132531. 4131284. 1840511.4 ] [[ -584924.06 -1430985. -1186712.4 -1241281.6 -2743481.2 ] |
diff.txt |
建议能否把代码再进行进一步简化,定位到具体是哪个API存在这个问题。现在看你的结果,就是conv1和conv2这个地方梯度不同是么?如果是的话,我再麻烦相关同学看下。 |
我们觉得虽然表面上是conv的问题,但是每个出问题的模型都涉及到了paddle.reciprocal,可能是在倒数这里出现了问题。下面是一个出错的例子 |
https://github.com/LouisChenB/Sample.git 运行test.py可以读取我们之前运行的结果,将会在AlexNet-19-112中生成diff.txt,展示梯度不一致的地方。如果希望复现错误的话,在AlexNet-19-112/case/paddle_gpu/AlexNet-19-112_paddle_gpu.py和AlexNet-19-112/case/paddle_cpu/AlexNet-19-112_paddle_cpu.py中分别添加如下代码即可
|
bug描述 Describe the Bug
GPU版本的代码
CPU版本的代码
其他补充信息 Additional Supplementary Information
paddlepaddle版本为2.6.1
The text was updated successfully, but these errors were encountered: