Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tips on why the model ain't working #56

Open
Long-louis opened this issue Mar 8, 2024 · 1 comment
Open

Some tips on why the model ain't working #56

Long-louis opened this issue Mar 8, 2024 · 1 comment

Comments

@Long-louis
Copy link

一些关于模型无法运行的建议

问题描述

如果代码跑起来之后程序没有任何相应,可以参考以下其他项目的解决方法
原文地址 blog.csdn.net
今天在跑实验时碰到标题所述的问题,具体代码片段如下:

### chamfer_3D.py

chamfer_found = importlib.find_loader("chamfer_3D") is not None
if not chamfer_found:
    ## Cool trick from https://github.com/chrdiller
    print("Jitting Chamfer 3D")

    from torch.utils.cpp_extension import load
    chamfer_3D = load(,
          sources=[
              "/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer_cuda.cpp"]),
              "/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer3D.cu"]),
              ])
    print("Loaded JIT 3D CUDA chamfer distance")

else:
    import chamfer_3D
    print("Loaded compiled 3D CUDA chamfer distance")

这段代码的含义是如果在 python 环境中检测到 chamfer_3D 包就直接引入,否则调用 torch.utils.cpp_extension.load,手动加载外部 C++ 库。

运行这段代码时,由于没有 chamfer_3D 包,所以程序运行 load 函数,发现程序会卡住,长时间一直无输出,命令行输出界面如下:

> (atlasnet) user@ubuntu: ~/chamfer3D$  python chamfer_3D.py
Jitting Chamfer 3D

按 Ctrl+C 强行结束掉程序时,输出如下:

> (atlasnet) user@ubuntu: ~/chamfer3D$  python chamfer_3D.py
Jitting Chamfer 3D
^CTraceback (most recent call last):
  File "dist_chamfer_3D.py", line 15, in <module>
    "/".join(os.path.abspath(__file__).split('/')[:-1] + ["chamfer3D.cu"]),
  File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 974, in load
    keep_intermediates=keep_intermediates)
  File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1183, in _jit_compile
    baton.wait()
  File "/home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/file_baton.py", line 49, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt

问题分析

出现这一问题的原因是存在互斥锁。出问题的代码片段如下:

### torch/utils/cpp_extension.py

if version != old_version:
   baton = FileBaton(os.path.join(build_directory, 'lock'))
   if baton.try_acquire():
       try:
           with GeneratedFileCleaner(keep_intermediates=keep_intermediates) as clean_ctx:
               if IS_HIP_EXTENSION and (with_cuda or with_cudnn):
                   hipify_python.hipify(
                       project_directory=build_directory,
                       output_directory=build_directory,
                       includes=os.path.join(build_directory, '*'),
                       extra_files=[os.path.abspath(s) for s in sources],
                       show_detailed=verbose,
                       is_pytorch_extension=True,
                       clean_ctx=clean_ctx
                   )
               _write_ninja_file_and_build_library(
                   name=name,
                   sources=sources,
                   extra_cflags=extra_cflags or [],
                   extra_cuda_cflags=extra_cuda_cflags or [],
                   extra_ldflags=extra_ldflags or [],
                   extra_include_paths=extra_include_paths or [],
                   build_directory=build_directory,
                   verbose=verbose,
                   with_cuda=with_cuda)
       finally:
           baton.release()
   else:
       baton.wait()

通过这个代码大致可以看出来,pytorch 的 cpp_extension 在加载外部库的时候会给这个库文件加上一个”读锁 “,这个读锁是通过新建一个 "lock" 文件来做的。如果程序探测到有“lock” 文件,就认为此时有其它进程正在使用相同的文件,发生读写冲突,导致 baton.try_acquire()返回 False,进入 wait()函数,直到锁被释放。

锁的存在,导致同一时刻其它进程不能读取此文件。如果在之前运行这个程序时,趁加锁之后突然 kill 掉这个程序,导致它还没来得及释放锁,这样锁就会一直存在,导致后续所有程序都无法读取该库文件。我分析这次碰到的 Jitting 卡住的问题就是上述原因引起的。

解决方案

首先要找到锁在哪里。

进入库函数 torch/utils/cpp_extension.py 文件,在第 1156 行打上一个断点,也就是这一句:

baton = FileBaton(os.path.join(build_directory, 'lock'))

当程序运行到这里时,查看变量 build_directory 的值,lock 文件应该就存在这里。进入这个文件夹删掉 lock 文件,之后再次运行该程序就不会卡住了。

windows 下如果使用 PyCharm,打断点和查看变量值的操作比较容易,在这里演示一下 linux 上使用 pdb 调试 python 程序的操作,如下:

(atlasnet) zhangwenyuan@ubuntu:~/atlas/AtlasNet/auxiliary/ChamferDistancePytorch/chamfer3D$ 
cd ~/atlas/AtlasNet
(atlasnet) zhangwenyuan@ubuntu:~/atlas/AtlasNet$ python -m pdb train.py --shapenet13
> /home/zhangwenyuan/atlas/AtlasNet/train.py(1)<module>()
-> import sys
(Pdb) b /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py:1156
Breakpoint 1 at /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py:1156
(Pdb) c
Jitting Chamfer 3D
> /home/zhangwenyuan/anaconda3/envs/atlasnet/lib/python3.6/site-packages/torch/utils/cpp_extension.py(1156)_jit_compile()
-> baton = FileBaton(os.path.join(build_directory, 'lock'))
(Pdb) p build_directory
'/home/zhangwenyuan/.cache/torch_extensions/chamfer_3D'

因此知道 lock 文件在 "/home/zhangwenyuan/.cache/torch_extensions/chamfer_3D" 目录下。进入该目录删掉 lock 文件,再次运行程序,不会再碰到该问题了。

==/home/geyulong/.cache/torch_extensions/py38_cu121/fused/lock==

运行后TensorFlow_hub报错urllib.error.URLError:

自己下载的模型文件,放在本地,然后修改代码中的路径,就可以解决这个问题了。

@truemanstock
Copy link

好家伙,太不容易了。
good guy ,it's too not easy。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants