cuda fix
All tests in test_nnet.py pass with CUDA.
Only fp32 tests in test_nnet.py pass with OpenCL. GpuFromHost doesn't work with fp16 or fp64.
Larger work item size doesn't improve performance.
Add 2 local_barrier(), it's strange that AMD card doesn't need these local_barrier(), but they are necessary for NVIDIA cards.
正在显示
请
注册
或者
登录
后发表评论