-
由 xiaoqie 提交于
All tests in test_nnet.py pass with CUDA. Only fp32 tests in test_nnet.py pass with OpenCL. GpuFromHost doesn't work with fp16 or fp64. Larger work item size doesn't improve performance. Add 2 local_barrier(), it's strange that AMD card doesn't need these local_barrier(), but they are necessary for NVIDIA cards.
fc36eefb