• xiaoqie's avatar
    cuda fix · fc36eefb
    xiaoqie 提交于
    All tests in test_nnet.py pass with CUDA.
    Only fp32 tests in test_nnet.py pass with OpenCL. GpuFromHost doesn't work with fp16 or fp64.
    Larger work item size doesn't improve performance.
    Add 2 local_barrier(), it's strange that AMD card doesn't need these local_barrier(), but they are necessary for NVIDIA cards.
    fc36eefb
nnet.py 45.3 KB