提交 ca465be0 authored 作者: abergeron's avatar abergeron

Merge pull request #3198 from nouiz/cumem3

Add CNMeM in Theano to speed up CUDA allocation.
...@@ -2,6 +2,7 @@ global-include *.txt ...@@ -2,6 +2,7 @@ global-include *.txt
global-include *.c global-include *.c
global-include *.cu global-include *.cu
global-include *.cuh global-include *.cuh
global-include *.cpp
global-include *.h global-include *.h
global-include *.sh global-include *.sh
global-include *.pkl global-include *.pkl
......
...@@ -11,7 +11,7 @@ Acknowledgements ...@@ -11,7 +11,7 @@ Acknowledgements
* The developers of `NumPy <http://numpy.scipy.org/>`_. Theano is based on its ndarray object and uses much of its implementation. * The developers of `NumPy <http://numpy.scipy.org/>`_. Theano is based on its ndarray object and uses much of its implementation.
* The developers of `SciPy <http://scipy.org/>`_. Our sparse matrix support uses their sparse matrix objects. We also reuse other parts. * The developers of `SciPy <http://scipy.org/>`_. Our sparse matrix support uses their sparse matrix objects. We also reuse other parts.
* All Theano authors in the commit log. * All `Theano contributors <https://github.com/Theano/Theano/graphs/contributors>`_.
* All Theano users that have given us feedback. * All Theano users that have given us feedback.
* The GPU implementation of tensordot is based on code from Tijmen * The GPU implementation of tensordot is based on code from Tijmen
Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_ Tieleman's `gnumpy <http://www.cs.toronto.edu/~tijmen/gnumpy.html>`_
...@@ -24,3 +24,4 @@ Acknowledgements ...@@ -24,3 +24,4 @@ Acknowledgements
P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689. P. L'Ecuyer and R. Touzin, `Fast Combined Multiple Recursive Generators with Multipliers of the form a = +/- 2^d +/- 2^e <http://www.informs-sim.org/wsc00papers/090.PDF>`_, Proceedings of the 2000 Winter Simulation Conference, Dec. 2000, 683--689.
We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano. We were authorized by Pierre L'Ecuyer to copy/modify his Java implementation in the `SSJ <http://www.iro.umontreal.ca/~simardr/ssj/>`_ software and to relicense it under BSD 3-Clauses in Theano.
* A better GPU memory allocator :attr:`CNMeM <config.lib.cnmem>` is included in Theano. It has the same license.
...@@ -72,13 +72,18 @@ and use directly the optimized graph from the pickled file. ...@@ -72,13 +72,18 @@ and use directly the optimized graph from the pickled file.
Faster Theano function Faster Theano function
---------------------- ----------------------
You can set the Theano flag ``allow_gc`` to ``False`` to get a speed-up by using You can set the Theano flag :attr:`allow_gc <config.allow_gc>` to ``False`` to get a speed-up by using
more memory. By default, Theano frees intermediate results when we don't need more memory. By default, Theano frees intermediate results when we don't need
them anymore. Doing so prevents us from reusing this memory. So disabling the them anymore. Doing so prevents us from reusing this memory. So disabling the
garbage collection will keep all intermediate results' memory space to allow to garbage collection will keep all intermediate results' memory space to allow to
reuse them during the next call to the same Theano function, if they are of the reuse them during the next call to the same Theano function, if they are of the
correct shape. The shape could change if the shapes of the inputs change. correct shape. The shape could change if the shapes of the inputs change.
.. note::
With :attr:`CNMeM <config.lib.cnmem>`, this isn't very useful with GPU
anymore.
.. _unsafe_optimization: .. _unsafe_optimization:
Unsafe optimization Unsafe optimization
......
...@@ -21,6 +21,9 @@ Montreal). ...@@ -21,6 +21,9 @@ Montreal).
News News
==== ====
* We added support for :attr:`CNMeM <config.lib.cnmem>` to speed up
the GPU memory allocation.
* Theano 0.7 was released 26th March 2015. Everybody is encouraged to update. * Theano 0.7 was released 26th March 2015. Everybody is encouraged to update.
* We support `cuDNN <http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html>`_ if it is installed by the user. * We support `cuDNN <http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html>`_ if it is installed by the user.
......
...@@ -370,6 +370,34 @@ import theano and print the config variable, as in: ...@@ -370,6 +370,34 @@ import theano and print the config variable, as in:
`amdlibm <http://developer.amd.com/cpu/libraries/libm/>`__ `amdlibm <http://developer.amd.com/cpu/libraries/libm/>`__
library, which is faster than the standard libm. library, which is faster than the standard libm.
.. attribute:: lib.cnmem
Float value: >= 0
Do we enable `CNMeM <https://github.com/NVIDIA/cnmem>`_ or not (a
faster CUDA memory allocator). In Theano dev version until 0.7.1
is released.
That library is included in Theano, you do not need to install it.
The value represents the start size (in MB or % of total GPU
memory) of the memory pool. If more memory are needed, it will
try to get more, but this can cause more memory fragmentation:
* 0: not enabled.
* 0 < N <= 1: % of the total GPU memory (clipped to .985 for driver memory)
* > 0: use that number of MB of memory.
Default 0 (but should change later)
.. note::
This could cause memory fragmentation. So if you have a
memory error while using cnmem, try to allocate more memory at
the start or disable it. If you try this, report your result
on :ref`theano-dev`.
.. attribute:: linker .. attribute:: linker
String value: 'c|py', 'py', 'c', 'c|py_nogc' String value: 'c|py', 'py', 'c', 'c|py_nogc'
......
...@@ -164,7 +164,7 @@ def do_setup(): ...@@ -164,7 +164,7 @@ def do_setup():
install_requires=['numpy>=1.6.2', 'scipy>=0.11', 'six>=1.9.0'], install_requires=['numpy>=1.6.2', 'scipy>=0.11', 'six>=1.9.0'],
package_data={ package_data={
'': ['*.txt', '*.rst', '*.cu', '*.cuh', '*.c', '*.sh', '*.pkl', '': ['*.txt', '*.rst', '*.cu', '*.cuh', '*.c', '*.sh', '*.pkl',
'*.h', 'ChangeLog'], '*.h', '*.cpp', 'ChangeLog'],
'theano.misc': ['*.sh'] 'theano.misc': ['*.sh']
}, },
scripts=['bin/theano-cache', 'bin/theano-nose', 'bin/theano-test'], scripts=['bin/theano-cache', 'bin/theano-nose', 'bin/theano-test'],
......
...@@ -13,7 +13,8 @@ from theano.compile import optdb ...@@ -13,7 +13,8 @@ from theano.compile import optdb
from theano.gof import EquilibriumDB, SequenceDB from theano.gof import EquilibriumDB, SequenceDB
from theano.gof.cmodule import get_lib_extension from theano.gof.cmodule import get_lib_extension
from theano.gof.compilelock import get_lock, release_lock from theano.gof.compilelock import get_lock, release_lock
from theano.configparser import config, AddConfigVar, StrParam, BoolParam from theano.configparser import (
config, AddConfigVar, BoolParam, FloatParam, StrParam)
from . import nvcc_compiler from . import nvcc_compiler
# ignore_newtrees is to speed the optimization as this is the pattern # ignore_newtrees is to speed the optimization as this is the pattern
...@@ -54,6 +55,21 @@ AddConfigVar('cublas.lib', ...@@ -54,6 +55,21 @@ AddConfigVar('cublas.lib',
"""Name of the cuda blas library for the linker.""", """Name of the cuda blas library for the linker.""",
StrParam('cublas')) StrParam('cublas'))
AddConfigVar('lib.cnmem',
"""Do we enable CNMeM or not (a faster CUDA memory allocator).
The parameter represent the start size (in MB or % of
total GPU memory) of the memory pool.
0: not enabled.
0 < N <= 1: % of the total GPU memory (clipped to .985 for driver memory)
> 0: use that number of MB of memory.
""",
# We should not mix both allocator, so we can't override
FloatParam(0, lambda i: i >= 0, allow_override=False),
in_c_key=False)
# is_nvcc_available called here to initialize global vars in # is_nvcc_available called here to initialize global vars in
# nvcc_compiler module # nvcc_compiler module
nvcc_compiler.is_nvcc_available() nvcc_compiler.is_nvcc_available()
...@@ -107,6 +123,8 @@ def try_import(): ...@@ -107,6 +123,8 @@ def try_import():
'cuda_ndarray.cu', 'cuda_ndarray.cu',
'cuda_ndarray.cuh', 'cuda_ndarray.cuh',
'conv_full_kernel.cu', 'conv_full_kernel.cu',
'cnmem.h',
'cnmem.cpp',
'conv_kernel.cu') 'conv_kernel.cu')
stat_times = [os.stat(os.path.join(cuda_path, cuda_file))[stat.ST_MTIME] stat_times = [os.stat(os.path.join(cuda_path, cuda_file))[stat.ST_MTIME]
for cuda_file in cuda_files] for cuda_file in cuda_files]
...@@ -178,7 +196,8 @@ if compile_cuda_ndarray and cuda_available: ...@@ -178,7 +196,8 @@ if compile_cuda_ndarray and cuda_available:
location=cuda_ndarray_loc, location=cuda_ndarray_loc,
include_dirs=[cuda_path], include_dirs=[cuda_path],
libs=[config.cublas.lib], libs=[config.cublas.lib],
preargs=['-O3'] + compiler.compile_args()) preargs=['-O3'] + compiler.compile_args(),
)
from cuda_ndarray.cuda_ndarray import * from cuda_ndarray.cuda_ndarray import *
except Exception as e: except Exception as e:
_logger.error("Failed to compile cuda_ndarray.cu: %s", str(e)) _logger.error("Failed to compile cuda_ndarray.cu: %s", str(e))
...@@ -377,7 +396,7 @@ def use(device, ...@@ -377,7 +396,7 @@ def use(device,
try: try:
if (device != 'gpu') and not pycuda_init_dev: if (device != 'gpu') and not pycuda_init_dev:
assert isinstance(device, int) assert isinstance(device, int)
gpu_init(device) gpu_init(device, config.lib.cnmem)
use.device_number = device use.device_number = device
assert active_device_number() == device assert active_device_number() == device
else: else:
...@@ -387,10 +406,10 @@ def use(device, ...@@ -387,10 +406,10 @@ def use(device,
# query the active GPU. If we check the active GPU before # query the active GPU. If we check the active GPU before
# the device is initialized we will always receive 0 # the device is initialized we will always receive 0
# event if another device is selected later. # event if another device is selected later.
cuda_ndarray.cuda_ndarray.CudaNdarray.zeros((2, 3)) cuda_ndarray.cuda_ndarray.select_a_gpu()
use.device_number = active_device_number() use.device_number = active_device_number()
# This is needed to initialize the cublas handle. # This is needed to initialize the cublas handle.
gpu_init(use.device_number) gpu_init(use.device_number, config.lib.cnmem)
if test_driver: if test_driver:
import theano.sandbox.cuda.tests.test_driver import theano.sandbox.cuda.tests.test_driver
...@@ -403,8 +422,9 @@ def use(device, ...@@ -403,8 +422,9 @@ def use(device,
" this property") " this property")
if config.print_active_device: if config.print_active_device:
print("Using gpu device %d: %s" % ( cnmem_enabled = "enabled" if config.lib.cnmem else "disabled"
active_device_number(), active_device_name()), file=sys.stderr) print("Using gpu device %d: %s (CNMeM is %s)" % (
active_device_number(), active_device_name(), cnmem_enabled), file=sys.stderr)
if device_properties(use.device_number)['regsPerBlock'] < 16384: if device_properties(use.device_number)['regsPerBlock'] < 16384:
# We will try to use too much register per bloc at many places # We will try to use too much register per bloc at many places
# when there is only 8k register per multi-processor. # when there is only 8k register per multi-processor.
......
...@@ -137,13 +137,9 @@ class BatchedDotOp(GpuOp): ...@@ -137,13 +137,9 @@ class BatchedDotOp(GpuOp):
host_z[i] = host_z[i - 1] + z_stride; host_z[i] = host_z[i - 1] + z_stride;
} }
err1 = cudaMalloc((void **)&gpu_x, ptr_array_size); gpu_x = (float **) device_malloc(ptr_array_size);
if (err1 != cudaSuccess) if (gpu_x == NULL){
{
CLEANUP();
PyErr_Format(PyExc_RuntimeError,
"%%s", "cudaMalloc failure");
%(fail)s; %(fail)s;
} }
...@@ -195,7 +191,7 @@ class BatchedDotOp(GpuOp): ...@@ -195,7 +191,7 @@ class BatchedDotOp(GpuOp):
do \ do \
{ \ { \
if (host_x) free (host_x); \ if (host_x) free (host_x); \
if (gpu_x) cudaFree(gpu_x); \ if (gpu_x) device_free(gpu_x); \
} while (0) } while (0)
""" """
...@@ -213,6 +209,9 @@ class BatchedDotOp(GpuOp): ...@@ -213,6 +209,9 @@ class BatchedDotOp(GpuOp):
return rval return rval
def c_code_cache_version(self):
return (1,)
batched_dot = BatchedDotOp() batched_dot = BatchedDotOp()
class GpuDot22(GpuOp): class GpuDot22(GpuOp):
......
...@@ -208,22 +208,28 @@ static int SparseBlockGemv_copy(PyArrayObject *a, npy_intp *b) { ...@@ -208,22 +208,28 @@ static int SparseBlockGemv_copy(PyArrayObject *a, npy_intp *b) {
static int %(n)s_prep(int b, int i, int j, int outsize) { static int %(n)s_prep(int b, int i, int j, int outsize) {
int s = b*i*j; int s = b*i*j;
if (%(n)s_list_len < s) { if (%(n)s_list_len < s) {
cudaFree(%(n)s_inp_list); device_free(%(n)s_inp_list);
cudaFree(%(n)s_out_list); device_free(%(n)s_out_list);
cudaFree(%(n)s_W_list); device_free(%(n)s_W_list);
if (cudaMalloc(&%(n)s_inp_list, s*sizeof(float *)) != cudaSuccess) return -1; %(n)s_inp_list = (const float **) device_malloc(s*sizeof(float *));
if (cudaMalloc(&%(n)s_out_list, s*sizeof(float *)) != cudaSuccess) return -1; if (%(n)s_inp_list == NULL) return -1;
if (cudaMalloc(&%(n)s_W_list, s*sizeof(float *)) != cudaSuccess) return -1; %(n)s_out_list = (float **) device_malloc(s*sizeof(float *));
if (%(n)s_out_list == NULL) return -1;
%(n)s_W_list = (const float **) device_malloc(s*sizeof(float *));
if (%(n)s_W_list == NULL) return -1;
%(n)s_list_len = s; %(n)s_list_len = s;
} }
if (%(n)s_iIdx_len < b*i) { if (%(n)s_iIdx_len < b*i) {
cudaFree(%(n)s_iIdx); device_free(%(n)s_iIdx);
if (cudaMalloc(&%(n)s_iIdx, b*i*sizeof(npy_intp)) != cudaSuccess) return -1; %(n)s_iIdx = (npy_intp*) device_malloc(b*i*sizeof(npy_intp));
if (%(n)s_iIdx == NULL) return -1;
%(n)s_iIdx_len = b*i; %(n)s_iIdx_len = b*i;
} }
if (%(n)s_oIdx_len < b*j) { if (%(n)s_oIdx_len < b*j) {
cudaFree(%(n)s_oIdx); device_free(%(n)s_oIdx);
if (cudaMalloc(&%(n)s_oIdx, b*j*sizeof(npy_intp)) != cudaSuccess) return -1; %(n)s_oIdx = (npy_intp*) device_malloc(b*j*sizeof(npy_intp));
if (%(n)s_oIdx == NULL) return -1;
%(n)s_oIdx_len = b*j; %(n)s_oIdx_len = b*j;
} }
return 0; return 0;
...@@ -326,7 +332,7 @@ CudaNdarray_HOST_STRIDES(%(out)s)[0], CudaNdarray_HOST_STRIDES(%(out)s)[1], ...@@ -326,7 +332,7 @@ CudaNdarray_HOST_STRIDES(%(out)s)[0], CudaNdarray_HOST_STRIDES(%(out)s)[1],
W=W, fail=sub['fail'], name=nodename) W=W, fail=sub['fail'], name=nodename)
def c_code_cache_version(self): def c_code_cache_version(self):
return (11,) return (12,)
def grad(self, inputs, grads): def grad(self, inputs, grads):
o, W, h, inputIdx, outputIdx = inputs o, W, h, inputIdx, outputIdx = inputs
...@@ -509,24 +515,27 @@ static size_t %(n)s_yIdx_len; ...@@ -509,24 +515,27 @@ static size_t %(n)s_yIdx_len;
static int %(n)s_prep(int b, int i, int j) { static int %(n)s_prep(int b, int i, int j) {
int s = b*i*j; int s = b*i*j;
if (%(n)s_list_len < s) { if (%(n)s_list_len < s) {
cudaFree(%(n)s_x_list); device_free(%(n)s_x_list);
cudaFree(%(n)s_y_list); device_free(%(n)s_y_list);
cudaFree(%(n)s_out_list); device_free(%(n)s_out_list);
if (cudaMalloc(&%(n)s_x_list, s*sizeof(float *)) != cudaSuccess) return -1; %(n)s_x_list = (const float **) device_malloc(s*sizeof(float *));
if (cudaMalloc(&%(n)s_y_list, s*sizeof(float *)) != cudaSuccess) return -1; if (%(n)s_x_list == NULL) return -1;
if (cudaMalloc(&%(n)s_out_list, s*sizeof(float *)) != cudaSuccess) return -1; %(n)s_y_list = (const float **) device_malloc(s*sizeof(float *));
if (%(n)s_y_list == NULL) return -1;
%(n)s_out_list = (float **) device_malloc(s*sizeof(float *));
if (%(n)s_out_list == NULL) return -1;
%(n)s_list_len = s; %(n)s_list_len = s;
} }
if (%(n)s_xIdx_len < b*i) { if (%(n)s_xIdx_len < b*i) {
cudaFree(%(n)s_xIdx); device_free(%(n)s_xIdx);
if (cudaMalloc(&%(n)s_xIdx, b*i*sizeof(npy_intp)) != cudaSuccess) %(n)s_xIdx = (npy_intp*) device_malloc(b*i*sizeof(npy_intp));
return -1; if (%(n)s_xIdx == NULL) return -1;
%(n)s_xIdx_len = b*i; %(n)s_xIdx_len = b*i;
} }
if (%(n)s_yIdx_len < b*j) { if (%(n)s_yIdx_len < b*j) {
cudaFree(%(n)s_yIdx); device_free(%(n)s_yIdx);
if (cudaMalloc(&%(n)s_yIdx, b*j*sizeof(npy_intp)) != cudaSuccess) %(n)s_yIdx = (npy_intp*) device_malloc(b*j*sizeof(npy_intp));
return -1; if (%(n)s_yIdx == NULL) return -1;
%(n)s_yIdx_len = b*j; %(n)s_yIdx_len = b*j;
} }
return 0; return 0;
...@@ -626,7 +635,7 @@ CudaNdarray_HOST_STRIDES(%(out)s)[0], CudaNdarray_HOST_STRIDES(%(out)s)[1], ...@@ -626,7 +635,7 @@ CudaNdarray_HOST_STRIDES(%(out)s)[0], CudaNdarray_HOST_STRIDES(%(out)s)[1],
alpha=alpha, fail=sub['fail']) alpha=alpha, fail=sub['fail'])
def c_code_cache_version(self): def c_code_cache_version(self):
return (10,) return (11,)
sparse_block_outer_ss = SparseBlockOuterSS(False) sparse_block_outer_ss = SparseBlockOuterSS(False)
......
差异被折叠。
差异被折叠。
...@@ -9,6 +9,13 @@ ...@@ -9,6 +9,13 @@
#include "cuda_ndarray.cuh" #include "cuda_ndarray.cuh"
#ifndef CNMEM_DLLEXPORT
#define CNMEM_DLLEXPORT
#endif
#include "cnmem.h"
#include "cnmem.cpp"
//If true, when there is a gpu malloc or free error, we print the size of allocated memory on the device. //If true, when there is a gpu malloc or free error, we print the size of allocated memory on the device.
#define COMPUTE_GPU_MEM_USED 0 #define COMPUTE_GPU_MEM_USED 0
...@@ -67,6 +74,54 @@ void * device_malloc(size_t size) ...@@ -67,6 +74,54 @@ void * device_malloc(size_t size)
return device_malloc(size, VERBOSE_DEVICE_MALLOC); return device_malloc(size, VERBOSE_DEVICE_MALLOC);
} }
///@TODO: thejaswi: link this option to a theano config variable?
static bool g_use_cnmem = false;
static const int g_max_devices = 8;
int initCnmem(int card_number_provided, int card_nb, size_t mem) {
static bool cnmemInitialized = false;
if(cnmemInitialized) {
return 0;
}
// On stderr to be at the same place as "Using gpu device..."
int numDevices = 0;
cnmemDevice_t devices[g_max_devices];
if(cudaGetDeviceCount(&numDevices) != cudaSuccess) {
PyErr_Format(PyExc_RuntimeError,
"initCnmem: 'cudaGetDeviceCount' failed! Reason=%s\n",
cudaGetErrorString(cudaGetLastError()));
return -1;
}
if(card_number_provided){
numDevices = 1;
int i = 0;
devices[i].device = card_nb;
devices[i].size = mem;
///@TODO: thejaswi: add support for multiple streams
devices[i].numStreams = 0;
devices[i].streams = NULL;
devices[i].streamSizes = NULL;
}else{
for(int i=0;i<numDevices;++i) {
devices[i].device = i;
devices[i].size = mem;
///@TODO: thejaswi: add support for multiple streams
devices[i].numStreams = 0;
devices[i].streams = NULL;
}
}
///@TODO: thejaswi: passing custom cnmem flags?
cnmemStatus_t status = cnmemInit(numDevices, devices, CNMEM_FLAGS_DEFAULT);
if(status != CNMEM_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError,
"initCnmem: cnmemInit call failed! Reason=%s. numdev=%d\n",
cnmemGetErrorString(status), numDevices);
return -1;
}
cnmemInitialized = true;
return 0;
}
void * device_malloc(size_t size, int verbose) void * device_malloc(size_t size, int verbose)
{ {
#if PRECHECK_ERROR #if PRECHECK_ERROR
...@@ -81,6 +136,18 @@ void * device_malloc(size_t size, int verbose) ...@@ -81,6 +136,18 @@ void * device_malloc(size_t size, int verbose)
} }
#endif #endif
void * rval=NULL; void * rval=NULL;
///@TODO: thejaswi: support for multiple-streams?
if(g_use_cnmem) {
cnmemStatus_t status = CNMEM_STATUS_SUCCESS;
status = cnmemMalloc(&rval, size, NULL);
if(status != CNMEM_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError,
"Error allocating %zd bytes of device memory (%s).",
size, cnmemGetErrorString(status));
return NULL;
}
}
else {
cudaError_t err = cudaMalloc(&rval, size); cudaError_t err = cudaMalloc(&rval, size);
if (cudaSuccess != err) if (cudaSuccess != err)
{ {
...@@ -118,6 +185,7 @@ void * device_malloc(size_t size, int verbose) ...@@ -118,6 +185,7 @@ void * device_malloc(size_t size, int verbose)
size, cudaGetErrorString(err)); size, cudaGetErrorString(err));
return NULL; return NULL;
} }
}
if (rval != NULL){ if (rval != NULL){
// Can it happen that cudaMalloc return cudaSuccess, but return a NULL ptr? // Can it happen that cudaMalloc return cudaSuccess, but return a NULL ptr?
// Could this be what happen if size is 0? // Could this be what happen if size is 0?
...@@ -202,6 +270,15 @@ int device_free(void *ptr) ...@@ -202,6 +270,15 @@ int device_free(void *ptr)
return 0; return 0;
} }
///@TODO: thejaswi: multi-stream support
if(g_use_cnmem) {
cnmemStatus_t status = cnmemFree(ptr, NULL);
if(status != CNMEM_STATUS_SUCCESS) {
fprintf(stderr, "device_free: cnmemFree call failed! Reason=%s\n",
cnmemGetErrorString(status));
}
}
else {
// We need sync as the Theano's GC could remove intermediate variable that // We need sync as the Theano's GC could remove intermediate variable that
// are still needed as the gpu kernel are running or in the queue. // are still needed as the gpu kernel are running or in the queue.
CNDA_BEGIN_ALLOW_THREADS CNDA_BEGIN_ALLOW_THREADS
...@@ -259,6 +336,7 @@ int device_free(void *ptr) ...@@ -259,6 +336,7 @@ int device_free(void *ptr)
cudaGetErrorString(err)); cudaGetErrorString(err));
return -1; return -1;
} }
}
_outstanding_mallocs[0] -= (ptr != NULL); _outstanding_mallocs[0] -= (ptr != NULL);
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
int i=0; int i=0;
...@@ -2863,6 +2941,32 @@ CudaNdarray_cublasv2(PyObject* _unused, PyObject* dummy) ...@@ -2863,6 +2941,32 @@ CudaNdarray_cublasv2(PyObject* _unused, PyObject* dummy)
return Py_True; return Py_True;
} }
PyObject *
CudaNdarray_select_a_gpu(PyObject* _unused, PyObject* dummy)
{
void * rval = NULL;
cudaError_t err = cudaMalloc(&rval, 4);
if (cudaSuccess != err){
printf("ERR!\\n");
PyErr_Format(PyExc_RuntimeError,
"Not able to do basic stuff on the GPU (alloc of 4 bytes) (%s).",
cudaGetErrorString(err));
return NULL;
}
err = cudaFree(rval);
if (cudaSuccess != err){
printf("ERR!\\n");
PyErr_Format(PyExc_RuntimeError,
"Not able to do basic stuff on the GPU (cudaFree failed) (%s).",
cudaGetErrorString(err));
return NULL;
}
Py_INCREF(Py_None);
return Py_None;
}
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
/* /*
* Return the size in bytes that Theano currently have allocated on the gpu. * Return the size in bytes that Theano currently have allocated on the gpu.
...@@ -3030,18 +3134,23 @@ CudaNdarray_ptr_int_size(PyObject* _unused, PyObject* args) ...@@ -3030,18 +3134,23 @@ CudaNdarray_ptr_int_size(PyObject* _unused, PyObject* args)
static int cublas_init(); static int cublas_init();
static void cublas_shutdown(); static void cublas_shutdown();
// Initialize the gpu. // Initialize the gpu.
// Takes one optional parameter, the device number. // Takes two optional parameters, the device number and if we should use cnmem.
// If provided, it sets that device to be the active device. // If the device number is provided, it sets that device to be the active device.
// If not provided (usually just to test whether the gpu is available at all), // If not provided (usually just to test whether the gpu is available at all),
// it does not set an active device. // it does not set an active device.
// Raises EnvironmentError or ValueError (as appropriate) if the initialization failed. // Raises EnvironmentError or ValueError (as appropriate) if the initialization failed.
// cnmem is threaded like a bool. If converted to 0, don't use cnmem. Otherwise, use it.
PyObject * PyObject *
CudaNdarray_gpu_init(PyObject* _unused, PyObject* args) CudaNdarray_gpu_init(PyObject* _unused, PyObject* args)
{ {
int card_nb = 0; int card_nb = 0;
int card_number_provided = 1; int card_number_provided = 1;
float cnmem = 0; // Theano flag lib.cnmem
PyArg_ParseTuple(args, "|i", &card_nb); // if we're given something wildly invalid, this will throw a TypeError // if we're given something wildly invalid, this will throw a TypeError
if(!PyArg_ParseTuple(args, "|if", &card_nb, &cnmem))
return NULL;
if(cnmem)
g_use_cnmem = true;
if(PyTuple_Size(args) == 0) { if(PyTuple_Size(args) == 0) {
card_number_provided = 0; card_number_provided = 0;
...@@ -3096,6 +3205,34 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args) ...@@ -3096,6 +3205,34 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args)
if (cublas_init() == -1) if (cublas_init() == -1)
return NULL; return NULL;
} }
if(card_number_provided && g_use_cnmem) {
size_t mem = 0;
if (cnmem > 1)
mem = cnmem * 1024 * 1024;
else{
// Clip to 98.5% to let memory for the driver.
if (cnmem > .985){
cnmem = .985;
}
size_t free = 0, total = 0;
cudaError_t err = cudaMemGetInfo(&free, &total);
if (err != cudaSuccess){
// Clear the error flag, cudaMemGetInfo doesn't do it.
// Currently this returns the same thing as err, but if in future
// it returns something else I still don't see why we should ignore
// it. All we want to do here is reset the flag.
cudaGetLastError();
PyErr_Format(PyExc_RuntimeError,
"Error while getting memory info about the gpu: %s",
cudaGetErrorString(err));
return NULL;
}
mem = total * cnmem;
}
if(initCnmem(card_number_provided, card_nb, mem) == -1){
return NULL;
}
}
Py_INCREF(Py_None); Py_INCREF(Py_None);
return Py_None; return Py_None;
...@@ -3126,8 +3263,20 @@ PyObject * ...@@ -3126,8 +3263,20 @@ PyObject *
CudaNdarray_gpu_shutdown(PyObject* _unused, PyObject* _unused_args) { CudaNdarray_gpu_shutdown(PyObject* _unused, PyObject* _unused_args) {
// Don't handle errors here // Don't handle errors here
cublas_shutdown(); cublas_shutdown();
cudaThreadExit();
g_gpu_context_active = 0; // context has now been closed down g_gpu_context_active = 0; // context has now been closed down
if(g_use_cnmem) {
cnmemStatus_t status = cnmemFinalize();
if(status != CNMEM_STATUS_SUCCESS) {
fprintf(stderr, "CudaNdarray_gpu_shutdown: cnmemFinalize failed! Reason=%s\n",
cnmemGetErrorString(status));
if(status == CNMEM_STATUS_CUDA_ERROR) {
fprintf(stderr, " Cuda-Reason=%s\n",
cudaGetErrorString(cudaGetLastError()));
}
}
}
cudaThreadExit();
Py_INCREF(Py_None); Py_INCREF(Py_None);
return Py_None; return Py_None;
} }
...@@ -3392,6 +3541,7 @@ static PyMethodDef module_methods[] = { ...@@ -3392,6 +3541,7 @@ static PyMethodDef module_methods[] = {
{"dimshuffle", CudaNdarray_Dimshuffle, METH_VARARGS, "Returns the dimshuffle of a CudaNdarray."}, {"dimshuffle", CudaNdarray_Dimshuffle, METH_VARARGS, "Returns the dimshuffle of a CudaNdarray."},
{"dot", CudaNdarray_Dot, METH_VARARGS, "Returns the matrix product of two CudaNdarray arguments."}, {"dot", CudaNdarray_Dot, METH_VARARGS, "Returns the matrix product of two CudaNdarray arguments."},
{"gpu_init", CudaNdarray_gpu_init, METH_VARARGS, "Select the gpu card to use; also usable to test whether CUDA is available."}, {"gpu_init", CudaNdarray_gpu_init, METH_VARARGS, "Select the gpu card to use; also usable to test whether CUDA is available."},
{"select_a_gpu", CudaNdarray_select_a_gpu, METH_NOARGS, "Call this method if you want to select a GPU before gpu_init call and let the driver choose the GPU."},
{"active_device_name", CudaNdarray_active_device_name, METH_VARARGS, "Get the name of the active device."}, {"active_device_name", CudaNdarray_active_device_name, METH_VARARGS, "Get the name of the active device."},
{"active_device_number", CudaNdarray_active_device_number, METH_VARARGS, "Get the number of the active device."}, {"active_device_number", CudaNdarray_active_device_number, METH_VARARGS, "Get the number of the active device."},
{"gpu_shutdown", CudaNdarray_gpu_shutdown, METH_VARARGS, "Shut down the gpu."}, {"gpu_shutdown", CudaNdarray_gpu_shutdown, METH_VARARGS, "Shut down the gpu."},
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论