- added xlarge kernel to handle array size >= 2^31 - ported original pytorch kernel - various small fixes