# Half Precision

Half precision training could bring benefits:

- using less GPU memory, supporting larger network.
- training faster.

## Half data type

### Half data type definition

The IEEE 754 standard specifies a binary16 as having the following format: Sign bit: 1 bit Exponent width: 5 bits Significand precision: 11 bits (10 explicitly stored)

### Half data type operation

Load data in fp32 and easily convert to fp16 by casting.

```
>>> from singa import tensor, device
>>> dev = device.create_cuda_gpu()
>>> x = tensor.random((2,3),dev)
>>> x
[[0.7703407 0.42764223 0.5872884 ]
[0.78362167 0.70469785 0.64975065]], float32
>>> y = x.as_type(tensor.float16)
>>> y
[[0.7705 0.4277 0.5874]
[0.7837 0.7046 0.65 ]], float16
```

Primary operations are supported in fp16.

```
>>> y+y
[[1.541 0.8555 1.175 ]
[1.567 1.409 1.3 ]], float16
```

## Training in Half

### Training in Half three step

Training in half precision could be done easily in three steps:

- Load data and convert to half
- Set data type of optimizer
- Train model as usual

```
# cast input data to fp16
x = load_data()
x = x.astype(np.float16)
tx = tensor.from_numpy(x)
# load model
model = build_model()
# set optimizer dtype to fp16
sgd = opt.SGD(lr=0.1, dtype=tensor.float16)
# train as usual
out, loss = model(tx, ty)
```

### Example

An example script is `train_cnn.py`

, run below command to train in half.

```
python examples/cnn/train_cnn.py cnn mnist -pfloat16
```

## Implementation

### Half Type Dependency

This half implementation is integrated in C++ backend as general half type support.

To run on GPU, `__half`

is available in Cuda math API. To support `__half`

math operation, it is required to compile against Nvidia compute arch > 6.0
(Pascal).

### Nvidia Hardware Acceleration: Tensor Core

Tensor Core released by Nvidia further accelerates half precision and multiples throughput for operations like GEMM(CuBlas) and convolution(CuDNN). To enable Tensor core operation, there are a few restriction on GEMM dimensions, convolution channel size, Cuda version, and GPU version(Turing or later) and etc.

### Implement Operations

Half operations are primarily implemented in `tensor_math_cuda.h`

, by specializing
operation template with half type and implementation the low level computation.

For example, GEMM operation is implemented as:

```
template <>
void GEMM<half_float::half, lang::Cuda>(const half_float::half alpha,
const Tensor& A, const Tensor& B,
const half_float::half beta, Tensor* C,
Context* ctx) {
// ...
CUBLAS_CHECK(cublasGemmEx(handle, transb, transa, ncolB, nrowA, ncolA,
alphaPtr, BPtr, Btype, ldb, APtr, Atype, lda,
betaPtr, CPtr, Ctype, ldc, computeType, algo));
// ...
}
```