Data Types for ML – Shahroz’s Blog

We will be using Pytorch to discuss data types. Make sure it is installed and imported when following the code.

import torch

Python

Integer

Unsigned Integer

An unsigned integer data type is used to represent a positive integer value.

The range for an n-bit unsigned integer is \([0, 2^n - 1]\).

For example, the minimum value of an 8-bit unsigned integer is 0 and the maximum value is \((2^8 - 1) = (256 - 1) = 255\).

The computer allocates a sequence of 8 bits to store the 8-bit integer.

uint8

torch.uint8 can be used to define 8-bit unsigned integer.

torch.iinfo(torch.uint8)

Python

iinfo(min=0, max=255, dtype=uint8)

Signed Integer

An signed integer data type is used to represent a negative or positive integer.

We are considering two’s complement representation for signed integer. The range for an n-bit signed integer is \([-2^{n-1},\, 2^{n-1} - 1]\).

For example, the minimum value of an 8-bit signed integer is \(-2^{7} = 127\) and the maximum value is \(2^{7} - 1 = 127\).

int8

torch.int8 can be used to define 8-bit signed integer.

torch.iinfo(torch.int8)

Python

iinfo(min=-128, max=127, dtype=int8)

int16 / short

torch.int16 can be used to define 16-bit signed integer. There is also an alias for this data type torch.short.

torch.iinfo(torch.int16)

Python

iinfo(min=-32768, max=32767, dtype=int16)

torch.iinfo(torch.short)

Python

iinfo(min=-32768, max=32767, dtype=int16)

int32 / int

torch.int32 can be used to define 32-bit signed integer. There is also an alias for this data type torch.int.

torch.iinfo(torch.int32)

Python

iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

torch.iinfo(torch.int)

Python

iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

int64 / long

torch.int64 can be used to define 64-bit signed integer. There is also an alias for this data type torch.long.

torch.iinfo(torch.int64)

Python

iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

torch.iinfo(torch.long)

Python

iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

Floating Point

There are 3 major components in floating point representation.

Sign: Only 1 bit is needed to define if the number is positive or negative.
Exponent (range): It represents the range of the number; how big in magnitude a number can be in both the positive and negative direction.
Fraction (precision): It determines the precision of the number; how many decimal places can be used.

FP64

Python stores the floating point value in FP64.

value = 1 / 3
format(value, '.60f')

Python

'0.333333333333333314829616256247390992939472198486328125000000'

Now, let’s creating a FP64 tensor.

tensor_fp64 = torch.tensor(value, dtype=torch.float64)
format(tensor_fp64.item(), '.60f')

Python

'0.333333333333333314829616256247390992939472198486328125000000'

Python floating value and tensor floating values are same.

FP32

Sign: 1 bit
Exponent (range): 8 bit
Fraction (precision): 23 bit
Total: 32 bit

We can store as small value as \(10^{-45}\) and as large value as \(10^{+38}\).

This data type is very important because most of ML models store their weights / params in FP32.

tensor_fp32 = torch.tensor(value, dtype=torch.float32)
format(tensor_fp32.item(), '.60f')

Python

'0.333333343267440795898437500000000000000000000000000000000000'

FP16

Sign: 1 bit
Exponent (range): 5 bit
Fraction (precision): 10 bit
Total: 16 bit

We can store as small value as \(10^{-8}\) and as large value as \(10^{+4}\).

tensor_fp16 = torch.tensor(value, dtype=torch.float16)
format(tensor_fp16.item(), '.60f')

Python

'0.333251953125000000000000000000000000000000000000000000000000'

BF16

It stands for 16-bit brain floating point.

Sign: 1 bit
Exponent (range): 8 bit (same as FP32)
Fraction (precision): 7 bit
Total: 16 bit

We can store as small value as \(10^{-41}\) and as large value as \(10^{38}\). It increases the range of number; we can represent very large values and very small values. The downside is the precision is worse than FP16.

tensor_bf16 = torch.tensor(value, dtype=torch.bfloat16)
format(tensor_bf16.item(), '.60f')

Python

'0.333984375000000000000000000000000000000000000000000000000000'

Let’s compare all the floating point values together:

print(format(tensor_fp16.item(), '.60f'))
print(format(tensor_bf16.item(), '.60f'))
print(format(tensor_fp32.item(), '.60f'))
print(format(tensor_fp64.item(), '.60f'))

Python

0.333251953125000000000000000000000000000000000000000000000000
0.333984375000000000000000000000000000000000000000000000000000
0.333333343267440795898437500000000000000000000000000000000000
0.333333333333333314829616256247390992939472198486328125000000

See how BF16 has less precision than FP16.

Downcasting

Downcasting is the conversion of higher data type (e.g: float) to a lower data type (e.g: integer). It usually results in a loss of data.

tensor_fp32 = torch.rand(1000, dtype=torch.float32)

tensor_fp32[:5]

Python

tensor([0.9981, 0.3978, 0.7168, 0.7782, 0.1776])

We use .to(dtype=) method to downcast the data type.

tensor_fp32_to_bf16 = tensor_fp32.to(dtype=torch.bfloat16)

tensor_fp32_to_bf16[:5]

Python

tensor([1.0000, 0.3984, 0.7148, 0.7773, 0.1777], dtype=torch.bfloat16)

Impact on Tensor Multiplication

We can multiply FP32 tensor with itself to see the dot product.

m_float32 = torch.dot(tensor_fp32, tensor_fp32)

m_float32

Python

tensor(318.0981)

We will also multiply the downcasted BF16 tensor with itself to see the dot product.

m_bf16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)

m_bf16

Python

tensor(318., dtype=torch.bfloat16)

We can see above the result of BF16 product is very close to FP32. But you can also see the loss of precision (decimal values).

Advantages

Less memory footprint
- Efficient use of GPU memory
- Enables the training of larger models
- Enables using the larger batch sizes
Faster computer and speed
- Computation using low precision can be faster than full precision since it demands less memory.

Disadvantages

Less precision
- Since we are using less memory, the computations will be less precise.

Use-cases

Mixed Precision Training
- Do computations in smaller precision (FP16/BF16/FP8)
- Store and update the weights in higher precision (FP32)

Sources

Quantization Fundamentals with Hugging Face