
We will be using Pytorch to discuss data types. Make sure it is installed and imported when following the code.
Integer
Unsigned Integer
An unsigned integer data type is used to represent a positive integer value.
The range for an n-bit unsigned integer is \([0, 2^n - 1]\).
For example, the minimum value of an 8-bit unsigned integer is 0 and the maximum value is \((2^8 - 1) = (256 - 1) = 255\).
The computer allocates a sequence of 8 bits to store the 8-bit integer.
uint8
torch.uint8 can be used to define 8-bit unsigned integer.
Signed Integer
An signed integer data type is used to represent a negative or positive integer.
We are considering two’s complement representation for signed integer. The range for an n-bit signed integer is \([-2^{n-1},\, 2^{n-1} - 1]\).
For example, the minimum value of an 8-bit signed integer is \(-2^{7} = 127\) and the maximum value is \(2^{7} - 1 = 127\).
int8
torch.int8 can be used to define 8-bit signed integer.
int16 / short
torch.int16 can be used to define 16-bit signed integer. There is also an alias for this data type torch.short.
int32 / int
torch.int32 can be used to define 32-bit signed integer. There is also an alias for this data type torch.int.
int64 / long
torch.int64 can be used to define 64-bit signed integer. There is also an alias for this data type torch.long.
Floating Point
There are 3 major components in floating point representation.
- Sign: Only 1 bit is needed to define if the number is positive or negative.
- Exponent (range): It represents the range of the number; how big in magnitude a number can be in both the positive and negative direction.
- Fraction (precision): It determines the precision of the number; how many decimal places can be used.
FP64
Python stores the floating point value in FP64.
'0.333333333333333314829616256247390992939472198486328125000000'
Now, let’s creating a FP64 tensor.
'0.333333333333333314829616256247390992939472198486328125000000'
Python floating value and tensor floating values are same.
FP32
- Sign: 1 bit
- Exponent (range): 8 bit
- Fraction (precision): 23 bit
- Total: 32 bit
We can store as small value as \(10^{-45}\) and as large value as \(10^{+38}\).
This data type is very important because most of ML models store their weights / params in
FP32.
FP16
- Sign: 1 bit
- Exponent (range): 5 bit
- Fraction (precision): 10 bit
- Total: 16 bit
We can store as small value as \(10^{-8}\) and as large value as \(10^{+4}\).
BF16
It stands for 16-bit brain floating point.
- Sign: 1 bit
- Exponent (range): 8 bit (same as
FP32) - Fraction (precision): 7 bit
- Total: 16 bit
We can store as small value as \(10^{-41}\) and as large value as \(10^{38}\). It increases the range of number; we can represent very large values and very small values. The downside is the precision is worse than FP16.
'0.333984375000000000000000000000000000000000000000000000000000'
Let’s compare all the floating point values together:
0.333251953125000000000000000000000000000000000000000000000000
0.333984375000000000000000000000000000000000000000000000000000
0.333333343267440795898437500000000000000000000000000000000000
0.333333333333333314829616256247390992939472198486328125000000
See how BF16 has less precision than FP16.
Downcasting
Downcasting is the conversion of higher data type (e.g: float) to a lower data type (e.g: integer). It usually results in a loss of data.
tensor([0.9981, 0.3978, 0.7168, 0.7782, 0.1776])
We use .to(dtype=) method to downcast the data type.
tensor([1.0000, 0.3984, 0.7148, 0.7773, 0.1777], dtype=torch.bfloat16)
Impact on Tensor Multiplication
We can multiply FP32 tensor with itself to see the dot product.
We will also multiply the downcasted BF16 tensor with itself to see the dot product.
tensor(318., dtype=torch.bfloat16)
We can see above the result of BF16 product is very close to FP32. But you can also see the loss of precision (decimal values).
Advantages
- Less memory footprint
- Efficient use of GPU memory
- Enables the training of larger models
- Enables using the larger batch sizes
- Faster computer and speed
- Computation using low precision can be faster than full precision since it demands less memory.
Disadvantages
- Less precision
- Since we are using less memory, the computations will be less precise.
Use-cases
- Mixed Precision Training
- Do computations in smaller precision (
FP16/BF16/FP8) - Store and update the weights in higher precision (
FP32)
- Do computations in smaller precision (