Data Types for ML

Published

2026-03-30

Modified

2026-04-01

Generated using ChatGPT

We will be using Pytorch to discuss data types. Make sure it is installed and imported when following the code.

import torch
Python

Integer

Unsigned Integer

An unsigned integer data type is used to represent a positive integer value.

The range for an n-bit unsigned integer is \([0, 2^n - 1]\).

For example, the minimum value of an 8-bit unsigned integer is 0 and the maximum value is \((2^8 - 1) = (256 - 1) = 255\).

The computer allocates a sequence of 8 bits to store the 8-bit integer.

uint8

torch.uint8 can be used to define 8-bit unsigned integer.

torch.iinfo(torch.uint8)
Python
iinfo(min=0, max=255, dtype=uint8)

Signed Integer

An signed integer data type is used to represent a negative or positive integer.

We are considering two’s complement representation for signed integer. The range for an n-bit signed integer is \([-2^{n-1},\, 2^{n-1} - 1]\).

For example, the minimum value of an 8-bit signed integer is \(-2^{7} = 127\) and the maximum value is \(2^{7} - 1 = 127\).

int8

torch.int8 can be used to define 8-bit signed integer.

torch.iinfo(torch.int8)
Python
iinfo(min=-128, max=127, dtype=int8)

int16 / short

torch.int16 can be used to define 16-bit signed integer. There is also an alias for this data type torch.short.

torch.iinfo(torch.int16)
Python
iinfo(min=-32768, max=32767, dtype=int16)
torch.iinfo(torch.short)
Python
iinfo(min=-32768, max=32767, dtype=int16)

int32 / int

torch.int32 can be used to define 32-bit signed integer. There is also an alias for this data type torch.int.

torch.iinfo(torch.int32)
Python
iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)
torch.iinfo(torch.int)
Python
iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

int64 / long

torch.int64 can be used to define 64-bit signed integer. There is also an alias for this data type torch.long.

torch.iinfo(torch.int64)
Python
iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)
torch.iinfo(torch.long)
Python
iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

Floating Point

There are 3 major components in floating point representation.

  1. Sign: Only 1 bit is needed to define if the number is positive or negative.
  2. Exponent (range): It represents the range of the number; how big in magnitude a number can be in both the positive and negative direction.
  3. Fraction (precision): It determines the precision of the number; how many decimal places can be used.

FP64

Python stores the floating point value in FP64.

value = 1 / 3
format(value, '.60f')
Python
'0.333333333333333314829616256247390992939472198486328125000000'

Now, let’s creating a FP64 tensor.

tensor_fp64 = torch.tensor(value, dtype=torch.float64)
format(tensor_fp64.item(), '.60f')
Python
'0.333333333333333314829616256247390992939472198486328125000000'

Python floating value and tensor floating values are same.

FP32

  • Sign: 1 bit
  • Exponent (range): 8 bit
  • Fraction (precision): 23 bit
  • Total: 32 bit

We can store as small value as \(10^{-45}\) and as large value as \(10^{+38}\).

This data type is very important because most of ML models store their weights / params in FP32.

tensor_fp32 = torch.tensor(value, dtype=torch.float32)
format(tensor_fp32.item(), '.60f')
Python
'0.333333343267440795898437500000000000000000000000000000000000'

FP16

  • Sign: 1 bit
  • Exponent (range): 5 bit
  • Fraction (precision): 10 bit
  • Total: 16 bit

We can store as small value as \(10^{-8}\) and as large value as \(10^{+4}\).

tensor_fp16 = torch.tensor(value, dtype=torch.float16)
format(tensor_fp16.item(), '.60f')
Python
'0.333251953125000000000000000000000000000000000000000000000000'

BF16

It stands for 16-bit brain floating point.

  • Sign: 1 bit
  • Exponent (range): 8 bit (same as FP32)
  • Fraction (precision): 7 bit
  • Total: 16 bit

We can store as small value as \(10^{-41}\) and as large value as \(10^{38}\). It increases the range of number; we can represent very large values and very small values. The downside is the precision is worse than FP16.

tensor_bf16 = torch.tensor(value, dtype=torch.bfloat16)
format(tensor_bf16.item(), '.60f')
Python
'0.333984375000000000000000000000000000000000000000000000000000'

Let’s compare all the floating point values together:

print(format(tensor_fp16.item(), '.60f'))
print(format(tensor_bf16.item(), '.60f'))
print(format(tensor_fp32.item(), '.60f'))
print(format(tensor_fp64.item(), '.60f'))
Python
0.333251953125000000000000000000000000000000000000000000000000
0.333984375000000000000000000000000000000000000000000000000000
0.333333343267440795898437500000000000000000000000000000000000
0.333333333333333314829616256247390992939472198486328125000000

See how BF16 has less precision than FP16.

Downcasting

Downcasting is the conversion of higher data type (e.g: float) to a lower data type (e.g: integer). It usually results in a loss of data.

tensor_fp32 = torch.rand(1000, dtype=torch.float32)

tensor_fp32[:5]
Python
tensor([0.9981, 0.3978, 0.7168, 0.7782, 0.1776])

We use .to(dtype=) method to downcast the data type.

tensor_fp32_to_bf16 = tensor_fp32.to(dtype=torch.bfloat16)

tensor_fp32_to_bf16[:5]
Python
tensor([1.0000, 0.3984, 0.7148, 0.7773, 0.1777], dtype=torch.bfloat16)

Impact on Tensor Multiplication

We can multiply FP32 tensor with itself to see the dot product.

m_float32 = torch.dot(tensor_fp32, tensor_fp32)

m_float32
Python
tensor(318.0981)

We will also multiply the downcasted BF16 tensor with itself to see the dot product.

m_bf16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)

m_bf16
Python
tensor(318., dtype=torch.bfloat16)

We can see above the result of BF16 product is very close to FP32. But you can also see the loss of precision (decimal values).

Advantages

  1. Less memory footprint
    • Efficient use of GPU memory
    • Enables the training of larger models
    • Enables using the larger batch sizes
  2. Faster computer and speed
    • Computation using low precision can be faster than full precision since it demands less memory.

Disadvantages

  1. Less precision
    • Since we are using less memory, the computations will be less precise.

Use-cases

  1. Mixed Precision Training
    • Do computations in smaller precision (FP16/BF16/FP8)
    • Store and update the weights in higher precision (FP32)

Sources