LLM Study Diary #3: PyTorch

#deeplearning #devjournal #llm #machinelearning

Continuation of the course...This lesson talks a lot related to pytorch.

Tensor Basics & Memory

It talks about the tensors as the core building blocks for parameters, gradients, and optimizer states. And then he discusses floating-point representations, including FP32 (full precision), BF16 (brain float, often preferred for deep learning), and the move toward FP8 for efficiency

Float Data Types
There are many float types have been discussed, such as float32, float 16, bfloat16, fp8, etc. Using float32 to train requires a lot of memory, and using bfloat16/fp8 gives you some stability. Some people also mix the solutions, use float32 in attention calculation and float16 int feed forward etc. Generally Float32 (also referred to as single precision or full precision) is typically used for storing parameters and optimizer states during training to ensure numerical stability and prevent training from becoming unstable.

Tensor Operations & Einstein Summation

He introduces einops as a more readable and robust alternative to standard PyTorch indexing (e.g., -1, -2), helping developers manage dimensions without confusion. You can understand it as tag for tensor data. For example, here z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -> batch seq1 seq2") they name the output tensor as batch seq1 seq2.

Compute Accounting (FLOPs)

A deep dive into calculating the total number of floating-point operations. The instructor establishes the rule of thumb that training requires approximately 6x parameters × tokens (a total derived from 2x FLOPs for the forward pass and 4x FLOPs for the backward pass)

Note: If you forgot what the forward pass and back propagation are, here is a video to walk through the math behinds a simple Neural Networks training: The Math behind Neural Networks

Model Building & Optimization

He demonstrates on building a simple linear model, implementing custom optimizers like AdaGrad to understand how states persist across steps, and the importance of proper initialization (e.g., Xavier initialization) to maintain numerical stability in deep networks

Training Infrastructure

There is practical advice on data loading with memmap to handle massive datasets (only load specific part of the data into memory), the importance of checkpointing to prevent progress loss (this is similar to the batch processing and the streaming processing), and the synergy between hardware constraints and model architecture

DEV Community