DeepSeek training $ mystery

solved (low-precision training)

Copy of Untitled (322 x 214 px).png



In this post, I’ll break down how DeepSeek used a clever approach to cut the cost of training their V3 model—a move that shook the tech stock market over the past two days. The trick? They selectively used floating point eight (FP8) precision for parts of the training process, instead of the usual floating point thirty-two (FP32). This allowed them to pack four FP8 calculations into the same GPU space as a single FP32 calculation, reducing memory usage and data transfer by a factor of four. They didn’t use FP8 everywhere, though—FP32 was still used where high precision was critical, and FP8 was applied where lower precision was enough.


The idea comes straight from DeepSeek’s technical paper, and I’ll keep things simple—no complicated math, just a clear explanation of how they trained their V3 model much more efficiently than others. You can check out the full paper [here](link to the document).


Training a large language model (LLM) is incredibly complex and resource-intensive. It involves tasks like tokenization, attention score calculations, matrix multiplications, backpropagation, and optimization—just to name a few. Each calculation is processed in batches across multiple GPUs, which traditionally operate in FP32 mode for maximum accuracy. This ensures precise calculations, which is especially critical when breaking new ground or working with large budgets.


However, GPUs can also run in FP8 mode, where each internal register handles four times the data compared to FP32. This means faster computations, reduced memory requirements, and less data movement. The tradeoff is lower precision, but DeepSeek cleverly identified parts of the training process where FP8 precision was sufficient, significantly speeding things up. This technique, known as "low-precision training," allowed them to complete certain operations much faster while keeping overall accuracy intact. It’s a brilliant optimization that’s as game-changing as it is cost-effective.


No mystery, just good engineering.............

An error has occurred. This application may no longer respond until reloaded. Reload 🗙