Quantization is one of the most common ways to make AI models work better, but it has boundaries, and the industry may be getting close to them quickly.
In AI, quantization means reducing the number of bits that are needed to store information. Bits are the smallest units that a computer can handle. Think about this comparison: You would likely say “noon” instead of “oh twelve hundred, one second, and four milliseconds” when someone asks you what time it is. That’s quantizing. Both answers are right, but one is a little more accurate. It depends on the situation how precise you need to be.
There are several parts of AI models that can be quantified. These include parameters, which are the internal variables that the models use to make choices or predictions. This is helpful because models do millions of calculations when they are run. Quantized models with fewer bits for parameters are easier to understand theoretically and can be solved more quickly. (Just so you know, this is not the same as “distilling,” which involves more work and picking which factors to keep.)
But quantization might have more costs than was thought before.
As The Model Gets Smaller,
A study from Harvard, Stanford, MIT, Databricks, and Carnegie Mellon found that quantized models don’t work as well if the original model that wasn’t quantized was learned on a lot of data over a long period of time. This means that there may come a time when it’s better to train a smaller model instead of cooking down a bigger one.
That could be bad news for AI companies that train very big models (which is known to make answers better) and then quantify them to make them cheaper to serve.
The results can already be seen. A few months ago, developers and scientists said that quantizing Meta’s Llama 3 model was “more harmful” than other models. This might be because of how it was trained.
“Inference is and will continue to be the number one cost for everyone in AI,” Tanishq Kumar, the first author on the paper and a math student at Harvard, told Parhlo World. “Our work shows that one important way to reduce it will not work forever.”
As opposed to what most people think, AI model inferencing (running a model, like when ChatGPT asks a question) is usually more expensive than model training. For example, Google trained one of its most famous Gemini models for about $191 million, which is a very large amount of money. But it would cost the company about $6 billion a year to use a model to come up with 50-word answers to half of all Google Search questions.
A lot of big AI labs have started training models on very large datasets because they think that “scaling up,” or using more data and computing power for training, will make AI smarter.
In one case, Meta taught Llama 3 on a set of 15 trillion pieces. There are about 750,000 words in a million tokens, which are bits of raw data. The last generation, Llama 2, was taught with “only” 2 trillion coins.
Scaling up seems to have diminishing returns over time. Anthropic and Google are said to have recently trained huge models that didn’t meet their own internal benchmark standards. But there aren’t many signs that the industry is ready to make a big change from these tried-and-true growth methods.
How Exact, Please?
If labs don’t want to train models on smaller datasets, is there a way to make models less likely to break down? Perhaps. It was found by Kumar and his co-authors that teaching models in “low precision” can make them stronger. Please wait a moment while we get started.
In this case, “precision” means how many digits a numerical data type can correctly show. Data types are groups of data values that are generally defined by a list of possible values and operations that can be used with them. For example, the data type FP8 only uses 8 bits to represent a floating-point number.
Today, most models are trained at 16 bits, which is also known as “half precision,” and then “post-train quantized” to 8 bits of precision. Some model parts, like its parameters, are changed to a less accurate version, but the accuracy is lost. When you do math to a few decimal places and then round off to the nearest tenth, you’re getting the best of both worlds.
Hardware companies like Nvidia want quantized model inference to have less accuracy. Nvidia’s new Blackwell chip supports 4-bit precision, especially a data type called FP4. The company says this will help data centers that are short on memory and power.
But it might not be good to have very low quantization accuracy. Kumar says that precisions lower than 7- or 8-bit may see a noticeable drop in quality unless the original model has a huge number of parameters.
Don’t worry if this all seems a bit complicated; it is. The main point is that AI models are not fully understood, and known short-cuts that work for other types of computations don’t work here. If someone asked you when you started the 100-meter dash, you wouldn’t say “noon,” would you? Of course, it’s not quite that clear-cut, but you get the idea:
Kumar said, “The main point of our work is that there are limits that you can’t just dodge.” “We hope that our work adds some depth to the conversation that is often looking for defaults that are less and less accurate for training and inference.”
Kumar agrees that the study he and his partners did was pretty small. In the future, they plan to test it with more models. One thing he does think will hold true, though, is that lowering the cost of reasoning is not a free lunch.
Also Read: The First Top Economist Has Been Hired by Openai
He said, “Bit accuracy is important, and it costs money.” “You can’t cut it down forever without making models suffer.” Since models can only hold a certain number of tokens, I think that instead of trying to fit a quadrillion tokens into a small model, a lot more work will go into carefully selecting and filtering the data so that only the best data is put into smaller models. There is reason to be hopeful that new designs that are designed to make low precision training stable will be important in the future.
What do you say about this story? Visit Parhlo World For more.