Apple has put out a technical study that talks about the models it used to make Apple Intelligence possible. Apple Intelligence is a set of generative AI features that will be coming to iOS, macOS, and iPadOS in the coming months.
In the paper, Apple responds to claims that it trained some of its models in an unethical way by saying again that it did not use private user data but instead used a mix of publicly available and licensed data for Apple Intelligence.
In the paper, Apple says, “The pre-training data set is made up of……data we have licensed from publishers, curated publicly available or open-sourced datasets, and publicly available information crawled by our web crawler, Applebot.” “Because we care about protecting user privacy, we want to make sure that no private Apple user data is in the mix.”
Proof News revealed in July that Apple trained a family of models made for on-device processing with subtitles from hundreds of thousands of YouTube videos. The data set is called “The Pile,” and it has subtitles from over 100,000 videos. A lot of YouTubers whose subtitles got caught in “The Pile” didn’t know about this and didn’t agree to it. Apple later said in a statement that it didn’t plan to use those models to power any AI features in its products.
The technical paper gives more information about Apple Foundation Models (AFM) models, which Apple first showed off at WWDC 2024 in June. It stresses that the training data for the AFM models came from a “responsible” source, at least according to Apple.
The training data for the AFM models includes both freely available web data and licensed data from authors who have not been named. The New York Times says that Apple talked to NBC, Condé Nast, and IAC, among others, about multi-year deals worth at least $50 million to train models on their news files near the end of 2023. Apple also used open source code on GitHub to train its AFM models. This code included Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go.
Developers don’t like it when models are trained on code without permission, even open code. Some coders say that the terms of use for some open source codebases don’t license AI training or don’t allow it at all. But Apple says it “license-filtered” code to try to include only repositories with few limits on use, like those with an MIT, ISC, or Apache license.
The paper says that Apple specifically included math questions and replies from websites, math forums, blogs, tutorials, and seminars in the training set to help the AFM models get better at math. The business also used “high-quality, publicly-available” data sets (which the paper doesn’t name) that had “licenses that permit use for training […] models” and had sensitive information taken out.
AFM models are trained on a set of data that has about 6.3 trillion tokens in it. (Tokens are small pieces of data that are easier for generative AI models to take in.) That’s less than half of the 15 trillion tokens that Meta used to train Llama 3.1 405B, its most famous text-generating model.
Apple gathered more information, such as comments from people and artificial data, to improve the AFM models and try to stop any bad behaviours, like spouting toxicity.
“Our models were made to help people do everyday things on their Apple goods, based on the company says. “They are based on Apple’s core values and our responsible AI principles at every stage.”
There isn’t a smoking gun or a shocking new idea in the paper, and that was done on purpose. Because of competition and the fact that companies could get in trouble for sharing too much, papers like these don’t come out very often.
Some businesses that train models by scraping data from the public web say that the fair use law protects their actions. But there is a lot of disagreement about it, and more and more cases are being filed about it.
In the paper, Apple says that owners can stop its crawler from taking data from other sites. But that leaves artists who work alone in a tough spot. What should an artist do if, say, their resume is on a website that won’t stop Apple from scraping it?
Also Read: Apple Maps Comes Out on the Web to Compete With Google Maps
The fate of generative AI models and the way they are taught will depend on court cases. But for now, Apple is trying to show that it is a good company while staying out of trouble with the law.
What do you say about this story? Visit Parhlo World For more.