While lawmakers in most countries are still talking about how to regulate AI, the European Union is already ahead of the game. Earlier this year, it passed a risk-based framework for controlling AI apps.
The law went into effect in August, but the full pan-EU AI governance regime is still being worked out. For example, Codes of Practice are still being made. However, over the next few months and years, the law’s tiered provisions will start to apply to companies that make AI apps and models, so the compliance clock is already running.
Checking to see if and how AI models are following the law is the next task. Most AI apps will be built on top of large language models (LLM) and other “foundational” or “general purpose” AIs, so it seems important to focus testing at this level of the AI stack.
Move forward At ETH Zurich, there is a company called LatticeFlow AI that works on AI risk management and compliance.
It released what it calls the first technical interpretation of the EU AI Act on Wednesday. This means that it tried to connect legal requirements to technical ones. It also released an open-source LLM validation framework based on this work, which it calls Compl-AI (compl-ai, see what they did there?).
They call it “the first regulation-oriented LLM benchmarking suite.” The Swiss Federal Institute of Technology and Bulgaria’s Institute for Computer Science, Artificial Intelligence, and Technology (INSAIT) have been working together on this project for a long time, according to LatticeFlow.
The Compl-AI website lets people who make AI models ask for an evaluation of their technology to see if it meets the standards of the EU AI Act.
Also, LatticeFlow has put out model reviews of a number of popular LLMs, including different versions and sizes of Meta’s Llama models and OpenAI’s GPT. They also have an EU AI Act compliance leaderboard for Big AI.
On a scale from 0 (no compliance) to 1 (full compliance), the second one rates how well models from companies like Anthropic, Google, OpenAI, Meta, and Mistral meet the law’s standards.
Other evaluations are marked with a “N/A” symbol, which means “not available” when there isn’t enough data or “not applicable” when the model maker doesn’t offer the feature. Please note that there were also some negative marks recorded at the time of writing. This was due to a bug in the Hugging Face interface, we were told.
To give you an idea of how LatticeFlow’s framework rates LLM replies, it looks at 27 different benchmarks, such as “prejudiced answers”, “toxic completions of benign text”, “following harmful instructions”, “truthfulness” and “common sense reasoning”. Each model gets a range of scores in each column, or “N/A” if there are no scores.
AI Compliance Isn’t All Good Or Bad
How did the big LLMs do? There is no overall score for the type. So performance is different based on what is being tested, but there are clear highs and lows across all of the benchmarks.
As an example, all the models did pretty well at not following harmful directions and nearly as well at not giving biased answers. However, the scores for reasoning and general knowledge were much less consistent.
In other places, recommendation consistency, which is used by the framework as a measure of justice, was very bad for all models. None of them scored above the middle point, and most scored well below it.
Because so many results are marked “N/A,” other areas, like the suitability of the training data and the reliability and strength of the watermark, don’t seem to have been looked at at all.
LatticeFlow does say that it can be harder to tell if models are complying with certain rules in some areas, like when it comes to tough problems like privacy and copyright. It’s not acting like it knows everything.
In a paper about the framework’s work, the scientists involved say that most of the smaller models they looked at (with 13B parameters) “scored poorly on technical robustness and safety.”
Almost all of the models they looked at had trouble reaching high levels of variety, fairness, and non-discrimination.
They say, “We think that these problems are mostly because model providers have been too focused on making models better, at the expense of other important aspects brought to light by the EU AI Act’s regulatory requirements.” They also say that as compliance deadlines approach, LLM makers will have to shift their attention to areas that need it, which will “lead to a more balanced development of LLMs.”
LatticeFlow’s system is always a work in progress because no one knows for sure what will be needed to follow the EU AI Act. Also, this is just one way that the law’s standards could be turned into technical results that can be measured and contrasted. But it’s an interesting start to what will need to be an ongoing project to look into powerful automation technologies and help their creators make them better.
LatticeFlow CEO Petar Tsankov told TechCrunch, “The framework is a first step toward a full compliance-centered evaluation of the EU AI Act. However, it is designed in a way that makes it easy to be updated so that it moves in lock-step with the Act as it is updated and the different working groups make progress.” “This is okay with the EU Commission.” We expect the community and business world to keep building on the framework to make it a full and complete AI Act assessment tool.
Tsankov said that it’s clear that AI models have “predominantly been optimized for capabilities rather than compliance,” which is a summary of what we’ve learned so far. Also, he pointed out “notable performance gaps,” which means that some models with a lot of power can be just as good at following the rules as models with less power.
Tsankov says that cyberattack resilience (at the model level) and fairness are two main issues that need to be looked into. Many models scored less than 50% in cyberattack resilience.
It was said that open-source vendors like Mistral have not put as much stress on this as closed-source vendors like Anthropic and OpenAI have in order to protect against jailbreaks and prompt injections.
He also said that this should be a top goal for future work because “most models” did just as badly on fairness tests.
Tsankov talked about the problems with comparing LLM success in areas like privacy and copyright. He said, “The problem with copyright is that the current benchmarks only look at copyright books.” There are two main problems with this approach: (i) it doesn’t look at possible copyright violations involving materials other than these books; and (ii) it rests on measuring model memorization, which is notoriously hard to do.
“The challenge is the same when it comes to privacy: the benchmark only checks to see if the model has remembered certain personal details.”
LatticeFlow wants the free and open source system to be used by more AI researchers and made better by them.
“We invite AI researchers, developers, and regulators to join us in moving this project forward,” said professor Martin Vechev of ETH Zurich and founder and scientific director of INSAIT, who is also working on the project. “We invite other research groups and practitioners to improve the AI Act mapping, add new benchmarks, and make this open-source framework bigger.”
Also Read: France Gets the Most Money in Europe for Generative Ai
“The methodology can also be used to test AI models against new laws that aren’t in the EU AI Act. This makes it a useful tool for businesses that work in multiple countries.”
What do you say about this story? Visit Parhlo World For more.