
ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. Unlike traditional benchmarks that measure overall accuracy, ADeLe provides explanatory and predictive capabilities by mapping task demands against model abilities to forecast performance outcomes.
The 18 dimensions include 11 primordial capabilities (e.g., attention, comprehension, reasoning, and knowledge), 5 knowledge dimensions (pertaining to different domains such as natural sciences or formal sciences), and 2 extraneous dimensions which impact the difficulty of tasks regardless of cognitive demands.
Each scale ranges from 0 to 5+, representing varying degrees of demand for specific cognitive capabilities. Tasks are annotated using GPT-4o with structured prompts and detailed rubrics originally developed for human cognitive assessment.
Applied DeLeAn to 16,108 instances from 63 tasks from 20 benchmarks, creating the most comprehensive unified representation of AI evaluation tasks in a standardized cognitive demand space.
High correlation scores between human and LLM annotations demonstrated the reliability of the automated approach through inter-rater reliability studies and Delphi consensus methods.
The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This enables anticipation of potential failures before deployment.
By comparing task demands with model abilities, the researchers achieved prediction accuracies of up to 88 percent on unfamiliar tasks, demonstrating robust generalization beyond training contexts.
Many existing AI benchmarks fail to test what they intend. ADeLe analysis uncovered that benchmarks often require mixed capabilities, diluting their intended focus and confounding evaluation results.
The reasoning models (such as OpenAI's o1 and DeepSeek's R1-Distill) have clear improvements on the two kinds of QL (Quantitative and Logical Reasoning) but also on MCr (Identifying Relevant Information) and MS (Mind Modelling and Social Cognition).
Past a certain point, scaling up models produced diminishing returns in many ability areas. Training techniques and model design appeared to play a larger role in refining performance across specific cognitive domains.
In medical diagnosis, for example, a model's KNs (social science) score predicts how it will handle psychosomatic cases – a crucial factor missing from traditional accuracy metrics.
The transparency and risk assessment requirements of the EU AI Act are in line with ADeLe's explanatory capabilities. By transforming model profiles into standardized capability reports, developers can demonstrate compliance with Article 14 documentation requirements.
Rather than deploying AI based on general reputation or limited task scores, developers and decision-makers can now use demand-level evaluations to match systems to tasks with far greater confidence.
The cost overhead it adds is usually much lower than evaluating the model on the underlying benchmarks. The reason is that we do not need the annotator to complete the tasks, but only to evaluate their hardness on the different dimensions.
These demand levels are assigned using automated annotation via a large language model (LLM), specifically GPT-4o. By prompting the model with task examples and asking it to evaluate the demand levels across identified scales, the authors can rapidly annotate thousands of instances.
ADeLe transitions evaluation from simplistic performance metrics to nuanced understanding of how specific capabilities interact with task complexity, providing causal explanations for model failures.
Several demands, MS (Mind modelling and social cognition) and SNs (Spatial Reasoning and Navigation - Spatial), with level 0 for most benchmarks, a bias that is indicative of the little coverage of benchmarks for those dimensions.
We suspect the limited AUROC gains reflect a lack of sufficiently challenging instances across most dimensions in the current benchmarks included in the ADELE battery v1.0.
Future extension of the work includes (i) improving the calibration of the scales against human populations, (ii) extending the scales to higher levels of demands, and (iii) extending the methodology to propensities.
ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.
Planned extensions include evaluating agentic behavior and autonomous capabilities beyond current language model assessments.
Microsoft researchers aim to establish a collaborative community to refine and expand ADeLe as an open standard for AI evaluation across research, industry, and regulatory contexts.
This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance.
ADeLe represents a paradigm shift from performance-based to capability-based AI evaluation. By providing both explanatory insights into why models succeed or fail and predictive capabilities for unfamiliar tasks, ADeLe addresses critical gaps in current benchmarking approaches. Its 88% prediction accuracy, combined with interpretable cognitive profiles, positions it as a foundational framework for reliable AI deployment, regulatory compliance, and scientific understanding of AI capabilities.
The framework's emphasis on demand-ability matching offers unprecedented transparency in AI evaluation, enabling more informed decisions about model deployment while supporting the development of safer, more reliable AI systems.
Based on Microsoft Research's comprehensive study analyzing 16,000+ examples across 63 tasks and 20 benchmarks