
The percentage of correct predictions made by a model out of total predictions. In classification tasks, calculated as (True Positives + True Negatives) / Total Predictions.
A methodology that assesses task difficulty for AI models using measurement scales for 18 types of cognitive and knowledge-based abilities.
Comparative evaluation methodology testing different model versions or prompts against each other to determine which performs better.
Testing with intentionally challenging inputs designed to expose model weaknesses, biases, or failure modes.
Hypothetical AI systems with human-level cognitive abilities across all domains, representing a key milestone that companies like OpenAI and DeepMind aim to achieve.
The systematic evaluation of AI models using standardized datasets and metrics to assess their capabilities, limitations, and performance across various tasks.
Organization monitoring AI companies' safety practices and evaluating their preparedness for advanced AI risks.
Government organizations (UK AISI, US AI Safety Institute) developing evaluations for advanced AI systems, focusing on misuse risks, societal impacts, and autonomous capabilities.
Future of Life Institute's grading system for AI companies' safety practices, with Anthropic receiving the highest grade (C) and Meta receiving an F.
Anthropic's classification system for AI systems based on risk levels, with ASL-3 involving enhanced security and deployment standards for models with potential dangerous capabilities.
Third-party organization conducting safety evaluations of frontier AI models for dangerous capabilities like resource accumulation and self-replication.
DeepMind's famous AI system that defeated human champions at the game of Go, representing a major breakthrough in AI capabilities evaluation.
Algorithms used to find the nearest neighbors of a query point in high-dimensional datasets, trading small amounts of accuracy for significant speed improvements.
Evaluation metric determining whether an LLM output addresses the given input in an informative and concise manner.
AI safety company founded by Dario and Daniela Amodei, known for Claude models and Constitutional AI approach, receiving the highest grade on AI Safety Index.
Third-party research institute that evaluates AI models for safety risks, including testing for deceptive behavior and scheming capabilities.
See Alignment Research Center.
Framework for automated evaluation of RAG systems.
Open-source observability and evaluation platform for LLM applications with focus on tracing and debugging.
See AI Safety Level.
Evaluation approach focusing on specific aspects of model outputs (e.g., summary accuracy, coherence, relevance).
Techniques for visualizing attention weights in transformer models to understand what the model focuses on.
Metric measuring the ability of a binary classifier to distinguish between classes across all classification thresholds.
Quantization library for transformer models to reduce memory requirements while maintaining performance.
Systematic assessment using computational metrics and algorithms for scalable, reproducible measurements.
Microsoft's cloud-based evaluation platform for assessing model performance across key metrics including factuality and semantic similarity.
Evaluation framework measuring social biases against protected classes along nine social dimensions.
A standardized test or measurement used to evaluate the performance of AI models against established criteria or other models.
An evaluation metric that uses BERT embeddings to compute similarity between generated and reference texts, capturing semantic similarity beyond surface-level n-gram matching.
Systematic unfairness in model outputs toward specific groups or demographics, measured through demographic parity, equalized odds, and other fairness metrics.
A suite of 23 challenging BIG-Bench tasks where prior language model evaluations did not outperform average human raters.
Research initiative focused on large-scale code generation models and their evaluation.
A coding benchmark where AI systems achieve 35.5% success rate compared to 97% human performance.
International collaboration that developed BLOOM language model and contributed to evaluation frameworks.
A precision-based metric that evaluates machine-generated text by comparing n-gram overlap with reference texts, commonly used in machine translation evaluation.
A learned evaluation metric that uses fine-tuned BERT models to assess text generation quality.
Large language model developed by BigScience collaboration, representing open-source alternatives to proprietary models.
Statistical method for estimating sampling distributions by resampling with replacement.
Anthropic's training methodology using AI feedback to evaluate outputs according to a set of principles, implemented in Claude models.
A model's ability to provide confidence estimates that accurately reflect the likelihood of its predictions being correct.
Categories of weapons of mass destruction that AI safety evaluations assess models' potential to help develop or acquire.
Assessment of models' reasoning capabilities through step-by-step problem-solving approaches.
A crowdsourced platform where users interact with two anonymous LLMs simultaneously and vote for the better response, used to compute ELO ratings.
Set of metrics for evaluating classification models including accuracy, precision, recall, F1-score, specificity, and sensitivity.
Anthropic's AI assistant family, including Claude 3.5 Sonnet and Claude Opus 4, known for Constitutional AI training and safety features.
Set of principles used in Constitutional AI training, drawing from UN Declaration of Human Rights and other ethical frameworks.
Neural network architecture particularly effective for image processing, used in computer vision evaluation tasks.
Evaluation of how well sentences and paragraphs flow together to form unified and understandable responses.
AI company developing language models and evaluation frameworks, recently studied potential gaming of Chatbot Arena leaderboard.
A learned metric for machine translation evaluation that uses cross-lingual pre-trained models.
Open-source end-to-end LLM evaluation and monitoring platform with prompt playground capabilities.
Company behind DeepEval framework, providing hosted evaluation platform for LLM applications.
A table layout allowing visualization of the performance of a classification algorithm, showing true positives, false positives, true negatives, and false negatives.
See CAI.
Assessment of model performance across different input context lengths.
RAG evaluation metric measuring the quality of retrieved information relative to the query.
RAG evaluation metric measuring the completeness of retrieved information relative to what should have been retrieved.
Vector representations where the same word has different embeddings based on surrounding context, crucial for semantic evaluation.
Conversational Question Answering dataset used for evaluating multi-turn dialogue capabilities.
Evaluation metric determining whether an LLM output is factually correct based on ground truth or reference standards.
A measure of similarity between two non-zero vectors defined as the cosine of the angle between them, commonly used for comparing embeddings.
See Chain-of-Thought Evaluation.
A fairness criterion where predictions remain unchanged when protected attributes are flipped to counterfactual values.
A loss function measuring the difference between predicted and true probability distributions, widely used in LLM training and evaluation.
The phenomenon where test data inadvertently appears in training datasets, compromising evaluation integrity and leading to inflated performance scores.
A ranking quality metric that measures the total item relevance in a list with a discount for items further down the list.
Evaluation framework focusing on LLM evaluation with emphasis on dashboard visualization and UI for evaluation results.
Open-source LLM evaluation framework by Confident AI offering 14+ evaluation metrics for RAG and fine-tuning use cases.
Method for decomposing predictions of neural networks by comparing activations to reference activations.
Google's AI research lab known for AlphaGo, protein folding breakthroughs, and safety evaluation research.
Chinese AI lab that developed R1 model, demonstrating cost-effective training methods that surprised competitors.
A fairness metric ensuring that the probability of receiving a positive outcome is the same across all groups defined by a sensitive attribute.
Continuous, real-valued vectors representing information in high-dimensional space where every element contains non-zero values.
LLM-as-a-Judge methodology evaluating single outputs with numerical scores rather than comparative assessment.
A variation of demographic parity that aims to achieve higher-than-specified ratios rather than equal approval rates.
Framework for programming with foundation models, supporting automatic optimization of prompts and weights.
Evaluation frameworks that evolve over time to prevent gaming and maintain challenge levels.
Application of chess rating system to compare language models through pairwise human preference battles.
Open-source AI research organization behind the LM Evaluation Harness and various language models.
A unified framework for testing generative language models on academic benchmarks with standardized evaluation protocols.
The phenomenon where embedding representations change over time, affecting downstream task performance.
Assessment of vector representations of text, images, or other data types for their ability to capture semantic relationships and enable downstream tasks.
Assessment of how well vector representations capture semantic relationships and enable downstream tasks.
Requirement that confusion matrices have the same distribution across all sensitive characteristics.
A fairness metric ensuring that qualified individuals from all groups have the same chance of receiving positive outcomes.
A fairness metric requiring equal true positive rates and false positive rates across different demographic groups.
Benchmark evaluating emotional intelligence capabilities of language models.
A measure of the straight-line distance between two points in Euclidean space, commonly used for comparing vectors.
Frameworks providing standardized environments for running multiple evaluation benchmarks consistently across different models.
Unified evaluation library integrating existing frameworks like lm-evaluation-harness and FastChat.
Open-source Python library for ML model evaluation and monitoring with focus on data drift and model performance.
Methods and techniques for making AI model decisions interpretable and understandable to humans.
The harmonic mean of precision and recall, providing a single metric that balances both measures.
Assessment of whether generated content aligns with provided source materials without introducing false information.
Meta's AI research lab, recently restructured under Meta's efficiency initiatives and facing researcher departures.
Microsoft's open-source toolkit for assessing and mitigating unfairness in machine learning models.
Mathematical measures used to assess bias and ensure equitable treatment across different demographic groups.
The proportion of actual positive cases that were incorrectly classified as negative.
The proportion of actual negative cases that were incorrectly classified as positive.
LMSYS Org's evaluation framework supporting LLM-Judge evaluation for MT-Bench.
RAG evaluation metric measuring how accurately generated responses align with retrieved context without hallucination.
Methods for determining which input features most influence model predictions.
Dataset for fact-checking evaluation, assessing models' ability to verify statement accuracy.
Testing model performance with a small number of examples provided in the prompt.
Benchmark evaluating LLMs in financial domain across 36 datasets covering 24 tasks in seven financial areas.
Assessment of how natural, grammatically correct, and readable generated text appears.
See False Negative Rate.
See False Positive Rate.
A benchmark evaluating models on extremely difficult mathematics problems where AI systems currently solve only 2% of problems.
Nonprofit organization that created the AI Safety Index and the "pause letter" calling for AI development moratorium.
A framework using large language models to evaluate text quality with chain-of-thought reasoning and probability scoring.
An activation function that provides smooth and differentiable alternative to ReLU, ensuring differentiability at every point.
Google's family of large language models competing with GPT and Claude, including Gemini 2.5.
AI-powered code completion tool whose effectiveness is evaluated through coding benchmarks.
Benchmark for evaluating language understanding across multiple tasks.
Former Google AI research lab, now integrated into DeepMind.
See DeepMind.
OpenAI's family of language models, including GPT-3, GPT-4, and variants used in evaluation studies.
A challenging benchmark designed to evaluate expertise in biology, physics, and chemistry at PhD level.
A technique for making convolutional neural network decisions transparent by highlighting important regions in input images.
An ensemble method combining multiple weak learners sequentially, where each learner corrects errors from previous ones.
Assessment of whether output text is free from grammatical errors such as incorrect verb conjugations and syntactical mistakes.
Math word problem dataset commonly used in language model evaluation.
A benchmark evaluating LLM performance in recognizing hallucinations in question-answering, dialogue, and summarization tasks.
The identification and quantification of instances where language models generate information that appears plausible but is factually incorrect.
The frequency at which language models generate false or unsupported information.
Evaluation category assessing potential for AI systems to cause harm through toxic, biased, or dangerous outputs.
Framework for building search systems and RAG applications with built-in evaluation capabilities.
Framework for testing medical chatbots on factual correctness, safety, and ethical standards.
A comprehensive benchmarking framework developed by Stanford CRFM evaluating language models across 42 scenarios and 7 metrics.
A benchmark testing commonsense reasoning by requiring models to choose sensible sentence completions.
Vectara's evaluation model specifically designed to detect and quantify hallucinations in LLM outputs.
Evaluation methodology incorporating human judgment and feedback in the testing process.
The proportion of queries for which at least one relevant item appears in the top-k results.
Algorithm for approximate nearest neighbor search in high-dimensional spaces, used in vector databases.
Platform for tracing LLM execution flows and creating evaluation datasets from production data.
Multi-hop reasoning dataset requiring models to connect information across multiple documents.
Assessment of model outputs by human judges for qualities like fluency, coherence, relevance, and adequacy.
A benchmark for evaluating code generation capabilities through programming challenges.
Rigorous academic test where top AI systems score only 8.80%, representing challenging evaluation benchmark.
The best possible DCG score, used to normalize DCG into NDCG.
Institute of Electrical and Electronics Engineers, publishes AI evaluation research and standards.
A benchmark measuring models' ability to follow specific instructions accurately.
Testing models' ability to learn from examples provided in the prompt.
Fairness principle requiring that similar individuals receive similar treatment from AI systems.
Metrics specifically designed for evaluating search and retrieval systems, including precision@k, recall@k, MAP, MRR, and NDCG.
A mathematical operation computing the sum of products of corresponding elements of two vectors, used as a similarity measure.
UK AI Safety Institute's open-source evaluation platform for assessing AI model capabilities and safety risks.
Evaluation of models' ability to follow complex, multi-step instructions accurately.
Evaluation of models' adherence to prioritizing instructions between system, developer, and user messages.
A technique for attributing predictions of classification models to input features by computing gradients along a path.
The degree to which humans can understand and explain AI model decisions and reasoning processes.
Assessment of embeddings based on their internal properties rather than downstream task performance.
Indexing method for efficient similarity search in large vector databases.
Adversarial prompts designed to circumvent model safety training and induce harmful content generation.
Korean version of MMLU benchmark for testing knowledge in Korean language context.
Framework for building LLM applications with integrated evaluation and monitoring capabilities.
Open-source LLM engineering platform providing tracing, evaluation, prompt management, and analytics.
Evaluation and observability platform by LangChain for debugging, testing, and monitoring LLM applications.
Time required for a model to produce output after receiving input, critical for real-time applications.
Method for explaining neural network decisions by propagating relevance scores backward through layers.
Benchmark for measuring legal reasoning capabilities in large language models.
Systematic preference for responses of certain lengths regardless of quality.
HuggingFace's evaluation framework built on top of EleutherAI's lm-evaluation-harness.
Gradient boosting framework used in machine learning with applications in model evaluation.
A technique providing local explainability by approximating black box models with interpretable models for individual predictions.
Universal API wrapper for LLM providers, supporting evaluation across multiple model endpoints.
Framework for building LLM applications with built-in evaluation modules for retrieval and response quality.
An evaluation methodology using large language models to assess text output quality based on custom criteria defined in evaluation prompts.
Organization behind Chatbot Arena and FastChat evaluation frameworks.
The raw output scores from a neural network before applying activation functions like softmax.
See Layer-wise Relevance Propagation.
The sum of absolute differences between corresponding elements of two vectors.
A ranking metric that averages precision values across multiple recall levels and queries.
Dataset of mathematics problems used for evaluating mathematical reasoning capabilities.
An evaluation metric addressing synonymy issues in text evaluation, providing advantages over simpler metrics.
Company behind FAIR research lab and LLaMA models, recently announcing Meta Superintelligence Labs.
Meta's reorganized AI research organization led by Alexandr Wang as Chief AI Officer.
Benchmark for multimodal fact-checking with large vision-language models.
Efficient fact-checking framework for LLMs using grounding documents.
Platform for standardized LLM evaluation with custom metrics and experimentation tracking.
A benchmark testing knowledge across 57 academic subjects, from elementary to professional levels.
Enhanced version of MMLU with more challenging questions.
Benchmark for evaluating multimodal understanding across various domains.
Documentation providing transparency about AI model capabilities, limitations, and potential risks.
Changes in model performance over time due to shifts in data distribution or model degradation.
The extent to which humans can understand the reasoning behind AI model predictions and decisions.
An embedding-based evaluation metric using optimal transport to align embeddings between reference and generated text.
A ranking metric focusing on the position of the first relevant item in ranked results.
See Meta Superintelligence Labs.
A benchmark designed to evaluate LLMs' ability to sustain multi-turn conversations.
Comprehensive benchmark for evaluating text embedding models across various tasks.
Assessment approach using multiple complementary metrics to capture different aspects of model performance.
A ranking quality metric comparing rankings to ideal order where all relevant items are at the top.
The proportion of negative predictions that are actually negative.
Leading AI conference where research papers on evaluation methods are published.
U.S. national initiative for developing AI measurement standards and evaluation methodologies.
U.S. agency collaborating with Anthropic on red-teaming Claude models for nuclear security risks.
RAG evaluation metric measuring robustness to irrelevant information in retrieved context.
See Negative Predictive Value.
Open Language Model with publicly available training data and evaluation code.
Real-time assessment of model performance in production environments.
Testing model performance on data that differs from the training distribution.
AI company behind GPT models and OpenAI Evals framework, receiving D+ grade on AI Safety Index.
OpenAI's framework for evaluating AI models with basic evaluation templates and model-graded assessments.
LLM evaluation platform supporting evaluations across multiple domains including finance, healthcare, and law.
Open standard for capturing and storing AI model inferences to enable evaluation and observability.
LLM-as-a-Judge methodology comparing two outputs to determine which is better according to specified criteria.
Data analytics company partnering with Anthropic to provide Claude to U.S. intelligence agencies.
The relationship between model size (number of parameters) and performance, measuring computational resource effectiveness.
A measure of how well a language model predicts a sequence of words, with lower values indicating better performance.
Evaluation dataset for testing hallucination in personal question answering scenarios.
See Arize Phoenix.
Systematic preference for items in certain positions of a ranked list.
See Precision - the proportion of positive predictions that are actually positive.
Statistical method for determining sample sizes needed to detect effects of specified size.
See Precision.
The ratio of true positive predictions to total positive predictions made by the model.
The proportion of relevant items among the top K retrieved items.
A fairness metric requiring equal precision rates across different demographic groups.
Assessment of how different prompt formulations affect model performance.
Machine learning framework providing tools for building and training models with evaluation capabilities.
Method for modeling relationships between chemical structures and pharmacological activity, used in toxicity prediction.
A comprehensive framework for evaluating Retrieval-Augmented Generation systems across multiple dimensions.
Evaluation tool for assessing RAG system performance and quality.
Evaluation measures for systems that return ordered lists of results, including MAP, MRR, NDCG, and precision@k.
The ratio of true positive predictions to total actual positive instances.
The proportion of relevant items that appear in the top K retrieved items.
Deploying domain experts to interact with models and test capabilities while attempting to break model safeguards.
Assessment methods that compare model outputs against gold standard references or ground truth.
Evaluating models on consistent test sets across iterations to detect performance degradation.
Dataset for evaluating model safety and toxicity detection capabilities.
Anthropic's framework categorizing AI systems into different AI Safety Levels with associated safety measures.
Assessment of information retrieval systems using metrics like precision, recall, MAP, MRR, and NDCG.
Training methodology using human preferences to improve model alignment and safety.
A model's ability to maintain performance when faced with adversarial inputs, distribution shifts, or edge cases.
Curve plotting true positive rate against false positive rate, used in binary classification evaluation.
A recall-focused metric family for evaluating text summarization quality.
See Responsible Scaling Policy.
Assessment of AI systems for potential harms, including toxicity, bias, misinformation, and misuse risks.
Visualizations showing which parts of input data are most important for model decisions.
Company providing evaluation and red-teaming services, selected by White House to conduct public AI assessments.
Google's library for efficient similarity search in large datasets.
AI behavior involving deceptive alignment and subversion of safety measures to gain power or achieve goals.
Scale AI's research lab focusing on model-assisted evaluation approaches.
Methods for assessing model reliability through multiple sampling and consistency checking.
Hallucination detection method based on self-consistency checking across multiple model generations.
Evaluation of how similar two pieces of text are in meaning, often using embedding-based approaches.
Task and benchmark for measuring the degree of semantic equivalence between text pairs.
Method for generating sentence-level embeddings that can be compared using cosine similarity.
Running new models alongside production models to compare performance without affecting users.
A framework using game theory concepts to explain model predictions by computing feature contribution values.
A dataset of fact-seeking questions with short answers measuring model accuracy for attempted answers.
Vector representations where most values are zero, emphasizing only relevant information.
The proportion of actual negative cases correctly identified as negative.
Benchmark for evaluating reading comprehension capabilities.
See Semantic Textual Similarity.
Academic jailbreak benchmark testing model resistance against common adversarial attacks.
More challenging version of GLUE benchmark for language understanding evaluation.
A benchmark evaluating AI systems' ability to resolve real-world software engineering problems.
Evaluation metric determining whether an LLM agent successfully completes assigned tasks.
Machine learning framework providing comprehensive tools for building and training models with evaluation capabilities.
See NIST TEVV.
Independent assessment conducted by external organizations to ensure objective evaluation of AI systems.
Number of requests or tokens a model can process per unit time, measuring computational efficiency.
Measurement of how effectively models use their token budgets for generation tasks.
Application and SDK for measuring RAG LLM system performance with automated evaluation capabilities.
Evaluation metric determining whether an LLM agent calls the correct tools for given tasks.
Question answering dataset used for evaluating topic-specific knowledge.
Measurement of harmful, offensive, or inappropriate content generation by AI models.
See Sensitivity or Recall - the proportion of actual positives correctly identified.
Fairness metric requiring equal ratios of false negatives to false positives across demographic groups.
Fast explainer for analyzing decision tree models in the SHAP framework.
Question answering dataset testing factual knowledge across various topics.
See Specificity - the proportion of actual negatives correctly identified.
See Sensitivity or Recall - the proportion of actual positives correctly identified.
Open-source library for evaluating and tracking LLM applications through feedback functions and comprehensive tracing.
A benchmark evaluating LLMs' accuracy in providing truthful information using adversarial questions.
Unified evaluation framework for text generation using pre-trained language models.
Frameworks that standardize evaluation across multiple tasks, models, and metrics for consistent comparison.
Evaluation framework for LLM applications with focus on continuous monitoring and improvement.
Percentage of time a model or system is operational and available.
Federal initiative collaborating with AI companies on safety research, testing and evaluation.
Process of assessing model performance on held-out data to ensure generalization.
Assessment of systems designed to store and retrieve high-dimensional vectors efficiently.
The process of finding vectors in a database that are most similar to a query vector using distance metrics.
Company providing hallucination evaluation models and leaderboards for LLM assessment.
Extension of HELM framework for comprehensive evaluation of Vision-Language Models.
Inference and serving library for large language models with support for fast evaluation.
Platform providing experiment tracking, model evaluation, and collaborative ML workflows.
Vector database company providing evaluation metrics and tools for search and recommendation systems.
Benchmark dataset for evaluating model performance on diverse, real-world tasks.
Multi-purpose moderation tool for assessing safety of user-LLM interactions.
Safety training dataset with 262K examples for improving model robustness against adversarial attacks.
Automatic red-teaming framework for identifying and reproducing human-devised attacks.
Benchmark testing commonsense reasoning through pronoun resolution tasks.
See Explainable AI.
Elon Musk's AI company, receiving low grades on AI Safety Index evaluations.
Extreme gradient boosting framework commonly used in machine learning competitions and evaluation studies.
Object detection algorithm whose performance is evaluated using computer vision metrics.
Benchmark evaluating logical reasoning abilities of LLMs via logic grid puzzles.
Testing model performance on tasks without providing task-specific training examples.
Chinese AI company included in AI Safety Index evaluations.
Company behind Milvus vector database, providing resources on embedding evaluation and similarity metrics.
Ensemble method that sequentially applies weak learners, with each focusing on previously misclassified examples.
Model's ability to maintain performance when faced with intentionally crafted malicious inputs.
Framework ensuring AI systems can be held responsible for their decisions and impacts.
Systematic examination of AI systems for bias, fairness, and compliance with ethical standards.
Neural network component that allows models to focus on relevant parts of input sequences.
Technique for normalizing layer inputs to improve training stability and convergence.
Statistical approach to model evaluation incorporating uncertainty quantification.
Phenomenon where AI systems increase existing biases present in training data.
Problem where neural networks lose previously learned information when learning new tasks.
The ability of models to understand and generate novel combinations of known elements.
Changes in the underlying data distribution that affect model performance over time.
Privacy-preserving technique that adds noise to data or model outputs to protect individual privacy.
Changes in data distribution between training and testing phases that can degrade model performance.
Machine learning technique combining multiple models to improve overall performance.
Uncertainty arising from limited knowledge or data, reducible with more information.
Machine learning approach where models are trained across decentralized data sources.
Assessment of model performance after task-specific training on pre-trained models.
Optimization algorithm used to minimize loss functions in machine learning models.
Process of finding optimal configuration parameters for machine learning models.
Technique for transferring knowledge from large models to smaller, more efficient ones.
Statistical sampling method used in Bayesian model evaluation.
Learning to learn - algorithms that improve their learning efficiency through experience.
Framework for sequential decision-making under uncertainty, used in online evaluation.
Automated method for finding optimal neural network architectures.
Problem where models perform well on training data but poorly on unseen data.
Techniques to prevent overfitting by adding constraints or penalties to model complexity.
Learning approach using both labeled and unlabeled data for training.
Technique leveraging knowledge from pre-trained models for new tasks.
Problem where models are too simple to capture underlying data patterns.
Professional organization publishing AI evaluation research and standards.
Conference focusing on ethical considerations in AI development and evaluation.
Organization promoting AI research and responsible development practices.
Conference dedicated to fairness and accountability in algorithmic systems.
Premier venue for machine learning research including evaluation methodologies.
International standards for AI systems including evaluation and testing frameworks.
Collaborative effort to study and formulate best practices on AI technologies.
European Union regulation establishing requirements for AI systems including evaluation standards.
Framework establishing principles for responsible AI development and deployment.
European privacy regulation affecting AI evaluation practices and data handling.
Documentation standard for AI model transparency and accountability.
Structured approach to evaluating and managing AI risks throughout development.
Assessment frameworks for self-driving car AI systems including safety and performance metrics.
Specialized assessment for medical AI applications focusing on patient safety and efficacy.
Assessment of AI systems used in security applications including threat detection and response.
Frameworks for assessing AI tutoring systems and educational technology effectiveness.
Specialized assessment for AI in banking, trading, and financial services with regulatory compliance.
Assessment frameworks for AI systems used in legal applications including bias and fairness.
Specialized assessment for defense applications including ethical and strategic considerations.
Assessment of content moderation and recommendation algorithms for harmful content detection.
Technology enabling direct communication between brain and computer systems, requiring specialized evaluation.
Virtual replicas of physical systems used for simulation and evaluation.
Assessment of AI systems running on local devices with resource constraints.
Brain-inspired computing architectures requiring specialized evaluation approaches.
Integration of quantum computing with machine learning, requiring new evaluation frameworks.
Collective behavior of decentralized systems requiring specialized evaluation metrics.
Process of removing personally identifiable information from datasets used in evaluation.
Tracking of data flow and transformations throughout the machine learning pipeline.
Documentation of data origin, ownership, and processing history for evaluation datasets.
Assessment approaches that preserve data privacy by keeping data distributed.
Cryptographic technique allowing computation on encrypted data for privacy-preserving evaluation.
Creation of artificial datasets for evaluation without exposing real sensitive data.
Organizational committees overseeing ethical AI development and evaluation practices.
Frameworks for managing AI development, deployment, and evaluation within organizations.
Metrics for measuring business value and effectiveness of AI system implementations.
Evaluation frameworks for selecting and monitoring third-party AI service providers.
Executive role responsible for AI strategy and evaluation oversight.
Practices for deploying and maintaining machine learning systems including evaluation pipelines.
Framework for identifying, assessing, and mitigating risks associated with AI model deployment.
This comprehensive dictionary draws from leading sources including:
Academic Institutions:
Industry Organizations:
Evaluation Frameworks:
Government and Standards:
Research Venues:
This dictionary represents the most comprehensive compilation of AI evaluation terminology as of 2025, covering foundational concepts, cutting-edge methodologies, practical tools, industry standards, regulatory frameworks, and emerging technologies. The field continues to evolve rapidly with new metrics, benchmarks, evaluation approaches, companies, and regulatory requirements being developed regularly. This reference should serve as the definitive one-stop resource for anyone working in or studying AI evaluation and testing.
The percentage of correct predictions made by a model out of total predictions. In classification tasks, calculated as (True Positives + True Negatives) / Total Predictions.
A methodology that assesses task difficulty for AI models using measurement scales for 18 types of cognitive and knowledge-based abilities.
Comparative evaluation methodology testing different model versions or prompts against each other to determine which performs better.
Testing with intentionally challenging inputs designed to expose model weaknesses, biases, or failure modes.
Hypothetical AI systems with human-level cognitive abilities across all domains, representing a key milestone that companies like OpenAI and DeepMind aim to achieve.
The systematic evaluation of AI models using standardized datasets and metrics to assess their capabilities, limitations, and performance across various tasks.
Organization monitoring AI companies' safety practices and evaluating their preparedness for advanced AI risks.
Government organizations (UK AISI, US AI Safety Institute) developing evaluations for advanced AI systems, focusing on misuse risks, societal impacts, and autonomous capabilities.
Future of Life Institute's grading system for AI companies' safety practices, with Anthropic receiving the highest grade (C) and Meta receiving an F.
Anthropic's classification system for AI systems based on risk levels, with ASL-3 involving enhanced security and deployment standards for models with potential dangerous capabilities.
Third-party organization conducting safety evaluations of frontier AI models for dangerous capabilities like resource accumulation and self-replication.
DeepMind's famous AI system that defeated human champions at the game of Go, representing a major breakthrough in AI capabilities evaluation.
Algorithms used to find the nearest neighbors of a query point in high-dimensional datasets, trading small amounts of accuracy for significant speed improvements.
Evaluation metric determining whether an LLM output addresses the given input in an informative and concise manner.
AI safety company founded by Dario and Daniela Amodei, known for Claude models and Constitutional AI approach, receiving the highest grade on AI Safety Index.
Third-party research institute that evaluates AI models for safety risks, including testing for deceptive behavior and scheming capabilities.
See Alignment Research Center.
Framework for automated evaluation of RAG systems.
Open-source observability and evaluation platform for LLM applications with focus on tracing and debugging.
See AI Safety Level.
Evaluation approach focusing on specific aspects of model outputs (e.g., summary accuracy, coherence, relevance).
Techniques for visualizing attention weights in transformer models to understand what the model focuses on.
Metric measuring the ability of a binary classifier to distinguish between classes across all classification thresholds.
Quantization library for transformer models to reduce memory requirements while maintaining performance.
Systematic assessment using computational metrics and algorithms for scalable, reproducible measurements.
Microsoft's cloud-based evaluation platform for assessing model performance across key metrics including factuality and semantic similarity.
Evaluation framework measuring social biases against protected classes along nine social dimensions.
A standardized test or measurement used to evaluate the performance of AI models against established criteria or other models.
An evaluation metric that uses BERT embeddings to compute similarity between generated and reference texts, capturing semantic similarity beyond surface-level n-gram matching.
Systematic unfairness in model outputs toward specific groups or demographics, measured through demographic parity, equalized odds, and other fairness metrics.
A suite of 23 challenging BIG-Bench tasks where prior language model evaluations did not outperform average human raters.
Research initiative focused on large-scale code generation models and their evaluation.
A coding benchmark where AI systems achieve 35.5% success rate compared to 97% human performance.
International collaboration that developed BLOOM language model and contributed to evaluation frameworks.
A precision-based metric that evaluates machine-generated text by comparing n-gram overlap with reference texts, commonly used in machine translation evaluation.
A learned evaluation metric that uses fine-tuned BERT models to assess text generation quality.
Large language model developed by BigScience collaboration, representing open-source alternatives to proprietary models.
Statistical method for estimating sampling distributions by resampling with replacement.
Anthropic's training methodology using AI feedback to evaluate outputs according to a set of principles, implemented in Claude models.
A model's ability to provide confidence estimates that accurately reflect the likelihood of its predictions being correct.
Categories of weapons of mass destruction that AI safety evaluations assess models' potential to help develop or acquire.
Assessment of models' reasoning capabilities through step-by-step problem-solving approaches.
A crowdsourced platform where users interact with two anonymous LLMs simultaneously and vote for the better response, used to compute ELO ratings.
Set of metrics for evaluating classification models including accuracy, precision, recall, F1-score, specificity, and sensitivity.
Anthropic's AI assistant family, including Claude 3.5 Sonnet and Claude Opus 4, known for Constitutional AI training and safety features.
Set of principles used in Constitutional AI training, drawing from UN Declaration of Human Rights and other ethical frameworks.
Neural network architecture particularly effective for image processing, used in computer vision evaluation tasks.
Evaluation of how well sentences and paragraphs flow together to form unified and understandable responses.
AI company developing language models and evaluation frameworks, recently studied potential gaming of Chatbot Arena leaderboard.
A learned metric for machine translation evaluation that uses cross-lingual pre-trained models.
Open-source end-to-end LLM evaluation and monitoring platform with prompt playground capabilities.
Company behind DeepEval framework, providing hosted evaluation platform for LLM applications.
A table layout allowing visualization of the performance of a classification algorithm, showing true positives, false positives, true negatives, and false negatives.
See CAI.
Assessment of model performance across different input context lengths.
RAG evaluation metric measuring the quality of retrieved information relative to the query.
RAG evaluation metric measuring the completeness of retrieved information relative to what should have been retrieved.
Vector representations where the same word has different embeddings based on surrounding context, crucial for semantic evaluation.
Conversational Question Answering dataset used for evaluating multi-turn dialogue capabilities.
Evaluation metric determining whether an LLM output is factually correct based on ground truth or reference standards.
A measure of similarity between two non-zero vectors defined as the cosine of the angle between them, commonly used for comparing embeddings.
See Chain-of-Thought Evaluation.
A fairness criterion where predictions remain unchanged when protected attributes are flipped to counterfactual values.
A loss function measuring the difference between predicted and true probability distributions, widely used in LLM training and evaluation.
The phenomenon where test data inadvertently appears in training datasets, compromising evaluation integrity and leading to inflated performance scores.
A ranking quality metric that measures the total item relevance in a list with a discount for items further down the list.
Evaluation framework focusing on LLM evaluation with emphasis on dashboard visualization and UI for evaluation results.
Open-source LLM evaluation framework by Confident AI offering 14+ evaluation metrics for RAG and fine-tuning use cases.
Method for decomposing predictions of neural networks by comparing activations to reference activations.
Google's AI research lab known for AlphaGo, protein folding breakthroughs, and safety evaluation research.
Chinese AI lab that developed R1 model, demonstrating cost-effective training methods that surprised competitors.
A fairness metric ensuring that the probability of receiving a positive outcome is the same across all groups defined by a sensitive attribute.
Continuous, real-valued vectors representing information in high-dimensional space where every element contains non-zero values.
LLM-as-a-Judge methodology evaluating single outputs with numerical scores rather than comparative assessment.
A variation of demographic parity that aims to achieve higher-than-specified ratios rather than equal approval rates.
Framework for programming with foundation models, supporting automatic optimization of prompts and weights.
Evaluation frameworks that evolve over time to prevent gaming and maintain challenge levels.
Application of chess rating system to compare language models through pairwise human preference battles.
Open-source AI research organization behind the LM Evaluation Harness and various language models.
A unified framework for testing generative language models on academic benchmarks with standardized evaluation protocols.
The phenomenon where embedding representations change over time, affecting downstream task performance.
Assessment of vector representations of text, images, or other data types for their ability to capture semantic relationships and enable downstream tasks.
Assessment of how well vector representations capture semantic relationships and enable downstream tasks.
Requirement that confusion matrices have the same distribution across all sensitive characteristics.
A fairness metric ensuring that qualified individuals from all groups have the same chance of receiving positive outcomes.
A fairness metric requiring equal true positive rates and false positive rates across different demographic groups.
Benchmark evaluating emotional intelligence capabilities of language models.
A measure of the straight-line distance between two points in Euclidean space, commonly used for comparing vectors.
Frameworks providing standardized environments for running multiple evaluation benchmarks consistently across different models.
Unified evaluation library integrating existing frameworks like lm-evaluation-harness and FastChat.
Open-source Python library for ML model evaluation and monitoring with focus on data drift and model performance.
Methods and techniques for making AI model decisions interpretable and understandable to humans.
The harmonic mean of precision and recall, providing a single metric that balances both measures.
Assessment of whether generated content aligns with provided source materials without introducing false information.
Meta's AI research lab, recently restructured under Meta's efficiency initiatives and facing researcher departures.
Microsoft's open-source toolkit for assessing and mitigating unfairness in machine learning models.
Mathematical measures used to assess bias and ensure equitable treatment across different demographic groups.
The proportion of actual positive cases that were incorrectly classified as negative.
The proportion of actual negative cases that were incorrectly classified as positive.
LMSYS Org's evaluation framework supporting LLM-Judge evaluation for MT-Bench.
RAG evaluation metric measuring how accurately generated responses align with retrieved context without hallucination.
Methods for determining which input features most influence model predictions.
Dataset for fact-checking evaluation, assessing models' ability to verify statement accuracy.
Testing model performance with a small number of examples provided in the prompt.
Benchmark evaluating LLMs in financial domain across 36 datasets covering 24 tasks in seven financial areas.
Assessment of how natural, grammatically correct, and readable generated text appears.
See False Negative Rate.
See False Positive Rate.
A benchmark evaluating models on extremely difficult mathematics problems where AI systems currently solve only 2% of problems.
Nonprofit organization that created the AI Safety Index and the "pause letter" calling for AI development moratorium.
A framework using large language models to evaluate text quality with chain-of-thought reasoning and probability scoring.
An activation function that provides smooth and differentiable alternative to ReLU, ensuring differentiability at every point.
Google's family of large language models competing with GPT and Claude, including Gemini 2.5.
AI-powered code completion tool whose effectiveness is evaluated through coding benchmarks.
Benchmark for evaluating language understanding across multiple tasks.
Former Google AI research lab, now integrated into DeepMind.
See DeepMind.
OpenAI's family of language models, including GPT-3, GPT-4, and variants used in evaluation studies.
A challenging benchmark designed to evaluate expertise in biology, physics, and chemistry at PhD level.
A technique for making convolutional neural network decisions transparent by highlighting important regions in input images.
An ensemble method combining multiple weak learners sequentially, where each learner corrects errors from previous ones.
Assessment of whether output text is free from grammatical errors such as incorrect verb conjugations and syntactical mistakes.
Math word problem dataset commonly used in language model evaluation.
A benchmark evaluating LLM performance in recognizing hallucinations in question-answering, dialogue, and summarization tasks.
The identification and quantification of instances where language models generate information that appears plausible but is factually incorrect.
The frequency at which language models generate false or unsupported information.
Evaluation category assessing potential for AI systems to cause harm through toxic, biased, or dangerous outputs.
Framework for building search systems and RAG applications with built-in evaluation capabilities.
Framework for testing medical chatbots on factual correctness, safety, and ethical standards.
A comprehensive benchmarking framework developed by Stanford CRFM evaluating language models across 42 scenarios and 7 metrics.
A benchmark testing commonsense reasoning by requiring models to choose sensible sentence completions.
Vectara's evaluation model specifically designed to detect and quantify hallucinations in LLM outputs.
Evaluation methodology incorporating human judgment and feedback in the testing process.
The proportion of queries for which at least one relevant item appears in the top-k results.
Algorithm for approximate nearest neighbor search in high-dimensional spaces, used in vector databases.
Platform for tracing LLM execution flows and creating evaluation datasets from production data.
Multi-hop reasoning dataset requiring models to connect information across multiple documents.
Assessment of model outputs by human judges for qualities like fluency, coherence, relevance, and adequacy.
A benchmark for evaluating code generation capabilities through programming challenges.
Rigorous academic test where top AI systems score only 8.80%, representing challenging evaluation benchmark.
The best possible DCG score, used to normalize DCG into NDCG.
Institute of Electrical and Electronics Engineers, publishes AI evaluation research and standards.
A benchmark measuring models' ability to follow specific instructions accurately.
Testing models' ability to learn from examples provided in the prompt.
Fairness principle requiring that similar individuals receive similar treatment from AI systems.
Metrics specifically designed for evaluating search and retrieval systems, including precision@k, recall@k, MAP, MRR, and NDCG.
A mathematical operation computing the sum of products of corresponding elements of two vectors, used as a similarity measure.
UK AI Safety Institute's open-source evaluation platform for assessing AI model capabilities and safety risks.
Evaluation of models' ability to follow complex, multi-step instructions accurately.
Evaluation of models' adherence to prioritizing instructions between system, developer, and user messages.
A technique for attributing predictions of classification models to input features by computing gradients along a path.
The degree to which humans can understand and explain AI model decisions and reasoning processes.
Assessment of embeddings based on their internal properties rather than downstream task performance.
Indexing method for efficient similarity search in large vector databases.
Adversarial prompts designed to circumvent model safety training and induce harmful content generation.
Korean version of MMLU benchmark for testing knowledge in Korean language context.
Framework for building LLM applications with integrated evaluation and monitoring capabilities.
Open-source LLM engineering platform providing tracing, evaluation, prompt management, and analytics.
Evaluation and observability platform by LangChain for debugging, testing, and monitoring LLM applications.
Time required for a model to produce output after receiving input, critical for real-time applications.
Method for explaining neural network decisions by propagating relevance scores backward through layers.
Benchmark for measuring legal reasoning capabilities in large language models.
Systematic preference for responses of certain lengths regardless of quality.
HuggingFace's evaluation framework built on top of EleutherAI's lm-evaluation-harness.
Gradient boosting framework used in machine learning with applications in model evaluation.
A technique providing local explainability by approximating black box models with interpretable models for individual predictions.
Universal API wrapper for LLM providers, supporting evaluation across multiple model endpoints.
Framework for building LLM applications with built-in evaluation modules for retrieval and response quality.
An evaluation methodology using large language models to assess text output quality based on custom criteria defined in evaluation prompts.
Organization behind Chatbot Arena and FastChat evaluation frameworks.
The raw output scores from a neural network before applying activation functions like softmax.
See Layer-wise Relevance Propagation.
The sum of absolute differences between corresponding elements of two vectors.
A ranking metric that averages precision values across multiple recall levels and queries.
Dataset of mathematics problems used for evaluating mathematical reasoning capabilities.
An evaluation metric addressing synonymy issues in text evaluation, providing advantages over simpler metrics.
Company behind FAIR research lab and LLaMA models, recently announcing Meta Superintelligence Labs.
Meta's reorganized AI research organization led by Alexandr Wang as Chief AI Officer.
Benchmark for multimodal fact-checking with large vision-language models.
Efficient fact-checking framework for LLMs using grounding documents.
Platform for standardized LLM evaluation with custom metrics and experimentation tracking.
A benchmark testing knowledge across 57 academic subjects, from elementary to professional levels.
Enhanced version of MMLU with more challenging questions.
Benchmark for evaluating multimodal understanding across various domains.
Documentation providing transparency about AI model capabilities, limitations, and potential risks.
Changes in model performance over time due to shifts in data distribution or model degradation.
The extent to which humans can understand the reasoning behind AI model predictions and decisions.
An embedding-based evaluation metric using optimal transport to align embeddings between reference and generated text.
A ranking metric focusing on the position of the first relevant item in ranked results.
See Meta Superintelligence Labs.
A benchmark designed to evaluate LLMs' ability to sustain multi-turn conversations.
Comprehensive benchmark for evaluating text embedding models across various tasks.
Assessment approach using multiple complementary metrics to capture different aspects of model performance.
A ranking quality metric comparing rankings to ideal order where all relevant items are at the top.
The proportion of negative predictions that are actually negative.
Leading AI conference where research papers on evaluation methods are published.
U.S. national initiative for developing AI measurement standards and evaluation methodologies.
U.S. agency collaborating with Anthropic on red-teaming Claude models for nuclear security risks.
RAG evaluation metric measuring robustness to irrelevant information in retrieved context.
See Negative Predictive Value.
Open Language Model with publicly available training data and evaluation code.
Real-time assessment of model performance in production environments.
Testing model performance on data that differs from the training distribution.
AI company behind GPT models and OpenAI Evals framework, receiving D+ grade on AI Safety Index.
OpenAI's framework for evaluating AI models with basic evaluation templates and model-graded assessments.
LLM evaluation platform supporting evaluations across multiple domains including finance, healthcare, and law.
Open standard for capturing and storing AI model inferences to enable evaluation and observability.
LLM-as-a-Judge methodology comparing two outputs to determine which is better according to specified criteria.
Data analytics company partnering with Anthropic to provide Claude to U.S. intelligence agencies.
The relationship between model size (number of parameters) and performance, measuring computational resource effectiveness.
A measure of how well a language model predicts a sequence of words, with lower values indicating better performance.
Evaluation dataset for testing hallucination in personal question answering scenarios.
See Arize Phoenix.
Systematic preference for items in certain positions of a ranked list.
See Precision - the proportion of positive predictions that are actually positive.
Statistical method for determining sample sizes needed to detect effects of specified size.
See Precision.
The ratio of true positive predictions to total positive predictions made by the model.
The proportion of relevant items among the top K retrieved items.
A fairness metric requiring equal precision rates across different demographic groups.
Assessment of how different prompt formulations affect model performance.
Machine learning framework providing tools for building and training models with evaluation capabilities.
Method for modeling relationships between chemical structures and pharmacological activity, used in toxicity prediction.
A comprehensive framework for evaluating Retrieval-Augmented Generation systems across multiple dimensions.
Evaluation tool for assessing RAG system performance and quality.
Evaluation measures for systems that return ordered lists of results, including MAP, MRR, NDCG, and precision@k.
The ratio of true positive predictions to total actual positive instances.
The proportion of relevant items that appear in the top K retrieved items.
Deploying domain experts to interact with models and test capabilities while attempting to break model safeguards.
Assessment methods that compare model outputs against gold standard references or ground truth.
Evaluating models on consistent test sets across iterations to detect performance degradation.
Dataset for evaluating model safety and toxicity detection capabilities.
Anthropic's framework categorizing AI systems into different AI Safety Levels with associated safety measures.
Assessment of information retrieval systems using metrics like precision, recall, MAP, MRR, and NDCG.
Training methodology using human preferences to improve model alignment and safety.
A model's ability to maintain performance when faced with adversarial inputs, distribution shifts, or edge cases.
Curve plotting true positive rate against false positive rate, used in binary classification evaluation.
A recall-focused metric family for evaluating text summarization quality.
See Responsible Scaling Policy.
Assessment of AI systems for potential harms, including toxicity, bias, misinformation, and misuse risks.
Visualizations showing which parts of input data are most important for model decisions.
Company providing evaluation and red-teaming services, selected by White House to conduct public AI assessments.
Google's library for efficient similarity search in large datasets.
AI behavior involving deceptive alignment and subversion of safety measures to gain power or achieve goals.
Scale AI's research lab focusing on model-assisted evaluation approaches.
Methods for assessing model reliability through multiple sampling and consistency checking.
Hallucination detection method based on self-consistency checking across multiple model generations.
Evaluation of how similar two pieces of text are in meaning, often using embedding-based approaches.
Task and benchmark for measuring the degree of semantic equivalence between text pairs.
Method for generating sentence-level embeddings that can be compared using cosine similarity.
Running new models alongside production models to compare performance without affecting users.
A framework using game theory concepts to explain model predictions by computing feature contribution values.
A dataset of fact-seeking questions with short answers measuring model accuracy for attempted answers.
Vector representations where most values are zero, emphasizing only relevant information.
The proportion of actual negative cases correctly identified as negative.
Benchmark for evaluating reading comprehension capabilities.
See Semantic Textual Similarity.
Academic jailbreak benchmark testing model resistance against common adversarial attacks.
More challenging version of GLUE benchmark for language understanding evaluation.
A benchmark evaluating AI systems' ability to resolve real-world software engineering problems.
Evaluation metric determining whether an LLM agent successfully completes assigned tasks.
Machine learning framework providing comprehensive tools for building and training models with evaluation capabilities.
See NIST TEVV.
Independent assessment conducted by external organizations to ensure objective evaluation of AI systems.
Number of requests or tokens a model can process per unit time, measuring computational efficiency.
Measurement of how effectively models use their token budgets for generation tasks.
Application and SDK for measuring RAG LLM system performance with automated evaluation capabilities.
Evaluation metric determining whether an LLM agent calls the correct tools for given tasks.
Question answering dataset used for evaluating topic-specific knowledge.
Measurement of harmful, offensive, or inappropriate content generation by AI models.
See Sensitivity or Recall - the proportion of actual positives correctly identified.
Fairness metric requiring equal ratios of false negatives to false positives across demographic groups.
Fast explainer for analyzing decision tree models in the SHAP framework.
Question answering dataset testing factual knowledge across various topics.
See Specificity - the proportion of actual negatives correctly identified.
See Sensitivity or Recall - the proportion of actual positives correctly identified.
Open-source library for evaluating and tracking LLM applications through feedback functions and comprehensive tracing.
A benchmark evaluating LLMs' accuracy in providing truthful information using adversarial questions.
Unified evaluation framework for text generation using pre-trained language models.
Frameworks that standardize evaluation across multiple tasks, models, and metrics for consistent comparison.
Evaluation framework for LLM applications with focus on continuous monitoring and improvement.
Percentage of time a model or system is operational and available.
Federal initiative collaborating with AI companies on safety research, testing and evaluation.
Process of assessing model performance on held-out data to ensure generalization.
Assessment of systems designed to store and retrieve high-dimensional vectors efficiently.
The process of finding vectors in a database that are most similar to a query vector using distance metrics.
Company providing hallucination evaluation models and leaderboards for LLM assessment.
Extension of HELM framework for comprehensive evaluation of Vision-Language Models.
Inference and serving library for large language models with support for fast evaluation.
Platform providing experiment tracking, model evaluation, and collaborative ML workflows.
Vector database company providing evaluation metrics and tools for search and recommendation systems.
Benchmark dataset for evaluating model performance on diverse, real-world tasks.
Multi-purpose moderation tool for assessing safety of user-LLM interactions.
Safety training dataset with 262K examples for improving model robustness against adversarial attacks.
Automatic red-teaming framework for identifying and reproducing human-devised attacks.
Benchmark testing commonsense reasoning through pronoun resolution tasks.
See Explainable AI.
Elon Musk's AI company, receiving low grades on AI Safety Index evaluations.
Extreme gradient boosting framework commonly used in machine learning competitions and evaluation studies.
Object detection algorithm whose performance is evaluated using computer vision metrics.
Benchmark evaluating logical reasoning abilities of LLMs via logic grid puzzles.
Testing model performance on tasks without providing task-specific training examples.
Chinese AI company included in AI Safety Index evaluations.
Company behind Milvus vector database, providing resources on embedding evaluation and similarity metrics.
Ensemble method that sequentially applies weak learners, with each focusing on previously misclassified examples.
Model's ability to maintain performance when faced with intentionally crafted malicious inputs.
Framework ensuring AI systems can be held responsible for their decisions and impacts.
Systematic examination of AI systems for bias, fairness, and compliance with ethical standards.
Neural network component that allows models to focus on relevant parts of input sequences.
Technique for normalizing layer inputs to improve training stability and convergence.
Statistical approach to model evaluation incorporating uncertainty quantification.
Phenomenon where AI systems increase existing biases present in training data.
Problem where neural networks lose previously learned information when learning new tasks.
The ability of models to understand and generate novel combinations of known elements.
Changes in the underlying data distribution that affect model performance over time.
Privacy-preserving technique that adds noise to data or model outputs to protect individual privacy.
Changes in data distribution between training and testing phases that can degrade model performance.
Machine learning technique combining multiple models to improve overall performance.
Uncertainty arising from limited knowledge or data, reducible with more information.
Machine learning approach where models are trained across decentralized data sources.
Assessment of model performance after task-specific training on pre-trained models.
Optimization algorithm used to minimize loss functions in machine learning models.
Process of finding optimal configuration parameters for machine learning models.
Technique for transferring knowledge from large models to smaller, more efficient ones.
Statistical sampling method used in Bayesian model evaluation.
Learning to learn - algorithms that improve their learning efficiency through experience.
Framework for sequential decision-making under uncertainty, used in online evaluation.
Automated method for finding optimal neural network architectures.
Problem where models perform well on training data but poorly on unseen data.
Techniques to prevent overfitting by adding constraints or penalties to model complexity.
Learning approach using both labeled and unlabeled data for training.
Technique leveraging knowledge from pre-trained models for new tasks.
Problem where models are too simple to capture underlying data patterns.
Professional organization publishing AI evaluation research and standards.
Conference focusing on ethical considerations in AI development and evaluation.
Organization promoting AI research and responsible development practices.
Conference dedicated to fairness and accountability in algorithmic systems.
Premier venue for machine learning research including evaluation methodologies.
International standards for AI systems including evaluation and testing frameworks.
Collaborative effort to study and formulate best practices on AI technologies.
European Union regulation establishing requirements for AI systems including evaluation standards.
Framework establishing principles for responsible AI development and deployment.
European privacy regulation affecting AI evaluation practices and data handling.
Documentation standard for AI model transparency and accountability.
Structured approach to evaluating and managing AI risks throughout development.
Assessment frameworks for self-driving car AI systems including safety and performance metrics.
Specialized assessment for medical AI applications focusing on patient safety and efficacy.
Assessment of AI systems used in security applications including threat detection and response.
Frameworks for assessing AI tutoring systems and educational technology effectiveness.
Specialized assessment for AI in banking, trading, and financial services with regulatory compliance.
Assessment frameworks for AI systems used in legal applications including bias and fairness.
Specialized assessment for defense applications including ethical and strategic considerations.
Assessment of content moderation and recommendation algorithms for harmful content detection.
Technology enabling direct communication between brain and computer systems, requiring specialized evaluation.
Virtual replicas of physical systems used for simulation and evaluation.
Assessment of AI systems running on local devices with resource constraints.
Brain-inspired computing architectures requiring specialized evaluation approaches.
Integration of quantum computing with machine learning, requiring new evaluation frameworks.
Collective behavior of decentralized systems requiring specialized evaluation metrics.
Process of removing personally identifiable information from datasets used in evaluation.
Tracking of data flow and transformations throughout the machine learning pipeline.
Documentation of data origin, ownership, and processing history for evaluation datasets.
Assessment approaches that preserve data privacy by keeping data distributed.
Cryptographic technique allowing computation on encrypted data for privacy-preserving evaluation.
Creation of artificial datasets for evaluation without exposing real sensitive data.
Organizational committees overseeing ethical AI development and evaluation practices.
Frameworks for managing AI development, deployment, and evaluation within organizations.
Metrics for measuring business value and effectiveness of AI system implementations.
Evaluation frameworks for selecting and monitoring third-party AI service providers.
Executive role responsible for AI strategy and evaluation oversight.
Practices for deploying and maintaining machine learning systems including evaluation pipelines.
Framework for identifying, assessing, and mitigating risks associated with AI model deployment.
This comprehensive dictionary draws from leading sources including:
Academic Institutions:
Industry Organizations:
Evaluation Frameworks:
Government and Standards:
Research Venues:
This dictionary represents the most comprehensive compilation of AI evaluation terminology as of 2025, covering foundational concepts, cutting-edge methodologies, practical tools, industry standards, regulatory frameworks, and emerging technologies. The field continues to evolve rapidly with new metrics, benchmarks, evaluation approaches, companies, and regulatory requirements being developed regularly. This reference should serve as the definitive one-stop resource for anyone working in or studying AI evaluation and testing.