The TestLM Ultimate AI Evaluation & Testing Dictionary
The most comprehensive reference guide covering ALL essential terms, tools, frameworks, concepts, companies, acronyms, and methodologies in AI model evaluation and testing.

A

Accuracy

The percentage of correct predictions made by a model out of total predictions. In classification tasks, calculated as (True Positives + True Negatives) / Total Predictions.

 

ADeLe (Annotated Demand Levels)

A methodology that assesses task difficulty for AI models using measurement scales for 18 types of cognitive and knowledge-based abilities.

 

A/B Testing

Comparative evaluation methodology testing different model versions or prompts against each other to determine which performs better.

 

Adversarial Evaluation

Testing with intentionally challenging inputs designed to expose model weaknesses, biases, or failure modes.

 

AGI (Artificial General Intelligence)

Hypothetical AI systems with human-level cognitive abilities across all domains, representing a key milestone that companies like OpenAI and DeepMind aim to achieve.

 

AI Benchmarking

The systematic evaluation of AI models using standardized datasets and metrics to assess their capabilities, limitations, and performance across various tasks.

 

AI Lab Watch

Organization monitoring AI companies' safety practices and evaluating their preparedness for advanced AI risks.

 

AI Safety Institute (AISI)

Government organizations (UK AISI, US AI Safety Institute) developing evaluations for advanced AI systems, focusing on misuse risks, societal impacts, and autonomous capabilities.

 

AI Safety Index

Future of Life Institute's grading system for AI companies' safety practices, with Anthropic receiving the highest grade (C) and Meta receiving an F.

 

AI Safety Level (ASL)

Anthropic's classification system for AI systems based on risk levels, with ASL-3 involving enhanced security and deployment standards for models with potential dangerous capabilities.

 

Alignment Research Center (ARC)

Third-party organization conducting safety evaluations of frontier AI models for dangerous capabilities like resource accumulation and self-replication.

 

AlphaGo

DeepMind's famous AI system that defeated human champions at the game of Go, representing a major breakthrough in AI capabilities evaluation.

 

ANN (Approximate Nearest Neighbor)

Algorithms used to find the nearest neighbors of a query point in high-dimensional datasets, trading small amounts of accuracy for significant speed improvements.

 

Answer Relevancy

Evaluation metric determining whether an LLM output addresses the given input in an informative and concise manner.

 

Anthropic

AI safety company founded by Dario and Daniela Amodei, known for Claude models and Constitutional AI approach, receiving the highest grade on AI Safety Index.

 

Apollo Research

Third-party research institute that evaluates AI models for safety risks, including testing for deceptive behavior and scheming capabilities.

 

ARC (AI Research Center)

See Alignment Research Center.

 

ARES (Automated Evaluation Framework for Retrieval-Augmented Generation Systems)

Framework for automated evaluation of RAG systems.

 

Arize Phoenix

Open-source observability and evaluation platform for LLM applications with focus on tracing and debugging.

 

ASL (AI Safety Level)

See AI Safety Level.

 

Aspect Criticism

Evaluation approach focusing on specific aspects of model outputs (e.g., summary accuracy, coherence, relevance).

 

Attention Visualization

Techniques for visualizing attention weights in transformer models to understand what the model focuses on.

 

AUC-ROC (Area Under Curve - Receiver Operating Characteristic)

Metric measuring the ability of a binary classifier to distinguish between classes across all classification thresholds.

 

AutoGPTQ

Quantization library for transformer models to reduce memory requirements while maintaining performance.

 

Automated Evaluation

Systematic assessment using computational metrics and algorithms for scalable, reproducible measurements.

 

Azure OpenAI Evaluation

Microsoft's cloud-based evaluation platform for assessing model performance across key metrics including factuality and semantic similarity.

 


 

B

BBQ (Bias Benchmark for QA)

Evaluation framework measuring social biases against protected classes along nine social dimensions.

 

Benchmark

A standardized test or measurement used to evaluate the performance of AI models against established criteria or other models.

 

BERTScore

An evaluation metric that uses BERT embeddings to compute similarity between generated and reference texts, capturing semantic similarity beyond surface-level n-gram matching.

 

Bias

Systematic unfairness in model outputs toward specific groups or demographics, measured through demographic parity, equalized odds, and other fairness metrics.

 

Big-Bench Hard (BBH)

A suite of 23 challenging BIG-Bench tasks where prior language model evaluations did not outperform average human raters.

 

BigCode

Research initiative focused on large-scale code generation models and their evaluation.

 

BigCodeBench

A coding benchmark where AI systems achieve 35.5% success rate compared to 97% human performance.

 

BigScience

International collaboration that developed BLOOM language model and contributed to evaluation frameworks.

 

BLEU Score (Bilingual Evaluation Understudy)

A precision-based metric that evaluates machine-generated text by comparing n-gram overlap with reference texts, commonly used in machine translation evaluation.

 

BLEURT

A learned evaluation metric that uses fine-tuned BERT models to assess text generation quality.

 

BLOOM

Large language model developed by BigScience collaboration, representing open-source alternatives to proprietary models.

 

Bootstrapping

Statistical method for estimating sampling distributions by resampling with replacement.

 


 

C

CAI (Constitutional AI)

Anthropic's training methodology using AI feedback to evaluate outputs according to a set of principles, implemented in Claude models.

 

Calibration

A model's ability to provide confidence estimates that accurately reflect the likelihood of its predictions being correct.

 

CBRN (Chemical, Biological, Radiological, Nuclear)

Categories of weapons of mass destruction that AI safety evaluations assess models' potential to help develop or acquire.

 

Chain-of-Thought (CoT) Evaluation

Assessment of models' reasoning capabilities through step-by-step problem-solving approaches.

 

Chatbot Arena

A crowdsourced platform where users interact with two anonymous LLMs simultaneously and vote for the better response, used to compute ELO ratings.

 

Classification Metrics

Set of metrics for evaluating classification models including accuracy, precision, recall, F1-score, specificity, and sensitivity.

 

Claude

Anthropic's AI assistant family, including Claude 3.5 Sonnet and Claude Opus 4, known for Constitutional AI training and safety features.

 

Claude's Constitution

Set of principles used in Constitutional AI training, drawing from UN Declaration of Human Rights and other ethical frameworks.

 

CNN (Convolutional Neural Network)

Neural network architecture particularly effective for image processing, used in computer vision evaluation tasks.

 

Coherence

Evaluation of how well sentences and paragraphs flow together to form unified and understandable responses.

 

Cohere

AI company developing language models and evaluation frameworks, recently studied potential gaming of Chatbot Arena leaderboard.

 

COMET

A learned metric for machine translation evaluation that uses cross-lingual pre-trained models.

 

Comet Opik

Open-source end-to-end LLM evaluation and monitoring platform with prompt playground capabilities.

 

Confident AI

Company behind DeepEval framework, providing hosted evaluation platform for LLM applications.

 

Confusion Matrix

A table layout allowing visualization of the performance of a classification algorithm, showing true positives, false positives, true negatives, and false negatives.

 

Constitutional AI

See CAI.

 

Context Length Evaluation

Assessment of model performance across different input context lengths.

 

Context Precision

RAG evaluation metric measuring the quality of retrieved information relative to the query.

 

Context Recall

RAG evaluation metric measuring the completeness of retrieved information relative to what should have been retrieved.

 

Contextual Embeddings

Vector representations where the same word has different embeddings based on surrounding context, crucial for semantic evaluation.

 

CoQA

Conversational Question Answering dataset used for evaluating multi-turn dialogue capabilities.

 

Correctness

Evaluation metric determining whether an LLM output is factually correct based on ground truth or reference standards.

 

Cosine Similarity

A measure of similarity between two non-zero vectors defined as the cosine of the angle between them, commonly used for comparing embeddings.

 

CoT (Chain-of-Thought)

See Chain-of-Thought Evaluation.

 

Counterfactual Fairness

A fairness criterion where predictions remain unchanged when protected attributes are flipped to counterfactual values.

 

Cross-Entropy Loss

A loss function measuring the difference between predicted and true probability distributions, widely used in LLM training and evaluation.

 


 

D

Data Contamination

The phenomenon where test data inadvertently appears in training datasets, compromising evaluation integrity and leading to inflated performance scores.

 

DCG (Discounted Cumulative Gain)

A ranking quality metric that measures the total item relevance in a list with a discount for items further down the list.

 

Deepchecks

Evaluation framework focusing on LLM evaluation with emphasis on dashboard visualization and UI for evaluation results.

 

DeepEval

Open-source LLM evaluation framework by Confident AI offering 14+ evaluation metrics for RAG and fine-tuning use cases.

 

DeepLIFT

Method for decomposing predictions of neural networks by comparing activations to reference activations.

 

DeepMind

Google's AI research lab known for AlphaGo, protein folding breakthroughs, and safety evaluation research.

 

DeepSeek

Chinese AI lab that developed R1 model, demonstrating cost-effective training methods that surprised competitors.

 

Demographic Parity

A fairness metric ensuring that the probability of receiving a positive outcome is the same across all groups defined by a sensitive attribute.

 

Dense Embeddings

Continuous, real-valued vectors representing information in high-dimensional space where every element contains non-zero values.

 

Direct Scoring

LLM-as-a-Judge methodology evaluating single outputs with numerical scores rather than comparative assessment.

 

Disparate Impact

A variation of demographic parity that aims to achieve higher-than-specified ratios rather than equal approval rates.

 

DSPy

Framework for programming with foundation models, supporting automatic optimization of prompts and weights.

 

Dynamic Benchmarks

Evaluation frameworks that evolve over time to prevent gaming and maintain challenge levels.

 


 

E

ELO Rating for LLMs

Application of chess rating system to compare language models through pairwise human preference battles.

 

EleutherAI

Open-source AI research organization behind the LM Evaluation Harness and various language models.

 

EleutherAI LM Evaluation Harness

A unified framework for testing generative language models on academic benchmarks with standardized evaluation protocols.

 

Embedding Drift

The phenomenon where embedding representations change over time, affecting downstream task performance.

 

Embedding Evaluation

Assessment of vector representations of text, images, or other data types for their ability to capture semantic relationships and enable downstream tasks.

 

Embedding Quality

Assessment of how well vector representations capture semantic relationships and enable downstream tasks.

 

Equal Confusion Fairness

Requirement that confusion matrices have the same distribution across all sensitive characteristics.

 

Equal Opportunity

A fairness metric ensuring that qualified individuals from all groups have the same chance of receiving positive outcomes.

 

Equalized Odds

A fairness metric requiring equal true positive rates and false positive rates across different demographic groups.

 

EQ-Bench

Benchmark evaluating emotional intelligence capabilities of language models.

 

Euclidean Distance

A measure of the straight-line distance between two points in Euclidean space, commonly used for comparing vectors.

 

Evaluation Harness

Frameworks providing standardized environments for running multiple evaluation benchmarks consistently across different models.

 

Evalverse

Unified evaluation library integrating existing frameworks like lm-evaluation-harness and FastChat.

 

Evidently

Open-source Python library for ML model evaluation and monitoring with focus on data drift and model performance.

 

Explainable AI (XAI)

Methods and techniques for making AI model decisions interpretable and understandable to humans.

 


 

F

F1 Score

The harmonic mean of precision and recall, providing a single metric that balances both measures.

 

Factual Consistency

Assessment of whether generated content aligns with provided source materials without introducing false information.

 

FAIR (Fundamental AI Research)

Meta's AI research lab, recently restructured under Meta's efficiency initiatives and facing researcher departures.

 

FairLearn

Microsoft's open-source toolkit for assessing and mitigating unfairness in machine learning models.

 

Fairness Metrics

Mathematical measures used to assess bias and ensure equitable treatment across different demographic groups.

 

False Negative Rate (FNR)

The proportion of actual positive cases that were incorrectly classified as negative.

 

False Positive Rate (FPR)

The proportion of actual negative cases that were incorrectly classified as positive.

 

FastChat

LMSYS Org's evaluation framework supporting LLM-Judge evaluation for MT-Bench.

 

Faithfulness

RAG evaluation metric measuring how accurately generated responses align with retrieved context without hallucination.

 

Feature Attribution

Methods for determining which input features most influence model predictions.

 

FEVER (Fact Extraction and VERification)

Dataset for fact-checking evaluation, assessing models' ability to verify statement accuracy.

 

Few-Shot Evaluation

Testing model performance with a small number of examples provided in the prompt.

 

FinBen

Benchmark evaluating LLMs in financial domain across 36 datasets covering 24 tasks in seven financial areas.

 

Fluency

Assessment of how natural, grammatically correct, and readable generated text appears.

 

FNR (False Negative Rate)

See False Negative Rate.

 

FPR (False Positive Rate)

See False Positive Rate.

 

FrontierMath

A benchmark evaluating models on extremely difficult mathematics problems where AI systems currently solve only 2% of problems.

 

Future of Life Institute

Nonprofit organization that created the AI Safety Index and the "pause letter" calling for AI development moratorium.

 


 

G

G-Eval

A framework using large language models to evaluate text quality with chain-of-thought reasoning and probability scoring.

 

GELU (Gaussian Error Linear Unit)

An activation function that provides smooth and differentiable alternative to ReLU, ensuring differentiability at every point.

 

Gemini

Google's family of large language models competing with GPT and Claude, including Gemini 2.5.

 

GitHub Copilot

AI-powered code completion tool whose effectiveness is evaluated through coding benchmarks.

 

GLUE (General Language Understanding Evaluation)

Benchmark for evaluating language understanding across multiple tasks.

 

Google Brain

Former Google AI research lab, now integrated into DeepMind.

 

Google DeepMind

See DeepMind.

 

GPT (Generative Pre-trained Transformer)

OpenAI's family of language models, including GPT-3, GPT-4, and variants used in evaluation studies.

 

GPQA (Graduate-Level Google-Proof Q&A)

A challenging benchmark designed to evaluate expertise in biology, physics, and chemistry at PhD level.

 

Grad-CAM (Gradient-weighted Class Activation Mapping)

A technique for making convolutional neural network decisions transparent by highlighting important regions in input images.

 

Gradient Boosting

An ensemble method combining multiple weak learners sequentially, where each learner corrects errors from previous ones.

 

Grammatical Correctness

Assessment of whether output text is free from grammatical errors such as incorrect verb conjugations and syntactical mistakes.

 

GSM8k

Math word problem dataset commonly used in language model evaluation.

 


 

H

HaluEval

A benchmark evaluating LLM performance in recognizing hallucinations in question-answering, dialogue, and summarization tasks.

 

Hallucination Detection

The identification and quantification of instances where language models generate information that appears plausible but is factually incorrect.

 

Hallucination Rate

The frequency at which language models generate false or unsupported information.

 

Harmfulness

Evaluation category assessing potential for AI systems to cause harm through toxic, biased, or dangerous outputs.

 

Haystack

Framework for building search systems and RAG applications with built-in evaluation capabilities.

 

HealthPariksha

Framework for testing medical chatbots on factual correctness, safety, and ethical standards.

 

HELM (Holistic Evaluation of Language Models)

A comprehensive benchmarking framework developed by Stanford CRFM evaluating language models across 42 scenarios and 7 metrics.

 

HellaSwag

A benchmark testing commonsense reasoning by requiring models to choose sensible sentence completions.

 

HHEM (Hallucination Evaluation Model)

Vectara's evaluation model specifically designed to detect and quantify hallucinations in LLM outputs.

 

HILT (Human-in-the-Loop Testing)

Evaluation methodology incorporating human judgment and feedback in the testing process.

 

Hit Rate

The proportion of queries for which at least one relevant item appears in the top-k results.

 

HNSW (Hierarchical Navigable Small World)

Algorithm for approximate nearest neighbor search in high-dimensional spaces, used in vector databases.

 

HoneyHive

Platform for tracing LLM execution flows and creating evaluation datasets from production data.

 

HotPotQA

Multi-hop reasoning dataset requiring models to connect information across multiple documents.

 

Human Evaluation

Assessment of model outputs by human judges for qualities like fluency, coherence, relevance, and adequacy.

 

HumanEval

A benchmark for evaluating code generation capabilities through programming challenges.

 

Humanity's Last Exam

Rigorous academic test where top AI systems score only 8.80%, representing challenging evaluation benchmark.

 


 

I

IDCG (Ideal Discounted Cumulative Gain)

The best possible DCG score, used to normalize DCG into NDCG.

 

IEEE

Institute of Electrical and Electronics Engineers, publishes AI evaluation research and standards.

 

IFEval (Instruction Following Evaluation)

A benchmark measuring models' ability to follow specific instructions accurately.

 

In-Context Learning Evaluation

Testing models' ability to learn from examples provided in the prompt.

 

Individual Fairness

Fairness principle requiring that similar individuals receive similar treatment from AI systems.

 

Information Retrieval Metrics

Metrics specifically designed for evaluating search and retrieval systems, including precision@k, recall@k, MAP, MRR, and NDCG.

 

Inner Product (Dot Product)

A mathematical operation computing the sum of products of corresponding elements of two vectors, used as a similarity measure.

 

Inspect

UK AI Safety Institute's open-source evaluation platform for assessing AI model capabilities and safety risks.

 

Instruction Following

Evaluation of models' ability to follow complex, multi-step instructions accurately.

 

Instruction Hierarchy

Evaluation of models' adherence to prioritizing instructions between system, developer, and user messages.

 

Integrated Gradients

A technique for attributing predictions of classification models to input features by computing gradients along a path.

 

Interpretability

The degree to which humans can understand and explain AI model decisions and reasoning processes.

 

Intrinsic Evaluation

Assessment of embeddings based on their internal properties rather than downstream task performance.

 

IVF (Inverted File)

Indexing method for efficient similarity search in large vector databases.

 


 

J

Jailbreaking

Adversarial prompts designed to circumvent model safety training and induce harmful content generation.

 


 

K

KMMLU

Korean version of MMLU benchmark for testing knowledge in Korean language context.

 


 

L

LangChain

Framework for building LLM applications with integrated evaluation and monitoring capabilities.

 

LangFuse

Open-source LLM engineering platform providing tracing, evaluation, prompt management, and analytics.

 

LangSmith

Evaluation and observability platform by LangChain for debugging, testing, and monitoring LLM applications.

 

Latency

Time required for a model to produce output after receiving input, critical for real-time applications.

 

Layer-wise Relevance Propagation (LRP)

Method for explaining neural network decisions by propagating relevance scores backward through layers.

 

LegalBench

Benchmark for measuring legal reasoning capabilities in large language models.

 

Length Bias

Systematic preference for responses of certain lengths regardless of quality.

 

LightEval

HuggingFace's evaluation framework built on top of EleutherAI's lm-evaluation-harness.

 

LightGBM

Gradient boosting framework used in machine learning with applications in model evaluation.

 

LIME (Local Interpretable Model-agnostic Explanations)

A technique providing local explainability by approximating black box models with interpretable models for individual predictions.

 

LiteLLM

Universal API wrapper for LLM providers, supporting evaluation across multiple model endpoints.

 

LlamaIndex

Framework for building LLM applications with built-in evaluation modules for retrieval and response quality.

 

LLM-as-a-Judge

An evaluation methodology using large language models to assess text output quality based on custom criteria defined in evaluation prompts.

 

LMSYS

Organization behind Chatbot Arena and FastChat evaluation frameworks.

 

Logit

The raw output scores from a neural network before applying activation functions like softmax.

 

LRP (Layer-wise Relevance Propagation)

See Layer-wise Relevance Propagation.

 


 

M

Manhattan Distance (L1 Distance)

The sum of absolute differences between corresponding elements of two vectors.

 

MAP (Mean Average Precision)

A ranking metric that averages precision values across multiple recall levels and queries.

 

MATH

Dataset of mathematics problems used for evaluating mathematical reasoning capabilities.

 

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

An evaluation metric addressing synonymy issues in text evaluation, providing advantages over simpler metrics.

 

Meta

Company behind FAIR research lab and LLaMA models, recently announcing Meta Superintelligence Labs.

 

Meta Superintelligence Labs (MSL)

Meta's reorganized AI research organization led by Alexandr Wang as Chief AI Officer.

 

MFC-Bench

Benchmark for multimodal fact-checking with large vision-language models.

 

MiniCheck

Efficient fact-checking framework for LLMs using grounding documents.

 

MLflow Evaluate

Platform for standardized LLM evaluation with custom metrics and experimentation tracking.

 

MMLU (Massive Multitask Language Understanding)

A benchmark testing knowledge across 57 academic subjects, from elementary to professional levels.

 

MMLU-Pro

Enhanced version of MMLU with more challenging questions.

 

MMMU

Benchmark for evaluating multimodal understanding across various domains.

 

Model Cards

Documentation providing transparency about AI model capabilities, limitations, and potential risks.

 

Model Drift

Changes in model performance over time due to shifts in data distribution or model degradation.

 

Model Interpretability

The extent to which humans can understand the reasoning behind AI model predictions and decisions.

 

MoverScore

An embedding-based evaluation metric using optimal transport to align embeddings between reference and generated text.

 

MRR (Mean Reciprocal Rank)

A ranking metric focusing on the position of the first relevant item in ranked results.

 

MSL (Meta Superintelligence Labs)

See Meta Superintelligence Labs.

 

MT-Bench (Multi-Turn Benchmark)

A benchmark designed to evaluate LLMs' ability to sustain multi-turn conversations.

 

MTEB (Massive Text Embedding Benchmark)

Comprehensive benchmark for evaluating text embedding models across various tasks.

 

Multi-Metric Evaluation

Assessment approach using multiple complementary metrics to capture different aspects of model performance.

 


 

N

NDCG (Normalized Discounted Cumulative Gain)

A ranking quality metric comparing rankings to ideal order where all relevant items are at the top.

 

Negative Predictive Value (NPV)

The proportion of negative predictions that are actually negative.

 

NeurIPS

Leading AI conference where research papers on evaluation methods are published.

 

NIST TEVV (Test, Evaluation, Validation and Verification)

U.S. national initiative for developing AI measurement standards and evaluation methodologies.

 

NNSA (National Nuclear Security Administration)

U.S. agency collaborating with Anthropic on red-teaming Claude models for nuclear security risks.

 

Noise Sensitivity

RAG evaluation metric measuring robustness to irrelevant information in retrieved context.

 

NPV (Negative Predictive Value)

See Negative Predictive Value.

 


 

O

OLMo

Open Language Model with publicly available training data and evaluation code.

 

Online Evaluation

Real-time assessment of model performance in production environments.

 

OOD (Out-of-Distribution) Evaluation

Testing model performance on data that differs from the training distribution.

 

OpenAI

AI company behind GPT models and OpenAI Evals framework, receiving D+ grade on AI Safety Index.

 

OpenAI Evals

OpenAI's framework for evaluating AI models with basic evaluation templates and model-graded assessments.

 

OpenCompass

LLM evaluation platform supporting evaluations across multiple domains including finance, healthcare, and law.

 

OpenInference

Open standard for capturing and storing AI model inferences to enable evaluation and observability.

 


 

P

Pairwise Comparison

LLM-as-a-Judge methodology comparing two outputs to determine which is better according to specified criteria.

 

Palantir

Data analytics company partnering with Anthropic to provide Claude to U.S. intelligence agencies.

 

Parameter Efficiency

The relationship between model size (number of parameters) and performance, measuring computational resource effectiveness.

 

Perplexity

A measure of how well a language model predicts a sequence of words, with lower values indicating better performance.

 

PersonQA

Evaluation dataset for testing hallucination in personal question answering scenarios.

 

Phoenix

See Arize Phoenix.

 

Position Bias

Systematic preference for items in certain positions of a ranked list.

 

Positive Predictive Value (PPV)

See Precision - the proportion of positive predictions that are actually positive.

 

Power Analysis

Statistical method for determining sample sizes needed to detect effects of specified size.

 

PPV (Positive Predictive Value)

See Precision.

 

Precision

The ratio of true positive predictions to total positive predictions made by the model.

 

Precision@K

The proportion of relevant items among the top K retrieved items.

 

Predictive Parity

A fairness metric requiring equal precision rates across different demographic groups.

 

Prompt Engineering Evaluation

Assessment of how different prompt formulations affect model performance.

 

PyTorch

Machine learning framework providing tools for building and training models with evaluation capabilities.

 


 

Q

QSAR (Quantitative Structure-Activity Relationships)

Method for modeling relationships between chemical structures and pharmacological activity, used in toxicity prediction.

 


 

R

RAGAS (RAG Assessment)

A comprehensive framework for evaluating Retrieval-Augmented Generation systems across multiple dimensions.

 

RAGChecker

Evaluation tool for assessing RAG system performance and quality.

 

Ranking Metrics

Evaluation measures for systems that return ordered lists of results, including MAP, MRR, NDCG, and precision@k.

 

Recall

The ratio of true positive predictions to total actual positive instances.

 

Recall@K

The proportion of relevant items that appear in the top K retrieved items.

 

Red Teaming

Deploying domain experts to interact with models and test capabilities while attempting to break model safeguards.

 

Reference-based Evaluation

Assessment methods that compare model outputs against gold standard references or ground truth.

 

Regression Testing

Evaluating models on consistent test sets across iterations to detect performance degradation.

 

RealToxicityPrompts

Dataset for evaluating model safety and toxicity detection capabilities.

 

Responsible Scaling Policy (RSP)

Anthropic's framework categorizing AI systems into different AI Safety Levels with associated safety measures.

 

Retrieval Evaluation

Assessment of information retrieval systems using metrics like precision, recall, MAP, MRR, and NDCG.

 

RLHF (Reinforcement Learning from Human Feedback)

Training methodology using human preferences to improve model alignment and safety.

 

Robustness

A model's ability to maintain performance when faced with adversarial inputs, distribution shifts, or edge cases.

 

ROC (Receiver Operating Characteristic)

Curve plotting true positive rate against false positive rate, used in binary classification evaluation.

 

ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)

A recall-focused metric family for evaluating text summarization quality.

 

RSP (Responsible Scaling Policy)

See Responsible Scaling Policy.

 


 

S

Safety Evaluation

Assessment of AI systems for potential harms, including toxicity, bias, misinformation, and misuse risks.

 

Saliency Maps

Visualizations showing which parts of input data are most important for model decisions.

 

Scale AI

Company providing evaluation and red-teaming services, selected by White House to conduct public AI assessments.

 

ScANN (Scalable Approximate Nearest Neighbors)

Google's library for efficient similarity search in large datasets.

 

Scheming

AI behavior involving deceptive alignment and subversion of safety measures to gain power or achieve goals.

 

SEAL (Safety, Evaluations, and Analysis Lab)

Scale AI's research lab focusing on model-assisted evaluation approaches.

 

Self-Consistency Evaluation

Methods for assessing model reliability through multiple sampling and consistency checking.

 

SelfCheckGPT

Hallucination detection method based on self-consistency checking across multiple model generations.

 

Semantic Similarity

Evaluation of how similar two pieces of text are in meaning, often using embedding-based approaches.

 

Semantic Textual Similarity (STS)

Task and benchmark for measuring the degree of semantic equivalence between text pairs.

 

Sentence-BERT (S-BERT)

Method for generating sentence-level embeddings that can be compared using cosine similarity.

 

Shadow Testing

Running new models alongside production models to compare performance without affecting users.

 

SHAP (SHapley Additive exPlanations)

A framework using game theory concepts to explain model predictions by computing feature contribution values.

 

SimpleQA

A dataset of fact-seeking questions with short answers measuring model accuracy for attempted answers.

 

Sparse Embeddings

Vector representations where most values are zero, emphasizing only relevant information.

 

Specificity (True Negative Rate)

The proportion of actual negative cases correctly identified as negative.

 

SQuAD (Stanford Question Answering Dataset)

Benchmark for evaluating reading comprehension capabilities.

 

STS (Semantic Textual Similarity)

See Semantic Textual Similarity.

 

StrongReject

Academic jailbreak benchmark testing model resistance against common adversarial attacks.

 

SuperGLUE

More challenging version of GLUE benchmark for language understanding evaluation.

 

SWE-bench

A benchmark evaluating AI systems' ability to resolve real-world software engineering problems.

 


 

T

Task Completion

Evaluation metric determining whether an LLM agent successfully completes assigned tasks.

 

TensorFlow

Machine learning framework providing comprehensive tools for building and training models with evaluation capabilities.

 

TEVV (Test, Evaluation, Validation and Verification)

See NIST TEVV.

 

Third-Party Evaluation

Independent assessment conducted by external organizations to ensure objective evaluation of AI systems.

 

Throughput

Number of requests or tokens a model can process per unit time, measuring computational efficiency.

 

Token Efficiency

Measurement of how effectively models use their token budgets for generation tasks.

 

Tonic Validate

Application and SDK for measuring RAG LLM system performance with automated evaluation capabilities.

 

Tool Correctness

Evaluation metric determining whether an LLM agent calls the correct tools for given tasks.

 

TopicQA

Question answering dataset used for evaluating topic-specific knowledge.

 

Toxicity Detection

Measurement of harmful, offensive, or inappropriate content generation by AI models.

 

TPR (True Positive Rate)

See Sensitivity or Recall - the proportion of actual positives correctly identified.

 

Treatment Equality

Fairness metric requiring equal ratios of false negatives to false positives across demographic groups.

 

TreeSHAP

Fast explainer for analyzing decision tree models in the SHAP framework.

 

TriviaQA

Question answering dataset testing factual knowledge across various topics.

 

True Negative Rate (TNR)

See Specificity - the proportion of actual negatives correctly identified.

 

True Positive Rate (TPR)

See Sensitivity or Recall - the proportion of actual positives correctly identified.

 

TruLens

Open-source library for evaluating and tracking LLM applications through feedback functions and comprehensive tracing.

 

TruthfulQA

A benchmark evaluating LLMs' accuracy in providing truthful information using adversarial questions.

 


 

U

UniEval

Unified evaluation framework for text generation using pre-trained language models.

 

Unified Evaluation

Frameworks that standardize evaluation across multiple tasks, models, and metrics for consistent comparison.

 

UpTrain

Evaluation framework for LLM applications with focus on continuous monitoring and improvement.

 

Uptime

Percentage of time a model or system is operational and available.

 

US AI Safety Institute

Federal initiative collaborating with AI companies on safety research, testing and evaluation.

 


 

V

Validation

Process of assessing model performance on held-out data to ensure generalization.

 

Vector Database Evaluation

Assessment of systems designed to store and retrieve high-dimensional vectors efficiently.

 

Vector Similarity Search

The process of finding vectors in a database that are most similar to a query vector using distance metrics.

 

Vectara

Company providing hallucination evaluation models and leaderboards for LLM assessment.

 

VHELM (Holistic Evaluation of Vision-Language Models)

Extension of HELM framework for comprehensive evaluation of Vision-Language Models.

 

vLLM

Inference and serving library for large language models with support for fast evaluation.

 


 

W

W&B (Weights & Biases)

Platform providing experiment tracking, model evaluation, and collaborative ML workflows.

 

Weaviate

Vector database company providing evaluation metrics and tools for search and recommendation systems.

 

WildBench

Benchmark dataset for evaluating model performance on diverse, real-world tasks.

 

WildGuard

Multi-purpose moderation tool for assessing safety of user-LLM interactions.

 

WildJailbreak

Safety training dataset with 262K examples for improving model robustness against adversarial attacks.

 

WildTeaming

Automatic red-teaming framework for identifying and reproducing human-devised attacks.

 

Winogrande

Benchmark testing commonsense reasoning through pronoun resolution tasks.

 


 

X

XAI (Explainable AI)

See Explainable AI.

 

xAI

Elon Musk's AI company, receiving low grades on AI Safety Index evaluations.

 

XGBoost

Extreme gradient boosting framework commonly used in machine learning competitions and evaluation studies.

 


 

Y

YOLO (You Only Look Once)

Object detection algorithm whose performance is evaluated using computer vision metrics.

 


 

Z

ZebraLogic

Benchmark evaluating logical reasoning abilities of LLMs via logic grid puzzles.

 

Zero-Shot Evaluation

Testing model performance on tasks without providing task-specific training examples.

 

Zhipu AI

Chinese AI company included in AI Safety Index evaluations.

 

Zilliz

Company behind Milvus vector database, providing resources on embedding evaluation and similarity metrics.

 


 

Advanced Technical Concepts

AdaBoost (Adaptive Boosting)

Ensemble method that sequentially applies weak learners, with each focusing on previously misclassified examples.

 

Adversarial Robustness

Model's ability to maintain performance when faced with intentionally crafted malicious inputs.

 

Algorithmic Accountability

Framework ensuring AI systems can be held responsible for their decisions and impacts.

 

Algorithmic Auditing

Systematic examination of AI systems for bias, fairness, and compliance with ethical standards.

 

Attention Mechanism

Neural network component that allows models to focus on relevant parts of input sequences.

 

Batch Normalization

Technique for normalizing layer inputs to improve training stability and convergence.

 

Bayesian Evaluation

Statistical approach to model evaluation incorporating uncertainty quantification.

 

Bias Amplification

Phenomenon where AI systems increase existing biases present in training data.

 

Catastrophic Forgetting

Problem where neural networks lose previously learned information when learning new tasks.

 

Compositional Generalization

The ability of models to understand and generate novel combinations of known elements.

 

Concept Drift

Changes in the underlying data distribution that affect model performance over time.

 

Differential Privacy

Privacy-preserving technique that adds noise to data or model outputs to protect individual privacy.

 

Distribution Shift

Changes in data distribution between training and testing phases that can degrade model performance.

 

Ensemble Learning

Machine learning technique combining multiple models to improve overall performance.

 

Epistemic Uncertainty

Uncertainty arising from limited knowledge or data, reducible with more information.

 

Federated Learning

Machine learning approach where models are trained across decentralized data sources.

 

Fine-tuning Evaluation

Assessment of model performance after task-specific training on pre-trained models.

 

Gradient Descent

Optimization algorithm used to minimize loss functions in machine learning models.

 

Hyperparameter Optimization

Process of finding optimal configuration parameters for machine learning models.

 

Knowledge Distillation

Technique for transferring knowledge from large models to smaller, more efficient ones.

 

Markov Chain Monte Carlo (MCMC)

Statistical sampling method used in Bayesian model evaluation.

 

Meta-Learning

Learning to learn - algorithms that improve their learning efficiency through experience.

 

Multi-Armed Bandit

Framework for sequential decision-making under uncertainty, used in online evaluation.

 

Neural Architecture Search (NAS)

Automated method for finding optimal neural network architectures.

 

Overfitting

Problem where models perform well on training data but poorly on unseen data.

 

Regularization

Techniques to prevent overfitting by adding constraints or penalties to model complexity.

 

Semi-Supervised Learning

Learning approach using both labeled and unlabeled data for training.

 

Transfer Learning

Technique leveraging knowledge from pre-trained models for new tasks.

 

Underfitting

Problem where models are too simple to capture underlying data patterns.

 


 

Industry Standards and Organizations

ACM (Association for Computing Machinery)

Professional organization publishing AI evaluation research and standards.

 

AIES (AI, Ethics, and Society)

Conference focusing on ethical considerations in AI development and evaluation.

 

AAAI (Association for the Advancement of Artificial Intelligence)

Organization promoting AI research and responsible development practices.

 

FAccT (Fairness, Accountability, and Transparency)

Conference dedicated to fairness and accountability in algorithmic systems.

 

ICML (International Conference on Machine Learning)

Premier venue for machine learning research including evaluation methodologies.

 

ISO/IEC Standards

International standards for AI systems including evaluation and testing frameworks.

 

Partnership on AI

Collaborative effort to study and formulate best practices on AI technologies.

 


 

Regulatory and Policy Frameworks

AI Act (EU)

European Union regulation establishing requirements for AI systems including evaluation standards.

 

AI Bill of Rights (US)

Framework establishing principles for responsible AI development and deployment.

 

GDPR (General Data Protection Regulation)

European privacy regulation affecting AI evaluation practices and data handling.

 

Model Cards

Documentation standard for AI model transparency and accountability.

 

Preparedness Framework

Structured approach to evaluating and managing AI risks throughout development.

 


 

Specialized Evaluation Domains

Autonomous Vehicle Evaluation

Assessment frameworks for self-driving car AI systems including safety and performance metrics.

 

Clinical AI Evaluation

Specialized assessment for medical AI applications focusing on patient safety and efficacy.

 

Cybersecurity AI Evaluation

Assessment of AI systems used in security applications including threat detection and response.

 

Educational AI Evaluation

Frameworks for assessing AI tutoring systems and educational technology effectiveness.

 

Financial AI Evaluation

Specialized assessment for AI in banking, trading, and financial services with regulatory compliance.

 

Assessment frameworks for AI systems used in legal applications including bias and fairness.

 

Military AI Evaluation

Specialized assessment for defense applications including ethical and strategic considerations.

 

Social Media AI Evaluation

Assessment of content moderation and recommendation algorithms for harmful content detection.

 


 

Emerging Technologies and Concepts

Brain-Computer Interfaces (BCI)

Technology enabling direct communication between brain and computer systems, requiring specialized evaluation.

 

Digital Twins

Virtual replicas of physical systems used for simulation and evaluation.

 

Edge AI Evaluation

Assessment of AI systems running on local devices with resource constraints.

 

Neuromorphic Computing

Brain-inspired computing architectures requiring specialized evaluation approaches.

 

Quantum Machine Learning

Integration of quantum computing with machine learning, requiring new evaluation frameworks.

 

Swarm Intelligence

Collective behavior of decentralized systems requiring specialized evaluation metrics.

 


 

Data and Privacy Concepts

Anonymization

Process of removing personally identifiable information from datasets used in evaluation.

 

Data Lineage

Tracking of data flow and transformations throughout the machine learning pipeline.

 

Data Provenance

Documentation of data origin, ownership, and processing history for evaluation datasets.

 

Federated Evaluation

Assessment approaches that preserve data privacy by keeping data distributed.

 

Homomorphic Encryption

Cryptographic technique allowing computation on encrypted data for privacy-preserving evaluation.

 

Synthetic Data Generation

Creation of artificial datasets for evaluation without exposing real sensitive data.

 


 

Business and Economic Aspects

AI Ethics Boards

Organizational committees overseeing ethical AI development and evaluation practices.

 

AI Governance

Frameworks for managing AI development, deployment, and evaluation within organizations.

 

AI ROI (Return on Investment)

Metrics for measuring business value and effectiveness of AI system implementations.

 

AI Vendor Assessment

Evaluation frameworks for selecting and monitoring third-party AI service providers.

 

Chief AI Officer (CAIO)

Executive role responsible for AI strategy and evaluation oversight.

 

MLOps (Machine Learning Operations)

Practices for deploying and maintaining machine learning systems including evaluation pipelines.

 

Model Risk Management

Framework for identifying, assessing, and mitigating risks associated with AI model deployment.

 


 

Bibliography and References

This comprehensive dictionary draws from leading sources including:

Academic Institutions:

    • Stanford Center for Research on Foundation Models (CRFM)
    • MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
    • Carnegie Mellon University AI Research
    • UC Berkeley AI Research
    • University of Oxford AI Research
    • University of Cambridge AI Research
    •  

Industry Organizations:

    • OpenAI Safety and Evaluation Teams
    • Anthropic Constitutional AI Research
    • Google DeepMind Safety Research
    • Microsoft AI Research
    • Meta Fundamental AI Research (FAIR)
    • Cohere Research Team

 

Evaluation Frameworks:

    • EleutherAI LM Evaluation Harness
    • RAGAS Framework
    • DeepEval/Confident AI
    • TruLens/TruEra
    • Arize Phoenix
    • LangSmith/LangChain

 

Government and Standards:

    • NIST AI Risk Management Framework
    • UK AI Safety Institute
    • US AI Safety Institute
    • EU AI Act Implementation
    • IEEE AI Standards Committee

 

Research Venues:

    • NeurIPS (Neural Information Processing Systems)
    • ICML (International Conference on Machine Learning)
    • ACL (Association for Computational Linguistics)
    • ICLR (International Conference on Learning Representations)
    • FAccT (Fairness, Accountability, and Transparency)
    • AIES (AI, Ethics, and Society)

 


 

This dictionary represents the most comprehensive compilation of AI evaluation terminology as of 2025, covering foundational concepts, cutting-edge methodologies, practical tools, industry standards, regulatory frameworks, and emerging technologies. The field continues to evolve rapidly with new metrics, benchmarks, evaluation approaches, companies, and regulatory requirements being developed regularly. This reference should serve as the definitive one-stop resource for anyone working in or studying AI evaluation and testing.

A


Accuracy


The percentage of correct predictions made by a model out of total predictions. In classification tasks, calculated as (True Positives + True Negatives) / Total Predictions.


ADeLe (Annotated Demand Levels)


A methodology that assesses task difficulty for AI models using measurement scales for 18 types of cognitive and knowledge-based abilities.


A/B Testing


Comparative evaluation methodology testing different model versions or prompts against each other to determine which performs better.


Adversarial Evaluation


Testing with intentionally challenging inputs designed to expose model weaknesses, biases, or failure modes.


AGI (Artificial General Intelligence)


Hypothetical AI systems with human-level cognitive abilities across all domains, representing a key milestone that companies like OpenAI and DeepMind aim to achieve.


AI Benchmarking


The systematic evaluation of AI models using standardized datasets and metrics to assess their capabilities, limitations, and performance across various tasks.


AI Lab Watch


Organization monitoring AI companies' safety practices and evaluating their preparedness for advanced AI risks.


AI Safety Institute (AISI)


Government organizations (UK AISI, US AI Safety Institute) developing evaluations for advanced AI systems, focusing on misuse risks, societal impacts, and autonomous capabilities.


AI Safety Index


Future of Life Institute's grading system for AI companies' safety practices, with Anthropic receiving the highest grade (C) and Meta receiving an F.


AI Safety Level (ASL)


Anthropic's classification system for AI systems based on risk levels, with ASL-3 involving enhanced security and deployment standards for models with potential dangerous capabilities.


Alignment Research Center (ARC)


Third-party organization conducting safety evaluations of frontier AI models for dangerous capabilities like resource accumulation and self-replication.


AlphaGo


DeepMind's famous AI system that defeated human champions at the game of Go, representing a major breakthrough in AI capabilities evaluation.


ANN (Approximate Nearest Neighbor)


Algorithms used to find the nearest neighbors of a query point in high-dimensional datasets, trading small amounts of accuracy for significant speed improvements.


Answer Relevancy


Evaluation metric determining whether an LLM output addresses the given input in an informative and concise manner.


Anthropic


AI safety company founded by Dario and Daniela Amodei, known for Claude models and Constitutional AI approach, receiving the highest grade on AI Safety Index.


Apollo Research


Third-party research institute that evaluates AI models for safety risks, including testing for deceptive behavior and scheming capabilities.


ARC (AI Research Center)


See Alignment Research Center.


ARES (Automated Evaluation Framework for Retrieval-Augmented Generation Systems)


Framework for automated evaluation of RAG systems.


Arize Phoenix


Open-source observability and evaluation platform for LLM applications with focus on tracing and debugging.


ASL (AI Safety Level)


See AI Safety Level.


Aspect Criticism


Evaluation approach focusing on specific aspects of model outputs (e.g., summary accuracy, coherence, relevance).


Attention Visualization


Techniques for visualizing attention weights in transformer models to understand what the model focuses on.


AUC-ROC (Area Under Curve - Receiver Operating Characteristic)


Metric measuring the ability of a binary classifier to distinguish between classes across all classification thresholds.


AutoGPTQ


Quantization library for transformer models to reduce memory requirements while maintaining performance.


Automated Evaluation


Systematic assessment using computational metrics and algorithms for scalable, reproducible measurements.


Azure OpenAI Evaluation


Microsoft's cloud-based evaluation platform for assessing model performance across key metrics including factuality and semantic similarity.




B


BBQ (Bias Benchmark for QA)


Evaluation framework measuring social biases against protected classes along nine social dimensions.


Benchmark


A standardized test or measurement used to evaluate the performance of AI models against established criteria or other models.


BERTScore


An evaluation metric that uses BERT embeddings to compute similarity between generated and reference texts, capturing semantic similarity beyond surface-level n-gram matching.


Bias


Systematic unfairness in model outputs toward specific groups or demographics, measured through demographic parity, equalized odds, and other fairness metrics.


Big-Bench Hard (BBH)


A suite of 23 challenging BIG-Bench tasks where prior language model evaluations did not outperform average human raters.


BigCode


Research initiative focused on large-scale code generation models and their evaluation.


BigCodeBench


A coding benchmark where AI systems achieve 35.5% success rate compared to 97% human performance.


BigScience


International collaboration that developed BLOOM language model and contributed to evaluation frameworks.


BLEU Score (Bilingual Evaluation Understudy)


A precision-based metric that evaluates machine-generated text by comparing n-gram overlap with reference texts, commonly used in machine translation evaluation.


BLEURT


A learned evaluation metric that uses fine-tuned BERT models to assess text generation quality.


BLOOM


Large language model developed by BigScience collaboration, representing open-source alternatives to proprietary models.


Bootstrapping


Statistical method for estimating sampling distributions by resampling with replacement.




C


CAI (Constitutional AI)


Anthropic's training methodology using AI feedback to evaluate outputs according to a set of principles, implemented in Claude models.


Calibration


A model's ability to provide confidence estimates that accurately reflect the likelihood of its predictions being correct.


CBRN (Chemical, Biological, Radiological, Nuclear)


Categories of weapons of mass destruction that AI safety evaluations assess models' potential to help develop or acquire.


Chain-of-Thought (CoT) Evaluation


Assessment of models' reasoning capabilities through step-by-step problem-solving approaches.


Chatbot Arena


A crowdsourced platform where users interact with two anonymous LLMs simultaneously and vote for the better response, used to compute ELO ratings.


Classification Metrics


Set of metrics for evaluating classification models including accuracy, precision, recall, F1-score, specificity, and sensitivity.


Claude


Anthropic's AI assistant family, including Claude 3.5 Sonnet and Claude Opus 4, known for Constitutional AI training and safety features.


Claude's Constitution


Set of principles used in Constitutional AI training, drawing from UN Declaration of Human Rights and other ethical frameworks.


CNN (Convolutional Neural Network)


Neural network architecture particularly effective for image processing, used in computer vision evaluation tasks.


Coherence


Evaluation of how well sentences and paragraphs flow together to form unified and understandable responses.


Cohere


AI company developing language models and evaluation frameworks, recently studied potential gaming of Chatbot Arena leaderboard.


COMET


A learned metric for machine translation evaluation that uses cross-lingual pre-trained models.


Comet Opik


Open-source end-to-end LLM evaluation and monitoring platform with prompt playground capabilities.


Confident AI


Company behind DeepEval framework, providing hosted evaluation platform for LLM applications.


Confusion Matrix


A table layout allowing visualization of the performance of a classification algorithm, showing true positives, false positives, true negatives, and false negatives.


Constitutional AI


See CAI.


Context Length Evaluation


Assessment of model performance across different input context lengths.


Context Precision


RAG evaluation metric measuring the quality of retrieved information relative to the query.


Context Recall


RAG evaluation metric measuring the completeness of retrieved information relative to what should have been retrieved.


Contextual Embeddings


Vector representations where the same word has different embeddings based on surrounding context, crucial for semantic evaluation.


CoQA


Conversational Question Answering dataset used for evaluating multi-turn dialogue capabilities.


Correctness


Evaluation metric determining whether an LLM output is factually correct based on ground truth or reference standards.


Cosine Similarity


A measure of similarity between two non-zero vectors defined as the cosine of the angle between them, commonly used for comparing embeddings.


CoT (Chain-of-Thought)


See Chain-of-Thought Evaluation.


Counterfactual Fairness


A fairness criterion where predictions remain unchanged when protected attributes are flipped to counterfactual values.


Cross-Entropy Loss


A loss function measuring the difference between predicted and true probability distributions, widely used in LLM training and evaluation.




D


Data Contamination


The phenomenon where test data inadvertently appears in training datasets, compromising evaluation integrity and leading to inflated performance scores.


DCG (Discounted Cumulative Gain)


A ranking quality metric that measures the total item relevance in a list with a discount for items further down the list.


Deepchecks


Evaluation framework focusing on LLM evaluation with emphasis on dashboard visualization and UI for evaluation results.


DeepEval


Open-source LLM evaluation framework by Confident AI offering 14+ evaluation metrics for RAG and fine-tuning use cases.


DeepLIFT


Method for decomposing predictions of neural networks by comparing activations to reference activations.


DeepMind


Google's AI research lab known for AlphaGo, protein folding breakthroughs, and safety evaluation research.


DeepSeek


Chinese AI lab that developed R1 model, demonstrating cost-effective training methods that surprised competitors.


Demographic Parity


A fairness metric ensuring that the probability of receiving a positive outcome is the same across all groups defined by a sensitive attribute.


Dense Embeddings


Continuous, real-valued vectors representing information in high-dimensional space where every element contains non-zero values.


Direct Scoring


LLM-as-a-Judge methodology evaluating single outputs with numerical scores rather than comparative assessment.


Disparate Impact


A variation of demographic parity that aims to achieve higher-than-specified ratios rather than equal approval rates.


DSPy


Framework for programming with foundation models, supporting automatic optimization of prompts and weights.


Dynamic Benchmarks


Evaluation frameworks that evolve over time to prevent gaming and maintain challenge levels.




E


ELO Rating for LLMs


Application of chess rating system to compare language models through pairwise human preference battles.


EleutherAI


Open-source AI research organization behind the LM Evaluation Harness and various language models.


EleutherAI LM Evaluation Harness


A unified framework for testing generative language models on academic benchmarks with standardized evaluation protocols.


Embedding Drift


The phenomenon where embedding representations change over time, affecting downstream task performance.


Embedding Evaluation


Assessment of vector representations of text, images, or other data types for their ability to capture semantic relationships and enable downstream tasks.


Embedding Quality


Assessment of how well vector representations capture semantic relationships and enable downstream tasks.


Equal Confusion Fairness


Requirement that confusion matrices have the same distribution across all sensitive characteristics.


Equal Opportunity


A fairness metric ensuring that qualified individuals from all groups have the same chance of receiving positive outcomes.


Equalized Odds


A fairness metric requiring equal true positive rates and false positive rates across different demographic groups.


EQ-Bench


Benchmark evaluating emotional intelligence capabilities of language models.


Euclidean Distance


A measure of the straight-line distance between two points in Euclidean space, commonly used for comparing vectors.


Evaluation Harness


Frameworks providing standardized environments for running multiple evaluation benchmarks consistently across different models.


Evalverse


Unified evaluation library integrating existing frameworks like lm-evaluation-harness and FastChat.


Evidently


Open-source Python library for ML model evaluation and monitoring with focus on data drift and model performance.


Explainable AI (XAI)


Methods and techniques for making AI model decisions interpretable and understandable to humans.




F


F1 Score


The harmonic mean of precision and recall, providing a single metric that balances both measures.


Factual Consistency


Assessment of whether generated content aligns with provided source materials without introducing false information.


FAIR (Fundamental AI Research)


Meta's AI research lab, recently restructured under Meta's efficiency initiatives and facing researcher departures.


FairLearn


Microsoft's open-source toolkit for assessing and mitigating unfairness in machine learning models.


Fairness Metrics


Mathematical measures used to assess bias and ensure equitable treatment across different demographic groups.


False Negative Rate (FNR)


The proportion of actual positive cases that were incorrectly classified as negative.


False Positive Rate (FPR)


The proportion of actual negative cases that were incorrectly classified as positive.


FastChat


LMSYS Org's evaluation framework supporting LLM-Judge evaluation for MT-Bench.


Faithfulness


RAG evaluation metric measuring how accurately generated responses align with retrieved context without hallucination.


Feature Attribution


Methods for determining which input features most influence model predictions.


FEVER (Fact Extraction and VERification)


Dataset for fact-checking evaluation, assessing models' ability to verify statement accuracy.


Few-Shot Evaluation


Testing model performance with a small number of examples provided in the prompt.


FinBen


Benchmark evaluating LLMs in financial domain across 36 datasets covering 24 tasks in seven financial areas.


Fluency


Assessment of how natural, grammatically correct, and readable generated text appears.


FNR (False Negative Rate)


See False Negative Rate.


FPR (False Positive Rate)


See False Positive Rate.


FrontierMath


A benchmark evaluating models on extremely difficult mathematics problems where AI systems currently solve only 2% of problems.


Future of Life Institute


Nonprofit organization that created the AI Safety Index and the "pause letter" calling for AI development moratorium.




G


G-Eval


A framework using large language models to evaluate text quality with chain-of-thought reasoning and probability scoring.


GELU (Gaussian Error Linear Unit)


An activation function that provides smooth and differentiable alternative to ReLU, ensuring differentiability at every point.


Gemini


Google's family of large language models competing with GPT and Claude, including Gemini 2.5.


GitHub Copilot


AI-powered code completion tool whose effectiveness is evaluated through coding benchmarks.


GLUE (General Language Understanding Evaluation)


Benchmark for evaluating language understanding across multiple tasks.


Google Brain


Former Google AI research lab, now integrated into DeepMind.


Google DeepMind


See DeepMind.


GPT (Generative Pre-trained Transformer)


OpenAI's family of language models, including GPT-3, GPT-4, and variants used in evaluation studies.


GPQA (Graduate-Level Google-Proof Q&A)


A challenging benchmark designed to evaluate expertise in biology, physics, and chemistry at PhD level.


Grad-CAM (Gradient-weighted Class Activation Mapping)


A technique for making convolutional neural network decisions transparent by highlighting important regions in input images.


Gradient Boosting


An ensemble method combining multiple weak learners sequentially, where each learner corrects errors from previous ones.


Grammatical Correctness


Assessment of whether output text is free from grammatical errors such as incorrect verb conjugations and syntactical mistakes.


GSM8k


Math word problem dataset commonly used in language model evaluation.




H


HaluEval


A benchmark evaluating LLM performance in recognizing hallucinations in question-answering, dialogue, and summarization tasks.


Hallucination Detection


The identification and quantification of instances where language models generate information that appears plausible but is factually incorrect.


Hallucination Rate


The frequency at which language models generate false or unsupported information.


Harmfulness


Evaluation category assessing potential for AI systems to cause harm through toxic, biased, or dangerous outputs.


Haystack


Framework for building search systems and RAG applications with built-in evaluation capabilities.


HealthPariksha


Framework for testing medical chatbots on factual correctness, safety, and ethical standards.


HELM (Holistic Evaluation of Language Models)


A comprehensive benchmarking framework developed by Stanford CRFM evaluating language models across 42 scenarios and 7 metrics.


HellaSwag


A benchmark testing commonsense reasoning by requiring models to choose sensible sentence completions.


HHEM (Hallucination Evaluation Model)


Vectara's evaluation model specifically designed to detect and quantify hallucinations in LLM outputs.


HILT (Human-in-the-Loop Testing)


Evaluation methodology incorporating human judgment and feedback in the testing process.


Hit Rate


The proportion of queries for which at least one relevant item appears in the top-k results.


HNSW (Hierarchical Navigable Small World)


Algorithm for approximate nearest neighbor search in high-dimensional spaces, used in vector databases.


HoneyHive


Platform for tracing LLM execution flows and creating evaluation datasets from production data.


HotPotQA


Multi-hop reasoning dataset requiring models to connect information across multiple documents.


Human Evaluation


Assessment of model outputs by human judges for qualities like fluency, coherence, relevance, and adequacy.


HumanEval


A benchmark for evaluating code generation capabilities through programming challenges.


Humanity's Last Exam


Rigorous academic test where top AI systems score only 8.80%, representing challenging evaluation benchmark.




I


IDCG (Ideal Discounted Cumulative Gain)


The best possible DCG score, used to normalize DCG into NDCG.


IEEE


Institute of Electrical and Electronics Engineers, publishes AI evaluation research and standards.


IFEval (Instruction Following Evaluation)


A benchmark measuring models' ability to follow specific instructions accurately.


In-Context Learning Evaluation


Testing models' ability to learn from examples provided in the prompt.


Individual Fairness


Fairness principle requiring that similar individuals receive similar treatment from AI systems.


Information Retrieval Metrics


Metrics specifically designed for evaluating search and retrieval systems, including precision@k, recall@k, MAP, MRR, and NDCG.


Inner Product (Dot Product)


A mathematical operation computing the sum of products of corresponding elements of two vectors, used as a similarity measure.


Inspect


UK AI Safety Institute's open-source evaluation platform for assessing AI model capabilities and safety risks.


Instruction Following


Evaluation of models' ability to follow complex, multi-step instructions accurately.


Instruction Hierarchy


Evaluation of models' adherence to prioritizing instructions between system, developer, and user messages.


Integrated Gradients


A technique for attributing predictions of classification models to input features by computing gradients along a path.


Interpretability


The degree to which humans can understand and explain AI model decisions and reasoning processes.


Intrinsic Evaluation


Assessment of embeddings based on their internal properties rather than downstream task performance.


IVF (Inverted File)


Indexing method for efficient similarity search in large vector databases.




J


Jailbreaking


Adversarial prompts designed to circumvent model safety training and induce harmful content generation.




K


KMMLU


Korean version of MMLU benchmark for testing knowledge in Korean language context.




L


LangChain


Framework for building LLM applications with integrated evaluation and monitoring capabilities.


LangFuse


Open-source LLM engineering platform providing tracing, evaluation, prompt management, and analytics.


LangSmith


Evaluation and observability platform by LangChain for debugging, testing, and monitoring LLM applications.


Latency


Time required for a model to produce output after receiving input, critical for real-time applications.


Layer-wise Relevance Propagation (LRP)


Method for explaining neural network decisions by propagating relevance scores backward through layers.


LegalBench


Benchmark for measuring legal reasoning capabilities in large language models.


Length Bias


Systematic preference for responses of certain lengths regardless of quality.


LightEval


HuggingFace's evaluation framework built on top of EleutherAI's lm-evaluation-harness.


LightGBM


Gradient boosting framework used in machine learning with applications in model evaluation.


LIME (Local Interpretable Model-agnostic Explanations)


A technique providing local explainability by approximating black box models with interpretable models for individual predictions.


LiteLLM


Universal API wrapper for LLM providers, supporting evaluation across multiple model endpoints.


LlamaIndex


Framework for building LLM applications with built-in evaluation modules for retrieval and response quality.


LLM-as-a-Judge


An evaluation methodology using large language models to assess text output quality based on custom criteria defined in evaluation prompts.


LMSYS


Organization behind Chatbot Arena and FastChat evaluation frameworks.


Logit


The raw output scores from a neural network before applying activation functions like softmax.


LRP (Layer-wise Relevance Propagation)


See Layer-wise Relevance Propagation.




M


Manhattan Distance (L1 Distance)


The sum of absolute differences between corresponding elements of two vectors.


MAP (Mean Average Precision)


A ranking metric that averages precision values across multiple recall levels and queries.


MATH


Dataset of mathematics problems used for evaluating mathematical reasoning capabilities.


METEOR (Metric for Evaluation of Translation with Explicit Ordering)


An evaluation metric addressing synonymy issues in text evaluation, providing advantages over simpler metrics.


Meta


Company behind FAIR research lab and LLaMA models, recently announcing Meta Superintelligence Labs.


Meta Superintelligence Labs (MSL)


Meta's reorganized AI research organization led by Alexandr Wang as Chief AI Officer.


MFC-Bench


Benchmark for multimodal fact-checking with large vision-language models.


MiniCheck


Efficient fact-checking framework for LLMs using grounding documents.


MLflow Evaluate


Platform for standardized LLM evaluation with custom metrics and experimentation tracking.


MMLU (Massive Multitask Language Understanding)


A benchmark testing knowledge across 57 academic subjects, from elementary to professional levels.


MMLU-Pro


Enhanced version of MMLU with more challenging questions.


MMMU


Benchmark for evaluating multimodal understanding across various domains.


Model Cards


Documentation providing transparency about AI model capabilities, limitations, and potential risks.


Model Drift


Changes in model performance over time due to shifts in data distribution or model degradation.


Model Interpretability


The extent to which humans can understand the reasoning behind AI model predictions and decisions.


MoverScore


An embedding-based evaluation metric using optimal transport to align embeddings between reference and generated text.


MRR (Mean Reciprocal Rank)


A ranking metric focusing on the position of the first relevant item in ranked results.


MSL (Meta Superintelligence Labs)


See Meta Superintelligence Labs.


MT-Bench (Multi-Turn Benchmark)


A benchmark designed to evaluate LLMs' ability to sustain multi-turn conversations.


MTEB (Massive Text Embedding Benchmark)


Comprehensive benchmark for evaluating text embedding models across various tasks.


Multi-Metric Evaluation


Assessment approach using multiple complementary metrics to capture different aspects of model performance.




N


NDCG (Normalized Discounted Cumulative Gain)


A ranking quality metric comparing rankings to ideal order where all relevant items are at the top.


Negative Predictive Value (NPV)


The proportion of negative predictions that are actually negative.


NeurIPS


Leading AI conference where research papers on evaluation methods are published.


NIST TEVV (Test, Evaluation, Validation and Verification)


U.S. national initiative for developing AI measurement standards and evaluation methodologies.


NNSA (National Nuclear Security Administration)


U.S. agency collaborating with Anthropic on red-teaming Claude models for nuclear security risks.


Noise Sensitivity


RAG evaluation metric measuring robustness to irrelevant information in retrieved context.


NPV (Negative Predictive Value)


See Negative Predictive Value.




O


OLMo


Open Language Model with publicly available training data and evaluation code.


Online Evaluation


Real-time assessment of model performance in production environments.


OOD (Out-of-Distribution) Evaluation


Testing model performance on data that differs from the training distribution.


OpenAI


AI company behind GPT models and OpenAI Evals framework, receiving D+ grade on AI Safety Index.


OpenAI Evals


OpenAI's framework for evaluating AI models with basic evaluation templates and model-graded assessments.


OpenCompass


LLM evaluation platform supporting evaluations across multiple domains including finance, healthcare, and law.


OpenInference


Open standard for capturing and storing AI model inferences to enable evaluation and observability.




P


Pairwise Comparison


LLM-as-a-Judge methodology comparing two outputs to determine which is better according to specified criteria.


Palantir


Data analytics company partnering with Anthropic to provide Claude to U.S. intelligence agencies.


Parameter Efficiency


The relationship between model size (number of parameters) and performance, measuring computational resource effectiveness.


Perplexity


A measure of how well a language model predicts a sequence of words, with lower values indicating better performance.


PersonQA


Evaluation dataset for testing hallucination in personal question answering scenarios.


Phoenix


See Arize Phoenix.


Position Bias


Systematic preference for items in certain positions of a ranked list.


Positive Predictive Value (PPV)


See Precision - the proportion of positive predictions that are actually positive.


Power Analysis


Statistical method for determining sample sizes needed to detect effects of specified size.


PPV (Positive Predictive Value)


See Precision.


Precision


The ratio of true positive predictions to total positive predictions made by the model.


Precision@K


The proportion of relevant items among the top K retrieved items.


Predictive Parity


A fairness metric requiring equal precision rates across different demographic groups.


Prompt Engineering Evaluation


Assessment of how different prompt formulations affect model performance.


PyTorch


Machine learning framework providing tools for building and training models with evaluation capabilities.




Q


QSAR (Quantitative Structure-Activity Relationships)


Method for modeling relationships between chemical structures and pharmacological activity, used in toxicity prediction.




R


RAGAS (RAG Assessment)


A comprehensive framework for evaluating Retrieval-Augmented Generation systems across multiple dimensions.


RAGChecker


Evaluation tool for assessing RAG system performance and quality.


Ranking Metrics


Evaluation measures for systems that return ordered lists of results, including MAP, MRR, NDCG, and precision@k.


Recall


The ratio of true positive predictions to total actual positive instances.


Recall@K


The proportion of relevant items that appear in the top K retrieved items.


Red Teaming


Deploying domain experts to interact with models and test capabilities while attempting to break model safeguards.


Reference-based Evaluation


Assessment methods that compare model outputs against gold standard references or ground truth.


Regression Testing


Evaluating models on consistent test sets across iterations to detect performance degradation.


RealToxicityPrompts


Dataset for evaluating model safety and toxicity detection capabilities.


Responsible Scaling Policy (RSP)


Anthropic's framework categorizing AI systems into different AI Safety Levels with associated safety measures.


Retrieval Evaluation


Assessment of information retrieval systems using metrics like precision, recall, MAP, MRR, and NDCG.


RLHF (Reinforcement Learning from Human Feedback)


Training methodology using human preferences to improve model alignment and safety.


Robustness


A model's ability to maintain performance when faced with adversarial inputs, distribution shifts, or edge cases.


ROC (Receiver Operating Characteristic)


Curve plotting true positive rate against false positive rate, used in binary classification evaluation.


ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)


A recall-focused metric family for evaluating text summarization quality.


RSP (Responsible Scaling Policy)


See Responsible Scaling Policy.




S


Safety Evaluation


Assessment of AI systems for potential harms, including toxicity, bias, misinformation, and misuse risks.


Saliency Maps


Visualizations showing which parts of input data are most important for model decisions.


Scale AI


Company providing evaluation and red-teaming services, selected by White House to conduct public AI assessments.


ScANN (Scalable Approximate Nearest Neighbors)


Google's library for efficient similarity search in large datasets.


Scheming


AI behavior involving deceptive alignment and subversion of safety measures to gain power or achieve goals.


SEAL (Safety, Evaluations, and Analysis Lab)


Scale AI's research lab focusing on model-assisted evaluation approaches.


Self-Consistency Evaluation


Methods for assessing model reliability through multiple sampling and consistency checking.


SelfCheckGPT


Hallucination detection method based on self-consistency checking across multiple model generations.


Semantic Similarity


Evaluation of how similar two pieces of text are in meaning, often using embedding-based approaches.


Semantic Textual Similarity (STS)


Task and benchmark for measuring the degree of semantic equivalence between text pairs.


Sentence-BERT (S-BERT)


Method for generating sentence-level embeddings that can be compared using cosine similarity.


Shadow Testing


Running new models alongside production models to compare performance without affecting users.


SHAP (SHapley Additive exPlanations)


A framework using game theory concepts to explain model predictions by computing feature contribution values.


SimpleQA


A dataset of fact-seeking questions with short answers measuring model accuracy for attempted answers.


Sparse Embeddings


Vector representations where most values are zero, emphasizing only relevant information.


Specificity (True Negative Rate)


The proportion of actual negative cases correctly identified as negative.


SQuAD (Stanford Question Answering Dataset)


Benchmark for evaluating reading comprehension capabilities.


STS (Semantic Textual Similarity)


See Semantic Textual Similarity.


StrongReject


Academic jailbreak benchmark testing model resistance against common adversarial attacks.


SuperGLUE


More challenging version of GLUE benchmark for language understanding evaluation.


SWE-bench


A benchmark evaluating AI systems' ability to resolve real-world software engineering problems.




T


Task Completion


Evaluation metric determining whether an LLM agent successfully completes assigned tasks.


TensorFlow


Machine learning framework providing comprehensive tools for building and training models with evaluation capabilities.


TEVV (Test, Evaluation, Validation and Verification)


See NIST TEVV.


Third-Party Evaluation


Independent assessment conducted by external organizations to ensure objective evaluation of AI systems.


Throughput


Number of requests or tokens a model can process per unit time, measuring computational efficiency.


Token Efficiency


Measurement of how effectively models use their token budgets for generation tasks.


Tonic Validate


Application and SDK for measuring RAG LLM system performance with automated evaluation capabilities.


Tool Correctness


Evaluation metric determining whether an LLM agent calls the correct tools for given tasks.


TopicQA


Question answering dataset used for evaluating topic-specific knowledge.


Toxicity Detection


Measurement of harmful, offensive, or inappropriate content generation by AI models.


TPR (True Positive Rate)


See Sensitivity or Recall - the proportion of actual positives correctly identified.


Treatment Equality


Fairness metric requiring equal ratios of false negatives to false positives across demographic groups.


TreeSHAP


Fast explainer for analyzing decision tree models in the SHAP framework.


TriviaQA


Question answering dataset testing factual knowledge across various topics.


True Negative Rate (TNR)


See Specificity - the proportion of actual negatives correctly identified.


True Positive Rate (TPR)


See Sensitivity or Recall - the proportion of actual positives correctly identified.


TruLens


Open-source library for evaluating and tracking LLM applications through feedback functions and comprehensive tracing.


TruthfulQA


A benchmark evaluating LLMs' accuracy in providing truthful information using adversarial questions.




U


UniEval


Unified evaluation framework for text generation using pre-trained language models.


Unified Evaluation


Frameworks that standardize evaluation across multiple tasks, models, and metrics for consistent comparison.


UpTrain


Evaluation framework for LLM applications with focus on continuous monitoring and improvement.


Uptime


Percentage of time a model or system is operational and available.


US AI Safety Institute


Federal initiative collaborating with AI companies on safety research, testing and evaluation.




V


Validation


Process of assessing model performance on held-out data to ensure generalization.


Vector Database Evaluation


Assessment of systems designed to store and retrieve high-dimensional vectors efficiently.


Vector Similarity Search


The process of finding vectors in a database that are most similar to a query vector using distance metrics.


Vectara


Company providing hallucination evaluation models and leaderboards for LLM assessment.


VHELM (Holistic Evaluation of Vision-Language Models)


Extension of HELM framework for comprehensive evaluation of Vision-Language Models.


vLLM


Inference and serving library for large language models with support for fast evaluation.




W


W&B (Weights & Biases)


Platform providing experiment tracking, model evaluation, and collaborative ML workflows.


Weaviate


Vector database company providing evaluation metrics and tools for search and recommendation systems.


WildBench


Benchmark dataset for evaluating model performance on diverse, real-world tasks.


WildGuard


Multi-purpose moderation tool for assessing safety of user-LLM interactions.


WildJailbreak


Safety training dataset with 262K examples for improving model robustness against adversarial attacks.


WildTeaming


Automatic red-teaming framework for identifying and reproducing human-devised attacks.


Winogrande


Benchmark testing commonsense reasoning through pronoun resolution tasks.




X


XAI (Explainable AI)


See Explainable AI.


xAI


Elon Musk's AI company, receiving low grades on AI Safety Index evaluations.


XGBoost


Extreme gradient boosting framework commonly used in machine learning competitions and evaluation studies.




Y


YOLO (You Only Look Once)


Object detection algorithm whose performance is evaluated using computer vision metrics.




Z


ZebraLogic


Benchmark evaluating logical reasoning abilities of LLMs via logic grid puzzles.


Zero-Shot Evaluation


Testing model performance on tasks without providing task-specific training examples.


Zhipu AI


Chinese AI company included in AI Safety Index evaluations.


Zilliz


Company behind Milvus vector database, providing resources on embedding evaluation and similarity metrics.




Advanced Technical Concepts


AdaBoost (Adaptive Boosting)


Ensemble method that sequentially applies weak learners, with each focusing on previously misclassified examples.


Adversarial Robustness


Model's ability to maintain performance when faced with intentionally crafted malicious inputs.


Algorithmic Accountability


Framework ensuring AI systems can be held responsible for their decisions and impacts.


Algorithmic Auditing


Systematic examination of AI systems for bias, fairness, and compliance with ethical standards.


Attention Mechanism


Neural network component that allows models to focus on relevant parts of input sequences.


Batch Normalization


Technique for normalizing layer inputs to improve training stability and convergence.


Bayesian Evaluation


Statistical approach to model evaluation incorporating uncertainty quantification.


Bias Amplification


Phenomenon where AI systems increase existing biases present in training data.


Catastrophic Forgetting


Problem where neural networks lose previously learned information when learning new tasks.


Compositional Generalization


The ability of models to understand and generate novel combinations of known elements.


Concept Drift


Changes in the underlying data distribution that affect model performance over time.


Differential Privacy


Privacy-preserving technique that adds noise to data or model outputs to protect individual privacy.


Distribution Shift


Changes in data distribution between training and testing phases that can degrade model performance.


Ensemble Learning


Machine learning technique combining multiple models to improve overall performance.


Epistemic Uncertainty


Uncertainty arising from limited knowledge or data, reducible with more information.


Federated Learning


Machine learning approach where models are trained across decentralized data sources.


Fine-tuning Evaluation


Assessment of model performance after task-specific training on pre-trained models.


Gradient Descent


Optimization algorithm used to minimize loss functions in machine learning models.


Hyperparameter Optimization


Process of finding optimal configuration parameters for machine learning models.


Knowledge Distillation


Technique for transferring knowledge from large models to smaller, more efficient ones.


Markov Chain Monte Carlo (MCMC)


Statistical sampling method used in Bayesian model evaluation.


Meta-Learning


Learning to learn - algorithms that improve their learning efficiency through experience.


Multi-Armed Bandit


Framework for sequential decision-making under uncertainty, used in online evaluation.


Neural Architecture Search (NAS)


Automated method for finding optimal neural network architectures.


Overfitting


Problem where models perform well on training data but poorly on unseen data.


Regularization


Techniques to prevent overfitting by adding constraints or penalties to model complexity.


Semi-Supervised Learning


Learning approach using both labeled and unlabeled data for training.


Transfer Learning


Technique leveraging knowledge from pre-trained models for new tasks.


Underfitting


Problem where models are too simple to capture underlying data patterns.




Industry Standards and Organizations


ACM (Association for Computing Machinery)


Professional organization publishing AI evaluation research and standards.


AIES (AI, Ethics, and Society)


Conference focusing on ethical considerations in AI development and evaluation.


AAAI (Association for the Advancement of Artificial Intelligence)


Organization promoting AI research and responsible development practices.


FAccT (Fairness, Accountability, and Transparency)


Conference dedicated to fairness and accountability in algorithmic systems.


ICML (International Conference on Machine Learning)


Premier venue for machine learning research including evaluation methodologies.


ISO/IEC Standards


International standards for AI systems including evaluation and testing frameworks.


Partnership on AI


Collaborative effort to study and formulate best practices on AI technologies.




Regulatory and Policy Frameworks


AI Act (EU)


European Union regulation establishing requirements for AI systems including evaluation standards.


AI Bill of Rights (US)


Framework establishing principles for responsible AI development and deployment.


GDPR (General Data Protection Regulation)


European privacy regulation affecting AI evaluation practices and data handling.


Model Cards


Documentation standard for AI model transparency and accountability.


Preparedness Framework


Structured approach to evaluating and managing AI risks throughout development.




Specialized Evaluation Domains


Autonomous Vehicle Evaluation


Assessment frameworks for self-driving car AI systems including safety and performance metrics.


Clinical AI Evaluation


Specialized assessment for medical AI applications focusing on patient safety and efficacy.


Cybersecurity AI Evaluation


Assessment of AI systems used in security applications including threat detection and response.


Educational AI Evaluation


Frameworks for assessing AI tutoring systems and educational technology effectiveness.


Financial AI Evaluation


Specialized assessment for AI in banking, trading, and financial services with regulatory compliance.



Assessment frameworks for AI systems used in legal applications including bias and fairness.


Military AI Evaluation


Specialized assessment for defense applications including ethical and strategic considerations.


Social Media AI Evaluation


Assessment of content moderation and recommendation algorithms for harmful content detection.




Emerging Technologies and Concepts


Brain-Computer Interfaces (BCI)


Technology enabling direct communication between brain and computer systems, requiring specialized evaluation.


Digital Twins


Virtual replicas of physical systems used for simulation and evaluation.


Edge AI Evaluation


Assessment of AI systems running on local devices with resource constraints.


Neuromorphic Computing


Brain-inspired computing architectures requiring specialized evaluation approaches.


Quantum Machine Learning


Integration of quantum computing with machine learning, requiring new evaluation frameworks.


Swarm Intelligence


Collective behavior of decentralized systems requiring specialized evaluation metrics.




Data and Privacy Concepts


Anonymization


Process of removing personally identifiable information from datasets used in evaluation.


Data Lineage


Tracking of data flow and transformations throughout the machine learning pipeline.


Data Provenance


Documentation of data origin, ownership, and processing history for evaluation datasets.


Federated Evaluation


Assessment approaches that preserve data privacy by keeping data distributed.


Homomorphic Encryption


Cryptographic technique allowing computation on encrypted data for privacy-preserving evaluation.


Synthetic Data Generation


Creation of artificial datasets for evaluation without exposing real sensitive data.




Business and Economic Aspects


AI Ethics Boards


Organizational committees overseeing ethical AI development and evaluation practices.


AI Governance


Frameworks for managing AI development, deployment, and evaluation within organizations.


AI ROI (Return on Investment)


Metrics for measuring business value and effectiveness of AI system implementations.


AI Vendor Assessment


Evaluation frameworks for selecting and monitoring third-party AI service providers.


Chief AI Officer (CAIO)


Executive role responsible for AI strategy and evaluation oversight.


MLOps (Machine Learning Operations)


Practices for deploying and maintaining machine learning systems including evaluation pipelines.


Model Risk Management


Framework for identifying, assessing, and mitigating risks associated with AI model deployment.




Bibliography and References


This comprehensive dictionary draws from leading sources including:


Academic Institutions:



  • Stanford Center for Research on Foundation Models (CRFM)

  • MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)

  • Carnegie Mellon University AI Research

  • UC Berkeley AI Research

  • University of Oxford AI Research

  • University of Cambridge AI Research


Industry Organizations:



  • OpenAI Safety and Evaluation Teams

  • Anthropic Constitutional AI Research

  • Google DeepMind Safety Research

  • Microsoft AI Research

  • Meta Fundamental AI Research (FAIR)

  • Cohere Research Team


Evaluation Frameworks:



  • EleutherAI LM Evaluation Harness

  • RAGAS Framework

  • DeepEval/Confident AI

  • TruLens/TruEra

  • Arize Phoenix

  • LangSmith/LangChain


Government and Standards:



  • NIST AI Risk Management Framework

  • UK AI Safety Institute

  • US AI Safety Institute

  • EU AI Act Implementation

  • IEEE AI Standards Committee


Research Venues:



  • NeurIPS (Neural Information Processing Systems)

  • ICML (International Conference on Machine Learning)

  • ACL (Association for Computational Linguistics)

  • ICLR (International Conference on Learning Representations)

  • FAccT (Fairness, Accountability, and Transparency)

  • AIES (AI, Ethics, and Society)




This dictionary represents the most comprehensive compilation of AI evaluation terminology as of 2025, covering foundational concepts, cutting-edge methodologies, practical tools, industry standards, regulatory frameworks, and emerging technologies. The field continues to evolve rapidly with new metrics, benchmarks, evaluation approaches, companies, and regulatory requirements being developed regularly. This reference should serve as the definitive one-stop resource for anyone working in or studying AI evaluation and testing.