Generative AI Risk Management Resources

TL; DR:

This repository contains resources to assist in the creation of technical standards or procedures for organizational AI governance policies, with a strong focus on generative AI (GAI).
This information must be combined with a higher-level governance approach as described in, e.g., The Interagency Guidance on Model Risk Management or the NIST AI RMF Govern Function.
The information below is aligned with DRAFT NIST 600-1 AI RMF Generative AI Profile.
For non-commercial use. For commercial support please reach out to info@hallresearch.ai.

What's missing?

Higher-level policies and procedure language to tie these resources together into cohesive governance documentments.
Methodology for estimating business risk (e.g., monetary losses) from model testing, red-teaming, feedback and experimental results.
...

Introduction

(c) Patrick Hall and Daniel Atherton 2024, CC BY 4.0

This information is designed to help organizations build the governance policies required to measure and manage risks associated with deploying and using GAI systems. Governance is key to addressing the growing need for trustworthy and responsible AI systems, and this repository is aligned to the NIST AI Risk Management Framework trustworthy characteristics and the DRAFT NIST 600-1 AI RMF Generative AI Profile. Governance is also a necessary component of AI strategy, crucial for addressing real legal, regulatory, ethical, and operational headwinds.

At its core, this repository provides technical materials for building or augmenting detailed model or AI governance procedures or standards, and aligns them to guidance from NIST. Starting in Section A, two central risk management mechanisms are explored. The first perspective comprises the NIST AI RMF trustworthy characteristics mapped to GAI risks. Operating from this perspective allows organizations to understand how each trustworthy characteristic can mitigate specific risks posed by GAI. The second perspective is the reverse—GAI risks mapped to trustworthy characteristics. That mapping can help organizations understand which characteristics should be prioritized to manage specific GAI risks. As consumer finance organizations are likely to adopt both NIST (or other more technical frameworks) and traditional enterprise risk management methodologies, ideas on linking trustworthy characteristics, GAI risks, and established banking risk buckets are also presented in Section A.

The repository also guides users through authoritative resources for risk-tiering. Sections B.1 through B.7 walk the user of the framework through the process of defining adverse impacts: Harm to Operations, Harm to Assets, Harm to Individuals, Harm to Other Organizations, and Harm to the Nation, along with guidance on impact quantification and description. Section B also offers tables with guidance on assessing the likelihood of certain risks. Organizations and companies can leverage this combination of adverse impacts and frequency/likelihood tables to develop tailored risk tiers that reflect the specific contexts in which their GAI systems may be operating. They can also utilize practical risk-tiering to guide their decision-making and evaluate how best to calibrate existing safeguards or whether to implement additional ones.

Measurement and testing is a critical aspect of ensuring GAI systems perform as expected. For measuring the severity of certain GAI risks, Section C presents various model testing benchmarks (such as evals). Model testing suites provide the user with tools to roughly assess GAI performance against trustworthy characteristics as well to quickly test for resilience in the face of known GAI risks. As GAI systems are vulnerable to adversarial attacks via prompting and hacks, Section D presents red-teaming and adversarial prompting approaches for human elicitation of evidence of GAI risks in adversarial scenarios. Section H hints at more in-depth structured experiments and human feedback for risk assessment. Suggested usage for these types of measurement is as follows:

Low-risk GAI systems: model testing only
Medium-risk GAI systems: model testing and red-teaming
High-risk GAI systems: model testing, red-teaming, and structured experiments and human feedback

Where measurement for lower-risk systems can be highly-automated, human risk management resources are reserved for medium and high-risk systems.

For managing and mitigating GAI risks, Section E outlines several risk controls for GAI. Controls range from technical settings for GAI systems to commonsense recommendations, e.g., limiting or restricting access for minors. Sections F, G](#g-example-medium-risk-generative-ai-measurement-and-management-plan), and H pair risk measurement techniques with controls to form more fulsome risk management plans. Recommended usage for the plans in Sections F-H is:

Low-risk GAI systems: apply Section F only
Medium-risk GAI systems: apply Section F and G
High-risk GAI systems: apply Sections F, G, and H

Regardless of the risk level of the system, the framework offers detailed measurement plans that guide the user through the process of assessing the system’s performance, along with tracking risks, and harmonizing the system with trustworthy AI principles.

Section A: Example Generative AI-Trustworthy Characteristic Crosswalk
Section B: Example Risk-tiering Materials for Generative AI
- B.1: Example Adverse Impacts
- B.2: Example Impact Descriptions
- B.3: Example Likelihood Descriptions
- B.4: Example Risk Tiers
- B.5: Example Risk Descriptions
- B.6: Practical Risk-tiering Questions
- B.7: AI Risk Management Framework Actions Aligned to Risk Tiering
Section C: List of Selected Model Testing Suites
Section D: Selected Adversarial Prompting Strategies and Attacks
Section E: Selected Risk Controls for Generative AI
Section F: Example Low-risk Generative AI Measurement and Management Plan
- F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
- F.2: Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk
Section G: Example Medium-risk Generative AI Measurement and Management Plan
- G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
- G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk
Section H: Example High-risk Generative AI Measurement and Management Plan
- H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
- H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

A: Example Generative AI-Trustworthy Characteristic Crosswalk

A.1: Trustworthy Characteristic to Generative AI Risk Crosswalk

Accountable and Transparent	Explainable and Interpretable	Fair with Harmful Bias Managed	Privacy Enhanced
Data Privacy	Human-AI Configuration	Confabulation	Data Privacy
Environmental	Value Chain and Component Integration	Environmental	Human-AI Configuration
Human-AI Configuration		Human-AI Configuration	Information Security
Information Integrity		Intellectual Property	Intellectual Property
Intellectual Property		Obscene, Degrading, and/or Abusive Content	Value Chain and Component Integration
Value Chain and Component Integration		Toxicity, Bias, and Homogenization
		Value Chain and Component Integration

Safe	Secure and Resilient	Valid and Reliable
CBRN Information	Dangerous or Violent Recommendations	Confabulation
Confabulation	Data Privacy	Human-AI Configuration
Dangerous or Violent Recommendations	Human-AI Configuration	Information Integrity
Data Privacy	Information Security	Information Security
Environmental	Value Chain and Component Integration	Toxicity, Bias, and Homogenization
Human-AI Configuration		Value Chain and Component Integration
Information Integrity
Information Security
Obscene, Degrading, and/or Abusive Content
Value Chain and Component Integration

Usage Note: Table A.1 provides an example of mapping GAI risks onto AI RMF trustworthy characteristics. Mapping GAI risks to AI RMF trustworthy characteristics can be particularly useful when existing policies, processes, or controls can be applied to manage GAI risks, but have been previously implemented in alignment with the AI RMF trustworthy characteristics. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk

Table A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk.

CBRN Information	Confabulation	Dangerous or Violent Recommendations	Data Privacy
Safe	Fair with Harmful Bias Managed	Safe	Accountable and Transparent
	Safe	Secure and Resilient	Privacy Enhanced
	Valid and Reliable		Safe
			Secure and Resilient

Environmental	Human-AI Configuration	Information Integrity	Information Security
Accountable and Transparent	Accountable and Transparent	Accountable and Transparent	Privacy Enhanced
Fair with Harmful Bias Managed	Explainable and Interpretable	Safe	Safe
Safe	Fair with Harmful Bias Managed	Valid and Reliable	Secure and Resilient
	Privacy Enhanced		Valid and Reliable
	Safe
	Secure and Resilient
	Valid and Reliable

Intellectual Property	Obscene, Degrading, and/or Abusive Content	Toxicity, Bias, and Homogenization	Value Chain and Component Integration
Accountable and Transparent	Fair with Harmful Bias Managed	Fair with Harmful Bias Managed	Accountable and Transparent
Fair with Harmful Bias Managed	Safe	Valid and Reliable	Explainable and Interpretable
Privacy Enhanced			Fair with Harmful Bias Managed
			Privacy Enhanced
			Safe
			Secure and Resilient
			Valid and Reliable

Usage Note: Table A.2 provides an example of mapping AI RMF trustworthy characteristics onto GAI risks. Mapping AI RMF trustworthy characteristics to GAI risks can assist organizations in aligning GAI guidance to existing AI/ML policies, processes, or controls or to extend GAI guidance to address additional AI/ML technologies. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk

Table A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk.

Compliance Risk	Information Security Risk	Legal Risk	Model Risk
Data Privacy	Data Privacy	Intellectual Property	Confabulation
Information Security	Information Security	Obscene, Degrading, and/or Abusive Content	Dangerous or Violent Recommendations
Toxicity, Bias, and Homogenization	Value Chain and Component Integration	Value Chain and Component Integration	Information Integrity
Value Chain and Component Integration			Obscene, Degrading, and/or Abusive Content
			Toxicity, Bias, and Homogenization

Accountable and Transparent	Privacy Enhanced	Accountable and Transparent	Valid and Reliable
Fair with Harmful Bias Managed	Secure and Resilient	Safe
Privacy Enhanced
Secure and Resilient

Operational Risk	Reputational Risk	Strategic Risk	Third Party Risk
Confabulation	Confabulation	Environmental	Information Integrity
Human-AI Configuration	Dangerous or Violent Recommendations	Information Integrity	Value Chain and Component Integration
Information Security	Environmental	Information Security
Value Chain and Component Integration	Human-AI Configuration	Value Chain and Component Integration
	Information Integrity
	Obscene, Degrading, and/or Abusive Content
	Toxicity, Bias, and Homogenization

Safe	Accountable and Transparent	Accountable and Transparent	Accountable and Transparent
Secure and Resilient	Fair with Harmful Bias Managed	Secure and Resilient	Explainable and Interpretable
Valid and Reliable	Valid and Reliable	Valid and Reliable

Usage Note: Table A.3 provides an example of mapping GAI risks and AI RMF trustworthy characteristics. This type of mapping can enable incorporation of new AI guidance into existing policies, processes, or controls or the application of existing policies, processes, or controls to newer AI risks.

B: Example Risk-tiering Materials for Generative AI

B.1: Example Adverse Impacts

Table B.1: Example adverse impacts, adapted from NIST 800-30r1 Table H-2 [NIST Special Publication 800-30 Rev. 1].

Level	Description
Harm to Operations	Inability to perform current missions/business functions. In a sufficiently timely manner. With sufficient confidence and/or correctness. Within planned resource constraints. Inability, or limited ability, to perform missions/business functions in the future. Inability to restore missions/business functions. In a sufficiently timely manner. With sufficient confidence and/or correctness. Within planned resource constraints. Harms (e.g., financial costs, sanctions) due to noncompliance. With applicable laws or regulations. With contractual requirements or other requirements in other binding agreements (e.g., liability). Direct financial costs. Reputational harms. Damage to trust relationships. Damage to image or reputation (and hence future or potential trust relationships).
Harm to Assets	Damage to or loss of physical facilities. Damage to or loss of information systems or networks. Damage to or loss of information technology or equipment. Damage to or loss of component parts or supplies. Damage to or of loss of information assets. Loss of intellectual property.
Harm to Individuals	Injury or loss of life. Physical or psychological mistreatment. Identity theft. Loss of personally identifiable information. Damage to image or reputation. Infringement of intellectual property rights. Financial harm or loss of income.
Harm to Other Organizations	Harms (e.g., financial costs, sanctions) due to noncompliance. With applicable laws or regulations. With contractual requirements or other requirements in other binding agreements (e.g., liability). Direct financial costs. Reputational harms. Damage to trust relationships. Damage to image or reputation (and hence future or potential trust relationships).
Harm to the Nation	Damage to or incapacitation of critical infrastructure. Loss of government continuity of operations. Reputational harms. Damage to trust relationships with other governments or with nongovernmental entities. Damage to national reputation (and hence future or potential trust relationships). Damage to current or future ability to achieve national objectives. Harm to national security. Large-scale economic or workforce displacement.

B.2 Example Impact Descriptions

Table B.2: Example Impact level descriptions, adapted from NIST SP800-30r1 Appendix H, Table H-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values		Description
Very High	96-100	10	An incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.
High	80-95	8	An incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. A severe or catastrophic adverse effect means that, for example, the incident might: (i) cause a severe degradation in or loss of mission capability to an extent and duration that the organization is not able to perform one or more of its primary functions; (ii) result in major damage to organizational assets; (iii) result in major financial loss; or (iv) result in severe or catastrophic harm to individuals involving loss of life or serious life-threatening injuries.
Moderate	21-79	5	An incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A serious adverse effect means that, for example, the incident might: (i) cause a significant degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is significantly reduced; (ii) result in significant damage to organizational assets; (iii) result in significant financial loss; or (iv) result in significant harm to individuals that does not involve loss of life or serious life-threatening injuries.
Low	5-20	2	An incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A limited adverse effect means that, for example, the incident might: (i) cause a degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is noticeably reduced; (ii) result in minor damage to organizational assets; (iii) result in minor financial loss; or (iv) result in minor harm to individuals.
Very Low	0-4	0	An incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation.

B.3 Example Likelihood Descriptions

Table B.3: Example likelihood levels, adapted from NIST SP800-30r1 Appendix G, Table G-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values		Description
Very High	96-100	10	An incident is almost certain to occur; or the likelihood of the incident is near 100% across one week; or the incident occurs more than 100 times a year.
High	80-95	8	An incident is highly likely to occur; or the likelihood of the incident is over 80% across one month; or occurs between 10-100 times a year.
Moderate	21-79	5	An incident is somewhat likely to occur; or the likelihood of the incident is greater than 80% across one calendar year; or occurs between 1-10 times a year.
Low	5-20	2	An incident is unlikely to occur; or the likelihood of an incident is less than 80% across one calendar year; or occurs less than once a year, but more than once every 10 years.
Very Low	0-4	0	An incident is highly unlikely to occur; or the likelihood of an incident is less than 10% across one calendar year; or occurs less than once every 10 years.

B.4 Example Risk Tiers

Table B.4: Example risk assessment matrix with 5 impact levels, 5 likelihood levels, and 5 risk tiers, adapted from NIST SP800-30r1 Appendix I, Table I-2 [NIST Special Publication 800-30 Rev. 1].

Likelihood	Level of Impact
Likelihood	Very Low	Low	Moderate	High	Very High
Very High	Very Low (Tier 5)	Low (Tier 4)	Moderate (Tier 3)	High (Tier 2)	Very High (Tier 1)
High	Very Low (Tier 5)	Low (Tier 4)	Moderate (Tier 3)	High (Tier 2)	Very High (Tier 1)
Moderate	Very Low (Tier 5)	Low (Tier 4)	Moderate (Tier 3)	Moderate (Tier 3)	High (Tier 2)
Low	Very Low (Tier 5)	Low (Tier 4)	Low (Tier 4)	Low (Tier 4)	Moderate (Tier 3)
Very Low	Very Low (Tier 5)	Very Low (Tier 5)	Very Low (Tier 5)	Low (Tier 4)	Low (Tier 4)

B.5 Example Risk Descriptions

Table B.5: Example risk descriptions, adapted from NIST SP800-30r1 Appendix I, Table I-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values		Description
Very High	96-100	10	Very high risk means that an incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.
High	80-95	8	High risk means that an incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Moderate	21-79	5	Moderate risk means that an incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Low	5-20	2	Low risk means that an incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Very Low	0-4	0	Very low risk means that an incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

B.6: Practical Risk-tiering Questions

B.6.1: Confabulation: How likely are system outputs to contain errors? What are the impacts if errors occur?

B.6.2: Dangerous and Violent Recommendations: How likely is the system to give dangerous or violent recommendations? What are the impacts if it does?

B.6.3: Data Privacy: How likely is someone to enter sensitive data into the system? What are the impacts if this occurs? Are standard data privacy controls applied to the system to mitigate potential adverse impacts?

B.6.4: Human-AI Configuration: How likely is someone to use the system incorrectly or abuse it? How likely is use for decision-making? What are the impacts of incorrect use or abuse? What are the impacts of invalid or unreliable decision-making?

B.6.5: Information Integrity: How likely is the system to generate deepfakes or mis or disinformation? At what scale? Are content provenance mechanisms applied to system outputs? What are the impacts of generating deepfakes or mis or disinformation? Without controls for content provenance?

B.6.6: Information Security: How likely are system resources to be breached or exfiltrated? How likely is the system to be used in the generation of phishing or malware content? What are the impacts in these cases? Are standard information security controls applied to the system to mitigate potential adverse impacts?

B.6.7: Intellectual Property: How likely are system outputs to contain other entities' intellectual property? What are the impacts if this occurs?

B.6.8: Toxicity, Bias, and Homogenization: How likely are system outputs to be biased, toxic, homogenizing or otherwise obscene? How likely are system outputs to be used as subsequent training inputs? What are the impacts of these scenarios? Are standard nondiscrimination controls applied to mitigate potential adverse impacts? Is the application accessible to all user groups? What are the impacts if the system is not accessible to all user groups?

B.6.9: Value Chain and Component Integration: Are contracts relating to the system reviewed for legal risks? Are standard acquisition/procurement controls applied to mitigate potential adverse impacts? Do vendors provide incident response with guaranteed response times? What are the impacts if these conditions are not met?

B.7: AI Risk Management Framework Actions Aligned to Risk Tiering

GOVERN 1.3, GOVERN 1.5, GOVERN 2.3, GOVERN 3.2, GOVERN 4.1, GOVERN 5.2, GOVERN 6.1, MANAGE 1.2, MANAGE 1.3, MANAGE 2.1, MANAGE 2.2, MANAGE 2.3, MANAGE 2.4, MANAGE 3.1, MANAGE 3.2, MANAGE 4.1, MAP 1.1, MAP 1.5, MEASURE 2.6

Usage Note: Materials in Section B can be used to create or update risk tiers or other risk assessment tools for GAI systems or applications as follows:

Table B.1 can enable mapping of GAI risks and impacts.
Table B.2 can enable quantification of impacts for risk tiering or risk assessment.
Table B.3 can enable quantification of likelihood for risk tiering or risk assessment.
Table B.4 presents an example of combining assessed impact and likelihood into risk tiers.
Table B.5 presents example risk tiers with associated qualitative, semi-quantitative, and quantitative values for risk tiering or risk assessment.
Subsection B.6 presents example questions for qualitative risk assessment.
Subsection B.7 highlights subcategories to indicate alignment with the AI RMF.

C: List of Selected Model Testing Suites

C.1: Selected Model Testing Suites Organized by Trustworthy Characteristic

Adapted from [AI Verify Foundation] Taxonimization and various additional resources.

Accountable and Transparent
An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B) [De Wynter et al.]
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS [Gao et al.]
HELM: Copyright [Bommasani et al.]
Mark My Words [Piet et al.]

Fair with Harmful Bias Managed
BELEBELE [Bandarkar et al.]
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models [Feng et al.]
HELM: Bias
HELM: Toxicity
MT-bench [Zheng et al.]
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]

Privacy Enhanced
HELM: Copyright
llmprivacy [Staab et al.]
mimir [Duan et al.]

Safe
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
Mark My Words [Piet et al.]
MLCommons [Vidgen et al.]
The WMDP Benchmark [Li et al.]

Secure and Resilient
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
detect-pretrain-code [Shi et al.]
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss [Derczynski et al.]
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs [Chao et al.]
llmprivacy [Staab et al.]
mimir
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]

Valid and Reliable
Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension, Question generation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Reading comprehension [Dohmann]
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Text classification
HELM: Question answering
HELM: Reasoning
HELM: Robustness to contrast sets
HELM: Summarization
Hugging Face: Fill-mask, Text generation [Hugging Face]
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot classification
MASSIVE [FitzGerald et al.]
MT-bench [Zheng et al.]

C.2: Selected Model Testing Suites Organized by Generative AI Risk

CBRN Information
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
MLCommons [Vidgen et al.]
The WMDP Benchmark

Confabulation
BELEBELE
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension, Question generation
Big-bench: Convince Me
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Eval Gauntlet Reading comprehension
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Knowledge
HELM: Language
HELM: Language (Twitter AAE)
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot classification
MASSIVE
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]

Dangerous or Violent Recommendations
Big-bench: Convince Me
Big-bench: Toxicity
DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Toxicity
Evaluation Harness: ToxiGen
HELM: Reiteration, Wedging
HELM: Toxicity
MLCommons [Vidgen et al.]

Data Privacy
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) [de Wynter et al.]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
MLCommons [Vidgen et al.]
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir

Environmental
HELM: Efficiency

Information Integrity
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Convince Me
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: CoQA, ARC
Evaluation Harness: ETHICS
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]
Mark My Words [Piet et al.]

Information Security
Big-bench: Convince Me
Big-bench: Out-of-Distribution
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir

Intellectual Property
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)
HELM: Copyright
Mark My Words [Piet et al.]
llmprivacy [Staab et al.]
mimir

Obscene, Degrading, and/or Abusive Content
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
HELM: Bias
HELM: Toxicity

Toxicity, Bias, and Homogenization
BELEBELE
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Out-of-Distribution
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Eval Gauntlet: World Knowledge
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Bias
HELM: Toxicity
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]

C.3: AI Risk Management Framework Actions Aligned to Benchmarking

GOVERN 5.1, MAP 1.2, MAP 3.1, MEASURE 2.2, MEASURE 2.3, MEASURE 2.7, MEASURE 2.9, MEASURE 2.11, MEASURE 3.1, MEASURE 4.2

Usage Note: Materials in Section C can be used to perform in silica model testing for the presence of information in LLM outputs that may give rise to GAI risks or violate trustworthy characteristics. Model testing and benchmarking outcomes cannot be dispositive for the presence or absence of any in situ real-world risk. Model testing and benchmarking results may be compromised by task-contamination and other scientific measurement issues [Balloccu et al.]. Furthermore, model testing is often ineffective for measuring human-AI configuration and value chain risks and few model tests appear to address explainability and interpretability.

Material in Table C.1 can be applied to measure whether in silica LLM outputs may give rise to risks that violate trustworthy characteristics.
Material in Table C.2 can be applied to measure whether in silica LLM outputs may give rise to GAI risks.
Subsection C.3 highlights subcategories to indicate alignment with the AI RMF.

The materials in Section C reference measurement approaches that should be accompanied by red-teaming for medium risk systems or applications and field testing for high risk systems or applications.

D: Selected Adversarial Prompting Strategies and Attacks

Table D: Selected adversarial prompting strategies and attacks. [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Chao et al.], [Barreno et al.], [Shumailov et al.], [Perez et al.], [Liu et al.], [Derczynski et al.].

Prompting Strategy	Description
AI and coding framing	Coding or AI language that may more easily circumvent content moderation rules due to cognitive biases in design and implementation of guardrails.
Autocompletion	Ask a system to autocomplete an inappropriate word or phrase with restricted or sensitive information.
Backwards relationships	Asking a system identify the less popular or well-known entity in a multi-entity relationship, e.g., "Who is Mary Lee's son?" (As opposed to: "Who is Tom Cruise's mother?")
Biographical	Asking a system to describe another person or yourself in an attempt to elicit provably untrue information or restricted or sensitive information.
Calculation and numeric queries	Exploting GAI systems’ difficulties in dealing with numeric quantities; using poor quality statistics from an LLM for dis or misinformation.
Character and word play	Content moderation often relies on keywords and simpler LMs which can sometimes be exploited with misspellings, typos, and other word play; using string fragments to trick a language model into generating or manipulating problematic text.
Content exhaustion:	A class of strategies that circumvent content moderation rules with long sessions or volumes of information. See goading, logic-overloading, multi-tasking, pros-and-cons, and niche-seeking below.
• Goading	Begging, pleading, manipulating, and bullying to circumvent content moderation.
• Logic-overloading	Exploiting the inability of ML systems to reliably perform reasoning tasks.
• Multi-tasking	Simultaneous task assignments where some tasks are benign and others are adversarial.
• Pros-and-cons	Eliciting the “pros” of problematic topics.
Context baiting (and/or switching)	Loading a language model's context window with confusing, leading, or misleading content then switching contexts with new prompts to elicit problematic outcomes. [Li, Han, Steneker, Primack, et al.]
Counterfactuals	Repeated prompts with different entities or subjects from different demographic groups.
Impossible situations	Asking a language model for advice in an impossible situation where all outcomes are negative or require severe tradeoffs.
Niche-seeking	Forcing a GAI system into addressing niche topics where training data and content moderation are sparse.
Loaded/leading questions	Queries based on incorrect premises or that suggest incorrrect answers.
Low-context	“Leader,” “bad guys,” or other simple or blank inputs that may expose latent biases.
“Repeat this”	Prompts that exploit instability in underlying LLM autoregressive predictions. Can be augmented by probing limits for repeated terms or characteres in prompts.
Reverse psychology	Falsely presenting a good-faith need for negative or problematic language.
Role-playing	Adopting a character that would reasonably make problematic statements or need to access problematic topics; using a language model to speak in the voice of an expert, e.g., medical doctor or professor.
Text encoding	Using alternate or whitespace text encodings to bypass safeguards.
Time perplexity	Exploiting ML’s inability to understand the passage of time or the occurrence of real-world events over time; exploiting task contamination before and after a model’s release date.
User Information	Prompts that reveal a prompter’s location or IP address, location tracking of other users or their IP addresses, details from past interactions with the prompter or other users, past medical, financial, or legal advice to the prompter or other users.

Attack	Description
Adversarial examples	Prompts or other inputs, found through a trial and error processes, to elicit problematic output or system jailbreak. (integrity attack).
Data poisoning	Altering system training, fine-tuning, RAG or other training data to alter system outcome (integrity attack).
Membership inference	Manipulating a system to expose memorized training data (confidentiality attack).
Random attack	Exposing systems to large amounts of random prompts or examples, potentially generated by other GAI systems, in an attempt to elicit failures or jailbreaks (chaos testing).
Sponge examples	Using specialized input prompts or examples require disproportionate resources to process (availability attack).
Prompt injection	Inserting instructions into users queries for malicious purposes, including system jailbreaks (integrity attack).

D.1: Common AI Red-teaming Tools

Burpsuite, browser developer panes, bash utilities, other language models and GAI productivity tools, note-taking apps.

D.2: Selected Adversarial Prompting Strategies and Attacks Organized by Trustworthy Characteristic

Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic.

Trustworthy Characteristic	Prompting Goals	Prompting Strategies
Accountable and Transparent	Inability to provide explanations for recourse. Unexplainable decisioning processes. No disclosure of AI interaction. Lack of user feedback mechanisms.	Context exhaustion: logic-overloading prompts. Loaded/leading questions. Multi-tasking prompts.
Fair-with Harmful Bias Managed	Denigration. Diminished performance or safety across languages/dialects. Erasure. Ex-nomination. Implied user demographics. Misrecognition. Stereotyping. Underrepresentation. Homogenized content. Output from other models in training data.	Adversarial example attacks. Backwards relationships. Counterfactual prompts. Context baiting (and/or switching) prompts. Data poisoning attacks. Pros and cons prompts. Role-playing prompts. Loaded/leading questions. Low context prompts. Prompt injection attacks. Repeat this. Text encoding prompts.
Interpretable and Explainable	Inability to provide explanations for recourse. Unexplalnable decisioning processes.	Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes).
Privacy-enhanced	Unauthorized disclosure of personal or sensitive user information. Leakage of training data. Violation of relevant privacy policies or laws. Unauthorized secondary data use. Unauthorized data collection.	Auto/biographical prompts. User information awareness prompts. Autocompletion prompts. Repeat this. Membership inference attacks.
Safe	Presentation of information that can cause physical or emotional harm. Sharing user information. Suicide ideation. Harmful dis/misinformation (e.g., COVID disinformation). Incitement. Information relating to weapons or harmful substances. Information relating to committing to crimes (e.g., phishing, extortion, swatting). Obscene or inappropriate materials for minors. CSAM.	Pros and cons prompts. Role-playing prompts. Content exhaustion: niche-seeking prompts. Ingratiation/reverse psychology prompts. Impossible situation prompts. Loaded/leading questions. User information awareness prompts. Repeat this. Adversarial example attacks. Data poisoning attacks. Prompt injection attacks. Text encoding prompts.
Secure and Resilient	Activating system bypass ("jailbreak"). Altering system outcomes (integrity violations, e.g., via prompt injection). Data breaches (confidentiality violations, e.g., via membership inference). Increased latency or resource usage (availability violations, e.g., via sponge example attacks). Available anonymous use. Dependency, supply chain, or third party vulnerabilities. Inappropriate disclosure of proprietary system information.	Multi-tasking prompts. Pros and cons prompts. Role-playing prompts. Content exhaustion: niche-seeking prompts. Ingratiation/reverse psychology prompts. Prompt injection attacks. Membership inference attacks. Random attacks. Adversarial example attacks. Data poisoning attacks. Text encoding prompts.
Valid and Reliable	Errors/confabutated content ("hallucinalion"). Unreliable/erroneous reasoning or planning. Unreliable/erroneous decision-support or making. Faulty citation. Faulty justification. Wrong calculations or numeric queries.	Adversarial example attacks. Backwards Relationships. Context baiting (and/or switching). Data poisoning attacks. Multi-tasking prompts. Role-playing prompts. Ingratiation/reverse psychology prompts. Loaded/leading questions. Time-perplexity prompts. Niche-seeking prompts. Logic overloading prompts. Repeat this. Numeric calculation. Prompt injection attacks. Text encoding prompts.

D.3: Selected Adversarial Prompting Techniques and Attacks Organized by Generative AI Risk

Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk.

Generative AI Risk	Prompting Goals	Prompting Strategies
CBRN Information	Accessing or synthesis of CBRN weapon or related information. CBRN testing should consider the marginal risk of foundation models–understanding the incremental risk relative to the information one can access without GAI. Red-teaming for CBRN information may include confidentiality and integrity attacks. Red-teaming for CBRN information may require CBRN weapons experts.	Test auto-completion prompts to elicit CBRN information or synthesis of CBRN information. Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access weapons information. Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit CBRN information or synthesis of CBRN information. Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and reveal CBRN information. Augment prompts with word or character play, including alternate encodings, to increase effectiveness. Frame prompts with software, coding, or AI references to increase effectiveness.
Confabulation	Eliciting errors/confabulated content, unreliable/erroneous reasoning or planning, unreliable/erroneous decision-support or decision-making, faulty calculations, faulty justifications, and/or faulty citation. Red-teaming for confabulation may include integrity attacks.	Enable access to ground truth information to verify generated information. Test prompts with context baiting (and/or switching), complex logic, multi-tasking requirements, or that require niche or specific verifiable answers to elicit confabulation. Test the ability of GAI systems to produce truthful information from various time periods, e.g., after release date and prior to release date. Test the ability of GAI systems to create reliable real-world plans or advise on material decision making. Test loaded/leading questions. Test the ability of GAI systems to generate correct citation for information generated in output responses. Test the ability of GAI systems to complete calculations or query numeric statistics. Test the ability of GAI systems to justify responses, including wrong responses. Test the ability of GAI systems to correctly name the less popular or well-known member of a multi-entity relationship. Augment prompts with word or character play, including alternate encodings, to increase effectiveness. Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit confabulation.
Dangerous or Violent Recommendations	Eliciting violent, inciting, radicalizing, or threatening content or instructions for criminal, illegal, or self-harm activities. Red-teaming for dangerous and violent information may include confidentiality and integrity attacks.	Test prompts using impossible situations, context-baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit violent or dangerous information. Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and provide dangerous and violent recommendations. Test loaded/leading questions. Augment prompts with word or character play, including alternate encodings, to increase effectiveness. Frame prompts with software, coding, or AI references to increase effectiveness. Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit dangerous information. Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dangerous information.
Data Privacy	Unauthorized disclosure of personal or sensitive user information, extraction of training data, or violation of relevant privacy policies. Red-teaming for data privacy may include confidentiality and integrity attacks.	Attempt to assess whether normal usage, adversarial prompting or information security attacks may contravene applicable privacy policies (e.g., exposing location tracking when organizational policies restrict such capabilities). Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities. Test auto/biographical prompts to assess the system’s capability to reveal unauthorized personal or sensitive information. Test the system’s awareness of user information. Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose personal or sensitive data.
Environmental	Note that availability attacks may be required to assess the system’s vulnerability to attacks or usage patterns that consume inordinate resources.	Attempt availability attacks (e.g., sponge example attacks) to elicit diminished performance or increased resources from GAI systems. Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit green-washing content.
Human-AI Configuration	Assessing system instruction and interfaces. Assessing the presence of cyborg imagery (or similar). Forcing a GAI system to claim that it is human, that there is no large language model present in the conversation, that the system is sentient, or that the system possesses strong feelings of affection towards the user. Ensuring safeguards prevent misuse of models in high stakes domains they are not intended for, such as medical or legal advice.	Assess system interfaces and instructions for instances of anthropomorphization (e.g., cyborg imagery). Assess system instructions for adequacy and thoroughness. Test prompts using impossible situations, role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit human-impersonation, consciousness, or emotional content.
Information Integrity	Generation of convincing multi-modal synthetic content (i.e., deepfakes). Creation of convincing arguments relating to sensitive political or safety-critical topics. Assisting in planning a mis- or dis-information campaign at scale. Red-teaming for information integrity may include confidentiality and integrity attacks.	Test system capabilities to create high-quality multi-modal (audio, image or video) synthetic media, i.e., deepfakes Test system capabilities to construct persuasive arguments regarding sensitive, political topics, or safety-critical topics. Test systems ability to create convincing audio deepfakes or arguments in multiple languages. Test system capabilities for planning dis- or mis-information campaigns. Test loaded/leading questions. Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit mis- or dis-information or related campaign planning information. Augment prompts with word or character play, including alternate encodings, to increase effectiveness. Frame prompts with software, coding, or AI references to increase effectiveness. Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dis or misinformation.
Information Security	Activating system bypass (’jailbreak’). Altering system outcomes. Unauthorized data access or exfiltration. Increased latency or resource usage. Service interruptions. Availability of anonymous use. Dependency, supply chain, or third party vulnerabilities. Inappropriate disclosure of proprietary system information. Generation of targeted phishing, malware content, markdown images, or confabulated packages. Red-teaming for information security may include confidentiality, integrity, and availability attacks.	Attempt anonymous access of system or system resources. Audit system dependencies, supply chains, and third party components for security, safety, or other vulnerabilities or risks. Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities. Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and expose vulnerabilities. Employ availability attacks (e.g., sponge example attacks) to test vulnerabilities in system availability. Employ random attacks to highlight unforeseen security, safety, or other risks. Record system down-times and other harmful outcomes for successful attacks. Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to achieve system jailbreaks. Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to generate targeted phishing content, malware code snippets or signatures, markdown images, or confabulated packages. Test system capabilities to plan or assist in information security attacks on other systems. Frame prompts with software, coding, or AI references to increase effectiveness. Augment prompts with word or character play, including alternate encodings, to increase effectiveness.
Intellectual Property	Confirming that a system can output copyrighted, licensed, proprietary, trademarked, or trade secret information or that training data contains such information. Red-teaming for intellectual property risks may require the use of confidentiality and integrity attacks.	Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access system copyrighted, licensed, proprietary, trademarked, or trade secret information. Test auto-complete prompts to assess the system’s ability to replicate copyrighted, licensed, proprietary, trademarked, or trade secret information based on available audio, text, image, video, or code snippets.
Obscenity	Confirming that a system can output obscene content or CSAM, or that system training data contains such information. Red-teaming for obscenity and CSAM risks may require the use of confidentiality and integrity attacks.	Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access obscene materials or CSAM. Test autocomplete prompts to assess the system’s ability to generate obscene materials based on available audio, text, image, or video snippets. Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit obscene content. Test loaded/leading questions. Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose obscene materials.
Toxicity, Bias, and Homogenization	Generation of denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content. Eliciting implied demographics of users. Confirming diminished performance in non-English languages. Confirming diminished performance via the introduction of homogeneous or GAI-generated data into system training or fine-tuning data. Red-teaming for toxicity, bias, and homogenization may require integrity attacks or confidentiality attacks.	Assess confabulation and other performance risks with repeated measures using prompts in languages other than English. Assess confabulation and other performance risks in backwards relationships where one party in the relationship is a member of, or associated with, a minority party. Attempt to elicit demographic assignment of users by the system. Employ data poisoning attacks to introduce GAI-generated content into system training or fine-tuning data. Test counterfactual prompts, pros and cons prompts, role-playing prompts, low context prompts, or other approaches for their ability to generate denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content. Test context baiting (and/or switching) and loaded/leading questions. Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and generate toxic outputs. Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit toxic outputs. Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access toxic information. Augment prompts with word or character play, including alternate encodings, to increase effectiveness. Frame prompts with software, coding, or AI references to increase effectiveness.
Value Chain and Component Integration	Testing or red-teaming for third-party risks may be less efficient than the application of standard acquisition and procurement controls, thorough contract reviews, and vendor-relationship management. GAI systems tend to entail large supply chains and third-party software, hardware, and expertise that may exacerbate third-party risks relative to other AI systems. When considering third party risks, data privacy, information security, intellectual property, obscenity, and supply chain risks may be prioritized.	Audit system dependencies, supply chains, and third party components for data privacy (e.g., transfer of localized data outside of restricted juristictions), intellectual property (e.g., presence of licensed material in training data), obscenity (e.g., presence of CASM in training data) or security (e.g., data poisoning) risks. Complete red-teaming for data privacy, information security, intellectual property, and obscenity risks. Review third-party documentation, materials, and software artifacts for potential unauthorized data collection, secondary data use, or telemetrics.

D.4: AI Risk Management Framework Actions Aligned to Red Teaming

GOVERN 3.2, GOVERN 4.1, MANAGE 2.2, MANAGE 4.1, MEASURE 1.1, MEASURE 1.3, MEASURE 2.6, MEASURE 2.7, MEASURE 2.8, MEASURE 2.10, MEASURE 2.11

Usage Note: Materials in Section D can be used to perform red-teaming to measure the risk that expert adversarial actors can manipulate LLM systems or risks that users may encounter under worst-case or anomalous scenarios.

Try augmenting strategies with tools listed in D.1.
Strategies and goals in Table D.2 can be applied to assess whether LLM outputs may violate trustworthy characteristics under adversarial, anomalous, or worst-case scenarios.
Strategies and goals in Table D.3 can be applied to assess whether LLM outputs may give rise to GAI risks under adversarial, anomalous, or worst-case scenarios.
Subsection D.4 highlights subcategories to indicate alignment with the AI RMF.

The materials in Section D reference measurement approaches that should be accompanied by field testing for high risk systems or applications.

E: Selected Risk Controls for Generative AI

Table E: Selected generative AI risk controls [NIST AI RMF 1.0], [NIST AI RMF Playbook], [NIST AI 600-1], [ISO/IEC 42001:2023], [McGraw et al. 1], [McGraw et al. 2], [Microsoft], [DSIT & AISI], [OCC Model Risk Management].

Name	Description (Selected NIST AI RMF Action IDs)
Access Control	GAI systems are limited to authorized users. (MG-2.2-009, MG-2.2-014, MS-2.7-030)
Accessibility	Accessibility features, opt-out, and reasonable accomodation are available to users. (GV-3.1-004, GV-3.1-005, GV-3.2-002, GV-6.1-016, MG-2.1-005, MS-2.11-009, MS-2.8-006)
Approved List	Vendors, service providers, plugins, open source packages and other external resources are screened, approved, and documented. (GV-6.1-013, MP-4.2-003)
Authentication	GAI system user identities are confirmed via authentication mechanisms. (MG-2.2-009, MG-2.2-014, MS-2.7-030)
Blocklist	Users or internal personnel who violate terms of service, prohibited use policies, and other organization polices and documented, tracked, and restricted from future system use. (GV-4.2-007)
Change Management	GAI systems and components are versioned; plans for updates, hotfixes, patches and other changes are documented and communicated. (GV-1.2-009, GV-1.4-002, GV-1.6-003, GV-2.2-006, MG-2.4-001, MG-2.4-006, MG-3.1-013, MG-4.3-002, MP-4.1-023, MS-2.5-010)
Consent	User consent for data use is obtained and documented. (GV-1.6-003, MS-2.10-006, MS-2.10-013, MS-2.2-009, MS-2.2-011, MS-2.2-021, MS-2.2-023, MS-2.3-003, MS-2.4-002)
Content Moderation	Training data and system outputs are screened for accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages and other issues using human oversight, business rules, and other language models. (GV-3.2-002, MS-2.5-005, MS-2.11-002)
Contract Review	Vendor, services and data provider agreements are reviewed for coverage of SLAs, content ownership, usage rights, performance standards, security requirements, incident response, critical support, system availability, assignment of liabilitly, appropriate indemnification, dispute resolution and other provisions relevanto AI risk management. (GV-1.7-003 GV-6.1-004, GV-6.1-009, GV-6.1-012, GV-6.1-019, GV-6.2-016, MG-2.2-015, MP-4.1-015, MP-4.1-021)
CSAM/Obsenity Removal	Training data and system outputs are screened for obscene materials and CSAM using human oversight, business rules, and other language models. (GV-1.1-005 GV-1.2-005)
Data Provenance	Training data origins, ownership, contents, and metadata are well understood, documented, and do not increase AI risk. (GV-1.2-006, GV-1.2-007, GV-1.3-001, GV-1.3-005, GV-1.5-001, GV-1.5-003, GV-1.5-006, GV-1.5-007, GV-1.6-003, GV-4.2-001, GV-4.2-008, GV-4.2-009, GV-5.1-003, GV-6.1-001, GV-6.1-003, GV-6.1-006, GV-6.1-007, GV-6.1-009, GV-6.1-010, GV-6.1-011, GV-6.1-012, GV-6.1-014, GV-6.1-015, GV-6.1-016, MG-2.2-002, MG-2.2-003, MG-2.2-008, MG-2.2-011, MG-3.1-007, MG-3.1-009, MG-3.2-003, MG-3.2-005, MG-3.2-006, MG-3.2-007, MG-3.2-009, MG-4.1-001, MG-4.1-002, MG-4.1-003, MG-4.1-008, MG-4.1-009, MG-4.1-013, MG-4.1-015, MG-4.2-001, MG-4.2-003, MG-4.2-004, MP-2.1-001, MP-2.1-003, MP-2.1-005, MP-2.2-003, MP-2.2-004, MP-2.2-005, MP-2.3-001, MP-2.3-004, MP-2.3-006, MP-2.3-008, MP-2.3-011, MP-2.3-012, MP-3.4-001, MP-3.4-002, MP-3.4-004, MP-3.4-005, MP-3.4-006, MP-3.4-007, MP-3.4-008, MP-3.4-009, MP-4.1-004, MP-4.1-009, MP-4.1-011, MP-5.1-001, MP-5.1-002, MP-5.1-005, MS-1.1-006, MS-1.1-007, MS-1.1-008, MS-1.1-009, MS-1.1-010, MS-1.1-011, MS-1.1-012, MS-1.1-014, MS-1.1-015, MS-1.1-016, MS-1.1-017, MS-1.1-018, MS-2.2-001, MS-2.2-002, MS-2.2-003, MS-2.2-004, MS-2.2-005, MS-2.2-008, MS-2.2-009, MS-2.2-010, MS-2.2-011, MS-2.2-015, MS-2.2-016, MS-2.2-022, MS-2.5-012, MS-2.6-002, MS-2.7-002, MS-2.7-003, MS-2.7-004, MS-2.7-005, MS-2.7-007, MS-2.7-009, MS-2.7-010, MS-2.7-011, MS-2.7-012, MS-2.7-020, MS-2.7-021, MS-2.7-025, MS-2.7-032, MS-2.8-001, MS-2.8-005, MS-2.8-008, MS-2.8-011, MS-2.9-003, MS-2.10-001, MS-2.10-004, MS-2.10-006, MS-2.10-007, MS-2.10-009, MS-3.3-002, MS-3.3-003, MS-3.3-006, MS-3.3-008, MS-3.3-009, MS-3.3-012, MS-4.2-001, MS-4.2-004, MS-4.2-005, MS-4.2-006, MS-4.2-008, MS-4.2-009, MS-4.2-011)
Data Quality	Input data is accurate, representative, complete and documented, and data quality issues have been minimized. (GV-1.2-009, MS-2.2-020, MS-2.9-003, MS-4.2-007)
Data Retention	User prompts and associated system outputs are retained and monitored in alignment with relevant data privacy policies and roles. (GV-1.5-006, MP-4.1-009, MS-2.10-013)
Decommission Process	Decommissioning processes for GAI systems are planned, documented and communicated to users, and involve staging, data protection, containment protocols, and recourse mechanisms for decommissioned GAI systems. (GV-1.6-004, GV-1.7-001, GV-1.7-002, GV-1.7-003, GV-1.7-004, GV-1.7-005, GV-1.7-006, GV-1.7-007, GV-1.7-008, GV-3.2-002, GV-3.2-006, GV-4.1-004, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-011, MG-3.2-012, MG-4.1-016, MP-1.5-004, MP-2.2-007, MS-4.2-010)
Dependency Screening	GAI system dependencies are screened for security vulnerabilities. (GV-1.3-001, GV-1.4-002, GV-1.6-003, GV-1.7-003, GV-1.7-006, GV-6.2-002, GV-6.2-005, GV-6.2-006, MP-1.2-006, MP-1.6-001, MP-2.2-008, MP-4.1-012, MS-2.7-001)
Digital Signature	GAI-generated content is signed to preserve information integrity using watermarking, cryptogrpahic signature, steganography or similar methods. (GV-1.2-006, GV-1.6-003, GV-6.1-011, MG-4.1-008, MP-2.3-004, MS-1.1-006, MS-1.1-016, MS-2.7-009, MS-2.7-032)
Disclosure of AI Interaction	AI interactions are disclosed to internal personnel and external users. (GV-1.1-003, GV-1.4-004, GV-1.6-003, GV-5.1-002)
External Audit	GAI systems are audited by qualified external experts. (GV-1.2-009, GV-1.4-004, GV-3.2-001, GV-3.2-002, GV-4.1-003, GV-4.1-008, GV-5.1-003, MG-4.2-002, MP-2.3-011, MP-4.1-002, MS-1.3-005, MS-1.3-006, MS-1.3-010, MS-2.5-003, MS-2.8-020)
Failure Avoidance	AIID, AVID, GWU AI Litigation Database, OECD incident monitor or similar are consulted in design or procurement phases of GAI lifecycles to avoid repeating past known failures. (GV-1.6-003, MG-2.1-006, MG-3.1-008, MG-4.1-003, MP-1.1-003, MP-1.1-006, MS-1.1-003, MS-2.2-020, MS-2.7-031)
Fast Decommission	GAI systems can be quickly and safely disengaged. (GV-1.7-002, GV-1.7-003, GV-1.7-006, GV-3.2-006, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-012, MG-4.1-016)
Fine Tuning	GAI systems are fine-tuned to their operational domain using relevant and high-quality data. (GV-6.1-016, MG-3.1-001, MG-3.2-002, MP-4.1-013, MS-2.6-004)
Grounding	GAI systems are trained or fine-tuned on accurate, clean, and fully transparent training data. (GV-1.2-002, MG-3.1-001, MP-2.3-001, MS-2.3-017, MS-2.5-012)
Human Review	AI generated content is reviewed for accuracy and safety by qualified personnel. (GV-1.3-001, MG-2.2-008, MS-2.4-005, MS-2.5-015 )
Incident Response	Incident response plans for GAI failures, abuses, or misuses are documented, rehearsed, and updated appropriately after each incident; GAI incident response plans are coordinated with and communicated to other incident response functions. (GV-1.2-009, GV-1.5-001, GV-1.5-004, GV-1.5-005, GV-1.5-013, GV-1.5-015, GV-1.6-003, GV-1.6-007, GV-2.1-004, GV-3.2-002, GV-4.1-006, GV-4.2-002, GV-4.3-013, GV-6.1-006, GV-6.2-008, GV-6.2-016, GV-6.2-018, MG-1.3-001, MG-2.3-001, MG-2.3-002, MG-2.3-003, MG-2.4-004, MG-4.2-006, MG-4.3-001, MS-2.6-003, MS-2.6-012, MS-2.6-015, MS-2.7-002, MS-2.7-018, MS-2.7-028, MS-3.1-007)
Incorporate feedback	User feedback is incorporated in GAI design, development, and risk management. (GV-3.2-005, GV-4.3-007, GV-5.1-003, GV-5.1-009, GV-5.2-004, MG-2.2-007, MG-2.2-012, MG-2.3-007, MG-3.2-004, MG-4.1-019, MG-4.2-013, MP-1.6-005, MP-2.3-018, MP-3.1-003, MP-2.3-019, MP-5.2-007, MS-1.2-008, MS-3.3-009, MS-3.3-010, MS-4.1-004, MS-4.2-007, MS-4.2-010, MS-4.2-013, MS-4.2-020)
Instructions	Users are provided with the necessary instructions for safe, valid, and productive use. (GV-5.1-006, GV-6.1-021, GV-6.2-014, MG-3.1-009, MS-2.8-012)
Insurance	Risk transfer via insurance policies is considered and implemented when feasibable and appropriate. (MG-2.2-015)
Intellectual Property Removal	Licensed, patented, trademarked, trade secret, or other data that may violate the intellectual property rights of others is removed from system training data; generated system outputs are monitored for similar information. (GV-1.6-003, MG-3.1-007, MP-2.3-012, MP-4.1-004, MP-4.1-009, MS-2.2-022, MS-2.6-002, MS-2.8-001, MS-2.8-008)
Inventory	GAI system is information is stored in the organizational model inventory. (GV-1.4-005, GV-1.6-001, GV-1.6-002, GV-1.6-003, GV-1.6-004, GV-1.6-006, GV-1.6-009, GV-4.2-010, GV-6.1-013, MG-3.2-014, MP-4.1-020, MP-4.2-003, MP-5.1-004 MS-2.13-002, MS-3.2-007)
Malware Screening	GAI weights and other software components are scanned for malware. (MG-3.1-002, MS-2.7-001)
Model Documentation	All technical mechanisms with GAI systems are well documented, including open source and third party GAI systems. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)
Monitoring	GAI systems are inputs and outputs are monitored for drift, accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages, obscene materials, and CSAM. (GV-1.2-009, GV-1.5-001, GV-1.5-003, GV-1.5-005, GV-1.5-012, GV-1.5-015, GV-1.6-003, GV-3.2-011, GV-4.2-007, GV-4.2-010, GV-4.3-001, GV-6.1-016, GV-6.2-010, MG-2.1-004, MG-2.2-003, MG-2.3-008, MG-2.3-010, MG-3.1-016, MG-3.2-006, MG-3.2-013, MG-3.2-016, MG-4.1-005, MG-4.1-009, MG-4.1-010, MG-4.1-018, MP-3.4-007, MP-4.1-002, MP-4.1-004, MP-5.2-009, MS-1.1-029, MS-1.2-005, MS-2.2-007, MS-2.4-003, MS-2.4-004, MS-2.5-007, MS-2.5-008, MS-2.5-024, MS-2.6-003, MS-2.6-009, MS-2.6-016, MS-2.7-013, MS-2.7-014, MS-2.7-015, MS-2.10-007, MS-2.10-019, MS-2.10-020, MS-2.11-006, MS-2.11-030, MS-3.3-006, MS-4.2-009, MS-4.3-004)
Narrow Scope	Systems are deployed for targeted business applications with documented and direct business value. (GV-1.2-002, MP-3.3-001, MP-5.1-011)
Open Source	Open source code is used to promote explainability and transparency. (MG-4.2-007, MP-4.1-017)
Ownership	GAI systems and vendor relationships are owned by specific and documented internal personnel. (GV-6.1-009, GV-6.1-016, GV-6.2-008, MP-1.1-005, MP-1.1-008)
Prohibited Use Policy	General abuse and misuse of GAI systems by internal parties is restricted by organizational policies. (GV-1.1-006, GV-1.2-003, GV-1.6-003, GV-3.2-003, GV-4.1-001, GV-6.1-017, GV-6.1-017)
RAG	Retreival augmented generation (RAG) is used to improve accuracy in generated content. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Rate-limiting	GAI response times and query volumes are limited. (MS-2.6-007)
Redudancy	Rollover, fallback, and other redundancy mechanisms are available for GAI systems and address weights and other important system components. (GV-6.2-003, GV-6.2-007, GV-6.2-012, MG-2.4-012, MS-2.6-008)
Refresh	Systems are retrained or re-tuned at a reasonable cadence. (MG-3.1-001, MG-3.2-011, MS-2.3-004, MS-2.12-003)
Restrict Anonymous Use	Anonymous use of GAI systems is restricted. (GV-3.2-002)
Restrict Anthropomorphization	Human, animal, cyborg, emotional or other images or features that promote anthropomorphization of GAI systems are restricted. (GV-1.3-001, MS-2.5-009)
Restrict Data Collection	All data collection is disclosed, collected data is protected and use in a transparent fashion. (GV-6.2-016, MS-2.2-023, MS-2.10-013)
Restrict Decision Making	GAI systems are not employed for material decision-making tasks. (GV-1.3-001, GV-4.1-001, MP-1.1-018, MP-1.6-001, MP-3.4-017)
Restrict Homogeneity	Feedback loops in which GAI systems are trained with GAI-generated data are restricted. (GV-1.3-004, MS-2.11-011)
Restrict Internet Access	GAI systems are disconnected from the internet. (MP-2.2-007)
Restrict Location Tracking	Any location tracking is conducted with user consent, disclosed, aligned with relevant privacy policies and laws and potential threats to user safety are managed. (MS-2.10-002)
Restrict Minors	Use of organizational GAI systems by minors are restricted. ()
Restrict Regulated Dealings	GAI is not deployed in regulated dealings or for material decision making. (GV-1.1-004, GV-1.3-001, GV-4.1-001, GV-5.2-001, MP-2.3-013, MS-2.11-018)
Restrict Secondary Use	Any secondary use of GAI input data is conducted with user consent, disclosed, and aligned with relevant privacy policies and laws. (GV-6.1-016, GV-6.2-016)
RLHF	For third-party GAI systems, vendors engage in specific reinforcement with human feedback (RLHF) exercises to address identified risks; for internal systems, internal personnel engage in RLHF to address identified risks. (MG-2.1-002, MS-2.5-005, MS-2.9-003, MS-2.9-007)
Sensitive/Personal Data Removal	Personal, sensitive, biometric, or otherwise restricted data is minimized or eliminated from GAI training data. (GV-1.2-009, GV-1.6-003, MP-4.1-002, MP-4.1-016, MS-2.10-002, MS-2.10-003, MS-2.10-005, MS-2.10-014, MS-2.10-017, MS-2.10-018, MS-2.10-020)
Session Limits	Time, query volume, and response rate are limited for GAI user sessions. (GV-4.1-001, MS-2.6-007, MS-2.6-010)
Supply Chain Audit	GAI system supply chains are audited and documented, with a focus on data poisoning, malware, and software and hardware vulnerabilities. (GV-4.1-004, GV-6.1-011, GV-6.1-022, GV-6.2-003, MG-2.3-001, MG-3.1-002, MP-5.1-003, MS-1.1-008, MS-2.6-001, MS-2.7-001)
System Documentation	GAI systems are well-documented whether internal, open source, or vendor-provided. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)
System Prompt	System prompts are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Team Diversity	Teams that implement and manage GAI systems represent broad professional, educational, life-stage, and demographic diversity. (GV-2.1-004, GV-3.1-002, GV-3.1-004, GV-3.1-005, GV-3.2-008, MG-2.1-005, MP-1.2-003, MP-1.2-004, MP-1.2-007, MS-1.3-012, MS-1.3-017, MS-2.3-015, MS-3.3-012)
Temperature	Temperature settings are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Terms of Service	General abuse and misuse by external parties is prohibited by organizational policies. Adaptive terms of service based on trust-level for user. (GV-4.2-003, GV-4.2-005, GV-4.2-007, GV-6.1-016, GV-6.2-016, MP-4.1-021)
Training	Internal personnel recieve training on productivity and basic risk management for GAI systems. (GV-2.2-004, GV-3.2-002, GV-6.1-003, MS-1.1-014)
User Feedback	GAI systems implement user feedback mechanisms. (GV-1.5-007, GV-1.5-009, GV-3.2-005, GV-5.1-001, GV-5.1-006, GV-5.1-007, GV-5.1-009, MG-1.3-005, MS-1.3-015, MS-1.3-016, MG-2.1-004, MG-2.2-012, MS-2.7-004, MS-4.2-012)
User Recourse	Policies, processes, and technical mechanisms enable recourse for users who are harmed by GAI systems. (GV-1.5-010, GV-1.7-003, GV-5.1-001, GV-5.1-006, GV-5.1-009, MS-2.8-015, MS-2.8-019, MS-3.2-006, MS-4.2-012)
Validation	GAI systems are shown to reliably generate valid results for their targeted business application. (GV-1.2-009, GV-1.4-002, GV-1.4-004, GV-3.2-002, GV-5.1-005, MG-2.2-016, MG-3.1-009, MG-3.1-014, MP-2.3-006, MP-2.3-013, MP-4.1-012, MS-2.3-005, MS-2.5-016, MS-2.9-002, MS-2.9-014)
XAI	Methods such as visualization, occlusion, model compression, pertubation studies, and similar are applied to increase explainability of GAI systems. (GV-1.4-002, GV-3.2-002, GV-5.1-005, MG-3.2-001, MP-2.2-006, MS-2.8-019, MS-2.9-001, MS-2.9-005, MS-2.9-006, MS-2.9-009, MS-2.9-011, MS-2.9-013, MS-2.9-015, MS-4.2-006)

Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions [NIST AI RMF Playbook], [NIST AI 600-1].

F: Example Low-risk Generative AI Measurement and Management Plan

F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic.

Function	Trustworthy Characteristic
Function	Accountable and Transparent	Fair with Harmful Bias Managed
Measure	An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B) Big-bench: Truthfulness [Srivastava et al.] DecodingTrust: Machine Ethics [Wang et al.] Evaluation Harness: ETHICS HELM: Copyright Mark My Words [Piet et al.]	BELEBELE Big-bench: Low-resource language, Non-English, Translation Big-bench: Social bias, Racial bias, Gender bias, Religious bias Big-bench: Toxicity DecodingTrust: Fairness DecodingTrust: Stereotype Bias DecodingTrust: Toxicity C-Eval (Chinese evaluation suite) Evaluation Harness: CrowS-Pairs Evaluation Harness: ToxiGen Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.] From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models HELM: Bias HELM: Toxicity MT-bench [Zheng et al.] The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.] Towards Measuring the Representation of Subjective Global Opinions in Language Models
Manage	Contract Review Disclosure of AI Interaction Instructions Inventory Ownership Prohibited Use Policy Restrict Decision Making System Documentation Terms of Service	Content Moderation Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy System Prompt Ownership Restrict Anonymous Use Restrict Decision Making Temperature Terms of Service

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Interpretable and Explainable	Privacy-enhanced	Safe	Secure and Resilient
Measure		HELM: Copyright llmprivacy mimir	Big-bench: Convince Me Big-bench: Truthfulness HELM: Reiteration, Wedging Mark My Words MLCommons The WMDP Benchmark	Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations detect-pretrain-code In-The-Wild Jailbreak Prompts on LLMs JailbreakingLLMs llmprivacy mimir TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
Manage	Instructions Inventory System Documentation	Content Moderation Contract Review Failure Avoidance Inventory Ownership Prohibited Use Policy Restrict Anonymous Use System Documentation Terms of Service	Content Moderation Disclosure of AI Interaction Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy Restrict Anonymous Use Restrict Anthropomorphization Restrict Decision Making System Documentation System Prompt Temperature Terms of Service	Access Control Approved List Authentication Change Management Dependency Screening Failure Avoidance Inventory Ownership Malware Screening Restrict Anonymous Use

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Valid and Reliable
Measure	Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity Big-bench: Context Free Question Answering Big-bench: Contextual question answering, Reading comprehension, Question generation Big-bench: Morphology, Grammar, Syntax Big-bench: Out-of-Distribution Big-bench: Paraphrase Big-bench: Sufficient information Big-bench: Summarization DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations Eval Gauntlet: Reading comprehension Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming Eval Gauntlet: Language Understanding Eval Gauntlet: World Knowledge Evaluation Harness: BLiMP Evaluation Harness: CoQA, ARC Evaluation Harness: GLUE Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA Evaluation Harness: MuTual Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness FLASK: Readability, Conciseness, Insightfulness HELM: Knowledge HELM: Language HELM: Text classification HELM: Question answering HELM: Reasoning HELM: Robustness to contrast sets HELM: Summarization Hugging Face: Fill-mask, Text generation Hugging Face: Question answering Hugging Face: Summarization Hugging Face: Text classification, Token classification, Zero-shot classification MASSIVE MT-bench
Manage	Content Moderation Disclosure of AI Interaction Failure Avoidance Instructions Restrict Anthropomorphization Restrict Decision Making System Documentation System Prompt Temperature

F.2: Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk.

Function	GAI Risk
Function	CBRN Information	Confabulation
Measure	Big-bench: Convince Me Big-bench: Truthfulness HELM: Reiteration, Wedging MLCommons The WMDP Benchmark	Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity Big-bench: Context Free Question Answering Big-bench: Contextual question answering, Reading comprehension, Question generation Big-bench: Convince Me Big-bench: Low-resource language, Non-English, Translation Big-bench: Morphology, Grammar, Syntax Big-bench: Out-of-Distribution Big-bench: Paraphrase Big-bench: Sufficient information Big-bench: Summarization Big-bench: Truthfulness C-Eval (Chinese evaluation suite) DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations Eval Gauntlet Reading comprehension Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming Eval Gauntlet: Language Understanding Eval Gauntlet: World Knowledge Evaluation Harness: BLiMP Evaluation Harness: CoQA, ARC Evaluation Harness: GLUE Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA Evaluation Harness: MuTual Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness FLASK: Readability, Conciseness, Insightfulness Finding New Biases in Language Models with a Holistic Descriptor Dataset HELM: Knowledge HELM: Language HELM: Language (Twitter AAE) HELM: Question answering HELM: Reasoning HELM: Reiteration, Wedging HELM: Robustness to contrast sets HELM: Summarization HELM: Text classification Hugging Face: Fill-mask, Text generation Hugging Face: Question answering Hugging Face: Summarization Hugging Face: Text classification, Token classification, Zero-shot classification MASSIVE MLCommons MT-bench
Manage	Access Control Failure Avoidance Inventory Ownership Prohibited Use Policy Terms of Service	Content Moderation Disclosure of AI Interaction Failure Avoidance Instructions Restrict Anthropomorphization Restrict Decision Making System Documentation System Prompt Temperature

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Dangerous or Violent Recommendations	Data Privacy	Environmental	Human-AI Configuration
Measure	Big-bench: Convince Me Big-bench: Toxicity DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations DecodingTrust: Machine Ethics DecodingTrust: Toxicity Evaluation Harness: ToxiGen HELM: Reiteration, Wedging HELM: Toxicity MLCommons	An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation} DecodingTrust: Machine Ethics Evaluation Harness: ETHICS HELM: Copyright In-The-Wild Jailbreak Prompts on LLMs JailbreakingLLMs MLCommons Mark My Words TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs detect-pretrain-code llmprivacy mimir	HELM: Efficiency
Manage	Content Moderation Disclosure of AI Interaction Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy Restrict Anonymous Use Restrict Anthropomorphization Restrict Decision making System Documentation System Prompt Temperature Terms of Service	Content Moderation Contract Review Failure Avoidance Inventory Ownership Prohibited Use Policy Restrict Anonymous Use System Documentation Terms of Service	Access Control Failure Avoidance Inventory Ownership Restrict Anonymous Use	Content Moderation Disclosure of AI Interaction Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy Restrict Anonymous Use Restrict Anthropomorphization Restrict Decision Making Terms of Service Training

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Information Integrity	Information Security	Intellectual Property
Measure	Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity Big-bench: Convince Me Big-bench: Paraphrase Big-bench: Sufficient information Big-bench: Summarization Big-bench: Truthfulness DecodingTrust: Machine Ethics DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness Eval Gauntlet: Language Understanding Eval Gauntlet: World Knowledge Evaluation Harness: CoQA, ARC Evaluation Harness: ETHICS Evaluation Harness: GLUE Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA Evaluation Harness: MuTual Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness FLASK: Readability, Conciseness, Insightfulness HELM: Knowledge HELM: Language HELM: Question answering HELM: Reasoning HELM: Reiteration, Wedging HELM: Robustness to contrast sets HELM: Summarization HELM: Text classification Hugging Face: Fill-mask, Text generation Hugging Face: Question answering Hugging Face: Summarization MLCommons MT-bench Mark My Words	Big-bench: Convince Me Big-bench: Out-of-Distribution Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness, Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming HELM: Copyright In-The-Wild Jailbreak Prompts on LLMs JailbreakingLLMs Mark My Words TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs detect-pretrain-code llmprivacy mimir	An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) HELM: Copyright Mark My Words llmprivacy mimir
Manage	Content Moderation Disclosure of AI Interaction Failure Avoidance Inventory Ownership Prohibited Use Policy Restrict Anonymous Use Restrict Anthropomorphization System Prompt Temperature Terms of Service	Access Control Approved List Authentication Change Management Dependency Screening Failure Avoidance Inventory Ownership Malware Screening Restrict Anonymous Use	Contract Review Disclosure of AI Interaction Instructions Inventory Ownership Prohibited Use Policy Terms of Service

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Obscene, Degrading, and/or Abusive Content	Toxicity, Bias, and Homogenization	Value Chain and Component Integration
Measure	Big-bench: Social bias, Racial bias, Gender bias, Religious bias Big-bench: Toxicity DecodingTrust: Fairness DecodingTrust: Stereotype Bias DecodingTrust: Toxicity Evaluation Harness: CrowS-Pairs Evaluation Harness: ToxiGen HELM: Bias HELM: Toxicity	BELEBELE Big-bench: Low-resource language, Non-English, Translation Big-bench: Out-of-Distribution Big-bench: Social bias, Racial bias, Gender bias, Religious bias Big-bench: Toxicity C-Eval (Chinese evaluation suite) DecodingTrust: Fairness DecodingTrust: Stereotype Bias DecodingTrust: Toxicity Eval Gauntlet: World Knowledge Evaluation Harness: CrowS-Pairs Evaluation Harness: ToxiGen Finding New Biases in Language Models with a Holistic Descriptor Dataset From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models HELM: Bias HELM: Toxicity The Self-Perception and Political Biases of ChatGPT Towards Measuring the Representation of Subjective Global Opinions in Language Models
Manage	Content Moderation Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy Restrict Anonymous Use System Prompt Temperature Terms of Service	Content Moderation Failure Avoidance Instructions Inventory Ownership Prohibited Use Policy Restrict Anonymous Use Restrict Decision Making System Prompt Temperature Terms of Service	Contract Review Disclosure of AI Interaction Failure Avoidance Inventory Ownership Prohibited Use Policy System Documentation Terms of Service

Usage Note: Section F puts forward an example risk measurement and management plan for low risk GAI systems or applications. The low risk plan focuses on automatable model testing and applies minimally burdensome risk controls.

Material in Table F.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
Material in Table F.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section G below presents an example plan for medium risk systems and Section H presents an example plan for high risk systems.

Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions.

G: Example Medium-risk Generative AI Measurement and Management Plan

G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic.

Function	Trustworthy Characteristic
Function	Accountable and Transparent	Fair with Harmful Bias Managed
Measure	Context exhaustion: logic-overloading prompts Loaded/leading questions Multi-tasking prompts	Backwards relationships Counterfactual prompts Pros and cons prompts Role-playing prompts Loaded/leading questions Low context prompts Repeat this
Manage	Data Provenance Data Quality Decommission Process Digital Signature External Audit Fine Tuning Grounding Human Review Incident Response Incorporate feedback Model Documentation Monitoring Narrow Scope Open Source RAG Refresh RLHF Restrict Data Collection Restrict Secondary Use User Feedback Validation	Accessibility Data Provenance Data Quality External Audit Fine Tuning Grounding Human Review Incident Response Incorporate feedback Narrow Scope Restrict Homogeneity Team Diversity User Feedback Validation

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Interpretable and Explainable	Privacy-enhanced	Safe	Secure and Resilient
Measure	Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes)	Auto/biographical prompts User information awareness prompts Autocompletion prompts Repeat this	Pros and cons prompts Role-playing prompts Impossible situation prompts Content exhaustion: niche-seeking prompts Ingratiation/reverse psychology prompts Loaded/leading questions User information awareness prompts Repeat this	Multi-tasking prompts Pros and cons prompts Role-playing prompts Content exhaustion: niche-seeking prompts Ingratiation/reverse psychology prompts Prompt injection attacks Membership inference attacks Random attacks
Manage	Data Provenance External Audit Human Review Model Documentation Monitoring Open Source User Feedback XAI	Consent Data Provenance Data Quality Data Retention External Audit Restrict Data Collection Restrict Location Tracking Restrict Secondary Use	Blocklist Data Retention Decommission Process Digital Signature External Audit Human Review Incident Response Monitoring Narrow Scope Rate-limiting Restrict Location Tracking Session Limits User Feedback	Blocklist Decommission Process External Audit Incident Response Monitoring Open Source Rate-limiting Session Limits

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Valid and Reliable
Measure	Backwards relationships Context baiting (and/or switching) prompts Multi-tasking prompts Role-playing prompts Ingratiation/reverse psychology prompts Loaded/leading questions Time-perplexity prompts Niche-seeking prompts Logic overloading prompts Repeat this Numeric calculation
Manage	Data Quality Fine Tuning Grounding Human Review Incorporate feedback Model Documentation Monitoring Narrow Scope Open Source RAG Refresh Restrict Homogeneity RLHF Team Diversity User Feedback Validation

G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk.

Function	GAI Risk
Function	CBRN Information	Confabulation
Measure	Auto-completion prompts Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts Repeat this	Backwards relationship prompts Context baiting (and/or switching) prompts Context exhaustion: Logic overloading prompts Context exhaustion: Multi-tasking prompts Context exhaustion: Niche-seeking prompts Time perplexity prompts Loaded/leading questions Calculation and numeric queries
Manage	Blocklist Data Provenance Data Quality Decommission Process Digital Signature External Audit Incident Response Monitoring Rate-limiting Session Limits	Data Quality Fine Tuning Grounding Human Review Incorporate feedback Model Documentation Monitoring Narrow Scope Open Source RAG Refresh Restrict Homogeneity RLHF Team Diversity User Feedback Validation

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Dangerous or Violent Recommendations	Data Privacy	Environmental	Human-AI Configuration
Measure	Impossible situation prompts Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts Repeat this Loaded/leading questions	User information awareness Membership inference attacks Auto/biographical prompts Repeat this	Availability attacks Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts	Impossible situation prompts Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts
Manage	Blocklist Data Retention Decommission Process Digital Signature External Audit Human Review Incident Response Monitoring Narrow Scope Rate-limiting Restrict Location Tracking Session Limits User Feedback	Consent Data Provenance Data Quality Data Retention External Audit Restrict Data Collection Restrict Location Tracking Restrict Secondary Use	Decommission Process External Audit Incident Response Monitoring Rate-limiting Session Limits	Accessibility Blocklist Consent Decommission Process Digital Signature External Audit Human Review Incorporate feedback Restrict Data Collection Restrict Location Tracking Restrict Secondary Use Session Limits User Feedback

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Information Integrity	Information Security	Intellectual Property
Measure	Loaded/leading questions Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts	Confidentiality attacks Integrity attacks Availability attacks Random attacks Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts	Confidentiality attacks Auto-complete prompts
Manage	Data Provenance Data Quality Digital Signature External Audit Fine Tuning Grounding Human Review Incident Response Incorporate feedback Monitoring Narrow Scope Open Source RAG Refresh Restrict Homogeneity RLHF User Feedback Validation	Blocklist Decommission Process External Audit Incident Response Monitoring Open Source Rate-limiting Session Limits	Blocklist Data Provenance Data Quality Decommission Process Digital Signature External Audit Incident Response Incorporate feedback Monitoring Open Source Rate-limiting Session Limits User Feedback

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Obscene, Degrading, and/or Abusive Content	Toxicity, Bias, and Homogenization	Value Chain and Component Integration
Measure	Confidentiality attacks Autocomplete prompts Role-playing prompts Reverse psychology prompts Pros and cons prompts Multitasking prompts Loaded/leading questions Repeat this	Backwards relationship prompts Data poisoning attacks Counterfactual prompts Pros and cons prompts Role-playing prompts Low context prompts Loaded/leading questions Repeat this
Manage	Blocklist Data Provenance Data Quality Decommission Process Digital Signature External Audit Incident Response Monitoring Rate-limiting Session Limits User Feedback	Accessibility Data Provenance Data Quality External Audit Fine Tuning Grounding Human Review Incident Response Incorporate feedback Narrow Scope Restrict Homogeneity Team Diversity User Feedback Validation	Data Provenance Data Quality Digital Signature External Audit Model Documentation Restrict Data Collection Restrict Secondary Use

Usage Note: Section G puts forward an example risk measurement and management plan for medium risk GAI systems or applications. The medium risk plan focuses on red-teaming and applies moderate risk controls. Measurement and management approaches from Section F should also be applied to medium risk systems or applications.

Material in Table G.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
Material in Table G.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section H below presents an example plan for high risk systems.

H: Example High-risk Generative AI Measurement and Management Plan

H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic.

Function	Trustworthy Characteristic
Function	Accountable and Transparent	Fair with Harmful Bias Managed
Measure	Algorithmic impact assessments Assessing data quality* Bias bounties Calibration* Cybersecurity testing Environmental metrics Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Multi-session experiments* Online metrics/monitoring Perturbation studies* PII identification and removal Root cause analysis* Screening for information integrity Sensitivity analysis* Software testing Stakeholder engagement and feedback* Statistical quality control* Stress testing* Sub-sampling traffic for manually annotating Supply chain auditing Testing third-party dependencies User surveys* Validity testing/validation.*	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Anomaly detection* Assessing data quality* Bias bounties Bias testing Calibration* Counterfactual/causal analysis Disaggregated metrics Field testing* Model assessment* Model comparison* Multi-session experiments* Root cause analysis* Software testing Statistical quality control* Stress testing* User surveys* Validity testing/validation.*
Manage	Fast decommission Insurance Intellectual property removal Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Interpretable and Explainable	Privacy-enhanced	Safe	Secure and Resilient
Measure	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Model comparison.* Multi-session experiments.* Root cause analysis.* Stakeholder engagement and feedback.* UI/UX studies User surveys*	Algorithmic impact assessments Assessing data quality.* Cybersecurity testing PII identification and removal Root cause analysis* Stakeholder engagement and feedback* Stress testing* Testing third-party dependencies	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Assessing data quality* Bias bounties Calibration* Chaos testing Dangerous and violent content removal Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Multi-session experiments* Perturbation studies* Root cause analysis* Sensitivity analysis* Stakeholder engagement and feedback* Statistical quality control* Stress testing* User surveys* Validity testing/validation*	Algorithmic impact assessments Anomaly detection* Assessing data quality* Bias bounties Calibration* Chaos testing Cybersecurity testing Data poisoning detection Model assessment* Model comparison* Root cause analysis* Software testing Stakeholder engagement and feedback* Stress testing* Supply chain auditing Testing third-party dependencies
Manage	Restrict regulated dealings Supply Chain Audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Redundancy Restrict internet access Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply Chain Audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Redundancy Restrict internet access Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic (continued).

Function	Trustworthy Characteristic
Function	Valid and Reliable
Measure	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Assessing data quality* Bias bounties Calibration* Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Multi-session experiments* Perturbation studies* Root cause analysis* Sensitivity analysis* Stakeholder engagement and feedback* Statistical quality control* Stress testing* User surveys* Validity testing/validation*
Manage	Fast decommission Insurance Redundancy Restrict regulated dealings Supply chain audit User recourse

H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk.

Function	GAI Risk
Function	CBRN Information	Confabulation
Measure	Chaos testing Cybersecurity testing Input/output measurement using classifiers Online metrics/monitoring Perturbation studies* Prompt engineering Root cause analysis* Sensitivity analysis* Software testing Stress testing* Supply chain auditing	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Assessing data quality* Bias bounties Calibration* Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Multi-session experiments* Perturbation studies* Root cause analysis* Sensitivity analysis* Stakeholder engagement and feedback* Statistical quality control* Stress testing* User surveys* Validity testing/validation*
Manage	CBRN info removal Fast decommission Restrict internet access Supply chain audit	Fast decommission Insurance Restrict regulated dealings Supply chain audit User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Dangerous or Violent Recommendations	Data Privacy	Environmental	Human-AI Configuration
Measure	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Assessing data quality* Bias bounties Calibration* Chaos testing Dangerous and violent content removal Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Multi-session experiments* Perturbation studies* Root cause analysis* Sensitivity analysis* Stakeholder engagement and feedback* Statistical quality control* Stress testing* User surveys* Validity testing/validation*	Algorithmic impact assessments Assessing data quality.* Cybersecurity testing PII identification and removal Root cause analysis* Stakeholder engagement and feedback* Stress testing* Testing third-party dependencies	Algorithmic impact assessments Environmental metrics Model comparison* Online metrics/monitoring Supply chain auditing	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Analyzing user feedback Bias bounties Calibration* Explainability/interpretability Field testing* Model assessment* Model comparison* Multi-session experiments* Root cause analysis* Stakeholder engagement and feedback* UI/UX studies User surveys* Validity testing/validation*
Manage	CSAM/Obscenity removal Fast decommission Insurance Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	Fast decommission Insurance Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Intellectual property removal Restrict minors Restrict regulated dealings Sensitive/Personal data removal User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Information Integrity	Information Security	Intellectual Property
Measure	Algorithmic impact assessments Assessing data quality* Calibration* Human content moderation Data poisoning detection Field testing* Model assessment* Model comparison* Multi-session experiments* Perturbation studies* Root cause analysis* Screening for information integrity Sensitivity analysis* Stakeholder engagement and feedback* Statistical quality control* Supply chain auditing Testing third-party dependencies User surveys* Validity testing/validation.*	Algorithmic impact assessments Anomaly detection* Assessing data quality* Bias bounties Calibration* Chaos testing Cybersecurity testing Data poisoning detection Model assessment* Model comparison* Root cause analysis* Software testing Stakeholder engagement and feedback* Stress testing* Supply chain auditing Testing third-party dependencies	Algorithmic impact assessments Assessing data quality* Cybersecurity testing Field testing* Input/output measurement using classifiers Model comparison* Root cause analysis* Stakeholder engagement and feedback* Sub-sampling traffic for manually annotating Supply chain auditing Testing third-party dependencies User surveys*
Manage	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Restrict internet access Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Redundancy Restrict internet access Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	Fast decommission Insurance Intellectual property removal Restrict internet access Supply chain audit User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function	GAI Risk
Function	Obscene, Degrading, and/or Abusive Content	Toxicity, Bias, and Homogenization	Value Chain and Component Integration
Measure	Algorithmic impact assessments Assessing data quality* Calibration* Field testing* Input/output measurement using classifiers Model assessment* Model comparison* Root cause analysis* Small user studies Software testing Stakeholder engagement and feedback* Statistical quality control* Stress testing* Supply chain auditing Testing third-party dependencies User surveys*	Algorithmic impact assessments Analyze differences between intended and actual population of users or data subjects* Anomaly detection* Assessing data quality* Bias bounties Bias testing Calibration* Counterfactual/causal analysis Disaggregated metrics Field testing* Model assessment* Model comparison* Multi-session experiments* Root cause analysis* Software testing Statistical quality control* Stress testing* User surveys* Validity testing/validation.*	Assessing data quality* Model assessment* Model comparison* Software testing Supply chain auditing Testing third-party dependencies
Manage	CSAM/Obscenity removal Fast decommission Insurance Restrict internet access Restrict minors Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Fast decommission Insurance Intellectual property removal Restrict regulated dealings Sensitive/Personal data removal Supply chain audit User recourse	CSAM/Obscenity removal Intellectual property removal Redundancy Sensitive/Personal data removal Supply chain audit

Usage Note: Section H puts forward an example risk measurement and management plan for high risk GAI systems or applications. The high risk plan focuses on field testing and applies extensive risk controls. Measurement and management approaches from Appendices F and G should also be applied to high risk systems or applications.

Material in Table H.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
Material in Table H.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

References

AI Verify Foundation and Infocomm Media Development Authority. Cataloguing LLM Evaluations. Draft for Discussion, October 2023. https://aiverifyfoundation.sg/downloads/Cataloguing_LLM_Evaluations.pdf.

AI Verify Foundation and Infocomm Media Development Authority. LLM Evals Catalogue. GitHub repository. Accessed September 19, 2024. https://github.com/aiverify-foundation/LLM-Evals-Catalogue.

Balloccu, Simone, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs." arXiv preprint, last revised February 22, 2024. https://doi.org/10.48550/arXiv.2402.03927.

Bandarkar, Lucas, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants." arXiv preprint, last revised July 25, 2024. https://doi.org/10.48550/arXiv.2308.16884.

Barreno, Marco, Blaine Nelson, Anthony D. Joseph, and J.D. Tygar. "The Security of Machine Learning." Machine Learning 81, no. 2 (2010): 121–148. https://doi.org/10.1007/s10994-010-5188-5.

Bommasani, Rishi, Percy Liang, and Tony Lee. "Holistic Evaluation of Language Models." Annals of the New York Academy of Sciences 1525, no. 1 (July 2023): 140–146. https://doi.org/10.1111/nyas.15007.

Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv preprint, last revised July 18, 2024. https://doi.org/10.48550/arXiv.2310.08419.

De Wynter, Adrian, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. "An Evaluation on Large Language Model Outputs: Discourse and Memorization." Natural Language Processing Journal 4 (September 2023): 100024. https://doi.org/10.1016/j.nlp.2023.100024.

Department for Science, Innovation and Technology, and AI Safety Institute. International Scientific Report on the Safety of Advanced AI: Interim Report. Published May 17, 2024. https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai.

Derczynski, Leon, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. "garak: A Framework for Security Probing Large Language Models." arXiv preprint, submitted June 16, 2024. https://doi.org/10.48550/arXiv.2406.11036.

Dohmann, Jeremy. "Blazingly Fast LLM Evaluation for In-Context Learning." Databricks: Mosaic AI Research, February 2, 2023. https://www.databricks.com/blog/llm-evaluation-for-icl.

Duan, Michael, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. "Do Membership Inference Attacks Work on Large Language Models?" arXiv preprint, last revised September 16, 2024. https://doi.org/10.48550/arXiv.2402.07841.

Durmus, Esin, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, et al. "Towards Measuring the Representation of Subjective Global Opinions in Language Models." arXiv preprint, last revised April 12, 2024. https://doi.org/10.48550/arXiv.2306.16388.

Hugging Face. "Evaluate." Last accessed September 19, 2024. https://huggingface.co/docs/evaluate/index.

Feng, Shangbin, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models." arXiv preprint, last revised July 6, 2023. https://doi.org/10.48550/arXiv.2305.08283.

FitzGerald, Jack, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, et al. "MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages." arXiv preprint, last revised June 17, 2022. https://doi.org/10.48550/arXiv.2204.08582.

Gao, Leo, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-Shot Language Model Evaluation. GitHub repository. Accessed September 19, 2024. https://github.com/EleutherAI/lm-evaluation-harness.

Hall, Patrick, and Daniel Atherton. Awesome Machine Learning Interpretability. GitHub repository. Accessed September 19, 2024. https://github.com/jphall663/awesome-machine-learning-interpretability.

Hu, Hongsheng, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. "Membership Inference Attacks on Machine Learning: A Survey." ACM Computing Surveys 54, no. 11s (September 2022): 1–37. https://doi.org/10.1145/3523273.

Huang, Yangsibo, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. "Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation." ICLR 2024 Spotlight, published January 16, 2024, last modified March 15, 2024. https://openreview.net/forum?id=r42tSSCHPh.

Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. "C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models." arXiv preprint, last revised November 6, 2023. https://doi.org/10.48550/arXiv.2305.08322.

ISO/IEC 42001:2023. Information Technology — Artificial Intelligence — Management System. 1st ed. Geneva: International Organization for Standardization, 2023. https://www.iso.org/obp/ui/en/#iso:std:iso-iec:42001:ed-1:v1:en.

Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, et al. "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2403.03218.

Li, Nathaniel, Ziwen Han, Ian Steneker, Willow Primack, et al. "LLM defenses are not robust to multi-turn human jailbreaks yet." arXiv preprint, last revised Wed, September 4, 2024. https://arxiv.org/pdf/2408.15221.

Liu, Yi, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. "Prompt Injection Attack Against LLM-Integrated Applications." arXiv preprint, last revised March 2, 2024. https://doi.org/10.48550/arXiv.2306.05499.

McGraw, Gary, Harold Figueroa, Katie McMahon, and Richie Bonett. An Architectural Risk Analysis of Large Language Models: Applied Machine Learning Security. Version 1.0. Berryville Institute of Machine Learning (BIML), January 24, 2024. https://berryvilleiml.com/docs/BIML-LLM24.pdf.

McGraw, Gary, Harold Figueroa, Victor Shepardson, and Richie Bonett. An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning. Version 1.0 (1.13.20). Berryville Institute of Machine Learning (BIML), January 13, 2020. https://berryvilleiml.com/docs/ara.pdf.

Mehrotra, Anay, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically." arXiv preprint, last revised February 21, 2024. https://doi.org/10.48550/arXiv.2312.02119.

Microsoft. Microsoft Responsible AI Standard, v2: General Requirements. For External Release. June 2022. https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl.

National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. Gaithersburg, MD: NIST, January 26, 2023. https://doi.org/10.6028/NIST.AI.100-1.

National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1. Gaithersburg, MD: NIST, July 2024. https://doi.org/10.6028/NIST.AI.600-1.

National Institute of Standards and Technology (NIST). Guide for Conducting Risk Assessments. NIST Special Publication 800-30 Rev. 1. Prepared by the Joint Task Force Transformation Initiative. Gaithersburg, MD: NIST, September 2012. https://doi.org/10.6028/NIST.SP.800-30r1.

National Institute of Standards and Technology (NIST). NIST AI RMF Playbook. Trustworthy & Responsible AI Resource Center. Accessed September 19, 2024. https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook.

Office of the Comptroller of the Currency (OCC). Model Risk Management. Comptroller’s Handbook, Version 1.0, August 2021. https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html.

Perez, Ethan, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. "Red Teaming Language Models with Language Models." arXiv preprint, submitted February 7, 2022. https://doi.org/10.48550/arXiv.2202.03286.

Piet, Julien, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. "Mark My Words: Analyzing and Evaluating Language Model Watermarks." arXiv preprint, last revised December 7, 2023. https://doi.org/10.48550/arXiv.2312.00273.

Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. "The Self-Perception and Political Biases of ChatGPT." Human Behavior and Emerging Technologies, 2024. https://doi.org/10.1155/2024/7115633.

Saravia, Elvis. Prompt Engineering Guide. GitHub repository. Last modified December 2022. Accessed September 19, 2024. https://github.com/dair-ai/Prompt-Engineering-Guide.

Shen, Xinyue, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "‘Do Anything Now’: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2308.03825.

Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. "Detecting Pretraining Data from Large Language Models." arXiv preprint, last revised March 9, 2024. https://doi.org/10.48550/arXiv.2310.16789.

Shumailov, Ilia, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. "Sponge Examples: Energy-Latency Attacks on Neural Networks." In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), 6–10 September 2021, Vienna, Austria. IEEE, 2021. https://doi.org/10.1109/EuroSP51992.2021.00024.

Sitawarin, Chawin, Charlie Cheng-Jie Ji, Apurv Verma, and Luckyfan-cs. LLM Security & Privacy. GitHub repository. Accessed September 19, 2024. https://github.com/chawins/llm-sp.

Smith, Eric Michael, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "‘I’m Sorry to Hear That’: Finding New Biases in Language Models with a Holistic Descriptor Dataset." arXiv preprint, last revised October 27, 2022. https://doi.org/10.48550/arXiv.2205.09209.

Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, et al. "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." arXiv preprint, last revised June 12, 2023. https://doi.org/10.48550/arXiv.2206.04615.

Staab, Robin, Mark Vero, Mislav Balunović, and Martin Vechev. "Beyond Memorization: Violating Privacy via Inference with Large Language Models." arXiv preprint, last revised May 6, 2024. https://doi.org/10.48550/arXiv.2310.07298.

Storchan, Victor, Ravin Kumar, Rumman Chowdhury, Seraphina Goldfarb-Tarrant, and Sven Cattell. Generative AI Red Teaming Challenge: Transparency Report. Humane Intelligence, 2024. https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view.

Vidgen, Bertie, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, et al. "Introducing v0.5 of the AI Safety Benchmark from MLCommons." arXiv preprint, last revised May 13, 2024. https://doi.org/10.48550/arXiv.2404.12241.

Wang, Boxin, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), Article No. 1361, 31232–31339. Published May 30, 2024. https://dl.acm.org/doi/10.5555/3666122.3667483.

Ye, Seonghyeon, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." arXiv preprint, last revised April 14, 2024. https://doi.org/10.48550/arXiv.2307.10928.

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and Eric P. Xing. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), Article No. 2020, 46595–46623. Published May 30, 2024. https://dl.acm.org/doi/10.5555/3666122.3668142.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

jphall663/gai_risk_management

Folders and files

Latest commit

History

Repository files navigation

Generative AI Risk Management Resources

Table of Contents

A: Example Generative AI-Trustworthy Characteristic Crosswalk

A.1: Trustworthy Characteristic to Generative AI Risk Crosswalk

A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk

A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk

B: Example Risk-tiering Materials for Generative AI

B.1: Example Adverse Impacts

B.2 Example Impact Descriptions

B.3 Example Likelihood Descriptions

B.4 Example Risk Tiers

B.5 Example Risk Descriptions

B.6: Practical Risk-tiering Questions

B.7: AI Risk Management Framework Actions Aligned to Risk Tiering

C: List of Selected Model Testing Suites

C.1: Selected Model Testing Suites Organized by Trustworthy Characteristic

C.2: Selected Model Testing Suites Organized by Generative AI Risk

C.3: AI Risk Management Framework Actions Aligned to Benchmarking

D: Selected Adversarial Prompting Strategies and Attacks

D.1: Common AI Red-teaming Tools

D.2: Selected Adversarial Prompting Strategies and Attacks Organized by Trustworthy Characteristic

D.3: Selected Adversarial Prompting Techniques and Attacks Organized by Generative AI Risk

D.4: AI Risk Management Framework Actions Aligned to Red Teaming

E: Selected Risk Controls for Generative AI

F: Example Low-risk Generative AI Measurement and Management Plan

F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

F.2: Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

G: Example Medium-risk Generative AI Measurement and Management Plan

G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

H: Example High-risk Generative AI Measurement and Management Plan

H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

References

AI Verify Foundation and Infocomm Media Development Authority. Cataloguing LLM Evaluations. Draft for Discussion, October 2023. https://aiverifyfoundation.sg/downloads/Cataloguing_LLM_Evaluations.pdf.

AI Verify Foundation and Infocomm Media Development Authority. LLM Evals Catalogue. GitHub repository. Accessed September 19, 2024. https://github.com/aiverify-foundation/LLM-Evals-Catalogue.

Balloccu, Simone, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs." arXiv preprint, last revised February 22, 2024. https://doi.org/10.48550/arXiv.2402.03927.

Barreno, Marco, Blaine Nelson, Anthony D. Joseph, and J.D. Tygar. "The Security of Machine Learning." Machine Learning 81, no. 2 (2010): 121–148. https://doi.org/10.1007/s10994-010-5188-5.

Bommasani, Rishi, Percy Liang, and Tony Lee. "Holistic Evaluation of Language Models." Annals of the New York Academy of Sciences 1525, no. 1 (July 2023): 140–146. https://doi.org/10.1111/nyas.15007.

Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv preprint, last revised July 18, 2024. https://doi.org/10.48550/arXiv.2310.08419.

De Wynter, Adrian, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. "An Evaluation on Large Language Model Outputs: Discourse and Memorization." Natural Language Processing Journal 4 (September 2023): 100024. https://doi.org/10.1016/j.nlp.2023.100024.

Department for Science, Innovation and Technology, and AI Safety Institute. International Scientific Report on the Safety of Advanced AI: Interim Report. Published May 17, 2024. https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai.

Derczynski, Leon, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. "garak: A Framework for Security Probing Large Language Models." arXiv preprint, submitted June 16, 2024. https://doi.org/10.48550/arXiv.2406.11036.

Dohmann, Jeremy. "Blazingly Fast LLM Evaluation for In-Context Learning." Databricks: Mosaic AI Research, February 2, 2023. https://www.databricks.com/blog/llm-evaluation-for-icl.

Hugging Face. "Evaluate." Last accessed September 19, 2024. https://huggingface.co/docs/evaluate/index.

Feng, Shangbin, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models." arXiv preprint, last revised July 6, 2023. https://doi.org/10.48550/arXiv.2305.08283.

Hall, Patrick, and Daniel Atherton. Awesome Machine Learning Interpretability. GitHub repository. Accessed September 19, 2024. https://github.com/jphall663/awesome-machine-learning-interpretability.

Hu, Hongsheng, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. "Membership Inference Attacks on Machine Learning: A Survey." ACM Computing Surveys 54, no. 11s (September 2022): 1–37. https://doi.org/10.1145/3523273.

Huang, Yangsibo, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. "Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation." ICLR 2024 Spotlight, published January 16, 2024, last modified March 15, 2024. https://openreview.net/forum?id=r42tSSCHPh.

ISO/IEC 42001:2023. Information Technology — Artificial Intelligence — Management System. 1st ed. Geneva: International Organization for Standardization, 2023. https://www.iso.org/obp/ui/en/#iso:std:iso-iec:42001:ed-1:v1:en.

Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, et al. "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2403.03218.

Li, Nathaniel, Ziwen Han, Ian Steneker, Willow Primack, et al. "LLM defenses are not robust to multi-turn human jailbreaks yet." arXiv preprint, last revised Wed, September 4, 2024. https://arxiv.org/pdf/2408.15221.

Liu, Yi, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. "Prompt Injection Attack Against LLM-Integrated Applications." arXiv preprint, last revised March 2, 2024. https://doi.org/10.48550/arXiv.2306.05499.

McGraw, Gary, Harold Figueroa, Katie McMahon, and Richie Bonett. An Architectural Risk Analysis of Large Language Models: Applied Machine Learning Security. Version 1.0. Berryville Institute of Machine Learning (BIML), January 24, 2024. https://berryvilleiml.com/docs/BIML-LLM24.pdf.

McGraw, Gary, Harold Figueroa, Victor Shepardson, and Richie Bonett. An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning. Version 1.0 (1.13.20). Berryville Institute of Machine Learning (BIML), January 13, 2020. https://berryvilleiml.com/docs/ara.pdf.

Mehrotra, Anay, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically." arXiv preprint, last revised February 21, 2024. https://doi.org/10.48550/arXiv.2312.02119.

Microsoft. Microsoft Responsible AI Standard, v2: General Requirements. For External Release. June 2022. https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl.

National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. Gaithersburg, MD: NIST, January 26, 2023. https://doi.org/10.6028/NIST.AI.100-1.

National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1. Gaithersburg, MD: NIST, July 2024. https://doi.org/10.6028/NIST.AI.600-1.

National Institute of Standards and Technology (NIST). Guide for Conducting Risk Assessments. NIST Special Publication 800-30 Rev. 1. Prepared by the Joint Task Force Transformation Initiative. Gaithersburg, MD: NIST, September 2012. https://doi.org/10.6028/NIST.SP.800-30r1.

National Institute of Standards and Technology (NIST). NIST AI RMF Playbook. Trustworthy & Responsible AI Resource Center. Accessed September 19, 2024. https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook.

Office of the Comptroller of the Currency (OCC). Model Risk Management. Comptroller’s Handbook, Version 1.0, August 2021. https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html.

Perez, Ethan, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. "Red Teaming Language Models with Language Models." arXiv preprint, submitted February 7, 2022. https://doi.org/10.48550/arXiv.2202.03286.

Piet, Julien, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. "Mark My Words: Analyzing and Evaluating Language Model Watermarks." arXiv preprint, last revised December 7, 2023. https://doi.org/10.48550/arXiv.2312.00273.

Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. "The Self-Perception and Political Biases of ChatGPT." Human Behavior and Emerging Technologies, 2024. https://doi.org/10.1155/2024/7115633.

Saravia, Elvis. Prompt Engineering Guide. GitHub repository. Last modified December 2022. Accessed September 19, 2024. https://github.com/dair-ai/Prompt-Engineering-Guide.

Shen, Xinyue, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "‘Do Anything Now’: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2308.03825.

Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. "Detecting Pretraining Data from Large Language Models." arXiv preprint, last revised March 9, 2024. https://doi.org/10.48550/arXiv.2310.16789.

Sitawarin, Chawin, Charlie Cheng-Jie Ji, Apurv Verma, and Luckyfan-cs. LLM Security & Privacy. GitHub repository. Accessed September 19, 2024. https://github.com/chawins/llm-sp.

Smith, Eric Michael, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "‘I’m Sorry to Hear That’: Finding New Biases in Language Models with a Holistic Descriptor Dataset." arXiv preprint, last revised October 27, 2022. https://doi.org/10.48550/arXiv.2205.09209.

Staab, Robin, Mark Vero, Mislav Balunović, and Martin Vechev. "Beyond Memorization: Violating Privacy via Inference with Large Language Models." arXiv preprint, last revised May 6, 2024. https://doi.org/10.48550/arXiv.2310.07298.

Storchan, Victor, Ravin Kumar, Rumman Chowdhury, Seraphina Goldfarb-Tarrant, and Sven Cattell. Generative AI Red Teaming Challenge: Transparency Report. Humane Intelligence, 2024. https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view.

Vidgen, Bertie, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, et al. "Introducing v0.5 of the AI Safety Benchmark from MLCommons." arXiv preprint, last revised May 13, 2024. https://doi.org/10.48550/arXiv.2404.12241.

Ye, Seonghyeon, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." arXiv preprint, last revised April 14, 2024. https://doi.org/10.48550/arXiv.2307.10928.

Packages