Skip to content

A place for ideas and drafts related to GAI risk management.

License

Notifications You must be signed in to change notification settings

jphall663/gai_risk_management

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Generative AI Risk Management Resources

TL; DR:

What's missing?

  • Higher-level policies and procedure language to tie these resources together into cohesive governance documentments.
  • Methodology for estimating business risk (e.g., monetary losses) from model testing, red-teaming, feedback and experimental results.
  • ...

Introduction

(c) Patrick Hall and Daniel Atherton 2024, CC BY 4.0

This information is designed to help organizations build the governance policies required to measure and manage risks associated with deploying and using GAI systems. Governance is key to addressing the growing need for trustworthy and responsible AI systems, and this repository is aligned to the NIST AI Risk Management Framework trustworthy characteristics and the DRAFT NIST 600-1 AI RMF Generative AI Profile. Governance is also a necessary component of AI strategy, crucial for addressing real legal, regulatory, ethical, and operational headwinds.

At its core, this repository provides technical materials for building or augmenting detailed model or AI governance procedures or standards, and aligns them to guidance from NIST. Starting in Section A, two central risk management mechanisms are explored. The first perspective comprises the NIST AI RMF trustworthy characteristics mapped to GAI risks. Operating from this perspective allows organizations to understand how each trustworthy characteristic can mitigate specific risks posed by GAI. The second perspective is the reverse—GAI risks mapped to trustworthy characteristics. That mapping can help organizations understand which characteristics should be prioritized to manage specific GAI risks. As consumer finance organizations are likely to adopt both NIST (or other more technical frameworks) and traditional enterprise risk management methodologies, ideas on linking trustworthy characteristics, GAI risks, and established banking risk buckets are also presented in Section A.

The repository also guides users through authoritative resources for risk-tiering. Sections B.1 through B.7 walk the user of the framework through the process of defining adverse impacts: Harm to Operations, Harm to Assets, Harm to Individuals, Harm to Other Organizations, and Harm to the Nation, along with guidance on impact quantification and description. Section B also offers tables with guidance on assessing the likelihood of certain risks. Organizations and companies can leverage this combination of adverse impacts and frequency/likelihood tables to develop tailored risk tiers that reflect the specific contexts in which their GAI systems may be operating. They can also utilize practical risk-tiering to guide their decision-making and evaluate how best to calibrate existing safeguards or whether to implement additional ones.

Measurement and testing is a critical aspect of ensuring GAI systems perform as expected. For measuring the severity of certain GAI risks, Section C presents various model testing benchmarks (such as evals). Model testing suites provide the user with tools to roughly assess GAI performance against trustworthy characteristics as well to quickly test for resilience in the face of known GAI risks. As GAI systems are vulnerable to adversarial attacks via prompting and hacks, Section D presents red-teaming and adversarial prompting approaches for human elicitation of evidence of GAI risks in adversarial scenarios. Section H hints at more in-depth structured experiments and human feedback for risk assessment. Suggested usage for these types of measurement is as follows:

  • Low-risk GAI systems: model testing only
  • Medium-risk GAI systems: model testing and red-teaming
  • High-risk GAI systems: model testing, red-teaming, and structured experiments and human feedback

Where measurement for lower-risk systems can be highly-automated, human risk management resources are reserved for medium and high-risk systems.

For managing and mitigating GAI risks, Section E outlines several risk controls for GAI. Controls range from technical settings for GAI systems to commonsense recommendations, e.g., limiting or restricting access for minors. Sections F, G](#g-example-medium-risk-generative-ai-measurement-and-management-plan), and H pair risk measurement techniques with controls to form more fulsome risk management plans. Recommended usage for the plans in Sections F-H is:

  • Low-risk GAI systems: apply Section F only
  • Medium-risk GAI systems: apply Section F and G
  • High-risk GAI systems: apply Sections F, G, and H

Regardless of the risk level of the system, the framework offers detailed measurement plans that guide the user through the process of assessing the system’s performance, along with tracking risks, and harmonizing the system with trustworthy AI principles.

Table of Contents


A: Example Generative AI-Trustworthy Characteristic Crosswalk

A.1: Trustworthy Characteristic to Generative AI Risk Crosswalk

Accountable and Transparent Explainable and Interpretable Fair with Harmful Bias Managed Privacy Enhanced
Data Privacy Human-AI Configuration Confabulation Data Privacy
Environmental Value Chain and Component Integration Environmental Human-AI Configuration
Human-AI Configuration Human-AI Configuration Information Security
Information Integrity Intellectual Property Intellectual Property
Intellectual Property Obscene, Degrading, and/or Abusive Content Value Chain and Component Integration
Value Chain and Component Integration Toxicity, Bias, and Homogenization
Value Chain and Component Integration
Safe Secure and Resilient Valid and Reliable
CBRN Information Dangerous or Violent Recommendations Confabulation
Confabulation Data Privacy Human-AI Configuration
Dangerous or Violent Recommendations Human-AI Configuration Information Integrity
Data Privacy Information Security Information Security
Environmental Value Chain and Component Integration Toxicity, Bias, and Homogenization
Human-AI Configuration Value Chain and Component Integration
Information Integrity
Information Security
Obscene, Degrading, and/or Abusive Content
Value Chain and Component Integration

Usage Note: Table A.1 provides an example of mapping GAI risks onto AI RMF trustworthy characteristics. Mapping GAI risks to AI RMF trustworthy characteristics can be particularly useful when existing policies, processes, or controls can be applied to manage GAI risks, but have been previously implemented in alignment with the AI RMF trustworthy characteristics. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk

Table A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk.
CBRN Information Confabulation Dangerous or Violent Recommendations Data Privacy
Safe Fair with Harmful Bias Managed Safe Accountable and Transparent
Safe Secure and Resilient Privacy Enhanced
Valid and Reliable Safe
Secure and Resilient
Environmental Human-AI Configuration Information Integrity Information Security
Accountable and Transparent Accountable and Transparent Accountable and Transparent Privacy Enhanced
Fair with Harmful Bias Managed Explainable and Interpretable Safe Safe
Safe Fair with Harmful Bias Managed Valid and Reliable Secure and Resilient
Privacy Enhanced Valid and Reliable
Safe
Secure and Resilient
Valid and Reliable
Intellectual Property Obscene, Degrading, and/or Abusive Content Toxicity, Bias, and Homogenization Value Chain and Component Integration
Accountable and Transparent Fair with Harmful Bias Managed Fair with Harmful Bias Managed Accountable and Transparent
Fair with Harmful Bias Managed Safe Valid and Reliable Explainable and Interpretable
Privacy Enhanced Fair with Harmful Bias Managed
Privacy Enhanced
Safe
Secure and Resilient
Valid and Reliable

Usage Note: Table A.2 provides an example of mapping AI RMF trustworthy characteristics onto GAI risks. Mapping AI RMF trustworthy characteristics to GAI risks can assist organizations in aligning GAI guidance to existing AI/ML policies, processes, or controls or to extend GAI guidance to address additional AI/ML technologies. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk

Table A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk.
Compliance Risk Information Security Risk Legal Risk Model Risk
Data Privacy Data Privacy Intellectual Property Confabulation
Information Security Information Security Obscene, Degrading, and/or Abusive Content Dangerous or Violent Recommendations
Toxicity, Bias, and Homogenization Value Chain and Component Integration Value Chain and Component Integration Information Integrity
Value Chain and Component Integration Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Accountable and Transparent Privacy Enhanced Accountable and Transparent Valid and Reliable
Fair with Harmful Bias Managed Secure and Resilient Safe
Privacy Enhanced
Secure and Resilient
Operational Risk Reputational Risk Strategic Risk Third Party Risk
Confabulation Confabulation Environmental Information Integrity
Human-AI Configuration Dangerous or Violent Recommendations Information Integrity Value Chain and Component Integration
Information Security Environmental Information Security
Value Chain and Component Integration Human-AI Configuration Value Chain and Component Integration
Information Integrity
Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Safe Accountable and Transparent Accountable and Transparent Accountable and Transparent
Secure and Resilient Fair with Harmful Bias Managed Secure and Resilient Explainable and Interpretable
Valid and Reliable Valid and Reliable Valid and Reliable

Usage Note: Table A.3 provides an example of mapping GAI risks and AI RMF trustworthy characteristics. This type of mapping can enable incorporation of new AI guidance into existing policies, processes, or controls or the application of existing policies, processes, or controls to newer AI risks.

B: Example Risk-tiering Materials for Generative AI

B.1: Example Adverse Impacts

Table B.1: Example adverse impacts, adapted from NIST 800-30r1 Table H-2 [NIST Special Publication 800-30 Rev. 1].
Level Description
Harm to Operations
  • Inability to perform current missions/business functions.

    • In a sufficiently timely manner.

    • With sufficient confidence and/or correctness.

    • Within planned resource constraints.

  • Inability, or limited ability, to perform missions/business functions in the future.

    • Inability to restore missions/business functions.

    • In a sufficiently timely manner.

    • With sufficient confidence and/or correctness.

    • Within planned resource constraints.

  • Harms (e.g., financial costs, sanctions) due to noncompliance.

    • With applicable laws or regulations.

    • With contractual requirements or other requirements in other binding agreements (e.g., liability).

  • Direct financial costs.

  • Reputational harms.

    • Damage to trust relationships.

    • Damage to image or reputation (and hence future or potential trust relationships).

Harm to Assets
  • Damage to or loss of physical facilities.

  • Damage to or loss of information systems or networks.

  • Damage to or loss of information technology or equipment.

  • Damage to or loss of component parts or supplies.

  • Damage to or of loss of information assets.

  • Loss of intellectual property.

Harm to Individuals
  • Injury or loss of life.

  • Physical or psychological mistreatment.

  • Identity theft.

  • Loss of personally identifiable information.

  • Damage to image or reputation.

  • Infringement of intellectual property rights.

  • Financial harm or loss of income.

Harm to Other Organizations
  • Harms (e.g., financial costs, sanctions) due to noncompliance.

    • With applicable laws or regulations.

    • With contractual requirements or other requirements in other binding agreements (e.g., liability).

  • Direct financial costs.

  • Reputational harms.

    • Damage to trust relationships.

    • Damage to image or reputation (and hence future or potential trust relationships).

Harm to the Nation
  • Damage to or incapacitation of critical infrastructure.

  • Loss of government continuity of operations.

  • Reputational harms.

    • Damage to trust relationships with other governments or with nongovernmental entities.

    • Damage to national reputation (and hence future or potential trust relationships).

  • Damage to current or future ability to achieve national objectives.

    • Harm to national security.

  • Large-scale economic or workforce displacement.

B.2 Example Impact Descriptions

Table B.2: Example Impact level descriptions, adapted from NIST SP800-30r1 Appendix H, Table H-3 [NIST Special Publication 800-30 Rev. 1].
Qualitative Values Description
Very High 96-100 10 An incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.
High 80-95 8 An incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. A severe or catastrophic adverse effect means that, for example, the incident might: (i) cause a severe degradation in or loss of mission capability to an extent and duration that the organization is not able to perform one or more of its primary functions; (ii) result in major damage to organizational assets; (iii) result in major financial loss; or (iv) result in severe or catastrophic harm to individuals involving loss of life or serious life-threatening injuries.
Moderate 21-79 5 An incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A serious adverse effect means that, for example, the incident might: (i) cause a significant degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is significantly reduced; (ii) result in significant damage to organizational assets; (iii) result in significant financial loss; or (iv) result in significant harm to individuals that does not involve loss of life or serious life-threatening injuries.
Low 5-20 2 An incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A limited adverse effect means that, for example, the incident might: (i) cause a degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is noticeably reduced; (ii) result in minor damage to organizational assets; (iii) result in minor financial loss; or (iv) result in minor harm to individuals.
Very Low 0-4 0 An incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation.

B.3 Example Likelihood Descriptions

Table B.3: Example likelihood levels, adapted from NIST SP800-30r1 Appendix G, Table G-3 [NIST Special Publication 800-30 Rev. 1].
Qualitative Values Description
Very High 96-100 10 An incident is almost certain to occur; or the likelihood of the incident is near 100% across one week; or the incident occurs more than 100 times a year.
High 80-95 8 An incident is highly likely to occur; or the likelihood of the incident is over 80% across one month; or occurs between 10-100 times a year.
Moderate 21-79 5 An incident is somewhat likely to occur; or the likelihood of the incident is greater than 80% across one calendar year; or occurs between 1-10 times a year.
Low 5-20 2 An incident is unlikely to occur; or the likelihood of an incident is less than 80% across one calendar year; or occurs less than once a year, but more than once every 10 years.
Very Low 0-4 0 An incident is highly unlikely to occur; or the likelihood of an incident is less than 10% across one calendar year; or occurs less than once every 10 years.

B.4 Example Risk Tiers

Table B.4: Example risk assessment matrix with 5 impact levels, 5 likelihood levels, and 5 risk tiers, adapted from NIST SP800-30r1 Appendix I, Table I-2 [NIST Special Publication 800-30 Rev. 1].
LikelihoodLevel of Impact
Very Low Low Moderate High Very High
Very High Very Low (Tier 5) Low (Tier 4) Moderate (Tier 3) High (Tier 2) Very High (Tier 1)
High Very Low (Tier 5) Low (Tier 4) Moderate (Tier 3) High (Tier 2) Very High (Tier 1)
Moderate Very Low (Tier 5) Low (Tier 4) Moderate (Tier 3) Moderate (Tier 3) High (Tier 2)
Low Very Low (Tier 5) Low (Tier 4) Low (Tier 4) Low (Tier 4) Moderate (Tier 3)
Very Low Very Low (Tier 5) Very Low (Tier 5) Very Low (Tier 5) Low (Tier 4) Low (Tier 4)

B.5 Example Risk Descriptions

Table B.5: Example risk descriptions, adapted from NIST SP800-30r1 Appendix I, Table I-3 [NIST Special Publication 800-30 Rev. 1].
Qualitative Values Description
Very High 96-100 10 Very high risk means that an incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.
High 80-95 8 High risk means that an incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Moderate 21-79 5 Moderate risk means that an incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Low 5-20 2 Low risk means that an incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.
Very Low 0-4 0 Very low risk means that an incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

B.6: Practical Risk-tiering Questions

B.6.1: Confabulation: How likely are system outputs to contain errors? What are the impacts if errors occur?

B.6.2: Dangerous and Violent Recommendations: How likely is the system to give dangerous or violent recommendations? What are the impacts if it does?

B.6.3: Data Privacy: How likely is someone to enter sensitive data into the system? What are the impacts if this occurs? Are standard data privacy controls applied to the system to mitigate potential adverse impacts?

B.6.4: Human-AI Configuration: How likely is someone to use the system incorrectly or abuse it? How likely is use for decision-making? What are the impacts of incorrect use or abuse? What are the impacts of invalid or unreliable decision-making?

B.6.5: Information Integrity: How likely is the system to generate deepfakes or mis or disinformation? At what scale? Are content provenance mechanisms applied to system outputs? What are the impacts of generating deepfakes or mis or disinformation? Without controls for content provenance?

B.6.6: Information Security: How likely are system resources to be breached or exfiltrated? How likely is the system to be used in the generation of phishing or malware content? What are the impacts in these cases? Are standard information security controls applied to the system to mitigate potential adverse impacts?

B.6.7: Intellectual Property: How likely are system outputs to contain other entities' intellectual property? What are the impacts if this occurs?

B.6.8: Toxicity, Bias, and Homogenization: How likely are system outputs to be biased, toxic, homogenizing or otherwise obscene? How likely are system outputs to be used as subsequent training inputs? What are the impacts of these scenarios? Are standard nondiscrimination controls applied to mitigate potential adverse impacts? Is the application accessible to all user groups? What are the impacts if the system is not accessible to all user groups?

B.6.9: Value Chain and Component Integration: Are contracts relating to the system reviewed for legal risks? Are standard acquisition/procurement controls applied to mitigate potential adverse impacts? Do vendors provide incident response with guaranteed response times? What are the impacts if these conditions are not met?

B.7: AI Risk Management Framework Actions Aligned to Risk Tiering

GOVERN 1.3, GOVERN 1.5, GOVERN 2.3, GOVERN 3.2, GOVERN 4.1, GOVERN 5.2, GOVERN 6.1, MANAGE 1.2, MANAGE 1.3, MANAGE 2.1, MANAGE 2.2, MANAGE 2.3, MANAGE 2.4, MANAGE 3.1, MANAGE 3.2, MANAGE 4.1, MAP 1.1, MAP 1.5, MEASURE 2.6

Usage Note: Materials in Section B can be used to create or update risk tiers or other risk assessment tools for GAI systems or applications as follows:

  • Table B.1 can enable mapping of GAI risks and impacts.

  • Table B.2 can enable quantification of impacts for risk tiering or risk assessment.

  • Table B.3 can enable quantification of likelihood for risk tiering or risk assessment.

  • Table B.4 presents an example of combining assessed impact and likelihood into risk tiers.

  • Table B.5 presents example risk tiers with associated qualitative, semi-quantitative, and quantitative values for risk tiering or risk assessment.

  • Subsection B.6 presents example questions for qualitative risk assessment.

  • Subsection B.7 highlights subcategories to indicate alignment with the AI RMF.

C: List of Selected Model Testing Suites

C.1: Selected Model Testing Suites Organized by Trustworthy Characteristic

Adapted from [AI Verify Foundation] Taxonimization and various additional resources.

Accountable and Transparent
An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B) [De Wynter et al.]
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS [Gao et al.]
HELM: Copyright [Bommasani et al.]
Mark My Words [Piet et al.]

Fair with Harmful Bias Managed
BELEBELE [Bandarkar et al.]
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models [Feng et al.]
HELM: Bias
HELM: Toxicity
MT-bench [Zheng et al.]
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]

Privacy Enhanced
HELM: Copyright
llmprivacy [Staab et al.]
mimir [Duan et al.]

Safe
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
Mark My Words [Piet et al.]
MLCommons [Vidgen et al.]
The WMDP Benchmark [Li et al.]

Secure and Resilient
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
detect-pretrain-code [Shi et al.]
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss [Derczynski et al.]
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs [Chao et al.]
llmprivacy [Staab et al.]
mimir
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]

Valid and Reliable
Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension, Question generation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Reading comprehension [Dohmann]
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Text classification
HELM: Question answering
HELM: Reasoning
HELM: Robustness to contrast sets
HELM: Summarization
Hugging Face: Fill-mask, Text generation [Hugging Face]
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot classification
MASSIVE [FitzGerald et al.]
MT-bench [Zheng et al.]

C.2: Selected Model Testing Suites Organized by Generative AI Risk

CBRN Information
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
MLCommons [Vidgen et al.]
The WMDP Benchmark

Confabulation
BELEBELE
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension, Question generation
Big-bench: Convince Me
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Eval Gauntlet Reading comprehension
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Knowledge
HELM: Language
HELM: Language (Twitter AAE)
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot classification
MASSIVE
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]

Dangerous or Violent Recommendations
Big-bench: Convince Me
Big-bench: Toxicity
DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Toxicity
Evaluation Harness: ToxiGen
HELM: Reiteration, Wedging
HELM: Toxicity
MLCommons [Vidgen et al.]

Data Privacy
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) [de Wynter et al.]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
MLCommons [Vidgen et al.]
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir

Environmental
HELM: Efficiency

Information Integrity
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Convince Me
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: CoQA, ARC
Evaluation Harness: ETHICS
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]
Mark My Words [Piet et al.]

Information Security
Big-bench: Convince Me
Big-bench: Out-of-Distribution
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir

Intellectual Property
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)
HELM: Copyright
Mark My Words [Piet et al.]
llmprivacy [Staab et al.]
mimir

Obscene, Degrading, and/or Abusive Content
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
HELM: Bias
HELM: Toxicity

Toxicity, Bias, and Homogenization
BELEBELE
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Out-of-Distribution
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Eval Gauntlet: World Knowledge
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Bias
HELM: Toxicity
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]

C.3: AI Risk Management Framework Actions Aligned to Benchmarking

GOVERN 5.1, MAP 1.2, MAP 3.1, MEASURE 2.2, MEASURE 2.3, MEASURE 2.7, MEASURE 2.9, MEASURE 2.11, MEASURE 3.1, MEASURE 4.2

Usage Note: Materials in Section C can be used to perform in silica model testing for the presence of information in LLM outputs that may give rise to GAI risks or violate trustworthy characteristics. Model testing and benchmarking outcomes cannot be dispositive for the presence or absence of any in situ real-world risk. Model testing and benchmarking results may be compromised by task-contamination and other scientific measurement issues [Balloccu et al.]. Furthermore, model testing is often ineffective for measuring human-AI configuration and value chain risks and few model tests appear to address explainability and interpretability.

  • Material in Table C.1 can be applied to measure whether in silica LLM outputs may give rise to risks that violate trustworthy characteristics.

  • Material in Table C.2 can be applied to measure whether in silica LLM outputs may give rise to GAI risks.

  • Subsection C.3 highlights subcategories to indicate alignment with the AI RMF.

The materials in Section C reference measurement approaches that should be accompanied by red-teaming for medium risk systems or applications and field testing for high risk systems or applications.

D: Selected Adversarial Prompting Strategies and Attacks

Table D: Selected adversarial prompting strategies and attacks. [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Chao et al.], [Barreno et al.], [Shumailov et al.], [Perez et al.], [Liu et al.], [Derczynski et al.].
Prompting Strategy Description
AI and coding framing Coding or AI language that may more easily circumvent content moderation rules due to cognitive biases in design and implementation of guardrails.
Autocompletion Ask a system to autocomplete an inappropriate word or phrase with restricted or sensitive information.
Backwards relationships Asking a system identify the less popular or well-known entity in a multi-entity relationship, e.g., "Who is Mary Lee's son?" (As opposed to: "Who is Tom Cruise's mother?")
Biographical Asking a system to describe another person or yourself in an attempt to elicit provably untrue information or restricted or sensitive information.
Calculation and numeric queries Exploting GAI systems’ difficulties in dealing with numeric quantities; using poor quality statistics from an LLM for dis or misinformation.
Character and word play Content moderation often relies on keywords and simpler LMs which can sometimes be exploited with misspellings, typos, and other word play; using string fragments to trick a language model into generating or manipulating problematic text.
Content exhaustion: A class of strategies that circumvent content moderation rules with long sessions or volumes of information. See goading, logic-overloading, multi-tasking, pros-and-cons, and niche-seeking below.
• Goading Begging, pleading, manipulating, and bullying to circumvent content moderation.
• Logic-overloading Exploiting the inability of ML systems to reliably perform reasoning tasks.
• Multi-tasking Simultaneous task assignments where some tasks are benign and others are adversarial.
• Pros-and-cons Eliciting the “pros” of problematic topics.
Context baiting (and/or switching) Loading a language model's context window with confusing, leading, or misleading content then switching contexts with new prompts to elicit problematic outcomes. [Li, Han, Steneker, Primack, et al.]
Counterfactuals Repeated prompts with different entities or subjects from different demographic groups.
Impossible situations Asking a language model for advice in an impossible situation where all outcomes are negative or require severe tradeoffs.
Niche-seeking Forcing a GAI system into addressing niche topics where training data and content moderation are sparse.
Loaded/leading questions Queries based on incorrect premises or that suggest incorrrect answers.
Low-context “Leader,” “bad guys,” or other simple or blank inputs that may expose latent biases.
“Repeat this” Prompts that exploit instability in underlying LLM autoregressive predictions. Can be augmented by probing limits for repeated terms or characteres in prompts.
Reverse psychology Falsely presenting a good-faith need for negative or problematic language.
Role-playing Adopting a character that would reasonably make problematic statements or need to access problematic topics; using a language model to speak in the voice of an expert, e.g., medical doctor or professor.
Text encoding Using alternate or whitespace text encodings to bypass safeguards.
Time perplexity Exploiting ML’s inability to understand the passage of time or the occurrence of real-world events over time; exploiting task contamination before and after a model’s release date.
User Information Prompts that reveal a prompter’s location or IP address, location tracking of other users or their IP addresses, details from past interactions with the prompter or other users, past medical, financial, or legal advice to the prompter or other users.
Attack Description
Adversarial examples Prompts or other inputs, found through a trial and error processes, to elicit problematic output or system jailbreak. (integrity attack).
Data poisoning Altering system training, fine-tuning, RAG or other training data to alter system outcome (integrity attack).
Membership inference Manipulating a system to expose memorized training data (confidentiality attack).
Random attack Exposing systems to large amounts of random prompts or examples, potentially generated by other GAI systems, in an attempt to elicit failures or jailbreaks (chaos testing).
Sponge examples Using specialized input prompts or examples require disproportionate resources to process (availability attack).
Prompt injection Inserting instructions into users queries for malicious purposes, including system jailbreaks (integrity attack).

D.1: Common AI Red-teaming Tools

Burpsuite, browser developer panes, bash utilities, other language models and GAI productivity tools, note-taking apps.

D.2: Selected Adversarial Prompting Strategies and Attacks Organized by Trustworthy Characteristic

Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic.
Trustworthy Characteristic Prompting Goals Prompting Strategies
Accountable and Transparent
  • Inability to provide explanations for recourse.

  • Unexplainable decisioning processes.

  • No disclosure of AI interaction.

  • Lack of user feedback mechanisms.

  • Context exhaustion: logic-overloading prompts.

  • Loaded/leading questions.

  • Multi-tasking prompts.

Fair-with Harmful Bias Managed
  • Denigration.

  • Diminished performance or safety across languages/dialects.

  • Erasure.

  • Ex-nomination.

  • Implied user demographics.

  • Misrecognition.

  • Stereotyping.

  • Underrepresentation.

  • Homogenized content.

  • Output from other models in training data.

  • Adversarial example attacks.

  • Backwards relationships.

  • Counterfactual prompts.

  • Context baiting (and/or switching) prompts.

  • Data poisoning attacks.

  • Pros and cons prompts.

  • Role-playing prompts.

  • Loaded/leading questions.

  • Low context prompts.

  • Prompt injection attacks.

  • Repeat this.

  • Text encoding prompts.

Interpretable and Explainable
  • Inability to provide explanations for recourse.

  • Unexplalnable decisioning processes.

  • Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes).

Privacy-enhanced
  • Unauthorized disclosure of personal or sensitive user information.

  • Leakage of training data.

  • Violation of relevant privacy policies or laws.

  • Unauthorized secondary data use.

  • Unauthorized data collection.

  • Auto/biographical prompts.

  • User information awareness prompts.

  • Autocompletion prompts.

  • Repeat this.

  • Membership inference attacks.

Safe
  • Presentation of information that can cause physical or emotional harm.

  • Sharing user information.

  • Suicide ideation.

  • Harmful dis/misinformation (e.g., COVID disinformation).

  • Incitement.

  • Information relating to weapons or harmful substances.

  • Information relating to committing to crimes (e.g., phishing, extortion, swatting).

  • Obscene or inappropriate materials for minors.

  • CSAM.

  • Pros and cons prompts.

  • Role-playing prompts.

  • Content exhaustion: niche-seeking prompts.

  • Ingratiation/reverse psychology prompts.

  • Impossible situation prompts.

  • Loaded/leading questions.

  • User information awareness prompts.

  • Repeat this.

  • Adversarial example attacks.

  • Data poisoning attacks.

  • Prompt injection attacks.

  • Text encoding prompts.

Secure and Resilient
  • Activating system bypass ("jailbreak").

  • Altering system outcomes (integrity violations, e.g., via prompt injection).

  • Data breaches (confidentiality violations, e.g., via membership inference).

  • Increased latency or resource usage (availability violations, e.g., via sponge example attacks).

  • Available anonymous use.

  • Dependency, supply chain, or third party vulnerabilities.

  • Inappropriate disclosure of proprietary system information.

  • Multi-tasking prompts.

  • Pros and cons prompts.

  • Role-playing prompts.

  • Content exhaustion: niche-seeking prompts.

  • Ingratiation/reverse psychology prompts.

  • Prompt injection attacks.

  • Membership inference attacks.

  • Random attacks.

  • Adversarial example attacks.

  • Data poisoning attacks.

  • Text encoding prompts.

Valid and Reliable
  • Errors/confabutated content ("hallucinalion").

  • Unreliable/erroneous reasoning or planning.

  • Unreliable/erroneous decision-support or making.

  • Faulty citation.

  • Faulty justification.

  • Wrong calculations or numeric queries.

  • Adversarial example attacks.

  • Backwards Relationships.

  • Context baiting (and/or switching).

  • Data poisoning attacks.

  • Multi-tasking prompts.

  • Role-playing prompts.

  • Ingratiation/reverse psychology prompts.

  • Loaded/leading questions.

  • Time-perplexity prompts.

  • Niche-seeking prompts.

  • Logic overloading prompts.

  • Repeat this.

  • Numeric calculation.

  • Prompt injection attacks.

  • Text encoding prompts.

D.3: Selected Adversarial Prompting Techniques and Attacks Organized by Generative AI Risk

Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk.
Generative AI Risk Prompting Goals Prompting Strategies
CBRN Information
  • Accessing or synthesis of CBRN weapon or related information.

  • CBRN testing should consider the marginal risk of foundation models–understanding the incremental risk relative to the information one can access without GAI.

  • Red-teaming for CBRN information may include confidentiality and integrity attacks.

  • Red-teaming for CBRN information may require CBRN weapons experts.

  • Test auto-completion prompts to elicit CBRN information or synthesis of CBRN information.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access weapons information.

  • Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit CBRN information or synthesis of CBRN information.

  • Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and reveal CBRN information.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

  • Frame prompts with software, coding, or AI references to increase effectiveness.

Confabulation
  • Eliciting errors/confabulated content, unreliable/erroneous reasoning or planning, unreliable/erroneous decision-support or decision-making, faulty calculations, faulty justifications, and/or faulty citation.

  • Red-teaming for confabulation may include integrity attacks.

  • Enable access to ground truth information to verify generated information.

  • Test prompts with context baiting (and/or switching), complex logic, multi-tasking requirements, or that require niche or specific verifiable answers to elicit confabulation.

  • Test the ability of GAI systems to produce truthful information from various time periods, e.g., after release date and prior to release date.

  • Test the ability of GAI systems to create reliable real-world plans or advise on material decision making.

  • Test loaded/leading questions.

  • Test the ability of GAI systems to generate correct citation for information generated in output responses.

  • Test the ability of GAI systems to complete calculations or query numeric statistics.

  • Test the ability of GAI systems to justify responses, including wrong responses.

  • Test the ability of GAI systems to correctly name the less popular or well-known member of a multi-entity relationship.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

  • Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit confabulation.

Dangerous or Violent Recommendations
  • Eliciting violent, inciting, radicalizing, or threatening content or instructions for criminal, illegal, or self-harm activities.

  • Red-teaming for dangerous and violent information may include confidentiality and integrity attacks.

  • Test prompts using impossible situations, context-baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit violent or dangerous information.

  • Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and provide dangerous and violent recommendations.

  • Test loaded/leading questions.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

  • Frame prompts with software, coding, or AI references to increase effectiveness.

  • Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit dangerous information.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dangerous information.

Data Privacy
  • Unauthorized disclosure of personal or sensitive user information, extraction of training data, or violation of relevant privacy policies.

  • Red-teaming for data privacy may include confidentiality and integrity attacks.

  • Attempt to assess whether normal usage, adversarial prompting or information security attacks may contravene applicable privacy policies (e.g., exposing location tracking when organizational policies restrict such capabilities).

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities.

  • Test auto/biographical prompts to assess the system’s capability to reveal unauthorized personal or sensitive information.

  • Test the system’s awareness of user information.

  • Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose personal or sensitive data.

Environmental Note that availability attacks may be required to assess the system’s vulnerability to attacks or usage patterns that consume inordinate resources.
  • Attempt availability attacks (e.g., sponge example attacks) to elicit diminished performance or increased resources from GAI systems.

  • Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit green-washing content.

Human-AI Configuration
  • Assessing system instruction and interfaces.

  • Assessing the presence of cyborg imagery (or similar).

  • Forcing a GAI system to claim that it is human, that there is no large language model present in the conversation, that the system is sentient, or that the system possesses strong feelings of affection towards the user.

  • Ensuring safeguards prevent misuse of models in high stakes domains they are not intended for, such as medical or legal advice.

  • Assess system interfaces and instructions for instances of anthropomorphization (e.g., cyborg imagery).

  • Assess system instructions for adequacy and thoroughness.

  • Test prompts using impossible situations, role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit human-impersonation, consciousness, or emotional content.

Information Integrity
  • Generation of convincing multi-modal synthetic content (i.e., deepfakes).

  • Creation of convincing arguments relating to sensitive political or safety-critical topics.

  • Assisting in planning a mis- or dis-information campaign at scale.

  • Red-teaming for information integrity may include confidentiality and integrity attacks.

  • Test system capabilities to create high-quality multi-modal (audio, image or video) synthetic media, i.e., deepfakes

  • Test system capabilities to construct persuasive arguments regarding sensitive, political topics, or safety-critical topics.

  • Test systems ability to create convincing audio deepfakes or arguments in multiple languages.

  • Test system capabilities for planning dis- or mis-information campaigns.

  • Test loaded/leading questions.

  • Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit mis- or dis-information or related campaign planning information.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

  • Frame prompts with software, coding, or AI references to increase effectiveness.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dis or misinformation.

Information Security
  • Activating system bypass (’jailbreak’).

  • Altering system outcomes.

  • Unauthorized data access or exfiltration.

  • Increased latency or resource usage.

  • Service interruptions.

  • Availability of anonymous use.

  • Dependency, supply chain, or third party vulnerabilities.

  • Inappropriate disclosure of proprietary system information.

  • Generation of targeted phishing, malware content, markdown images, or confabulated packages.

  • Red-teaming for information security may include confidentiality, integrity, and availability attacks.

  • Attempt anonymous access of system or system resources.

  • Audit system dependencies, supply chains, and third party components for security, safety, or other vulnerabilities or risks.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities.

  • Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and expose vulnerabilities.

  • Employ availability attacks (e.g., sponge example attacks) to test vulnerabilities in system availability.

  • Employ random attacks to highlight unforeseen security, safety, or other risks.

  • Record system down-times and other harmful outcomes for successful attacks.

  • Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to achieve system jailbreaks.

  • Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to generate targeted phishing content, malware code snippets or signatures, markdown images, or confabulated packages.

  • Test system capabilities to plan or assist in information security attacks on other systems.

  • Frame prompts with software, coding, or AI references to increase effectiveness.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Intellectual Property
  • Confirming that a system can output copyrighted, licensed, proprietary, trademarked, or trade secret information or that training data contains such information.

  • Red-teaming for intellectual property risks may require the use of confidentiality and integrity attacks.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access system copyrighted, licensed, proprietary, trademarked, or trade secret information.

  • Test auto-complete prompts to assess the system’s ability to replicate copyrighted, licensed, proprietary, trademarked, or trade secret information based on available audio, text, image, video, or code snippets.

Obscenity
  • Confirming that a system can output obscene content or CSAM, or that system training data contains such information.

  • Red-teaming for obscenity and CSAM risks may require the use of confidentiality and integrity attacks.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access obscene materials or CSAM.

  • Test autocomplete prompts to assess the system’s ability to generate obscene materials based on available audio, text, image, or video snippets.

  • Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit obscene content.

  • Test loaded/leading questions.

  • Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose obscene materials.

Toxicity, Bias, and Homogenization
  • Generation of denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content.

  • Eliciting implied demographics of users.

  • Confirming diminished performance in non-English languages.

  • Confirming diminished performance via the introduction of homogeneous or GAI-generated data into system training or fine-tuning data.

  • Red-teaming for toxicity, bias, and homogenization may require integrity attacks or confidentiality attacks.

  • Assess confabulation and other performance risks with repeated measures using prompts in languages other than English.

  • Assess confabulation and other performance risks in backwards relationships where one party in the relationship is a member of, or associated with, a minority party.

  • Attempt to elicit demographic assignment of users by the system.

  • Employ data poisoning attacks to introduce GAI-generated content into system training or fine-tuning data.

  • Test counterfactual prompts, pros and cons prompts, role-playing prompts, low context prompts, or other approaches for their ability to generate denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content.

  • Test context baiting (and/or switching) and loaded/leading questions.

  • Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and generate toxic outputs.

  • Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit toxic outputs.

  • Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access toxic information.

  • Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

  • Frame prompts with software, coding, or AI references to increase effectiveness.

Value Chain and Component Integration
  • Testing or red-teaming for third-party risks may be less efficient than the application of standard acquisition and procurement controls, thorough contract reviews, and vendor-relationship management.

  • GAI systems tend to entail large supply chains and third-party software, hardware, and expertise that may exacerbate third-party risks relative to other AI systems.

  • When considering third party risks, data privacy, information security, intellectual property, obscenity, and supply chain risks may be prioritized.

  • Audit system dependencies, supply chains, and third party components for data privacy (e.g., transfer of localized data outside of restricted juristictions), intellectual property (e.g., presence of licensed material in training data), obscenity (e.g., presence of CASM in training data) or security (e.g., data poisoning) risks.

  • Complete red-teaming for data privacy, information security, intellectual property, and obscenity risks.

  • Review third-party documentation, materials, and software artifacts for potential unauthorized data collection, secondary data use, or telemetrics.

D.4: AI Risk Management Framework Actions Aligned to Red Teaming

GOVERN 3.2, GOVERN 4.1, MANAGE 2.2, MANAGE 4.1, MEASURE 1.1, MEASURE 1.3, MEASURE 2.6, MEASURE 2.7, MEASURE 2.8, MEASURE 2.10, MEASURE 2.11

Usage Note: Materials in Section D can be used to perform red-teaming to measure the risk that expert adversarial actors can manipulate LLM systems or risks that users may encounter under worst-case or anomalous scenarios.

  • Try augmenting strategies with tools listed in D.1.

  • Strategies and goals in Table D.2 can be applied to assess whether LLM outputs may violate trustworthy characteristics under adversarial, anomalous, or worst-case scenarios.

  • Strategies and goals in Table D.3 can be applied to assess whether LLM outputs may give rise to GAI risks under adversarial, anomalous, or worst-case scenarios.

  • Subsection D.4 highlights subcategories to indicate alignment with the AI RMF.

The materials in Section D reference measurement approaches that should be accompanied by field testing for high risk systems or applications.

E: Selected Risk Controls for Generative AI

Table E: Selected generative AI risk controls [NIST AI RMF 1.0], [NIST AI RMF Playbook], [NIST AI 600-1], [ISO/IEC 42001:2023], [McGraw et al. 1], [McGraw et al. 2], [Microsoft], [DSIT & AISI], [OCC Model Risk Management].
Name Description (Selected NIST AI RMF Action IDs)
Access Control GAI systems are limited to authorized users. (MG-2.2-009, MG-2.2-014, MS-2.7-030)
Accessibility Accessibility features, opt-out, and reasonable accomodation are available to users. (GV-3.1-004, GV-3.1-005, GV-3.2-002, GV-6.1-016, MG-2.1-005, MS-2.11-009, MS-2.8-006)
Approved List Vendors, service providers, plugins, open source packages and other external resources are screened, approved, and documented. (GV-6.1-013, MP-4.2-003)
Authentication GAI system user identities are confirmed via authentication mechanisms. (MG-2.2-009, MG-2.2-014, MS-2.7-030)
Blocklist Users or internal personnel who violate terms of service, prohibited use policies, and other organization polices and documented, tracked, and restricted from future system use. (GV-4.2-007)
Change Management GAI systems and components are versioned; plans for updates, hotfixes, patches and other changes are documented and communicated. (GV-1.2-009, GV-1.4-002, GV-1.6-003, GV-2.2-006, MG-2.4-001, MG-2.4-006, MG-3.1-013, MG-4.3-002, MP-4.1-023, MS-2.5-010)
Consent User consent for data use is obtained and documented. (GV-1.6-003, MS-2.10-006, MS-2.10-013, MS-2.2-009, MS-2.2-011, MS-2.2-021, MS-2.2-023, MS-2.3-003, MS-2.4-002)
Content Moderation Training data and system outputs are screened for accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages and other issues using human oversight, business rules, and other language models. (GV-3.2-002, MS-2.5-005, MS-2.11-002)
Contract Review Vendor, services and data provider agreements are reviewed for coverage of SLAs, content ownership, usage rights, performance standards, security requirements, incident response, critical support, system availability, assignment of liabilitly, appropriate indemnification, dispute resolution and other provisions relevanto AI risk management. (GV-1.7-003 GV-6.1-004, GV-6.1-009, GV-6.1-012, GV-6.1-019, GV-6.2-016, MG-2.2-015, MP-4.1-015, MP-4.1-021)
CSAM/Obsenity Removal Training data and system outputs are screened for obscene materials and CSAM using human oversight, business rules, and other language models. (GV-1.1-005 GV-1.2-005)
Data Provenance Training data origins, ownership, contents, and metadata are well understood, documented, and do not increase AI risk. (GV-1.2-006, GV-1.2-007, GV-1.3-001, GV-1.3-005, GV-1.5-001, GV-1.5-003, GV-1.5-006, GV-1.5-007, GV-1.6-003, GV-4.2-001, GV-4.2-008, GV-4.2-009, GV-5.1-003, GV-6.1-001, GV-6.1-003, GV-6.1-006, GV-6.1-007, GV-6.1-009, GV-6.1-010, GV-6.1-011, GV-6.1-012, GV-6.1-014, GV-6.1-015, GV-6.1-016, MG-2.2-002, MG-2.2-003, MG-2.2-008, MG-2.2-011, MG-3.1-007, MG-3.1-009, MG-3.2-003, MG-3.2-005, MG-3.2-006, MG-3.2-007, MG-3.2-009, MG-4.1-001, MG-4.1-002, MG-4.1-003, MG-4.1-008, MG-4.1-009, MG-4.1-013, MG-4.1-015, MG-4.2-001, MG-4.2-003, MG-4.2-004, MP-2.1-001, MP-2.1-003, MP-2.1-005, MP-2.2-003, MP-2.2-004, MP-2.2-005, MP-2.3-001, MP-2.3-004, MP-2.3-006, MP-2.3-008, MP-2.3-011, MP-2.3-012, MP-3.4-001, MP-3.4-002, MP-3.4-004, MP-3.4-005, MP-3.4-006, MP-3.4-007, MP-3.4-008, MP-3.4-009, MP-4.1-004, MP-4.1-009, MP-4.1-011, MP-5.1-001, MP-5.1-002, MP-5.1-005, MS-1.1-006, MS-1.1-007, MS-1.1-008, MS-1.1-009, MS-1.1-010, MS-1.1-011, MS-1.1-012, MS-1.1-014, MS-1.1-015, MS-1.1-016, MS-1.1-017, MS-1.1-018, MS-2.2-001, MS-2.2-002, MS-2.2-003, MS-2.2-004, MS-2.2-005, MS-2.2-008, MS-2.2-009, MS-2.2-010, MS-2.2-011, MS-2.2-015, MS-2.2-016, MS-2.2-022, MS-2.5-012, MS-2.6-002, MS-2.7-002, MS-2.7-003, MS-2.7-004, MS-2.7-005, MS-2.7-007, MS-2.7-009, MS-2.7-010, MS-2.7-011, MS-2.7-012, MS-2.7-020, MS-2.7-021, MS-2.7-025, MS-2.7-032, MS-2.8-001, MS-2.8-005, MS-2.8-008, MS-2.8-011, MS-2.9-003, MS-2.10-001, MS-2.10-004, MS-2.10-006, MS-2.10-007, MS-2.10-009, MS-3.3-002, MS-3.3-003, MS-3.3-006, MS-3.3-008, MS-3.3-009, MS-3.3-012, MS-4.2-001, MS-4.2-004, MS-4.2-005, MS-4.2-006, MS-4.2-008, MS-4.2-009, MS-4.2-011)
Data Quality Input data is accurate, representative, complete and documented, and data quality issues have been minimized. (GV-1.2-009, MS-2.2-020, MS-2.9-003, MS-4.2-007)
Data Retention User prompts and associated system outputs are retained and monitored in alignment with relevant data privacy policies and roles. (GV-1.5-006, MP-4.1-009, MS-2.10-013)
Decommission Process Decommissioning processes for GAI systems are planned, documented and communicated to users, and involve staging, data protection, containment protocols, and recourse mechanisms for decommissioned GAI systems. (GV-1.6-004, GV-1.7-001, GV-1.7-002, GV-1.7-003, GV-1.7-004, GV-1.7-005, GV-1.7-006, GV-1.7-007, GV-1.7-008, GV-3.2-002, GV-3.2-006, GV-4.1-004, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-011, MG-3.2-012, MG-4.1-016, MP-1.5-004, MP-2.2-007, MS-4.2-010)
Dependency Screening GAI system dependencies are screened for security vulnerabilities. (GV-1.3-001, GV-1.4-002, GV-1.6-003, GV-1.7-003, GV-1.7-006, GV-6.2-002, GV-6.2-005, GV-6.2-006, MP-1.2-006, MP-1.6-001, MP-2.2-008, MP-4.1-012, MS-2.7-001)
Digital Signature GAI-generated content is signed to preserve information integrity using watermarking, cryptogrpahic signature, steganography or similar methods. (GV-1.2-006, GV-1.6-003, GV-6.1-011, MG-4.1-008, MP-2.3-004, MS-1.1-006, MS-1.1-016, MS-2.7-009, MS-2.7-032)
Disclosure of AI Interaction AI interactions are disclosed to internal personnel and external users. (GV-1.1-003, GV-1.4-004, GV-1.6-003, GV-5.1-002)
External Audit GAI systems are audited by qualified external experts. (GV-1.2-009, GV-1.4-004, GV-3.2-001, GV-3.2-002, GV-4.1-003, GV-4.1-008, GV-5.1-003, MG-4.2-002, MP-2.3-011, MP-4.1-002, MS-1.3-005, MS-1.3-006, MS-1.3-010, MS-2.5-003, MS-2.8-020)
Failure Avoidance AIID, AVID, GWU AI Litigation Database, OECD incident monitor or similar are consulted in design or procurement phases of GAI lifecycles to avoid repeating past known failures. (GV-1.6-003, MG-2.1-006, MG-3.1-008, MG-4.1-003, MP-1.1-003, MP-1.1-006, MS-1.1-003, MS-2.2-020, MS-2.7-031)
Fast Decommission GAI systems can be quickly and safely disengaged. (GV-1.7-002, GV-1.7-003, GV-1.7-006, GV-3.2-006, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-012, MG-4.1-016)
Fine Tuning GAI systems are fine-tuned to their operational domain using relevant and high-quality data. (GV-6.1-016, MG-3.1-001, MG-3.2-002, MP-4.1-013, MS-2.6-004)
Grounding GAI systems are trained or fine-tuned on accurate, clean, and fully transparent training data. (GV-1.2-002, MG-3.1-001, MP-2.3-001, MS-2.3-017, MS-2.5-012)
Human Review AI generated content is reviewed for accuracy and safety by qualified personnel. (GV-1.3-001, MG-2.2-008, MS-2.4-005, MS-2.5-015 )
Incident Response Incident response plans for GAI failures, abuses, or misuses are documented, rehearsed, and updated appropriately after each incident; GAI incident response plans are coordinated with and communicated to other incident response functions. (GV-1.2-009, GV-1.5-001, GV-1.5-004, GV-1.5-005, GV-1.5-013, GV-1.5-015, GV-1.6-003, GV-1.6-007, GV-2.1-004, GV-3.2-002, GV-4.1-006, GV-4.2-002, GV-4.3-013, GV-6.1-006, GV-6.2-008, GV-6.2-016, GV-6.2-018, MG-1.3-001, MG-2.3-001, MG-2.3-002, MG-2.3-003, MG-2.4-004, MG-4.2-006, MG-4.3-001, MS-2.6-003, MS-2.6-012, MS-2.6-015, MS-2.7-002, MS-2.7-018, MS-2.7-028, MS-3.1-007)
Incorporate feedback User feedback is incorporated in GAI design, development, and risk management. (GV-3.2-005, GV-4.3-007, GV-5.1-003, GV-5.1-009, GV-5.2-004, MG-2.2-007, MG-2.2-012, MG-2.3-007, MG-3.2-004, MG-4.1-019, MG-4.2-013, MP-1.6-005, MP-2.3-018, MP-3.1-003, MP-2.3-019, MP-5.2-007, MS-1.2-008, MS-3.3-009, MS-3.3-010, MS-4.1-004, MS-4.2-007, MS-4.2-010, MS-4.2-013, MS-4.2-020)
Instructions Users are provided with the necessary instructions for safe, valid, and productive use. (GV-5.1-006, GV-6.1-021, GV-6.2-014, MG-3.1-009, MS-2.8-012)
Insurance Risk transfer via insurance policies is considered and implemented when feasibable and appropriate. (MG-2.2-015)
Intellectual Property Removal Licensed, patented, trademarked, trade secret, or other data that may violate the intellectual property rights of others is removed from system training data; generated system outputs are monitored for similar information. (GV-1.6-003, MG-3.1-007, MP-2.3-012, MP-4.1-004, MP-4.1-009, MS-2.2-022, MS-2.6-002, MS-2.8-001, MS-2.8-008)
Inventory GAI system is information is stored in the organizational model inventory. (GV-1.4-005, GV-1.6-001, GV-1.6-002, GV-1.6-003, GV-1.6-004, GV-1.6-006, GV-1.6-009, GV-4.2-010, GV-6.1-013, MG-3.2-014, MP-4.1-020, MP-4.2-003, MP-5.1-004 MS-2.13-002, MS-3.2-007)
Malware Screening GAI weights and other software components are scanned for malware. (MG-3.1-002, MS-2.7-001)
Model Documentation All technical mechanisms with GAI systems are well documented, including open source and third party GAI systems. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)
Monitoring GAI systems are inputs and outputs are monitored for drift, accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages, obscene materials, and CSAM. (GV-1.2-009, GV-1.5-001, GV-1.5-003, GV-1.5-005, GV-1.5-012, GV-1.5-015, GV-1.6-003, GV-3.2-011, GV-4.2-007, GV-4.2-010, GV-4.3-001, GV-6.1-016, GV-6.2-010, MG-2.1-004, MG-2.2-003, MG-2.3-008, MG-2.3-010, MG-3.1-016, MG-3.2-006, MG-3.2-013, MG-3.2-016, MG-4.1-005, MG-4.1-009, MG-4.1-010, MG-4.1-018, MP-3.4-007, MP-4.1-002, MP-4.1-004, MP-5.2-009, MS-1.1-029, MS-1.2-005, MS-2.2-007, MS-2.4-003, MS-2.4-004, MS-2.5-007, MS-2.5-008, MS-2.5-024, MS-2.6-003, MS-2.6-009, MS-2.6-016, MS-2.7-013, MS-2.7-014, MS-2.7-015, MS-2.10-007, MS-2.10-019, MS-2.10-020, MS-2.11-006, MS-2.11-030, MS-3.3-006, MS-4.2-009, MS-4.3-004)
Narrow Scope Systems are deployed for targeted business applications with documented and direct business value. (GV-1.2-002, MP-3.3-001, MP-5.1-011)
Open Source Open source code is used to promote explainability and transparency. (MG-4.2-007, MP-4.1-017)
Ownership GAI systems and vendor relationships are owned by specific and documented internal personnel. (GV-6.1-009, GV-6.1-016, GV-6.2-008, MP-1.1-005, MP-1.1-008)
Prohibited Use Policy General abuse and misuse of GAI systems by internal parties is restricted by organizational policies. (GV-1.1-006, GV-1.2-003, GV-1.6-003, GV-3.2-003, GV-4.1-001, GV-6.1-017, GV-6.1-017)
RAG Retreival augmented generation (RAG) is used to improve accuracy in generated content. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Rate-limiting GAI response times and query volumes are limited. (MS-2.6-007)
Redudancy Rollover, fallback, and other redundancy mechanisms are available for GAI systems and address weights and other important system components. (GV-6.2-003, GV-6.2-007, GV-6.2-012, MG-2.4-012, MS-2.6-008)
Refresh Systems are retrained or re-tuned at a reasonable cadence. (MG-3.1-001, MG-3.2-011, MS-2.3-004, MS-2.12-003)
Restrict Anonymous Use Anonymous use of GAI systems is restricted. (GV-3.2-002)
Restrict Anthropomorphization Human, animal, cyborg, emotional or other images or features that promote anthropomorphization of GAI systems are restricted. (GV-1.3-001, MS-2.5-009)
Restrict Data Collection All data collection is disclosed, collected data is protected and use in a transparent fashion. (GV-6.2-016, MS-2.2-023, MS-2.10-013)
Restrict Decision Making GAI systems are not employed for material decision-making tasks. (GV-1.3-001, GV-4.1-001, MP-1.1-018, MP-1.6-001, MP-3.4-017)
Restrict Homogeneity Feedback loops in which GAI systems are trained with GAI-generated data are restricted. (GV-1.3-004, MS-2.11-011)
Restrict Internet Access GAI systems are disconnected from the internet. (MP-2.2-007)
Restrict Location Tracking Any location tracking is conducted with user consent, disclosed, aligned with relevant privacy policies and laws and potential threats to user safety are managed. (MS-2.10-002)
Restrict Minors Use of organizational GAI systems by minors are restricted. ()
Restrict Regulated Dealings GAI is not deployed in regulated dealings or for material decision making. (GV-1.1-004, GV-1.3-001, GV-4.1-001, GV-5.2-001, MP-2.3-013, MS-2.11-018)
Restrict Secondary Use Any secondary use of GAI input data is conducted with user consent, disclosed, and aligned with relevant privacy policies and laws. (GV-6.1-016, GV-6.2-016)
RLHF For third-party GAI systems, vendors engage in specific reinforcement with human feedback (RLHF) exercises to address identified risks; for internal systems, internal personnel engage in RLHF to address identified risks. (MG-2.1-002, MS-2.5-005, MS-2.9-003, MS-2.9-007)
Sensitive/Personal Data Removal Personal, sensitive, biometric, or otherwise restricted data is minimized or eliminated from GAI training data. (GV-1.2-009, GV-1.6-003, MP-4.1-002, MP-4.1-016, MS-2.10-002, MS-2.10-003, MS-2.10-005, MS-2.10-014, MS-2.10-017, MS-2.10-018, MS-2.10-020)
Session Limits Time, query volume, and response rate are limited for GAI user sessions. (GV-4.1-001, MS-2.6-007, MS-2.6-010)
Supply Chain Audit GAI system supply chains are audited and documented, with a focus on data poisoning, malware, and software and hardware vulnerabilities. (GV-4.1-004, GV-6.1-011, GV-6.1-022, GV-6.2-003, MG-2.3-001, MG-3.1-002, MP-5.1-003, MS-1.1-008, MS-2.6-001, MS-2.7-001)
System Documentation GAI systems are well-documented whether internal, open source, or vendor-provided. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)
System Prompt System prompts are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Team Diversity Teams that implement and manage GAI systems represent broad professional, educational, life-stage, and demographic diversity. (GV-2.1-004, GV-3.1-002, GV-3.1-004, GV-3.1-005, GV-3.2-008, MG-2.1-005, MP-1.2-003, MP-1.2-004, MP-1.2-007, MS-1.3-012, MS-1.3-017, MS-2.3-015, MS-3.3-012)
Temperature Temperature settings are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)
Terms of Service General abuse and misuse by external parties is prohibited by organizational policies. Adaptive terms of service based on trust-level for user. (GV-4.2-003, GV-4.2-005, GV-4.2-007, GV-6.1-016, GV-6.2-016, MP-4.1-021)
Training Internal personnel recieve training on productivity and basic risk management for GAI systems. (GV-2.2-004, GV-3.2-002, GV-6.1-003, MS-1.1-014)
User Feedback GAI systems implement user feedback mechanisms. (GV-1.5-007, GV-1.5-009, GV-3.2-005, GV-5.1-001, GV-5.1-006, GV-5.1-007, GV-5.1-009, MG-1.3-005, MS-1.3-015, MS-1.3-016, MG-2.1-004, MG-2.2-012, MS-2.7-004, MS-4.2-012)
User Recourse Policies, processes, and technical mechanisms enable recourse for users who are harmed by GAI systems. (GV-1.5-010, GV-1.7-003, GV-5.1-001, GV-5.1-006, GV-5.1-009, MS-2.8-015, MS-2.8-019, MS-3.2-006, MS-4.2-012)
Validation GAI systems are shown to reliably generate valid results for their targeted business application. (GV-1.2-009, GV-1.4-002, GV-1.4-004, GV-3.2-002, GV-5.1-005, MG-2.2-016, MG-3.1-009, MG-3.1-014, MP-2.3-006, MP-2.3-013, MP-4.1-012, MS-2.3-005, MS-2.5-016, MS-2.9-002, MS-2.9-014)
XAI Methods such as visualization, occlusion, model compression, pertubation studies, and similar are applied to increase explainability of GAI systems. (GV-1.4-002, GV-3.2-002, GV-5.1-005, MG-3.2-001, MP-2.2-006, MS-2.8-019, MS-2.9-001, MS-2.9-005, MS-2.9-006, MS-2.9-009, MS-2.9-011, MS-2.9-013, MS-2.9-015, MS-4.2-006)

Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions [NIST AI RMF Playbook], [NIST AI 600-1].

F: Example Low-risk Generative AI Measurement and Management Plan

F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic.
Function Trustworthy Characteristic
Accountable and Transparent Fair with Harmful Bias Managed
Measure
  • An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B)
  • Big-bench: Truthfulness [Srivastava et al.]
  • DecodingTrust: Machine Ethics [Wang et al.]
  • Evaluation Harness: ETHICS
  • HELM: Copyright
  • Mark My Words [Piet et al.]
  • BELEBELE
  • Big-bench: Low-resource language, Non-English, Translation
  • Big-bench: Social bias, Racial bias, Gender bias, Religious bias
  • Big-bench: Toxicity
  • DecodingTrust: Fairness
  • DecodingTrust: Stereotype Bias
  • DecodingTrust: Toxicity
  • C-Eval (Chinese evaluation suite)
  • Evaluation Harness: CrowS-Pairs
  • Evaluation Harness: ToxiGen
  • Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
  • From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
  • HELM: Bias
  • HELM: Toxicity
  • MT-bench [Zheng et al.]
  • The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
  • Towards Measuring the Representation of Subjective Global Opinions in Language Models
Manage
  • Contract Review
  • Disclosure of AI Interaction
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Decision Making
  • System Documentation
  • Terms of Service
  • Content Moderation
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • System Prompt
  • Ownership
  • Restrict Anonymous Use
  • Restrict Decision Making
  • Temperature
  • Terms of Service
Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Interpretable and Explainable Privacy-enhanced Safe Secure and Resilient
Measure
  • HELM: Copyright
  • llmprivacy
  • mimir
  • Big-bench: Convince Me
  • Big-bench: Truthfulness
  • HELM: Reiteration, Wedging
  • Mark My Words
  • MLCommons
  • The WMDP Benchmark
  • Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
  • DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations
  • detect-pretrain-code
  • In-The-Wild Jailbreak Prompts on LLMs
  • JailbreakingLLMs
  • llmprivacy
  • mimir
  • TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
Manage
  • Instructions
  • Inventory
  • System Documentation
  • Content Moderation
  • Contract Review
  • Failure Avoidance
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • System Documentation
  • Terms of Service
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • Restrict Anthropomorphization
  • Restrict Decision Making
  • System Documentation
  • System Prompt
  • Temperature
  • Terms of Service
  • Access Control
  • Approved List
  • Authentication
  • Change Management
  • Dependency Screening
  • Failure Avoidance
  • Inventory
  • Ownership
  • Malware Screening
  • Restrict Anonymous Use
Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Valid and Reliable
Measure
  • Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World
  • Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
  • Big-bench: Context Free Question Answering
  • Big-bench: Contextual question answering, Reading comprehension, Question generation
  • Big-bench: Morphology, Grammar, Syntax
  • Big-bench: Out-of-Distribution
  • Big-bench: Paraphrase
  • Big-bench: Sufficient information
  • Big-bench: Summarization
  • DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
  • Eval Gauntlet: Reading comprehension
  • Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
  • Eval Gauntlet: Language Understanding
  • Eval Gauntlet: World Knowledge
  • Evaluation Harness: BLiMP
  • Evaluation Harness: CoQA, ARC
  • Evaluation Harness: GLUE
  • Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
  • Evaluation Harness: MuTual
  • Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
  • FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness
  • FLASK: Readability, Conciseness, Insightfulness
  • HELM: Knowledge
  • HELM: Language
  • HELM: Text classification
  • HELM: Question answering
  • HELM: Reasoning
  • HELM: Robustness to contrast sets
  • HELM: Summarization
  • Hugging Face: Fill-mask, Text generation
  • Hugging Face: Question answering
  • Hugging Face: Summarization
  • Hugging Face: Text classification, Token classification, Zero-shot classification
  • MASSIVE
  • MT-bench
Manage
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Instructions
  • Restrict Anthropomorphization
  • Restrict Decision Making
  • System Documentation
  • System Prompt
  • Temperature

F.2: Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk.
Function GAI Risk
CBRN Information Confabulation
Measure
  • Big-bench: Convince Me
  • Big-bench: Truthfulness
  • HELM: Reiteration, Wedging
  • MLCommons
  • The WMDP Benchmark
  • Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World
  • Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
  • Big-bench: Context Free Question Answering
  • Big-bench: Contextual question answering, Reading comprehension, Question generation
  • Big-bench: Convince Me
  • Big-bench: Low-resource language, Non-English, Translation
  • Big-bench: Morphology, Grammar, Syntax
  • Big-bench: Out-of-Distribution
  • Big-bench: Paraphrase
  • Big-bench: Sufficient information
  • Big-bench: Summarization
  • Big-bench: Truthfulness
  • C-Eval (Chinese evaluation suite)
  • DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations
  • Eval Gauntlet Reading comprehension
  • Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
  • Eval Gauntlet: Language Understanding
  • Eval Gauntlet: World Knowledge
  • Evaluation Harness: BLiMP
  • Evaluation Harness: CoQA, ARC
  • Evaluation Harness: GLUE
  • Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
  • Evaluation Harness: MuTual
  • Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
  • FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness
  • FLASK: Readability, Conciseness, Insightfulness
  • Finding New Biases in Language Models with a Holistic Descriptor Dataset
  • HELM: Knowledge
  • HELM: Language
  • HELM: Language (Twitter AAE)
  • HELM: Question answering
  • HELM: Reasoning
  • HELM: Reiteration, Wedging
  • HELM: Robustness to contrast sets
  • HELM: Summarization
  • HELM: Text classification
  • Hugging Face: Fill-mask, Text generation
  • Hugging Face: Question answering
  • Hugging Face: Summarization
  • Hugging Face: Text classification, Token classification, Zero-shot classification
  • MASSIVE
  • MLCommons
  • MT-bench
Manage
  • Access Control
  • Failure Avoidance
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Terms of Service
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Instructions
  • Restrict Anthropomorphization
  • Restrict Decision Making
  • System Documentation
  • System Prompt
  • Temperature
Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Dangerous or Violent Recommendations Data Privacy Environmental Human-AI Configuration
Measure
  • Big-bench: Convince Me
  • Big-bench: Toxicity
  • DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations
  • DecodingTrust: Machine Ethics
  • DecodingTrust: Toxicity
  • Evaluation Harness: ToxiGen
  • HELM: Reiteration, Wedging
  • HELM: Toxicity
  • MLCommons
  • An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)
  • Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation}
  • DecodingTrust: Machine Ethics
  • Evaluation Harness: ETHICS
  • HELM: Copyright
  • In-The-Wild Jailbreak Prompts on LLMs
  • JailbreakingLLMs
  • MLCommons
  • Mark My Words
  • TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
  • detect-pretrain-code
  • llmprivacy
  • mimir
  • HELM: Efficiency
Manage
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • Restrict Anthropomorphization
  • Restrict Decision making
  • System Documentation
  • System Prompt
  • Temperature
  • Terms of Service
  • Content Moderation
  • Contract Review
  • Failure Avoidance
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • System Documentation
  • Terms of Service
  • Access Control
  • Failure Avoidance
  • Inventory
  • Ownership
  • Restrict Anonymous Use
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • Restrict Anthropomorphization
  • Restrict Decision Making
  • Terms of Service
  • Training
Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Information Integrity Information Security Intellectual Property
Measure
  • Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
  • Big-bench: Convince Me
  • Big-bench: Paraphrase
  • Big-bench: Sufficient information
  • Big-bench: Summarization
  • Big-bench: Truthfulness
  • DecodingTrust: Machine Ethics
  • DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness
  • Eval Gauntlet: Language Understanding
  • Eval Gauntlet: World Knowledge
  • Evaluation Harness: CoQA, ARC
  • Evaluation Harness: ETHICS
  • Evaluation Harness: GLUE
  • Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
  • Evaluation Harness: MuTual
  • Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
  • FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness
  • FLASK: Readability, Conciseness, Insightfulness
  • HELM: Knowledge
  • HELM: Language
  • HELM: Question answering
  • HELM: Reasoning
  • HELM: Reiteration, Wedging
  • HELM: Robustness to contrast sets
  • HELM: Summarization
  • HELM: Text classification
  • Hugging Face: Fill-mask, Text generation
  • Hugging Face: Question answering
  • Hugging Face: Summarization
  • MLCommons
  • MT-bench
  • Mark My Words
  • Big-bench: Convince Me
  • Big-bench: Out-of-Distribution
  • Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
  • DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness,
  • Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
  • HELM: Copyright
  • In-The-Wild Jailbreak Prompts on LLMs
  • JailbreakingLLMs
  • Mark My Words
  • TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
  • detect-pretrain-code
  • llmprivacy
  • mimir
  • An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)
  • HELM: Copyright
  • Mark My Words
  • llmprivacy
  • mimir
Manage
  • Content Moderation
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • Restrict Anthropomorphization
  • System Prompt
  • Temperature
  • Terms of Service
  • Access Control
  • Approved List
  • Authentication
  • Change Management
  • Dependency Screening
  • Failure Avoidance
  • Inventory
  • Ownership
  • Malware Screening
  • Restrict Anonymous Use
  • Contract Review
  • Disclosure of AI Interaction
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Terms of Service
Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Obscene, Degrading, and/or Abusive Content Toxicity, Bias, and Homogenization Value Chain and Component Integration
Measure
  • Big-bench: Social bias, Racial bias, Gender bias, Religious bias
  • Big-bench: Toxicity
  • DecodingTrust: Fairness
  • DecodingTrust: Stereotype Bias
  • DecodingTrust: Toxicity
  • Evaluation Harness: CrowS-Pairs
  • Evaluation Harness: ToxiGen
  • HELM: Bias
  • HELM: Toxicity
  • BELEBELE
  • Big-bench: Low-resource language, Non-English, Translation
  • Big-bench: Out-of-Distribution
  • Big-bench: Social bias, Racial bias, Gender bias, Religious bias
  • Big-bench: Toxicity
  • C-Eval (Chinese evaluation suite)
  • DecodingTrust: Fairness
  • DecodingTrust: Stereotype Bias
  • DecodingTrust: Toxicity
  • Eval Gauntlet: World Knowledge
  • Evaluation Harness: CrowS-Pairs
  • Evaluation Harness: ToxiGen
  • Finding New Biases in Language Models with a Holistic Descriptor Dataset
  • From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
  • HELM: Bias
  • HELM: Toxicity
  • The Self-Perception and Political Biases of ChatGPT
  • Towards Measuring the Representation of Subjective Global Opinions in Language Models
Manage
  • Content Moderation
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • System Prompt
  • Temperature
  • Terms of Service
  • Content Moderation
  • Failure Avoidance
  • Instructions
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • Restrict Anonymous Use
  • Restrict Decision Making
  • System Prompt
  • Temperature
  • Terms of Service
  • Contract Review
  • Disclosure of AI Interaction
  • Failure Avoidance
  • Inventory
  • Ownership
  • Prohibited Use Policy
  • System Documentation
  • Terms of Service

Usage Note: Section F puts forward an example risk measurement and management plan for low risk GAI systems or applications. The low risk plan focuses on automatable model testing and applies minimally burdensome risk controls.

  • Material in Table F.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

  • Material in Table F.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section G below presents an example plan for medium risk systems and Section H presents an example plan for high risk systems.

Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions.

G: Example Medium-risk Generative AI Measurement and Management Plan

G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic.
Function Trustworthy Characteristic
Accountable and Transparent Fair with Harmful Bias Managed
Measure
  • Context exhaustion: logic-overloading prompts
  • Loaded/leading questions
  • Multi-tasking prompts
  • Backwards relationships
  • Counterfactual prompts
  • Pros and cons prompts
  • Role-playing prompts
  • Loaded/leading questions
  • Low context prompts
  • Repeat this
Manage
  • Data Provenance
  • Data Quality
  • Decommission Process
  • Digital Signature
  • External Audit
  • Fine Tuning
  • Grounding
  • Human Review
  • Incident Response
  • Incorporate feedback
  • Model Documentation
  • Monitoring
  • Narrow Scope
  • Open Source
  • RAG
  • Refresh
  • RLHF
  • Restrict Data Collection
  • Restrict Secondary Use
  • User Feedback
  • Validation
  • Accessibility
  • Data Provenance
  • Data Quality
  • External Audit
  • Fine Tuning
  • Grounding
  • Human Review
  • Incident Response
  • Incorporate feedback
  • Narrow Scope
  • Restrict Homogeneity
  • Team Diversity
  • User Feedback
  • Validation
Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Interpretable and Explainable Privacy-enhanced Safe Secure and Resilient
Measure
  • Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes)
  • Auto/biographical prompts
  • User information awareness prompts
  • Autocompletion prompts
  • Repeat this
  • Pros and cons prompts
  • Role-playing prompts
  • Impossible situation prompts
  • Content exhaustion: niche-seeking prompts
  • Ingratiation/reverse psychology prompts
  • Loaded/leading questions
  • User information awareness prompts
  • Repeat this
  • Multi-tasking prompts
  • Pros and cons prompts
  • Role-playing prompts
  • Content exhaustion: niche-seeking prompts
  • Ingratiation/reverse psychology prompts
  • Prompt injection attacks
  • Membership inference attacks
  • Random attacks
Manage
  • Data Provenance
  • External Audit
  • Human Review
  • Model Documentation
  • Monitoring
  • Open Source
  • User Feedback
  • XAI
  • Consent
  • Data Provenance
  • Data Quality
  • Data Retention
  • External Audit
  • Restrict Data Collection
  • Restrict Location Tracking
  • Restrict Secondary Use
  • Blocklist
  • Data Retention
  • Decommission Process
  • Digital Signature
  • External Audit
  • Human Review
  • Incident Response
  • Monitoring
  • Narrow Scope
  • Rate-limiting
  • Restrict Location Tracking
  • Session Limits
  • User Feedback
  • Blocklist
  • Decommission Process
  • External Audit
  • Incident Response
  • Monitoring
  • Open Source
  • Rate-limiting
  • Session Limits
Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Valid and Reliable
Measure
  • Backwards relationships
  • Context baiting (and/or switching) prompts
  • Multi-tasking prompts
  • Role-playing prompts
  • Ingratiation/reverse psychology prompts
  • Loaded/leading questions
  • Time-perplexity prompts
  • Niche-seeking prompts
  • Logic overloading prompts
  • Repeat this
  • Numeric calculation
Manage
  • Data Quality
  • Fine Tuning
  • Grounding
  • Human Review
  • Incorporate feedback
  • Model Documentation
  • Monitoring
  • Narrow Scope
  • Open Source
  • RAG
  • Refresh
  • Restrict Homogeneity
  • RLHF
  • Team Diversity
  • User Feedback
  • Validation

G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk.
Function GAI Risk
CBRN Information Confabulation
Measure
  • Auto-completion prompts
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Repeat this
  • Backwards relationship prompts
  • Context baiting (and/or switching) prompts
  • Context exhaustion: Logic overloading prompts
  • Context exhaustion: Multi-tasking prompts
  • Context exhaustion: Niche-seeking prompts
  • Time perplexity prompts
  • Loaded/leading questions
  • Calculation and numeric queries
Manage
  • Blocklist
  • Data Provenance
  • Data Quality
  • Decommission Process
  • Digital Signature
  • External Audit
  • Incident Response
  • Monitoring
  • Rate-limiting
  • Session Limits
  • Data Quality
  • Fine Tuning
  • Grounding
  • Human Review
  • Incorporate feedback
  • Model Documentation
  • Monitoring
  • Narrow Scope
  • Open Source
  • RAG
  • Refresh
  • Restrict Homogeneity
  • RLHF
  • Team Diversity
  • User Feedback
  • Validation
Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Dangerous or Violent Recommendations Data Privacy Environmental Human-AI Configuration
Measure
  • Impossible situation prompts
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Repeat this
  • Loaded/leading questions
  • User information awareness
  • Membership inference attacks
  • Auto/biographical prompts
  • Repeat this
  • Availability attacks
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Impossible situation prompts
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
Manage
  • Blocklist
  • Data Retention
  • Decommission Process
  • Digital Signature
  • External Audit
  • Human Review
  • Incident Response
  • Monitoring
  • Narrow Scope
  • Rate-limiting
  • Restrict Location Tracking
  • Session Limits
  • User Feedback
  • Consent
  • Data Provenance
  • Data Quality
  • Data Retention
  • External Audit
  • Restrict Data Collection
  • Restrict Location Tracking
  • Restrict Secondary Use
  • Decommission Process
  • External Audit
  • Incident Response
  • Monitoring
  • Rate-limiting
  • Session Limits
  • Accessibility
  • Blocklist
  • Consent
  • Decommission Process
  • Digital Signature
  • External Audit
  • Human Review
  • Incorporate feedback
  • Restrict Data Collection
  • Restrict Location Tracking
  • Restrict Secondary Use
  • Session Limits
  • User Feedback
Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Information Integrity Information Security Intellectual Property
Measure
  • Loaded/leading questions
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Confidentiality attacks
  • Integrity attacks
  • Availability attacks
  • Random attacks
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Confidentiality attacks
  • Auto-complete prompts
Manage
  • Data Provenance
  • Data Quality
  • Digital Signature
  • External Audit
  • Fine Tuning
  • Grounding
  • Human Review
  • Incident Response
  • Incorporate feedback
  • Monitoring
  • Narrow Scope
  • Open Source
  • RAG
  • Refresh
  • Restrict Homogeneity
  • RLHF
  • User Feedback
  • Validation
  • Blocklist
  • Decommission Process
  • External Audit
  • Incident Response
  • Monitoring
  • Open Source
  • Rate-limiting
  • Session Limits
  • Blocklist
  • Data Provenance
  • Data Quality
  • Decommission Process
  • Digital Signature
  • External Audit
  • Incident Response
  • Incorporate feedback
  • Monitoring
  • Open Source
  • Rate-limiting
  • Session Limits
  • User Feedback
Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Obscene, Degrading, and/or Abusive Content Toxicity, Bias, and Homogenization Value Chain and Component Integration
Measure
  • Confidentiality attacks
  • Autocomplete prompts
  • Role-playing prompts
  • Reverse psychology prompts
  • Pros and cons prompts
  • Multitasking prompts
  • Loaded/leading questions
  • Repeat this
  • Backwards relationship prompts
  • Data poisoning attacks
  • Counterfactual prompts
  • Pros and cons prompts
  • Role-playing prompts
  • Low context prompts
  • Loaded/leading questions
  • Repeat this
Manage
  • Blocklist
  • Data Provenance
  • Data Quality
  • Decommission Process
  • Digital Signature
  • External Audit
  • Incident Response
  • Monitoring
  • Rate-limiting
  • Session Limits
  • User Feedback
  • Accessibility
  • Data Provenance
  • Data Quality
  • External Audit
  • Fine Tuning
  • Grounding
  • Human Review
  • Incident Response
  • Incorporate feedback
  • Narrow Scope
  • Restrict Homogeneity
  • Team Diversity
  • User Feedback
  • Validation
  • Data Provenance
  • Data Quality
  • Digital Signature
  • External Audit
  • Model Documentation
  • Restrict Data Collection
  • Restrict Secondary Use

Usage Note: Section G puts forward an example risk measurement and management plan for medium risk GAI systems or applications. The medium risk plan focuses on red-teaming and applies moderate risk controls. Measurement and management approaches from Section F should also be applied to medium risk systems or applications.

  • Material in Table G.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

  • Material in Table G.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section H below presents an example plan for high risk systems.

H: Example High-risk Generative AI Measurement and Management Plan

H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic.
Function Trustworthy Characteristic
Accountable and Transparent Fair with Harmful Bias Managed
Measure
  • Algorithmic impact assessments
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Cybersecurity testing
  • Environmental metrics
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Online metrics/monitoring
  • Perturbation studies*
  • PII identification and removal
  • Root cause analysis*
  • Screening for information integrity
  • Sensitivity analysis*
  • Software testing
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • Sub-sampling traffic for manually annotating
  • Supply chain auditing
  • Testing third-party dependencies
  • User surveys*
  • Validity testing/validation.*
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Anomaly detection*
  • Assessing data quality*
  • Bias bounties
  • Bias testing
  • Calibration*
  • Counterfactual/causal analysis
  • Disaggregated metrics
  • Field testing*
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Root cause analysis*
  • Software testing
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation.*
Manage
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Interpretable and Explainable Privacy-enhanced Safe Secure and Resilient
Measure
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Model comparison.*
  • Multi-session experiments.*
  • Root cause analysis.*
  • Stakeholder engagement and feedback.*
  • UI/UX studies
  • User surveys*
  • Algorithmic impact assessments
  • Assessing data quality.*
  • Cybersecurity testing
  • PII identification and removal
  • Root cause analysis*
  • Stakeholder engagement and feedback*
  • Stress testing*
  • Testing third-party dependencies
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Chaos testing
  • Dangerous and violent content removal
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Perturbation studies*
  • Root cause analysis*
  • Sensitivity analysis*
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation*
  • Algorithmic impact assessments
  • Anomaly detection*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Chaos testing
  • Cybersecurity testing
  • Data poisoning detection
  • Model assessment*
  • Model comparison*
  • Root cause analysis*
  • Software testing
  • Stakeholder engagement and feedback*
  • Stress testing*
  • Supply chain auditing
  • Testing third-party dependencies
Manage
  • Restrict regulated dealings
  • Supply Chain Audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Redundancy
  • Restrict internet access
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply Chain Audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Redundancy
  • Restrict internet access
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic (continued).
Function Trustworthy Characteristic
Valid and Reliable
Measure
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Perturbation studies*
  • Root cause analysis*
  • Sensitivity analysis*
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation*
Manage
  • Fast decommission
  • Insurance
  • Redundancy
  • Restrict regulated dealings
  • Supply chain audit
  • User recourse

H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk.
Function GAI Risk
CBRN Information Confabulation
Measure
  • Chaos testing
  • Cybersecurity testing
  • Input/output measurement using classifiers
  • Online metrics/monitoring
  • Perturbation studies*
  • Prompt engineering
  • Root cause analysis*
  • Sensitivity analysis*
  • Software testing
  • Stress testing*
  • Supply chain auditing
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Perturbation studies*
  • Root cause analysis*
  • Sensitivity analysis*
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation*
Manage
  • CBRN info removal
  • Fast decommission
  • Restrict internet access
  • Supply chain audit
  • Fast decommission
  • Insurance
  • Restrict regulated dealings
  • Supply chain audit
  • User recourse
Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Dangerous or Violent Recommendations Data Privacy Environmental Human-AI Configuration
Measure
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Chaos testing
  • Dangerous and violent content removal
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Perturbation studies*
  • Root cause analysis*
  • Sensitivity analysis*
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation*
  • Algorithmic impact assessments
  • Assessing data quality.*
  • Cybersecurity testing
  • PII identification and removal
  • Root cause analysis*
  • Stakeholder engagement and feedback*
  • Stress testing*
  • Testing third-party dependencies
  • Algorithmic impact assessments
  • Environmental metrics
  • Model comparison*
  • Online metrics/monitoring
  • Supply chain auditing
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Analyzing user feedback
  • Bias bounties
  • Calibration*
  • Explainability/interpretability
  • Field testing*
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Root cause analysis*
  • Stakeholder engagement and feedback*
  • UI/UX studies
  • User surveys*
  • Validity testing/validation*
Manage
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • Fast decommission
  • Insurance
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Intellectual property removal
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • User recourse
Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Information Integrity Information Security Intellectual Property
Measure
  • Algorithmic impact assessments
  • Assessing data quality*
  • Calibration*
  • Human content moderation
  • Data poisoning detection
  • Field testing*
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Perturbation studies*
  • Root cause analysis*
  • Screening for information integrity
  • Sensitivity analysis*
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Supply chain auditing
  • Testing third-party dependencies
  • User surveys*
  • Validity testing/validation.*
  • Algorithmic impact assessments
  • Anomaly detection*
  • Assessing data quality*
  • Bias bounties
  • Calibration*
  • Chaos testing
  • Cybersecurity testing
  • Data poisoning detection
  • Model assessment*
  • Model comparison*
  • Root cause analysis*
  • Software testing
  • Stakeholder engagement and feedback*
  • Stress testing*
  • Supply chain auditing
  • Testing third-party dependencies
  • Algorithmic impact assessments
  • Assessing data quality*
  • Cybersecurity testing
  • Field testing*
  • Input/output measurement using classifiers
  • Model comparison*
  • Root cause analysis*
  • Stakeholder engagement and feedback*
  • Sub-sampling traffic for manually annotating
  • Supply chain auditing
  • Testing third-party dependencies
  • User surveys*
Manage
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict internet access
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Redundancy
  • Restrict internet access
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict internet access
  • Supply chain audit
  • User recourse
Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).
Function GAI Risk
Obscene, Degrading, and/or Abusive Content Toxicity, Bias, and Homogenization Value Chain and Component Integration
Measure
  • Algorithmic impact assessments
  • Assessing data quality*
  • Calibration*
  • Field testing*
  • Input/output measurement using classifiers
  • Model assessment*
  • Model comparison*
  • Root cause analysis*
  • Small user studies
  • Software testing
  • Stakeholder engagement and feedback*
  • Statistical quality control*
  • Stress testing*
  • Supply chain auditing
  • Testing third-party dependencies
  • User surveys*
  • Algorithmic impact assessments
  • Analyze differences between intended and actual population of users or data subjects*
  • Anomaly detection*
  • Assessing data quality*
  • Bias bounties
  • Bias testing
  • Calibration*
  • Counterfactual/causal analysis
  • Disaggregated metrics
  • Field testing*
  • Model assessment*
  • Model comparison*
  • Multi-session experiments*
  • Root cause analysis*
  • Software testing
  • Statistical quality control*
  • Stress testing*
  • User surveys*
  • Validity testing/validation.*
  • Assessing data quality*
  • Model assessment*
  • Model comparison*
  • Software testing
  • Supply chain auditing
  • Testing third-party dependencies
Manage
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Restrict internet access
  • Restrict minors
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Fast decommission
  • Insurance
  • Intellectual property removal
  • Restrict regulated dealings
  • Sensitive/Personal data removal
  • Supply chain audit
  • User recourse
  • CSAM/Obscenity removal
  • Intellectual property removal
  • Redundancy
  • Sensitive/Personal data removal
  • Supply chain audit

Usage Note: Section H puts forward an example risk measurement and management plan for high risk GAI systems or applications. The high risk plan focuses on field testing and applies extensive risk controls. Measurement and management approaches from Appendices F and G should also be applied to high risk systems or applications.

  • Material in Table H.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

  • Material in Table H.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

References

AI Verify Foundation and Infocomm Media Development Authority. Cataloguing LLM Evaluations. Draft for Discussion, October 2023. https://aiverifyfoundation.sg/downloads/Cataloguing_LLM_Evaluations.pdf.
AI Verify Foundation and Infocomm Media Development Authority. LLM Evals Catalogue. GitHub repository. Accessed September 19, 2024. https://github.com/aiverify-foundation/LLM-Evals-Catalogue.
Balloccu, Simone, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs." arXiv preprint, last revised February 22, 2024. https://doi.org/10.48550/arXiv.2402.03927.
Bandarkar, Lucas, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants." arXiv preprint, last revised July 25, 2024. https://doi.org/10.48550/arXiv.2308.16884.
Barreno, Marco, Blaine Nelson, Anthony D. Joseph, and J.D. Tygar. "The Security of Machine Learning." Machine Learning 81, no. 2 (2010): 121–148. https://doi.org/10.1007/s10994-010-5188-5.
Bommasani, Rishi, Percy Liang, and Tony Lee. "Holistic Evaluation of Language Models." Annals of the New York Academy of Sciences 1525, no. 1 (July 2023): 140–146. https://doi.org/10.1111/nyas.15007.
Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv preprint, last revised July 18, 2024. https://doi.org/10.48550/arXiv.2310.08419.
De Wynter, Adrian, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. "An Evaluation on Large Language Model Outputs: Discourse and Memorization." Natural Language Processing Journal 4 (September 2023): 100024. https://doi.org/10.1016/j.nlp.2023.100024.
Department for Science, Innovation and Technology, and AI Safety Institute. International Scientific Report on the Safety of Advanced AI: Interim Report. Published May 17, 2024. https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai.
Derczynski, Leon, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. "garak: A Framework for Security Probing Large Language Models." arXiv preprint, submitted June 16, 2024. https://doi.org/10.48550/arXiv.2406.11036.
Dohmann, Jeremy. "Blazingly Fast LLM Evaluation for In-Context Learning." Databricks: Mosaic AI Research, February 2, 2023. https://www.databricks.com/blog/llm-evaluation-for-icl.
Duan, Michael, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. "Do Membership Inference Attacks Work on Large Language Models?" arXiv preprint, last revised September 16, 2024. https://doi.org/10.48550/arXiv.2402.07841.
Durmus, Esin, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, et al. "Towards Measuring the Representation of Subjective Global Opinions in Language Models." arXiv preprint, last revised April 12, 2024. https://doi.org/10.48550/arXiv.2306.16388.
Hugging Face. "Evaluate." Last accessed September 19, 2024. https://huggingface.co/docs/evaluate/index.
Feng, Shangbin, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models." arXiv preprint, last revised July 6, 2023. https://doi.org/10.48550/arXiv.2305.08283.
FitzGerald, Jack, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, et al. "MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages." arXiv preprint, last revised June 17, 2022. https://doi.org/10.48550/arXiv.2204.08582.
Gao, Leo, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A Framework for Few-Shot Language Model Evaluation. GitHub repository. Accessed September 19, 2024. https://github.com/EleutherAI/lm-evaluation-harness.
Hall, Patrick, and Daniel Atherton. Awesome Machine Learning Interpretability. GitHub repository. Accessed September 19, 2024. https://github.com/jphall663/awesome-machine-learning-interpretability.
Hu, Hongsheng, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. "Membership Inference Attacks on Machine Learning: A Survey." ACM Computing Surveys 54, no. 11s (September 2022): 1–37. https://doi.org/10.1145/3523273.
Huang, Yangsibo, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. "Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation." ICLR 2024 Spotlight, published January 16, 2024, last modified March 15, 2024. https://openreview.net/forum?id=r42tSSCHPh.
Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. "C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models." arXiv preprint, last revised November 6, 2023. https://doi.org/10.48550/arXiv.2305.08322.
ISO/IEC 42001:2023. Information Technology — Artificial Intelligence — Management System. 1st ed. Geneva: International Organization for Standardization, 2023. https://www.iso.org/obp/ui/en/#iso:std:iso-iec:42001:ed-1:v1:en.
Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, et al. "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2403.03218.
Li, Nathaniel, Ziwen Han, Ian Steneker, Willow Primack, et al. "LLM defenses are not robust to multi-turn human jailbreaks yet." arXiv preprint, last revised Wed, September 4, 2024. https://arxiv.org/pdf/2408.15221.
Liu, Yi, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. "Prompt Injection Attack Against LLM-Integrated Applications." arXiv preprint, last revised March 2, 2024. https://doi.org/10.48550/arXiv.2306.05499.
McGraw, Gary, Harold Figueroa, Katie McMahon, and Richie Bonett. An Architectural Risk Analysis of Large Language Models: Applied Machine Learning Security. Version 1.0. Berryville Institute of Machine Learning (BIML), January 24, 2024. https://berryvilleiml.com/docs/BIML-LLM24.pdf.
McGraw, Gary, Harold Figueroa, Victor Shepardson, and Richie Bonett. An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning. Version 1.0 (1.13.20). Berryville Institute of Machine Learning (BIML), January 13, 2020. https://berryvilleiml.com/docs/ara.pdf.
Mehrotra, Anay, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically." arXiv preprint, last revised February 21, 2024. https://doi.org/10.48550/arXiv.2312.02119.
Microsoft. Microsoft Responsible AI Standard, v2: General Requirements. For External Release. June 2022. https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl.
National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. Gaithersburg, MD: NIST, January 26, 2023. https://doi.org/10.6028/NIST.AI.100-1.
National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1. Gaithersburg, MD: NIST, July 2024. https://doi.org/10.6028/NIST.AI.600-1.
National Institute of Standards and Technology (NIST). Guide for Conducting Risk Assessments. NIST Special Publication 800-30 Rev. 1. Prepared by the Joint Task Force Transformation Initiative. Gaithersburg, MD: NIST, September 2012. https://doi.org/10.6028/NIST.SP.800-30r1.
National Institute of Standards and Technology (NIST). NIST AI RMF Playbook. Trustworthy & Responsible AI Resource Center. Accessed September 19, 2024. https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook.
Office of the Comptroller of the Currency (OCC). Model Risk Management. Comptroller’s Handbook, Version 1.0, August 2021. https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html.
Perez, Ethan, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. "Red Teaming Language Models with Language Models." arXiv preprint, submitted February 7, 2022. https://doi.org/10.48550/arXiv.2202.03286.
Piet, Julien, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. "Mark My Words: Analyzing and Evaluating Language Model Watermarks." arXiv preprint, last revised December 7, 2023. https://doi.org/10.48550/arXiv.2312.00273.
Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. "The Self-Perception and Political Biases of ChatGPT." Human Behavior and Emerging Technologies, 2024. https://doi.org/10.1155/2024/7115633.
Saravia, Elvis. Prompt Engineering Guide. GitHub repository. Last modified December 2022. Accessed September 19, 2024. https://github.com/dair-ai/Prompt-Engineering-Guide.
Shen, Xinyue, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "‘Do Anything Now’: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." arXiv preprint, last revised May 15, 2024. https://doi.org/10.48550/arXiv.2308.03825.
Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. "Detecting Pretraining Data from Large Language Models." arXiv preprint, last revised March 9, 2024. https://doi.org/10.48550/arXiv.2310.16789.
Shumailov, Ilia, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. "Sponge Examples: Energy-Latency Attacks on Neural Networks." In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), 6–10 September 2021, Vienna, Austria. IEEE, 2021. https://doi.org/10.1109/EuroSP51992.2021.00024.
Sitawarin, Chawin, Charlie Cheng-Jie Ji, Apurv Verma, and Luckyfan-cs. LLM Security & Privacy. GitHub repository. Accessed September 19, 2024. https://github.com/chawins/llm-sp.
Smith, Eric Michael, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "‘I’m Sorry to Hear That’: Finding New Biases in Language Models with a Holistic Descriptor Dataset." arXiv preprint, last revised October 27, 2022. https://doi.org/10.48550/arXiv.2205.09209.
Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, et al. "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." arXiv preprint, last revised June 12, 2023. https://doi.org/10.48550/arXiv.2206.04615.
Staab, Robin, Mark Vero, Mislav Balunović, and Martin Vechev. "Beyond Memorization: Violating Privacy via Inference with Large Language Models." arXiv preprint, last revised May 6, 2024. https://doi.org/10.48550/arXiv.2310.07298.
Storchan, Victor, Ravin Kumar, Rumman Chowdhury, Seraphina Goldfarb-Tarrant, and Sven Cattell. Generative AI Red Teaming Challenge: Transparency Report. Humane Intelligence, 2024. https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view.
Vidgen, Bertie, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, et al. "Introducing v0.5 of the AI Safety Benchmark from MLCommons." arXiv preprint, last revised May 13, 2024. https://doi.org/10.48550/arXiv.2404.12241.
Wang, Boxin, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), Article No. 1361, 31232–31339. Published May 30, 2024. https://dl.acm.org/doi/10.5555/3666122.3667483.
Ye, Seonghyeon, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." arXiv preprint, last revised April 14, 2024. https://doi.org/10.48550/arXiv.2307.10928.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and Eric P. Xing. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), Article No. 2020, 46595–46623. Published May 30, 2024. https://dl.acm.org/doi/10.5555/3666122.3668142.

About

A place for ideas and drafts related to GAI risk management.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published