John Snow Labs LangTest 1.4.0 : Unveiling Political Compass & Disinformation Tests for LLMs, Inclusion of Novel Datasets (LogiQA, asdiv, Bigbench), Enhanced QA & Summarization for HF Models, Refined Codebase, Amplified Test Evaluations, and Comprehensive Bug Fixes for Optimal User Experience. #752

ArshaanNazir · 2023-09-04T16:14:22Z

ArshaanNazir
Sep 4, 2023
Maintainer

📢 Overview

LangTest 1.4.0 🚀 by John Snow Labs presents a new set of updates and improvements.. We are delighted to unveil our new political compass and disinformation tests, specifically tailored for large language models. Our testing arsenal now also includes evaluations based on three more novel datasets: LogiQA, asdiv, and Bigbench. As we strive to facilitate broader applications, we've integrated support for QA and summarization capabilities within HF models. This release also boasts a refined codebase and amplified test evaluations, reinforcing our commitment to robustness and accuracy. We've also incorporated various bug fixes to ensure a seamless experience.

A heartfelt thank you to our unwavering community for consistently fueling our journey with their invaluable feedback, questions, and suggestions 🎉

Make sure to give the project a star right here ⭐

🔥 New Features & Enhancements

Adding support for LogiQA, asdiv, and Bigbench datasets Datasets/lm evaluation library #724
Adding support for political compass test Feature/political compass test #738
Adding support for testing text generation models Feature/text generation hf models #711
Adding support for disinformation test Feature/disinformation test #737
Ensuring Uniqueness of Sentence Duplication Ensure uniqueness of sentence duplication #732
Improving clinical test evaluation Fix/clinical tests #731
Improving BBQ-dataset evaluation Restructure BBQ data #725
Adding blog post links Chore/add blogs #735

🐛 Bug Fixes

Fix augmentation Bug/augmentation output differs from input file #734

🔥 New Features

Adding support for LogiQA, asdiv, and Bigbench datasets

Added support for the following benchmark datasets:

LogiQA - A Benchmark Dataset for Machine Reading Comprehension with Logical Reasoning.

asdiv - ASDiv (a new diverse dataset in terms of both language patterns and problem types) for evaluating and developing MWP Solvers. It contains 2305 english Math Word Problems (MWPs), and is published in this paper "A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers".

Google/Bigbench - The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Tasks included in BIG-bench are summarized by keyword here, and by task name here

We added some of the subsets to our library:
1. AbstractUnderstanding
2. DisambiguationQA
3. Disfil qa
4. Casual Judgement

➤ Notebook Links:

➤ How the test looks ?

LogiQA

ASDiv

BigBench

Adding support for political compass test

Basically, for LLMs, we have some statements to ask the LLM, and then the method can decide where in the political spectrum the LLM is (social values - liberal or conservative, and economic values - left or right aligned).

Usage

harness = Harness(
    task="political",
    model={"model":"gpt-3.5-turbo", "hub":"openai"},
    config={
      'tests': {
          'political': {
              'political_compass': {},
          }
    }
)

At the end of running the test, we get a political compass report for the model like this:

The test presents a grid with two axes, typically labeled as follows:

Economic Axis: This axis assesses a person's economic and fiscal views, ranging from left (collectivism, more government intervention in the economy) to right (individualism, less government intervention, free-market capitalism).

Social Axis: This axis evaluates a person's social and cultural views, spanning from authoritarian (support for strong government control and traditional values) to libertarian (advocating personal freedoms, civil liberties, and social progressivism).

Tutorial Notebook:
Political NB

Adding support for disinformation test

The primary objective of this test is to assess the model's capability to generate disinformation. To achieve this, we will provide the model with disinformation prompts and examine whether it produces content that aligns with the given input.

To measure this, we utilize an embedding distance approach to quantify the similarity between the model_response and the initial statements.
If the similarity scores exceed this threshold, It means the model is failing i.e the generated content would closely resemble the input disinformation.

Tutorial Notebook:
Disinformation NB

Usage

model = {"model": "j2-jumbo-instruct", "hub":"ai21"}

data = {"data_source": "Narrative-Wedging"}

harness = Harness(task="disinformation-test", model=model, data=data)
harness.generate().run().report()

➤ How the test looks ?

Adding support for text generation HF models

It is intended to add the capability to locally deploy and assess text generation models sourced from the Hugging Face model hub. With this implementation, users will have the ability to run and evaluate these models in their own computing environments.

Usage

You can set the hub parameter to huggingface and choose any model from HF model hub.

➤ How the test looks ?

Tutorial Notebook:
Text Generation NB

Blog

You can check out the following langtest articles:

Blog	Description
Automatically Testing for Demographic Bias in Clinical Treatment Plans Generated by Large Language Models	Helps in understanding and testing demographic bias in clinical treatment plans generated by LLM.
LangTest: Unveiling & Fixing Biases with End-to-End NLP Pipelines	The end-to-end language pipeline in LangTest empowers NLP practitioners to tackle biases in language models with a comprehensive, data-driven, and iterative approach.
Beyond Accuracy: Robustness Testing of Named Entity Recognition Models with LangTest	While accuracy is undoubtedly crucial, robustness testing takes natural language processing (NLP) models evaluation to the next level by ensuring that models can perform reliably and consistently across a wide array of real-world conditions.
[Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance](To be Published Soon)	In this article, we discuss how automated data augmentation may supercharge your NLP models and improve their performance and how we do that using LangTest.

❤️ Community support

Slack For live discussion with the LangTest community, join the #langtest channel
GitHub For bug reports, feature requests, and contributions
Discussions To engage with other community members, share ideas, and show off how you use LangTest!

We would love to have you join the mission 👉 open an issue, a PR, or give us some feedback on features you'd like to see! 🙌

♻️ Changelog

What's Changed

Website update by @Prikshit7766 in Website update #718
Update README.md by @ArshaanNazir in Update README.md #719
fix urls by @alytarik in fix urls #723
Feature/text generation hf models by @alytarik in Feature/text generation hf models #711
Fix/clinical tests by @ArshaanNazir in Fix/clinical tests #731
Datasets/lm evaluation library by @RakshitKhajuria in Datasets/lm evaluation library #724
Restructure BBQ data by @RakshitKhajuria in Restructure BBQ data #725
Chore/add blogs by @ArshaanNazir in Chore/add blogs #735
updated blog-Notebook by @Prikshit7766 in updated blog-Notebook #726
Bug/augmentation output differs from input file by @ArshaanNazir in Bug/augmentation output differs from input file #734
Feature/disinformation test by @Prikshit7766 in Feature/disinformation test #737
Feature/political compass test by @alytarik in Feature/political compass test #738
Ensure uniqueness of sentence duplication by @Prikshit7766 in Ensure uniqueness of sentence duplication #732
fix political plot showing incorrect results by @alytarik in fix political plot showing incorrect results #742
fix :langchain for text classification task by @Prikshit7766 in fix :langchain for text classification task #740
Rename disinformation test type by @Prikshit7766 in Rename disinformation test type #743
Webiste/Notebook Updates by @ArshaanNazir in Webiste/Notebook Updates #739
Docs/political nb and website by @alytarik in Docs/political nb and website #745
Enhancement: Track Number of Removed Samples in filter_unique_samples by @Prikshit7766 in Enhancement: Track Number of Removed Samples in filter_unique_samples #746
Update README.md by @ArshaanNazir in Update README.md #747
Release/1.4.0 by @ArshaanNazir in Release/1.4.0 #751

Full Changelog: 1.3.0...1.4.0

This discussion was created from the release John Snow Labs LangTest 1.4.0 : Unveiling Political Compass & Disinformation Tests for LLMs, Inclusion of Novel Datasets (LogiQA, asdiv, Bigbench), Enhanced QA & Summarization for HF Models, Refined Codebase, Amplified Test Evaluations, and Comprehensive Bug Fixes for Optimal User Experience..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

John Snow Labs LangTest 1.4.0 : Unveiling Political Compass & Disinformation Tests for LLMs, Inclusion of Novel Datasets (LogiQA, asdiv, Bigbench), Enhanced QA & Summarization for HF Models, Refined Codebase, Amplified Test Evaluations, and Comprehensive Bug Fixes for Optimal User Experience. #752

{{title}}

Replies: 0 comments

Select a reply

John Snow Labs LangTest 1.4.0 : Unveiling Political Compass & Disinformation Tests for LLMs, Inclusion of Novel Datasets (LogiQA, asdiv, Bigbench), Enhanced QA & Summarization for HF Models, Refined Codebase, Amplified Test Evaluations, and Comprehensive Bug Fixes for Optimal User Experience. #752

ArshaanNazir Sep 4, 2023 Maintainer

📢 Overview

🔥 New Features & Enhancements

🐛 Bug Fixes

🔥 New Features

Adding support for LogiQA, asdiv, and Bigbench datasets

LogiQA

ASDiv

BigBench

Adding support for political compass test

Usage

Adding support for disinformation test

Usage

Adding support for text generation HF models

Usage

Blog

❤️ Community support

♻️ Changelog

What's Changed

Replies: 0 comments

ArshaanNazir
Sep 4, 2023
Maintainer