From 0d9a0dcc6f02abf577f1c6ab402c74eb8e98947e Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 9 Sep 2024 17:47:12 -0500 Subject: [PATCH 01/39] Name correction for Ranran Haoran Zhang, closes #3292. --- data/xml/2020.findings.xml | 2 +- data/xml/2021.naacl.xml | 2 +- data/xml/2023.eacl.xml | 2 +- data/xml/2023.emnlp.xml | 2 +- data/xml/2023.findings.xml | 2 +- data/yaml/name_variants.yaml | 2 ++ 6 files changed, 7 insertions(+), 5 deletions(-) diff --git a/data/xml/2020.findings.xml b/data/xml/2020.findings.xml index bc2a296f9a..4b4fb7e83a 100644 --- a/data/xml/2020.findings.xml +++ b/data/xml/2020.findings.xml @@ -330,7 +330,7 @@ Minimize Exposure Bias of <fixed-case>S</fixed-case>eq2<fixed-case>S</fixed-case>eq Models in Joint Entity and Relation Extraction - Ranran HaoranZhang + Ranran HaoranZhang QianyingLiu Aysa XuemoFan HengJi diff --git a/data/xml/2021.naacl.xml b/data/xml/2021.naacl.xml index afb238d89b..9eb131274b 100644 --- a/data/xml/2021.naacl.xml +++ b/data/xml/2021.naacl.xml @@ -7496,7 +7496,7 @@ JiaweiMa JingxuanTu YingLin - Ranran HaoranZhang + Ranran HaoranZhang WeiliLiu AabhasChauhan YingjunGuan diff --git a/data/xml/2023.eacl.xml b/data/xml/2023.eacl.xml index 1f8185590d..ef68a7341f 100644 --- a/data/xml/2023.eacl.xml +++ b/data/xml/2023.eacl.xml @@ -1920,7 +1920,7 @@ <fixed-case>C</fixed-case>on<fixed-case>E</fixed-case>ntail: An Entailment-based Framework for Universal Zero and Few Shot Classification with Supervised Contrastive Pretraining - Ranran HaoranZhangThe Pennsylvania State University + Ranran HaoranZhangThe Pennsylvania State University Aysa XuemoFanUniversity of Illinois at Urbana-Champaign RuiZhangPenn State University 1941-1953 diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml index 81d58b48c8..3e425af594 100644 --- a/data/xml/2023.emnlp.xml +++ b/data/xml/2023.emnlp.xml @@ -6067,7 +6067,7 @@ Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning Sarkar Snigdha SarathiDas - HaoranZhang + Ranran HaoranZhang PengShi WenpengYin RuiZhang diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index 53f5e39640..ef19af444f 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -21079,7 +21079,7 @@ Exploring the Potential of Large Language Models in Generating Code-Tracing Questions for Introductory Programming Courses AysaFan - HaoranZhang + Ranran HaoranZhang LucPaquette RuiZhang 7406-7421 diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml index ff118b613c..f5ad0712e9 100644 --- a/data/yaml/name_variants.yaml +++ b/data/yaml/name_variants.yaml @@ -10637,3 +10637,5 @@ - canonical: {first: Genta Indra, last: Winata} variants: - {first: Genta, last: Winata} +- canonical: {first: Ranran Haoran, last: Zhang} + id: ranran-haoran-zhang From 3663c56ed82afed742d38fda04591aa7e37a7699 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 9 Sep 2024 20:14:59 -0500 Subject: [PATCH 02/39] Paper Revision{2024.acl-long.387}, closes #3839. --- data/xml/2024.acl.xml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index e0a20a98fa..8e2ef425dc 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -4973,10 +4973,11 @@ ZiyuYaoGeorge Mason University 7174-7193 Large language models (LLMs) have shown strong arithmetic reasoning capabilities when prompted with Chain-of-Thought (CoT) prompts. However, we have only a limited understanding of how they are processed by LLMs. To demystify it, prior work has primarily focused on ablating different components in the CoT prompt and empirically observing their resulting LLM performance change. Yet, the reason why these components are important to LLM reasoning is not explored. To fill this gap, in this work, we investigate “neuron activation” as a lens to provide a unified explanation to observations made by prior work. Specifically, we look into neurons within the feed-forward layers of LLMs that may have activated their arithmetic reasoning capabilities, using Llama2 as an example. To facilitate this investigation, we also propose an approach based on GPT-4 to automatically identify neurons that imply arithmetic reasoning. Our analyses revealed that the activation of reasoning neurons in the feed-forward layers of an LLM can explain the importance of various components in a CoT prompt, and future research can extend it for a more complete understanding. - 2024.acl-long.387 + 2024.acl-long.387 rai-yao-2024-investigation Minor updates. + Minor updates. Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling From e1c30cf24c9ae333b6cd9f94d6603f96b45ef363 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 9 Sep 2024 20:16:39 -0500 Subject: [PATCH 03/39] Paper Revision{2024.acl-long.3}, closes #3843. --- data/xml/2024.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index 8e2ef425dc..1e19a5e108 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -57,8 +57,10 @@ YueZhangWestlake University 36-53 Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective deepfake text detection to mitigate risks like the spread of fake news and plagiarism. Existing research has been constrained by evaluating detection methods o specific domains or particular language models. In practical scenarios, however, the detector faces texts from various domains or LLMs without knowing their sources. To this end, we build a comprehensive testbed by gathering texts from diverse human writings and deepfake texts generated by different LLMs. Empirical results on mainstream detection methods demonstrate the difficulties associated with detecting deepfake text in a wide-ranging testbed, particularly in out-of-distribution scenarios. Such difficulties align with the diminishing linguistic differences between the two text sources. Despite challenges, the top-performing detector can identify 84.12% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios. - 2024.acl-long.3 + 2024.acl-long.3 li-etal-2024-mage + + Minor updates. <fixed-case>P</fixed-case>riv<fixed-case>LM</fixed-case>-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models From 6bb4e667edb2c327d434e27367c238d3c14cdcb1 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 9 Sep 2024 20:17:56 -0500 Subject: [PATCH 04/39] Paper Revision{2024.nlp4convai-1.5}, closes #3852. --- data/xml/2024.nlp4convai.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.nlp4convai.xml b/data/xml/2024.nlp4convai.xml index 29b9ef42e0..aa9cbefa8b 100644 --- a/data/xml/2024.nlp4convai.xml +++ b/data/xml/2024.nlp4convai.xml @@ -77,8 +77,10 @@ FlorianMatthesTechnische Universität München 73-88 Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users’ information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research. - 2024.nlp4convai-1.5 + 2024.nlp4convai-1.5 schneider-etal-2024-engineering + + This revision corrects the page numbering. Efficient Dynamic Hard Negative Sampling for Dialogue Selection From 69bda12f3c14f8f599cf2032ed045e7d582e4ec4 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 9 Sep 2024 20:20:26 -0500 Subject: [PATCH 05/39] Paper Revision: {2024.acl-long.233}, closes #3856. --- data/xml/2024.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index 1e19a5e108..15363b8ffa 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -3028,8 +3028,10 @@ JianguoLiAnt Group 4247-4262 Self-attention and position embedding are two crucial modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors that hinder long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention.Incorrect initial angles between Q and K can cause misestimation in modeling rotary position embedding of the closest tokens.To address this issue, we propose \textbf{Co}llinear \textbf{C}onstrained \textbf{A}ttention mechanism, namely CoCA. Specifically, we enforce a collinear constraint between Q and K to seamlessly integrate RoPE and self-attention.While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models.Extensive experiments demonstrate that CoCA excels in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can extend the context window up to 32K (60\times) without any fine-tuning.Additionally, incorporating CoCA into LLaMA-7B achieves extrapolation up to 32K within a training length of only 2K.Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention - 2024.acl-long.233 + 2024.acl-long.233 zhu-etal-2024-coca + + The author's affiliation changed. <fixed-case>I</fixed-case>nfo<fixed-case>L</fixed-case>oss<fixed-case>QA</fixed-case>: Characterizing and Recovering Information Loss in Text Simplification From 22cd9cb4bb5d86b48fbc971d2b693a9435dc71df Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 10 Sep 2024 06:22:41 -0500 Subject: [PATCH 06/39] Paper Metadata: 2024.arabicnlp-1.47, closes #3857. --- data/xml/2024.arabicnlp.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.arabicnlp.xml b/data/xml/2024.arabicnlp.xml index 6460f59db7..ca9f0e6ffd 100644 --- a/data/xml/2024.arabicnlp.xml +++ b/data/xml/2024.arabicnlp.xml @@ -557,7 +557,7 @@ Mela at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Propagandistic Techniques Detection in <fixed-case>A</fixed-case>rabic with a Multilingual Approach - MdRiyadh + Md Abdur RazzaqRiyadh SaraNabhani 478-482 This paper presents our system submitted for Task 1 of the ArAIEval Shared Task on Unimodal (Text) Propagandistic Technique Detection in Arabic. Task 1 involves identifying all employed propaganda techniques in a given text from a set of possible techniques or detecting that no propaganda technique is present. Additionally, the task requires identifying the specific spans of text where these techniques occur. We explored the capabilities of a multilingual BERT model for this task, focusing on the effectiveness of using outputs from different hidden layers within the model. By fine-tuning the multilingual BERT, we aimed to improve the model’s ability to recognize and locate various propaganda techniques. Our experiments showed that leveraging the hidden layers of the BERT model enhanced detection performance. Our system achieved competitive results, ranking second in the shared task, demonstrating that multilingual BERT models, combined with outputs from hidden layers, can effectively detect and identify spans of propaganda techniques in Arabic text. From 275c77d81c37675f9c38cac23253c2a9baf8e819 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 10 Sep 2024 09:40:48 -0400 Subject: [PATCH 07/39] Remove author name merge (#3292) --- data/xml/2020.findings.xml | 2 +- data/xml/2021.naacl.xml | 2 +- data/xml/2023.eacl.xml | 2 +- data/xml/2023.emnlp.xml | 2 +- data/xml/2023.findings.xml | 2 +- data/yaml/name_variants.yaml | 5 ----- 6 files changed, 5 insertions(+), 10 deletions(-) diff --git a/data/xml/2020.findings.xml b/data/xml/2020.findings.xml index 4b4fb7e83a..bc2a296f9a 100644 --- a/data/xml/2020.findings.xml +++ b/data/xml/2020.findings.xml @@ -330,7 +330,7 @@ Minimize Exposure Bias of <fixed-case>S</fixed-case>eq2<fixed-case>S</fixed-case>eq Models in Joint Entity and Relation Extraction - Ranran HaoranZhang + Ranran HaoranZhang QianyingLiu Aysa XuemoFan HengJi diff --git a/data/xml/2021.naacl.xml b/data/xml/2021.naacl.xml index 9eb131274b..afb238d89b 100644 --- a/data/xml/2021.naacl.xml +++ b/data/xml/2021.naacl.xml @@ -7496,7 +7496,7 @@ JiaweiMa JingxuanTu YingLin - Ranran HaoranZhang + Ranran HaoranZhang WeiliLiu AabhasChauhan YingjunGuan diff --git a/data/xml/2023.eacl.xml b/data/xml/2023.eacl.xml index ef68a7341f..1f8185590d 100644 --- a/data/xml/2023.eacl.xml +++ b/data/xml/2023.eacl.xml @@ -1920,7 +1920,7 @@ <fixed-case>C</fixed-case>on<fixed-case>E</fixed-case>ntail: An Entailment-based Framework for Universal Zero and Few Shot Classification with Supervised Contrastive Pretraining - Ranran HaoranZhangThe Pennsylvania State University + Ranran HaoranZhangThe Pennsylvania State University Aysa XuemoFanUniversity of Illinois at Urbana-Champaign RuiZhangPenn State University 1941-1953 diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml index 3e425af594..4743c3f697 100644 --- a/data/xml/2023.emnlp.xml +++ b/data/xml/2023.emnlp.xml @@ -6067,7 +6067,7 @@ Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning Sarkar Snigdha SarathiDas - Ranran HaoranZhang + Ranran HaoranZhang PengShi WenpengYin RuiZhang diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index ef19af444f..4296a51471 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -21079,7 +21079,7 @@ Exploring the Potential of Large Language Models in Generating Code-Tracing Questions for Introductory Programming Courses AysaFan - Ranran HaoranZhang + Ranran HaoranZhang LucPaquette RuiZhang 7406-7421 diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml index f5ad0712e9..0d6f09d7ef 100644 --- a/data/yaml/name_variants.yaml +++ b/data/yaml/name_variants.yaml @@ -10557,9 +10557,6 @@ - canonical: {first: Zhicheng, last: Guo} comment: xidian id: zhicheng-guo-xidian -- canonical: {first: Ranran Haoran, last: Zhang} - variants: - - {first: Haoran, last: Zhang} - canonical: {first: Michael, last: Schlichtkrull} variants: - {first: Michael Sejr, last: Schlichtkrull} @@ -10637,5 +10634,3 @@ - canonical: {first: Genta Indra, last: Winata} variants: - {first: Genta, last: Winata} -- canonical: {first: Ranran Haoran, last: Zhang} - id: ranran-haoran-zhang From b1ba90f3cfe4b46bb74bdb58eaf4550fe4e99c1d Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 10 Sep 2024 09:48:05 -0400 Subject: [PATCH 08/39] Revert "Remove author name merge (#3292)" This reverts commit 275c77d81c37675f9c38cac23253c2a9baf8e819. --- data/xml/2020.findings.xml | 2 +- data/xml/2021.naacl.xml | 2 +- data/xml/2023.eacl.xml | 2 +- data/xml/2023.emnlp.xml | 2 +- data/xml/2023.findings.xml | 2 +- data/yaml/name_variants.yaml | 5 +++++ 6 files changed, 10 insertions(+), 5 deletions(-) diff --git a/data/xml/2020.findings.xml b/data/xml/2020.findings.xml index bc2a296f9a..4b4fb7e83a 100644 --- a/data/xml/2020.findings.xml +++ b/data/xml/2020.findings.xml @@ -330,7 +330,7 @@ Minimize Exposure Bias of <fixed-case>S</fixed-case>eq2<fixed-case>S</fixed-case>eq Models in Joint Entity and Relation Extraction - Ranran HaoranZhang + Ranran HaoranZhang QianyingLiu Aysa XuemoFan HengJi diff --git a/data/xml/2021.naacl.xml b/data/xml/2021.naacl.xml index afb238d89b..9eb131274b 100644 --- a/data/xml/2021.naacl.xml +++ b/data/xml/2021.naacl.xml @@ -7496,7 +7496,7 @@ JiaweiMa JingxuanTu YingLin - Ranran HaoranZhang + Ranran HaoranZhang WeiliLiu AabhasChauhan YingjunGuan diff --git a/data/xml/2023.eacl.xml b/data/xml/2023.eacl.xml index 1f8185590d..ef68a7341f 100644 --- a/data/xml/2023.eacl.xml +++ b/data/xml/2023.eacl.xml @@ -1920,7 +1920,7 @@ <fixed-case>C</fixed-case>on<fixed-case>E</fixed-case>ntail: An Entailment-based Framework for Universal Zero and Few Shot Classification with Supervised Contrastive Pretraining - Ranran HaoranZhangThe Pennsylvania State University + Ranran HaoranZhangThe Pennsylvania State University Aysa XuemoFanUniversity of Illinois at Urbana-Champaign RuiZhangPenn State University 1941-1953 diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml index 4743c3f697..3e425af594 100644 --- a/data/xml/2023.emnlp.xml +++ b/data/xml/2023.emnlp.xml @@ -6067,7 +6067,7 @@ Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning Sarkar Snigdha SarathiDas - Ranran HaoranZhang + Ranran HaoranZhang PengShi WenpengYin RuiZhang diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index 4296a51471..ef19af444f 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -21079,7 +21079,7 @@ Exploring the Potential of Large Language Models in Generating Code-Tracing Questions for Introductory Programming Courses AysaFan - Ranran HaoranZhang + Ranran HaoranZhang LucPaquette RuiZhang 7406-7421 diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml index 0d6f09d7ef..f5ad0712e9 100644 --- a/data/yaml/name_variants.yaml +++ b/data/yaml/name_variants.yaml @@ -10557,6 +10557,9 @@ - canonical: {first: Zhicheng, last: Guo} comment: xidian id: zhicheng-guo-xidian +- canonical: {first: Ranran Haoran, last: Zhang} + variants: + - {first: Haoran, last: Zhang} - canonical: {first: Michael, last: Schlichtkrull} variants: - {first: Michael Sejr, last: Schlichtkrull} @@ -10634,3 +10637,5 @@ - canonical: {first: Genta Indra, last: Winata} variants: - {first: Genta, last: Winata} +- canonical: {first: Ranran Haoran, last: Zhang} + id: ranran-haoran-zhang From 8ee40075d79f18efd8f0ad18649a81e6d1a26a16 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 10 Sep 2024 09:53:14 -0400 Subject: [PATCH 09/39] Fixed duplicate key --- data/yaml/name_variants.yaml | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml index f5ad0712e9..93342e530e 100644 --- a/data/yaml/name_variants.yaml +++ b/data/yaml/name_variants.yaml @@ -10558,8 +10558,8 @@ comment: xidian id: zhicheng-guo-xidian - canonical: {first: Ranran Haoran, last: Zhang} - variants: - - {first: Haoran, last: Zhang} + comment: Penn State University + id: ranran-haoran-zhang - canonical: {first: Michael, last: Schlichtkrull} variants: - {first: Michael Sejr, last: Schlichtkrull} @@ -10637,5 +10637,3 @@ - canonical: {first: Genta Indra, last: Winata} variants: - {first: Genta, last: Winata} -- canonical: {first: Ranran Haoran, last: Zhang} - id: ranran-haoran-zhang From 7d794607df51f66b5e304a8f80dfec6a3e40f80d Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Wed, 11 Sep 2024 17:00:27 -0500 Subject: [PATCH 10/39] Paper Metadata: 2024.propor-1.31, closes #3861. --- data/xml/2024.propor.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.propor.xml b/data/xml/2024.propor.xml index 1e3759f240..90fd47bc39 100644 --- a/data/xml/2024.propor.xml +++ b/data/xml/2024.propor.xml @@ -322,7 +322,7 @@ Exploring <fixed-case>P</fixed-case>ortuguese Hate Speech Detection in Low-Resource Settings: Lightly Tuning Encoder Models or In-Context Learning of Large Models? GabrielAssis AnnieAmorim - JonnatahnCarvalho + JonnathanCarvalho Danielde Oliveira DanielaVianna AlinePaes From 57f82fb5386f234c8bbe04a0e4864632c158a091 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Wed, 11 Sep 2024 17:19:00 -0500 Subject: [PATCH 11/39] Paper Metadata: {2024.starsem-1.30}, closes #3864. --- data/xml/2024.starsem.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.starsem.xml b/data/xml/2024.starsem.xml index 9872757673..2d25741ccc 100644 --- a/data/xml/2024.starsem.xml +++ b/data/xml/2024.starsem.xml @@ -355,10 +355,10 @@ A Trip Towards Fairness: Bias and De-Biasing in Large Language Models LeonardoRanaldiUniversity of Rome Tor Vergata and Idiap Research Institute - ElenaRuzzettiUniversity of Rome Tor Vergata + Elena SofiaRuzzettiUniversity of Rome Tor Vergata DavideVendittiUniversity of Rome Tor Vergata DarioOnoratiSapienza University of Rome - FabioZanzottoUniversity of Rome Tor Vergata + Fabio MassimoZanzottoUniversity of Rome Tor Vergata 372-384 Cheap-to-Build Very Large-Language Models (CtB-LLMs) with affordable training are emerging as the next big revolution in natural language processing and understanding. These CtB-LLMs are democratizing access to trainable Very Large-Language Models (VLLMs) and, thus, may represent the building blocks of many NLP systems solving downstream tasks. Hence, a little or a large bias in CtB-LLMs may cause huge harm. In this paper, we performed a large investigation of the bias of three families of CtB-LLMs, and we showed that debiasing techniques are effective and usable. Indeed, according to current tests, the LLaMA and the OPT families have an important bias in gender, race, religion, and profession. In contrast to the analysis for other LMMs, we discovered that bias depends not on the number of parameters but on the perplexity. Finally, the debiasing of OPT using LORA reduces bias up to 4.12 points in the normalized stereotype score. 2024.starsem-1.30 From b607f5dfe44ee939f6fb12e42da49afc1c208ac9 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Wed, 11 Sep 2024 17:30:30 -0500 Subject: [PATCH 12/39] Paper Metadata: 2024.findings-acl.847, closes #3869. --- data/xml/2024.findings.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml index 52a810e049..901e4ddb44 100644 --- a/data/xml/2024.findings.xml +++ b/data/xml/2024.findings.xml @@ -16598,7 +16598,7 @@ Pushing the Limits of Zero-shot End-to-End Speech Translation IoannisTsiamasApple and Universidad Politécnica de Cataluna - GerardGállegoUniversidad Politécnica de Cataluna + Gerard I.GállegoUniversidad Politécnica de Cataluna JoséFonollosaUniversitat Politècnica de Catalunya MartaCosta-jussàMeta 14245-14267 From 8d9cfd7f95687e727f5040d1b547f4b778842db2 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:41:31 -0500 Subject: [PATCH 13/39] Paper Revision: {2024.acl-long.693}, closes #3875. --- data/xml/2024.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index ef3eb8c793..f20229ebe1 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -8993,8 +8993,10 @@ ArmanCohanYale University and Allen Institute for Artificial Intelligence 12841-12858 We introduce KnowledgeFMath, a novel benchmark designed to evaluate LLMs’ capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, KnowledgeFMath includes 1,259 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 26 LLMs with different prompting strategies like Chain-of-Thought and Program-of-Thought. Our experimental results reveal that the current best-performing system (i.e., GPT-4 with CoT prompting) achieves only 56.6% accuracy, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve their performance (e.g., from 33.5% to 47.1% for GPT-3.5), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that KnowledgeFMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving math reasoning problems. - 2024.acl-long.693 + 2024.acl-long.693 zhao-etal-2024-knowledgefmath + + Revised the dataset name. <fixed-case>API</fixed-case>-<fixed-case>BLEND</fixed-case>: A Comprehensive Corpora for Training and Benchmarking <fixed-case>API</fixed-case> <fixed-case>LLM</fixed-case>s From bb69dcf083ec99fbd1d2ca3d17d2db0a684bcd74 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:43:32 -0500 Subject: [PATCH 14/39] Paper Revision: {2024.acl-long.852}, closes #3879. --- data/xml/2024.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index f20229ebe1..8e53e8ad6d 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -11145,8 +11145,10 @@ ArmanCohanYale University 16103-16120 Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 27 LLMs, including those specialized in math, coding and finance, with Chain-of-Thought and Program-of-Thought prompting methods. We found that even the current best-performing system (i.e., GPT-4) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs’ capabilities to solve challenging numerical reasoning problems in expert domains. - 2024.acl-long.852 + 2024.acl-long.852 zhao-etal-2024-docmath + + Included experimental results. Unintended Impacts of <fixed-case>LLM</fixed-case> Alignment on Global Representation From c45051cdcb441a3d52e7bb29b20e6425484e1b5a Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:44:58 -0500 Subject: [PATCH 15/39] Paper Revision{2024.findings-acl.354}, closes #3881. --- data/xml/2024.findings.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml index 901e4ddb44..2c69d3a85b 100644 --- a/data/xml/2024.findings.xml +++ b/data/xml/2024.findings.xml @@ -10422,8 +10422,10 @@ BenoitCrabbéUniversité de Paris 5935-5947 We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as Pandas, Numpy, and Regex, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,402 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models’ strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLAMA 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at anonymized address. - 2024.findings-acl.354 + 2024.findings-acl.354 beau-crabbe-2024-codeinsight + + Minor updates. <fixed-case>V</fixed-case>i<fixed-case>H</fixed-case>ate<fixed-case>T</fixed-case>5: Enhancing Hate Speech Detection in <fixed-case>V</fixed-case>ietnamese With a Unified Text-to-Text Transformer Model From 536512fdc51ee41a2fd1387f840db92a01c4e16d Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:47:36 -0500 Subject: [PATCH 16/39] Paper Revision{2023.findings-acl.38}, closes #3885. --- data/xml/2023.findings.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index ef19af444f..c4ea33d98c 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -3166,10 +3166,12 @@ RyanCotterellETH Zürich 598-614 Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1-e(-sigma))-approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is approx0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization. - 2023.findings-acl.38 + 2023.findings-acl.38 zouhar-etal-2023-formal 10.18653/v1/2023.findings-acl.38 Automatic Named Entity Obfuscation in Speech From 520683f08cae3abbd2156c80fe6980429567a033 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:49:20 -0500 Subject: [PATCH 17/39] Paper Revision{2021.findings-emnlp.96}, closes #3888. --- data/xml/2021.findings.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2021.findings.xml b/data/xml/2021.findings.xml index d4652661ea..9890bc3539 100644 --- a/data/xml/2021.findings.xml +++ b/data/xml/2021.findings.xml @@ -7958,13 +7958,15 @@ Albert Y.S.Lam 1114–1120 This paper investigates the effectiveness of pre-training for few-shot intent classification. While existing paradigms commonly further pre-train language models such as BERT on a vast amount of unlabeled corpus, we find it highly effective and efficient to simply fine-tune BERT with a small set of labeled utterances from public datasets. Specifically, fine-tuning BERT with roughly 1,000 labeled data yields a pre-trained model – IntentBERT, which can easily surpass the performance of existing pre-trained models for few-shot intent classification on novel domains with very different semantics. The high effectiveness of IntentBERT confirms the feasibility and practicality of few-shot intent detection, and its high generalization ability across different domains suggests that intent classification tasks may share a similar underlying structure, which can be efficiently learned from a small set of labeled data. The source code can be found at https://github.com/hdzhang-code/IntentBERT. - 2021.findings-emnlp.96 + 2021.findings-emnlp.96 zhang-etal-2021-effectiveness-pre 10.18653/v1/2021.findings-emnlp.96 Improving Abstractive Dialogue Summarization with Hierarchical Pretraining and Topic Segment From 0015196169bc4381913bd96a1621b03284414cda Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:52:18 -0500 Subject: [PATCH 18/39] Paper Revision{2022.naacl-main.39}, closes #3890. --- data/xml/2022.naacl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2022.naacl.xml b/data/xml/2022.naacl.xml index c729852f14..ae7b509cf4 100644 --- a/data/xml/2022.naacl.xml +++ b/data/xml/2022.naacl.xml @@ -621,7 +621,7 @@ AlbertLam 532-542 It is challenging to train a good intent classifier for a task-oriented dialogue system with only a few annotations. Recent studies have shown that fine-tuning pre-trained language models with a small set of labeled utterances from public benchmarks in a supervised manner is extremely helpful. However, we find that supervised pre-training yields an anisotropic feature space, which may suppress the expressive power of the semantic representations. Inspired by recent research in isotropization, we propose to improve supervised pre-training by regularizing the feature space towards isotropy. We propose two regularizers based on contrastive learning and correlation matrix respectively, and demonstrate their effectiveness through extensive experiments. Our main finding is that it is promising to regularize supervised pre-training with isotropization to further improve the performance of few-shot intent detection. The source code can be found at https://github.com/fanolabs/isoIntentBert-main. - 2022.naacl-main.39 + 2022.naacl-main.39 2022.naacl-main.39.software.zip zhang-etal-2022-fine 10.18653/v1/2022.naacl-main.39 @@ -630,6 +630,8 @@ BANKING77 HINT3 HWU64 + + Changes the order of the authors. Cross-document Misinformation Detection based on Event Graph Reasoning From d7d9525ac0fb58522ddfd7c8753392a656b0cc9e Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:53:43 -0500 Subject: [PATCH 19/39] Paper Revision{2023.findings-acl.706}, closes #3892. --- data/xml/2023.findings.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index c4ea33d98c..0ca9816ec8 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -11998,9 +11998,11 @@ Albert Y.S.LamFano Labs 11105-11121 We consider the task of few-shot intent detection, which involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. The current approach to address this problem is through continual pre-training, i.e., fine-tuning pre-trained language models (PLMs) on external resources (e.g., conversational corpora, public intent detection datasets, or natural language understanding datasets) before using them as utterance encoders for training an intent classifier. In this paper, we show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. Specifically, we find that directly fine-tuning PLMs on only a handful of labeled examples already yields decent results compared to methods that employ continual pre-training, and the performance gap diminishes rapidly as the number of labeled data increases. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance. Comprehensive experiments on real-world benchmarks show that given only two or more labeled samples per class, direct fine-tuning outperforms many strong baselines that utilize external data sources for continual pre-training. The code can be found at https://github.com/hdzhang-code/DFTPlus. - 2023.findings-acl.706 + 2023.findings-acl.706 zhang-etal-2023-revisit 10.18653/v1/2023.findings-acl.706 + + Changes the order of the authors. Improving Contrastive Learning of Sentence Embeddings from <fixed-case>AI</fixed-case> Feedback From 23221f5a5fd719c7e8e2ff9d1663005b27e35335 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:55:49 -0500 Subject: [PATCH 20/39] Paper correction for 2024.bea-1.32, closes #3844. --- data/xml/2024.bea.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.bea.xml b/data/xml/2024.bea.xml index 1b0ce81bb3..40af6cf92e 100644 --- a/data/xml/2024.bea.xml +++ b/data/xml/2024.bea.xml @@ -366,8 +366,10 @@ GioraAlexandronWeizmann Institute of Science 391-402 Unsupervised clustering of student responses to open-ended questions into behavioral and cognitive profiles using pre-trained LLM embeddings is an emerging technique, but little is known about how well this captures pedagogically meaningful information. We investigate this in the context of student responses to open-ended questions in biology, which were previously analyzed and clustered by experts into theory-driven Knowledge Profiles (KPs).Comparing these KPs to ones discovered by purely data-driven clustering techniques, we report poor discoverability of most KPs, except for the ones including the correct answers. We trace this ‘discoverability bias’ to the representations of KPs in the pre-trained LLM embeddings space. - 2024.bea-1.32 + 2024.bea-1.32 gurin-schleifer-etal-2024-anna + + Corrected a typo. Assessing Student Explanations with Large Language Models Using Fine-Tuning and Few-Shot Learning From 2fbf0cd9c7e4bcc6a5c917d5a57918b21052681e Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 13:59:17 -0500 Subject: [PATCH 21/39] Paper Metadata: {2023.findings-acl.706}, closes #3891. --- data/xml/2023.findings.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index 0ca9816ec8..fd1cbbd68f 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -11993,9 +11993,9 @@ Revisit Few-shot Intent Classification with <fixed-case>PLM</fixed-case>s: Direct Fine-tuning vs. Continual Pre-training HaodeZhangThe Hong Kong Polytechnic University HaowenLiangThe Hong Kong Polytechnic University - Li-MingZhanThe Hong Kong Polytechnic University - Xiao-MingWuHong Kong Polytechnic University + LimingZhanThe Hong Kong Polytechnic University Albert Y.S.LamFano Labs + Xiao-MingWuHong Kong Polytechnic University 11105-11121 We consider the task of few-shot intent detection, which involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. The current approach to address this problem is through continual pre-training, i.e., fine-tuning pre-trained language models (PLMs) on external resources (e.g., conversational corpora, public intent detection datasets, or natural language understanding datasets) before using them as utterance encoders for training an intent classifier. In this paper, we show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. Specifically, we find that directly fine-tuning PLMs on only a handful of labeled examples already yields decent results compared to methods that employ continual pre-training, and the performance gap diminishes rapidly as the number of labeled data increases. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance. Comprehensive experiments on real-world benchmarks show that given only two or more labeled samples per class, direct fine-tuning outperforms many strong baselines that utilize external data sources for continual pre-training. The code can be found at https://github.com/hdzhang-code/DFTPlus. 2023.findings-acl.706 From 19db664723b7743bfef976e07dbf3cd041b4b98d Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 14:01:14 -0500 Subject: [PATCH 22/39] Paper Metadata: {2022.naacl-main.39}, closes #3889. --- data/xml/2022.naacl.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2022.naacl.xml b/data/xml/2022.naacl.xml index ae7b509cf4..6acf3a1f8d 100644 --- a/data/xml/2022.naacl.xml +++ b/data/xml/2022.naacl.xml @@ -615,10 +615,10 @@ HaodeZhang HaowenLiang YuweiZhang - Li-MingZhan - Xiao-MingWu + LimingZhan XiaoleiLu AlbertLam + Xiao-MingWu 532-542 It is challenging to train a good intent classifier for a task-oriented dialogue system with only a few annotations. Recent studies have shown that fine-tuning pre-trained language models with a small set of labeled utterances from public benchmarks in a supervised manner is extremely helpful. However, we find that supervised pre-training yields an anisotropic feature space, which may suppress the expressive power of the semantic representations. Inspired by recent research in isotropization, we propose to improve supervised pre-training by regularizing the feature space towards isotropy. We propose two regularizers based on contrastive learning and correlation matrix respectively, and demonstrate their effectiveness through extensive experiments. Our main finding is that it is promising to regularize supervised pre-training with isotropization to further improve the performance of few-shot intent detection. The source code can be found at https://github.com/fanolabs/isoIntentBert-main. 2022.naacl-main.39 From 672f252a6ae3e6741138df51122a1d23625b5331 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 14:04:54 -0500 Subject: [PATCH 23/39] Paper Metadata{2021.findings-emnlp.96}, closes #3887. --- data/xml/2021.findings.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2021.findings.xml b/data/xml/2021.findings.xml index 9890bc3539..121747d13b 100644 --- a/data/xml/2021.findings.xml +++ b/data/xml/2021.findings.xml @@ -7954,8 +7954,8 @@ Li-MingZhan JiaxinChen GuangyuanShi - Xiao-MingWu Albert Y.S.Lam + Xiao-MingWu 1114–1120 This paper investigates the effectiveness of pre-training for few-shot intent classification. While existing paradigms commonly further pre-train language models such as BERT on a vast amount of unlabeled corpus, we find it highly effective and efficient to simply fine-tune BERT with a small set of labeled utterances from public datasets. Specifically, fine-tuning BERT with roughly 1,000 labeled data yields a pre-trained model – IntentBERT, which can easily surpass the performance of existing pre-trained models for few-shot intent classification on novel domains with very different semantics. The high effectiveness of IntentBERT confirms the feasibility and practicality of few-shot intent detection, and its high generalization ability across different domains suggests that intent classification tasks may share a similar underlying structure, which can be efficiently learned from a small set of labeled data. The source code can be found at https://github.com/hdzhang-code/IntentBERT. 2021.findings-emnlp.96 From 0aa1ce0a4293e312dfbc7981ee762413b3732773 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 14:08:14 -0500 Subject: [PATCH 24/39] 2024.wassa-1.8 : Swap author 3 and 4 according to the order in the paper, closes #3886. --- data/xml/2024.wassa.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.wassa.xml b/data/xml/2024.wassa.xml index ceb7e36daf..1ba88b73ca 100644 --- a/data/xml/2024.wassa.xml +++ b/data/xml/2024.wassa.xml @@ -93,8 +93,8 @@ Entity-Level Sentiment: More than the Sum of Its Parts EgilRønningstad RomanKlingerOtto-Friedrich Universität Bamberg - ErikVelldalUniversity of Oslo LiljaØvrelidDept. of Informatics, University of Oslo + ErikVelldalUniversity of Oslo 84-96 In sentiment analysis of longer texts, there may be a variety of topics discussed, of entities mentioned, and of sentiments expressed regarding each entity. We find a lack of studies exploring how such texts express their sentiment towards each entity of interest, and how these sentiments can be modelled. In order to better understand how sentiment regarding persons and organizations (each entity in our scope) is expressed in longer texts, we have collected a dataset of expert annotations where the overall sentiment regarding each entity is identified, together with the sentence-level sentiment for these entities separately. We show that the reader’s perceived sentiment regarding an entity often differs from an arithmetic aggregation of sentiments at the sentence level. Only 70% of the positive and 55% of the negative entities receive a correct overall sentiment label when we aggregate the (human-annotated) sentiment labels for the sentences where the entity is mentioned. Our dataset reveals the complexity of entity-specific sentiment in longer texts, and allows for more precise modelling and evaluation of such sentiment expressions. 2024.wassa-1.8 From e5776ec9af2d0edd417da7671f8266afdd888929 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 14:11:05 -0500 Subject: [PATCH 25/39] Paper Metadata: {2024.acl-long.852}, closes #3880. --- data/xml/2024.acl.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index 8e53e8ad6d..cae735d0c4 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -11132,7 +11132,7 @@ jin-etal-2024-mmtom - <fixed-case>D</fixed-case>oc<fixed-case>M</fixed-case>ath-Eval: Evaluating Math Reasoning Capabilities of <fixed-case>LLM</fixed-case>s in Understanding Financial Documents + <fixed-case>D</fixed-case>oc<fixed-case>M</fixed-case>ath-Eval: Evaluating Math Reasoning Capabilities of <fixed-case>LLM</fixed-case>s in Understanding Long and Specialized Documents YilunZhaoYale University YitaoLongNew York University HongjunLiuCollege of Computer Science and Technology, Zhejiang University @@ -11144,7 +11144,7 @@ RuiZhangPennsylvania State University ArmanCohanYale University 16103-16120 - Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 27 LLMs, including those specialized in math, coding and finance, with Chain-of-Thought and Program-of-Thought prompting methods. We found that even the current best-performing system (i.e., GPT-4) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs’ capabilities to solve challenging numerical reasoning problems in expert domains. + Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We conduct an extensive evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe that DocMath-Eval can serve as a valuable benchmark for evaluating LLMs' capabilities in solving challenging numerical reasoning problems within expert domains. 2024.acl-long.852 zhao-etal-2024-docmath From 9cc02a1aa0c847821be302b79ba3096f288ec0ad Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Tue, 17 Sep 2024 14:12:47 -0500 Subject: [PATCH 26/39] Paper Metadata: {2024.acl-long.693}, closes #3876. --- data/xml/2024.acl.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index cae735d0c4..b754558313 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -8984,7 +8984,7 @@ zhao-etal-2024-tapera - <fixed-case>K</fixed-case>nowledge<fixed-case>FM</fixed-case>ath: A Knowledge-Intensive Math Reasoning Dataset in Finance Domains + FinanceMATH: Knowledge-Intensive Math Reasoning in Finance Domains YilunZhaoYale University HongjunLiu YitaoLongNew York University @@ -8992,7 +8992,7 @@ ChenZhaoNew York University Shanghai ArmanCohanYale University and Allen Institute for Artificial Intelligence 12841-12858 - We introduce KnowledgeFMath, a novel benchmark designed to evaluate LLMs’ capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, KnowledgeFMath includes 1,259 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 26 LLMs with different prompting strategies like Chain-of-Thought and Program-of-Thought. Our experimental results reveal that the current best-performing system (i.e., GPT-4 with CoT prompting) achieves only 56.6% accuracy, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve their performance (e.g., from 33.5% to 47.1% for GPT-3.5), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that KnowledgeFMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving math reasoning problems. + We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks. 2024.acl-long.693 zhao-etal-2024-knowledgefmath From 4f52bdcd1b33a738a555e837043a17da0b3e3c97 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Wed, 18 Sep 2024 14:17:39 -0500 Subject: [PATCH 27/39] Paper Revision{2024.acl-long.329}, closes #3896. --- data/xml/2024.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index b754558313..5237618275 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -4266,8 +4266,10 @@ AnetteFrankRuprecht-Karls-Universität Heidelberg 6048-6089 Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models’ inner workings – but rather their self-consistency at output level.Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks – including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model’s input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. - 2024.acl-long.329 + 2024.acl-long.329 parcalabescu-frank-2024-measuring + + This revision mentions a sponsor in the acknowledgements and fixes the typo in Eq. 4. Learning or Self-aligning? Rethinking Instruction Fine-tuning From d9fc9ef906603abb6dc7644018e523f3a89f990a Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Wed, 18 Sep 2024 14:19:15 -0500 Subject: [PATCH 28/39] Paper Revision{2023.acl-long.223}, closes #3895. --- data/xml/2023.acl.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/data/xml/2023.acl.xml b/data/xml/2023.acl.xml index 39a5c32c41..799f88d5fb 100644 --- a/data/xml/2023.acl.xml +++ b/data/xml/2023.acl.xml @@ -3128,10 +3128,12 @@ AnetteFrankHeidelberg University 4032-4059 Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at https://github.com/Heidelberg-NLP/MM-SHAP. - 2023.acl-long.223 + 2023.acl-long.223 parcalabescu-frank-2023-mm 10.18653/v1/2023.acl-long.223 Towards Boosting the Open-Domain Chatbot with Human Feedback From a38ec953c56e1836e7906b714749eb7c66f64f3d Mon Sep 17 00:00:00 2001 From: Matt Post Date: Mon, 23 Sep 2024 08:28:20 -0400 Subject: [PATCH 29/39] Name correction: Cesar Yoshikawa --- data/xml/2022.deeplo.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2022.deeplo.xml b/data/xml/2022.deeplo.xml index f0ed86c1ea..b847f095d4 100644 --- a/data/xml/2022.deeplo.xml +++ b/data/xml/2022.deeplo.xml @@ -30,7 +30,7 @@ WilliamChen RichardCastro NúriaBel - CesarToshio + CesarYoshikawa RenzoVenturas HilarioAradiel NelsiMelgarejo From c7955ccd57a78535ad7689bb722b3d2b0ca48ea3 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Mon, 23 Sep 2024 09:10:06 -0400 Subject: [PATCH 30/39] =?UTF-8?q?Name=20correction:=20Patr=C3=ADcia=20Ferr?= =?UTF-8?q?eira?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- data/xml/2024.sigdial.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.sigdial.xml b/data/xml/2024.sigdial.xml index 8aecc67d7a..1f2eecb171 100644 --- a/data/xml/2024.sigdial.xml +++ b/data/xml/2024.sigdial.xml @@ -277,7 +277,7 @@ Sentiment-Aware Dialogue Flow Discovery for Interpreting Communication Trends - Patrícia Sofia PereiraFerreira + PatríciaFerreira IsabelCarvalho AnaAlves CatarinaSilva From 2836e050dcc04f4faf6638c6eafdd2b43f6900dc Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 23 Sep 2024 17:12:52 -0500 Subject: [PATCH 31/39] Paper Metadata: 2024.gebnlp-1.5, closes #3898. --- data/xml/2024.gebnlp.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.gebnlp.xml b/data/xml/2024.gebnlp.xml index 40fa3caf73..66835b8816 100644 --- a/data/xml/2024.gebnlp.xml +++ b/data/xml/2024.gebnlp.xml @@ -64,7 +64,7 @@ A Fairness Analysis of Human and <fixed-case>AI</fixed-case>-Generated Student Reflection Summaries BhimanBaghelUniversity of Pittsburgh Arun BalajieeLekshmi NarayananUniversity of Pittsburgh - MichaelYoderSchool of Computer Science, Carnegie Mellon University + Michael MillerYoderSchool of Computer Science, Carnegie Mellon University 60-77 This study examines the fairness of human- and AI-generated summaries of student reflections in university STEM classes, focusing on potential gender biases. Using topic modeling, we first identify topics that are more prevalent in reflections from female students and others that are more common among male students. We then analyze whether human and AI-generated summaries reflect the concerns of students of any particular gender over others. Our analysis reveals that though human-generated and extractive AI summarization techniques do not show a clear bias, abstractive AI-generated summaries exhibit a bias towards male students. Pedagogical themes are over-represented from male reflections in these summaries, while concept-specific topics are under-represented from female reflections. This research contributes to a deeper understanding of AI-generated bias in educational contexts, highlighting the need for future work on mitigating these biases. 2024.gebnlp-1.5 From 99bd72070b5c2ea19b7c24de5868f54cfa654024 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 23 Sep 2024 17:14:33 -0500 Subject: [PATCH 32/39] Paper Metadata: {2024.findings-acl.872}, closes #3901. --- data/xml/2024.findings.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml index 2c69d3a85b..7e6b8e0413 100644 --- a/data/xml/2024.findings.xml +++ b/data/xml/2024.findings.xml @@ -16903,7 +16903,7 @@ MatteoGabburoUniversity of Trento NicolaasJedemaAmazon SiddhantGargMeta - LeonardoRibeiroAmazon + Leonardo F. R.RibeiroAmazon AlessandroMoschittiAmazon AGI 14636-14650 In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system.Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty.Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets. From 9a3f5095e25065f92d139a8003d1dd07a47abf07 Mon Sep 17 00:00:00 2001 From: anthology-assist Date: Mon, 23 Sep 2024 17:16:34 -0500 Subject: [PATCH 33/39] Author correction for William Soto Martinez, closes #3899. --- data/xml/2024.inlg.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.inlg.xml b/data/xml/2024.inlg.xml index 54c83e5151..ca123a4fad 100644 --- a/data/xml/2024.inlg.xml +++ b/data/xml/2024.inlg.xml @@ -77,7 +77,7 @@ Generating from <fixed-case>AMR</fixed-case>s into High and Low-Resource Languages using Phylogenetic Knowledge and Hierarchical <fixed-case>QL</fixed-case>o<fixed-case>RA</fixed-case> Training (<fixed-case>HQL</fixed-case>) - William EduardoSoto Martinez + WilliamSoto Martinez YannickParmentier ClaireGardent 70–81 From bdb00ef998bc613dcb4df8dac637a19e7682a88a Mon Sep 17 00:00:00 2001 From: Matt Post Date: Mon, 23 Sep 2024 18:32:19 -0400 Subject: [PATCH 34/39] Update 2024.lrec-main.464 (closes #3874) --- data/xml/2024.lrec.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.lrec.xml b/data/xml/2024.lrec.xml index f6a8a37449..2d281a3e78 100644 --- a/data/xml/2024.lrec.xml +++ b/data/xml/2024.lrec.xml @@ -5509,10 +5509,11 @@ Does the Language Matter? Curriculum Learning over Neo-<fixed-case>L</fixed-case>atin Languages - GiuliaPucci LeonardoRanaldi + GiuliaPucci + AndréFreitas 5212–5220 - Curriculum Learning (CL) is emerging as a relevant technique to reduce the cost of pre-training Large Language Models. The idea, tested for the English language, is to train LLMs by organizing training examples from the simplest to the most complex. Complexity measures may depend on the specific language. Hence, this paper aims to investigate whether CL and the complexity measure can be easily exported to other languages. For this reason, we present a set of linguistically motivated measures to determine the complexity of examples, which has been used in English: these measures are based on text length, rarity, and comprehensibility. We then test the approach to two Romance languages: Italian and French. Our results show that the technique can be easily exported to languages other than English without adaptation. + Curriculum Learning (CL) has been emerged as an effective technique for improving the performances and reducing the cost of pre-training Large Language Models (LLMs). The efficacy of CL demonstrated in different scenarios is in the training LLMs by organizing examples from the simplest to the most complex. Although improvements have been shown extensively, this approach was used for pre-training, leaving novel fine-tuning approaches such as instruction-tuning unexplored. In this paper, we propose a novel complexity measure to empower the instruction-tuning method using the CL paradigm. To complement previous works, we propose cognitively motivated measures to determine the complexity of training demonstrations used in the instruction-tuning paradigm. Hence, we experiment with the proposed heuristics first in English and then in other languages. The downstream results show that delivering training examples by complexity ranking is also effective for instruction tuning, as it improves downstream performance while reducing costs. Furthermore, the technique can be easily transferred to languages other than English, e.g., Italian and French, without any adaptation, maintaining functionality and effectiveness. 2024.lrec-main.464 pucci-ranaldi-2024-language From 6cf9e2c71c75dc2dbbcb7a9e74c6a9aedf399d2e Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 24 Sep 2024 08:03:41 -0400 Subject: [PATCH 35/39] Fix broken merge --- data/xml/2024.acl.xml | 15 --------------- data/xml/2024.lrec.xml | 2 +- 2 files changed, 1 insertion(+), 16 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index dd1b7e42e5..1292c8a73c 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -3262,12 +3262,9 @@ Self-attention and position embedding are two crucial modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors that hinder long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention.Incorrect initial angles between Q and K can cause misestimation in modeling rotary position embedding of the closest tokens.To address this issue, we propose \textbf{Co}llinear \textbf{C}onstrained \textbf{A}ttention mechanism, namely CoCA. Specifically, we enforce a collinear constraint between Q and K to seamlessly integrate RoPE and self-attention.While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models.Extensive experiments demonstrate that CoCA excels in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can extend the context window up to 32K (60\times) without any fine-tuning.Additionally, incorporating CoCA into LLaMA-7B achieves extrapolation up to 32K within a training length of only 2K.Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention 2024.acl-long.233 zhu-etal-2024-coca -<<<<<<< HEAD The author's affiliation changed. -======= 10.18653/v1/2024.acl-long.233 ->>>>>>> origin/master <fixed-case>I</fixed-case>nfo<fixed-case>L</fixed-case>oss<fixed-case>QA</fixed-case>: Characterizing and Recovering Information Loss in Text Simplification @@ -4599,12 +4596,9 @@ Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models’ inner workings – but rather their self-consistency at output level.Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks – including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model’s input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. 2024.acl-long.329 parcalabescu-frank-2024-measuring -<<<<<<< HEAD This revision mentions a sponsor in the acknowledgements and fixes the typo in Eq. 4. -======= 10.18653/v1/2024.acl-long.329 ->>>>>>> origin/master Learning or Self-aligning? Rethinking Instruction Fine-tuning @@ -5375,11 +5369,8 @@ rai-yao-2024-investigation Minor updates. -<<<<<<< HEAD Minor updates. -======= 10.18653/v1/2024.acl-long.387 ->>>>>>> origin/master Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling @@ -9698,12 +9689,9 @@ We introduce FinanceMath, a novel benchmark designed to evaluate LLMs' capabilities in solving knowledge-intensive math reasoning problems. Compared to prior works, this study features three core advancements. First, FinanceMath includes 1,200 problems with a hybrid of textual and tabular content. These problems require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. We also construct a finance-domain knowledge bank and investigate various knowledge integration strategies. Finally, we evaluate a wide spectrum of 44 LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our experimental results reveal that the current best-performing system (i.e., GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial room for improvement. Moreover, while augmenting LLMs with external knowledge can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro), their accuracy remains significantly lower than the estimated human expert performance of 92%. We believe that FinanceMath can advance future research in the area of domain-specific knowledge retrieval and integration, particularly within the context of solving reasoning-intensive tasks. 2024.acl-long.693 zhao-etal-2024-knowledgefmath -<<<<<<< HEAD Revised the dataset name. -======= 10.18653/v1/2024.acl-long.693 ->>>>>>> origin/master <fixed-case>API</fixed-case>-<fixed-case>BLEND</fixed-case>: A Comprehensive Corpora for Training and Benchmarking <fixed-case>API</fixed-case> <fixed-case>LLM</fixed-case>s @@ -12012,12 +12000,9 @@ Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We conduct an extensive evaluation of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe that DocMath-Eval can serve as a valuable benchmark for evaluating LLMs' capabilities in solving challenging numerical reasoning problems within expert domains. 2024.acl-long.852 zhao-etal-2024-docmath -<<<<<<< HEAD Included experimental results. -======= 10.18653/v1/2024.acl-long.852 ->>>>>>> origin/master Unintended Impacts of <fixed-case>LLM</fixed-case> Alignment on Global Representation diff --git a/data/xml/2024.lrec.xml b/data/xml/2024.lrec.xml index 2d281a3e78..383adc9a3f 100644 --- a/data/xml/2024.lrec.xml +++ b/data/xml/2024.lrec.xml @@ -5515,7 +5515,7 @@ 5212–5220 Curriculum Learning (CL) has been emerged as an effective technique for improving the performances and reducing the cost of pre-training Large Language Models (LLMs). The efficacy of CL demonstrated in different scenarios is in the training LLMs by organizing examples from the simplest to the most complex. Although improvements have been shown extensively, this approach was used for pre-training, leaving novel fine-tuning approaches such as instruction-tuning unexplored. In this paper, we propose a novel complexity measure to empower the instruction-tuning method using the CL paradigm. To complement previous works, we propose cognitively motivated measures to determine the complexity of training demonstrations used in the instruction-tuning paradigm. Hence, we experiment with the proposed heuristics first in English and then in other languages. The downstream results show that delivering training examples by complexity ranking is also effective for instruction tuning, as it improves downstream performance while reducing costs. Furthermore, the technique can be easily transferred to languages other than English, e.g., Italian and French, without any adaptation, maintaining functionality and effectiveness. 2024.lrec-main.464 - pucci-ranaldi-2024-language + ranaldi-etal-2024-language Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages From 041b2ff5f5f864731cb8b1dffd5b109df6f33be3 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 24 Sep 2024 10:33:21 -0400 Subject: [PATCH 36/39] Name corrections to 2024.arabicnlp-1.65 --- data/xml/2024.arabicnlp.xml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/data/xml/2024.arabicnlp.xml b/data/xml/2024.arabicnlp.xml index a0792f7313..6cfc944a86 100644 --- a/data/xml/2024.arabicnlp.xml +++ b/data/xml/2024.arabicnlp.xml @@ -824,15 +824,15 @@ <fixed-case>B</fixed-case>ias<fixed-case>G</fixed-case>anda at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: A Quest to Uncover Biased Views in News Coverage - BlqeesBlqees - AlWardi + Al ManarAl Wardi + BlqeesAl Busaidi MalathAl-Sibani - HibaAl-Siyabi - NajmaZidjalySultan Qaboos University + Hiba Salim MuhammadAl-Siyabi + NajmaAl ZidjalySultan Qaboos University 609-613 In this study, we aimed to identify biased language in a dataset provided by the FIGNEWS 2024 committee on the Gaza-Israel war. We classified entries into seven categories: Unbiased, Biased against Palestine, Biased against Israel, Biased against Others, Biased against both Palestine and Israel, Unclear, and Not Applicable. Our team reviewed the literature to develop a codebook of terminologies and definitions. By coding each example, we sought to detect language tendencies used by media outlets when reporting on the same event. The primary finding was that most examples were classified as “Biased against Palestine,” as all examined language data used one-sided terms to describe the October 7 event. The least used category was “Not Applicable,” reserved for irrelevant examples or those lacking context. It is recommended to use neutral and balanced language when reporting volatile political news. 2024.arabicnlp-1.65 - blqees-etal-2024-biasganda + al-wardi-etal-2024-biasganda 10.18653/v1/2024.arabicnlp-1.65 From b182dc94bafd79ce38d2a5c0f4671820e7e6f797 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Wed, 25 Sep 2024 17:13:03 -0400 Subject: [PATCH 37/39] Name correction; 2024.eamt-1.14 --- data/xml/2024.eamt.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.eamt.xml b/data/xml/2024.eamt.xml index 7e0b7edcf9..a0ce21b508 100644 --- a/data/xml/2024.eamt.xml +++ b/data/xml/2024.eamt.xml @@ -150,7 +150,7 @@ Quality Estimation with <tex-math>k</tex-math>-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation - TuDinhKarlsruher Institut für Technologie + TuAnh DinhKarlsruhe Institut für Technologie TobiasPalzerTechnische Universität München JanNiehues 133-146 From abc4d953a476ff138cca8f0fad17ad7ea5320c94 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Mon, 30 Sep 2024 06:29:42 -0500 Subject: [PATCH 38/39] Remove middle name --- data/xml/2024.lrec.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/xml/2024.lrec.xml b/data/xml/2024.lrec.xml index 383adc9a3f..d014c43cbd 100644 --- a/data/xml/2024.lrec.xml +++ b/data/xml/2024.lrec.xml @@ -3041,7 +3041,7 @@ <fixed-case>CASIMIR</fixed-case>: A Corpus of Scientific Articles Enhanced with Multiple Author-Integrated Revisions - Léane IsabelleJourdan + LéaneJourdan FlorianBoudin NicolasHernandez RichardDufour From e19d9b89ee977f758b87b16fe21a2a51b8e123f0 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Tue, 24 Sep 2024 10:12:11 -0400 Subject: [PATCH 39/39] Name correction (closes #3872) --- data/xml/2024.acl.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index f1d3c6de9b..3308126c2b 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -5988,7 +5988,7 @@ <fixed-case>LLME</fixed-case>mbed: Rethinking Lightweight <fixed-case>LLM</fixed-case>’s Genuine Function in Text Classification - ChunLiuChunLiuAMS + ChunLiuAMS HongguangZhangSystems Engineering Institute, AMS KainanZhaoAMS XinghaiJuInformation Engineering University @@ -5996,7 +5996,7 @@ 7994-8004 With the booming of Large Language Models (LLMs), prompt-learning has become a promising method mainly researched in various research areas. Recently, many attempts based on prompt-learning have been made to improve the performance of text classification. However, most of these methods are based on heuristic Chain-of-Thought (CoT), and tend to be more complex but less efficient. In this paper, we rethink the LLM-based text classification methodology, propose a simple and effective transfer learning strategy, namely LLMEmbed, to address this classical but challenging task. To illustrate, we first study how to properly extract and fuse the text embeddings via various lightweight LLMs at different network depths to improve their robustness and discrimination, then adapt such embeddings to train the classifier. We perform extensive experiments on publicly available datasets, and the results show that LLMEmbed achieves strong performance while enjoys low training overhead using lightweight LLM backbones compared to recent methods based on larger LLMs, *i.e.* GPT-3, and sophisticated prompt-based strategies. Our LLMEmbed achieves adequate accuracy on publicly available benchmarks without any fine-tuning while merely use 4% model parameters, 1.8% electricity consumption and 1.5% runtime compared to its counterparts. Code is available at: https://github.com/ChunLiu-cs/LLMEmbed-ACL2024. 2024.acl-long.433 - chunliu-etal-2024-llmembed + liu-etal-2024-llmembed 10.18653/v1/2024.acl-long.433