Paper Metadata: Maarten de Rijke, closes #3811.

acl-org · Oct 19, 2024 · c6761e1 · c6761e1
1 parent 051701b
commit c6761e1
Show file tree

Hide file tree

Showing 6 changed files with 9 additions and 9 deletions.
diff --git a/data/xml/2022.acl.xml b/data/xml/2022.acl.xml
@@ -4000,7 +4000,7 @@ in the Case of Unambiguous Gender</title>
       <author><first>Pengjie</first><last>Ren</last></author>
       <author><first>Wentao</first><last>Deng</last></author>
       <author><first>Zhumin</first><last>Chen</last></author>
-      <author><first>Maarten</first><last>Rijke</last></author>
+      <author><first>Maarten</first><last>de Rijke</last></author>
       <pages>3543-3555</pages>
       <abstract>A dialogue response is malevolent if it is grounded in negative emotions, inappropriate behavior, or an unethical value basis in terms of content and dialogue acts. The detection of malevolent dialogue responses is attracting growing interest. Current research on detecting dialogue malevolence has limitations in terms of datasets and methods. First, available dialogue datasets related to malevolence are labeled with a single category, but in practice assigning a single category to each utterance may not be appropriate as some malevolent utterances belong to multiple labels. Second, current methods for detecting dialogue malevolence neglect label correlation. Therefore, we propose the task of multi-label dialogue malevolence detection and crowdsource a multi-label dataset, multi-label dialogue malevolence detection (MDMD) for evaluation. We also propose a multi-label malevolence detection model, multi-faceted label correlation enhanced CRF (MCRF), with two label correlation mechanisms, label correlation in taxonomy (LCT) and label correlation in context (LCC). Experiments on MDMD show that our method outperforms the best performing baseline by a large margin, i.e., 16.1%, 11.9%, 12.0%, and 6.1% on precision, recall, F1, and Jaccard score, respectively.</abstract>
       <url hash="f0c66248">2022.acl-long.248</url>

diff --git a/data/xml/2022.dialdoc.xml b/data/xml/2022.dialdoc.xml
@@ -82,7 +82,7 @@
       <title>Parameter-Efficient Abstractive Question Answering over Tables or Text</title>
       <author><first>Vaishali</first><last>Pal</last></author>
       <author><first>Evangelos</first><last>Kanoulas</last></author>
-      <author><first>Maarten</first><last>Rijke</last></author>
+      <author><first>Maarten</first><last>de Rijke</last></author>
       <pages>41-53</pages>
       <abstract>A long-term ambition of information seeking QA systems is to reason over multi-modal contexts and generate natural answers to user queries. Today, memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables. To avoid training such memory-hungry models while utilizing a uniform architecture for each modality, parameter-efficient adapters add and train small task-specific bottle-neck layers between transformer layers. In this work, we study parameter-efficient abstractive QA in encoder-decoder models over structured tabular data and unstructured textual data using only 1.5% additional parameters for each modality. We also ablate over adapter layers in both encoder and decoder modules to study the efficiency-performance trade-off and demonstrate that reducing additional trainable parameters down to 0.7%-1.0% leads to comparable results. Our models out-perform current state-of-the-art models on tabular QA datasets such as Tablesum and FeTaQA, and achieve comparable performance on a textual QA dataset such as NarrativeQA using significantly less trainable parameters than fine-tuning.</abstract>
       <url hash="fd968a0b">2022.dialdoc-1.5</url>

diff --git a/data/xml/2022.naacl.xml b/data/xml/2022.naacl.xml
@@ -66,7 +66,7 @@
       <title>What Makes a Good and Useful Summary? <fixed-case>I</fixed-case>ncorporating Users in Automatic Summarization Research</title>
       <author><first>Maartje</first><last>Ter Hoeve</last></author>
       <author><first>Julia</first><last>Kiseleva</last></author>
-      <author><first>Maarten</first><last>Rijke</last></author>
+      <author><first>Maarten</first><last>de Rijke</last></author>
       <pages>46-75</pages>
       <abstract>Automatic text summarization has enjoyed great progress over the years and is used in numerous applications, impacting the lives of many. Despite this development, there is little research that meaningfully investigates how the current research focus in automatic summarization aligns with users’ needs. To bridge this gap, we propose a survey methodology that can be used to investigate the needs of users of automatically generated summaries. Importantly, these needs are dependent on the target group. Hence, we design our survey in such a way that it can be easily adjusted to investigate different user groups. In this work we focus on university students, who make extensive use of summaries during their studies. We find that the current research directions of the automatic summarization community do not fully align with students’ needs. Motivated by our findings, we present ways to mitigate this mismatch in future research on automatic summarization: we propose research directions that impact the design, the development and the evaluation of automatically generated summaries.</abstract>
       <url hash="a5360123">2022.naacl-main.4</url>

diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml
@@ -2338,7 +2338,7 @@
       <author><first>Shiguang</first><last>Wu</last></author>
       <author><first>Mengqi</first><last>Zhang</last><affiliation>Shandong University</affiliation></author>
       <author><first>Zhaochun</first><last>Ren</last><affiliation>Leiden University</affiliation></author>
-      <author><first>Maarten</first><last>Rijke</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Maarten</first><last>de Rijke</last><affiliation>University of Amsterdam</affiliation></author>
       <author><first>Zhumin</first><last>Chen</last><affiliation>Shandong University</affiliation></author>
       <author><first>Jiahuan</first><last>Pei</last><affiliation>Centrum voor Wiskunde en Informatica</affiliation></author>
       <pages>3052-3064</pages>

diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml
@@ -3015,7 +3015,7 @@
       <title>Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems</title>
       <author><first>Clemencia</first><last>Siro</last></author>
       <author><first>Mohammad</first><last>Aliannejadi</last><affiliation>University of Amsterdam</affiliation></author>
-      <author><first>Maarten</first><last>Rijke</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Maarten</first><last>de Rijke</last><affiliation>University of Amsterdam</affiliation></author>
       <pages>1258-1273</pages>
       <abstract>Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models ( LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator’s performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.</abstract>
       <url hash="874ad2d5">2024.findings-naacl.80</url>
@@ -14249,7 +14249,7 @@
       <author><first>Yubao</first><last>Tang</last></author>
       <author><first>Ruqing</first><last>Zhang</last></author>
       <author><first>Jiafeng</first><last>Guo</last><affiliation>Institute of Computing Technolgy, Chinese Academy of Sciences</affiliation></author>
-      <author><first>Maarten</first><last>Rijke</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Maarten</first><last>de Rijke</last><affiliation>University of Amsterdam</affiliation></author>
       <author><first>Yixing</first><last>Fan</last></author>
       <author><first>Xueqi</first><last>Cheng</last><affiliation>, Chinese Academy of Sciences</affiliation></author>
       <pages>10303-10317</pages>
@@ -17202,7 +17202,7 @@
       <author><first>Mohanna</first><last>Hoveyda</last></author>
       <author><first>Arjen</first><last>Vries</last><affiliation>Institute for Computing and Information Sciences, Radboud University Nijmegen, Radboud University</affiliation></author>
       <author><first>Faegheh</first><last>Hasibi</last><affiliation>Radboud University</affiliation></author>
-      <author><first>Maarten</first><last>Rijke</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Maarten</first><last>de Rijke</last><affiliation>University of Amsterdam</affiliation></author>
       <pages>13938-13946</pages>
       <abstract>Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models’ ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance.Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation frame-work and dataset proposed are tailored to facilitate this research.</abstract>
       <url hash="4eda4d75">2024.findings-acl.829</url>
@@ -17765,7 +17765,7 @@
       <author><first>Zhaochun</first><last>Ren</last><affiliation>Leiden University</affiliation></author>
       <author><first>Arian</first><last>Askari</last></author>
       <author><first>Mohammad</first><last>Aliannejadi</last><affiliation>University of Amsterdam</affiliation></author>
-      <author><first>Maarten</first><last>Rijke</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Maarten</first><last>de Rijke</last><affiliation>University of Amsterdam</affiliation></author>
       <author><first>Suzan</first><last>Verberne</last><affiliation>Universiteit Leiden</affiliation></author>
       <pages>14623-14635</pages>
       <abstract>An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.</abstract>

diff --git a/data/xml/2024.sighan.xml b/data/xml/2024.sighan.xml
@@ -58,7 +58,7 @@
       <author><first>Yu Yan</first><last>Lam</last><affiliation>Hong Kong Metropolitan University</affiliation></author>
       <author><first>Wing Lam</first><last>Suen</last><affiliation>Hong Kong Metropolitan University</affiliation></author>
       <author><first>Elsie Li Chen</first><last>Ong</last><affiliation>Hong Kong Metropolitan University</affiliation></author>
-      <author><first>Samuel Kai Wah</first><last>Chu</last><affiliation>Hong Kong Metropolitan University</affiliation></author>    
+      <author><first>Samuel Kai Wah</first><last>Chu</last><affiliation>Hong Kong Metropolitan University</affiliation></author>
       <pages>21-27</pages>
       <abstract>According to the internationally recognized PIRLS (Progress in International Reading Literacy Study) assessment standards, reading comprehension questions should require not only information retrieval, but also higher-order processes such as inferencing, interpreting and evaluation. However, these kinds of questions are often not available in large quantities for training question generation models. This paper investigates whether pre-trained Large Language Models (LLMs) can produce higher-order questions. Human assessment on a Chinese dataset shows that few-shot LLM prompting generates more usable and higher-order questions than two competitive neural baselines.</abstract>
       <url hash="5f10f2d4">2024.sighan-1.3</url>