한줄 요약: 

이 논문에서는 대형 언어 모델(LLM)의 불확실성 인식 및 해결 능력을 향상시키기 위해 ConfuseBench라는 벤치마크를 도입(문서 부족, 제한된 능력, 쿼리 모호성의 세 가지 유형의 불확실성을 평가), on-policy training method called InteractDPO to generate better inquiries 제안  


짧은 요약(Abstract) :

이 논문에서는 대형 언어 모델(LLM)이 불확실한 상황에서 과도한 자신감을 보이는 문제를 다루고 있습니다. 기존의 해결책들은 주로 회피적인 응답("모르겠습니다")에 의존하고 있으며, 이는 불확실성을 인식하고 해결할 기회를 놓치고 있습니다. 이를 해결하기 위해, 저자들은 ConfuseBench라는 벤치마크를 도입하여 문서 부족, 제한된 능력, 쿼리 모호성의 세 가지 유형의 불확실성을 평가합니다. 실험 결과, 현재 LLM은 불확실성의 근본 원인을 정확하게 식별하고 해결하는 데 어려움을 겪고 있으며, 특히 약한 모델일수록 쿼리 모호성에 불확실성을 귀속시키는 경향이 있습니다. 저자들은 상황 인식 질문을 생성하고, 응답의 독창성을 기반으로 불확실성의 원인을 판단하는 방법을 제안하며, InteractDPO라는 온-정책 훈련 방법을 통해 더 나은 질문을 생성하는 방법을 제시합니다. 실험 결과는 제안된 접근 방식의 효율성을 입증합니다.

### English Summary
This paper addresses the issue of Large Language Models (LLMs) exhibiting overconfidence in uncertain situations. Existing solutions primarily rely on evasive responses (e.g., "I don’t know"), which overlook the opportunity to recognize and address uncertainty. To tackle this, the authors introduce ConfuseBench, a benchmark that evaluates three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experimental results reveal that current LLMs struggle to accurately identify and resolve the root causes of uncertainty, often attributing it to query ambiguity, especially in weaker models. The authors propose generating context-aware inquiries and judging the source of uncertainty based on the uniqueness of the inquiry's answer, along with an on-policy training method called InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of the proposed approach.


* Useful sentences :

단어정리

Methodology

이 논문에서는 대형 언어 모델(LLM)의 불확실성을 인식하고 해결하는 능력을 향상시키기 위한 새로운 방법론을 제안합니다. 연구의 핵심은 ‘ConfuseBench’라는 벤치마크를 도입하여 LLM이 세 가지 주요 불확실성 유형(문서 부족, 능력 제한, 쿼리 모호성)을 인식하고 해결하는 능력을 평가하는 것입니다.

모델 아키텍처: 연구에서 사용된 모델은 LLM으로, 다양한 자연어 처리 작업을 수행할 수 있는 능력을 가지고 있습니다. 이 모델은 대량의 텍스트 데이터로 사전 훈련되어 있으며, 질문 응답, 텍스트 생성, 코드 생성 등 여러 작업을 수행할 수 있습니다.
훈련 데이터: 모델은 다양한 도메인에서 수집된 대규모 텍스트 데이터로 훈련되었습니다. 이 데이터는 모델이 다양한 질문에 대한 답변을 생성하고, 문서에서 정보를 검색하며, 복잡한 쿼리를 처리하는 데 필요한 지식을 제공합니다.
특별한 기법: 연구에서는 ‘InteractDPO’라는 새로운 훈련 방법을 제안합니다. 이 방법은 모델이 실시간으로 사용자나 검색 시스템과 상호작용하여 더 나은 쿼리를 생성하고, 그에 따라 불확실성을 해결하는 데 도움을 줍니다. InteractDPO는 모델이 생성한 쿼리의 품질을 평가하고, 이를 통해 모델의 성능을 지속적으로 개선할 수 있도록 설계되었습니다.
실험 및 결과: ConfuseBench를 사용한 실험 결과, 현재의 LLM은 불확실성의 원인을 정확하게 식별하는 데 어려움을 겪고 있으며, 주로 쿼리의 모호성으로 불확실성을 귀속시키는 경향이 있음을 보여주었습니다. 연구진은 이러한 문제를 해결하기 위해, 쿼리의 혼란스러운 부분을 강조하는 맥락 인식 쿼리를 생성하고, 그에 대한 답변의 독창성을 기반으로 불확실성의 원인을 판단하는 방법을 제안합니다.

이러한 방법론은 LLM이 불확실성을 인식하고 해결하는 능력을 향상시켜, 더 나은 응답 품질을 제공할 수 있도록 돕습니다.

This paper proposes a new methodology to enhance the ability of large language models (LLMs) to recognize and address uncertainty. The core of the research is the introduction of a benchmark called ‘ConfuseBench’, which evaluates the ability of LLMs to identify and resolve three main types of uncertainty: document scarcity, limited capability, and query ambiguity.

Model Architecture: The models used in this research are LLMs capable of performing various natural language processing tasks. These models are pre-trained on large amounts of text data and can handle tasks such as question answering, text generation, and code generation.
Training Data: The models are trained on a vast corpus of text data collected from various domains. This data provides the necessary knowledge for the models to generate answers to diverse questions, retrieve information from documents, and handle complex queries.
Special Techniques: The research introduces a novel training method called ‘InteractDPO’. This method allows the model to interact in real-time with users or retrieval systems to generate better queries and help resolve uncertainties accordingly. InteractDPO is designed to evaluate the quality of the queries generated by the model, enabling continuous improvement of the model’s performance.
Experiments and Results: Experimental results using ConfuseBench reveal that current LLMs struggle to accurately identify the sources of uncertainty, often attributing uncertainty primarily to query ambiguity. To address this issue, the researchers propose generating context-aware inquiries that highlight the confusing aspects of the original query and judging the source of uncertainty based on the uniqueness of the inquiry’s answer.

These methodologies aim to improve the ability of LLMs to recognize and resolve uncertainties, thereby providing better response quality.

Results

이 논문에서는 대형 언어 모델(LLM)의 불확실성 인식 및 해결 능력을 평가하기 위해 새로운 벤치마크인 ConfuseBench를 도입했습니다. 이 벤치마크는 세 가지 주요 불확실성 유형인 문서 부족(document scarcity), 능력 제한(limited capability), 쿼리 모호성(query ambiguity)을 다룹니다. 실험 결과, 현재의 LLM들은 이러한 불확실성의 근본 원인을 정확하게 식별하고 해결하는 데 어려움을 겪고 있으며, 특히 능력 제한을 간과하고 쿼리 모호성으로 잘못 분류하는 경향이 있음을 보여주었습니다.

실험 설정

경쟁 모델: GPT-4o, DeepSeek-V3, Qwen2.5-72B, Llama-3-70B, Qwen2.5-7B, Mistral-7B
테스트 데이터: HotpotQA, AmbigQA, ExpertQA, TechQA, ToolBench
메트릭:
- Answer Quality (AQ): 상호작용 후 제공된 답변의 품질을 평가
- Uncertainty Classification Accuracy (UCA): 불확실성의 출처를 인식하는 능력
- Inquiry Quality (IQ): 생성된 문의 품질 평가

결과

실험 결과, LLM들은 불확실성을 효과적으로 인식하고 해결하는 데 어려움을 겪고 있으며, 특히 쿼리 모호성으로 잘못 분류하는 경향이 있었습니다. 예를 들어, GPT-4o 모델은 쿼리의 명확성과 관계없이 쿼리를 모호하다고 판단하고 사용자에게 추가적인 명확화를 요청하는 경우가 많았습니다. 반면, 더 강력한 모델인 GPT-4o는 의미 있는 문의를 생성할 수 있는 능력이 있었지만, 여전히 개선의 여지가 있음을 보여주었습니다.

결과적으로, InteractDPO라는 새로운 훈련 방법을 통해 문의 품질을 향상시키고 불확실성을 더 잘 인식할 수 있도록 하는 방법을 제안했습니다. 이 방법은 모델이 생성한 문의의 고유성을 기반으로 불확실성의 출처를 판단하는 데 중점을 두고 있습니다.

This paper introduces a new benchmark called ConfuseBench to evaluate the uncertainty recognition and resolution capabilities of large language models (LLMs). The benchmark addresses three main types of uncertainty: document scarcity, limited capability, and query ambiguity. Experimental results reveal that current LLMs struggle to accurately identify and resolve the root causes of these uncertainties, often overlooking capability limitations and misclassifying issues as query ambiguity.

Experimental Setup

Competing Models: GPT-4o, DeepSeek-V3, Qwen2.5-72B, Llama-3-70B, Qwen2.5-7B, Mistral-7B
Test Data: HotpotQA, AmbigQA, ExpertQA, TechQA, ToolBench
Metrics:
- Answer Quality (AQ): Evaluates the quality of responses provided after interaction
- Uncertainty Classification Accuracy (UCA): Measures the ability to recognize the source of uncertainty
- Inquiry Quality (IQ): Assesses the quality of generated inquiries

Results

The experimental results indicate that LLMs face challenges in effectively recognizing and resolving uncertainties, particularly misclassifying issues as query ambiguity. For instance, the GPT-4o model frequently judged queries as ambiguous regardless of their clarity, often prompting users for additional clarification. In contrast, the more powerful GPT-4o model demonstrated the ability to generate meaningful inquiries, although there remains room for improvement.

Ultimately, the paper proposes a new training method called InteractDPO, which focuses on enhancing the quality of inquiries and improving the model’s ability to recognize uncertainties. This method emphasizes judging the source of uncertainty based on the uniqueness of the generated inquiries.

예제

이 논문에서는 대형 언어 모델(LLM)이 불확실한 상황에서 어떻게 반응하는지를 평가하기 위해 “ConfuseBench”라는 벤치마크를 도입합니다. 이 벤치마크는 세 가지 주요 불확실성 유형인 문서 부족(document scarcity), 능력 제한(limited capability), 쿼리 모호성(query ambiguity)을 다룹니다.

트레이닝 데이터와 테스트 데이터

트레이닝 데이터:
- 입력: 다양한 질문과 그에 대한 관련 문서들. 예를 들어, “뉴욕에서 가장 좋은 요가 수업은 어디인가요?”라는 질문과 관련된 문서들(예: 요가 수업 목록, 위치 정보 등).
- 출력: 모델이 생성한 답변. 예를 들어, “뉴욕의 요가 수업은 A, B, C입니다.”와 같은 답변.
테스트 데이터:
- 입력: 불확실성을 포함한 질문과 문서. 예를 들어, “2025년의 날씨는 어떨까요?”라는 질문과 관련된 문서가 부족한 경우.
- 출력: 모델이 생성한 답변. 예를 들어, “그에 대한 정보는 없습니다.” 또는 “확실하지 않습니다.”와 같은 답변.

구체적인 테스크

문서 부족: 모델이 필요한 정보를 포함하지 않는 문서에 기반하여 질문에 답변해야 할 때, 모델이 어떻게 반응하는지를 평가합니다.
능력 제한: 모델이 복잡한 질문을 처리할 수 없을 때, 모델이 어떻게 반응하는지를 평가합니다. 예를 들어, “양자 컴퓨팅이 기후 모델링에 미치는 영향은 무엇인가요?”와 같은 질문에 대해 모델이 적절한 답변을 생성할 수 있는지를 평가합니다.
쿼리 모호성: 질문이 모호할 때, 모델이 사용자에게 추가적인 정보를 요청하는지를 평가합니다. 예를 들어, “내 도시에서 가장 좋은 요가 수업은?”이라는 질문이 있을 때, 모델이 “어떤 도시를 말씀하시는 건가요?”라고 질문하는지를 평가합니다.

이러한 방식으로, 논문은 LLM이 불확실성을 인식하고 해결하는 능력을 평가하고 개선하기 위한 방법론을 제시합니다.

This paper introduces a benchmark called “ConfuseBench” to evaluate how large language models (LLMs) respond in uncertain situations. This benchmark addresses three main types of uncertainty: document scarcity, limited capability, and query ambiguity.

Training Data and Test Data

Training Data:
- Input: Various questions and their related documents. For example, a question like “Where is the best yoga class in New York?” along with relevant documents (e.g., lists of yoga classes, location information).
- Output: The answer generated by the model. For instance, “The yoga classes in New York are A, B, and C.”
Test Data:
- Input: Questions and documents that include uncertainty. For example, a question like “What will the weather be like in 2025?” with insufficient related documents.
- Output: The answer generated by the model. For example, “There is no information on that.” or “I am not sure.”

Specific Tasks

Document Scarcity: Evaluates how the model responds when it has to answer questions based on documents that lack necessary information.
Limited Capability: Assesses how the model reacts when it cannot handle complex questions. For instance, evaluating whether the model can generate an appropriate answer to a question like “What is the impact of quantum computing on climate modeling?”
Query Ambiguity: Evaluates whether the model asks for additional information from the user when the question is ambiguous. For example, when presented with the question “What is the best yoga class in my city?”, the model should ask, “Which city are you referring to?”

Through this methodology, the paper presents a framework for assessing and improving the ability of LLMs to recognize and address uncertainty.

요약

이 논문에서는 대형 언어 모델(LLM)의 불확실성 인식 및 해결 능력을 향상시키기 위해 ConfuseBench라는 벤치마크를 도입하였다. 실험 결과, 현재 LLM은 불확실성의 근본 원인을 정확히 식별하는 데 어려움을 겪으며, 특히 쿼리 모호성을 과도하게 강조하는 경향이 있다. 이를 해결하기 위해, 연구진은 InteractDPO라는 새로운 훈련 방법을 제안하여 더 나은 문의 생성을 목표로 하였다.

This paper introduces ConfuseBench, a benchmark designed to improve the ability of large language models (LLMs) to recognize and address uncertainty. Experimental results reveal that current LLMs struggle to accurately identify the root causes of uncertainty, often overemphasizing query ambiguity. To tackle this issue, the authors propose a novel training method called InteractDPO aimed at generating better inquiries.

기타

논문 “Do not Abstain! Identify and Solve the Uncertainty”에서는 대형 언어 모델(LLM)이 불확실한 상황에서 과도한 자신감을 보이는 문제를 다루고 있습니다. 이 연구는 LLM이 불확실성을 인식하고 해결하는 능력을 향상시키기 위해 ConfuseBench라는 벤치마크를 도입했습니다. 이 벤치마크는 문서 부족, 제한된 능력, 쿼리 모호성의 세 가지 유형의 불확실성을 중점적으로 평가합니다.

결과 및 인사이트

ConfuseBench 벤치마크:
- LLM의 불확실성 인식 및 해결 능력을 평가하기 위해 설계되었습니다.
- 세 가지 주요 불확실성 유형(문서 부족, 제한된 능력, 쿼리 모호성)을 포함합니다.
- 실험 결과, 현재 LLM은 불확실성의 근본 원인을 정확하게 식별하고 해결하는 데 어려움을 겪고 있으며, 특히 약한 모델일수록 쿼리 모호성으로 불확실성을 귀속시키는 경향이 있습니다.
모델 성능:
- LLM은 불확실성을 인식하는 데 있어 50% 미만의 정확도를 보였습니다.
- 특히, GPT-4o와 같은 강력한 모델조차도 불확실성을 잘 분류하지 못하고, 쿼리의 모호성으로 잘못 판단하는 경우가 많았습니다.
InteractDPO 방법론:
- LLM이 효과적인 후속 질문을 생성하도록 돕기 위해 InteractDPO라는 새로운 훈련 방법을 제안했습니다.
- 이 방법은 LLM이 실시간으로 사용자나 검색 시스템과 상호작용하여 더 나은 질문을 생성하도록 유도합니다.
- 실험 결과, InteractDPO는 LLM의 불확실성 분류 성능을 향상시키는 데 효과적임을 보여주었습니다.
질문 및 응답 품질:
- LLM이 생성한 질문의 품질이 응답의 품질에 직접적인 영향을 미친다는 것을 확인했습니다.
- 높은 품질의 질문이 더 유용한 정보를 수집하는 데 기여했습니다.
정확도 및 신뢰성:
- LLM의 불확실성 분류 정확도는 여전히 낮았으며, 특히 문서 부족과 관련된 불확실성을 인식하는 데 어려움을 겪었습니다.
- 신뢰성 평가 결과, LLM이 생성한 응답의 일관성이 높았지만, 여전히 개선이 필요함을 나타냈습니다.

Conclusion

이 연구는 LLM이 불확실성을 인식하고 해결하는 데 있어 중요한 한계를 가지고 있음을 보여주며, 이를 해결하기 위한 새로운 접근 방식을 제안합니다. ConfuseBench와 InteractDPO는 LLM의 성능을 향상시키기 위한 유망한 도구로 자리 잡을 수 있습니다.

The paper “Do not Abstain! Identify and Solve the Uncertainty” addresses the issue of large language models (LLMs) exhibiting overconfidence in uncertain situations. The study introduces a benchmark called ConfuseBench to improve LLMs’ ability to recognize and address uncertainty. This benchmark focuses on three types of uncertainty: document scarcity, limited capability, and query ambiguity.

Results and Insights

ConfuseBench Benchmark:
- Designed to evaluate LLMs’ ability to recognize and resolve uncertainty.
- Includes three main types of uncertainty (document scarcity, limited capability, query ambiguity).
- Experimental results reveal that current LLMs struggle to accurately identify and solve the root causes of uncertainty, particularly weaker models tend to attribute uncertainty to query ambiguity.
Model Performance:
- LLMs demonstrated less than 50% accuracy in recognizing uncertainty.
- Even powerful models like GPT-4o often misclassify uncertainty as query ambiguity.
InteractDPO Methodology:
- A new training method called InteractDPO is proposed to help LLMs generate effective follow-up inquiries.
- This method encourages LLMs to interact with users or retrieval systems in real-time to generate better inquiries.
- Experimental results show that InteractDPO effectively improves LLMs’ uncertainty classification performance.
Inquiry and Response Quality:
- The quality of inquiries generated by LLMs directly impacts the quality of responses.
- High-quality inquiries contribute to gathering more useful information.
Accuracy and Reliability:
- LLMs still exhibit low accuracy in classifying uncertainty, particularly in recognizing uncertainty related to document scarcity.
- Reliability assessments indicate a high consistency in the responses generated by LLMs, but improvements are still needed.

Conclusion

This study highlights significant limitations in LLMs’ ability to recognize and address uncertainty and proposes new approaches to tackle these challenges. ConfuseBench and InteractDPO could serve as promising tools to enhance LLM performance.

refer format:

BibTeX 형식

@inproceedings{liu2025do,
  title={Do not Abstain! Identify and Solve the Uncertainty},
  author={Jingyu Liu and Jingquan Peng and Xiaopeng Wu and Xubing Li and Tiezheng Ge and Bo Zheng and Yong Liu},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={17177--17197},
  year={2025},
  month={July},
  publisher={Association for Computational Linguistics},
  address={Beijing, China}
}

시카고 스타일

Liu, Jingyu, Jingquan Peng, Xiaopeng Wu, Xubing Li, Tiezheng Ge, Bo Zheng, and Yong Liu. “Do not Abstain! Identify and Solve the Uncertainty.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 17177–17197. Beijing, China: Association for Computational Linguistics, July 2025.