HonestLLM: Toward an Honest and Helpful Large Language Model

NeurIPS 2024
1MUZUAI 2Huazhong University of Science andTechnology,
3University of Notre Dame, 4University of Washington
5Peking University 6Lehigh University

GUI-world Benchmark Overview We establish exhaustive principles aimed at guaranteeing the honesty of LLM and propose HONESET, as well as two methods to enhance the honesty and helpfulness of LLMs. In summary, the primary contributions of this paper are as follows:
  1. Definitions for Honesty. We refine a comprehensive definition of honesty in LLMs and establish detailed principles that honest LLMs should adhere to. Based on these principles, we construct a new dataset, HONESET, which contains queries from six categories designed to evaluate LLMs’ ability to maintain honesty.
  2. Two Methods. We introduce a training-free approach based on curiosity-driven prompting, alongside a curriculum learning-based approach with a two-stage fine-tuning process, to enhance the helpfulness of both proprietary and open-source LLMs while maintaining their honesty.
  3. Comprehensive Experiments and Valuable Insights. We conduct extensive experiments on nine LLMs, including both open-source and proprietary models, using two evaluation protocols. The experimental results show that both of our proposed methods significantly improve the honesty and helpfulness of LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable success across various industries due to their exceptional generative capabilities. However, for safe and effective real-world deployments, ensuring honesty and helpfulness is critical. This paper addresses the question: ***Can we prioritize the helpfulness of LLMs while preserving their honesty?*** To begin with, we establish exhaustive principles aimed at guaranteeing the honesty of LLM. Additionally, we introduce a novel dataset, referred to as **HONESET**, comprising 930 queries spanning six categories meticulously crafted to assess an LLM’s capacity for maintaining honesty. Subsequently, we present two approaches to augmenting honesty and helpfulness in LLMs: a training-free enhancement and a fine-tuning-based improvement. The training-free approach, which is based on curiosity-driven prompting, empowers LLMs to articulate internal confusion and uncertainty regarding queries, thereby optimizing their responses. Conversely, the fine-tuning-based method employs a two-stage process inspired by curriculum learning: initially instructing LLMs to discern between honest and dishonest responses, then refining their training to enhance helpfulness. Experiments conducted on nine prominent LLMs demonstrate a significant improvement in alignment with honesty across all models through the implementation of our proposed enhancements. Particularly noteworthy is the 65.3% enhancement observed in Llama3-8b and the remarkable 124.7% improvement in Mistral-7b, as measured by the H2 (honest and helpful) assessment. We believe that our work can pave the way for developing more trustworthy LLMs for real-world applications.

Principles for Honest LLMs

environment infrastructure we propose the principles of honest LLMs as shown in Appendix A, which focus on six categories*:
  1. Latest Information with External Services. Due to outdated pre-training data, insufficient fact-checking, and lack of access to live or up-to-date external data sources, LLMs may produce seemingly reasonable but inaccurate output when accessing the latest information via external tools. As a result, honestly acknowledging these limitations is crucial.
  2. User Input Not Enough Or With Wrong Information. In the real world, LLMs frequently face incorrect or ambiguous questions. LLMs must avoid sycophancy and provide truthful, honest responses to maintain objectivity and prevent undue influence from user inputs.
  3. Professional Capability in Specific Domains. Domain-specific tasks challenge LLMs beyond their capabilities because of the rapid updates in professional fields and the need for extensive, high-quality, task-specific datasets. Given the diverse constraints, LLMs are expected to honestly recognize their limitations and avoid unreliable outputs.
  4. Interactivity Sensory Processing. LLMs are unable to directly perceive and process sensory data (such as sound or tactile feedback), which are crucial for interactive tasks. The honesty of LLMs would include acknowledging that they cannot directly interact with the physical world.
  5. Modality Mismatch. LLMs are designed for processing text-based inputs and outputs, therefore, they face challenges in understanding or generating non-text modal data (such as images, and audio). This mismatch can lead to incorrect or irrelevant responses, which underscores the need for LLMs to honestly acknowledge the limitations in handling these types of data.
  6. Self Identity Cognition. As a helpful and honest assistant, an LLM should possess a clear selfawareness, recognize the distinctions between humans and AI assistant, and renounce its self-identity when addressing topics that humans can perceive and understand but AI cannot, such as social and introspective awareness.

Approach I: Training-Free Enhancement

HonestLLM Overview
**Curiosity-Driven Prompting** Intuitively, when faced with queries that require a high degree of honesty (e.g., questions outside the LLM’s capabilities or those it cannot adequately address), there arises an inherent uncertainty within the LLM. Recent research has explored methods for utilizing LLM outputs to quantify such uncertainties, including the generation of confidence scores alongside responses. This has inspired us to employ LLM’s awareness of their uncertainty in addressing given queries. In essence, as LLM is engineered to be helpful, this uncertainty can be transformed into curiosity, which in turn may drive them to provide more accurate responses to user queries. **Optimizing Output** The curiosity-driven prompt approach enhances LLM honesty by using LLM outputs as a basis for improvement. Inspired by self-alignment studies, a constitution-guided prompt combines the query, raw answer, and expressed confusion, and feeds it back into the LLMs to generate improved, helpful, and honest responses. This prompt ensures LLMs express uncertainties as disclaimers and provide actionable guidance, such as suggesting practical alternatives for complex tasks.

Approach II: Curriculum Fine-Tuning

**Stage One: Differentiating Honesty from Dishonesty.** The primary goal of this stage is to train LLMs to distinguish between honest and dishonest responses. We only retain response pairs with contrasting honesty evaluations for training. **Stage Two: Enhancing Overall Response Quality.** The second stage is dedicated to enhancing the overall quality of responses, aiming to produce outcomes that are not only honest but also informative and helpful. These pairs are utilized to further refine the LLM through the DPO framework. This two-stage fine-tuning process ensures that LLMs adhere to honesty standards while fostering the generation of helpful, high-quality guidance in practical scenarios.

Benchmark

  1. Models. Our study covers nine mainstream LLMs, including both open-source and proprietary LLMs. Our evaluation came across ChatGPT and GPT-4 by OpenAI; Llama2 (7b-chat, 13b-chat, 70b-chat) and Llama3-70b-instruct by Meta AI; Mistral-7b and Mixtral-8x7b by Mistral AI; and Claude3-Opus by Anthropic.
  2. Metrics. Our evaluation framework consists of two protocols: one focusing on honesty and the other on both honesty and helpfulness. Due to the complexity of rule-based methods like keyword matching, we use the “LLM-as-a-Judge” methodology, widely used in previous studies. Each response is judged by averaging the results of three rounds of LLM-as-a-Judge. We propose two evaluation protocols as follows:
    • Purely Honest-Guided Evaluation. This protocol aims to gauge the adherence of LLMs to honesty. LLMs are evaluated against predefined criteria specified in Table 4. An LLM is deemed honest if its responses consistently align with these standards. For this evaluation, we use the "Honesty Rate" metric, which quantifies the percentage of queries in which an LLM consistently exhibits honesty.
    • H2 Assessment. This protocol evaluates both honesty and helpfulness (H2). It requires LLMs to not only uphold honesty but also provide well-reasoned explanations, justifications, and viable solutions for user inquiries. The H2 assessment is based on three main criteria: (1) Rationality of Explanations for Honesty or Disclaimers, (2) Quality of Further Guidance, and (3) Potential Solutions. Criteria (1) and (2) are crucial as they directly reflect the model’s honesty and helpfulness, while (3) is secondary. The importance of these criteria is weighted accordingly in our evaluation. Additionally, the H2 protocol uses both pairwise and score-based evaluation formats to comprehensively assess responses.
Improvements in honesty rate and H2 scores for Llama3-8b and Mistral-7b after Curiosity-Driven method.
Model 1~3 (Poor, ↓) 4~6 (Medium, ↓) 7~10 (Excellent, ↑) Overall(↑)
raw opt. raw opt. raw opt. raw opt. gain
Proprietary Model
GPT4 2.5% 0.1% 10.1% 2.5% 87.6% 97.3% 8.094 8.604 6.3%↑
ChatGPT 38.5% 11.1% 20.1% 26.9% 41.4% 62.0% 5.098 6.770 32.8%↑
Claude3-Opus 14.4% 0.9% 17.0% 9.2% 68.6% 89.9% 7.061 8.244 16.8%↑
Open-Source Model
Mistral-7b 55.3% 21.7% 20.4% 27.5% 24.4% 50.8% 3.885 6.046 55.6%↑
Mixtral-8x7b 31.4% 2.8% 18.1% 15.5% 50.5% 81.7% 5.693 7.626 34.0%↑
Llama2-7b 42.9% 23.2% 19.1% 17.2% 38.0% 59.6% 4.877 6.203 27.2%↑
Llama2-13b 42.7% 24.9% 19.0% 22.1% 38.4% 53.0% 4.890 5.961 21.9%↑
Llama2-70b 39.4% 21.0% 19.7% 14.8% 40.9% 64.2% 5.068 6.447 27.2%↑
Llama3-70b 25.3% 4.2% 20.8% 14.5% 53.9% 81.3% 6.128 7.783 27.0%↑

Empirical Results

Significant Improvements in Honesty Rates for LLMs with Training-Free Approach

We significantly enhance the honesty rates in both open-source and proprietary LLMs by implementing our proposed training-free approach. For example, GPT-4 and Claude3-Opus’s honesty rates improved markedly to 100%, demonstrating a near-perfect honesty alignment. Large open-source models such as Llama3-70b and Mixtral-8x7b also saw a substantial increase, rising from 0.606 to 0.871 and 0.585 to 0.914 respectively. Notably, Llama2-7b, a smaller parameter model, exhibited a remarkable improvement from 0.430 to 0.837. In summary, honesty rates for all models we evaluated are over 60% when implementing our curiosity-driven approach, convincing the efficacy of our method for constructing more honest LLMs.

Enhanced Honesty and Helpfulness in LLMs with Curiosity-Driven Method: H2 Assessment Results

In addition to honesty rates, we leverage LLM-as-a-Judge to conduct H2 assessment in both pairwise and score settings to evaluate the responses before and after the curiosity-driven method. In the pairwise setting, optimized answers were generally rated higher than the original ones, representing better honesty and helpfulness. Proprietary LLMs like Claude3-Opus and GPT-4 show a significant win rate for optimized answers. Open-source models like Llama2-7b showed that 40.1% of the optimized answers were preferred over the raw ones. In the score setting, we provide fine-grained scores for three principles. All LLMs demonstrate improvement using our training-free method, with proprietary models achieving significantly better results than open-source models, scoring over 9 in ‘Explanation’ and over 8 in ‘Guidance’. For both the Llama2 and Mistral series, we observe a scaling law where larger models exhibit higher scores in both raw and optimized settings. Among the three dimensions, ‘Explanation’ and ‘Guidance’ show the most substantial improvement, indicating that models become more honest and helpful in identifying their limitations and guiding users through LLM-unable questions.

Two-Stage Fine-Tuning Method Boosts Honesty and H2 Scores in Open-source Models

Our proposed two-stage fine-tuning method demonstrates improvements in honesty rate and H2 assessment for both Llama3-8B and Mistral-7B. It significantly enhances the honesty of LLMs when encountering LLM-unable queries without degrading the overall response quality, as measured by the H2 score. Specifically, the Llama3-8b model shows a notable improvement of 13.7% in honesty rates post fine-tuning, along with an 8.5% increase in the H2 score. Similarly, the Mistral-7b model exhibits a substantial enhancement, with the honesty rate soaring by 51.9% and the H2 score escalating by 108.6% after the two-stage fine-tuning process. These results underscore the critical role that both stages of the fine-tuning method play in augmenting LLM performance and the effectiveness of our proposed dataset. Empirical results show the overall scores and honesty rates for the two LLMs under different thresholds. Llama3-8b achieves optimal two-stage fine-tuning enhancement with a threshold set at 6 points, and Mistral-7b maintains consistent overall scores across different thresholds, peaking at a threshold of 5 points. Moreover, the two-stage finetuning process outperforms the direct finetuning approach, regardless of the threshold setting. Both models achieve the highest overall scores in the category “user input not enough or with wrong information", while the data from the category “modality mismatch" and “interactivity sensory processing” gain the most scores. In summary, the overall scores for each category have improved, demonstrating the effectiveness of the method we proposed.

BibTeX

@misc{gao2024bestworldshonesthelpful,
        title={The Best of Both Worlds: Toward an Honest and Helpful Large Language Model}, 
        author={Chujie Gao and Qihui Zhang and Dongping Chen and Yue Huang and Siyuan Wu and Zhengyan Fu and Yao Wan and Xiangliang Zhang and Lichao Sun},
        year={2024},
        eprint={2406.00380},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2406.00380}, 
  }

HonestLLM Team