Model | 1~3 (Poor, ↓) | 4~6 (Medium, ↓) | 7~10 (Excellent, ↑) | Overall(↑) | |||||
---|---|---|---|---|---|---|---|---|---|
raw | opt. | raw | opt. | raw | opt. | raw | opt. | gain | |
Proprietary Model | |||||||||
GPT4 | 2.5% | 0.1% | 10.1% | 2.5% | 87.6% | 97.3% | 8.094 | 8.604 | 6.3%↑ |
ChatGPT | 38.5% | 11.1% | 20.1% | 26.9% | 41.4% | 62.0% | 5.098 | 6.770 | 32.8%↑ |
Claude3-Opus | 14.4% | 0.9% | 17.0% | 9.2% | 68.6% | 89.9% | 7.061 | 8.244 | 16.8%↑ |
Open-Source Model | |||||||||
Mistral-7b | 55.3% | 21.7% | 20.4% | 27.5% | 24.4% | 50.8% | 3.885 | 6.046 | 55.6%↑ |
Mixtral-8x7b | 31.4% | 2.8% | 18.1% | 15.5% | 50.5% | 81.7% | 5.693 | 7.626 | 34.0%↑ |
Llama2-7b | 42.9% | 23.2% | 19.1% | 17.2% | 38.0% | 59.6% | 4.877 | 6.203 | 27.2%↑ |
Llama2-13b | 42.7% | 24.9% | 19.0% | 22.1% | 38.4% | 53.0% | 4.890 | 5.961 | 21.9%↑ |
Llama2-70b | 39.4% | 21.0% | 19.7% | 14.8% | 40.9% | 64.2% | 5.068 | 6.447 | 27.2%↑ |
Llama3-70b | 25.3% | 4.2% | 20.8% | 14.5% | 53.9% | 81.3% | 6.128 | 7.783 | 27.0%↑ |
We significantly enhance the honesty rates in both open-source and proprietary LLMs by implementing our proposed training-free approach. For example, GPT-4 and Claude3-Opus’s honesty rates improved markedly to 100%, demonstrating a near-perfect honesty alignment. Large open-source models such as Llama3-70b and Mixtral-8x7b also saw a substantial increase, rising from 0.606 to 0.871 and 0.585 to 0.914 respectively. Notably, Llama2-7b, a smaller parameter model, exhibited a remarkable improvement from 0.430 to 0.837. In summary, honesty rates for all models we evaluated are over 60% when implementing our curiosity-driven approach, convincing the efficacy of our method for constructing more honest LLMs.
In addition to honesty rates, we leverage LLM-as-a-Judge to conduct H2 assessment in both pairwise and score settings to evaluate the responses before and after the curiosity-driven method. In the pairwise setting, optimized answers were generally rated higher than the original ones, representing better honesty and helpfulness. Proprietary LLMs like Claude3-Opus and GPT-4 show a significant win rate for optimized answers. Open-source models like Llama2-7b showed that 40.1% of the optimized answers were preferred over the raw ones. In the score setting, we provide fine-grained scores for three principles. All LLMs demonstrate improvement using our training-free method, with proprietary models achieving significantly better results than open-source models, scoring over 9 in ‘Explanation’ and over 8 in ‘Guidance’. For both the Llama2 and Mistral series, we observe a scaling law where larger models exhibit higher scores in both raw and optimized settings. Among the three dimensions, ‘Explanation’ and ‘Guidance’ show the most substantial improvement, indicating that models become more honest and helpful in identifying their limitations and guiding users through LLM-unable questions.
Our proposed two-stage fine-tuning method demonstrates improvements in honesty rate and H2 assessment for both Llama3-8B and Mistral-7B. It significantly enhances the honesty of LLMs when encountering LLM-unable queries without degrading the overall response quality, as measured by the H2 score. Specifically, the Llama3-8b model shows a notable improvement of 13.7% in honesty rates post fine-tuning, along with an 8.5% increase in the H2 score. Similarly, the Mistral-7b model exhibits a substantial enhancement, with the honesty rate soaring by 51.9% and the H2 score escalating by 108.6% after the two-stage fine-tuning process. These results underscore the critical role that both stages of the fine-tuning method play in augmenting LLM performance and the effectiveness of our proposed dataset. Empirical results show the overall scores and honesty rates for the two LLMs under different thresholds. Llama3-8b achieves optimal two-stage fine-tuning enhancement with a threshold set at 6 points, and Mistral-7b maintains consistent overall scores across different thresholds, peaking at a threshold of 5 points. Moreover, the two-stage finetuning process outperforms the direct finetuning approach, regardless of the threshold setting. Both models achieve the highest overall scores in the category “user input not enough or with wrong information", while the data from the category “modality mismatch" and “interactivity sensory processing” gain the most scores. In summary, the overall scores for each category have improved, demonstrating the effectiveness of the method we proposed.
@misc{gao2024bestworldshonesthelpful,
title={The Best of Both Worlds: Toward an Honest and Helpful Large Language Model},
author={Chujie Gao and Qihui Zhang and Dongping Chen and Yue Huang and Siyuan Wu and Zhengyan Fu and Yao Wan and Xiangliang Zhang and Lichao Sun},
year={2024},
eprint={2406.00380},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.00380},
}