Large language models (LLMs) have become widely used by the public all over the world. They impressed millions of users with their fluency and ability to perform many tasks. However, their reliability is still an issue, with the models manifesting a phenomenon dubbed hallucination: the models generate plausible-sounding texts, which are not supported by the user-provided data or a knowledge base. This is dangerous for regular users of LLMs who believe the LLM has no reason to “lie” and thus often do not question the truthfulness of LLM outputs because they don’t understand the underlying technology of language modeling.
In this project, we aim to improve the reliability of open-source LLMs by employing interpretability methods to gain a better understanding of how information is stored and used in the LLM weights. This understanding will allow us to address the issue of hallucination either directly by selectively re-training a portion of the weights or by detecting and circumventing hallucinations during decoding.
To repeatedly measure whether our experiments lead to a decrease in hallucinations without compromising fluency, we need comprehensive and reliable automatic evaluation. This is challenging as there is no gold standard for automatically assessing the performance and reliability of LLMs. Therefore, we aim to introduce a novel evaluation method using a variety of existing and novel metrics that will help us assess the LLMs’ strengths and weaknesses.