Synthesising Summaries: A novel Retrieval-Augmented Generation-based pipeline for multi-document summarisation

Callaghan, Martin

Synthesising Summaries: A novel Retrieval-Augmented Generation-based pipeline for multi-document summarisation

Tools

CALLAGHAN, Martin (2025). Synthesising Summaries: A novel Retrieval-Augmented Generation-based pipeline for multi-document summarisation. Doctoral, Sheffield Hallam University. [Thesis]

[+][-]

Documents

36326:1084253

[+][-]

36326:1084253

[thumbnail of Callaghan_2025_PhD_SynthesisingSummariesNovel.pdf]

Preview

PDF
Callaghan_2025_PhD_SynthesisingSummariesNovel.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (9MB) | Preview

Abstract

The rapid growth of scientific literature in recent years has created a requirement for efficient methods to synthesise information from multiple related documents. This thesis addresses this challenge by developing and evaluating novel approaches to multi-document summarisation (MDS) of scientific papers, with a focus on hybrid and deep learning techniques leveraging both extractive and abstractive methods. The research explores the application of state-of-the-art large language models (LLMs), specifically Google's Gemma 2B and 7B models, to the task of scientific literature summarisation. A key innovative approach is the integration of Retrieval-Augmented Generation (RAG) techniques to enhance the summarisation process. The study employs a mixed-methods approach, combining quantitative evaluation metrics with qualitative human assessment and the recently developed novel LLM-as-judge methodology. A comprehensive literature review provides the theoretical foundation, covering the evolution of summarisation techniques, the emergence of transformer-based models, and recent advances in LLMs and related tools and techniques. The experimental design involves fine-tuning embedding models, optimising chunking strategies, and developing a RAG pipeline that integrates retrieval mechanisms with generative LLMs. Results demonstrate significant improvements in summary quality, coherence, and factual accuracy compared to baseline methods. The fine-tuned Gemma models, coupled with RAG techniques, show promise in handling the complexities of scientific text. The study also shows interesting trade-offs between model size and performance, with implications for resource-constrained applications. This research contributes to the field by advancing the state-of-the-art in scientific literature summarisation, providing insights into the effective application of LLMs to this area, and suggesting improved evaluation methodologies. The findings have potential implications for enhancing scientific communication, accelerating literature reviews, and improving access to scientific knowledge.

More Information

Contributors:

Thesis advisor - Hirsch, Laurence [0000-0002-3589-9816] (Affiliation: Sheffield Hallam University)
Thesis advisor - Di Nuovo, Alessandro [0000-0003-2677-2650] (Affiliation: Sheffield Hallam University)

Additional Information:

Director of studies: Prof. Laurance Hirsch

Supervisor: Dr. Alessandro Di Nuovo

Research Institute, Centre or Group - Does NOT include content added after October 2018:

Sheffield Hallam Doctoral Theses

Identifiers

Identification Number:

10.7190/shu-thesis-00719

Library

Item Type:

Thesis (Doctoral)

Depositing User: