CALLAGHAN, Martin (2025). Synthesising Summaries: A novel Retrieval-Augmented Generation-based pipeline for multi-document summarisation. Doctoral, Sheffield Hallam University. [Thesis]
Documents
36326:1084253
PDF
Callaghan_2025_PhD_SynthesisingSummariesNovel.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Callaghan_2025_PhD_SynthesisingSummariesNovel.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (9MB) | Preview
Abstract
The rapid growth of scientific literature in recent years has created a requirement for efficient methods to synthesise information from multiple related documents. This thesis addresses this challenge by developing and evaluating novel approaches to multi-document summarisation (MDS) of scientific papers, with a focus on hybrid and deep learning techniques leveraging both extractive and abstractive methods.
The research explores the application of state-of-the-art large language models (LLMs), specifically Google's Gemma 2B and 7B models, to the task of scientific literature summarisation. A key innovative approach is the integration of Retrieval-Augmented Generation (RAG) techniques to enhance the summarisation process. The study employs a mixed-methods approach, combining quantitative evaluation metrics with qualitative human assessment and the recently developed novel LLM-as-judge methodology.
A comprehensive literature review provides the theoretical foundation, covering the evolution of summarisation techniques, the emergence of transformer-based models, and recent advances in LLMs and related tools and techniques. The experimental design involves fine-tuning embedding models, optimising chunking strategies, and developing a RAG pipeline that integrates retrieval mechanisms with generative LLMs.
Results demonstrate significant improvements in summary quality, coherence, and factual accuracy compared to baseline methods. The fine-tuned Gemma models, coupled with RAG techniques, show promise in handling the complexities of scientific text. The study also shows interesting trade-offs between model size and performance, with implications for resource-constrained applications.
This research contributes to the field by advancing the state-of-the-art in scientific literature summarisation, providing insights into the effective application of LLMs to this area, and suggesting improved evaluation methodologies. The findings have potential implications for enhancing scientific communication, accelerating literature reviews, and improving access to scientific knowledge.
More Information
Statistics
Downloads
Downloads per month over past year
Metrics
Altmetric Badge
Dimensions Badge
Share
Actions (login required)
![]() |
View Item |


Tools
Tools