Symposium on calibration in forensic science

Forensic Data Science Laboratory, Aston Institute for Forensic Linguistics

Date & Time: 3 June 2021, 12:00–15:00 UTC

Version 2021-06-08a



Abstract

In the first decade of the 2000s, procedures and statistical models were developed for calibrating the likelihood-ratio output of automatic-speaker-recognition systems. These procedures and models were quickly adopted for calibrating the likelihood-ratio output of human-supervised-automatic forensic-voice-comparison systems. Since at least the early 2010s, recommendations have been made to use the same calibration procedures and models in other branches of forensic science. Interest in doing this is now growing. Published examples can be found in the context of multiple branches of forensic science, including fingerprints, DNA, mRNA, glass fragments, and mobile telephone colocation. There are also published examples of the use of these procedures and models to calibrate human judgements. The 2021 Consensus on validation of forensic voice comparison and the Forensic Science Regulator of England & Wales’s 2021 Development of evaluative opinions both recommend/require the use of calibration.

This symposium brings together some of the leading researchers in the calibration of the likelihood-ratio output of automatic-speaker-recognition systems and of forensic-evaluation systems. They explain what calibration is and why it is important. They present algorithms used for calibrating likelihood-ratio systems, and metrics used for assessing the degree of calibration of likelihood-ratio systems. They discuss aspects of calibration on which there is consensus, aspects on which there is disagreement, and aspects requiring additional research. They also discuss how to encourage wider adoption of calibration of likelihood-ratio systems in forensic practice.




Introduction

Roberto Puch-Solis

Forensic Data Science Laboratory, Department of Computer Science & Aston Institute for Forensic Linguistics, Aston University

Slides




Calibration in forensic science

Geoffrey Stewart Morrison

Forensic Data Science Laboratory, Department of Computer Science & Aston Institute for Forensic Linguistics, Aston University


Slides


Video

In the first decade of the 2000s, procedures and statistical models were developed for calibrating the likelihood-ratio output of automatic-speaker-recognition systems. These calibration procedures and models were quickly adopted for calibrating the likelihood-ratio output of human-supervised-automatic forensic-voice-comparison systems. They were adopted in both research and casework. The 2021 Consensus on validation of forensic voice comparison recommended that “In order for the forensic-voice-comparison system to answer the specific question formed by the propositions in the case, the output of the system should be well calibrated” and that “forensic-voice-comparison system should be calibrated using a statistical model that forms the final stage of the system”. Since at least the early 2010s, recommendations have been made to use the same calibration procedures and models in other branches of forensic science. Interest in doing this is now growing. Published examples can be found in the context of multiple branches of forensic science, including fingerprints, DNA, mRNA, glass fragments, and mobile telephone colocation. There are also published examples of the use of these procedures and models to calibrate human judgements. In this presentation I answer the questions: What is calibration? Why is it important? and How is it performed? I also discuss how this approach to calibration relates to the calibration requirements in the Forensic Science Regulator of England & Wales’s 2021 appendix to the Codes of Practice and Conduct: Development of evaluative opinions.

Dr Morrison is Director of Aston University’s Forensic Data Science Laboratory & Forensic Speech Science Laboratory. Since 2008, he has published multiple papers related to calibration of forensic-evaluation system, including a 2013 tutorial paper on the topic. He was lead author of the 2021 Consensus on validation of forensic voice comparison.



Calibration in automatic speaker recognition

Luciana Ferrer

Instituto de Ciencias de la Computación, Universidad de Buenos Aires – CONICET


Slides


Video

Most modern speaker verification systems produce uncalibrated scores at their output. Although these scores contain valuable information to separate same-speaker from different-speaker trials, their values cannot be interpreted in absolute terms – they can only be interpreted in relative terms. A calibration stage is usually applied to convert scores to useful absolute measures that can be interpreted, and that can be reliably thresholded to make decisions. In this presentation, I review the definition of calibration and explain its relationship with Bayes decision theory. I then present ways to measure quality of calibration, discuss when and why we should care about it, and show different methods that can be used to fix calibration when necessary.

Dr Ferrer is a researcher at the Computer Science Institute, affiliated with the University of Buenos Aires and with the National Scientific and Technical Research Council of Argentina (CONICET). She received her PhD in Electronic Engineering from Stanford University in 2009. Her primary research focus is machine learning applied to speech processing tasks.



Calibration in forensic voice comparison

Daniel Ramos

AUDIAS Lab, Escuela Politécnica Superior, Universidad Autónoma de Madrid


Slides


Video

In this presentation, I describe the role of calibration in forensic voice comparison, focusing on the use of automatic systems in a Bayesian decision framework. I describe computation of calibrated likelihood ratios in the context of scenarios and recording conditions typically encountered in forensic casework. I present algorithms commonly used for calibration. I also discuss the importance of calibration in the process of validating forensic-voice-comparison systems, and discuss recommendations and guidelines published by the European Network of Forensic Science Institutes (ENFSI).

Dr Ramos is an Associate Professor at the Audio, Data, Intelligence and Speech (AUDIAS) Laboratory of the Autonomous University of Madrid. He is author of numerous publications on applying and measuring calibration, especially in the context of forensic problems. He has served on scientific committees, and has often been invited to present on the role of calibration in forensic science.



Measuring calibration of likelihood-ratio systems

Peter Vergeer

Netherlands Forensic Institute


Slides


Video

In this presentation, I explain the concepts of what constitutes well-calibrated probabilities and well-calibrated likelihood ratios. I briefly describe graphical representations for assessing degree of calibration. I then focus on several metrics designed to assess degree of calibration, and present the results of a study comparing the performance of different metrics. Three metrics are taken from the existing literature, and one is a novel metric. One existing metric is based on the expected value of different-source likelihood-ratio values and the expected value of the inverse of same-source likelihood-ratio values (after Good, 1985), another is based on the proportion of different-source likelihood ratios above 2 and the proportion of same-source likelihood ratios below 0.5 (after Royall, 1997), and the third is Cllrcal (Brümmer & du Preez, 2006). The novel metric is devPAV (Vergeer et al., 2021).

Dr Vergeer is a research scientist in forensic statistics at the Netherlands Forensic Institute. His research focuses on computer-based methods for evaluation of strength of evidence, and on measuring and improving the performance of human experts. He has published multiple research papers on calibration of likelihood-ratio systems and on measuring the degree of calibration of likelihood-ratio systems.



Panel Discussion

Moderator: Rolf J.F. Ypma

Principal Scientist, Netherlands Forensic Institute

Forensic Data Science Laboratory, Department of Computer Science & Aston Institute for Forensic Linguistics, Aston University

TBA

Video

The presenters will discuss aspects of calibration on which there is consensus, aspects on which there is disagreement, and aspects requiring additional research. They will also discuss how to encourage wider adoption of calibration of likelihood-ratio systems in forensic practice.