Fake bin Laden audio tape
On November 30 2002 the Guardian delivered the following story under a very definitive headline: "Swiss scientists 95% sure that Bin Laden recording was fake":
The tape, delivered to the Arab satellite television channel al-Jazeera earlier this month, appeared to provide the first concrete evidence that Bin Laden is still alive because it mentioned recent attacks on western targets.
American experts initially concluded that the voice on the tape was probably Bin Laden, though it is unlikely ever to be fully authenticated because of the recording's poor quality.
The Swiss findings conflict with other research published by the French news magazine L'Express last week.
In that study, Bernard Gautheron, director of the phonetic testing laboratory at the Institute of Linguistics and Phonetics in Paris, concluded there was a "very strong probability" that the al-Jazeera tape was genuine.
But researchers at the Dalle Molle Institute for Perceptual Artificial Intelligence, in Lausanne, believe the message was recorded by an impostor.
In a study commissioned by France 2 television, researchers built a computer model of Bin Laden's voice, based on an hour of genuine recordings.
Using voice recognition systems being developed for banking security, they tested the model against 20 known recordings of Bin Laden. The system correctly identified his voice in 19 of them.
This meant there was only a 5% risk of error in their conclusion that the latest tape is a fake, Professor Hervé Bourlard, the institute's director, told the Guardian yesterday. "It's an automatic system but it's very sensitive," he said. "It picks up things the human ear doesn't pick up."
He agreed that the sound quality of the recent tape was poor but added: "Many of our 20 [test] recordings were also of poor quality. Some were very good, some very bad, but our results were all positive except in one case."
Prof Bourlard, a voice recognition expert, is the author or joint author of 150 research papers and two books, and has worked extensively with the International Computer Science Institute at Berkeley, California.http://www.guardian.co.uk/world/2002/nov/30/alqaida.terrorism
This conclusion was subsequently used by many 9/11 researchers to support the idea that bin Laden may be dead, the US were producing fake tapes to justify the "War on Terror", and so on. Although for some reason the Paris analysis, saying there was a "very strong probability" that the tape was genuine, got rather less attention. And that's odd, because this was also a detailed study, and the experts involved don't appear convinced by the IDIAP methodology:
The Lausanne-based Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) claims it was recorded by an impostor.
The review of the tape, which was first aired on November 12 by the Arabic television network, Al-Jazeera, was commissioned by the French television channel, France 2.
However, having compared the most recent tape with 20 previous recordings attributed to bin Laden, the IDIAP claims the voice is not that of the al-Qaeda leader.
Professor Hervé Boulard, the institute’s director, told France 2 that he was 95 per cent certain that “it has not been recorded by bin Laden”.
The IDIAP, which is affiliated to the Federal Institute of Technology in Lausanne and the University of Geneva, said the risk of error was just five per cent.
The voice on the tape praises recent attacks against Western targets including the siege of a Moscow theatre by Chechen rebels, the bombing of a French oil tanker off the coast of Yemen, and the Bali bombing.
The speaker also warns America’s allies – particularly Britain, France, Italy, Canada, Germany and Australia – that they will also be targeted if they continue to back the US.
At the time of the tape’s appearance, officials in the United States concluded that it was “probably” bin Laden speaking. They said it was the clearest evidence yet that bin Laden had survived the US-led bombing in Afghanistan.
Rival study
The Swiss findings, however, are in direct contrast to voice recognition tests commissioned by the weekly French news magazine, “L’Express”.
The magazine asked two experts to examine the tape: Bernard Gautheron, director of the phonetic testing laboratory at the University of Paris’s Institute of Linguistics and Phonetics; and Olivier Fiani, a translator and expert on Middle East affairs.
Gautheron concluded that there was a “very strong probability” that the tape was authentic.
In an interview with swissinfo, he cast doubt on whether the computers used by the IDIAP had been subtle enough to pick up all the known nuances of bin Laden’s speech pattern, including his use of language.
“Osama bin Laden’s speech is characterised by a certain number of tics,” he said. “Computers do not take into account these elements which are nonetheless very important.”http://www.swissinfo.ch/eng/Home/Archive/Bin_Laden_tape_faked.html?cid=3050322
Read a Google translation explaining the French study here.
Press interpretations of scientific studies aren't always reliable, though, in our experience. Newspaper stories on medical studies, for instance, will frequently take the most alarming part of a report and minimise (or entirely ignore) any caveats introduced by the author: they're really just after the headline. To really understand the story we needed to take a look at the full text:
Background information
Among other activities in speech processing, computer vision and machine learning, IDIAP has been involved in automatic speaker verification/authentication for about 10 years, where they incorporate state-of-the-art statistical approaches, as are also used in many other laboratories.
After the release of the latest bin Laden tape, IDIAP was often approached by journalists asking for general information about automatic speaker verification systems: how they worked, and how well they perform. Our answer has always been that these systems perform relatively well in well-controlled environments, such as banking and telephony applications, the main focus of the researchers working in this area. These systems are based on the collection and modeling of many utterances spoken by numerous people, speaking the targeted language, as well as a few utterances, pronounced in clean environments, from each person whose voice print will later have to be identified. In these well-controlled environments, correct verification performance in the range of 95-98% is often reported. However, when working in uncontrolled environments with degraded quality, and/or when there are insufficient training utterances (which is typically the case in forensic applications), this performance level can drop dramatically, making it impossible to draw conclusions with strong certainty.
More recently, IDIAP was also approached by the French TV station France 2, asking again for the same kind of information. In this framework, it was also agreed that IDIAP would illustrate the way state-of-the-art speaker verification systems work by processing the bin Laden recordings that were available at France 2, including the unauthenticated bin Laden recording broadcasted by Quatar Al Jazeera on November 12, 2002. US experts who have heard or processed the tape usually support the conclusion by US law enforcement officials that it probably is bin Laden speaking. However, it is also usually agreed that the latest tape will likely never be fully authenticated because its poor quality defies complete analysis by the best voice/linguist experts or the most sophisticated voice print technology. Although IDIAP fully agrees with these statements, it was decided, mainly motivated by pure scientific curiosity, to go ahead with the experiment and see what conclusion our state-of-the-art speaker authentication system would reach.
Experiment
A few days ago, France 2 thus kindly provided IDIAP with about 1h30 of audio/video recordings, recorded through CNN or Al Jazeera, and including about one hour of authenticated (through video) recordings from bin Laden and about 30 minutes voice signal from his associates or other persons speaking Arabic.
As usually done in the scientific community, and to allow for a fair and unbiased evaluation, these recordings were split into an independent « training set » (i.e., not used to evaluate the system) used to build a statistical model of bin Laden’s voice (see technical detail below), and a « test set » used to evaluate/predict the performance of the system. Usually, the training data should also include a large number of speakers (in the targeted language) to build a so-called « world model » used as a reference of « non-bin Laden examples », and several examples of bin Laden voice. Given the very limited number of « non-bin Laden » Arabic speakers, we decided not to train a new « world model » and to rely on a pre-trained English model.
Although this is a first weakness of our approach, results below will show that it didn’t significantly limit the performance of the system on the data in hand.
Before presenting the system with the latest unauthenticated recording, we extracted 44 recordings of audio data from the above broadcast material (thus covering about 1h30 of audio data) test data from the above broadcasted material. These recordings included:
• 30 recordings authenticated as from bin Laden, which were split in two sets :
o 15 to train the model, referred to as “train set”: not represented on the model below o 15 to evaluate the model: these appear as green squares in the figure below.
• 14 recordings authenticated as from other Arabic speakers and also used to evaluate the model (to evaluate “impostor accesses”): these appear as red squares in the figure below.
The last two sets of recordings, referred to as “test set” thus contained 29 recordings, including 15 recordings from bin Laden and 14 from other Arabic speakers. The quality of these recordings (in the training set, as well as in the test set) was highly variable, ranging from good quality to mediocre and very poor quality. In an informal blind test performed with colleagues at IDIAP (who are not experts in linguistics), including one native Arabic speaker, it was often very difficult to achieve a high human classification rate.
After having optimized our statistical model (briefly discussed below), and keeping in mind the limited amount of data, we used the resulting automatic user authentication system to generate one data point (associated with an identity decision) for each test recording, thus resulting in 29 data points in the two-dimensional space represented in the figure below. According to the figure plot below, all the points above the decision threshold (the line in black, optimized to maximize the decision margin) would be classified as « non bin Laden », while the recordings falling below the decision threshold would be classified as containing the voice print of bin Laden. Thus, it can be seen on the figure below that all of the 29 recordings, but one (the green square above the line), were properly classified.
Based on this experiment, an unreasonable conclusion (often drawn by some journalists) is that the resulting system is reliable at 97% since it made one error over 29 examples. However, when dealing with statistics, drawing this kind of conclusion when dealing with so few examples is simply too premature, and often wrong. In theory, we would need an infinite number of examples to draw any definite conclusion, but the more examples we have, the more reliable the conclusions would be. However, to further test our model, we also recorded two utterances (represented by blue squares in the figure plot below) from a native Arabic speaker at IDIAP repeating and mimicking one of bin Laden’s recording. It can be seen from the plot that these two additional recordings were also properly rejected as « non bin Laden », finally resulting in one error over 31 examples.
Thus, relying on the relatively good performance of the system on this (too limited!) test data, we then presented our system with the latest, unauthenticated bin Laden recording. The resulting point is represented by a light blue circle pointed by an arrow. According to the system, this last recording would thus not be attributed to bin Laden. However, on top of the limitations already discussed above, it can be seen that this point also falls very close to the decision threshold, which further decreases the confidence we can have in this result.
Conclusion
The work reported here was mainly motivated by pure scientific curiosity, also aiming at showing the possibilities and limitations offered by automatic speaker authentication system in non-optimal conditions (typically, noisy environments and limited amount of recordings). While this study does not permit us to draw any definite (statistically significant) conclusions, it nonetheless shows that there is serious room for doubt, and that it is also difficult to agree with some US officials saying that it is 100% sure that it is bin Laden. When addressing a problem with a scientific perspective (as opposed to a political approach), one has to be ready to also accept the uncertainty of the results. Even if the confidence of these results can be boosted by exploiting multiple automatic systems and multiple human expert opinions, it will never be possible to authenticate the latest bin Laden tape with 100% assurance.
A few words about the statistical user authentication technology used
The speaker authentication technology used here is a state-of-the-art approach based on (text independent) statistical modeling of the spectral characteristics of each voice by a set of multi-Gaussian densities in an acoustic space, optimized to extract at best the voice print characteristics while being independent of the transfer channel characteristics. Of course, the state-of-the-art acoustic features currently used do not perfectly achieve this ideal goal! Typically, the speaker’s model is optimized by adapting a « world model » (also represented by a large set of multi-Gaussian parameters) to the targeted speaker.
During authentication of a new utterance, acoustic features are extracted, and the probability that the model (customer model) of the claimed identity could have generated this utterance is calculated and compared to the probability of the « world model » (corresponding to « anybody else »). If the probability of the customer model is greater than the probability of the world model, the system authenticates the speaker, otherwise it rejects it. These two probabilities correspond to the two dimensional space represented in the figure above, and the decision threshold is represented by the line (also optimized on a separate training data) separating at best the two classes.
For the study of the latest recording attributed to bin Laden, it was not possible to build an Arabic « world model » given the limited amount of recordings. We thus used an English world model available at IDIAP. Of course, the bin Laden model was optimized by using many recordings, of different quality, attributed to bin Laden. The decision threshold was also optimized by using bin Laden’s recordings (customer accesses), as well as all the recordings of the other Arabic speakers available (impostor accesses).
A few words about IDIAP
Currently numbering about 60 scientists, IDIAP (Dalle Molle Institute for Perceptual Artificial Intelligence) is a semi-private, non-profit, research institute located in Martigny, Switzerland, and carrying research and development in the fields of automatic speech and speaker recognition, computer vision, and machine learning. Involved in numerous national and international research projects, IDIAP is also the Leading House of a National Research Center of Competence (NCCR) in « Interactive Multimodal Information Management ».www.iadp.ch (Web Archive)
(Reproducing PDF formatting in a wiki is difficult, and tedious in the extreme. Go follow the link to get the full picture.)
We can see that nowhere in the study do the authors say they are "95% sure" that the bin Laden recording was fake, then. In fact they specifically disown simple percentage reliability conclusions as "unreasonable" and "often drawn by some journalists", while saying they cannot "draw any definite (statistically significant) conclusions".
The furthest the authors go is to say that there is "room for serious doubt" over the recording's authenticity. But this is tempered by talk of "limitations" in the study. And if you look at the graph you'll see the recording in question (the blue circle) is closer to the authenticated bin Laden recordings than any of the non bin Ladens, and in addition the model has said one of those authenticated recordings wasn't him.
The IDIAP have done some clever work here, there's no doubt about that. However the study does not support the certainty in the press headlines, or in any sense prove that the tape was fake. As usual, the reality just isn't that black and white.