Researcher say its possible to hide garbled commands among other noise and surreptitiously trigger voice control on a phone. But it seems highly unlikely the method could be used for a genuine attack in its current form.
A group of eight researchers from Georgetown University and University of California, Berkely investigated the proposition that its possible to make voice commands which a human can’t detect but a phone’s speech recognition can.
The testing involved playing a recording of the voice commands “OK Google”, “Call 911” and “Turn on airplane mode” and then obscuring them in two ways. The first was to distort them in a manner the BBC likens to the speech effects used for Daleks. The second was to add a range of background noise tracks taken from busy situations such as a casino or a crowd applauding.
The researchers then checked whether the phones understood the commands and triggered the relevant actions. They then hired online workers to listen to the recordings and try to make out the commands.
They found that the phones understood the messages 85 percent of the times when the voice command was not distorted at all, 80 percent of the time when the voice was distorted but without the background noise track, and 60 percent of the time with both the distortion and noise track. However, the phones could only understand the messages in any form when they were within 3.5 meters of the playback speaker.
Testing whether the humans could hear the speech provided some surprising results. As you might expect, with just the background noise and no distortion, the human understanding was almost identical to that of the phones.
With the distortion added as well, the human performance was much worse in two cases: 22 percent success for humans compared with 95 percent for phones for “OK Google” and 24 percent for humans compared with 45 percent for Google with “Turn on airplane mode.”
However, it was a completely different picture with “Call 911”, which 94 percent of humans understood compared with just 40 percent of machines. The researchers speculated it’s possible the human participants had culturally developed a sensitivity to this particular phrase being urgent.
The researchers then recreated the experiment but used a machine learning algorithm to specifically optimize the distorted speech for a particular playback speaker and phone. In this case, the phone could understand the speech 82 percent of the time. Not only did none of the humans understand it correctly, but only 24 percent even attempted to transcribe it, with the rest assuming it wasn’t speech at all.
At the moment, the researchers say that using this technique to attack phones for real would be practically challenging but possible. They say what’s important now is to explore ways to defend or mitigate against such attacks before the technique is refined and improved.