Your phone rings. You answer it, and you hear a panicked voice. It’s one of your children or a partner, and they’re frantically asking you for money or for some confidential information.
The big question is: do you trust the voice that you’re hearing? Are you sure it’s who you think it is?
Microsoft Azure AI Speech needs just seconds of audio to spit out a convincing deep fake. The Register always has a fairly taken view of things. There’s no way this will be abused. So, Microsoft have upgraded Azure AI Voice Speech so that users can rapidly generate a voice replica with just a few seconds of sent-in speech. The system, which was already pretty good, is now even more worryingly accurate.
This capability unlocks a wide range of applications. These range from customising chatbot voices to dubbing video content in an actor’s original voice across multiple languages. This enables truly immersive and individualised audio experiences, Microsoft said. But it could also be a boon for people with goals that may be malicious or deceptive. We can imagine audio deep fakes produced with the service becoming even more challenging to spot.
Cyber security experts recommend that, in your family, you have a secret password or keyword — something that only you know. How would a scammer get one of your kids’ voices? Maybe they’re posting videos on TikTok or Instagram. It could be something as simple as phoning someone’s mobile. When the call goes to voicemail, they can sample the voice, get a few seconds of audio, and then clone it with services like this.
In addition to watermarks to make the generated audio easier to identify — which you can’t hear as a human — Microsoft insists all customers must agree to usage policies. These include requiring explicit consent from the original speaker. They also involve disclosing the synthetic nature of content created, and prohibiting impersonation of any person or deceiving people using the personal voice.
So, as The Register says: “Yeah, that’s alright then. Yeah, right. What are scammers going to do? They’re not going to care. They’re going to use it.” In these tests, they found that 30 seconds of sample speech was enough to create something that was eerily accurate. They summarised the fact that scammers are using this — and have, for example, used it to scam senior U.S. government officials as part of a major fraud campaign.
Just because the voice on the telephone sounds like one of your family members or business partners doesn’t mean that it’s actually that person. It could be a deep fake using AI-generated voice. Make sure that you have a passcode or password of some kind that only you know. You can ask them for that. If they can’t provide you with the passcode or password — hang up the phone.
So, the moral of this article is that it’s better to be safe than sorry.





