Voice cloning is too easy

All the attention goes to frontier LLMs. The rest of AI is moving just as fast.

Most of the discourse around AI right now is about the frontier LLM race. Who has the best reasoning model, which lab just raised another thirty billion. The rest of the field doesn't get nearly as much attention, and I think that's a mistake, because some of it is moving just as fast. Funnily enough, the result below was good enough that some of my family members I showed it to asked why I didnt include the AI generated part! When in reality the part after I hit play is exactly that, AI-generated.

The model is Qwen3-TTS, open-weight from Alibaba, running locally through mlx-audio on Apple Silicon. You record thirty seconds of yourself reading a script, and from that point on it can say anything in your voice. It runs entirely locally, no cloud or API key required, and the quality is actually good enough to fool most people on a phone call.

Cybercriminals are already doing this, with better tooling, at scale. The grandparent scam used to require someone who could convincingly fake a voice under pressure in real time. Now it means pulling thirty seconds of audio from someone's Instagram story and running a script. The only barrier is knowing the tools exist.

I converted the model weights to float32 for better quality and published them to Hugging Face since the community only had bfloat16 versions. The code is on GitHub.

The standard advice is to set up a code word with people close to you, something only you would know between you, so that if someone calls in a familiar voice asking for help or money, you have a way to check it's actually them.