hckrnws
Show HN: Voice-Pro – AI Voice Cloning
by abuskorea
Imagine creating a podcast where Mark Zuckerberg interviews Elon Musk – using their actual voices?
What sounds like science fiction is now reality.
Voice-Pro is an open-source Gradio WebUI that breaks the boundaries of audio manipulation.
Powered by cutting-edge Whisper engines, this tool turns voice replication into child's play.
Key Features:
- Zero-shot Voice Cloning
- Voice Changer with 50+ Celebrity Voices
- YouTube Audio Downloading
- Vocal Isolation
- Multi-Language Text-to-Speech (Edge-TTS, F5-TTS)
- Multi-Language Translation
- Powered by Whisper Engines (Whisper, Faster-Whisper, Whisper-Timestamped)
Video Demos:
1. Voice-Pro Usage Tutorial: https://youtu.be/z8g8LMhoh_o
2. Voice Cloning Celebrity Podcast Demo: https://youtu.be/Wfo7vQCD4no
3. Full Demo Playlist: https://www.youtube.com/playlist?list=PLwx5dnMDVC9Y7dAjm9r26...
Whether you're a content creator, developer, or audio experiment enthusiast,
Voice-Pro provides a user-friendly interface to push the boundaries of audio manipulation.
I do think that voice cloning for personal usage has actual genuine uses - in fact there was a relatively interesting news article about a person who was irrevocably losing their voice who had their vocal pattern cloned.
https://www.voanews.com/a/illness-took-away-her-voice-ai-cre...
That being said, it does seem a bit bizarre that the repo's home page is proudly trumpeting the ability to co-opt other people's identities without their permission (and yes your unique vocal pattern is definitely part of your identity - I mean it's used in some forms of biometric data). They're doing the project a bit of a disservice.
Of course there are legitimate uses, which means everyone should have completely unfettered access and nobody selling it should be responsible for irresponsible users. Personally, I’m sick of the government limiting my artistic freedom because the mediums I use might be misused by a tiny group of bad actors. For example, it’s unnecessarily difficult to source pineapple grenades for my large scale abstract punched tin crafts. The other people who live in my apartment building haven’t complained when I asked if they had a problem with it, so what’s the problem? And when I can get ahold of it, white phosphorous makes a great addition to my annual deep-woods pyrotechnic light shows. I just don’t understand this nanny state garbage.
Right? I am an avid keeper of terrariums and micro-ecosystems. Government over-reach means I am having a really hard time seeding my anthrax enclosure.
Ridiculous! Complete overreach. Be strong, oppressed one. “This too shall pass.”
Polonium has useful uses
If nothing else, I can confirm it’s delicious.
Take my upvote you greedy bastard.
Taken, as recommended. It tingled.
It does have actual genuine uses. I'm in the process of recording a series of tutorials for my peers but I'd like them to hear things in my voice so it doesn't sound like I have offloaded the work to someone else.
I don't know if this helps or harms the credibility but I can't really talk more than an hour without seriously straining my voice. So cloning it sounds like a great use-case for someone with a similar problem.
Looking forward to trying this.
I like this idea. I've been playing with the idea of having all my blog entries have corresponding narration with my own voice but I'd love to see some kind of voice cloner + gradio interface that let's me make some adjustments to things like cadence, delivery, etc. (I mean beyond just making me sound like Alvin and the Chipmunks).
I don't know about changing tone but I have used Adobe Podcast editor and it allows you adjust the words and rearrange what you said so you can cut "umms" and stuff. I know they are constantly adding features so I don't know if you can improve cadence and stuff but worth looking at if you have adobe stuff
Wondercraft.ai It's not mine, just used it for a bit few months ago.
> so it doesn't sound like I have offloaded the work to someone else.
So, deception. Deception that you feel is justified, but deception nonetheless.I disagree. Deception is the act of convincing one in untrue information.
The information I'm conveying is truthful and it's my words. The voice, generated or not, is not what I'm trying to convince people into believing.
That is an interesting perspective. I disagree that there is no deception, but do see the validity in your point. Thank you.
When my IoT geiger counter starts going off, I do what the in-home PA system's voice to be Admiral Adama warning my family of an imminent radiological threat, and preparing the Vipers for launch.
Edward James Olmos if you're reading this, I'm willing to pay a license fee, but then I expect actual recordings and not just AI bullshit. I'm not pirating your voice, you're refusing to let me hire it.
proudly trumpeting the ability to co-opt other people's identities without their permission
EXACTLY. Clone the wrong person's voice and it's game over.
It's useful for some things, like satire. Presidents Play is a good series in YouTube where it uses US presidents' cloned voices for comedic satire.
A gun is useful to shoot someone, what has that to do with it being right or wrong?
Not sure you picked the most cogent example because lots of people will debate you on that topic...
Randy Travis also used AI on his last album after losing his voice.
Thanks for sharing this! But I have some doubts about hidden installation procedures. It imports all functions from one_click (from one_click import *), which points to a compiled file. It then runs functions like install_webui and install_extra_packages. At least suspicious.
> Windows Defender may give a warning about untrusted application and disallow further execution of Voice-Pro. If SmartScreen security level is set to "Warn", just click "More info" and then click "Run anyway". If SmartScreen is set to level "Block" there will be no button to run the installation. In this case, open the properties of the start.bat file, and check "Unblock", apply the change and run the start.bat again.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
clear as day, do not trust this code
Exactly, and this isn't adding anything significant from what I can see that isn't already achieved in much more clear and openly presented repositories. Take coqui for example. Cloning as as easy as recording an example of your voice and using
```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
file_path="output.wav",
speaker_wav="/path/to/target/ speaker.wav",
language="en")
```
Perhaps I'm paranoid but this has multiple red flags that make hesitate to install - even the "too good to be true" aspect of such comprehensive features makes me wonder (which is probably irrational and taking it a bit too far!)
I have resorted to using separate physical computers + vlan network separation when exploring untrusted AI workloads. Yes, it costs, but so does a breach.
Thanks for raising this aspect.
Btw https://github.com/haimgel/display-switch helps a lot.
Try recording the installation process with a camera. The entire installation process is displayed in the Windows command. It's just installing Python packages and downloading the AI model and audio files. That's all.
The file I mentioned is just the begining... there is a folder full of .dll files, renamed to .pyd. I understand that this is the proprietary part, that limits usage for 30 minutes, but I think it is too closed for a MIT license.
Pretty easy for a script to not print everything it does at the command line. You have to inspect the code if you want to be sure.
I don't have much real use for celebrity voices (other than fun experimentation), but I'd love to be able to clone my own voice and character voices for the purposes of creating audiobooks / audioplays without having to pay monthly fees with monthly usage limits. So I'm excited by this sort of project!
P.S. Are there any tools for synthetic voice creation? Maybe melding two or more voices together, or just exploring latent space? Would be fun for character creation to create completely new voices.
I'd be interested as well. This is where I imagine the space is going - particularly as the potential for litigation increases around cloning.
Game studios will spin up a bunch of unique virtual voices for all the dialogue of extras. It'll probably be longer before we see replacements of main characters though. There's been some research in speech-to-speech transference as well - this means that company employee A records the character B's line with the appropriate emotional nuance (angry, sad, etc.) and the emotional aspect is copied on top of the generated TTS.
Have you tried eleven labs? I used that. Had to record 3 hours of training audio reading books and and news articles. But the result was really good.
They're great! They just cost too much for how much output I want.
How much did the training cost?
I’ve used tortoise tts before and trained it on my voice and a mix of voices. It’s not perfect but still impressive.
StyleTTSv2 is pretty good and open source, you can easily traverse its latent space for voice
Similarly, I’m not excited by “voice cloning” at all, but I’d like to have very high quality, natural sounding TTS. All of the projects that do that seem to be geared towards also allowing arbitrary voice cloning based on short audio clips I’ve noticed.
Isn't it funny how some text changes the voice in your head? Now you're hearing the best voice. It's amazing. I tell you. It's the greatest voice. Everybody’s talking about it. They are saying it's incredible. They say they've never heard as beautiful a voice before.
When Arnold Schwarzenegger was governor of California, he refused clemency for notorious gang founder Stanley "Tookie" Williams, who was sentenced to death for four murders in 1979.
https://www.ocregister.com/2005/12/12/governors-full-stateme...
Reading over the governor's statement explaining his reasons for denial of clemency, my brain couldn't help but do so in an Arnold voice. Sometimes, to amuse friends, I would read portions of it aloud while doing the voice.
Maybe it's a bit tasteless, like the anime-girl Demon Core memes, but there's just something about hearing the legal and administrative justification for proceeding with an execution in the voice of the Terminator.
I'm the same way with famous YouTubers. If I see "Guru Larry" Bundy Jr. or Clint "LGR" Basinger leave a comment on someone else's video, my brain reads it in their voice.
I needed until "Everybody’s talking about it" to hear it in his voice :)
Please no spoilers!
Voices can be beautiful.
> Windows Defender may give a warning about untrusted application and disallow further execution of Voice-Pro. If SmartScreen security level is set to "Warn", just click "More info" and then click "Run anyway". If SmartScreen is set to level "Block" there will be no button to run the installation. In this case, open the properties of the start.bat file, and check "Unblock", apply the change and run the start.bat again.
https://github.com/abus-aikorea/voice-pro?tab=readme-ov-file...
hard pass and anyone who reads this and continues is bonkers
Doesn't that basically apply to all binary executables? Anything new and unrecognised by the scanner.
There's also the fact that there's a load of precompiled binary files in the app directory https://github.com/abus-aikorea/voice-pro/tree/main/app - sure, they might be the binaries from compiling the source code you see in the repo, or they might be something else. Roll the dice.
Running a Python application using a Windows batch file is not a special task at all. Oobabooga and AUTOMATIC1111 work in the same way. They also have the same issues regarding Windows Defender.
https://github.com/oobabooga/text-generation-webui https://github.com/AUTOMATIC1111/stable-diffusion-webui
They are complaining about the binary files, not the batch files.
This application is executed in a virtual environment (venv) created using Miniconda, independent of the Windows OS. It does not damage the Windows OS.
If you have concerns or doubts about telemetry or spyware, there are countless software options available for detection. Give it a try.
Python venvs are not intended to provide isolation from the host system and therefore do not provide any isolation from the host system.
So yes, the app can certainly harm the OS, and the venv would not provide any protection against this.
I don't have concerns, because I won't be running this code
I'm legitimately wondering what you recommend from among those options.
This is absolutely not true. In fact, considering the deception of this post, if the person making this post is associated with the project then the project should be considered malware.
My neighbour is a detective and did a course on crypto scams. He told me scammers call someone's cell phone, record their voicemail greeting and use that to clone their voice. Then can then have a very real life conversation with their grandparent and take their money.
I'm all for innovation, but I don't really see the use case of cloning random voices to make podcasts? Listening to Zuck interview Elon? ok...?
It's really easy for a technical person to do as well.
I use Coqui TTS[0] as part of my home automation, I wrote a small python script that lets me upload a voice clip for it to clone after I got the idea from HeyWillow[1], and a small shim that lets me send the output to a Home Assistant media player instead of using their standard output device. I run the TTS container on a VM with a Tesla P4 (~£100 to buy) and get about 1x-2x (roughly the same time it'd take to say it, to process) using the large model.
Just for a giggle, I uploaded a few 3s-5s second clip of myself speaking and cloned my voice, then executed a command to our living room media player to call my wife into the room; from another room, she was 100% convinced it was myself speaking words I'd never spoken.
I tried playing with a variety of sentences for a few hours and overall, it sounded almost exactly like me, to me, with the exception of some "attitude" and "intonation" I know I wouldn't use in my speech. I didn't notice much of an improvement using much longer clips; the short ones were "good enough".
Tangentially, it really bugs me that most phone providers in the UK insist you record a "personal greeting" now before they'll let you check your voice mail box, I just record silence, because the last thing I want/need is a voicemail greeting in my voice confirming to some randomer I didn't want calling me, who I am and that my number is active, even more so knowing how I can clone any voice to a reasonably good accuracy with just a few seconds of audio.
[0] https://github.com/coqui-ai/TTS [1] https://heywillow.io/
The best thing about crypto is that it is an ever growing bug bounty program for all aspects of authentication :)
Technically, wouldn't a simple "Hold on, I'll call you back" test call stop that?
Scammers will use pressure and emotion. "Grandpa they put me in jail, I need you to bail me out please, there's not much time!" The last thing on the victim's mind is to hang up on what sounds like their crying distressed grandson to call them back. Sometimes even calling back won't work, the real grandson isn't picking up their phone and the scammer is saying that it's because they're in jail and their phone was taken.
I've been thinking a lot about this possibility. I think people will have to come up with family passwords eventually. A word or phrase that is regularly practised, but strictly private, for verification in times of crisis.
For example, my family's passphrase is- just kidding.
Either than or Android and iOS will add something like Caller ID but with actual authentication.
My family already does this.
Mmm. Safewords.
[flagged]
[flagged]
[flagged]
I read the entire thing. My point was that mental capacity and acuity (including the ability to resist panic scamming) go down especially after 80, and gullibility goes up. No matter how smart you were prior. And that this is a vulnerability for everyone. In 2023, Americans aged 60 and older reported losses exceeding $3.4 billion due to scams, marking an 11% increase from the previous year. The median loss in a romance scam for ages 70 and older is $9,000. And now there's also pig-butchering. When combined with isolation and a lack of digital literacy, everyone's older family members are vulnerable.
And my point is that coming up with ways for a fully calm, collected, reasonable person to detect a scammer is a waste of time, because that’s the easy part. The hard part is being calm, collected, and reasonable enough to actually consider that you might be getting scammed.
And that is hard. For most people, extremely hard. For people who lived most of their lives before the era of cheap and fast worldwide communication, it’s even worse. For people with declining mental abilities, it’s worse yet. Saying “oh you can expose the scam just by saying you’ll call them back” is looking at the wrong thing.
So yeah, you’re making a great point that meshes well with my own and definitely does not deserve to be wrapped in snideness.
ah. ok. I apologize for misinterpreting. You're saying that because these scams use irrational techniques, there is no point to expecting "reasonableness" to work?
Exactly. Scam protection needs to focus on staying calm and being open to the possibility of a scam. Once you’re in a good mental state and the thought has occurred, “hey, this might be a scam,” then you’ve basically already won. Finding a question that can authenticate the purported family member is trivial by comparison.
Comment was deleted :(
Yes, if the callee has reason to believe the caller isn't who they say they are. But this will never enter the mind of someone who's retirement age.
Some old people become very gullible.
In all fairness, the number of old people who even know that realistic recreations of their loved ones voices is even possible is probably pretty low.
I'm looking down the comments, but not really seeing much about what this actually is, by my very quick look, it's a front end for f5-tts with a yt-dlp and whisper?
Is there anything new in this?
It’s wrappers all the way down
Yeah they made an easy to use frontend. Don't be the dropbox guy
We can't just keep saying "Don't be the dropbox guy" as a comeback to criticism of new technology. Anyone who uses that phrase should have to place a bet in a prediction market that only pays out if the product they're talking about succeeds. Blindly supporting stuff out of a sort of "Pascal's Wager against looking foolish later" should have some cost if you're wrong.
Let’s default to being supportive and very careful with being negative.
That kind of imbalance makes it easier for scammers and hucksters to get away with things. It is not a feelgood prescription with no cost.
This is another cost of scamming: the cynicism it creates.
I completely agree with you. This is just a web front-end, and there's nothing new about it. However, it's very easy. It's not easy to create something like this.
Wind your neck in.
I simply asked "is there anything new in this?" because, i was interested to know if, you know, there was anything new in this.
> When Windows Defender mistakenly recognizes a [virus] as a Trojan, this is often called a 'False Positive'. To solve this problem, you can go through the following steps:
Not to mention a directory full of binaries which could do who knows what. The author is asking people turn off their antivirus, execute their code as admin, and be fine with it running binary files doing whatever
Yeah I also noticed the install instructions is run this batch file that gets administrator access and starts downloading things…
It's not any worse than all the projects on github with an "easy" install instructions of "curl ... | sudo sh". Heck, even an innocent "sudo make install" command can easily contain a malicious payload.
It's not really the sort of tool that should require admin rights though.
Not to mention a directory full of binaries which could do who knows what. The author is asking people turn off their antivirus, execute their code as admin, and be fine with it running binary files doing whatever
If it requires dependencies, how else do you expect it to work?
Vendoring.
Yea not to mention the entire homebrew ecosystem is built around trusting random people's shell scripts.
MacOS devs blindly trust it like it's the app store.
The assumption is that maintainers at Homebrew are reviewing each pull request before being merged, though it's obviously not a full security audit. Homebrew will also use macOS's sandboxing if a formula needs to be built during installation, which will limit file access to specific Homebrew directories and restrict network access.
But I agree that everyone should review the Homebrew install script for any package they're installing if they're concerned about security.
A simple `brew cat <packagename>` (possibly piping to bat if you want syntax highlighting) should spit out the ruby install formula for that package, for inspection.
Yeah it’s not great but it’s definitely not unusual. And windows reputation-based execution blocking does have false positives. I work for a company that has some very very popular products and some that only see a few dozen downloads per week, and despite being signed, it still takes a while for new versions to build enough rep to not trigger the block.
These tools make it very easy to scam vulnerable people, and have pretty limited use otherwise.
I'm absolutely using celebrity voices for my Home Assistant voice. Amazon has spent the last couple years removing the voices for Alexa that people had paid for.
I’d love some more info on using custom voices in HA. I have an esp32-s3-box that I am setting up holiday to do voice with HA.
If you have a how—to, I’d love to work on one for my home. I feel like this is all right around the corner…
To be fair, they’ve got pretty serious potential for letting tech companies get paid for a seasoned voice actor’s unique delivery, tone, inflection, etc rather than the voice actor themselves.
> they’ve got pretty serious potential for letting tech companies get paid for a seasoned voice actor’s unique delivery, tone, inflection, etc rather than the voice actor themselves.
I think you mean "steal the labor of an actor"?
Sure, and people that already agree with you will feel good reading it, but other people who don’t agree see it as an attack. It’s pretty much impossible to slip a new idea into someone’s mind if your approach made them slam the door before even considering it. So what’s the benefit of saying it like that?
It calls attention to the ethical implications of using a part of someone else's personal identity without their direct involvement.
So does what I said. Someone taking pay for someone else’s work is pretty unambiguously shitty. But when you call taking anything that isn’t a physical item theft, a large percentage of people— especially in the ‘data wants to be free’ crowd— will roll their eyes, think “that’s ridiculous... they aren’t stealing anything. That voice actor still has their voice” and just stop listening. The only people that feel the impact of statements like that are people that already agree. It turns it from an intellectual discussion to a reinforcement of existing tribes. Divisive language works for rallying those who already agree around a specific cause but it’s not even useless— it’s counterproductive— for changing people’s minds. When’s the last time someone you disagreed with changed your mind by being more aggressive towards your stance, and more terse in their portrayal of the dichotomy? If you can even think of one time that it has, you’re in the extreme minority.
Indirect involvement can still be ok within the confines of a license agreement for using the actor's voice.
But this requires a legal framework that mandates such licenses and effective emforcement / procecution of violations.
As far as I know, most countries are lagging behind when it comes to updating legislation to set binding rules around that.
> Indirect involvement can still be ok within the confines of a license agreement for using the actor's voice.
This assumes existence of a license agreement or likeness/right of publicity law that prevents unauthorized use. But this is far from the case.
Companies have shown willingness to use actors’ voices to create synthetic voices without permission, compensation, or regard for their livelihoods. [1][2][3]
[1] https://animehunch.com/popular-japanese-voice-actors-band-to...
[2] https://www.theatlantic.com/technology/archive/2024/05/eleve...
[3] https://www.yahoo.com/entertainment/morgan-freeman-calls-una...
Of course we need laws in place to require such licensing. The fact that people are having their voice stolen now does not mean that there should never be a case where a voice can legally be cloned and used by a third party.
Precisely. We must recognize this as a fundamental issue of workers’ rights and personal autonomy in the digital age, beyond viewing it as a technical challenge. Without proper protections, voice cloning technology risks concentrating power in large companies and undermining creative workers’ economic security.
It’s weird to me that people look at a technology and then assume from their reckoning that they know all the uses for that technology immediately. Most technological progress happens because someone notices a creative use for something that already exists which nobody else has noticed.
I like tools like these cause they make zero trust default even more obvious, and their "pretty limited use" is saving people hours of work.
They are pretty good for leaving messages for my blind friend. I generally find calling / voice texts a waste of time (I type and read far faster than I talk or listen, not to mention the ability to reread etc), but my blind friend prefers getting voice messages when on his phone and this works for us. I type and send and when he comes back with something, Whisper makes it into text for me.
Gen AI space to everyone else: “Your computer scientists were so preoccupied with whether or not they should, they didn’t stop to think if they could just do it anyway”
How many victims will it take for lawmakers to do something about this?
It's already illegal to scam somebody. While it's always positive to protect people more, what can be done here? Any alternative I can imagine is massively oppressive of the current state of the software industry.
You can regulate large companies, you can regulate published software sold for profit, but it's impossible to regulate free and open source tools.
You essentially have to regulate access to computing power if you want to prevent bad actors doing bad things using these sort of tools.
>You can regulate large companies, you can regulate published software sold for profit, but it's impossible to regulate free and open source tools.
Regulation is putting legal limitations on things, if it is impossible to regulate free and open source tools then it would be impossible to regulate murder and lots of other things, but it turns out it isn't impossible, sure - murder happens - but people get caught for it and punished.
Sorry, but this argument is much like the early internet triumphalism - back when people said it was impossible to regulate. Turns out lots of countries now regulate it.
It depends on what you do with the tool. Going with your murder analogy, if there's a stabbing epidemic what do you do? 1) Ban knives 2) invest in public safety 3) investigate the root causes and improve on them?
I'm also not sure what's so regulated about the internet besides net neutrality in certain countries. Of course the government can put limits on the network, like banning services, but it's easy since they are rather easy to target. With content traveling on the network it's much harder to say if it's legit or not.
> lots of countries
What about those countries that don't regulate it and people will keep pumping out better, leaner and faster models from there? Spreading software is trivial, all you achieve is the public won't be aware of what's possible.
The more I think about it if anything should be regulated that's a requirement to provide third party (probably government backed) ID verification system so it would be possible for my mom to know it's me calling here. Basically kill called ID spoofing.
>I'm also not sure what's so regulated about the internet besides net neutrality in certain countries.
generally things are regulated on the internet that were not going to ever be regulated because it was on the internet - example - sales taxes, perhaps you are old enough to remember when sales tax collection would not ever be enforceable on internet transactions - those idiot lawyer don't know, it's on the internet, the sale didn't happen in that country or in that state no sales taxes will never happen on the internet hah hah. It's unenforceable, it is logically undoable, there are so many edge cases - ugh, the law just does not understand technology!
oops, sales taxes now on internet purchases.
GDPR is another example of things that are regulated on the internet that basically most of HN years before it happened was completely convinced would be impossible!!
If this thing becomes too big a problem for the societies regulations will be done, with varying levels of effectiveness I'm sure.
And then in twenty years time we will be saying what, you can't regulate genital eating viral synths because a guy can make those in his garage and spread them via nasal spray, this technology is unstoppable and unregulatable, not like some open source deepfake library!!
It's always amusing listening to techies' musings on law... lots of misunderstandings, I suspect due to the helpful but inaccurate "code but for humans" analogy.
Obligatory/relevant xkcd: https://xkcd.com/538/
Lots of countries impose exactly what specific regulations with respect to open source tooling?
The closest thing I can think of is maybe the regulation of DRM ripping tools, but they're still out there in the wild and determined actors can easily get ahold of them. So I'm not at all confident that regulation will have any measurable meaningful effect.
>Lots of countries impose exactly what specific regulations with respect to open source tooling?
that something is not currently regulated does not mean it can never be regulated, further it does not seem likely that they would regulate open source tooling but rather some uses and if they open source tooling allowed those uses then what would happen is -
github and other big sources of code would refuse to host it as containing not legally allowed things, so for example if they regulated it in the U.S then Github stops allowing it, and everyone moves to some European git provider.
At the same time bigger companies will stop using the library because liability.
Europe then regulates and can't be in European git repos.. at some point many devs abandon particular library because not worth it (I get it this is actually for the love of doing the illegal thing so they won't abandon but despite the power of love most things in this world do not actually run on it)
Can determined actors get ahold of them and do the things with them the law forbids them to do, sure! That's called crime. Then law enforcement catches determined actors and puts them in prison, that's called the real world!
Will criminals stop - nope because there is benefit to what they're doing. Maybe some will stop because they will think screw it I can make more money working for the man. And some will be caught sooner or later. And maybe in version two of the regulations there will be AI enhancements - this crime was committed with AI allowing us to take all your belongings and add 10 years to your sentence and deprive you of the right to ever own a computing device again...etc. etc. And some people will stop and others will get more violent and aggressive about their criminal business.
I don't know necessarily what measurable meaningful effect means, for some people it will be measurable and meaningful, for some not, for some of society the regulation would in many ways be worse than what it is fighting against. I'm not saying regulation will solve problems 100%, I'm just saying this whole they can't regulate us thing because "TECH!!!" that developers seem to regularly go through with anything they set their eye on is a pipe dream.
The fable of the "determined actor".
The "determined actor" can get bombs, tanks, fissure material. There noone says "WHELP they can get it anyway so why bother regulating it LMAO" - somehow this is different in anything not physical?
> impossible to regulate free and open source tools
BS. Can you imagine a legislation? Yes, thus it can be done.
As an early example, the CRA (Cyber Resilience Act) already contains provisions about open source stewards and security. So far they are legal persons, aka foundations, but could easily relate to any contributor or maintainer.
As I made the comment, I can't really imagine anything that's not so absurd that has a more than zero chance of happening.
Seriously, what can anybody do about random hacker Joe publishing under the name XoX? Even if they burn GitHub and friends to the ground, if something is useful it will be really really hard to get rid of it. Remember youtube-dl? It's now https://github.com/yt-dlp/yt-dlp
If they make anything that cripples open source development they will feel it quite soon when they realize that it also cripples their world as much of the tooling and infrastructure also depends on it.
Killing open source is like killing the internet itself.
Consequences never stopped anyone.
Your example with yt-dl doesn't matter.
Open source/free software inherently relies on copyright and all state legal infrastructure. Once you operate outside, it's no longer open source/free software.
Can you host software in a way that's really hard to block? Sure. There is onion routing and plethora of other options.
But that's no longer open source/free software. You are in a realm of dark web and marketplaces.
I do maintain a semi popular open source project that I took over after about a year of inactivity and I seriously considered quiting because of CRA. It's quite easy to cripple/kill something when it basically runs on volunteering of your free time.
[dead]
Serious question: what do you think lawmakers should do?
For people's image being used without their permission: strengthen U.S. right of publicity laws with private right of action, enabling people to sue for unauthorized use of their voice or likeness.
Digital signatures as part of audio/video that can't be easily modified or faked which can trace the origin of a piece of media. Some camera manufacturers are already working on it.
How do you propose to keep watermark-free models out of the hands of evildoers? I can't build my own digital camera or laser printer, but I can certainly write software.
I don't have a good solution, but maybe legislation helps. There may not be a foolproof solution but I think the more that such devices are widely used, the less likelihood there may be of e.g. a court case hinged on bad evidence.
how many victims did it take for lawmakers to do something about Photoshop/GIMP/etc?
Bulldozing grandma is just the cost of technological progress /s
This tech is not only great for bulldozing grandma, its great at stealing content from other creators and rebranding it as your own. Based on the github, it kind of seems like thats exactly whats being advertised as the use case. Steal content from BBC, cut it up and pull the noise out/vocals/revoice the content so the algorithm cant detect the plagorism easily. The imagine detection is no where no the audio detection for copyright strikes.
There is a massive problem with this on youtube. Pretty much every category on youtube now has a host of these bots trolling content and playing the youtube strike system like a banjo. There are channels detected to showing you how to setup these content mills. This tool can make you good money.
First generative AI destroyed Google search, and now it has pretty much destroyed YouTube. Social platforms, including this one, are probably goners too. We live in interesting times.
This tech is going to be ubiquitous, it's just too easy to distribute it. Grandma better starts adapting now.
Because people make it so, not because the natural order of the world gets us there
For some reason because we can validates that we should. Any jackass has the power of a research team of phds. It's kinda weird.
Indeed. Humans ascended to dominance because we can cooperate. This every-man-for-themself idea is an aberration, not the natural order as so many claim. It’s rather astounding to think otherwise considering the logistics of how we’re communicating right now.
Cooperation works if the potential damage caused by a rouge actor is sufficiently low. Otherwise, it's too easy to sabotage things. This is why we don't want random rouge states to have nukes. AI will give so much leverage to rouge actors that it will significantly shift the game theory in favour of not cooperating.
> Cooperation works if the potential damage caused by a rouge actor is sufficiently low. Otherwise, it's too easy to sabotage things. This is why we don't want random rouge states to have nukes. AI will give so much leverage to rouge actors that it will significantly shift the game theory in favour of not cooperating.
Governments successfully collectively controlling dangerous things so they don’t fall into the hands of rogue bad actors fundamentally opposes the extreme individualist every-man-for-himself perspective in every conceivable way. It’s the absolute opposite of “it’s everybody’s responsibility to protect themselves because everybody else is only going to look out for themselves.”
And when individuals have that much leverage, collective action is the only conceivable way to oppose it. Some of those things might be cultural, like mores, some might be laws, some might be more martial. I don’t see how extreme individualism even theoretically could be more powerful.
Are you suggesting government action against putting up code like this to GitHub? It’s ok if you are, but I want to put into more concrete terms what we’re talking about.
You’re the one that made the direct government control analogy. I mentioned a number of non-individualistic mechanisms in my previous comment. I’m not going to keep engaging in a fishing expedition of things to argue about — I think it’s pretty clear what aspect of your stance I disagree with— and am going to leave it at that.
So you don't have a concrete suggestion to solve the scamming problem?
Demanding responsible behaviour from everybody is not going to work. Some people don't care about negative externalities that much and it's enough if only a few of them decide not to play ball. So either grandma needs to adapt which will upset some people or distributing the tech should be regulated/prosecuted which will upset another group of people.
I think either way grandma needs to adapt though since Russian scammers and trolls are still going to run scams with fake voices.
how very politically correct of you to pretend it's Russians who scam your grandmas
Insert any other country you like that doesn't have extradition agreements with the United States. Any other country the law can at least ostensibly be enforced there, even if it isn't always
You can’t adapt around brain age making it more difficult to distinguish truth from lies.
Yeah, I don't really get the hulabaloo, if granny doesn't have the mental fortitude to keep up with the times she shouldn't be managing her own money. I guess better her son/daughter than a scammer but both are better than letting money rot. Put granny on foodstamps and pay $1 for her rent controled housing be done with it.
Are we forgetting, that there are many elderly people without living descendants?
Quit being a doomer or keep it to yourself. This reminds me of the sound boards that were popular in the early 2000s except way more versatile. Some things are just good for people to have fun and THAT'S OKAY.
People are allowed to recognize the realistic negative outcomes of technology, especially on a forum that frequently discusses the tradeoffs of modern, cutting edge technologies.
So many AI posts are overrun with this kind of complaining from folks with limited imaginations.
On a forum that frequently discusses technology with enthusiasm you'd think there'd be more enthusiasm and more constructive criticism instead of blanket write-offs.
I would argue that being able to see the drawbacks and potential negative externalities of a new technology is not a sign of a "limited imagination", but quite the contrary. An actual display of a limited imagination is the inability to imagine how a new technology can (and will) be abused in society by bad actors.
Developing some insight on its negative potential could demonstrate imagination, but the claim that it could be used to scam people is pretty much just rote repetition by now - an obligatory point made in every article and under every post about this tech (and not something that I think actually works out in practice the way most imagine it, since cold-call scam operations that dial numbers at a huge scale expecting most not to pick up can't really find a voice clip prior to each automated call).
As for positive applications, some I see:
* Allowing those with speech impairments to communicate using their natural voice again
* Allowing those uncomfortable with their natural voice, such as transgender people, to communicate closer to how they wish to be perceived
* Translation of a user's voice, maintaining emotion and intonation, for natural cross-language communication on calls
* Professional-quality audio from cheap microphone setups (for video tutorials, indie games, etc.)
* Doing character voices for a D&D session, audiobook, etc.
* Customization of voice assistants, such as to use a native accent/dialect
* Movies, podcasts, audiobooks, news broadcasts, etc. made available in a huge range of languages
* If integrated with something like airpods, babelfish-like automatic isolation and translation of any speech around you
* Privacy from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only
* New forms of interactive media - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.
* And of course: memes, satire, and parody
I appreciate HN's general view on technologies like encrypted messaging - not falling into "we need to ban this now because pedophiles could use it" hysteria. But for anything involving machine learning, I'm concerned how often the hacker mentality seems to go out the window and we instead get people advocating for it to be made illegal to host the code, for instance.
Of the 11 positive applications that you listed, only the 1st, 3rd, 11th and arguably the 4th would benefit from voice cloning, which is what's being promoted here. The rest are solved merely by (improved) TTS and do not require the cloning of any actual human voice.
Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine. However, the only use-case which would require cloning a specific human voice belonging to a third party, use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
> Of the 11 positive applications that you listed, only the 1st, 3rd, 11th and arguably the 4th would benefit from voice cloning, which is what's being promoted here. The rest are solved merely by (improved) TTS and do not require the cloning of any actual human voice.
2, 5, 6, 9: It's true that in theory all you need is some way to capture the characteristics of a desired voice, but voice-cloning methods are the way to do this currently. If you want a voice assistant with a native accent, you fine-tune on the voice of a native speaker - opposed to turning a bunch of dials manually.
7, 8, 10: Here I think there is benefit specifically from sounding like a particular person. The dynamically generated lines of movie characters/videogame NPCs should be consistent with the actor's pre-recorded lines, for instance, and hearing someone in their own voice is more natural for communication and makes conversation easier to follow.
Pedantically, what's promoted here is a tool which features voice cloning prominently but not exclusively - other workflows demonstrated (like generating subtitles) seem mostly unobjectionable.
> Also, notice how the legitimate use-cases 1, 3 and 4 imply the user consenting to clone their own voice, which is fine
I think all, outside of potentially 8 and 11, could be done with full consent of the voice being cloned - an agreement with the movie actor to use their voice for dubbing to other languages, for example. That's already a significant number of use-cases for this tool.
> use-case 11, is "memes, satire, and parody"... and not much imagination is needed to see how steep and buttery that Teflon slippery slope is.
IMO prohibition around satire/parody would be the slippery slope, particularly with the potential for selective enforcement.
This is a GitHub repo, not an article on the effects of TTS. Policy discussions at the level of the parent comment feel off topic.
Just heads up, this is a trail, you have to pay to use it after 30mins..
Easier and (cheaper?) to just use elevenlabs.
It’s a bit of a hassle, but after closing the Windows command, you can restart the program and use it indefinitely. The results you worked on will still remain in the workspace folder.
Yeah, felt like it positions itself as open source project here and on GitHub, but buries the cost in other pages... Doesn't even say the subscription cost anywhere I could find (in English). Not a huge fan of this advertising model.
I haven’t looked at the code, but can you just patch out the 30 minute limit?
Looks to me like the app code is compiled into pyd files. One could try and decompile. Interestingly, it's licensed as MIT.
Is there speech to speech? I have been hoping for a model I can use to do voice acting with inflection
Do you mean Inflection's Pi?
I think they mean speech "in the style of" the same as repaint this picture in the style of Van Gogh, so they will do the audio and put the correct inflection on things but then rerender it with the voice of Katharine Hepburn for example.
on edit: example of course showing the difficulty as so much of Hepburn was her inflection.
More so I wish to voice act a line and then have the bot mimic it with a different voice but with the same contextual voicing.
“I’m going to kill you” could be delivered (laughing jokingly / seething with rage / ominously and creepily). I’d like a bot that can mimic the delivery in a different voice.
Project looks interesting. Are there short term plans to support MacOS?
If not, any recommendations for alternative projects?
The description, since many commentators are not clicking though but asking questions this answers:
Comprehensive Gradio WebUI for audio processing, powered by Whisper engines (Whisper, Faster-Whisper, Whisper-Timestamped). Features Voice Changer, zero-shot Voice Cloning (E2, F5-TTS), YouTube downloading, vocal isolation(UVR5), Text-to-Speech (Edge-TTS), and multi-language translation. Perfect for content creators and developers.
Are banks moving away from voice verification as a means to identity checks? It seems like it's getting easier and easier to clone voices.
Have you considered supporting whisper-at - https://github.com/YuanGongND/whisper-at ? Being able to identify sounds on a timeline can be useful e.g. politicians speech and how the audience is reacting to it (e.g. clapping, applauding)
The real utility of something like this is for reducing the creative costs of voice-acting. i.e. something like this is a massive boone for mod-makers where making fully voiced anything is a huge undertaking - i.e. while my friends and family could probably provide their voice if I asked, getting a decent recording and performance out of them is just not going to be possible.
But if I can get the performance I want and shift it to another voice, then fully voicing free works becomes very accessible (even better would be generative AI which could take a sample of what you want and re-render it into something which sounds like a more professional performance - voice in-fill I suppose).
Great stuff well done. What is your latency for real time Audio?
This is cool. I want to use this combined with NotebookLM to create a podcast with my mom and my dad’s voice covering a concept like gradient descent, Explain Like I’m 5 (ELI5).
I wonder if certain familiar voices like that of your parents would lead to higher understanding and retention.
> Imagine creating a podcast where Mark Zuckerberg interviews Elon Musk – using their actual voices?
I'm imagining it. It sucks to imagine.
I'm imagining it being used to scam people. I'm imagining it to leech off of performers who have worked very hard to build a recognizable voice (and it is a lot of work to speak like a performer). I'm imagining how this will be used in revenge porn. I'm imagining how this will be used to circumvent access to voice controlled things.
This is bad. You should feel bad.
And I know you are thinking, "Wait, but I worked really hard on this!" Sorry, I appreciate that it might be technically impressive, but you've basically come out with "we've invented a device that mixes bleach and ammonia automatically in your bedroom! It's so efficient at mixing those two, we can fill a space with chlorine gas in under 10 seconds! Imagine a world where every bedroom could become a toxic site with only the push of a button.
That this is posted here, proudly, is quite frankly astoundingly embarrassing for you.
I'd claim the way most people imagine it being used for scamming, cold-calls impersonating someone the victim knows, doesn't really end up working out in practice because scam operations dial numbers at a huge scale expecting most not to pick up a "scam likely" call (or be away, or a dead number, etc.). Having to find a voice clip prior to each unanswered call would tank the quantity they're able to make.
For spear-phishing (impersonate CEO, tell assistant to transfer money) it's more feasible, but I hope it forces acceptance that "somebody sounds like X over the phone" is not and has never been a good verification method - people have been falling for scams like those fake ransom calls[0] for decades.
Not that there aren't potential harms, but I think they're outweighed by positive applications. Those uncomfortable with their natural voice, such as transgender people, can communicate closer to how they wish to be perceived - or someone whose voice has been impaired (whether just a temporary cold or a permanent disorder/illness/accident) can use it from previous recordings. Privacy benefits from being able to communicate online or record videos without revealing your real voice, which I think is why many (myself included) currently resort to text-only. There's huge potential in the translation and vocal isolation aspects aiding communication - feels to me as though we're heading towards creating our own babelfish. There's also a bunch of creative applications - doing character voices for a D&D session or audiobook, memes/satire, and likely new forms of interactive media (customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prereccorded lines, etc.)
I think most people in America are more wary of foreign sounding voices. If the person on the other end sounds like a good ol boy, they get more trust.
Scammers don't have to sound like a specific person to be helped by software like this.
That aspect feels to me like "I used to racially profile people on the street to judge risk, but winter clothing now obscures skin color at a distance". There are heuristics that give non-zero information but are harmful to use, with the cost borne by some marginalized group, and I don't see it as a negative for use of such heuristics to be made less feasible. Reducing people's use of accent as a factor would be a positive for the ~1.5B Indians that aren't scammers, for instance.
I think there's also an autonomy argument to be made, if the alternative is to the effect of ensuring that people cannot use tools hide their accent (and particularly if, as above, the intent is so they can be discriminated against based on it). Even though it isn't something we've really been able to do before, I think it's generally a person's own right to modify their voice.
You do realise this is not the first AI release to clone voices?
I don't think the parent said they were. "I'm the Nth person to do a shitty thing!" doesn't absolve them of doing a shitty thing. Just because there are other thieves doesn't make theft ok.
Sure, and PoisonIvy wasn't the first RAT. So what? Does it get more ethical to assist fraudsters and so on once more people are doing it?
[flagged]
This doesn't appear to have any training facility, so its misuse would seem to be limited to the pre-trained voices supplied - for the casual user (and the ease-of-use seems to be the central issue in these comments).
My experience with voice cloning is that training is typically not required for it to work. You just embed a bit of audio of the desired voice to be cloned using the backing VAE and the model can do the rest.
Is it not the same with this project?
If you are looking for automatic dubbing without voice cloning: https://github.com/Softcatala/open-dubbing
The syncing of the original English is way off. I don't really know how they got that to be so broken.
There are a bunch of yc start-ups who are building new models and stuff in the space. I fear they are going to get decimated really soon as the quality of local llamas keep improving.
Just need to use this with some recordings of Majel Barrett, make a voice interface for Claude's computer use agent and we'll be all set.
Honestly, I'm not super worried about AI — at least this iteration of it — because of the uncanny valley effect. I would expect the VO industry to outlaw it purely because if people start to wonder if they're listening to an AI voice, that's a non-starter and they will stop paying attention. Even with the best AI, there are artifacts that make it easy to identify.
The primary goal of the voice actor is to achieve a personal connection, and I don't see how AI is a real threat to that end. I feel the same about other mediums as well. This will likely be used for scams, but I doubt it will ever draw as many eyes, or ears, as something a real human can produce. Thus, it won't be a valuable tool to marketers and largely unprofitable.
Looks cool! Also, is there a reason you went with a Web-UI instead of making a native desktop app?
Comment was deleted :(
I'm with the nay-sayers. Your product doesn't bring any good to this world, but it does make it easier to harm people. It's a disgrace.
"If, by whiskey...."
are there any TTS models which are decent but can work on devices without GPU and have relatively low RAM(4GB)
> Linux and Mac OS are not supported
Well, that's a big old fail. Just a reminder: The given (and proper) home of open source is on an open source OS.
This is gross. The person who made it and pitched it this way disgusts me.
[dead]
[dead]
Without Linux support it is going to have a very limited audience.
There is nothing in here that precludes you from running this on any OS that supports python + CUDA. They use miniconda for installation of python and python packages, but this could just as easily be a venv + system CUDA install or even better: a container. This is only one tiny Dockerfile away from running anywhere.
Comment was deleted :(
[flagged]
You don't want an Instagram hack app. You want to go home and rethink your life.
Crafted by Rajat
Source Code