Microsoft VibeVoice: Open-Source Frontier Voice AI

(github.com)

113 points | by tosh 2 hours ago

20 comments

  • maxloh 1 hour ago
    I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

    https://github.com/microsoft/VibeVoice/issues/102

    • jcmfernandes 31 minutes ago
      Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.
      • MarsIronPI 14 minutes ago
        If you're going to apologize to Stallman, you should apologize for conflating open source with software freedom. ;D
    • JumpCrisscross 1 hour ago
      > we should stop calling this type of model open source. They are indeed "open weight”

      This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.

      • andy_ppp 1 hour ago
        I think you mean GIF.
      • giancarlostoro 57 minutes ago
        It's the same as GIS, you wouldn't say jizz now would you?
        • DoctorOW 42 minutes ago
          I absolutely do, every single time it comes up.
        • kevin_thibedeau 35 minutes ago
          The developer of the format declared the pronunciation 30+ years ago. It has always been jif.
          • Geezus_42 16 minutes ago
            Yeah, but society overruled them.
        • pardon_me 28 minutes ago
          How do you pronounce giraffe?
          • parineum 24 minutes ago
            How do you pronounce gift?
        • dijksterhuis 23 minutes ago
          i am absolutely going to from now on
        • notabotiswear 49 minutes ago
          I take it that you haven’t met the Arcgees people…
      • WarmWash 42 minutes ago
        And "hallucination" which should have been "delusion".

        Way early on (spring 2023) people tried to stop it, but no luck.

    • btown 12 minutes ago
      At least it's MIT licensed! As much as non-open training data irks me, restrictive licensing irks me more!
    • giancarlostoro 57 minutes ago
      I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).

      Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.

      I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.

    • notabotiswear 46 minutes ago
      Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.
      • dist-epoch 38 minutes ago
        it was replaced with abundancewashing
        • Geezus_42 14 minutes ago
          What is "abundancewashing"?
          • dist-epoch 0 minutes ago
            > “This means a future of abundance. A future where there is no poverty, where people can have whatever they want in terms of goods and services.” – Elon Musk

            > “I think we see a path now where the world gets much more abundant and much better every year.” – Sam Altman

            https://www.diamandis.com/blog/elon-sam-abundance

  • steinvakt2 1 hour ago
    This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

    Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

  • Void_ 1 hour ago
    I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

    - Cohere Transcribe (self hosted)

    - Grok Speech To Text (they provide an API, only $0.10/hr!)

    They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

    • olejorgenb 1 hour ago
      I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)
    • Barbing 32 minutes ago
      Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?
    • 2ndorderthought 1 hour ago
      Have you tried qwen?
    • SecretDreams 1 hour ago
      Any non-Musk alternatives that are comparable in quality and cost?
      • jayphen 28 minutes ago
        Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)
      • Void_ 1 hour ago
        Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.
  • Mobius01 9 minutes ago
    Microsoft has historically made poor choices in product naming, but this has to be a new low.
  • aqme28 1 hour ago
    Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.
    • accrual 53 minutes ago
      Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.
      • Barbing 29 minutes ago
        I’m just surprised they put the name of the e-waste slop company in their product
    • altmanaltman 40 minutes ago
      Which makes it even more weird they get offended when people use Mircoslop. They are the ones leaning into the marketing
  • embedding-shape 1 hour ago
    Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
    • 542458 1 hour ago
      Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.
    • infecto 43 minutes ago
      It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.
      • Barbing 17 minutes ago
        [off topic]

        When explanations get posted directly in HN comments, I imagine someone somewhere in the world is able to learn in spite of their Internet restrictions/firewalls

        People will also post their own interpretations in response to comments, and quickly find out they missed something.

        … But if you try to automate it, like include a summary under every HN post, you encourage laziness too much and are pre-chewing too heavily. Some balance here.

        [on topic]

        (OK I’m done making excuses, time to read the article… thanks for the encouragement!)

        I thought this was not explained in the readme directly but in fact I missed it. I wasn’t going to read Microsoft entire changelog! But it was substantive, thanks to sibling commenter:

        “2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.”

  • frangonf 35 minutes ago
    I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

    My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

    Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

  • CubsFan1060 1 hour ago
    Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/
    • 542458 1 hour ago
      Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
    • JumpCrisscross 1 hour ago
      “VibeVoice can only handle up to an hour of audio”

      Why?

  • podgietaru 1 hour ago
    So we've really just settled on Vibe as the verb for AI then?
    • giarc 1 hour ago
      I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
      • internet_points 1 hour ago
        it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse
    • pryanshu89 1 hour ago
      Why use precise technical language when you can just vibe with your AI system?
  • chaosprint 18 minutes ago
    Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:

    https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...

  • khimaros 21 minutes ago
    looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested
  • Anonyneko 1 hour ago
    You have selected Microsoft Sam as the computer's default voice.
    • accrual 51 minutes ago
      My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.
  • ryukoposting 43 minutes ago
    Holy moly, a Microsoft AI product that isn't named Copilot!
    • DoctorOW 41 minutes ago
      Missed opportunity to call it Vopilot
  • JumpCrisscross 1 hour ago
    What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?
    • yreg 27 minutes ago
      Locally maybe https://voicebox.sh/

      Elevenlabs in the cloud.

    • chrsw 54 minutes ago
      Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.
    • khimaros 19 minutes ago
      open weights i would say S2: https://github.com/rodrigomatta/s2.cpp
  • pluc 1 hour ago
    Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243
  • BlastBash192 1 hour ago
    Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.
  • mistic92 1 hour ago
    For me its giving me very poor results
  • ChrisArchitect 14 minutes ago
  • starkeeper 10 minutes ago
    Microsoft is famous for choosing terrible names but how could they be this terrible.
  • walthamstow 1 hour ago
    Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

    The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck