Voxtral Transcribe 2

(mistral.ai)

228 points | by meetpateltech 2 hours ago

18 comments

simonw 1 hour ago
This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
[-]
- Oras 1 hour ago
  Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.
  I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
- tekacs 1 hour ago
  Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.
  And open weight too! So grateful for this.
- daemonologist 1 hour ago
  404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).
- jaggederest 56 minutes ago
  It can transcribe Eminem's Rap God fast sequence, really, really impressive.
  [-]
  - rafram 27 minutes ago
    That's almost certainly in the training data, to be fair.
- pyprism 40 minutes ago
  Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.
  [-]
  - derefr 7 minutes ago
    Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.
- rafram 23 minutes ago
  Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.
- th0ma5 1 hour ago
  [dead]
- adarsh2321 39 minutes ago
  [flagged]
- adarsh2321 33 minutes ago
  [flagged]
janalsncm 17 minutes ago
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.
https://aclanthology.org/2025.findings-acl.87/
[-]
- decide1000 7 minutes ago
  I think this model proves it's very efficient and accurate.
dmix 1 hour ago
> At approximately 4% word error rate on FLEURS and $0.003/min
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
[-]
- mdrzn 1 hour ago
  Is it 0.003 per minute of audio uploaded, or "compute minute"?
  For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.
  [-]
  - Oras 1 hour ago
    I think the point is having it for real-time; this is for conversations rather than transcribing audio files.
    [-]
    - jamilton 6 minutes ago
      That quote was for the non-realtime model.
yewenjie 4 minutes ago
One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.
observationist 2 hours ago
Native diarization, this looks exciting. edit: or not, no diarization in real-time.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
[-]
- coder543 1 hour ago
  The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.
  [-]
  - sbrother 1 hour ago
    Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..
    [-]
    - coder543 36 minutes ago
      > Do you have experience with that model
      No, I just heard about it this morning.
  - observationist 1 hour ago
    Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!
pietz 1 hour ago
Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.
[-]
- czottmann 5 minutes ago
  I liked Parakeet v3 a lot until it started to drop whole sentences, willy-nilly.
- tylergetsay 58 minutes ago
  I've been using Parakeet V3 locally and totally ancedotaly this feels more accurate but slightly slower
siddbudd 10 minutes ago
Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
[-]
- bigyabai 8 minutes ago
  It might be capable of translation; OpenAI Whisper was a transcription model that could do it.
mdrzn 1 hour ago
There's no comparison to Whisper Large v3 or other Whisper models..
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
[-]
- tekacs 1 hour ago
  WER is slightly misleading, but Whisper Large v3 WER is classically around 10%, I think, and 12% with Turbo.
  The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.
  But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.
  [-]
  - tekacs 1 hour ago
    On the topic of things being misleading, GPT-4o transcriber is a very _different_ transcriber to Whisper. I would say not better or worse, despite characterizations such. So it is a little difficult to compare on just the numbers.
    There's a reason that quite a lot of good transcribers still use V2, not V3.
    [-]
    - satvikpendem 1 hour ago
      Different how?
- GaggiX 1 hour ago
  Gpt4o mini transcribe is better and actually realtime. Whisper is trained to encode the entire audio (or at least 30s chunks) and then decode it.
  [-]
  - mdrzn 1 hour ago
    So "gpt4o mini transcribe" is not just whisper v3 under the hood? Btw it's $0.006 / minute
    For Whisper API online (with v3 large) I've found "$0.00125 per compute second" which is the cheapest absolute I've ever found.
    [-]
    - GaggiX 1 hour ago
      >So it's not just whisper v3 under the hood?
      Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
  - emmettm 1 hour ago
    The linked article claims the average word error rate for Voxtral mini v2 is lower than GPT-4o mini transcribe
    [-]
    - GaggiX 1 hour ago
      Gpt4o mini transcribe is better than whisper, the context is the parent comment.
satvikpendem 1 hour ago
Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.
aavci 1 hour ago
What's the cheapest device specs that this could realistically run on?
[-]
- kamranjon 48 minutes ago
  I haven't quite figured out if the open weights they released on huggingface amount to being able to run the (realtime) model locally - i hope so though! For the larger model with diarization I don't think they open sourced anything.
XCSme 15 minutes ago
Is it me or error rate of 3% is really high?
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
[-]
- cootsnuck 12 minutes ago
  The error rate for human transcription can be as high as 5%.
  [-]
  - XCSme 6 minutes ago
    Oh wow, I thought humans are like 0.1% error rate, if they are native speakers and aware of the subject being discussed.
serf 2 hours ago
things I hate:
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
[-]
- ReadEvalPost 2 hours ago
  You can try it on HF: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
  [-]
  - boobsbr 1 hour ago
    I'm impressed.
ewuhic 7 minutes ago
Can it translate in real time?
antirez 1 hour ago
Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.
[-]
- Archelaos 1 hour ago
  This is largely due to the fact that modern Italian is a systematised language that emerged from a literary movement (whose most prominent representative is Alessandro Manzoni) to establish a uniform language for the Italian people. At the time of Italian unification in 1861, only about 2.5% of the population could speak this language.
  [-]
  - gbalduzzi 1 hour ago
    The language itself was not invented for the purpose: it was the language spoken in Florence, than adopted by the literary movement and than selected as the national language.
    It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language
- hackyhacky 5 minutes ago
  > the most phonetically advanced human language
  That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.
  /s obviously.
  Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.
- gbalduzzi 1 hour ago
  I was honestly surprised to find it in the first place, because I assumed English to be at first place given the simpler grammar and the huge dataset available.
  I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)
  [-]
  - riffraff 1 hour ago
    English has a ton of homophones, way more sounds that differ slightly (long/short vowels), and major pronunciation differences across major "official" languages (think Australia/US/Canada/UK).
    Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)
- NewsaHackO 1 hour ago
  The only knowledge I have about how difficult Italian is comes from Inglourious Basterds.
- mmooss 32 minutes ago
  At least some relatively well-known research finds that all languages have similar information density in terms of bits/second (~39 bits/second based on a quick search). Languages do it with different amounts of phonetic sound / syllables / words per bit and per second, but the bps comes out the same.
  I don't know how widely accepted that conclusion is, what exceptions there may be, etc.
Archelaos 1 hour ago
As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.
What estimates do others use?
dumpstate 27 minutes ago
I'm on voxtral-mini-latest and that's why I started seeing 500s today lol
boringg 1 hour ago
Pseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),
varispeed 1 hour ago
[flagged]
[-]
- Empact 1 hour ago
  Many people speak Russian, including many who do not live in Russia, e.g. about 30% of Ukranians.
  Beyond that, I don't see how we stand to durably reduce military action by making languages mutually unintelligible.
  https://simple.wikipedia.org/wiki/Russian_language#/media/Fi...
- laffOr 1 hour ago
  Don't they have a partnership with the French Armed Forces? I am sure they are interested in automating Russian Audio or Text (-> Russian Text) -> French text.
- gostsamo 1 hour ago
  They've chosen languages which would help them to cover the highest percentage of human population..