How Nvidia Made Its ASR Models 3x Faster Than the Competition

Wait 5 sec.

Open the Hugging Face Open ASR Leaderboard and sort by RTFx, the inverse real-time factor. Among models with competitive WER, the top of the table is dominated by one family: Nvidia’s Parakeet TDT checkpoints. They process more than 3x as many seconds of audio per second of wall-clock time as the nearest competitor. Their word error rate is competitive with the rest of the top ten.A gap that wide is rarely just kernel engineering. The mechanism here is architectural. Nvidia's models use a modification to the RNN-Transducer called the Token-and-Duration Transducer, or TDT (Xu et al., 2023).It changes the decoder loop in a small but consequential way. Instead of stepping through encoder frames one at a time, the model jointly predicts a token and the number of frames that token covers, then jumps.On long utterances with stretches of silence or steady-state audio, that turns out to be worth up to 2.82x faster inference, at comparable or better accuracy.This post walks through what TDT actually is, why it speeds things up, and the handful of training details that make it work in practice. If you want the full math, our blog ==“Token Duration Transducer (TDT) Explained: How Frame-Skipping Achieves 2.8x Faster ASR”== covers the forward-backward derivation and gradient algebra in detail.\The bottleneck in standard RNN-TRNN-T is a sensible default for ASR. It captures label dependencies (which CTC cannot) without the heavyweight autoregressive predictor of encoder-decoder or decoder-only models. It also trains end-to-end with a well-understood loss. For a fuller comparison of the alternatives, ==see Desh's analysis.==The model has three parts. An encoder maps audio frames to hidden representations. A small autoregressive predictor consumes previously emitted non-blank tokens. A joint network combines the two and outputs a distribution over the vocabulary plus a blank symbol.At inference, the decoding loop looks like this:# RNN-T greedy decoding (simplified)t = 0output = []while t < T: logits = joint(encoder[t], predictor(output)) token = argmax(logits) if token == BLANK: t += 1 # advance ONE frame else: output.append(token) # stay at same t, advance u\Each step the model emits either blank (move forward one frame) or a token (stay at the same frame, append the token). For a 10-second clip at an 80ms frame rate after subsampling, that is roughly 125 sequential joint network calls. Most of them are blanks, because tokens are sparse relative to frames.The joint network is cheap per call, but the strict one-frame-at-a-time structure leaves throughput on the table.\TDT: predicting how long the token lastsThe TDT modification is small. The joint network grows a second head. Instead of one distribution over |V| + 1 symbols, it now produces two independent distributions. One is over the token vocabulary plus blank, identical to RNN-T. The other is over a predefined set of allowed durations, typically D = {0, 1, 2, 3, 4}. The two heads share the same encoder and predictor representations but are normalized separately:\# TDT joint network outputlogits = joint(encoder[t], predictor(output)) # shape: [V + 1 + |D|]token_logits = logits[:V+1]duration_logits = logits[V+1:]token_probs = softmax(token_logits)duration_probs = softmax(duration_logits)\The decoding loop now uses both heads:\# TDT greedy decoding (simplified)t = 0output = []while t < T: logits = joint(encoder[t], predictor(output)) token = argmax(token_logits) duration = argmax(duration_logits) if token == BLANK: t += max(1, duration) # skip MULTIPLE frames else: output.append(token) t += duration # tokens can also skip frames\If the model predicts blank with duration 4, the decoder skips 4 frames in one joint call rather than four. The model decides per step how aggressive to be.A worked example makes the saving concrete. Take 8 encoder frames, target "hi", and D = {0, 1, 2, 3}:\t=0: joint(enc[0], pred([])) -> token=h (p=0.8), duration=0 (p=0.7) -> emit 'h', stay at t=0t=0: joint(enc[0], pred([h])) -> token=i (p=0.6), duration=2 (p=0.5) -> emit 'i', jump to t=2t=2: joint(enc[2], pred([h, i])) -> token=blank (p=0.9), duration=3 (p=0.6) -> skip to t=5t=5: joint(enc[5], pred([h, i])) -> token=blank (p=0.95), duration=3 (p=0.8) -> skip to t=8 -> DONEFour joint network calls instead of eight or more. The speedup is most pronounced on longer utterances with stretches of silence, which is most real-world audio.\What has to change in trainingTraining has to teach the duration head to make sensible predictions, and the loss has to account for the richer set of transitions. The mechanism is the same family as standard RNN-T: a forward-backward algorithm over a lattice of valid alignments.The lattice is a T × (U+1) grid where every node represents how much of the audio has been consumed against how much of the transcript has been emitted. A valid path runs from bottom-left to top-right and corresponds to one possible frame-level alignment. Because we don't know the true alignment, RNN-T training maximizes the total probability mass flowing through every valid path.Forward-backward computes that sum efficiently in O(T·U) by reusing partial sums at each node, in much the same way HMM training does. The gradient with respect to any transition reduces to "the fraction of total probability mass flowing through it", which is why dominant alignments tend to take over later in training.For TDT, the lattice transitions are richer. A blank with duration d jumps d ≥ 1 frames horizontally. A token with duration d jumps d ≥ 0 frames and one step vertically. The forward and backward variables now sum over each duration in D, making the algorithm O(T·U·|D|). With |D| around four or five, the constant is small. The full derivation is in the TDT paper.Two practitioner-level tricks are worth knowing about, because neither is obvious from the architecture alone.\The sigma trick (logit under-normalization). Every transition in the lattice gets penalized by σ (typically 0.05) in log-space. Because the penalty is per-transition, paths with more steps rack up more total penalty. That biases the model toward fewer, longer-duration steps rather than many duration-1 ones. Without it, the model has no real incentive to use the larger durations during inference.The omega trick (sampled RNN-T loss). With probability ω, the loss falls back to plain RNN-T and ignores durations entirely. This regularizes the token head, keeping it well-calibrated even when duration information is missing. It matters for batched inference, where the whole batch has to advance by the minimum predicted duration, so token predictions still need to be sensible at every frame in between.A few other practical notes. The duration set D must include 1, so the model can always fall back to single-frame advance. Including 0 is optional, but it lets the model emit a token without advancing the frame, which helps on fast speech. And the joint network output is a 4D tensor of shape (B, T, U, V + |D|), which gets large fast: production training uses fused-loss kernels that materialize one (t, u) slice at a time.\How TDT differs from Multi-BlankTDT is often confused with the Multi-Blank Transducer, which attacks the same problem differently. Multi-Blank adds extra blank symbols (big-blank-2, big-blank-3, and so on) to the vocabulary, each advancing by a fixed number of frames. TDT keeps the vocabulary unchanged and adds a separate duration head.| | Multi-Blank | TDT ||----|----|----|| Duration prediction | Implicit, via blank type | Explicit, separate head || Token durations | Always 0 | Variable || Vocab size | Grows by |D| | Unchanged || Independence | Token and duration coupled | Independently normalized |\Decoupling means the duration distribution can be more fine-grained without bloating the vocabulary, and tokens themselves can carry duration, not just blanks.\Why this mattersThe Open ASR Leaderboard is one of the few honest benchmarks in production ASR, because it forces models to compete on accuracy and throughput at once. A leading WER at 0.5x RTFx is not shippable; slightly worse WER at 3x throughput often is. TDT is a clean example of an architectural change that buys a lot of throughput for very little accuracy cost, and it's why one family of checkpoints currently sits 3x clear at the top.The NeMo toolkit has the full implementation, including the forward-backward kernels. The Parakeet-TDT checkpoints are on Hugging Face if you want to benchmark them against whatever you are running today.What's interesting about TDT, beyond the speed numbers, is how little had to change to get there. A second softmax head, a slightly richer lattice, two regularization tricks during training.Production wins in ML often look like this: not a new paradigm, but a small architectural decision that respects the structure of the problem.RNN-T spent most of its life predicting silence one frame at a time. TDT just stopped doing that.\