Triton Kernels for Evo 2 Long-Context Inference

Status: Proposed upstream as an opt-in pull request, vortex#77, open and unmerged at the time of writing.

Evo 2 is a genomic foundation model (DNA in, next-token logits out) trained across all domains of life with a context window of up to a million nucleotides.Brixi et al., “Genome modelling and design across all domains of life with Evo 2,” Nature (2026), doi:10.1038/s41586-026-10176-5; open manuscript at arcinstitute.org/manuscripts/Evo2. Weights and inference code are open; the inference engine lives in Zymrael/vortex. Genomic context is long by nature: genes, regulatory regions, and whole genomes routinely span tens of thousands to millions of bases. Evo 2 reaches these lengths by using the Hyena operator for most sequence-mixing layers, with a small number of attention layers retained; Hyena’s cost grows sub-quadratically rather than quadratically in sequence length.Poli, Massaroli, et al., “Hyena Hierarchy: Towards Larger Convolutional Language Models,” ICML 2023, arXiv:2302.10866. The operator interleaves data-controlled gating with a long convolution whose filter is produced from a compact parameterisation rather than stored explicitly.

Long context is central to the architecture, and also where the reference implementation strains. Run the stock path on a single 80GB H100 and it runs out of memory at 131K (131,072) tokens, one of the context lengths the model was built to serve. The cause is not the convolution arithmetic but a single intermediate tensor that the reference builds in full, even though the computation never needs it held all at once.

Three Triton kernels close the gap: they fill the empty HC{S,M,L} interface stubs vortex’s main branch left open for downstream optimisation. The profiler, benchmarks, correctness sweep, and full result set produced for this work are public.Profiler, benchmarks, and the correctness sweep: AlphaKhaw/vortex-kernels. Net result: 131K-token inference now runs on a single 80GB H100 where the stock path runs out of memory, and the lengths that already fit run up to 1.73x faster at 65K (65,536) tokens. The same kernels carry to the 40B checkpoint.

Hyena Convolution Blocks

A Hyena layer mixes information across sequence positions with a long, implicitly parametrised convolution gated by the input, instead of relying on a full attention matrix. Direct convolution against a filter of length L costs O(L²). The Fast Fourier Transform (FFT) path computes the same convolution in O(L log L): transform both signals, multiply in the spectral domain, then transform back.The FFT turns convolution into pointwise multiplication via the convolution theorem. cuFFT is NVIDIA’s GPU FFT library; vortex reaches it through torch.fft, which dispatches to cuFFT on CUDA tensors. HCM uses rfft/irfft; the stock HCL path uses rfft(h), a full fft(x1v), then irfft. This is why the long-context paths are FFT-based.

Vortex’s HyenaInferenceEngine routes each block down one of three paths by filter length, and there is one kernel for each:

  • HCS (Hyena Cascade Short): The short-filter path (fir_length < 128). A depthwise causal convolution applied per channel; in evo2_7b the filter is 7 taps, meaning seven learned filter coefficients per channel. This is the gated FIR branch of parallel_fir.
  • HCM (Hyena Cascade Medium): The medium FFT-conv path (fir_length >= 128). Three cuFFT transforms with elementwise glue around them, the drop-in for fftconv_func.
  • HCL (Hyena Cascade Long): the long modal-filter path in parallel_iir. The same FFT convolution as HCM, preceded by a step that recomputes the filter itself from a compact pole-residue parameterisation. The filter is not stored as explicit taps: it is described by a small set of modes (each with an exponential rate and a weight), and the L taps are reconstructed from those modes on demand.In code, each tap is h[l] = sum_s residue[s] * exp(log_pole[s] * l): add the contribution of every mode s at position l. The arithmetic is cheap because S is small; the memory problem appears only when the implementation materialises every (s, l) term before summing. The recompute is where the memory goes.

Together the three cover the entire HC{S,M,L} cascade: the sequence-mixing core of every Hyena layer, and the role attention would otherwise play.

Inference Time and Memory

The first step before any development is to profile. The profiler runs an Evo 2 7B forward pass at four sequence lengths, attributing every leaf-level CUDA op to the block that issued it. This is a diagnostic sweep, run on an H200 because profiling needs the stock path to execute and only the H200’s 141GB reaches 131K unmodified. What it exposes is a property of the computation, identical on the 80GB H100 the headline results use.

The share of forward-pass CUDA time, by block kind, is stable across lengths:

Sequence Length (tokens)HCLHCMHCSAttentionOutside the Blocks
819211.2%6.2%5.7%3.2%73.7%
3276810.9%5.9%4.8%4.2%74.2%
6553611.9%5.2%3.7%4.8%74.4%
13107213.2%4.2%2.8%4.4%75.5%

Two things stand out. First, roughly three-quarters of forward-pass time lives outside the Hyena and attention blocks entirely — embedding, the LM head, the final norm, and a long tail of GEMMs and casts.GEMM alone is 15–30% of forward CUDA time across these lengths (30% at L = 8192) — part of the fixed cost that holds the L = 8192 numbers below the per-kernel microbenchmark speedups. No convolution kernel can touch it, and this fixed cost caps the achievable end-to-end speedup.

Second, within the addressable surface, HCL is the largest single block and grows with context. But its share of time understates its importance, because HCL is also the path doing redundant work. The stock HCL filter recompute is one line:

# vortex/model/model.py — modal filter build
h = (residues[..., None] * (log_poles * self.t).exp()).sum(1)[None]  # B, D, L

Read left to right: form residues[d, s] * exp(log_poles[d, s] * t[l]) as a full (D, state_size, L) tensor in fp32, then .sum(1) collapses the state axis away. The intermediate exists for one operation and is immediately reduced. At D = 4096 and L = 131072 it is about 34 GB — a buffer allocated only to be summed and discarded.34 GB is the intermediate alone, on top of the model weights, the activations, and the FFT workspaces. It is what tips evo2_7b over 80GB at L = 131072. The filter it reduces to is merely (D, L). The state dimension is small (16 modes in evo2_7b), which is exactly why summing it on the fly is cheap.

HCL: The Memory Unlock

The fix is to avoid building the buffer at all. The sum over the state axis is an accumulation: each mode can be added into the output tile, then discarded. The tiled kernel gives each program instance one (BLOCK_D, BLOCK_L) tile of the output filter, keeps a register accumulator, and walks the modes in a compile-time-unrolled loop (simplified here):

acc = tl.zeros((BLOCK_D, BLOCK_L), dtype=tl.float32)
for s in tl.static_range(S):
    rs_offs = offs_d * stride_rs_d + s * stride_rs_s
    r_s = tl.load(residues_ptr + rs_offs, mask=mask_d).to(tl.float32)
    lp_s = tl.load(log_poles_ptr + rs_offs, mask=mask_d).to(tl.float32)
    acc += r_s[:, None] * tl.exp(lp_s[:, None] * t_tile[None, :])

The (D, state_size, L) intermediate is never allocated — not tiled, not streamed, simply never formed. Peak memory for the recompute drops from a 34 GB temporary at L = 131072, growing linearly with context length, to one output tile plus the (D, L) result. With the buffer gone, the whole L = 131072 forward pass fits in 51.1 GB, comfortably inside an 80GB H100, where the stock path runs out of memory entirely:

Sequence Length (tokens)Stock Peak (GB)Kernels Peak (GB)
819218.0815.34
3276832.6421.64
6553652.0432.18
131072OOM51.13

The convolution that consumes the filter is still the same FFT convolution the engine already used. The only change is the input transform: the stock path runs a full complex fft(x1v) and slices the spectrum, while the kernel uses rfft(x1v) directly. The half-spectrum is enough in the forward path, where the full spectrum kept for prefill state caching is not needed.

The transforms still run on cuFFT (rfft(h), rfft(x1v), and one inverse transform), with Triton handling the elementwise glue around them: the scaled spectral product and the bias-residual-gate epilogue. At these sizes, the glue is the work worth giving to Triton.

Isolating the HCL operation (the modal recompute together with the convolution that consumes its filter) shows both wins at the kernel level. Against the stock path it is 2.6–4x faster across lengths and uses 2.3–2.6x less memory, and at 131K it is the only version that runs at all:

Sequence Length (tokens)Triton (ms)Stock (ms)SpeedupMemory Ratio
20480.532.013.77x2.61x
81921.977.884.00x2.61x
327689.2332.703.54x2.62x
6553626.7371.382.67x2.27x
9830442.38110.922.62x2.27x
13107253.20OOMUnlockedUnlocked

The above are isolated-kernel measurements in fp32, the fixed-precision floor the HCM and HCS microbenchmarks below also share, distinct from the bf16 end-to-end validation reported later. The isolated speedup runs ahead of the end-to-end figure for the reason the profile already gave.Memory Ratio is the buffer that never gets built; Speedup is the global-memory traffic that buffer would have cost, plus the elementwise glue folded into the kernel.

HCM and HCS Kernels

HCL carries the headline (both the memory unlock and the bulk of the speedup), but the contribution is the whole cascade. Every Hyena layer routes through one of these three paths, so leaving HCM or HCS on the stock path means the sequence-mixing core is only partly optimised. Both kernels are built on the same fused-epilogue machinery as HCL, and both have narrower margins. Each is profiled across its own operating regime: HCM over the longer lengths where the FFT path is active, HCS over the short-filter lengths where depthwise convolutions live. The end-to-end sweep below exercises all three together at one common set of lengths.

HCM fuses the elementwise work around the FFT rather than the transforms themselves. At a 128-tap filter, Triton cannot out-write cuFFT for the three transforms, so the win is launch count: the scaled spectral product and the skip-residual add collapse from several separate elementwise launches into two Triton kernels.Visible in the launch count: enabling HCM on top of HCS removes about 480 kernel launches from the L = 65536 forward pass (34,930 to 34,454, across the nine HCM layers), folding the stock path’s separate spectral-multiply, bias, and residual launches into two. In isolation the margin is modest and shrinking, from about 1.14x at 2K tokens to 1.06x at 65K as the cuFFT transforms come to dominate the total runtime:

Sequence Length (tokens)Triton (ms)Stock (ms)Speedup
20480.4170.4741.14x
81921.5861.7831.12x
327687.8318.5861.10x
6553623.88925.3891.06x

HCS is the depthwise short convolution. It is the one path whose stock implementation is a general-purpose cuDNN convolution rather than a near-optimal FFT, which leaves the most room. A bespoke depthwise kernel for a tiny causal filter is up to 2.65x faster at L = 4096:

Sequence Length (tokens)Triton (ms)cuDNN (ms)Speedup
5120.0180.0301.63x
10240.0280.0541.93x
20480.0430.1002.33x
40960.0730.1932.65x

At the model level, the saving concentrates where the redundant work was: HCL accounts for ~88% of the end-to-end saving at L = 65536, HCM for ~9%, and HCS for the remaining ~3%. HCS’s small in-model share at this length sits next to a much larger standalone win in the short-filter regime; the distribution reflects this model’s shapes, not a verdict on the kernels. Each wins cleanly in isolation, and with HCL also running, the three lift net throughput rather than drag on it.

Design: Opt-In and Signature-Preserving

Because this is a third-party contribution, the change must respect the engine’s existing call sites. Dispatch is gated by three optional flags on HyenaInferenceEngine (use_hcs_kernel, use_hcm_kernel, use_hcl_kernel), each defaulting to False. The flags arrive through the model config, so the model-loading path and the forward signature are unchanged, and no call site needs editing. With every flag off, the stock path runs unchanged; flipping one on swaps a single function for a numerically equivalent kernel. A maintainer can reason about the change as purely additive, and a user can A/B it by toggling a boolean.

End-to-End Results

Full forward-pass sweep on Evo 2 7B, on a single 80GB H100, five timed runs after three warmups.H100 80GB HBM3 (SM 9.0), CUDA 12.9, driver 580.126.09, torch 2.7.1, triton 3.3.1, vortex SHA a1dcc0f. Every JSON under results/ carries a run_meta envelope pinning the GPU, driver, CUDA, torch, triton, and both repo SHAs at measurement time, so a number is never separated from the machine that produced it.

Sequence Length (tokens)Stock (ms)Kernels (ms)SpeedupStock Peak (GB)Kernels Peak (GB)
8192382.2288.31.33x18.0815.34
327682065.61397.11.48x32.6421.64
655366653.33843.41.73x52.0432.18
131072OOM12701.9UnlockedOOM51.13

The speedup grows with context, from 1.33x at 8K to 1.73x at 65K, because HCL’s share of the work grows with L, while the part of the forward pass the kernels cannot touch is a fixed tax that matters less as the convolution work grows. At 131K the stock path has no number to beat: it runs out of memory on the 80GB card, so the kernel result is an unlock rather than a ratio.

There is a speedup hiding inside that unlock, visible only on a card large enough to run the stock path at 131K. An H200’s 141GB is large enough: the stock 7B/131K forward peaks at 90.9 GB, comfortably past the 80GB an H100 offers and the concrete reason it overflows there, and against the kernel path it times at 21.8s versus 12.0s. That is a 1.82x speedup, the same trend the shorter lengths set, now measured at the one length an H100 can report only as an unlock.H200 141GB HBM3e (SM 9.0), CUDA 12.9, driver 570.211.01, torch 2.7.1, triton 3.3.1, vortex SHA 9c54bb3. On the H200 the shorter-length speedups are 1.28x / 1.45x / 1.67x at 8K / 32K / 65K, marginally below the H100’s because the faster card shrinks the stock baseline, so the fixed non-convolution cost weighs more; the relative win is otherwise the same shape, and the extra memory only lets the 131K comparison run at all.

The kernels carry no Hopper-specific code. They have been measured on SM 8.9 (Ada) and SM 9.0 (Hopper), and the same Triton source should compile across recent NVIDIA architectures. The harness reproduces this sweep with a single command, and results from other hardware are welcome.

Scaling to Evo 2 40B

The same three kernels, unchanged, carry the 40B checkpoint. Evo 2 40B doubles the hidden size to 8192, so its filters and FFT workspaces grow with it and the memory headroom 7B had disappears: even at 8K the stock baseline already needs 92.0 GB, past an 80GB H100. These are H200 numbers throughout, which makes 40B the clean test of the second regime, where the kernels stop being a speedup and become the only way the forward pass runs at all:

Sequence Length (tokens)Stock (ms)Kernels (ms)SpeedupStock Peak (GB)Kernels Peak (GB)
81922294.11904.01.20x92.0086.50
3276813593.59106.81.49x121.0799.06
65536OOM26119.0UnlockedOOM120.11

Two reads. The speedup holds: 1.20x at 8K rising to 1.49x at 32K, the same shape as 7B and for the same reason, HCL’s share growing with context. And the unlock returns: at 65K the stock 40B forward runs an H200’s 141GB out of memory, while the kernel path completes in 120.1 GB.

This is the memory-relative nature of the unlock made concrete. The kernel is a speedup when the baseline fits and an unlock when it does not, and the boundary between the two climbs with available memory. On an 80GB H100 it sits at 7B/131K. Give the workload an H200’s 141GB and 7B/131K slides back into the speedup regime (the 1.82x above); the unlock reappears one model size up, at 40B/65K. Same kernel, same mechanism (the avoided buffer), relocated by the hardware it runs on.

The 40B sweep stops at 65K for a reason outside the kernels: at 131K each cuFFT transform crosses the 2^31-element limit of its 32-bit indexing, and torch.fft raises rather than falling back to 64-bit math.The limit lives in cuFFT / torch.fft, not the kernels: the Triton path fuses the elementwise glue around the same transforms the stock engine already calls, so it inherits the ceiling rather than introducing it. 7B clears 131K because it has half the channels and stays under the limit; 40B’s wider hidden size tips the same transform over it. Lifting it would mean a blocked overlap-add FFT or the 64-bit cuFFT API — both out of scope for a signature-preserving contribution.

Correctness

A speed number is worthless if the logits drift. The kernels are validated end-to-end against the stock computation in the model’s native bf16 inference dtype, not a forgiving fp32 harness. The procedure:

  1. Load one checkpoint.
  2. Run the forward pass twice at a fixed seed: once with every kernel off (the baseline logits), once with every kernel on.
  3. Compare the two logit tensors on the metrics below.

The metrics, each a column in the table:7B is validated on the H100 and 40B on the H200, each on its headline card. 7B coverage stops at 32K: 65K is omitted because every metric holds or improves from 8K to 32K, the two shorter lengths the conservative bound; 131K has no baseline to diff against on the 80GB H100, where the stock pass runs out of memory. 40B is checked at 8K, the largest length at which its stock and kernel passes both fit alongside each other.

  • Max / mean absolute difference over all positions and vocabulary entries.
  • Cosine similarity of the baseline and kernel logit vectors, for the last token and averaged over the sequence.
  • Argmax match rate: the fraction of positions where both paths predict the same top token.
ModelSequence Length (tokens)Max Absolute DifferenceMean Absolute DifferenceCosine (Last Token)Cosine (Sequence Mean)Argmax Match Rate
7B81923.180.03040.99999980.99979899.988%
7B327681.940.02920.99999980.99999599.997%
40B81922.440.03810.99999980.99999998.511%

A cosine similarity of 0.9999998 on the last-token logits and argmax agreement of at least 99.988% are the important numbers: the sampling distribution is preserved and next-token decisions almost always match. The maximum absolute difference reaches 3.18 in bf16 logit units, which looks alarming but is roundoff concentrated in low-impact cells, with top-token decisions matching at all but roughly one position in each checked run. On 40B the last-token cosine is identical at 0.9999998 and the sequence-mean cosine is 0.999999; argmax agreement is 98.5%, lower than 7B because a deeper bf16 stack accumulates more roundoff, but the next-token distribution still holds.

This is the check no unit test can replace. Per-kernel checks confirm each kernel matches its reference in isolation; only an end-to-end logit comparison through the real model, in the real dtype, at real sequence lengths, catches the way small per-block deltas compose across the full stack of layers. An end-to-end comparison is the artifact a reviewer would want to see, so it is the one this work provides.

Reproducibility

Everything above is scripted in the companion harness. The commands that produce the tables:

# Forward-pass timing and peak memory at each context length.
# On an 80GB H100 the stock 131K pass is recorded as OOM; on an
# H200 it runs, giving the head-to-head 1.82x at 131K.
pixi run python -m benchmarks.profile_evo2 \
    --seq-lens 8192 32768 65536 131072 \
    --triton hcs,hcm,hcl

# The same profiler on the 40B checkpoint, against its own memory ceiling
pixi run python -m benchmarks.profile_evo2 \
    --models evo2_40b \
    --seq-lens 8192 32768 65536 \
    --triton hcs,hcm,hcl

# End-to-end logit agreement, stock vs all kernels on, fixed seed
pixi run python -m benchmarks.correctness_evo2 \
    --models evo2_7b --seq-lens 8192 32768 --seed 0
pixi run python -m benchmarks.correctness_evo2 \
    --models evo2_40b --seq-lens 8192 --seed 0

Each run writes a JSON with a run_meta envelope pinning the GPU, driver, CUDA, torch, triton, and both repo SHAs.

Road Ahead

Two threads are open.

  1. Evo 2 40B at scale. The single-GPU 40B sweep above confirms the kernels carry to the larger checkpoint. They are per-block operations, agnostic to tensor and pipeline parallelism, so a sharded 40B deployment should benefit identically per-rank.
  2. Fusing the FFT. HCM and HCL still defer the three transforms to cuFFT, which is the right call today. A fused real-FFT path that keeps the spectral data resident is the obvious next lever if the launch-and-transfer overhead around cuFFT becomes the bottleneck.

The Lesson

The reference was correct and clear, but inefficient only at the long-context operating point. (residues * exp(...)).sum(1) is the clearest possible way to express a modal sum, and at shorter sequences it is effectively free. At 131K tokens, the same expression materialises a 34 GB buffer for a reduction that only needs O(D·L) output. The fix came from profiling the forward pass, reading the reference line by line, and tiling the one sum that did not need to be stored.

Harness | PR | Vortex | Evo 2