SeqPacker: 11 Bin-Packing Algorithms in Rust for LLM Sequence Packing

LLM training wastes a lot of compute on padding.

Most datasets do not give you fixed-length samples. You get one example at 120 tokens, the next at 2,000, the next at 600, and the training stack still wants fixed shapes somewhere down the line. The standard answer is padding: make everything match the longest sample in the batch and move on.

It works, but part of every step is now spent processing tokens that do not contribute anything. NVIDIA’s NeMo docs put the waste at 50–70% of computation on padding tokens for skewed length distributions.NVIDIA NeMo-RL, Sequence Packing and Dynamic Batching. The NeMo Framework User Guide calls out two specific inefficiencies: “Computation performed on the pad values is eventually ignored for model output, resulting in wasted FLOPs,” and “Micro batch size is often limited by the batch which contains longer sequences, so that most other micro batches have underutilized GPU memory.” See Sequence Packing. Hugging Face reports a 2x throughput increase on short-sequence datasets (FLAN) and 1.4x on longer ones (OrcaMath) when packing with Flash Attention 2, with no convergence degradation.Hugging Face, Packing with Flash Attention 2. Benchmarked on 8x A100-80GB with llama2-7B, mistral-7B, and granite-8B-code. Also showed 20% peak memory reduction on FLAN.

That is the first reason this package exists.

The second is that packing inside training frameworks is still a moving target. TRL has had a recurring trail of packing bugs: ConstantLengthDataset reported the wrong length, causing training to end mid-epoch#943 (Nov 2023). len() returned 9846 but iteration yielded 4283 examples. Related: #475, #761.; users flagged that attention masks might cross-contaminate packed sequences#3067 (Mar 2025). Still open. Without block-diagonal masking, packed sequences leak attention across unrelated samples.; a mid-2025 report showed packing was silently broken because Transformers’ _remove_unused_columns() dropped seq_lengths before the collator could use it#3705 (Jul 2025), titled “Packing is currently broken?” The collator generated position_ids as if each packed sequence was a single contiguous sequence.; and in early 2026, a fix for BFD token loss introduced re-queuing of truncated fragments — which broke SFT datasets by packing nonsensical conversation scraps as standalone sequences.#5075 (Feb 2026). The v0.27.0 fix for #4554 re-queued truncated tails, but for SFT this produced fragments like [“you”, ”!”, “<end_of_turn>”] as standalone packed sequences. Separately, #3645 found the “ffd” strategy was mislabelled — it implemented BFD, not FFD. And #3887 found BFD silently failed to pack for certain sequence lengths. None of this means TRL is unusable but it does indicate there is room for a standalone package that does this job independently and exposes the algorithms directly instead of hiding them behind trainer internals.

That became seqpacker.

The Problem Visually

Real datasets are messier, but the failure mode is the same. If your length distribution is skewed, padding burns memory bandwidth and compute on zeros. Packing turns that into a bin-packing problem: given a fixed capacity, place variable-length items into as few bins as possible.

Finding the optimal packing is NP-hard — it reduces from 3-PARTITION, so no polynomial-time exact solver exists unless P = NP.The decision version (“can these items fit in k bins of capacity C?”) is NP-complete. The optimisation version (minimise bins) is NP-hard. See Garey and Johnson (1979), Computers and Intractability. In practice, what matters is getting close enough, fast enough, that packing itself does not become the bottleneck.

What SeqPacker Is

seqpacker is a bin-packing library written in Rust with Python bindings via PyO3. Rust gives type safety, cross-platform compilation, and runtime speed; PyO3 and maturin produce prebuilt wheels for macOS, Linux, and Windows — pip install seqpacker with no build dependencies.

The LLM use case — pack variable-length tokenised samples into fixed-capacity bins for training or fine-tuning — is the obvious application, but the implementation is not tied to one trainer or one workflow. The same core supports three operating modes:

Offline Packing — all sequence lengths are known up front. Sort, reorder, and optimise for the tightest possible bin utilisation. This is the natural fit for preprocessing a tokenised corpus before training.
Buffered Packing — pack in chunks when the full dataset is too large to hold in memory or when fully offline preprocessing is inconvenient. Each chunk is packed independently, then results are concatenated.
Online Packing — make placement decisions one item at a time as sequences arrive. No lookahead, no reordering.

These three modes push you towards fundamentally different algorithms. They also surface a subtler distinction: “online” in the algorithmic sense and “streaming” in the systems sense are not the same thing. The Online ≠ Streaming section below expands on why.

Existing Options

Trainer-integrated packing (TRL, NeMo) is convenient but coupled to trainer behaviour. Pure Python libraries (prtpy, binpacking) achieve good packing efficiency but are too slow at scale for an offline pipeline.prtpy and binpacking are general-purpose bin-packing libraries. greedy_ffd is my own Python FFD baseline used for benchmarking. The closest public package to what I wanted was LightBinPack: a C++ implementation with Python bindings aimed at LLM training.LightBinPack (pybind11). Fast and clearly built for this use case, but effectively Linux-focused today.

Efficiency is mostly commoditised — on the same dataset, FFD and BFD variants from different packages all land at the same ~98.76% utilisation because the heuristic dominates, not the implementation. The real differentiators are runtime, algorithm choice, and how practical the package is to install and use. The full runtime comparison is in the Benchmarks section below.

seqpacker’s goal was to match LightBinPack’s performance class while keeping a clean install story across macOS, Linux, and Windows. The Rust toolchain around PyO3 and maturin made that realistic.

11 Algorithms Across Three Constraints

I implemented 11 algorithms. Not because I expect most users to try all 11, but because sequence packing has three different operating constraints, and they push you towards different algorithms.

Offline algorithms see the full input up front. They sort, reorder, and make better placement decisions.

Algorithm	Short	Approx. Ratio	Notes
FirstFitDecreasing	FFD	1.22	Good offline default
BestFitDecreasing	BFD	1.22	Tighter offline packing
FirstFitShuffle	FFS	~1.3	Training randomness (NeMo-style)
ModifiedFFD	MFFD	1.18	Mixed-size distributions
OptimizedBFD	OBFD	1.22	Production default

Online algorithms place items one at a time as they arrive.

Algorithm	Short	Approx. Ratio	Notes
NextFit	NF	2.0	One open bin at a time
FirstFit	FF	1.7	Leftmost bin that fits
BestFit	BF	1.7	Tightest-fitting bin
WorstFit	WF	2.0	Even distribution
Harmonic-K	HK	~1.69	Bounded-space, size-class bins

Parallel packing is its own category because the constraint changes: which heuristic parallelises cleanly enough to justify coordination overhead.

Algorithm	Short	Notes
ParallelOBFD	OBFDP	Rayon work-stealing, cyclical distribution

The same input packed three different ways shows why algorithm choice matters:

NextFit processes items in arrival order with one open bin. When an item does not fit, the bin closes and a new one opens. Fast, but wasteful — it never revisits.

FirstFitDecreasing sorts largest-first, then scans left to right for the first bin with room. Fewer bins, tighter packing.

OptimizedBFD sorts the same way but skips the scan — a capacity-indexed segment tree jumps directly to the best-fitting bin. Same packing quality, less work per placement.

Online ≠ Streaming

Only two of the online algorithms are supported in the streaming API, which is deliberate. A dataloader-style stream wants bounded state, predictable memory, and a way to emit bins without keeping a long tail of partially filled bins alive — not just one-pass processing. Some online heuristics process items sequentially but still maintain an in-memory view of all active bins, which is a valid online algorithm but not a good streaming API.

The streaming subset is NextFit (one open bin) and Harmonic-K (one bin per size class). Both have bounded active state and can emit completed bins before seeing all input.

“Online” is an algorithm class. “Streaming” is an operational constraint.

Benchmarks

The difference between these packages is not packing quality — it is how quickly they get there.

Package	Language	Runtime (10K seq)	vs seqpacker	Efficiency
seqpacker	Rust (PyO3)	5 ms	—	98.76%
LightBinPack	C++ (pybind11)	6–7 ms	~1.2x slower	98.76%
greedy_ffd	Python	2,000 ms	~400x slower	98.76%
binpacking	Python	8,500 ms	~1,700x slower	98.76%
prtpy	Python	9,500 ms	~1,900x slower	98.76%

Runtime is the mean of 5 runs on 10,000 sequences across Alpaca, UltraChat, and C4. Results were consistent across runs.Full per-algorithm breakdowns and dataset-level results are on the benchmark site. Reproduction scripts are in the repository.

Usage

from seqpacker import Packer

packer = Packer(capacity=4096, strategy="obfd")
result = packer.pack(sequence_lengths)

print(f"Bins: {result.num_bins}, Efficiency: {result.efficiency:.2%}")

If all you need is a good offline default, that is enough. Streaming, buffered packing, and algorithm comparison are exposed too — the point is not to force one policy on every workflow.

Road Ahead

Two directions are open for the algorithm side. The current Harmonic-K implementation achieves a ~1.69 asymptotic ratio, but the bin-packing literature has refinements — Refined Harmonic and Modified Harmonic — that push the bound lower while keeping bounded space.Lee and Lee (1985) proved Harmonic-K has an asymptotic ratio of ~1.69103. Refined Harmonic (Ramanan et al., 1989) improved this to ~1.63597, and Modified Harmonic (Woeginger and Zhang, 1993) pushed further to ~1.61562. Whether the practical gain justifies the added complexity for typical LLM length distributions is an open question.

The other direction is multi-GPU load balancing. Packing today produces bins, but says nothing about how those bins distribute across devices. If total token counts per GPU are unbalanced, the slowest rank becomes the bottleneck — the same variance problem that padding causes within a batch, but across data-parallel workers. A packing-aware distribution step is a natural extension if the need materialises.

Once the algorithms were in place, the next question was how far the implementation itself could be pushed. The next article covers that — profiling Rust code, identifying where compiler optimisations land versus where manual code-level changes are needed, and the measurable results of each.

GitHub | PyPI | crates.io | Python Docs | Rust Docs