Profiling Rust: A Flamegraph vs PGO, BOLT, and Native CPU Targeting

One of the goals behind seqpacker, covered in the previous article, was matching or exceeding LightBinPack’s C++ performance. I expected that would require PGO (Profile-Guided Optimisation), BOLT (Binary Optimisation and Layout Tool), and architecture-specific compiler flags. I tried all three, and the results were not what I expected: PGO gives ~15% on unoptimised code, but once I profiled with a flamegraph and fixed the hotspots in source, PGO added nothing. BOLT showed no measurable improvement beyond PGO in either phase. Here is the full breakdown.

The Design Choices That Carry the Algorithm

seqpacker ships 11 algorithms, but the performance story has two layers. The first is data structure and algorithm design, which determines the complexity ceiling for each algorithm. The second is implementation-level profiling: release profile tuning, compiler flags, and source-level fixes that apply across the package regardless of algorithm.

This section covers the first layer using OBFD (Optimized Best-Fit Decreasing), the default and fastest single-threaded algorithm. The compiler-level and profiling findings in later sections apply across all 11 algorithms.

OBFD achieves O(N log L) time where N is item count and L is max bin capacity. Two design choices make that complexity possible.

Counting sort instead of comparison sort. Standard BFD (Best-Fit Decreasing) sorts items descending at O(N log N). But sequence lengths are bounded integers, typically 1–8192. Counting sort groups indices by length in O(N + L), then iterates buckets from largest to smallest:Counting sort is a well-known non-comparison sort for integer keys. The key insight for bin packing is that sequence lengths are small bounded integers, making O(N + L) faster than O(N log N) comparison sort when L ≪ N log N.

pub fn counting_sort(lengths: &[usize], max_length: usize) -> Vec<Vec<usize>> {
    let mut buckets = vec![Vec::new(); max_length + 1];
    for (i, &length) in lengths.iter().enumerate() {
        buckets[length].push(i);
    }
    buckets
}

For 10,000 sequences with L = 4096, this replaces ~130,000 comparisons with a single linear pass.

Capacity-indexed segment tree instead of bin-indexed. Standard BFD searches all open bins for the best fit at O(log B) per item, where B is the number of bins and grows with input size. OBFD indexes by remaining capacity instead: the tree has L + 1 leaves, one per possible remaining-capacity value, and each leaf tracks whether any bin exists at that capacity.The capacity-indexed approach originates from LightBinPack’s C++ implementation. A standard segment tree stores max values at internal nodes and answers range-max queries in O(log N). See e.g. cp-algorithms for background. A best-fit query walks the tree left-first (smaller capacities first) to find the tightest fit in O(log L):

#[inline(always)]
pub fn find_best_fit(&self, target: usize) -> Option<usize> {
    if self.tree[0] < target {
        return None;
    }
    let mut idx = 0;
    while idx < self.n - 1 {
        let left = 2 * idx + 1;
        let right = 2 * idx + 2;
        if self.tree[left] >= target {
            idx = left;  // Prefer left (smaller capacity = tighter fit)
        } else {
            idx = right;
        }
    }
    Some(idx - (self.n - 1))
}

L is fixed and small (4096 for a typical training run). B can be 2,000–5,000 bins depending on the length distribution. The query cost is O(log L) regardless of input size, versus O(log B) that grows with every new bin.The tree rounds up to the next power of 2, so L = 4096 produces a tree with 8192 leaves and depth 13. The asymptotic class is the same; the constant is one extra step. At L = 4096, the tree occupies ~128 KB with usize nodes, small enough to fit in L1 or L2 cache on modern CPUs, which matters more than the asymptotic difference.

These two choices, counting sort and capacity-indexed lookup, are the foundation. Everything that follows is about how far compiler tooling can push this base versus how far manual profiling can.

The Release Profile

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"
strip = true

lto = "fat" is the single most impactful setting. It gives LLVM whole-program visibility across all crates: cross-module inlining, dead code elimination, and interprocedural optimisations that are impossible with default lto = false. codegen-units = 1 forces the entire crate through one LLVM pipeline instead of splitting into parallel chunks. Slower compilation, better codegen.See the Cargo reference on profiles for details on each setting. The combination of lto = “fat” and codegen-units = 1 gives LLVM the maximum optimisation surface at the cost of compile time.

panic = "abort" removes unwind tables. strip = true strips debug symbols. This is what ships in the PyPI wheel: portable, reproducible, no special tooling required.

Compiler-Level Optimisations on the Base Implementation

Before touching any source code, I measured what compiler tooling could do on its own. All benchmarks ran on Linux x86_64 (Ubuntu), OBFD averaged across 7 datasets at 4 capacities, giving 28 data points per build.Datasets: Uniform, Log-Normal, Exponential, Bimodal (synthetic) and Alpaca, UltraChat, C4 (real). Capacities: 512, 1024, 2048, 4096. 10,000 sequences per run.

target-cpu=native

RUSTFLAGS="-C target-cpu=native" tells LLVM to use every instruction set the host CPU supports: AVX2, AVX-512, BMI2, and the rest.See rustc codegen options: target-cpu.

Result: ~0%, no measurable change (0.905 ms vs 0.904 ms baseline). Bin packing is integer arithmetic and pointer chasing through tree structures. There are no float arrays to vectorise, no data-parallel loops for SIMD to exploit. The resulting binary also only runs on machines with the same CPU features, making it unusable for PyPI wheels.

PGO (Profile-Guided Optimisation)

Build an instrumented binary, run a representative workload, feed the profile back to the compiler, and rebuild.See The rustc book: Profile-Guided Optimization. The seqpacker PGO script instruments with -Cprofile-generate, runs 100K sequences across five strategies, merges with llvm-profdata, and rebuilds with -Cprofile-use. On the unoptimised codebase:

Build	Avg time (ms)	vs Baseline
Baseline release	0.904	—
Native CPU	0.905	~0%
PGO	0.767	15.2% faster
PGO+BOLT	0.770	14.8% faster
Code-optimised release	0.757	16.3% faster

PGO helps here. The compiler learns which branches are hot and cold, reorders basic blocks, and makes better inlining decisions. But notice the last row: code-level optimisation alone, with no PGO, gives 16.3%. The manual fixes slightly outperform the compiler-automated ones.

BOLT (Binary Optimisation and Layout Tool)

Post-link optimisation that reorganises the binary layout based on runtime profiles. The profiling step requires Linux perf, making the full pipeline Linux-only in practice.See LLVM BOLT. The build script records with perf, converts to BOLT format via perf2bolt, then applies —reorder-blocks=ext-tsp and —reorder-functions=hfsort.

Result: no measurable improvement over PGO alone (0.770 ms vs 0.767 ms). BOLT’s strength is reorganising large binaries with many cold paths: web browsers, databases, compilers. The seqpacker critical path is a segment tree traversal that fits in L1 cache. There is no layout problem to solve.

Profiling the Base Implementation

PGO found ~15% through compiler automation. The next step was to see whether the same gains could be made permanent in source through manual profiling.

On Linux: cargo-flamegraph wraps perf and produces an interactive SVG.cargo-flamegraph requires perf on Linux. Compile with the profiling profile (debug = true, strip = false) to get readable function names in the output. Compile with debug = true and strip = false. seqpacker’s [profile.profiling] inherits from release with those overrides.

On macOS: samply gives a Firefox Profiler-compatible flamegraph with no setup beyond cargo install samply. This is what I used.samply uses macOS dtrace under the hood and outputs to the Firefox Profiler web UI. For deeper analysis, cargo-instruments wraps Xcode’s Instruments for Time Profiler and Allocations views.

On Windows: cargo-xperf wraps ETW (Event Tracing for Windows) for CPU sampling. Windows Performance Analyzer (WPA) provides the visual analysis layer. Tracy and Superluminal are also options. I did not profile on Windows for this project.

For benchmarks: Criterion via cargo bench isolates individual algorithm paths with statistical rigour. For end-to-end wall-clock timing against real datasets, a Python benchmark script calls the PyO3 bindings directly.

The flamegraph on the unoptimised code revealed five hotspots, each with a corresponding fix.

Hot Spots and the Code-Level Fixes

1. Heap allocation in the inner loop (all algorithms). realloc showed up as a visible slice of runtime. Every Vec::push that exceeded its allocation triggered a reallocation.

Fix: pre-allocate all vectors to estimated size with Vec::with_capacity. The bin count estimate is cheap (total_tokens / capacity + 1) and eliminates reallocation for typical inputs:

let total_tokens: usize = lengths.iter().sum();
let estimated_bins = (total_tokens / capacity) + 1;
let mut bins_remaining: Vec<usize> = Vec::with_capacity(estimated_bins);
let mut bins_items: Vec<SmallVec<[usize; 8]>> = Vec::with_capacity(estimated_bins);

2. Per-bin heap allocation (all algorithms). Most bins hold fewer than 8 sequences. Vec<usize> allocates on the heap for every bin, even for a single item.

Fix: SmallVec<[usize; 8]> keeps the common case on the stack. Only bins exceeding 8 items spill to the heap. The threshold of 8 was chosen because typical LLM packing at capacity 4096 averages 3–6 sequences per bin:

pub items: SmallVec<[usize; 8]>,

3. Missed inlining on segment tree methods (OBFD). CapacitySegmentTree::update and CapacitySegmentTree::find_best_fit accounted for ~15% of runtime. LLVM chose not to inline them, a reasonable heuristic for general code but a wrong call for a 10-line method called N times in a tight loop.

Fix: #[inline(always)] on every method in the hot path. This forces LLVM to inline regardless of its cost model, eliminating the function call overhead:

#[inline(always)]
pub fn update(&mut self, capacity: usize, value: usize) {
    // ...
}

4. Early termination in tree propagation (OBFD). After placing an item, the segment tree propagates changes upward from leaf to root. If a parent’s value does not change (common when other bins share the same capacity range) the rest of the propagation is wasted.

Fix: Break as soon as the parent value is unchanged. This turns worst-case O(log L) propagation into O(1) for the common case:

while idx > 0 {
    idx = (idx - 1) / 2;
    let new_val = self.tree[2 * idx + 1].max(self.tree[2 * idx + 2]);
    if self.tree[idx] == new_val {
        break;  // No change — skip remaining ancestors
    }
    self.tree[idx] = new_val;
}

5. Cold-path annotation for bin creation (all algorithms). Opening a new bin is rare relative to placing into an existing one.

Fix: Marking open_new_bin as #[cold] tells LLVM to optimise the placement path (the common case) at the expense of bin creation (the uncommon case):

#[cold]
fn open_new_bin(
    size: usize,
    orig_idx: usize,
    capacity: usize,
    // ...
) { /* ... */ }

Combined, these code-level changes gave 16.3% improvement over the baseline release build, slightly more than PGO’s 15.2%, without any compiler tooling.

Re-Applying Compiler Optimisations

After the code-level fixes, I re-ran the same compiler pipeline on the optimised codebase:

Build	Avg time (ms)	vs Release
Release (optimised code)	0.732	—
Native CPU	0.771	5.3% slower
PGO	0.741	1.2% slower
PGO+BOLT	0.745	1.8% slower
Latest code optimisations	0.643	12.1% faster

The last row is not a compiler flag. It reflects a further round of source-level changes applied after the initial five fixes, measured with the same plain release build.The “latest code optimisations” row includes refinements to the capacity-to-bins mapping and tighter pre-allocation estimates, built with the same [profile.release] settings as the baseline. No PGO, no BOLT, no target-cpu.

PGO went from 15.2% faster to 1.2% slower. The hot paths are already inlined. The allocations are gone. The branches the profiler would have reordered are already fast. PGO’s profile data now points at paths that are already optimised, and the overhead of processing those stale hints produces a marginal regression.

target-cpu=native is 5.3% slower on average, and up to 8% slower at higher capacities. This is consistent across all datasets. The likely cause: LLVM’s instruction selection for wider registers (AVX-512 or AVX2) introduces register pressure and transition penalties on workloads that do not benefit from wider operations. For integer-heavy tree traversals, the generic instruction mix is better.

The total improvement from baseline (0.904 ms) to final (0.643 ms) is 28.9%, all from source-level changes and standard compiler settings.

The Full Picture

Technique	Unoptimised code	Optimised code	Portable	CI complexity
Release profile (LTO, codegen-units=1)	Baseline	Baseline	Yes	None
target-cpu=native	~0%	5% slower	No	Low
PGO	15% faster	1% slower	Yes	High
BOLT	No measurable gain over PGO	~0%	No (Linux perf)	Very high
Code optimisations (flamegraph-driven)	16% faster	N/A	Yes	None

On code that has not been profiled by a human, PGO’s automation is worth ~15%. Once the same hotspots have been identified and fixed in source, PGO’s profile data points at paths that are already optimised, producing no improvement or marginal regression.

The source-level fixes are permanent, portable, visible in code review, and carry no CI cost. For a library shipping prebuilt wheels on PyPI across three platforms, the tradeoff was clear: source-level optimisation only, no PGO in the build pipeline.

What Is Left on the Table

Two directions remain open.

CPU-specific builds. The PyPI wheel ships with generic x86_64 and aarch64 targets. Platform-specific builds with AVX2 paths for Intel/AMD or NEON intrinsics for Apple Silicon could benefit the parallel OBFDP (Optimized Best-Fit Decreasing Parallel) variant where Rayon distributes work across cores. The target-cpu=native regression on single-threaded OBFD suggests the ceiling is low or negative for scalar code, but parallel workloads with larger datasets may behave differently.

Lower-level memory control. The core crate uses #![forbid(unsafe_code)] today. An arena allocator for bin metadata could reduce allocator pressure further by avoiding per-bin SmallVec overhead entirely. unsafe access to the segment tree array with unchecked indexing would eliminate bounds checks in the inner loop. Both trade safety guarantees for speed. Neither is justified yet, as the current implementation is already ~1.2x faster than LightBinPack’s C++ core, but they are the next lever if the workload scales to millions of sequences.

The Lesson

Profile the code before reaching for compiler tools. A flamegraph costs 10 minutes. PGO costs a build pipeline. BOLT costs a Linux perf workflow. All three attempt the same thing: identifying and optimising hot paths. The flamegraph gives that answer directly, and the fix goes permanently into source.

GitHub | PyPI | crates.io | Python Docs | Rust Docs