← All posts
DataFusion · 3 min read

Zero-Copy Strings in Apache DataFusion: How StringViewArray Boosted Performance by 8%

How StringViewArray cut the copy tax on string operations and lifted ClickBench performance by 8%.

What if you could perform complex string operations like substr, trim, and split_part without copying a single byte of string data?

In Apache DataFusion, the introduction of StringViewArray (and its sibling BinaryViewArray) did exactly that. By moving away from the traditional offset-based string storage to a “view-based” architecture, DataFusion achieved an 8% performance improvement across the entire ClickBench suite just by enabling it for Parquet reads.

The Problem: The “Copy Tax” of Traditional Strings

Traditionally, Arrow strings (StringArray) use two buffers:

  1. An offsets buffer (e.g., [0, 5, 11])
  2. A values buffer (e.g., "HelloWorld")

This design is simple but has a hidden “copy tax.” If you want to take a substring or trim a string, you almost always have to allocate a new values buffer and copy the bytes into it. In a high-performance query engine processing billions of rows, these copies add up to massive CPU and memory pressure.

The Setup: StringViewArray

StringViewArray implements the “German Style” string optimization. Instead of offsets, each string is represented by a 16-byte “view”:

For strings $\le$ 12 bytes, the entire string is inlined directly into the 16-byte view, requiring zero additional memory lookups.

The Key Result: ClickBench & Beyond

When DataFusion enabled StringViewArray by default for Parquet, the results were immediate:

MetricTraditional StringArrayStringViewArrayImprovement
ClickBench SuiteBaseline-8% Runtime8% Faster
starts_with (Scalar)Baseline~1.5x FasterInlined Prefix
substr / split_partHeavy CopiesZero-CopyO(1) vs O(N)

Why It Happens: The Magic of Zero-Copy

The real power of StringViewArray is that it allows “virtual” transformations.

1. Zero-Copy Substrings

When you call substr(col, 1, 5), DataFusion doesn’t copy the bytes. It simply creates a new view pointing to the same underlying buffer but with a different offset and length.

// Conceptual zero-copy view creation
let sub_view = make_view(
    substr.as_bytes(), 
    original_view.buffer_index, 
    original_view.offset + start_offset
);

2. Fast Path Comparison

Because the first 4 bytes are inlined in the view, DataFusion can often determine that two strings are not equal without ever touching the actual string data in memory. This is a massive win for GROUP BY and JOIN operations.

3. Efficient Aggregation

Specialized structures like ArrowBytesViewMap allow COUNT DISTINCT and GROUP BY to intern unique strings without copying them into the hash map.

pub struct ArrowBytesViewMap<V> {
    /// Views for all stored values (zero-copy!)
    views: Vec<u128>,
    /// Completed buffers containing string data
    completed: Vec<Buffer>,
    // ...
}

Evidence: Real-World Performance

In the DataFusion codebase, we see this optimization applied across the board:

Practical Takeaways

  1. Inlining is King: For short strings ($\le$ 12 bytes), StringViewArray eliminates memory fragmentation and cache misses.
  2. Avoid the Copy Tax: By using views, complex string manipulations become metadata-only operations.
  3. Metadata-First Processing: Inlining prefixes allows the CPU to skip expensive memory accesses during comparisons.

Run It Yourself

You can see these optimizations in action in the Apache DataFusion repository. Check out the string benchmarks:

# Run the starts_with benchmark
cargo bench --bench starts_with

Conclusion

The shift to StringViewArray represents a fundamental shift in how DataFusion handles variable-length data. By treating strings as views rather than owned buffers, DataFusion has eliminated one of the most persistent bottlenecks in analytical processing.


Are you using Arrow’s View types in your Rust projects? The performance gains are hard to ignore.