Thursday, December 4, 2025

Overcoming the Hidden Efficiency Traps of Variable-Formed Tensors: Environment friendly Information Sampling in PyTorch


is the a part of a sequence of posts on the subject of analyzing and optimizing PyTorch fashions. All through the sequence, we’ve advocated for utilizing the PyTorch Profiler in AI mannequin improvement and demonstrated the potential impression of efficiency optimization on the velocity and value of working AI/ML workloads. One frequent phenomenon we’ve seen is how seemingly harmless code can hamper runtime efficiency. On this publish, we discover a number of the penalties related to the naive use of variable-shaped tensors — tensors whose form depends on previous computations and/or inputs. Whereas not relevant to all conditions, there are occasions when the usage of variable-shaped tensors may be prevented — though this may occasionally come on the expense of further compute and/or reminiscence. We are going to reveal the tradeoffs of those alternate options on a toy implementation of information sampling in PyTorch.

Three Downsides of Variable Formed Tensors

We encourage the dialogue by presenting three disadvantages to the usage of variable-shaped tensors:

Host-Gadget Sync Occasions

In a super situation, the CPU and GPU are capable of run in parallel in an asynchronous method, with the CPU repeatedly feeding the GPU with enter samples, allocating required GPU reminiscence, and loading GPU compute kernels, and the GPU executing the loaded kernels on the offered inputs utilizing the allotted reminiscence. The presence of dynamic-shaped tensors throws a wrench into this parallelism. With the intention to allocate the suitable quantity reminiscence, the CPU should watch for the GPU to report the tensor’s form, after which the GPU should watch for the CPU to allocate the reminiscence and proceed with the kernel loading. The overhead of this sync occasion may cause a drop within the GPU utilization and gradual runtime efficiency.

We noticed an instance of this in half three of this sequence once we studied a naive implementation of the frequent cross-entropy loss that included calls to torch.nonzero and torch.distinctive. Each APIs return tensors with shapes which are dynamic and depending on the contents of the enter. When these features are run on the GPU, a host-device synchronization occasion happens. Within the case of the cross-entropy loss, we found the inefficiency by way of the usage of PyTorch Profiler and had been capable of simply overcome it with another implementation that prevented the usage of variable-shaped tensors and demonstrated significantly better runtime efficiency.

Graph Compilation

In a latest publish we explored the efficiency advantages of making use of just-in-time (JIT) compilation utilizing the torch.compile operator. One among our observations was that graph compilation offered significantly better outcomes when the graph was static. The presence of dynamic shapes within the graph limits the extent of the optimization through compilation: In some instances, it fails fully; in others it ends in decrease efficiency good points. The identical implications additionally apply to different types of graph compilation, equivalent to XLAONNXOpenVINO, and TensorRT.

Information Batching

One other optimization we’ve encountered in a number of of our posts (e.g., right here) is sample-batching. Batching improves efficiency in two main methods:

  1. Decreasing overhead of kernel loading: Reasonably than loading the GPU kernels required for the computation pipeline as soon as per enter pattern, the CPU can load the kernels as soon as per batch.
  2. Maximizing parallelization throughout compute items: GPUs are extremely parallel compute engines. The extra we’re capable of parallelize computation, the extra we are able to saturate the GPU and improve its utilization. By batching we are able to probably improve the diploma of parallelization by an element of the batch measurement.

Regardless of their downsides, the usage of variable-shaped tensors is commonly unavoidable. However generally we are able to modify our mannequin implementation to bypass them. Typically these modifications can be simple (as within the cross-entropy loss instance). Different instances they might require some creativity in arising with a distinct sequence of fixed-shape PyTorch APIs that present the identical numerical consequence. Usually, this effort can ship significant rewards in runtime and prices.

Within the subsequent sections, we’ll examine the usage of variable-shaped tensors within the context of the info sampling operation. We are going to begin with a trivial implementation and analyze its efficiency. We are going to then suggest a GPU-friendly various that avoids the usage of variable-shaped tensors.

To match our implementations, we’ll use an Amazon EC2 g6e.xlarge with an NVIDIA L40S working an AWS Deep Studying AMI (DLAMI) with PyTorch (2.8). The code we’ll share is meant for demonstration functions. Please don’t depend on it for accuracy or optimality. Please don’t interpret our point out of any framework, library, or platform and an endorsement of its use.

Sampling in AI Mannequin Workloads

Within the context of this publish, sampling refers back to the number of a subset of things from a big set of candidates for the needs of computational effectivity, balancing of datatypes, or regularization. Sampling is frequent in lots of AI/ML fashions, equivalent to detection, rating, and contrastive studying techniques.

We outline a easy variation of the sampling drawback: Given a listing of N tensors every with a binary label, we’re requested to return a subset of Okay tensors containing each optimistic and unfavorable examples, in random order. If the enter listing accommodates sufficient samples of every label (Okay/2), the returned subset ought to be evenly cut up. Whether it is missing samples of 1 kind, these ought to be full of random samples of the second kind.

The code block under accommodates a PyTorch implementation of our sampling operate. The implementation is impressed by the favored Detectron2 library (e.g., see right here and right here). For the experiments on this publish, we’ll repair the sampling ratio to 1:10.

import torch

INPUT_SAMPLES = 10000
SUB_SAMPLE = INPUT_SAMPLES // 10
FEATURE_DIM = 16

def sample_data(input_array, labels):
    gadget = labels.gadget
    optimistic = torch.nonzero(labels == 1, as_tuple=True)[0]
    unfavorable = torch.nonzero(labels == 0, as_tuple=True)[0]
    num_pos = min(optimistic.numel(), SUB_SAMPLE//2)
    num_neg = min(unfavorable.numel(), SUB_SAMPLE//2)
    if num_neg < SUB_SAMPLE//2:
        num_pos = SUB_SAMPLE - num_neg
    elif num_pos < SUB_SAMPLE//2:
        num_neg = SUB_SAMPLE - num_pos

    # randomly choose optimistic and unfavorable examples
    perm1 = torch.randperm(optimistic.numel(), gadget=gadget)[:num_pos]
    perm2 = torch.randperm(unfavorable.numel(), gadget=gadget)[:num_neg]

    pos_idxs = optimistic[perm1]
    neg_idxs = unfavorable[perm2]

    sampled_idxs = torch.cat([pos_idxs, neg_idxs], dim=0)
    rand_perm = torch.randperm(SUB_SAMPLE, gadget=labels.gadget)
    sampled_idxs = sampled_idxs[rand_perm]
    return input_array[sampled_idxs], labels[sampled_idxs]

Efficiency Evaluation With PyTorch Profiler

Even when not instantly apparent, the usage of dynamic shapes is definitely identifiable within the PyTorch Profiler Hint view. We use the next operate to allow PyTorch Profiler:

def profile(fn, enter, labels):
    
    def export_trace(p):
        p.export_chrome_trace(f"{fn.__name__}.json")
        
    with torch.profiler.profile(
            actions=[torch.profiler.ProfilerActivity.CPU,
                        torch.profiler.ProfilerActivity.CUDA],
            with_stack=True,
            schedule=torch.profiler.schedule(wait=0, warmup=10, energetic=5),
            on_trace_ready=export_trace
    ) as prof:
        for _ in vary(20):
            fn(enter, labels)
            torch.cuda.synchronize()  # specific sync for hint readability
            prof.step()

# create random enter
input_samples = torch.randn((INPUT_SAMPLES, FEATURE_DIM), gadget='cuda')
labels = torch.randint(0, 2, (INPUT_SAMPLES,), 
                       gadget='cuda', dtype=torch.int64)

# run with profiler
profile(sample_data, input_samples, labels)

The picture under was captured for the worth of ten million enter samples. It clearly exhibits the presence of sync occasions coming from the torch.nonzero name, in addition to the corresponding drops in GPU utilization:

Profiler Hint of Sampler (by Creator)

The usage of torch.nonzero in our implementation isn’t perfect, however can it’s prevented?

A GPU-Pleasant Information Sampler

We suggest another implementation of our sampling operate that replaces the dynamic torch.nonzero operate with a inventive mixture of the static torch.count_nonzerotorch.topk, and different APIs:

def opt_sample_data(enter, labels):
    pos_mask = labels == 1
    neg_mask = labels == 0
    num_pos_idxs = torch.count_nonzero(pos_mask, dim=-1)
    num_neg_idxs = torch.count_nonzero(neg_mask, dim=-1)
    half_samples = labels.new_full((), SUB_SAMPLE // 2)
    num_pos = torch.minimal(num_pos_idxs, half_samples)
    num_neg = torch.minimal(num_neg_idxs, half_samples)
    num_pos = torch.the place(
        num_neg < SUB_SAMPLE // 2,
        SUB_SAMPLE - num_neg,
        num_pos
    )
    num_neg = SUB_SAMPLE - num_pos

    # create random ordering on pos and neg entries
    rand = torch.rand_like(labels, dtype=torch.float32)
    pos_rand = torch.the place(pos_mask, rand, -1)
    neg_rand = torch.the place(neg_mask, rand, -1)

    # choose prime pos entries and invalidate others
    # since CPU would not know num_pos, we assume most to keep away from sync
    top_pos_rand, top_pos_idx = torch.topk(pos_rand, okay=SUB_SAMPLE)
    arange = torch.arange(SUB_SAMPLE, gadget=labels.gadget)
    if num_pos.numel() > 1:
        # unsqueeze to assist batched enter
        arange = arange.unsqueeze(0)
        num_pos = num_pos.unsqueeze(-1)
        num_neg = num_neg.unsqueeze(-1)
    top_pos_rand = torch.the place(arange >= num_pos, -1, top_pos_rand)

    # repeat for neg entries
    top_neg_rand, top_neg_idx = torch.topk(neg_rand, okay=SUB_SAMPLE)
    top_neg_rand = torch.the place(arange >= num_neg, -1, top_neg_rand)

    # mix and blend collectively optimistic and unfavorable idxs
    cat_rand = torch.cat([top_pos_rand, top_neg_rand], dim=-1)
    cat_idx = torch.cat([top_pos_idx, top_neg_idx], dim=-1)
    topk_rand_idx = torch.topk(cat_rand, okay=SUB_SAMPLE)[1]
    sampled_idxs = torch.collect(cat_idx, dim=-1, index=topk_rand_idx)
    sampled_input = torch.collect(enter, dim=-2, 
                                 index=sampled_idxs.unsqueeze(-1))
    sampled_labels = torch.collect(labels, dim=-1, index=sampled_idxs)
    return sampled_input, sampled_labels

Clearly, this operate requires extra reminiscence and extra operations than our first implementation. The query is: Do the efficiency advantages of a static, synchronization-free implementation outweigh the additional value in reminiscence and compute?

To evaluate the tradeoffs between the 2 implementations, we introduce the next benchmarking utility:

def benchmark(fn, enter, labels):
    # warm-up
    for _ in vary(20):
        _ = fn(enter, labels)

    iters = 100
    begin = torch.cuda.Occasion(enable_timing=True)
    finish = torch.cuda.Occasion(enable_timing=True)
    torch.cuda.synchronize()
    begin.file()
    for _ in vary(iters):
        _ = fn(enter, labels)
    finish.file()
    torch.cuda.synchronize()
    avg_time = begin.elapsed_time(finish) / iters
    
    print(f"{fn.__name__} common step time: {(avg_time):.4f} ms")

benchmark(sample_data, input_samples, labels)
benchmark(opt_sample_data, input_samples, labels)

The next desk compares the typical runtime of every of the implementations for quite a lot of enter pattern sizes:

Comparative Step Time Efficiency — Decrease is Higher (by Creator)

For many of the enter pattern sizes, the overhead of the host-device sync occasion is both comparable or decrease than the extra compute of the static implementation. Disappointingly, we solely see a serious profit from the sync-free various when the enter pattern measurement reaches ten million. Pattern sizes that giant are unusual in AI/ML settings. But it surely’s not our tendency to surrender so simply. As famous above, the static implementation permits different optimizations like graph compilation and enter batching.

Graph Compilation

Opposite to the unique operate — which fails to compile — our static implementation is totally appropriate with torch.compile:

benchmark(torch.compile(opt_sample_data), input_samples, labels)

The next desk contains the runtimes of our compiled operate:

Comparative Step Time Efficiency — Decrease is Higher (by Creator)

The outcomes are considerably higher — offering a 70–75 p.c increase over the unique sampler implementation within the 1–10 thousand vary. However we nonetheless have yet another optimization up our sleeve.

Maximizing Efficiency with Batched Enter

As a result of the unique implementation accommodates variable-shaped operations, it can not deal with batched enter straight. To course of a batch, we’ve no alternative however to use it to every enter individually, in a Python loop:

BATCH_SIZE = 32

def batched_sample_data(inputs, labels):
    sampled_inputs = []
    sampled_labels = []
    for i in vary(inputs.measurement(0)):
        inp, lab = sample_data(inputs[i], labels[i])
        sampled_inputs.append(inp)
        sampled_labels.append(lab)
    return torch.stack(sampled_inputs), torch.stack(sampled_labels)

In distinction, our optimized operate helps batched inputs as is — no modifications vital.

input_batch = torch.randn((BATCH_SIZE, INPUT_SAMPLES, FEATURE_DIM),
                          gadget='cuda')
labels = torch.randint(0, 2, (BATCH_SIZE, INPUT_SAMPLES),
                       gadget='cuda', dtype=torch.int64)

benchmark(batched_sample_data, input_batch, labels)
benchmark(opt_sample_data, input_batch, labels)
benchmark(torch.compile(opt_sample_data), input_batch, labels)

The desk under compares the step instances of our sampling features on a batch measurement of 32:

Step Time Efficiency on Batched Enter — Decrease is Higher (by Creator)

Now the outcomes are definitive: By utilizing a static implementation of the info sampler, we’re capable of increase efficiency by 2X–52X(!!) the variable-shaped choice, relying on the enter pattern measurement.

Observe that though our experiments had been run on a GPU gadget, the mannequin compilation and enter batching optimizations additionally apply to a CPU atmosphere. Thus, avoiding variable shapes may have implications on AI/ML mannequin efficiency on CPU, as effectively.

Abstract

The optimization course of we demonstrated on this publish generalizes past the particular case of information sampling:

  • Discovery through Efficiency Profiling: Utilizing the PyTorch Profiler we had been capable of establish drops in GPU utilization and uncover their supply: the presence of variable-shaped tensors ensuing from the torch.nonzero operation.
  • An Alternate Implementation: Our profiling findings allowed us to develop another implementation that completed the identical objective whereas avoiding the usage of variable-shaped tensors. Nonetheless, this step got here at the price of further compute and reminiscence overhead. As seen in our preliminary benchmarks, the sync-free various demonstrated worse efficiency on frequent enter sizes.
  • Unlocking Additional Potential for Optimization: The true breakthrough got here as a result of the static-shaped implementation was compilation-friendly and supported batching. These optimizations offered efficiency good points that dwarfed the preliminary overhead, resulting in a 2x to 52x speedup over the unique implementation.

Naturally, not all tales will finish as fortunately as ours. In lots of instances, we could come throughout PyTorch code that performs poorly on the GPU however doesn’t have another implementation, or it might have one which requires considerably extra compute assets. Nonetheless, given the potential for significant good points in efficiency and reductions in value, the method of figuring out runtime inefficiencies and exploring various implementations is an important a part of AI/ML improvement.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com