C++ / Performance

Why Parallel Code Doesn't Always Go Faster

The previous article showed real benchmarks from parallelizing six DSA problems with OpenMP. Some scaled well. Some barely moved. In every case, the speedup eventually plateaued — well below what the number of CPU cores might suggest is possible.

This article explains why. Adding #pragma omp parallel for to a loop is the easy part. Understanding what silently limits or breaks your parallel program is what separates a working parallel implementation from a correct and fast one.

1. Race Conditions

A race condition occurs when two or more threads read and write the same memory location concurrently, and the final result depends on which thread happens to execute first. The program produces different — and usually wrong — output on different runs.

Consider a naive parallel frequency count:

int count = 0;

#pragma omp parallel for
for (int i = 0; i < n; i++) {
    if (arr[i] == target)
        count++;  // RACE CONDITION
}

The operation count++ compiles to three instructions: read the current value, increment it, write it back. If two threads read the same value before either has written back, one increment is silently lost. The result is almost always lower than the correct answer — and it changes between runs.

Race conditions also appear in BFS when multiple threads try to mark the same node as visited, and in histogram building when threads write to the same bucket simultaneously. The code compiles cleanly and often runs without crashing — the corruption is silent, which makes these bugs particularly dangerous.

2. Fixing Races: Atomics vs Reductions

There are two main tools for eliminating race conditions in OpenMP. They solve the same problem but with very different performance characteristics.

Atomics

An atomic operation guarantees that a read-modify-write sequence completes without interruption from other threads:

#pragma omp parallel for
for (int i = 0; i < n; i++) {
    if (arr[i] == target)
        #pragma omp atomic
        count++;
}

This is correct, but slow under contention. Every atomic increment forces all threads to serialize on that single memory location. When the loop runs millions of iterations and many threads hit the condition, threads spend more time waiting for the atomic lock than doing actual work. The speedup from parallelism is eaten by contention.

Reductions

A reduction gives each thread its own private copy of the variable. Each thread increments its local copy independently. When the parallel region ends, all private copies are combined into the final result:

#pragma omp parallel for reduction(+:count)
for (int i = 0; i < n; i++) {
    if (arr[i] == target)
        count++;
}

No contention during the loop. The combination step happens once at the end. This is the correct default for commutative accumulation operations like sum, product, min, and max. In the counting benchmark from the previous article, using a reduction was what produced the clean 2.08× speedup.

Use atomics when you need to update a shared data structure that cannot be reduced — like inserting into a shared list or updating specific indices of an array. Use reductions whenever you are aggregating a single scalar result across iterations.

3. False Sharing

False sharing is one of the most counterintuitive performance problems in parallel programming. It can silently destroy performance even when your code has no race conditions and produces correct results.

CPUs do not read memory one byte at a time. They operate on cache lines — contiguous blocks of memory, typically 64 bytes on modern x86 processors. When one thread writes to a memory location, the entire cache line containing that location is invalidated in every other core's cache.

Now suppose you give each thread its own counter, stored in a simple array:

int local_counts[NUM_THREADS];  // Each thread writes to local_counts[thread_id]

This looks safe — each thread writes to a different index. But if local_counts[0] and local_counts[1] sit on the same 64-byte cache line (which they do, since each int is only 4 bytes), then every write by Thread 0 invalidates Thread 1's cache entry for that line, and vice versa. Both threads continuously trigger cache misses and stall waiting for the line to be reloaded — even though they are never touching the same actual data.

The fix is to pad each thread's data so it occupies its own cache line:

struct alignas(64) PaddedCounter {
    int value;
};
PaddedCounter local_counts[NUM_THREADS];

This guarantees that no two threads share a cache line. False sharing in array aggregations and per-thread buffers was one of the more surprising performance issues I encountered when tuning the kNN pipeline — the slowdown was real and visible in profiling, but the code looked perfectly correct.

4. Algorithmic Dependencies

Some algorithms simply cannot be fully parallelized because later steps depend on earlier results. No amount of clever threading eliminates a structural dependency.

Prefix sum is the canonical example. The definition is:

output[i] = output[i-1] + input[i]

Every element depends on the one before it. A parallel implementation exists — compute partial sums in one pass, propagate offsets in a second pass — but it requires explicit synchronization barriers between phases. Every barrier forces all threads to wait for the slowest one before proceeding. The result is that prefix sum on 100 million elements only achieved 1.13× speedup, as the previous benchmarks showed.

BFS has the same structure at a higher level. Within a single frontier level, all nodes can be processed in parallel. But level k+1 cannot begin until level k is complete, because the frontier for the next level is determined by the current one. The barrier between levels is unavoidable.

5. Why Speedups Plateau

Even on problems that parallelize well, adding more threads eventually stops helping. Several forces converge to create this ceiling:

Amdahl's Law. Every program has a sequential fraction — initialization, I/O, final aggregation — that cannot be parallelized. If 10% of your runtime is inherently sequential, your maximum possible speedup is 10×, regardless of how many cores you add. Real programs have much higher sequential fractions than people expect.

Memory bandwidth saturation. Multiple threads reading large arrays in parallel compete for the same memory bus. Once the memory bandwidth is saturated, adding more threads does not improve throughput — they all stall waiting for data. This is why all-pairs distance computation showed a less-than-expected speedup despite being computationally intensive: the memory access pattern is not cache-friendly and bandwidth becomes the bottleneck.

Thread overhead. Creating, scheduling, and synchronizing threads has a fixed cost. For small workloads, this overhead exceeds the time saved by parallelism. The crossover point — where parallel becomes faster than sequential — depends on the problem and the machine. Always benchmark at realistic input sizes.

Load imbalance. If iterations vary significantly in how long they take, some threads will finish early and sit idle waiting for others. The parallel region only ends when the slowest thread finishes. Uneven work distribution wastes the cores that completed early and limits the effective speedup.

A Mental Model for Parallel Code

After working through these issues, the most useful mental model I have found is to ask three questions before parallelizing anything:

  1. Are the iterations independent? If not, restructure or accept limited gains.
  2. Is the workload large enough? Thread overhead is fixed — the benefit must exceed it.
  3. Where is the bottleneck? If memory bandwidth is the limit, more threads will not help and may hurt.

Parallel computing is genuinely powerful when these conditions are met. The kNN brute-force search — 10,000 test points each compared against 60,000 training points across 734 dimensions — achieved a ~55× speedup with OpenMP because the outer loop over test points was completely independent, the workload per iteration was heavy, and the access pattern was regular enough to use memory bandwidth efficiently.

Conclusion

Race conditions, false sharing, atomic contention, algorithmic dependencies, and memory bandwidth limits all work against you the moment you move beyond embarrassingly parallel problems. Understanding them does not make parallel programming easy, but it makes the failures predictable and diagnosable rather than mysterious.

The practical takeaway: parallelize at the outermost independent loop you can find, use reductions instead of atomics wherever possible, be mindful of data layout and cache line boundaries, and always measure — intuitions about where the bottleneck is are frequently wrong.

← Previous: Parallelizing DSA Problems with OpenMP
← Back to Blogs