Most of us learn algorithms assuming a single thread of execution — one loop, one core, one output. But modern CPUs have 4, 8, sometimes 16 hardware threads sitting idle while your code runs on just one of them. When a problem can be split into independent chunks, parallel execution can turn a slow loop into something dramatically faster without changing the algorithm itself.
This article looks at what that actually means in practice. Working on a high-performance kNN classifier in C++ gave me a good reason to dig into OpenMP seriously, and I ran benchmarks on six common DSA problems to see where parallelism helps, where it barely helps, and why.
What is OpenMP?
OpenMP is a set of compiler directives, library routines, and environment variables that let you add parallelism to C, C++, or Fortran programs without rewriting them from scratch. The most common pattern looks like this:
#pragma omp parallel for
for (int i = 0; i < n; i++) {
result[i] = expensive_computation(data[i]);
}
That single pragma tells the compiler to distribute the loop iterations across all available threads. Each thread gets a slice of the index range and runs independently. At the end of the loop, execution returns to a single thread.
To compile with OpenMP support on GCC:
g++ -O2 -fopenmp -o program program.cpp
OpenMP works best when loop iterations are independent — meaning no iteration reads a result written by another iteration in the same loop. Once dependencies creep in, things get more complicated, as the benchmarks below show clearly.
The Benchmarks
All benchmarks were run on the same machine with the same inputs, comparing sequential execution against parallel execution with OpenMP. The goal was not to find the theoretical maximum speedup but to see realistic gains on a normal development laptop.
1. Counting Occurrences in a Large Array
Array size: 100 million elements
| Mode | Time |
|---|---|
| Sequential | 0.709 s |
| Parallel (OpenMP) | 0.341 s |
| Speedup | ~2.08× |
This is the cleanest case for parallelism. Each element is checked independently and contributes to a local count. Using a reduction clause, each thread maintains its own counter and they are summed at the end — no shared state, no contention. The speedup is close to what you'd expect from two threads doing equal work.
2. All-Pairs Distance Computation
V = 2000 vertices
| Mode | Time |
|---|---|
| Sequential | 9.857 s |
| Parallel (OpenMP) | 5.522 s |
| Speedup | ~1.78× |
Computing distances between every pair of nodes is a compute-heavy O(V²) workload. The outer loop over source nodes is embarrassingly parallel — each iteration is fully independent. The speedup is good but already not linear, which hints at memory bandwidth starting to matter at this scale.
3. Matrix Multiplication
N = 800 (square matrices)
| Mode | Time |
|---|---|
| Sequential | 0.837 s |
| Parallel (OpenMP) | 0.343 s |
| Speedup | ~2.44× |
Matrix multiplication is one of the most satisfying problems to parallelize. Each output element C[i][j] is computed independently from all others, so the outer loop distributes cleanly across threads. This benchmark shows the best speedup of the set, which makes sense — the computation is heavy, memory access is structured, and there is zero inter-thread communication.
4. Prefix Sum
Array size: 100 million elements
| Mode | Time |
|---|---|
| Sequential | 1.215 s |
| Parallel (OpenMP) | 1.074 s |
| Speedup | ~1.13× |
This is where things get interesting. A prefix sum (or scan) is inherently sequential in its basic form — each element depends on the previous result. Parallelizing it requires a two-phase approach: compute partial sums independently in the first phase, then combine them in the second. The overhead of this coordination nearly cancels out the benefit of parallelism. The ~1.13× gain is real but modest, and it illustrates an important point: not all algorithms are structurally amenable to parallelism.
5. Merge Sort
N = 10 million elements
| Mode | Time |
|---|---|
| Sequential | 2.444 s |
| Parallel (OpenMP) | 1.088 s |
| Speedup | ~2.25× |
Merge sort is a divide-and-conquer algorithm where the two recursive halves are completely independent of each other. Parallelism here happens at the algorithm level — spawn threads for the left and right subtrees and let them sort simultaneously. The gains are strong but only materialise once the input is large enough to justify the thread creation overhead. On small arrays, parallel merge sort can actually be slower than sequential.
6. BFS (Level-by-Level)
V = 1,000,000 vertices | E = 10,000,000 edges
| Mode | Time |
|---|---|
| Sequential | 0.035 s |
| Parallel (OpenMP) | 0.012 s |
| Speedup | ~2.91× |
The best raw speedup of the set. Within each BFS level, processing all frontier nodes is independent — each node's neighbours can be explored simultaneously. The parallelism is natural and the work per level is substantial enough to make it worthwhile. The catch is that levels themselves are sequential: you cannot start level k+1 until level k is fully processed.
What the Numbers Actually Say
Looking across all six benchmarks, a clear pattern emerges:
- Problems with independent iterations scale well — matrix multiplication, BFS within a level, merge sort subtrees.
- Problems with dependencies barely scale — prefix sum is the clearest example. The dependency chain limits how much work can be done in parallel regardless of thread count.
- None hit linear speedup. Even the best result (2.91×) falls well short of the theoretical maximum. This is Amdahl's Law in action — the sequential portions of every program limit the overall speedup.
A Note on Input Size
One thing these benchmarks do not show explicitly but matters a great deal: parallelism has a fixed overhead cost — thread creation, synchronization, and combining results. For small inputs, this overhead can easily exceed the time saved. Always benchmark at the scale you actually expect in production. A parallel prefix sum on 1000 elements will likely be slower than the sequential version.
Conclusion
OpenMP is a remarkably low-friction way to add parallelism to existing C++ code. A single pragma can cut execution time in half on the right problem. But it rewards problems that are structurally parallel — independent iterations, divide-and-conquer structure, or level-based traversal. Applied to problems with inherent dependencies, the gains are marginal and the code becomes harder to reason about.
The next article in this series examines exactly why these speedups plateau and sometimes disappear entirely: race conditions, false sharing, atomic operations, and the limits imposed by memory bandwidth.