## How exactly to do linear time quickselect with duplicate elements?

It’s well known that the quickselect algorithm, which runs in average case linear time, can be made linear time in the worst case as well, by using the median of medians strategy to select the pivot. This also allows us to implement quicksort in guaranteed $O(n \log n)$ time.

While median-of-medians is theoretically interesting, it is not of much use in the real world. For sorting, the introsort approach with median-of-three partitioning is more suitable. Median-of-three partitioning has low overhead, and unlikely to degenerate to quadratic time behaviour except in the case of unusual adversarial sequences. On those rare occasions when the recursion gets too deep, introsort switches from quicksort to heapsort. Median-of-medians partitioning has too much overhead compared to heapsort to actually be useful. For selection, libstdc++ pursues a similar approach, namely median-of-three partitioning and switching from quickselect to heapselect if the recursion gets too deep; this implies that the selection may take $O(n \log n)$ time in rare situations, which is unlikely to cause any practical issues. Interestingly, libc++‘s std::nth_element implementation simply degenerates to quadratic time when given a median-of-three killer sequence as input (Musser, 1997). Presumably, few people complain since such sequences rarely occur in practice, and this is perfectly kosher according to the C++ standard, which merely requires linear time performance on average.

Why might one be interested in worst-case linear time quickselect despite the fact that it’s not useful enough to have made it into the two most widely used and respected C++ standard library implementations? The main reason I can think of is that you might, at some point, be applying for a job, and you might get an annoying interviewer who will dock marks from you if you don’t write a solution that has asymptotically optimal worst-case performance. You might say to yourself, I wouldn’t want to work for a company with this kind of interview process anyway. If that’s the case, I suppose there’s no compelling reason for you to read the rest of this post, other than curiosity.

When all elements are distinct, quickselect with median-of-medians is not hard to implement. The particular way in which the partitioning loop is implemented is immaterial, since there is only a single correct result. For example, we might use the simple Lomuto partitioning procedure given in CLRS. However, the version in CLRS takes the rightmost element of the range as the pivot, and CLRS itself remarks that the procedure needs to be modified to take an explicit pivot element in order to be usable with the median of medians. Of course, we can simply swap this element with the rightmost element and proceed with the original procedure:

// let x = *pivot; then, after this call, [begin, result) <= x && *result == x && (result, end) > x
template <std::random_access_iterator It>
It partition(It begin, It pivot, It end) {
using std::ranges::iter_swap;
iter_swap(pivot, end - 1);
It result = begin;
for (It i = begin; i < end - 1; i++) {
if (*i <= end[-1]) {  // NB: end[-1] is where the pivot value is right now
iter_swap(i, result);
++result;
}
}
iter_swap(result, end - 1);
return result;
}


(This code requires some additional concepts beyond std::random_access_iterator, and should probably be rangified. This is left as an exercise for the reader.)

The full quickselect may look something like this:

template <std::random_access_iterator It>
void nth_element(It begin, It nth, It end) {
const auto N = end - begin;

if (N <= 5) {
// insertion sort
for (auto i = begin; i < end; i++) {
const auto dest = std::upper_bound(begin, i, *i);
std::rotate(dest, i, i + 1);
}
return;
}

for (int i = 0; i + 5 <= N; i += 5) {
// move the median of this block of 5 into the middle of the block
nth_element(begin + i, begin + (i + 2), begin + (i + 5));
// move the median to the beginning of the array
std::ranges::iter_swap(begin + (i / 5), begin + i + 2);
}
nth_element(begin, begin + (N / 10), begin + (N / 5));

const auto m = partition(begin, begin + (N / 10), end);

if (nth < m) {
nth_element(begin, nth, m);
} else if (nth > m) {
nth_element(m, nth, end);
}
// if nth == m: nothing to do
}


Before we address the issue of duplicate elements, we should review the reason why this algorithm runs in guaranteed linear time. Let $T(n)$ denote the worst-case running time of quickselect with median-of-medians partitioning, run on an array with $n$ elements. There are approximately $n/5$ groups of 5. The median of medians is greater than half of these medians (constituting approximately $n/10$ elements). Each of those medians is greater than two elements in its group of 5; thus, the median of medians is greater than approximately $3n/10$ elements. By the same token, the median of medians is less than approximately $3n/10$ elements. It therefore follows that the side of the partition we descend into has size at most approximately $7n/10$. The time to compute the median of medians itself is $T(\lceil n/5\rceil) + O(n)$, and the partitioning step is of course linear. Thus

$\displaystyle T(n) = T(n/5 + O(1)) + T(7n/10 + O(1)) + \Theta(n)$

where we have used $O(1)$ to make our use of approximately precise, since e.g., $\lceil n/5 \rceil = n/5 + O(1)$. Now, because $n/5 + 7n/10$ is strictly less than $n$, it follows that $T(n) = \Theta(n)$. A careful proof can be found in CLRS; you can also use the Akra–Bazzi theorem, which has the advantage that it lets us also easily answer the question of what happens if the two factors of 1/5 and 7/10 were replaced by factors whose sum is exactly 1, or strictly greater than 1. As you can probably guess, the answers would be $T(n) = \Theta(n \log n)$ and $T(n) = \Theta(n^p)$, respectively, where $p$ is a constant greater than 1.

With the partitioning procedure given above, we can immediately see that there will be a problem if duplicate elements are present. In particular, if all elements in the array are equal, the right half of the partition will always be empty, leaving us to recurse into the left half which has $N - 1$ elements. In fact, because there is only one possible partition value, it doesn’t matter how we select that value; the running time will be at least quadratic, and the recursive nature of the median-of-medians computation actually results in superpolynomial time. It’s therefore clear that we must change the way we partition.

Indeed, the poor behaviour of Lomuto partitioning in an array where all (or almost all) elements are equal is a pretty good reason not to use it, even in the sorting case where we can switch to heapsort after a certain point. This switch-over would happen after $k \log n$ recursive calls for some constant $k$, meaning that by that time, $\Theta(n \log n)$ operations have already been done, and nearly all of the array remains unsorted. It would be better, even for introsort, to use a partitioning procedure that behaves well in the presence of many duplicates. This is typically the Hoare partitioning procedure or a variant:

// let x be the initial value of *pivot and k the return value; then,
// [begin, k) <= x and [k, end) >= x
template <std::random_access_iterator It>
It partition(It begin, It pivot, It end) {
using std::ranges::iter_swap;
iter_swap(begin, pivot);
It i = begin;
It j = end;
while (true) {
do {
--j;
} while (*j > *pivot);
while (*i < *pivot) {
++i;
}
if (i < j) {
iter_swap(i, j);
++i;
} else {
return j + 1;
}
}
}


The Hoare partitioning procedure is more complex than Lomuto; it’s not immediately obvious, upon reading the code, that it’s even correct at all (this is an exercise in CLRS; solutions can be found online). Lomuto thus has pedagogical value, whereas Hoare is used in practice. In the case of all elements distinct, Lomuto generally performs many unnecessary swaps, since the kth element less than the pivot (in index order) will always get swapped into position k, even if its original position was fine (i.e., to the left of the pivot’s final position); Hoare does not do this, as every swap it performs (except for the initial swap to move the pivot to the front, which is necessary in order to avoid degenerate partitions) will have the effect of exchanging an element that should be on the left and is on the right with an element that should be on the right and is on the left. In the presence of duplicates, Hoare can perform some unnecessary swaps of an element equal to the pivot and on the left with an element equal to the pivot and on ther right. However, Lomuto will also perform unnecessary swaps here, and indeed, when all elements are equal, Lomuto will perform about $n/2$ swaps compared to Hoare’s $n/4$.

A slight modification is needed to the nth_element implementation when using the two-way Hoare partitioning as shown above: the else branch must be made unconditional since the element at position m is not guaranteed to be in its final sorted position. With this change being made, our quickselect implementation should perform reasonably well in the presence of duplicates. But is it linear in the worst case?

The answer hinges on the issue of how lopsided a partition is possible in the presence of duplicates, where the pivot is chosen by median-of-medians partitioning. The case with all elements equal is actually a best case for Hoare partitioning; while it will perform unnecessary swaps, it will also move the left pointer in lockstep with the right pointer, meeting exactly in the middle. But what is a worst case? Well, if $3n/10 + 1$ of the elements are zeroes, and $7n/10 - 1$ are ones, and zero is chosen as the pivot (which is possible by gerrymandering), and the right pointer skips over all the ones at once and then moves in lockstep with the left, we’ll get a split where the left partition is about $3n/20$ zeroes and the right partition has the remaining $17n/20$ elements. Since $17/20 + 1/5 > 1$, this is a sufficiently bad split to be problematic. However, this particular case isn’t actually possible, because in this case 0 wouldn’t be chosen as the partition anyway. If we gerrymander the array to make 0 the median of medians, but otherwise move the ones as far to the right as possible, it looks like 00011 00011 00011 … 11111 11111 11111 …, and Hoare partitioning will yield a 3:13 split in this case. Still, $13/16 + 1/5 = 81/80$, greater than 1 by just a hair. Wait a second though—even this exact case is not possible, because we have written our quickselect above so that the $n/5$ medians all get moved to become the first $n/5$ elements of the array before the partitioning step is actually run. And even if a split worse than 1:4 can still happen, it’s not clear whether we can engineer the input so that the split is sufficiently bad upon every iteration, as the array gets reordered and whittled down, to yield superlinear running time. I don’t have an answer here; Hoare partitioning is difficult to analyze.

So what do we do? Well, you might feel that our algorithm by this point probably runs in worst-case linear time, and even if it doesn’t, it runs in $O(n^{1+c})$ time where $c$ is some small constant. But, as I said, you may be applying for a job where the interviewer wants you to write something that has provably worst-case linear time performance. In that case, it might occur to you that we need to be smart about how we handle elements that are equal to the pivot. (Duplicates that are unequal to the pivot are not a problem, since it is clear which half of the array they will end up in after partitioning.)

The solution to this problem is three-way partitioning: we must modify our approach so that the partitioning procedure is guaranteed to place all values equal to the pivot in their sorted positions. Writing such a partitioning loop is known as the Dutch national flag problem. There are various possible solutions; this one is based on (Dijkstra, 1976):

// returns the range of elements equal to the pivot.
// pivot is a value, not an iterator.
template <std::random_access_iterator It>
auto partition(It begin, It end, const auto& pivot) {
using std::ranges::iter_swap;
auto i = begin, j = begin, k = end;
while (j < k) {
const auto c = (*j <=> pivot);
if (c == std::weak_ordering::less) {
iter_swap(j++, i++);
} else if (c == std::weak_ordering::greater) {
iter_swap(j, --k);
} else {
j++;
}
}
return std::ranges::subrange(i, j);
}


Why does this help? Well, the reasoning for why the median of medians is greater than at least $(3/10 - \epsilon)n$ elements when all elements are distinct carries over easily to the non-distinct case, but the inequality becomes non-strict: the median of medians is greater than or equal to at least $(3/10 - \epsilon)n$ elements and likewise less than or equal to at least $(3/10 - \epsilon)n$ elements. This implies that when the three-way partition is run on a range with $n$ elements, the returned range at least overlaps with the middle two-fifths of the array; this is a generalization of the case with all elements distinct, where the returned iterator would always fall within the middle two-fifths of the array. We amend the calling code thus:

const auto r = partition(begin, end, begin[N / 10]);

if (nth < r.begin()) {
nth_element(begin, nth, r.begin());
} else if (nth >= r.end()) {
nth_element(r.end(), nth, end);
}
// nth falls within r: nothing to do


Now, in each case where nth_element is recursively called, it cannot be with more than $7n/10$ elements, since that would require the equal range to fall entirely within the $3n/10$ elements at the other end, which cannot happen.

I haven’t measured the performance of the overall algorithm, but it’s clear that it suffers from more overhead than the case with all elements distinct (using Hoare partitioning) since, as with Lomuto partitioning, we may perform unnecessary swaps, wherein an element that is already in the correct region of the array (according to whether it is less than, equal to, or greater than the pivot value) is nevertheless moved to a different location. This overhead is incurred even if it turns out that there are no duplicates in the array after all. The overhead is made significantly greater if we do not have three-way comparisons available (or if we do not make use of them), as each element is then potentially subjected to two full comparisons: one to determine whether it is less than the pivot, and (if the answer is no) a subsequent one to determine whether it is equal. Such is the cost of writing a guaranteed linear time quickselect. It’s not hard to see why, in the real world, the libstdc++ approach is preferred: select the pivot using median-of-3, run Hoare partitioning (which will usually perform well in the presence of duplicates despite not checking for them explicitly), and then switch over to heapselect in the very rare cases where this fails to behave well.