Find N largest or lowest values with heapq

You have a collection of items and you need to keep the N highest or lowest of them. Instead of writing your own algorithm, use build-in heapq module.

This module has two handy functions, nlargest(), and nsmallest().

This is how it works.

import heap

col = [79, 90, 21, 22, 30, 68, 55, 94, 12, 82]
heapq.nlargest(3, col) # [94, 90, 82]
heapq.nsmallest(3, col) # [12, 21, 22]

Under the hood, these functions use a heap data structure. The important feature of a heap is, that the first element is always the smallest one. The heapq module contains heapify() function which converts the elements to the heap data structure

col = [79, 90, 21, 22, 30, 68, 55, 94, 12, 82]
heapq.heapify(col)
col # prints [12, 22, 21, 79, 30, 68, 55, 94, 90, 82]

Notice two things.

1) The order of the elements changed in place.
2) That the elements are not sorted from smallest to largest as per se, but the underlying structure is heap, which means that the first element is always smallest.

To get the subsequent elements, you cannot simply iterate using for loop. To check, if you write something like

for i in col: print(i)
# [12, 22, 21, 79, 30, 68, 55, 94, 90, 82]

In order to get the next subsequent element, you need to use heappop() method.

heapq.heappop(col)
# 12
heapq.heappop(col)
# 21
heapq.heappop(col)
# 22

This is a destructive process so when you iterate over all the collection, you get IndexError.

The important point is that as heappop() pops the smallest item and replaces it with another smallest one this operation has O(log N) complexity where N is size of the heap. In practical terms, it means, that if you have a large list of items and want to get a few smallest, it is faster compared to sorting the list and slicing it. But if you have a large list and want to get almost the same size as the smallest items, it may be better to sort it first and slice it.

Another thing to mention is that this can be used to more complicated data structures.

students = [ {"name": "Michael", "score":90},
            {"name": "Jane", "score": 67},
            {"name": "Paul", "score": 89},
            {"name": "Mike", "score": 99},
            {"name": "Rocky", "score": 76},
            {"name": "Lucy", "score": 79},
            {"name": "Frank", "score": 84}]

# get the three best students
heapq.nlargest(3, students, key=lambda x:x["score"])

Real-world example.

I used this handy module quite a lot, lately also in one of my public projects - the COVID Dashboard. There was a need to annotate some charts but I wanted to annotate only a few highest values in the figure, not all of them.

Here is how it looks like

    largest_positive = heapq.nlargest(5, y_positive_df)
    for xa, ya in zip(x_state_df, y_positive_df):
        if ya in largest_positive:
            fig.add_annotation(
                xref='x2',
                yref='y2',
                x=xa,
                y=ya + 1300,
                text="{:,}".format(ya),
                showarrow=False)

Of course, this snippet is out of context but the point is that it is very useful module, finding the largest/smallest elements is a very common task in programming.