My Optimization isn't Premature, it's just Right

While looking up some information on StackOverflow comparing the performance of os.walk to Pathlib's rglob, I ran into a a bunch of "answers" that put me into rant mode. Without going too far down the rabbit-hole, this quote on one of the questions summarizes my feelings on the matter:

It's a plague on SO to discard questions related to performance with something like "premature optimizations are root of blabla" (which are actually misquotations of Knuth)

The other garbage non-answer that I run into is "try it out yourself". Oh, really? Why didn't I think of that!?

Trying it out myself

So, I already know that os.walk is faster than rglob.

I discovered that a couple of years ago while running some tests on a large directory tree, which was network mounted over a samba share, via a very low bandwidth wireless connection, on a device I could only access for limited periods of time. In my entire programming career, it was the only time that the performance of enumerating files on the filesystem was ever a bottleneck - and that was strictly due to how unusual and bizarre the entire setup was.

I needed to identify a small number of files based on a filename pattern - but had no deterministic way to access the deeply nested files. I even tried using find, grep, and other Linux tools - but they were similarly slow.

In the end, it was a reasonable solution to enumerate the files in my Python application, then action them accordingly. I use pathlib all the time, so I didn't think too much of it at the time. Conveniently, Trey Hunner wrote a blog post about why you should use pathlib and then wrote a follow-up covering some criticisms. One of these criticisms was that pathlib is too slow at enumerating files, but that in most cases, it's not a problem. I wholeheartedly agree with this.

But in my case, this speed difference doesn’t matter much. I searched for every file in my home directory and lost 6 seconds to the slower version of my code. If I needed to scale this code to search 10 million files, I’d probably want to rewrite it. But that’s a problem I can get to if I experience it.

While I didn't have to look through 10 million files, the speed difference mattered to me. A quick test in my environment comparing os.walk to rglob showed that os.walk was about 2-3x faster in my case, but more importantly, It was bringing the absolute time of 2-3 minutes down to about 1 minute - on a resource I might only be able to access for 10 minutes. So, a SUBSTANTIAL improvement.

Outside of this project, I've been a happy pathlib user... Until today.

For only the second time in my career, I've run into a situation where I need to evaluate the performance of performing an action on a large number of files (> 5 million files).

This brought me full circle back to looking for Trey's blog, so I could read his thoughts on the matter again, then re-run my tests locally to figure out what I wanted to do. Trying to find that blog post brought me to StackOverflow, and, well ... [see rant above].

First principles

To make this post useful, I'm going to detail some of the ways I've tried to walk through my filesystem (with code).

But first, for loop vs list comprehension... This has to be discussed, because I use one or the other in my sample code - and I'm sure someone will ask why I'm not using the other. Moreover, I'm sure someone will ask why I'm not using map or filter or some other functional programming construct. Even moreover, I'm sure someone will ask why I'm not using numpy or pandas or some other data science library. Yet even moreover, I'm sure someone will ask why I'm not using asyncio or multiprocessing or some other concurrency library. And on and on and on.

Also, it's still a pretty contentious topic, so I'm going to try to keep this as objective as possible for as long as possible. I'm just going to show you how I've used them in the past, and how I've tried to optimize them - and the results of said optimizations.

Here are the 4 methods I'm comparing:

def squares_comprehension(number_of_values: int):
    """Plain ol' list comprehension.""""
    return [f"{x}+{x}" for x in range(number_of_values)]

def squares_forloop_naive(number_of_values: int):
    """Naive `for` loop with no bells nor whistles."""
    squares: list[str] = []
    for i in range(number_of_values):
        squares.append(f"{i}+{i}")
    return squares

def squares_forloop_nodot(number_of_values: int):
    """"`for` loop avoiding dots - https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoiding_dots"""
    squares: list[str] = []
    append = squares.append
    for i in range(number_of_values):
        append(f"{i}+{i}")
    return squares

def squares_forloop_preallocate(number_of_values: int):
    """`for` loop pre-allocating the list's memory (a common C++ tactic)"""
    squares: list[str] = [""] * number_of_values
    i = 0
    for x in range(number_of_values):
        squares[i] = f"{x}+{x}"
        i += 1
    return squares

I've used strings instead of ints to add a slightly higher burden to the tests, as ints are heavily optimized across machines.

Testing methodology

Gotcha

def time_this(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        is_gc_enabled = gc.isenabled()
        gc.disable()
        try:
            start = time.perf_counter_ns()
            result = func(*args,  **kwargs)
            end = time.perf_counter_ns()
            elapsed_ms = (end - start) / 1_000_000
            print(f"{func.__name__} took {elapsed_ms} ms")
            return result
        finally:
            if is_gc_enabled:
                gc.enable()
    return wrapper

I had originally thought to use this timing decorator, and to run each test in order (and mix up the order a bunch) - but I realized that this is not a fair test. The first function takes an unfair hit due to warming everything up for the other functions. Also, tests get muddled since there is sometimes garbage collection, sometimes pre-allocated arrays, etc. The only fair way to do this is to run each function in isolation (using something like timeit), and then compare the results.

timeit

timeit comes with Python and it's a simple runner for timing code. It's not the most accurate, but it's good enough for this purpose. It handles some edge cases like garbage collection, and it's straightforward to use from the command line or from code.

Something like this was used for each method (with the appropriate function name):

times = timeit.repeat("squares_comprehension(NUMBER_OF_VALUES)", setup="from lib import squares_comprehension, NUMBER_OF_VALUES", number=NUMBER_OF_RUNS)
print("squares_comprehension:")
print(f"    min: {min(times)/NUMBER_OF_RUNS*1000}ms")
print(f"    avg: {sum(times)/NUMBER_OF_RUNS*1000/len(times)}ms")
print(f"    max: {max(times)/NUMBER_OF_RUNS*1000}ms")

If I wasn't importing from another library (lib), I could have imported from __main__ instead because timeit runs in a separate namespace.

Neofetch

Getting performance results without knowing what the system was run on is pretty much useless - so... Neofetch!

                    'c.
                 ,xNMM.          -----------------
               .OMMMMo           OS: macOS 13.1 22C65 x86_64
               OMMM0,            Host: Macmini8,1
     .;loddo:' loolloddol;.      Kernel: 22.2.0
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Uptime: 1 day, 16 hours, 40 mins
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 176 (brew)
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Shell: zsh 5.8.1
;MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 2560x1080
:MMMMMMMMMMMMMMMMMMMMMMMM:       DE: Aqua
.MMMMMMMMMMMMMMMMMMMMMMMMX.      WM: Quartz Compositor
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    WM Theme: Blue (Dark)
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Terminal: Apple_Terminal
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   Terminal Font: SFMono-Regular
    kMMMMMMMMMMMMMMMMMMMMMMd     CPU: Intel i5-8500B (6) @ 3.00GHz
     ;KMMMMMMMWXXWMMMMMMMk.      GPU: Intel UHD Graphics 630
       .cooc,.    .,coo:.        Memory: 16342MiB / 32768MiB

Results

I've run a small list test, and a large list test - while trying to keep total test time similar (by increasing the number of runs). Spoiler alert: it doesn't really matter which you use, but stick to list comprehensions or vanilla for loops if you can, because those are the most common, most maintainable, and follow the "least surprise" principle (i.e. they're the most likely to be what some random developer would expect to see).

NUMBER_OF_VALUES = 10_000
NUMBER_OF_RUNS = 1_000
NUMBER_OF_REPEATS = 5

squares_comprehension:
    min: 2.198ms
    avg: 2.234ms
    max: 2.295ms

squares_forloop_naive (~0.1ms/5% slower than comprehension per run):
    min: 2.317ms
    avg: 2.323ms
    max: 2.329ms

squares_forloop_nodot (~0.1ms/5% slower than comprehension per run):
    min: 2.309ms
    avg: 2.334ms
    max: 2.361ms

squares_forloop_preallocate (~0.2ms/10% slower than comprehension per run):
    min: 2.433ms
    avg: 2.452ms
    max: 2.470ms

NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5

squares_comprehension:
    min: 263.775ms
    avg: 271.824ms
    max: 276.505ms

squares_forloop_naive (~8-10ms/3-4% slower than comprehension per run):
    min: 273.416ms
    avg: 276.072ms
    max: 283.245ms

squares_forloop_nodot (~10-15ms/4-6% slower than comprehension per run):
    min: 276.128ms
    avg: 278.774ms
    max: 282.404ms

squares_forloop_preallocate (~50ms/20% slower than comprehension per run):
    min: 309.654ms
    avg: 311.013ms
    max: 312.896ms

Amusing aside

I was pretty surprised by the results that came out of this test - because the standard for loop and append was already almost as fast as list comprehension. It took me a few minutes to realize that I was testing on Python 3.11 which had a lot of optimizations under the hood. I think cheaper frames, inlined function calls, and method loading had the largest impact.

I'll discuss these optimizations in a future post, but for context, here is what Python 3.10 looks like:

NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5

squares_comprehension:
    min: 276.031ms
    avg: 282.553ms
    max: 287.557ms

squares_forloop_naive (~35ms/12% slower than comprehension per run):
    min: 310.676ms
    avg: 311.744ms
    max: 312.777ms

squares_forloop_nodot (~10-15ms/4-5% slower than comprehension per run):
    min: 289.103ms
    avg: 291.680ms
    max: 293.609ms

squares_forloop_preallocate (~40ms/15% slower than comprehension per run):
    min: 316.805ms
    avg: 317.813ms
    max: 318.840ms

There we go! That makes wayyyy more sense now when compared against the mountains of StackOverflow posts that say "list comprehension is much, much faster than a for loop" - because it was, until Python 3.11. However, even these "slow" for loops are still perfectly acceptable for the overwhelming number of use cases.

Compile it

Before I leave this topic - I want to talk about an easy win for this kind of performance-driven code. A library like numpy runs so fast on such large data because it makes heavy use of C extensions. If you're doing a lot of math, or a lot of data manipulation, you can get a huge performance boost by compiling your code to C and then calling it from Python. I've done this a few times, and it's not too hard, but it's also not something I'd recommend for a beginner. Instead of trying to learn how to compile your code, I'd recommend learning how to use a library like numpy or pandas that already has the performance optimizations built-in. In fact, that's not even a recommendation - it's a statement of fact. Use those libraries, and you'll be fine.

If you're writing code that's a bit more custom, you can go the route of something like nuitka or cython. Both of those options are a bit more complicated, but they're also a bit more powerful. You don't even need to delve into the cython-specific syntax. Just write regular Python, compile it, and take the win.

However, I'm going to propose an even easier solution. If you're writing Python code with type hints (which you should be doing), and if your code passes mypy strict checks, you can save yourself a hassle and use mypyc instead. It has it's own set of limitations, but when it works, it's trivial.

For example:

mypyc lib.py

There we go. You should have a compiled library version of your code that you can then import like any other Python module. It's that easy.

Faster Results

Annnnd let's see what happened after using the mypyc compiler:

NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5

squares_comprehension_mypyc (~82ms/31% faster than non-compiled comprehension per run):
    min: 180.988ms
    avg: 183.842ms
    max: 185.987ms

squares_forloop_naive_mypyc (~1-2ms/1% faster than compiled comprehension per run):
    min: 178.29850940033793ms
    avg: 181.4516938601446ms
    max: 185.0800022999465ms

squares_forloop_nodot_mypyc (~1-2ms/1% faster than compiled comprehension per run):
    min: 179.741ms
    avg: 182.105ms
    max: 184.054ms

squares_forloop_preallocate_mypyc  (~5ms/3% faster than compiled comprehension per run):
    min: 175.048ms
    avg: 179.784ms
    max: 185.061ms

Wow, check that out. The compiled versions are about 30% faster than the non-compiled versions, and the compiled for loops have slightly overtaken the compiled list comprehensions. That's pretty neat. And to boot, my poor, slow, lonely pre-allocated array idea went from being the tortoise to being the hare.

I should note that on really computationally heavy operations (e.g. fib), I've seen a 5-10x speedup. For something as trivial as a loop, there is already so much optimization under the hood, that we "only" see a 30% bump. This is much higher in typical compiled applications.

Final Thoughts

I went into complete digression mode for this post, as I was originally going to talk about the practical speed differences between os.walk and pathlib's rglob, but I got sucked into the for loop vs. list comprehension debate. Instead of giving you the lame "premature optimization is the root of all evil" speech, I'll just leave you with my recommendations:

Use Python 3.11 or later
Write code for readability and maintainability first, and performance second
Write code with type hints and mypy checks
Use numpy or pandas for math and data manipulation (or other libraries with C extensions for the heavy lifting)
Compile your code with mypyc if you want some free speed

Hey wait a sec, where's the loop recommendation? Whelp, sorry to break it to my fellow keyboard warriors, but following my recommendations above - it honestly doesn't matter which one you use.

As shown above, with Python 3.11, the for loop vs list comprehension numbers are so close (compiled or not) that it doesn't matter. Even in Python 3.10, they weren't crazy-far-apart in wall clock times. What's more, assuming your loop is doing something remotely non-trivial, the loop time will be completely wiped out by your actual business logic.

Development My Optimization isn't Premature, it's just Right