Development My Optimization isn't Premature, it's just Right
While looking up some information on StackOverflow comparing the performance of os.walk
to Pathlib's rglob
, I ran into a a bunch of "answers" that put me into rant mode. Without going too far down the rabbit-hole, this quote on one of the questions summarizes my feelings on the matter:
It's a plague on SO to discard questions related to performance with something like "premature optimizations are root of blabla" (which are actually misquotations of Knuth)
The other garbage non-answer that I run into is "try it out yourself". Oh, really? Why didn't I think of that!?
Trying it out myself
So, I already know that os.walk
is faster than rglob
.
I discovered that a couple of years ago while running some tests on a large directory tree, which was network mounted over a samba share, via a very low bandwidth wireless connection, on a device I could only access for limited periods of time. In my entire programming career, it was the only time that the performance of enumerating files on the filesystem was ever a bottleneck - and that was strictly due to how unusual and bizarre the entire setup was.
I needed to identify a small number of files based on a filename pattern - but had no deterministic way to access the deeply nested files. I even tried using find
, grep
, and other Linux tools - but they were similarly slow.
In the end, it was a reasonable solution to enumerate the files in my Python application, then action them accordingly. I use pathlib
all the time, so I didn't think too much of it at the time. Conveniently, Trey Hunner wrote a blog post about why you should use pathlib
and then wrote a follow-up covering some criticisms. One of these criticisms was that pathlib
is too slow at enumerating files, but that in most cases, it's not a problem. I wholeheartedly agree with this.
But in my case, this speed difference doesn’t matter much. I searched for every file in my home directory and lost 6 seconds to the slower version of my code. If I needed to scale this code to search 10 million files, I’d probably want to rewrite it. But that’s a problem I can get to if I experience it.
While I didn't have to look through 10 million files, the speed difference mattered to me. A quick test in my environment comparing os.walk
to rglob
showed that os.walk
was about 2-3x faster in my case, but more importantly, It was bringing the absolute time of 2-3 minutes down to about 1 minute - on a resource I might only be able to access for 10 minutes. So, a SUBSTANTIAL improvement.
Outside of this project, I've been a happy pathlib
user... Until today.
For only the second time in my career, I've run into a situation where I need to evaluate the performance of performing an action on a large number of files (> 5 million files).
This brought me full circle back to looking for Trey's blog, so I could read his thoughts on the matter again, then re-run my tests locally to figure out what I wanted to do. Trying to find that blog post brought me to StackOverflow, and, well ... [see rant above].
First principles
To make this post useful, I'm going to detail some of the ways I've tried to walk through my filesystem (with code).
But first, for
loop vs list comprehension... This has to be discussed, because I use one or the other in my sample code - and I'm sure someone will ask why I'm not using the other. Moreover, I'm sure someone will ask why I'm not using map
or filter
or some other functional programming construct. Even moreover, I'm sure someone will ask why I'm not using numpy
or pandas
or some other data science library. Yet even moreover, I'm sure someone will ask why I'm not using asyncio
or multiprocessing
or some other concurrency library. And on and on and on.
Also, it's still a pretty contentious topic, so I'm going to try to keep this as objective as possible for as long as possible. I'm just going to show you how I've used them in the past, and how I've tried to optimize them - and the results of said optimizations.
Here are the 4 methods I'm comparing:
def squares_comprehension(number_of_values: int):
"""Plain ol' list comprehension.""""
return [f"{x}+{x}" for x in range(number_of_values)]
def squares_forloop_naive(number_of_values: int):
"""Naive `for` loop with no bells nor whistles."""
squares: list[str] = []
for i in range(number_of_values):
squares.append(f"{i}+{i}")
return squares
def squares_forloop_nodot(number_of_values: int):
""""`for` loop avoiding dots - https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoiding_dots"""
squares: list[str] = []
append = squares.append
for i in range(number_of_values):
append(f"{i}+{i}")
return squares
def squares_forloop_preallocate(number_of_values: int):
"""`for` loop pre-allocating the list's memory (a common C++ tactic)"""
squares: list[str] = [""] * number_of_values
i = 0
for x in range(number_of_values):
squares[i] = f"{x}+{x}"
i += 1
return squares
I've used strings instead of ints to add a slightly higher burden to the tests, as ints are heavily optimized across machines.
Testing methodology
Gotcha
def time_this(func):
@wraps(func)
def wrapper(*args, **kwargs):
is_gc_enabled = gc.isenabled()
gc.disable()
try:
start = time.perf_counter_ns()
result = func(*args, **kwargs)
end = time.perf_counter_ns()
elapsed_ms = (end - start) / 1_000_000
print(f"{func.__name__} took {elapsed_ms} ms")
return result
finally:
if is_gc_enabled:
gc.enable()
return wrapper
I had originally thought to use this timing decorator, and to run each test in order (and mix up the order a bunch) - but I realized that this is not a fair test. The first function takes an unfair hit due to warming everything up for the other functions. Also, tests get muddled since there is sometimes garbage collection, sometimes pre-allocated arrays, etc. The only fair way to do this is to run each function in isolation (using something like timeit), and then compare the results.
timeit
timeit
comes with Python and it's a simple runner for timing code. It's not the most accurate, but it's good enough for this purpose. It handles some edge cases like garbage collection, and it's straightforward to use from the command line or from code.
Something like this was used for each method (with the appropriate function name):
times = timeit.repeat("squares_comprehension(NUMBER_OF_VALUES)", setup="from lib import squares_comprehension, NUMBER_OF_VALUES", number=NUMBER_OF_RUNS)
print("squares_comprehension:")
print(f" min: {min(times)/NUMBER_OF_RUNS*1000}ms")
print(f" avg: {sum(times)/NUMBER_OF_RUNS*1000/len(times)}ms")
print(f" max: {max(times)/NUMBER_OF_RUNS*1000}ms")
If I wasn't importing from another library (lib
), I could have imported from __main__
instead because timeit
runs in a separate namespace.
Neofetch
Getting performance results without knowing what the system was run on is pretty much useless - so... Neofetch!
'c.
,xNMM. -----------------
.OMMMMo OS: macOS 13.1 22C65 x86_64
OMMM0, Host: Macmini8,1
.;loddo:' loolloddol;. Kernel: 22.2.0
cKMMMMMMMMMMNWMMMMMMMMMM0: Uptime: 1 day, 16 hours, 40 mins
.KMMMMMMMMMMMMMMMMMMMMMMMWd. Packages: 176 (brew)
XMMMMMMMMMMMMMMMMMMMMMMMX. Shell: zsh 5.8.1
;MMMMMMMMMMMMMMMMMMMMMMMM: Resolution: 2560x1080
:MMMMMMMMMMMMMMMMMMMMMMMM: DE: Aqua
.MMMMMMMMMMMMMMMMMMMMMMMMX. WM: Quartz Compositor
kMMMMMMMMMMMMMMMMMMMMMMMMWd. WM Theme: Blue (Dark)
.XMMMMMMMMMMMMMMMMMMMMMMMMMMk Terminal: Apple_Terminal
.XMMMMMMMMMMMMMMMMMMMMMMMMK. Terminal Font: SFMono-Regular
kMMMMMMMMMMMMMMMMMMMMMMd CPU: Intel i5-8500B (6) @ 3.00GHz
;KMMMMMMMWXXWMMMMMMMk. GPU: Intel UHD Graphics 630
.cooc,. .,coo:. Memory: 16342MiB / 32768MiB
Results
I've run a small list test, and a large list test - while trying to keep total test time similar (by increasing the number of runs). Spoiler alert: it doesn't really matter which you use, but stick to list comprehensions or vanilla for loops if you can, because those are the most common, most maintainable, and follow the "least surprise" principle (i.e. they're the most likely to be what some random developer would expect to see).
NUMBER_OF_VALUES = 10_000
NUMBER_OF_RUNS = 1_000
NUMBER_OF_REPEATS = 5
squares_comprehension:
min: 2.198ms
avg: 2.234ms
max: 2.295ms
squares_forloop_naive (~0.1ms/5% slower than comprehension per run):
min: 2.317ms
avg: 2.323ms
max: 2.329ms
squares_forloop_nodot (~0.1ms/5% slower than comprehension per run):
min: 2.309ms
avg: 2.334ms
max: 2.361ms
squares_forloop_preallocate (~0.2ms/10% slower than comprehension per run):
min: 2.433ms
avg: 2.452ms
max: 2.470ms
NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5
squares_comprehension:
min: 263.775ms
avg: 271.824ms
max: 276.505ms
squares_forloop_naive (~8-10ms/3-4% slower than comprehension per run):
min: 273.416ms
avg: 276.072ms
max: 283.245ms
squares_forloop_nodot (~10-15ms/4-6% slower than comprehension per run):
min: 276.128ms
avg: 278.774ms
max: 282.404ms
squares_forloop_preallocate (~50ms/20% slower than comprehension per run):
min: 309.654ms
avg: 311.013ms
max: 312.896ms
Amusing aside
I was pretty surprised by the results that came out of this test - because the standard for
loop and append
was already almost as fast as list comprehension. It took me a few minutes to realize that I was testing on Python 3.11 which had a lot of optimizations under the hood. I think cheaper frames, inlined function calls, and method loading had the largest impact.
I'll discuss these optimizations in a future post, but for context, here is what Python 3.10 looks like:
NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5
squares_comprehension:
min: 276.031ms
avg: 282.553ms
max: 287.557ms
squares_forloop_naive (~35ms/12% slower than comprehension per run):
min: 310.676ms
avg: 311.744ms
max: 312.777ms
squares_forloop_nodot (~10-15ms/4-5% slower than comprehension per run):
min: 289.103ms
avg: 291.680ms
max: 293.609ms
squares_forloop_preallocate (~40ms/15% slower than comprehension per run):
min: 316.805ms
avg: 317.813ms
max: 318.840ms
There we go! That makes wayyyy more sense now when compared against the mountains of StackOverflow posts that say "list comprehension is much, much faster than a for loop" - because it was, until Python 3.11. However, even these "slow" for loops are still perfectly acceptable for the overwhelming number of use cases.
Compile it
Before I leave this topic - I want to talk about an easy win for this kind of performance-driven code. A library like numpy runs so fast on such large data because it makes heavy use of C
extensions. If you're doing a lot of math, or a lot of data manipulation, you can get a huge performance boost by compiling your code to C
and then calling it from Python. I've done this a few times, and it's not too hard, but it's also not something I'd recommend for a beginner. Instead of trying to learn how to compile your code, I'd recommend learning how to use a library like numpy or pandas that already has the performance optimizations built-in. In fact, that's not even a recommendation - it's a statement of fact. Use those libraries, and you'll be fine.
If you're writing code that's a bit more custom, you can go the route of something like nuitka or cython. Both of those options are a bit more complicated, but they're also a bit more powerful. You don't even need to delve into the cython-specific syntax. Just write regular Python, compile it, and take the win.
However, I'm going to propose an even easier solution. If you're writing Python code with type hints (which you should be doing), and if your code passes mypy strict checks, you can save yourself a hassle and use mypyc instead. It has it's own set of limitations, but when it works, it's trivial.
For example:
mypyc lib.py
There we go. You should have a compiled library version of your code that you can then import like any other Python module. It's that easy.
Faster Results
Annnnd let's see what happened after using the mypyc compiler:
NUMBER_OF_VALUES = 1_000_000
NUMBER_OF_RUNS = 10
NUMBER_OF_REPEATS = 5
squares_comprehension_mypyc (~82ms/31% faster than non-compiled comprehension per run):
min: 180.988ms
avg: 183.842ms
max: 185.987ms
squares_forloop_naive_mypyc (~1-2ms/1% faster than compiled comprehension per run):
min: 178.29850940033793ms
avg: 181.4516938601446ms
max: 185.0800022999465ms
squares_forloop_nodot_mypyc (~1-2ms/1% faster than compiled comprehension per run):
min: 179.741ms
avg: 182.105ms
max: 184.054ms
squares_forloop_preallocate_mypyc (~5ms/3% faster than compiled comprehension per run):
min: 175.048ms
avg: 179.784ms
max: 185.061ms
Wow, check that out. The compiled versions are about 30% faster than the non-compiled versions, and the compiled for loops have slightly overtaken the compiled list comprehensions. That's pretty neat. And to boot, my poor, slow, lonely pre-allocated array idea went from being the tortoise to being the hare.
I should note that on really computationally heavy operations (e.g. fib
), I've seen a 5-10x speedup. For something as trivial as a loop, there is already so much optimization under the hood, that we "only" see a 30% bump. This is much higher in typical compiled applications.
Final Thoughts
I went into complete digression mode for this post, as I was originally going to talk about the practical speed differences between os.walk
and pathlib's rglob
, but I got sucked into the for
loop vs. list comprehension debate. Instead of giving you the lame "premature optimization is the root of all evil" speech, I'll just leave you with my recommendations:
- Use Python 3.11 or later
- Write code for readability and maintainability first, and performance second
- Write code with type hints and mypy checks
- Use numpy or pandas for math and data manipulation (or other libraries with C extensions for the heavy lifting)
- Compile your code with mypyc if you want some free speed
Hey wait a sec, where's the loop recommendation? Whelp, sorry to break it to my fellow keyboard warriors, but following my recommendations above - it honestly doesn't matter which one you use.
As shown above, with Python 3.11, the for
loop vs list comprehension numbers are so close (compiled or not) that it doesn't matter. Even in Python 3.10, they weren't crazy-far-apart in wall clock times. What's more, assuming your loop is doing something remotely non-trivial, the loop time will be completely wiped out by your actual business logic.