Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #7: Pass return variables in as input arguments.This avoids initializing a fresh array each time, shaving off precious microseconds.It's natural to write a function that looks like this@njitdef add(a, b):c = np.zeros(a.size)for i in range(a.size):c[i] = a[i] + b[i]return cwhere the result array, c, is created and initialized before it is populated.Often, functions are called repeatedly with arguments of the same shape. (The fact that they are called so often is what makes them appealing targets for speeding up with Numba.) When that is the case, it's possible to use a shortcut.@njitdef add(a, b, c):for i in range(a.size):c[i] = a[i] + b[i]where the result array, c, is created just once, outside the function, and reused. This way the memory space is preallocated and the function can get right to the business at hand.This is such a useful trick that NumPy uses it too. Most NumPy functions have an optional `out` parameter that you can use to pass a preallocated results array.The difference is typically just a small fraction of the total compute time, but it's a freebie–an optimization that comes with simpler code and logic. There's no downside! That's a rare thing. Letting it go unclaimed is like leaving the last bite of cheesecake just sitting on the table.
19
2 Comments
Chris A.
Software Engineer at PNNL (Center for AI, data.pnnl.gov)
4d
 Report this comment
IMO add(a,b) should be the default unless you've provided the code and know that it is the bottleneck.The downsides include (Google around for why people prefer to write sideeffect free functions/avoid global variables, no shortage of people who have written about it).
1Reaction
Yigit Mertol Kayabasi
Data Scientist / ML
3d
 Report this comment
🍰
1Reaction
To view or add a comment, sign in
More Relevant Posts

Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #6: Call Numbajitted functions once before kicking off the program.This avoids awkward hiccups in execution.Numba functions are so fast because they are precompiled to machine code, but this compiling step takes a few moments to complete. The compiler is also "lazy" because it waits until the absolute last possible second. It is a "just in time" or JIT compiler. The upside of this is that it doesn't incur any latency in the program starting up and avoids unnecessary compilations. The downside is that it can make for an unexpected several second pause in the program the first time the function is called.Having that unscheduled pause can knock processes out of synchronization or can make for a bumpy user experience. To take back control of when this occurs you can make a gratuitous first call to your Numba functions during startup, when nothing important is going on yet, and a user will be least annoyed by it. For example, when I’m timing a Numbajitted function, including the first call in the timing estimate would grossly overestimate the average execution time, so I make sure to call it first outside the loop. This "warms up" the functions, so that they are already compiled by the time they are encountered in the natural flow of the program.It’s a small thing, but small things add up.
20
4 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Sometimes the tests test the code and sometimes the code tests the tests
77
3 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #5: Use @njit rather than @jit.This tip is already outdated, showing how active Numba development is. In version 0.58 and earlier, the default behavior of the compiler was to fall back to regular Python compilation if anything should happen to frustrate the Numba compiler. A small glitch like a data type mismatch could turn a bulletfast Numbajitted function into a slowerthantar Python for loop. And the bad part is that there would be no error, no hint to the developer or user that anything was wrong, other than a mysterious performance drop.The way to get around this was to use @jit(nopython=True) as the decorator for Numba functions. This was so commonly used that it got it's own nickname, @njit. Compiling with @njit ensured that if Numba compilation failed, an error would be thrown. It embodied the software engineering best practice of having all failures be noisy.It was so useful in fact that as of Numba 0.59 (released January 2024) @jit now defaults to nopython=True. Changing the default behavior of the decorator may be breaking change for some code bases, but it comes at the benefit of betterengineered code for many others. And as a bonus, if you are using a recent version of Numba, you can stop worrying about this issue entirely, and just use @jit.
17
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #4: Don't write your own matrix multiplication.The widest, bestpaved road in scientific computing is matrix multiplication. NumPy's matrix multiplication has been optimized for your system in ways Numba can't match. Comparing a straightforward Numba forloop implementation to NumPy's matmul() is sobering.@njitdef matmul_numba(a, b, c):n_i, n_j = a.shapen_j, n_k = b.shapefor i in range(n_i):for j in range(n_j):for k in range(n_k):c[i][k] += a[i][j] * b[j][k]For a pair of 2000 x 2000 matrices, my system shows that matmul_numba() takes 2800 ms, compared to numpy.matmul()'s 125 ms–a more than 20X speedup. You can't beat NumPy's matmul(). But don't let that stop you from trying! One trick you can use is@njit(parallel=True)and substituting Numba's prange() for range(). prange() is a special variant of range() that supports parallelization. Together these instruct Numba to parallelize the matrix operation across multiple threads, as Numpy does. For me, this reduces Numba's run time by a factor of four to 720 ms. It's still 5X slower than NumPy, but we've closed the gap a bit. There are two good lessons here. The first is that there are tricks to speed up Numba even more. The second is that Numba is not the right tool for every job. For large, optimized calculations there may be a better tool. numpy.matmul() is one of these.
193
49 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #3: Don't create intermediate arrays.It's a fine point, but you can shave precious time off your Numba execution by not creating extra arrays. Intermediate arrays can make code more readable, but Numba takes them literally. It takes the extra time to allocate the memory for the intermediate variables.Here's an example from physics simulationscalculating all the pairwise distances between two groups of points. These two functions are identical, except that one makes several stops along the way to the final result.For 5000 points in each group, the distances_intermediate() function takes 600 ms on my machine, while distances_direct() takes 90 ms. This is a contrived example, but it shows how those intermediate arrays can bog you down.@njitdef distances_intermediate(x1, y1, x2, y2, d):dx = np.zeros((x1.size, x2.size))for i in range(x1.size):for j in range(x2.size):dx[i, j] = x1[i]  x2[j]dy = np.zeros((y1.size, y2.size))for i in range(y1.size):for j in range(y2.size):dy[i, j] = y1[i]  y2[j]dx_squared = np.zeros((x1.size, x2.size))for i in range(x1.size):for j in range(x2.size):dx_squared[i, j] = dx[i, j] ** 2dy_squared = np.zeros((y1.size, y2.size))for i in range(y1.size):for j in range(y2.size):dy_squared[i, j] = dy[i, j] ** 2d_squared = np.zeros((x1.size, x2.size))for i in range(x1.size):for j in range(x2.size):d_squared[i, j] = dx_squared[i, j] + dy_squared[i, j]for i in range(x1.size):for j in range(x2.size):d[i, j] = d_squared[i, j] ** .5@njitdef distances_direct(x1, y1, x2, y2, d):for i in range(x1.size):for j in range(x2.size):d[i, j] = ((x1[i]  x2[j]) ** 2 + (y1[i]  y2[j]) ** 2) ** .5
26
4 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Numba rule of thumb #2: Avoid Numpy array operations and functionsThis is a repeat of rule #1 about preferring for loops, but it is so counterintuitive that it bears repeating.Avoid doing any NumPy operations in a Numbajitted function. Don't create new arrays, don't broadcast existing arrays, don't reshape() or transpose() or concatenate(). (We'll talk about exceptions to this in later rules.)NumPy is fast because it uses precompiled, optimized C code. Numba is fast because it compiles Python code in a highlyoptimized way. But Numba can't change the optimized NumPy code, so it's stuck trying to shove a square peg into a round hole, and some performance is lost.To demonstrate, here are NumPy and Numba functions that multiply three onedimensional arrays to get a threedimensional array, then sums it along its second dimension.def numpy_version(a, b, c, d): d = np.sum( a[:, np.newaxis, np.newaxis] * b[np.newaxis, :, np.newaxis] * c[np.newaxis, np.newaxis, :], axis=1 )@njitdef numba_version(a, b, c, d): for i in range(a.size): for j in range(b.size): for k in range(c.size): d[i, k] += a[i] * b[j] * c[k]With these input argumentsa = np.random.sample(200)b = np.random.sample(300)c = np.random.sample(400)d = np.zeros((200, 400))I get 46.0 ms for the numpy_version() and 2.6 ms for the numba_version(), a speed up of more than 17X. That factor only grows as a, b, and c get larger.
53
6 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Numba Rule of Thumb 1: Try for loops firstYoung Python programmers quickly get for loops beaten out of them. Large for loops are glacially slow. Instead, we are taught vectorization–to put our numbers into arrays before working with them. This allows underthehood optimizations of NumPy to speed things up.When working in Numba, it's the opposite. Within a Numba function for loops generally perform better than array operations. For instance, check out these two functions.@njitdef add_arrays(a, b, c):c = a + b@njitdef add_for_loop(a, b, c):for i in range(a.size):c[i] = a[i] + b[i]For 10 million element arrays, the add_arrays() function runs in 35 milliseconds on my machine.The add_for_loop() function runs in 12.6 milliseconds.Numba loves for loops. Even though it operates naturally on NumPy arrays as input arguments, I've found that it runs fastest when I avoid using any array operations in the function. For loops and base Python are your friends. I'm not sure why, but my best guess is that the optimizations that Numpy has already performed conflict with the compiletime optimizations of Numba. (If you know more about this please drop your insights into the comments.)
272
31 Comments
Like CommentTo view or add a comment, sign in

Brandon Rohrer
Data scientist
 Report this post
Your first Numba functionIf you’re new to Numba, not to worry. It’s not nearly as intimidating as it sounds. Imagine you have two arrays and you want to add them. You can of course use NumPy’s array operations.import numpy as npn = 10_000_000a = np.random.sample(size=n)b = np.random.sample(size=n)c = a + bThis typically takes 15 ms on my box.But if you need to go even faster you can use Numba. First you'll need to make sure you have it. For me this happens at the command line:python3 m pip install numbaThen write a function that uses Numba's justintime compiler.from numba import jit @jitdef add(a, b, c): for i in range(a.size):c[i] = a[i] + b[i]c = np.zeros(n)add(a, b, c)The first time this function is called it takes a little time to compile, but after that this runs in about 12 ms for me. Faster than even NumPy! In my experience, the more complex the calculation, the greater the benefit of moving to Numba.It’s fun to compare this against base Python to see how far we’ve come. Without the @jit decorator add() takes 2300 ms to run. Numba makes it almost 200 times faster.
62
7 Comments
Like CommentTo view or add a comment, sign in
132,286 followers
 1,703 Posts
View Profile
FollowExplore topics
 Sales
 Marketing
 Business Administration
 HR Management
 Content Management
 Engineering
 Soft Skills
 See All