The fastest 2D convolution in the world
Convolutions are essential components of any neural networks, image processing, computer vision ... but these are also a bottleneck in terms of computations... I will here benchmark different solutions using numpy
, scipy
or pytorch
. This is work-in-progress, so that any suggestion is welcome, for instance on StackExchange or in the comments below this post.
Let's first initialize the notebook:
from __future__ import division, print_function
import numpy as np
np.set_printoptions(precision=6, suppress=True)
import os
%matplotlib inline
#%config InlineBackend.figure_format='retina'
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
phi = (np.sqrt(5)+1)/2
fig_width = 10
figsize = (fig_width, fig_width/phi)
%load_ext autoreload
%autoreload 2
setting up the problem¶
The convolution operates on two 2-D matrices. We will here always consider the case which is most typical in computer vision:
- a first matrix $A$ is the input and is typically large ($N \times N$ where $N$ is typically larger than $2^{10}=1024$),
- a second matrix $B$ is the template and is typically smaller (say $M=128$),
- the result of the convolution $C = A \ast B$ is padded such that it is of the same size as $A$.
Often, you need to do that with many images on many kernels. In addition, we will test for the effect of prefetching.
Let's write a small function for this:
N, n_N, M, n_M = 1024, 2, 16, 3
# DEBUG
# N, n_N, M, n_M = 64, 10, 32, 8
def get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C
for prefetching in [True, False]:
print ('with prefetching=', prefetching)
A, B, C = get_data(prefetching=prefetching)
print('Checking size of A =', A.shape, ' of B=', B.shape, ', and of C=', C.shape)
The call to this function will generate some initialization time, but makes further tests cleaner:
def test_get_data(N=N, n_N=n_N, M=M, n_M=n_M):
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_get_data(N, n_N, M, n_M)
using scipy¶
The scipy
library as different solutions:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html
from scipy.signal import convolve2d
def test_scipy(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve2d(A[i_N, :, :], B[i_M, :, :], mode='same', boundary='fill', fillvalue=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve2d(A[:, :, i_N], B[:, :, i_M], mode='same', boundary='fill', fillvalue=0)
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_scipy(A, B, C)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_scipy(A, B, C, prefetching=True)
A, B, C = get_data(N, n_N, M, n_M)
from scipy.signal import fftconvolve
def test_scipy_fft(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = fftconvolve(A[i_N, :, :], B[i_M, :, :], mode='same')
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = fftconvolve(A[:, :, i_N], B[:, :, i_M], mode='same')
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_scipy_fft(A, B, C)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_scipy_fft(A, B, C, prefetching=True)
Fruther profiling, shows that most of the computing time is divided between the three FFT (2 forward, one inverse):
This shows the advantage of using the Fourier transform to perform the convolution. There is also a slight advantage in using prefetching. However, this solution imposes to have periodic bounds for the borders.
using directly numpy¶
Instead of loading scipy
(or more reasonably just the subset that loads the fftpack), one can simply use numpy
(see this comment by FonderPrism).
The idea is to simply load the appropriate library which is documented @ https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.fft2.html :
A, B, C = get_data(N, n_N, M, n_M)
from numpy.fft import fft2, ifft2
def np_fftconvolve(A, B):
return np.real(ifft2(fft2(A)*fft2(B, s=A.shape)))
def test_numpy_fft(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = np_fftconvolve(A[i_N, :, :], B[i_M, :, :])
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = np_fftconvolve(A[:, :, i_N], B[:, :, i_M])
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_numpy_fft(A, B, C)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_numpy_fft(A, B, C, prefetching=True)
Further profiling shows that most of the computing time is divided between the three FFT (2 forward, one inverse).
This shows the advantage of using the Fourier transform to perform the convolution. There is also a slight advantage in using prefetching.
From the design of the protocol, an optimization consists of computing the FFT transforms just once by using in-memory views of the different images and filters.
A, B, C = get_data(N, n_N, M, n_M)
from numpy.fft import fft2, ifft2
def test_numpy_fft_opt(A, B, prefetching=False):
if prefetching:
f_B = np.zeros((B.shape[0], A.shape[-2], A.shape[-1]), dtype=np.complex128)
for i_M in np.arange(B.shape[0]):
f_B[i_M, :, :] = fft2(B[i_M, :, :], s=A.shape[-2:])
for i_N in np.arange(A.shape[0]):
f_A = fft2(A[i_N, :, :])
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = np.real(ifft2(f_A*f_B[i_M, :, :]))
else:
f_B = np.zeros((A.shape[0], A.shape[1], B.shape[-1]), dtype=np.complex128)
for i_M in np.arange(B.shape[-1]):
f_B[:, :, i_M] = fft2(B[:, :, i_M], s=A.shape[:2])
for i_N in np.arange(A.shape[-1]):
f_A = fft2(A[:, :, i_N])
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = np.real(ifft2(f_A*f_B[:, :, i_M]))
A, B, C = get_data(N, n_N, M, n_M, prefetching=False)
%%timeit
test_numpy_fft_opt(A, B, prefetching=False)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_numpy_fft_opt(A, B, prefetching=True)
using ndimage¶
Within scipy
, the ndimage
library as different solutions:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.correlate.html
from scipy.ndimage import convolve
# help(convolve)
from scipy.ndimage import convolve
def test_ndimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve(A[i_N, :, :], B[i_M, :, :], mode='constant', cval=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve(A[:, :, i_N], B[:, :, i_M], mode='constant', cval=0)
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_ndimage(A, B, C)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_ndimage(A, B, C, prefetching=True)
There is also a correlate
function:
%%timeit
from scipy.ndimage import correlate
C = correlate(A, B[::-1, ::-1], mode='constant', cval=0)
using skimage¶
The skimage
has recently grown in popularity and includes many advanced computer vision algorithms. This operator is quick (using the Fourier transform), while allowing for many convenient option (mode, boundaries):
https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/template.py
Note: this function computes a normalized correlation so that you need to symmetrically invert the template to get the convolution.
from skimage.feature import match_template
def test_skimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = match_template(A[i_N, :, :], B[i_M, :, :], pad_input=True, mode='constant', constant_values=0.)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = match_template(A[:, :, i_N], B[:, :, i_M], pad_input=True, mode='constant', constant_values=0.)
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_skimage(A, B, C)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_skimage(A, B, C, prefetching=True)
using numba¶
The numba
package allows access to very fast routine:
http://numba.pydata.org/numba-doc/0.15.1/examples.html#filterbank-correlation
#from __future__ import print_function, division, absolute_import
import numpy as np
import numba
#from numba.utils import IS_PY3
from numba.decorators import jit
from numba import double
#from numba import double, jit
nd4type = numba.double[:,:,:,:]
nd3type = numba.double[:,:,:]
@jit(nopython=True)
def nb_get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C
@jit((nd3type, nd3type, nd4type))
def nbcorr_prefetching(imgs, filters, output):
n_imgs, n_rows, n_cols = imgs.shape
n_filters, height, width = filters.shape
for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[ii, rr + hh, cc + ww]
filterval = filters[ff, hh, ww]
output[ii, ff, rr, cc] += imgval * filterval
@jit((nd3type, nd3type, nd4type))
def nbcorr(imgs, filters, output):
n_rows, n_cols, n_imgs = imgs.shape
height, width, n_filters = filters.shape
for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[rr + hh, cc + ww, ii]
filterval = filters[hh, ww, ff]
output[rr, cc, ii, ff] += imgval * filterval
def test_numba(A, B, C, prefetching=False):
if prefetching:
nbcorr_prefetching(A, B, C)
else:
nbcorr(A, B, C)
A, B, C = nb_get_data(N, n_N, M, n_M)
%%timeit
test_numba(A, B, C)
A, B, C = nb_get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_numba(A, B, C, prefetching=True)
There may be an advantage when processing multiple images at once:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10)
%%timeit
test_numba(A, B, C)
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10, prefetching=True)
%%timeit
test_numba(A, B, C, prefetching=True)
Seems promising as many numpy
operations are supported : http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
using pytorch¶
The pytorch
package is currently used in deep-learning (DL) architectures, which also often rely on convolutions (for instance in CNNs):
https://pytorch.org/docs/stable/nn.html#torch.nn.functional.conv2d
A first advantage is that this convolution operator is optimized to handle multiple examples (mini-batches in DL terminology) over multiple filters (channels in DL), such that there is no loop needed. The ordering of dimensions folloows a similar strategy as when using prefetching
: truy to remember (B, C, H, W)
for respectively mini-batches, channels, height, width:
import torch
from torch.nn.functional import conv2d
filters = torch.randn(8, 4, 3, 3)
inputs = torch.randn(10, 4, 5, 5)
out = conv2d(inputs, filters, padding=1)
out.shape
In our benchmark, we have only one channel, but multiple batches:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
A = torch.from_numpy(A[:, None, :, :])
B = torch.from_numpy(B[:, None, :, :])
C = torch.from_numpy(C)
A.shape, B.shape, C.shape
To get a convolution of the same size, it is necessary to pad the filters (as for numpy). Note the padding is symmetric such that the size of the convolution is bigger than that for numpy for instance:
C = conv2d(A, B, padding=M//2)
A.shape, B.shape, C.shape
B.shape[-1]
def test_torch(A, B, prefetching=False):
if prefetching:
A = np.swapaxes(A, 0, -2)
B = np.swapaxes(B, 0, -2)
A = np.swapaxes(A, 1, -1)
B = np.swapaxes(B, 1, -1)
A = torch.from_numpy(A[:, None, :, :])
B = torch.from_numpy(B[:, None, :, :])
C = conv2d(A, B, padding=B.shape[-1]//2)
A, B, C = get_data(N, n_N, M, n_M)
%%timeit
test_torch(A, B)
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
%%timeit
test_torch(A, B, prefetching=True)
It is faster than scipy_fft
for instance. It is even faster on a GPU, which we do not include in that particular benchmark because it is difficult to compare it to the other methoss where no such solution exist.
from timeit import Timer
reps = 20
sp, sk, nb, pt, npo = [], [], [], [], []
n_Ns = 2**np.arange(6)
for prefetching in [False, True]:
for n_N_ in n_Ns:
A, B, C = get_data(N, n_N_, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
t = Timer(lambda: test_torch(A, B, prefetching=prefetching))
pt.append(t.timeit(number=reps))
t = Timer(lambda: test_numpy_fft_opt(A, B, prefetching=prefetching))
npo.append(t.timeit(number=reps))
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ns, sk[:len(n_Ns)], c='b', label='skimage')
ax.loglog(n_Ns, pt[:len(n_Ns)], c='c', label='torch')
ax.loglog(n_Ns, nb[:len(n_Ns)], c='g', label='numba')
ax.loglog(n_Ns, sp[:len(n_Ns)], c='r', label='scipy')
ax.loglog(n_Ns, npo[:len(n_Ns)], c='m', label='numpy')
ax.loglog(n_Ns, sk[len(n_Ns):], '--', c='b')
ax.loglog(n_Ns, nb[len(n_Ns):], '--', c='g')
ax.loglog(n_Ns, pt[len(n_Ns):], '--', c='c')
ax.loglog(n_Ns, sp[len(n_Ns):], '--', c='r')
ax.loglog(n_Ns, npo[len(n_Ns):], '--', c='m')
ax.set_xlabel('n_N')
ax.set_ylabel('time (s)')
ax.legend();
size of images¶
This should scale supra linearly:
from timeit import Timer
reps = 20
sp, sk, nb, pt, npo = [], [], [], [], []
Ns = 2**np.arange(6, 10)
for prefetching in [False, True]:
for N_ in Ns:
A, B, C = get_data(N_, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
t = Timer(lambda: test_torch(A, B, prefetching=prefetching))
pt.append(t.timeit(number=reps))
t = Timer(lambda: test_numpy_fft_opt(A, B, prefetching=prefetching))
npo.append(t.timeit(number=reps))
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(Ns, sk[:len(Ns)], c='b', label='skimage')
ax.loglog(Ns, nb[:len(Ns)], c='g', label='numba')
ax.loglog(Ns, pt[:len(Ns)], c='c', label='torch')
ax.loglog(Ns, sp[:len(Ns)], c='r', label='scipy')
ax.loglog(Ns, npo[:len(Ns)], c='m', label='numpy')
ax.loglog(Ns, sp[len(Ns):], '--', c='r')
ax.loglog(Ns, sk[len(Ns):], '--', c='b')
ax.loglog(Ns, nb[len(Ns):], '--', c='g')
ax.loglog(Ns, pt[len(Ns):], '--', c='c')
ax.loglog(Ns, npo[len(Ns):], '--', c='m')
ax.set_xlabel('N')
ax.set_ylabel('time (s)')
ax.legend();
number of kernels¶
This should scale supra linearly:
from timeit import Timer
reps = 20
sp, sk, nb, pt, npo = [], [], [], [], []
n_Ms = 2**np.arange(6)
for prefetching in [False, True]:
for n_M_ in n_Ms:
A, B, C = get_data(N, n_N, M, n_M_, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
t = Timer(lambda: test_torch(A, B, prefetching=prefetching))
pt.append(t.timeit(number=reps))
t = Timer(lambda: test_numpy_fft_opt(A, B, prefetching=prefetching))
npo.append(t.timeit(number=reps))
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ms, sp[:len(n_Ms)], c='r', label='scipy')
ax.loglog(n_Ms, sk[:len(n_Ms)], c='b', label='skimage')
ax.loglog(n_Ms, nb[:len(n_Ms)], c='g', label='numba')
ax.loglog(n_Ms, pt[:len(n_Ms)], c='c', label='torch')
ax.loglog(n_Ms, npo[:len(n_Ms)], c='m', label='numpy')
ax.loglog(n_Ms, sp[len(n_Ms):], '--', c='r')
ax.loglog(n_Ms, sk[len(n_Ms):], '--', c='b')
ax.loglog(n_Ms, nb[len(n_Ms):], '--', c='g')
ax.loglog(n_Ms, pt[len(n_Ms):], '--', c='c')
ax.loglog(n_Ms, npo[len(n_Ms):], '--', c='m')
ax.set_xlabel('n_M')
ax.set_ylabel('time (s)')
ax.legend();
size of kernels¶
This should scale supra linearly:
from timeit import Timer
reps = 20
sp, sk, nb, pt, npo = [], [], [], [], []
Ms = 2**np.arange(2, 7)
for prefetching in [False, True]:
for M_ in Ms:
A, B, C = get_data(N, n_N, M_, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
t = Timer(lambda: test_torch(A, B, prefetching=prefetching))
pt.append(t.timeit(number=reps))
t = Timer(lambda: test_numpy_fft_opt(A, B, prefetching=prefetching))
npo.append(t.timeit(number=reps))
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(Ms, sp[:len(Ms)], c='r', label='scipy')
ax.loglog(Ms, sk[:len(Ms)], c='b', label='skimage')
ax.loglog(Ms, nb[:len(Ms)], c='g', label='numba')
ax.loglog(Ms, pt[:len(Ms)], c='c', label='torch')
ax.loglog(Ms, npo[:len(Ms)], c='m', label='numpy')
ax.loglog(Ms, sp[len(Ms):], '--', c='r')
ax.loglog(Ms, sk[len(Ms):], '--', c='b')
ax.loglog(Ms, nb[len(Ms):], '--', c='g')
ax.loglog(Ms, pt[len(Ms):], '--', c='c')
ax.loglog(Ms, npo[len(Ms):], '--', c='m')
ax.set_xlabel('M')
ax.set_ylabel('time (s)')
ax.legend();
conclusion?¶
As a conclusion, there is not one single answer to all situations, the fastest method will depend on the task at hand. This benchmark needs to be extended to the case where you have access to a GPU for which the parallelization should make convolutions faster with pytorch
(in theory). Still, the FFT solution with numpy
seems the most rapid. Overall, there is also a slight advantage in using prefetching.
In the end, knowing the relative advantage of each library is crucial in the final result.
some book keeping for the notebook¶
%load_ext version_information
%version_information numpy, scipy, matplotlib, torch, numba