numba를 이용한 Single-CPU, Multi-CPU, GPU-CUDA 의 box blur 속도 비교

얼마전 성능 및 코드 간결성 비교 numba.vectorize vs numba.jit 에서 O(n) 복잡도의 연산에 대한 Single-CPU 와 Multi-CPU 의 수행 속도를 비교하여 보았다. 수행에 소요되는 시간은 예상대로 입력 데이터의 개수 대해 선형적으로 증가하였다. 이번에는 O(n*r^2) 의 복잡도를 갖는 간단한 동일가중치 box blur 알고리즘에 대한 수행 속도를 비교해 보려고한다.(계산 복잡도는 box 크기 r 의 제곱에 비례한다.)

box 의 크기를 임의의 값인 21 x 21 로 하였다. single 의 독보적 증가 추세로 인해 나머지 두 항목(parallel, cuda)의 추세가 드러나지 않는다.

parallel, cuda 의 보다 정확한 비교를 위해 single 을 제외하여 보았다. cuda 의 속도가 그리 높지 않은 의외의 결과가 나왔다. CPU 가 아무리 E5-2690v4 * 2ea (총 코어수 56개)라고 해도 cuda device 가 무려 P100 임을 고려하면 이해할 수 없는 결과이다. 입력되는 데이터 양 대비 복잡도가 매우 큰 알고리즘이므로 측정용 알고리즘 선택이 잘못된 것은 아닐 것이다. numba.cuda 의 코드가 잘못 작성되었을 것으로 추측하지만 지금으로서는 어찌할 도리가 없다.(향후 수정 보완 필요하며, 어떤 부분의 원인인지 제시할 것)

앞서 측정된 결과의 의문점을 확인하기 위해 box 크기를 201 x 201 로 하여 복잡도를 더욱 증가시켰다.(코어 수가 많은 cuda가 더 유리하도록) 그리고 parallel과 cuda 의 측정 결과를 비교하여 보았다. CPU 보다 빠르기는 하지만 여전히 의문이다...

측정에 사용된 코드를 아래와 같다.(단순히 수행 속도 비교를 위해 작성되어, 효율이 고려하지 않은 코드이다)

import matplotlib

matplotlib.use('Agg')

import time

import numpy
from numba import jit, prange, cuda
import matplotlib.pylab as plt
import math

SIZE = 10000
BOX_SIZE = 200
TPB = 32


@jit(nopython=True)
def func1(data, box_size):
    res = numpy.empty_like(data)
    for y in range(res.shape[0]):
        for x in range(res.shape[1]):
            res[y, x] = 0
            for yy in range(0, box_size):
                for xx in range(0, box_size):
                    nx = x + xx - int(box_size / 2)
                    ny = y + yy - int(box_size / 2)

                    if nx < 0:
                        nx = 0
                    elif nx >= res.shape[1]:
                        nx = res.shape[1] - 1

                    if ny < 0:
                        ny = 0
                    elif ny >= res.shape[0]:
                        ny = res.shape[0] - 1

                    res[y, x] += data[ny, nx] / (box_size * box_size)
    return res


@jit(nopython=True, parallel=True)
def func2(data, box_size):
    res = numpy.empty_like(data)
    for y in prange(res.shape[0]):
        for x in prange(res.shape[1]):
            res[y, x] = 0
            for yy in range(0, box_size):
                for xx in range(0, box_size):
                    nx = x + xx - int(box_size / 2)
                    ny = y + yy - int(box_size / 2)

                    if nx < 0:
                        nx = 0
                    elif nx >= res.shape[1]:
                        nx = res.shape[1] - 1

                    if ny < 0:
                        ny = 0
                    elif ny >= res.shape[0]:
                        ny = res.shape[0] - 1

                    res[y, x] += data[ny, nx] / (box_size * box_size)
    return res


@cuda.jit
def func3(data, box_size, res):
    y, x = cuda.grid(2)

    if y < res.shape[0] and x < res.shape[1]:

        res[y, x] = 0
        for yy in range(0, box_size):
            for xx in range(0, box_size):
                nx = x + xx - int(box_size / 2)
                ny = y + yy - int(box_size / 2)

                if nx < 0:
                    nx = 0
                elif nx >= res.shape[1]:
                    nx = res.shape[1] - 1

                if ny < 0:
                    ny = 0
                elif ny >= res.shape[0]:
                    ny = res.shape[0] - 1

                res[y, x] += data[ny, nx] / (box_size * box_size)


def main():
    sizes = []
    results = []
    for size in range(int(SIZE / 10), int(SIZE + SIZE / 10), int(SIZE / 10)):
        print(size)

        sizes.append(size)
        x = numpy.arange(size * size, dtype=numpy.float32).reshape((size, size))

        s = time.time()
        a = func1(x, BOX_SIZE)
        t1 = time.time() - s

        s = time.time()
        b = func2(x, BOX_SIZE)
        t2 = time.time() - s

        s = time.time()
        xx = numpy.empty_like(x)

        threadsperblock = (TPB, TPB)
        blockspergrid_x = int(math.ceil(x.shape[0] / threadsperblock[1]))
        blockspergrid_y = int(math.ceil(x.shape[1] / threadsperblock[0]))
        blockspergrid = (blockspergrid_x, blockspergrid_y)

        x_dary = cuda.to_device(x)
        xx_dary = cuda.device_array(xx.shape, xx.dtype)
        func3[blockspergrid, threadsperblock](x_dary, BOX_SIZE, xx_dary)
        xx_dary.copy_to_host(xx)

        t3 = time.time() - s

        assert numpy.all(a == xx)
        assert numpy.all(b == xx)

        results.append((t1, t2, t3))

    plt.plot(sizes, [x[0] for x in results], 'rs-', label='single')
    plt.plot(sizes, [x[1] for x in results], 'bs-', label='parallel')
    plt.plot(sizes, [x[2] for x in results], 's-', label='cuda')
    plt.legend()
    # plt.show()
    plt.savefig('output.png')


if __name__ == '__main__':
    main()

matplotlib 을 CLI 기반의 서버에서 사용하려고 했더니 오류가 발생하여 아래와 같은 환경변수를 설정하였다.

export MPLBACKEND="agg"

CUDA Toolkit 를 찾지 못하는 오류가 발생하여 아래와 같은 환경변수를 설정하였다.

export NUMBAPRO_NVVM=/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/cuda-8.0/nvvm/libdevice/

종합적으로 아래와 같은 환경변수를 설정하였다.

export MPLBACKEND="agg"
export NUMBAPRO_NVVM=/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/cuda-8.0/nvvm/libdevice/

저작자표시 비영리 변경금지 (새창열림)

dogbull(犬牛)

numba를 이용한 Single-CPU, Multi-CPU, GPU-CUDA 의 box blur 속도 비교

티스토리툴바