OpenCL Vision Image Extension

OpenCL Vision Image Extension 使用的指令集,又叫做 EVIS (Enhanced Vision Instruction Set)指令集,他时一个API Level的变成语言,在一些NPU、GPU、FPGA等一些OpenCL的硬件实现中适合用来处理视觉图像任务,同时它也是一个OPENCL VX Extension API , 它的源文件通常在 /usr/include/CL/cl_viv_vx_ext.h 中 。

一、参考资料

  • OPENCL官方文档:
  • OPENVX官方文档:
  • https://github.com/KhronosGroup/Khronosdotorg/blob/main/api/openvx/resources.md
  • https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/OpenVX-Vision-Image-Extension-API-Introduction-Basic-API/ta-p/1117280
  • Github

本站网盘对一批关于opencl和openvx资料进行整理:

  • https://oss.p-chao.com/book/opengl_openvx_opencl_cuda

二、数据和指令

2.1 数据类型

packed data type,同opencl 的矢量数据类型基本相同,部分类型被直接定义为OpenCL的矢量类型,

同时我们也需要注意部分类型是两个更短的类型拼在一起的,长类型使用过程中需要查看源码。

矢量类型的索引同opencl一样,可以使用 .s[n] .x/.y/.. .hi/.lo 来访问

vxc_charn
vxc_ucharn
vxc_shortn
vxc_ushortn
vxc_intn
vxc_uintn
vxc_floatn
vxc_halfn
我们可以从源文件中查看定义
/* packed char vector 2/4/8/16 */
typedef _viv_char2_packed     vxc_char2;
typedef _viv_char4_packed     vxc_char4;
typedef _viv_char8_packed     vxc_char8;
typedef _viv_char16_packed    vxc_char16;
typedef struct _vxc_char32
{
    vxc_char16  hi;
    vxc_char16  lo;
} vxc_char32;

/* packed uchar vector 2/4/8/16 */
typedef _viv_uchar2_packed    vxc_uchar2;
typedef _viv_uchar4_packed    vxc_uchar4;
typedef _viv_uchar8_packed    vxc_uchar8;
typedef _viv_uchar16_packed   vxc_uchar16;
typedef struct _vxc_uchar32
{
    vxc_uchar16  hi;
    vxc_uchar16  lo;
} vxc_uchar32;

/* packed short vector 2/4/8 */
typedef _viv_short2_packed    vxc_short2;
typedef _viv_short4_packed    vxc_short4;
typedef _viv_short8_packed    vxc_short8;
typedef struct _vxc_short16
{
    vxc_short8  hi;
    vxc_short8  lo;
} vxc_short16;

/* packed ushort vector 2/4/8 */
typedef _viv_ushort2_packed   vxc_ushort2;
typedef _viv_ushort4_packed   vxc_ushort4;
typedef _viv_ushort8_packed   vxc_ushort8;
typedef struct _vxc_ushort16
{
    vxc_ushort8  hi;
    vxc_ushort8  lo;
} vxc_ushort16;

/* int vector 2/4/8/16 */
typedef int      vxc_int;
typedef int2     vxc_int2;
typedef int4     vxc_int4;
typedef int8     vxc_int8;
typedef int16    vxc_int16;

/* uint vector 2/4/8/16 */
typedef uint     vxc_uint;
typedef uint2    vxc_uint2;
typedef uint4    vxc_uint4;
typedef uint8    vxc_uint8;
typedef uint16   vxc_uint16;

/* float vector 2/4/8/16 */
typedef float     vxc_float;
typedef float2    vxc_float2;
typedef float4    vxc_float4;
typedef float8    vxc_float8;
typedef float16   vxc_float16;

/* half (float16) vector 2/4/8/16 */
typedef half                  vxc_half;
typedef _viv_half2_packed     vxc_half2;
typedef _viv_half4_packed     vxc_half4;
typedef _viv_half8_packed     vxc_half8;
typedef struct _vxc_half16
{
    vxc_half8  hi;
    vxc_half8  lo;
} vxc_half16;

typedef uint16 vxc_512bits;
typedef uint4  vxc_128bits;

typedef vxc_512bits VXC_512Bits;
typedef vxc_128bits VXC_128Bits;
typedef vxc_modifier VXC_Modifier_t ;
typedef vxc_round_mode VXC_RoundMode;
typedef vxc_filter_mode VXC_FilterMode;

2.2 _viv_uniform 关键字

常量通常在opencl运行阶段吃实话,_viv_uniform 关键字用于在加载阶段初始化内核常量, 比如:

_viv_uniform vxc_512bits u512;

2.3 ASM指令

packed data type 一般不能使用 build-in-function了。如果要使用常用的build-in-function 需要把它转化为 unpacked data。或者使用内联汇编进行操作。

_viv_asm 是内联汇编调用关键字,一般指令格式是 _viv_asm(OP_CODE, dest, src0, src1), 在有些指令中, src1需要是立即数。

ABS绝对值
ADD求和vxc_uchar16 a, b, c
_viv_asm(ADD, a, b, c); // c = a + b
ADD_SAT带饱和的整形求和
AND_BITWISE位与
BIT_REVERSQL整形按位翻转 (reversal)
BITEXTRACT从src到dest的位提取
BITINSERT位替换 bit replacement
BITSEL位选择 bitwise select
CLAMP0MAX截断到 0 到 max
CMPcompare
CONVconvert
COPY复制src1 表示要copy的字节数
DIVdivide
FINDLSB找最小有效位
FINDMSB找最大有效位
LEADZERO检测leading zero
LSHIFT左移
MADSAT带饱和的整形乘加
MODModulus
MULMultiply
MULHI
MULSAT带饱和的整形Multiply
NEG负数
NOT_BITWISE
OR_BITWISE
POPCOUNTpopulation Count
ROTATE旋转(大小位颠倒)
RSHIFT右移
SUB
SUBSAT带饱和的整形减法
XOR_BITWISE按位异或

2.4 OP Code指令

VXC_OP 枚举中定义了操作符,需要注意这些操作不一定都有硬件实现。

关于DP指令(Dot Product),博主实验下来执行的是逐个元素相乘,而非向量内积,不是特别make sense,不是到是不是博主用的硬件实现的差别。opencl本身还有 dot 运算符…

ABS_DIFFAbsolute difference between two values
IADDAdds two or three integer values
IACC_SQSquares a value and adds it to an accumulator
LERPLinear interpolation between two values
FILTERPerforms a filter on a 3×3 block
MAG_PHASEComputes magnitude and phase of 2 packed data values
MUL_SHIFTMultiples two 8-or 16-bit integers and shifts
DP16x11 Dot Product from 2 16 component values
DP8x22 Dot Products from 2 8 component values
DP4x44 Dot Products from 2 4 component values
DP2x88 Dot Products from 2 2 component values
CLAMPClamps up to 16 values to a max or min value
BI_LINEARComputes a bi0linear interpolation of 4 pixel values
SELECT_ADDAdds a pixel value or increments a counter inside bins
ATOMIC_ADDAdds a valid atomically to an address
BIT_EXTRACTExtracts up to 8 bitfields from a packed stream
BIT_REPLACEReplaces up to 8 bitfields from a packed stream
DP32x11 Dot Product from 2 32 component values
DP16x22 Dot Products from 2 16 component values
DP8x44 Dot Products from 2 8 component values
DP4x88 Dot Products from 2 4 component values
DP2x1616 Dot Products from 2 2 component values

还有一些其它指令,参照头文件枚举定义:
enum VXC_OP {
    VXC_OP_abs_diff = 3, /* it must be the same value as VIR_IK_abs_diff */
    VXC_OP_iadd,
    VXC_OP_iacc_sq,
    VXC_OP_lerp,
    VXC_OP_filter,
    VXC_OP_mag_phase,
    VXC_OP_mul_shift,
    VXC_OP_dp16x1,
    VXC_OP_dp8x2,
    VXC_OP_dp4x4,
    VXC_OP_dp2x8,
    VXC_OP_clamp,
    VXC_OP_bi_linear,
    VXC_OP_select_add,
    VXC_OP_atomic_add,
    VXC_OP_bit_extract,
    VXC_OP_bit_replace,
    VXC_OP_dp32x1,
    VXC_OP_dp16x2,
    VXC_OP_dp8x4,
    VXC_OP_dp4x8,
    VXC_OP_dp2x16,
    VXC_OP_dp32x1_b,
    VXC_OP_dp16x2_b,
    VXC_OP_dp8x4_b,
    VXC_OP_dp4x8_b,
    VXC_OP_dp2x16_b,
    VXC_OP_img_load,
    VXC_OP_img_load_3d,
    VXC_OP_img_store,
    VXC_OP_img_store_3d,
    VXC_OP_vload2,
    VXC_OP_vload3,
    VXC_OP_vload4,
    VXC_OP_vload8,
    VXC_OP_vload16,
    VXC_OP_vstore2,
    VXC_OP_vstore3,
    VXC_OP_vstore4,
    VXC_OP_vstore8,
    VXC_OP_vstore16,
    VXC_OP_index_add,
    VXC_OP_vert_min3,
    VXC_OP_vert_max3,
    VXC_OP_vert_med3,
    VXC_OP_horz_min3,
    VXC_OP_horz_max3,
    VXC_OP_horz_med3,
    VXC_OP_error,
    OP_bit_extract,
    VXC_OP_dp16x1_b,
    VXC_OP_dp8x2_b,
    VXC_OP_dp4x4_b,
    VXC_OP_dp2x8_b,
    VXC_OP_gather,
    VXC_OP_gather_b,
    VXC_OP_scatter,
    VXC_OP_scatter_b,
    VXC_OP_atomic_s,
    VXC_OP_atomic_s_b,
};

我们有时会通过VXC_OPn(OP, dest, src …) (VXC_OP1, VXC_OP2, VXC_OP3 …) 来调用上述操作,实际上VXC_OPn中的n就表示操作数个数,我们可以直接从头文件中看到它们是如何展开的:

#define VXC_OP1(Op, Dest, Src0)   _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, Src0)

#define VXC_OP2(Op, Dest, Src0, Src1)                  \
    do {                                               \
        int _t1;                                       \
        _viv_asm(PARAM_CHAIN, _t1, Src0, Src1);         \
        _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t1);    \
    } while(0)

#define VXC_OP3(Op, Dest, Src0, Src1, Src2)            \
    do {                                               \
        int _t1, _t2;                                  \
        _viv_asm(PARAM_CHAIN, _t1, Src0, Src1);        \
        _viv_asm(PARAM_CHAIN, _t2, _t1, Src2);         \
        _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t2);   \
    } while(0)

关于Filter,可以支持3×3窗口上的滤波操作:

VXC_FM_BOXCompute a 3×3 box filter:
|1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9|.
VXC_FM_GuassianCompute a 3×3 Gaussian filter:
|1/16, 2/16, 1/16, 2/16, 4/16, 2/16, 1/16, 2/16, 1/16|.
VXC_FM_SobelXCompute a 3×3 Sobel filter in the x-direction:
|-1, 0, 1, -2, 0, 2, -1, 0, 1|.
VXC_FM_SobelYCompute a 3×3 Sobel filter in the y-direction:
|-1, -2, -1, 0, 0, 0, 1, 2, 1|.
VXC_FM_ScharrXCompute a 3×3 Scharr filter in the x-direction:
|3, 0, -3, 10, 0, -10, 3, 0, -3|.
VXC_FM_ScharrYCompute a 3×3 Scharr filter in the y-direction:
|3, 10, 3, 0, 0, 0, -3, -10, -3|.
VXC_FM_MaxGet the maximum from a 3×3 kernel.
VXC_FM_MinGet the minimum from a 3×3 kernel.
VXC_FM_MedianGet the median from a 3×3 kernel.

三、一些例子

3.1 读取和写出数据

可以使用img_load 来load数据,写出数据接口和load数据接口参数类似。

坐标只能是 intn 或者 floatn类型。

关于VXC_MODIFIER

VXC_MODIFIER(StartBin, EndBin, SourceBin, RoundingMode, Clamp)

我们在从图像中取数据的时候常用这个宏来产生指令,它的参数含义如下:

  • StartBin/EndBin 取数据的开始和结束位置,总工会取 End-Start个数据出来
  • Source Bin 可以使用 VXC_5BITOFFSET_XY指示便宜,如果没有使用则填0即可
  • RoundingMode    表示 取数据时取整的方式 0 VXC_RM_TowardZero 是向0取整, 1 是向上取整,2是向最近奇数取整
  • Clamp: 0表示结果truncate到合适的类型, 1表示结果被clamp到合适的结果类型

例子:从 in_image 中 取 16 个 uchar类型的数据放到r1中

int2 coord = (int2)(get_global_id(0), get_global_id(1));

vxc_uchar16 r1;

VXC_OP4(img_load, r1, in_image, coord, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));

再有的读取数据的接口中可以使用VXC_5BITOFFSET_XY(offsetX, offsetY)指定取数据的偏移,其中 offsetX 和 offsetY 值范围是-16~15,例子:

VXC_ReadImage(v0, input, coord, VXC_5BITOFFSET_XY(0, 0), 
                            VXC_MODIFIER(0, 7, 0, VXC_RM_TowardZero, 0));

图像读写的接口和通道顺序的支持

// ---Read_Imagef,i,ui  
/* OCL image builtins can be used in VX kernel */  
float4 read_imagef (image2d_t image, int2 coord);  
int4 read_imagei (image2d_t image, int2 coord);  
uint4 read_imageui (image2d_t image, int2 coord);  
float4 read_imagef (image1d_t image, int coord);  
int4 read_imagei (image1d_t image, int coord);  
uint4 read_imageui (image1d_t image, int coord);  
float4 read_imagef (image1d_array_t image, int2 coord);  
int4 read_imagei (image1d_array_t image, int2 coord);  
uint4 read_imageui (image1d_array_t image, int2 coord); 

// ---Write_Imagef,i,ui  
void write_imagef (image2d_t image, int2 coord, float4 color);  
void write_imagei (image2d_t image, int2 coord, int4 color);  
void write_imageui (image2d_t image, int2 coord, uint4 color);  
void write_imagef (image1d_t image, int coord, float4 color);  
void write_imagei (image1d_t image, int coord, int4 color);  
void write_imageui (image1d_t image, int coord, uint4 color);  
void write_imagef (image1d_array_t image, int2 coord, float4 color);  
void write_imagei (image1d_array_t image, int2 coord, int4 color);  
void write_imageui (image1d_array_t image, int2 coord, uint4 color) 

// ---Query Image Dimensions 
int2 get_image_dim (image2d_t image);  
size_t get_image_array_size(image1d_array_t image);  
/* Built-in Image Query Functions */  
int get_image_width (image1d_t image);  
int get_image_width (image2d_t image);  
int get_image_width (image1d_array_t image);  
int get_image_height (image2d_t image); 

// ---Channel Data Types Supported 
/* Return the channel data type. Valid values are:  
* CLK_SNORM_INT8  
* CLK_SNORM_INT16  
* CLK_UNORM_INT8  
* CLK_UNORM_INT16  
* CLK_UNORM_SHORT_565  
* CLK_UNORM_SHORT_555  
* CLK_UNORM_SHORT_101010  
* CLK_SIGNED_INT8  
* CLK_SIGNED_INT16  
* CLK_SIGNED_INT32  
* CLK_UNSIGNED_INT8  
* CLK_UNSIGNED_INT16  
* CLK_UNSIGNED_INT32  
* CLK_HALF_FLOAT  
* CLK_FLOAT  
*/  
int get_image_channel_data_type (image1d_t image);  
int get_image_channel_data_type (image2d_t image);  
int get_image_channel_data_type (image1d_array_t image); 
// ---Image Channel Orders Supported 
/* Return the image channel order. Valid values are:  
NXP Semiconductors 
* CLK_A  
* CLK_R  
* CLK_Rx  
* CLK_RG  
* CLK_RGx  
* CLK_RA  
* CLK_RGB  
* CLK_RGBx  
* CLK_RGBA  
* CLK_ARGB  
* CLK_BGRA  
* CLK_INTENSITY  
* CLK_LUMINANCE  
*/  
int get_image_channel_order (image1d_t image);  
int get_image_channel_order (image2d_t image);  
int get_image_channel_order (image1d_array_t image);

// ---Image Channel Orders Supported 
/* Return the image channel order. Valid values are:  
* CLK_A  
* CLK_R  
* CLK_Rx  
* CLK_RG  
* CLK_RGx  
* CLK_RA  
* CLK_RGB  
* CLK_RGBx  
* CLK_RGBA  
* CLK_ARGB  
* CLK_BGRA  
* CLK_INTENSITY  
* CLK_LUMINANCE  
*/  
int get_image_channel_order (image1d_t image);  
int get_image_channel_order (image2d_t image);  
int get_image_channel_order (image1d_array_t image);

3.2 高斯滤波的例子

这里有一个高斯滤波的例子:https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/How-to-use-OpenVX-extension-for-NPU-GPU-to-accelerate-machine/ta-p/1113429

这里博主贴一下kernel部分的代码:

__kernel void gaussian
(
__read_only image2d_t in_image,
__write_only image2d_t out_image
)
{
int2 coord = (int2)(get_global_id(0), get_global_id(1));
int2 coord_out = coord;
vxc_uchar16 lineA, lineB, lineC, out;
int2 coord_in1 = coord + (int2)(-1, -1);
VXC_OP4(img_load, lineA, in_image, coord_in1, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in2 = coord + (int2)(-1, 0);
VXC_OP4(img_load, lineB, in_image, coord_in2, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in3 = coord + (int2)(-1, 1);
VXC_OP4(img_load, lineC, in_image, coord_in3, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int info = VXC_MODIFIER_FILTER(0, 13, 0, VXC_FM_Guassian, 0);
VXC_OP4(filter, out, lineA, lineB, lineC, info); 
VXC_OP4_NoDest(img_store, out_image, coord_out, out, VXC_MODIFIER(0, 13, 0, VXC_RM_TowardZero, 0));
}

3.3 Half处理的例子

显示转换:需要在half赋值前用类型符修饰,而不能直接赋值

half a = (half)1000;

重解释:我们读取数据为short,再重新解释为half,但是通常较老的编译器是不带as_half选项的,于是我们通过内联汇编指令COPY来处理

vxc_short8 val
vxc_half8 val_fp16;
_viv_asm(COPY, val_fp16, val, sizeof(val));

3.4 其它开源实现

  1. TIM-VX: https://github.com/VeriSilicon/TIM-VX
    • TIM-VX 是一个神经网络加速库,其中可以找到大量vx kernel代码参考
  2. AML-NPU-SDK: aml-npu-sdk/nnvxc_kernels at 63b337b73f6143aa58d0ba1c57d0907ac9b0fb39 · leemgs/aml-npu-sdk (github.com)
    • 也是一个npu sdk 开源库,有大量nn相关的 opevx kernel nnvxc kernel实现
  3. 有关stereo和softisp几个CV Function的vx实现:gtec-demo-framework/DemoApps/OpenVX/Stereo/Content/CTF_kernels_vxc.vx at master · nxp-imx/gtec-demo-framework (github.com)

其它文档相关的资源整理在文章开头,希望能有帮助。

发表评论