OpenCL Vision Image Extension 使用的指令集,又叫做 EVIS (Enhanced Vision Instruction Set)指令集,他时一个API Level的变成语言,在一些NPU、GPU、FPGA等一些OpenCL的硬件实现中适合用来处理视觉图像任务,同时它也是一个OPENCL VX Extension API , 它的源文件通常在 /usr/include/CL/cl_viv_vx_ext.h 中 。
一、参考资料
- OPENCL官方文档:
- OPENVX官方文档:
- https://github.com/KhronosGroup/Khronosdotorg/blob/main/api/openvx/resources.md
- https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/OpenVX-Vision-Image-Extension-API-Introduction-Basic-API/ta-p/1117280
- Github
本站网盘对一批关于opencl和openvx资料进行整理:
- https://oss.p-chao.com/book/opengl_openvx_opencl_cuda
二、数据和指令
2.1 数据类型
packed data type,同opencl 的矢量数据类型基本相同,部分类型被直接定义为OpenCL的矢量类型,
同时我们也需要注意部分类型是两个更短的类型拼在一起的,长类型使用过程中需要查看源码。
矢量类型的索引同opencl一样,可以使用 .s[n] .x/.y/.. .hi/.lo 来访问
vxc_charn | |
vxc_ucharn | |
vxc_shortn | |
vxc_ushortn | |
vxc_intn | |
vxc_uintn | |
vxc_floatn | |
vxc_halfn |
/* packed char vector 2/4/8/16 */ typedef _viv_char2_packed vxc_char2; typedef _viv_char4_packed vxc_char4; typedef _viv_char8_packed vxc_char8; typedef _viv_char16_packed vxc_char16; typedef struct _vxc_char32 { vxc_char16 hi; vxc_char16 lo; } vxc_char32; /* packed uchar vector 2/4/8/16 */ typedef _viv_uchar2_packed vxc_uchar2; typedef _viv_uchar4_packed vxc_uchar4; typedef _viv_uchar8_packed vxc_uchar8; typedef _viv_uchar16_packed vxc_uchar16; typedef struct _vxc_uchar32 { vxc_uchar16 hi; vxc_uchar16 lo; } vxc_uchar32; /* packed short vector 2/4/8 */ typedef _viv_short2_packed vxc_short2; typedef _viv_short4_packed vxc_short4; typedef _viv_short8_packed vxc_short8; typedef struct _vxc_short16 { vxc_short8 hi; vxc_short8 lo; } vxc_short16; /* packed ushort vector 2/4/8 */ typedef _viv_ushort2_packed vxc_ushort2; typedef _viv_ushort4_packed vxc_ushort4; typedef _viv_ushort8_packed vxc_ushort8; typedef struct _vxc_ushort16 { vxc_ushort8 hi; vxc_ushort8 lo; } vxc_ushort16; /* int vector 2/4/8/16 */ typedef int vxc_int; typedef int2 vxc_int2; typedef int4 vxc_int4; typedef int8 vxc_int8; typedef int16 vxc_int16; /* uint vector 2/4/8/16 */ typedef uint vxc_uint; typedef uint2 vxc_uint2; typedef uint4 vxc_uint4; typedef uint8 vxc_uint8; typedef uint16 vxc_uint16; /* float vector 2/4/8/16 */ typedef float vxc_float; typedef float2 vxc_float2; typedef float4 vxc_float4; typedef float8 vxc_float8; typedef float16 vxc_float16; /* half (float16) vector 2/4/8/16 */ typedef half vxc_half; typedef _viv_half2_packed vxc_half2; typedef _viv_half4_packed vxc_half4; typedef _viv_half8_packed vxc_half8; typedef struct _vxc_half16 { vxc_half8 hi; vxc_half8 lo; } vxc_half16; typedef uint16 vxc_512bits; typedef uint4 vxc_128bits; typedef vxc_512bits VXC_512Bits; typedef vxc_128bits VXC_128Bits; typedef vxc_modifier VXC_Modifier_t ; typedef vxc_round_mode VXC_RoundMode; typedef vxc_filter_mode VXC_FilterMode;
2.2 _viv_uniform 关键字
常量通常在opencl运行阶段吃实话,_viv_uniform 关键字用于在加载阶段初始化内核常量, 比如:
_viv_uniform vxc_512bits u512;
2.3 ASM指令
packed data type 一般不能使用 build-in-function了。如果要使用常用的build-in-function 需要把它转化为 unpacked data。或者使用内联汇编进行操作。
_viv_asm 是内联汇编调用关键字,一般指令格式是 _viv_asm(OP_CODE, dest, src0, src1), 在有些指令中, src1需要是立即数。
ABS | 绝对值 | |
ADD | 求和 | vxc_uchar16 a, b, c _viv_asm(ADD, a, b, c); // c = a + b |
ADD_SAT | 带饱和的整形求和 | |
AND_BITWISE | 位与 | |
BIT_REVERSQL | 整形按位翻转 (reversal) | |
BITEXTRACT | 从src到dest的位提取 | |
BITINSERT | 位替换 bit replacement | |
BITSEL | 位选择 bitwise select | |
CLAMP0MAX | 截断到 0 到 max | |
CMP | compare | |
CONV | convert | |
COPY | 复制 | src1 表示要copy的字节数 |
DIV | divide | |
FINDLSB | 找最小有效位 | |
FINDMSB | 找最大有效位 | |
LEADZERO | 检测leading zero | |
LSHIFT | 左移 | |
MADSAT | 带饱和的整形乘加 | |
MOD | Modulus | |
MUL | Multiply | |
MULHI | ||
MULSAT | 带饱和的整形Multiply | |
NEG | 负数 | |
NOT_BITWISE | ||
OR_BITWISE | ||
POPCOUNT | population Count | |
ROTATE | 旋转(大小位颠倒) | |
RSHIFT | 右移 | |
SUB | 减 | |
SUBSAT | 带饱和的整形减法 | |
XOR_BITWISE | 按位异或 | |
2.4 OP Code指令
VXC_OP 枚举中定义了操作符,需要注意这些操作不一定都有硬件实现。
关于DP指令(Dot Product),博主实验下来执行的是逐个元素相乘,而非向量内积,不是特别make sense,不是到是不是博主用的硬件实现的差别。opencl本身还有 dot 运算符…
ABS_DIFF | Absolute difference between two values |
IADD | Adds two or three integer values |
IACC_SQ | Squares a value and adds it to an accumulator |
LERP | Linear interpolation between two values |
FILTER | Performs a filter on a 3×3 block |
MAG_PHASE | Computes magnitude and phase of 2 packed data values |
MUL_SHIFT | Multiples two 8-or 16-bit integers and shifts |
DP16x1 | 1 Dot Product from 2 16 component values |
DP8x2 | 2 Dot Products from 2 8 component values |
DP4x4 | 4 Dot Products from 2 4 component values |
DP2x8 | 8 Dot Products from 2 2 component values |
CLAMP | Clamps up to 16 values to a max or min value |
BI_LINEAR | Computes a bi0linear interpolation of 4 pixel values |
SELECT_ADD | Adds a pixel value or increments a counter inside bins |
ATOMIC_ADD | Adds a valid atomically to an address |
BIT_EXTRACT | Extracts up to 8 bitfields from a packed stream |
BIT_REPLACE | Replaces up to 8 bitfields from a packed stream |
DP32x1 | 1 Dot Product from 2 32 component values |
DP16x2 | 2 Dot Products from 2 16 component values |
DP8x4 | 4 Dot Products from 2 8 component values |
DP4x8 | 8 Dot Products from 2 4 component values |
DP2x16 | 16 Dot Products from 2 2 component values |
还有一些其它指令,参照头文件枚举定义:
enum VXC_OP { VXC_OP_abs_diff = 3, /* it must be the same value as VIR_IK_abs_diff */ VXC_OP_iadd, VXC_OP_iacc_sq, VXC_OP_lerp, VXC_OP_filter, VXC_OP_mag_phase, VXC_OP_mul_shift, VXC_OP_dp16x1, VXC_OP_dp8x2, VXC_OP_dp4x4, VXC_OP_dp2x8, VXC_OP_clamp, VXC_OP_bi_linear, VXC_OP_select_add, VXC_OP_atomic_add, VXC_OP_bit_extract, VXC_OP_bit_replace, VXC_OP_dp32x1, VXC_OP_dp16x2, VXC_OP_dp8x4, VXC_OP_dp4x8, VXC_OP_dp2x16, VXC_OP_dp32x1_b, VXC_OP_dp16x2_b, VXC_OP_dp8x4_b, VXC_OP_dp4x8_b, VXC_OP_dp2x16_b, VXC_OP_img_load, VXC_OP_img_load_3d, VXC_OP_img_store, VXC_OP_img_store_3d, VXC_OP_vload2, VXC_OP_vload3, VXC_OP_vload4, VXC_OP_vload8, VXC_OP_vload16, VXC_OP_vstore2, VXC_OP_vstore3, VXC_OP_vstore4, VXC_OP_vstore8, VXC_OP_vstore16, VXC_OP_index_add, VXC_OP_vert_min3, VXC_OP_vert_max3, VXC_OP_vert_med3, VXC_OP_horz_min3, VXC_OP_horz_max3, VXC_OP_horz_med3, VXC_OP_error, OP_bit_extract, VXC_OP_dp16x1_b, VXC_OP_dp8x2_b, VXC_OP_dp4x4_b, VXC_OP_dp2x8_b, VXC_OP_gather, VXC_OP_gather_b, VXC_OP_scatter, VXC_OP_scatter_b, VXC_OP_atomic_s, VXC_OP_atomic_s_b, };
我们有时会通过VXC_OPn(OP, dest, src …) (VXC_OP1, VXC_OP2, VXC_OP3 …) 来调用上述操作,实际上VXC_OPn中的n就表示操作数个数,我们可以直接从头文件中看到它们是如何展开的:
#define VXC_OP1(Op, Dest, Src0) _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, Src0) #define VXC_OP2(Op, Dest, Src0, Src1) \ do { \ int _t1; \ _viv_asm(PARAM_CHAIN, _t1, Src0, Src1); \ _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t1); \ } while(0) #define VXC_OP3(Op, Dest, Src0, Src1, Src2) \ do { \ int _t1, _t2; \ _viv_asm(PARAM_CHAIN, _t1, Src0, Src1); \ _viv_asm(PARAM_CHAIN, _t2, _t1, Src2); \ _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t2); \ } while(0)
关于Filter,可以支持3×3窗口上的滤波操作:
VXC_FM_BOX | Compute a 3×3 box filter: |1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9|. |
VXC_FM_Guassian | Compute a 3×3 Gaussian filter: |1/16, 2/16, 1/16, 2/16, 4/16, 2/16, 1/16, 2/16, 1/16|. |
VXC_FM_SobelX | Compute a 3×3 Sobel filter in the x-direction: |-1, 0, 1, -2, 0, 2, -1, 0, 1|. |
VXC_FM_SobelY | Compute a 3×3 Sobel filter in the y-direction: |-1, -2, -1, 0, 0, 0, 1, 2, 1|. |
VXC_FM_ScharrX | Compute a 3×3 Scharr filter in the x-direction: |3, 0, -3, 10, 0, -10, 3, 0, -3|. |
VXC_FM_ScharrY | Compute a 3×3 Scharr filter in the y-direction: |3, 10, 3, 0, 0, 0, -3, -10, -3|. |
VXC_FM_Max | Get the maximum from a 3×3 kernel. |
VXC_FM_Min | Get the minimum from a 3×3 kernel. |
VXC_FM_Median | Get the median from a 3×3 kernel. |
三、一些例子
3.1 读取和写出数据
可以使用img_load 来load数据,写出数据接口和load数据接口参数类似。
坐标只能是 intn 或者 floatn类型。
关于VXC_MODIFIER
VXC_MODIFIER(StartBin, EndBin, SourceBin, RoundingMode, Clamp)
我们在从图像中取数据的时候常用这个宏来产生指令,它的参数含义如下:
- StartBin/EndBin 取数据的开始和结束位置,总工会取 End-Start个数据出来
- Source Bin 可以使用 VXC_5BITOFFSET_XY指示便宜,如果没有使用则填0即可
- RoundingMode 表示 取数据时取整的方式 0 VXC_RM_TowardZero 是向0取整, 1 是向上取整,2是向最近奇数取整
- Clamp: 0表示结果truncate到合适的类型, 1表示结果被clamp到合适的结果类型
例子:从 in_image 中 取 16 个 uchar类型的数据放到r1中
int2 coord = (int2)(get_global_id(0), get_global_id(1)); vxc_uchar16 r1; VXC_OP4(img_load, r1, in_image, coord, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
再有的读取数据的接口中可以使用VXC_5BITOFFSET_XY(offsetX, offsetY)指定取数据的偏移,其中 offsetX 和 offsetY 值范围是-16~15,例子:
VXC_ReadImage(v0, input, coord, VXC_5BITOFFSET_XY(0, 0), VXC_MODIFIER(0, 7, 0, VXC_RM_TowardZero, 0));
图像读写的接口和通道顺序的支持
// ---Read_Imagef,i,ui /* OCL image builtins can be used in VX kernel */ float4 read_imagef (image2d_t image, int2 coord); int4 read_imagei (image2d_t image, int2 coord); uint4 read_imageui (image2d_t image, int2 coord); float4 read_imagef (image1d_t image, int coord); int4 read_imagei (image1d_t image, int coord); uint4 read_imageui (image1d_t image, int coord); float4 read_imagef (image1d_array_t image, int2 coord); int4 read_imagei (image1d_array_t image, int2 coord); uint4 read_imageui (image1d_array_t image, int2 coord); // ---Write_Imagef,i,ui void write_imagef (image2d_t image, int2 coord, float4 color); void write_imagei (image2d_t image, int2 coord, int4 color); void write_imageui (image2d_t image, int2 coord, uint4 color); void write_imagef (image1d_t image, int coord, float4 color); void write_imagei (image1d_t image, int coord, int4 color); void write_imageui (image1d_t image, int coord, uint4 color); void write_imagef (image1d_array_t image, int2 coord, float4 color); void write_imagei (image1d_array_t image, int2 coord, int4 color); void write_imageui (image1d_array_t image, int2 coord, uint4 color) // ---Query Image Dimensions int2 get_image_dim (image2d_t image); size_t get_image_array_size(image1d_array_t image); /* Built-in Image Query Functions */ int get_image_width (image1d_t image); int get_image_width (image2d_t image); int get_image_width (image1d_array_t image); int get_image_height (image2d_t image); // ---Channel Data Types Supported /* Return the channel data type. Valid values are: * CLK_SNORM_INT8 * CLK_SNORM_INT16 * CLK_UNORM_INT8 * CLK_UNORM_INT16 * CLK_UNORM_SHORT_565 * CLK_UNORM_SHORT_555 * CLK_UNORM_SHORT_101010 * CLK_SIGNED_INT8 * CLK_SIGNED_INT16 * CLK_SIGNED_INT32 * CLK_UNSIGNED_INT8 * CLK_UNSIGNED_INT16 * CLK_UNSIGNED_INT32 * CLK_HALF_FLOAT * CLK_FLOAT */ int get_image_channel_data_type (image1d_t image); int get_image_channel_data_type (image2d_t image); int get_image_channel_data_type (image1d_array_t image); // ---Image Channel Orders Supported /* Return the image channel order. Valid values are: NXP Semiconductors * CLK_A * CLK_R * CLK_Rx * CLK_RG * CLK_RGx * CLK_RA * CLK_RGB * CLK_RGBx * CLK_RGBA * CLK_ARGB * CLK_BGRA * CLK_INTENSITY * CLK_LUMINANCE */ int get_image_channel_order (image1d_t image); int get_image_channel_order (image2d_t image); int get_image_channel_order (image1d_array_t image); // ---Image Channel Orders Supported /* Return the image channel order. Valid values are: * CLK_A * CLK_R * CLK_Rx * CLK_RG * CLK_RGx * CLK_RA * CLK_RGB * CLK_RGBx * CLK_RGBA * CLK_ARGB * CLK_BGRA * CLK_INTENSITY * CLK_LUMINANCE */ int get_image_channel_order (image1d_t image); int get_image_channel_order (image2d_t image); int get_image_channel_order (image1d_array_t image);
3.2 高斯滤波的例子
这里有一个高斯滤波的例子:https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/How-to-use-OpenVX-extension-for-NPU-GPU-to-accelerate-machine/ta-p/1113429
这里博主贴一下kernel部分的代码:
__kernel void gaussian ( __read_only image2d_t in_image, __write_only image2d_t out_image ) { int2 coord = (int2)(get_global_id(0), get_global_id(1)); int2 coord_out = coord; vxc_uchar16 lineA, lineB, lineC, out; int2 coord_in1 = coord + (int2)(-1, -1); VXC_OP4(img_load, lineA, in_image, coord_in1, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0)); int2 coord_in2 = coord + (int2)(-1, 0); VXC_OP4(img_load, lineB, in_image, coord_in2, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0)); int2 coord_in3 = coord + (int2)(-1, 1); VXC_OP4(img_load, lineC, in_image, coord_in3, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0)); int info = VXC_MODIFIER_FILTER(0, 13, 0, VXC_FM_Guassian, 0); VXC_OP4(filter, out, lineA, lineB, lineC, info); VXC_OP4_NoDest(img_store, out_image, coord_out, out, VXC_MODIFIER(0, 13, 0, VXC_RM_TowardZero, 0)); }
3.3 Half处理的例子
显示转换:需要在half赋值前用类型符修饰,而不能直接赋值
half a = (half)1000;
重解释:我们读取数据为short,再重新解释为half,但是通常较老的编译器是不带as_half选项的,于是我们通过内联汇编指令COPY来处理
vxc_short8 val vxc_half8 val_fp16; _viv_asm(COPY, val_fp16, val, sizeof(val));
3.4 其它开源实现
- TIM-VX: https://github.com/VeriSilicon/TIM-VX
- TIM-VX 是一个神经网络加速库,其中可以找到大量vx kernel代码参考
- AML-NPU-SDK: aml-npu-sdk/nnvxc_kernels at 63b337b73f6143aa58d0ba1c57d0907ac9b0fb39 · leemgs/aml-npu-sdk (github.com)
- 也是一个npu sdk 开源库,有大量nn相关的 opevx kernel nnvxc kernel实现
- 有关stereo和softisp几个CV Function的vx实现:gtec-demo-framework/DemoApps/OpenVX/Stereo/Content/CTF_kernels_vxc.vx at master · nxp-imx/gtec-demo-framework (github.com)
其它文档相关的资源整理在文章开头,希望能有帮助。