OpenCL Vision Image Extension 使用的指令集,又叫做 EVIS (Enhanced Vision Instruction Set)指令集,他时一个API Level的变成语言,在一些NPU、GPU、FPGA等一些OpenCL的硬件实现中适合用来处理视觉图像任务,同时它也是一个OPENCL VX Extension API , 它的源文件通常在 /usr/include/CL/cl_viv_vx_ext.h 中 。
一、参考资料
- OPENCL官方文档:
- OPENVX官方文档:
- https://github.com/KhronosGroup/Khronosdotorg/blob/main/api/openvx/resources.md
- https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/OpenVX-Vision-Image-Extension-API-Introduction-Basic-API/ta-p/1117280
- Github
本站网盘对一批关于opencl和openvx资料进行整理:
- https://oss.p-chao.com/book/opengl_openvx_opencl_cuda

二、数据和指令
2.1 数据类型
packed data type,同opencl 的矢量数据类型基本相同,部分类型被直接定义为OpenCL的矢量类型,
同时我们也需要注意部分类型是两个更短的类型拼在一起的,长类型使用过程中需要查看源码。
矢量类型的索引同opencl一样,可以使用 .s[n] .x/.y/.. .hi/.lo 来访问
| vxc_charn | |
| vxc_ucharn | |
| vxc_shortn | |
| vxc_ushortn | |
| vxc_intn | |
| vxc_uintn | |
| vxc_floatn | |
| vxc_halfn |
/* packed char vector 2/4/8/16 */
typedef _viv_char2_packed vxc_char2;
typedef _viv_char4_packed vxc_char4;
typedef _viv_char8_packed vxc_char8;
typedef _viv_char16_packed vxc_char16;
typedef struct _vxc_char32
{
vxc_char16 hi;
vxc_char16 lo;
} vxc_char32;
/* packed uchar vector 2/4/8/16 */
typedef _viv_uchar2_packed vxc_uchar2;
typedef _viv_uchar4_packed vxc_uchar4;
typedef _viv_uchar8_packed vxc_uchar8;
typedef _viv_uchar16_packed vxc_uchar16;
typedef struct _vxc_uchar32
{
vxc_uchar16 hi;
vxc_uchar16 lo;
} vxc_uchar32;
/* packed short vector 2/4/8 */
typedef _viv_short2_packed vxc_short2;
typedef _viv_short4_packed vxc_short4;
typedef _viv_short8_packed vxc_short8;
typedef struct _vxc_short16
{
vxc_short8 hi;
vxc_short8 lo;
} vxc_short16;
/* packed ushort vector 2/4/8 */
typedef _viv_ushort2_packed vxc_ushort2;
typedef _viv_ushort4_packed vxc_ushort4;
typedef _viv_ushort8_packed vxc_ushort8;
typedef struct _vxc_ushort16
{
vxc_ushort8 hi;
vxc_ushort8 lo;
} vxc_ushort16;
/* int vector 2/4/8/16 */
typedef int vxc_int;
typedef int2 vxc_int2;
typedef int4 vxc_int4;
typedef int8 vxc_int8;
typedef int16 vxc_int16;
/* uint vector 2/4/8/16 */
typedef uint vxc_uint;
typedef uint2 vxc_uint2;
typedef uint4 vxc_uint4;
typedef uint8 vxc_uint8;
typedef uint16 vxc_uint16;
/* float vector 2/4/8/16 */
typedef float vxc_float;
typedef float2 vxc_float2;
typedef float4 vxc_float4;
typedef float8 vxc_float8;
typedef float16 vxc_float16;
/* half (float16) vector 2/4/8/16 */
typedef half vxc_half;
typedef _viv_half2_packed vxc_half2;
typedef _viv_half4_packed vxc_half4;
typedef _viv_half8_packed vxc_half8;
typedef struct _vxc_half16
{
vxc_half8 hi;
vxc_half8 lo;
} vxc_half16;
typedef uint16 vxc_512bits;
typedef uint4 vxc_128bits;
typedef vxc_512bits VXC_512Bits;
typedef vxc_128bits VXC_128Bits;
typedef vxc_modifier VXC_Modifier_t ;
typedef vxc_round_mode VXC_RoundMode;
typedef vxc_filter_mode VXC_FilterMode;
2.2 _viv_uniform 关键字
常量通常在opencl运行阶段吃实话,_viv_uniform 关键字用于在加载阶段初始化内核常量, 比如:
_viv_uniform vxc_512bits u512;
2.3 ASM指令
packed data type 一般不能使用 build-in-function了。如果要使用常用的build-in-function 需要把它转化为 unpacked data。或者使用内联汇编进行操作。
_viv_asm 是内联汇编调用关键字,一般指令格式是 _viv_asm(OP_CODE, dest, src0, src1), 在有些指令中, src1需要是立即数。
| ABS | 绝对值 | |
| ADD | 求和 | vxc_uchar16 a, b, c _viv_asm(ADD, a, b, c); // c = a + b |
| ADD_SAT | 带饱和的整形求和 | |
| AND_BITWISE | 位与 | |
| BIT_REVERSQL | 整形按位翻转 (reversal) | |
| BITEXTRACT | 从src到dest的位提取 | |
| BITINSERT | 位替换 bit replacement | |
| BITSEL | 位选择 bitwise select | |
| CLAMP0MAX | 截断到 0 到 max | |
| CMP | compare | |
| CONV | convert | |
| COPY | 复制 | src1 表示要copy的字节数 |
| DIV | divide | |
| FINDLSB | 找最小有效位 | |
| FINDMSB | 找最大有效位 | |
| LEADZERO | 检测leading zero | |
| LSHIFT | 左移 | |
| MADSAT | 带饱和的整形乘加 | |
| MOD | Modulus | |
| MUL | Multiply | |
| MULHI | ||
| MULSAT | 带饱和的整形Multiply | |
| NEG | 负数 | |
| NOT_BITWISE | ||
| OR_BITWISE | ||
| POPCOUNT | population Count | |
| ROTATE | 旋转(大小位颠倒) | |
| RSHIFT | 右移 | |
| SUB | 减 | |
| SUBSAT | 带饱和的整形减法 | |
| XOR_BITWISE | 按位异或 | |
2.4 OP Code指令
VXC_OP 枚举中定义了操作符,需要注意这些操作不一定都有硬件实现。
关于DP指令(Dot Product),博主实验下来执行的是逐个元素相乘,而非向量内积,不是特别make sense,不是到是不是博主用的硬件实现的差别。opencl本身还有 dot 运算符…
| ABS_DIFF | Absolute difference between two values |
| IADD | Adds two or three integer values |
| IACC_SQ | Squares a value and adds it to an accumulator |
| LERP | Linear interpolation between two values |
| FILTER | Performs a filter on a 3×3 block |
| MAG_PHASE | Computes magnitude and phase of 2 packed data values |
| MUL_SHIFT | Multiples two 8-or 16-bit integers and shifts |
| DP16x1 | 1 Dot Product from 2 16 component values |
| DP8x2 | 2 Dot Products from 2 8 component values |
| DP4x4 | 4 Dot Products from 2 4 component values |
| DP2x8 | 8 Dot Products from 2 2 component values |
| CLAMP | Clamps up to 16 values to a max or min value |
| BI_LINEAR | Computes a bi0linear interpolation of 4 pixel values |
| SELECT_ADD | Adds a pixel value or increments a counter inside bins |
| ATOMIC_ADD | Adds a valid atomically to an address |
| BIT_EXTRACT | Extracts up to 8 bitfields from a packed stream |
| BIT_REPLACE | Replaces up to 8 bitfields from a packed stream |
| DP32x1 | 1 Dot Product from 2 32 component values |
| DP16x2 | 2 Dot Products from 2 16 component values |
| DP8x4 | 4 Dot Products from 2 8 component values |
| DP4x8 | 8 Dot Products from 2 4 component values |
| DP2x16 | 16 Dot Products from 2 2 component values |
还有一些其它指令,参照头文件枚举定义:
enum VXC_OP {
VXC_OP_abs_diff = 3, /* it must be the same value as VIR_IK_abs_diff */
VXC_OP_iadd,
VXC_OP_iacc_sq,
VXC_OP_lerp,
VXC_OP_filter,
VXC_OP_mag_phase,
VXC_OP_mul_shift,
VXC_OP_dp16x1,
VXC_OP_dp8x2,
VXC_OP_dp4x4,
VXC_OP_dp2x8,
VXC_OP_clamp,
VXC_OP_bi_linear,
VXC_OP_select_add,
VXC_OP_atomic_add,
VXC_OP_bit_extract,
VXC_OP_bit_replace,
VXC_OP_dp32x1,
VXC_OP_dp16x2,
VXC_OP_dp8x4,
VXC_OP_dp4x8,
VXC_OP_dp2x16,
VXC_OP_dp32x1_b,
VXC_OP_dp16x2_b,
VXC_OP_dp8x4_b,
VXC_OP_dp4x8_b,
VXC_OP_dp2x16_b,
VXC_OP_img_load,
VXC_OP_img_load_3d,
VXC_OP_img_store,
VXC_OP_img_store_3d,
VXC_OP_vload2,
VXC_OP_vload3,
VXC_OP_vload4,
VXC_OP_vload8,
VXC_OP_vload16,
VXC_OP_vstore2,
VXC_OP_vstore3,
VXC_OP_vstore4,
VXC_OP_vstore8,
VXC_OP_vstore16,
VXC_OP_index_add,
VXC_OP_vert_min3,
VXC_OP_vert_max3,
VXC_OP_vert_med3,
VXC_OP_horz_min3,
VXC_OP_horz_max3,
VXC_OP_horz_med3,
VXC_OP_error,
OP_bit_extract,
VXC_OP_dp16x1_b,
VXC_OP_dp8x2_b,
VXC_OP_dp4x4_b,
VXC_OP_dp2x8_b,
VXC_OP_gather,
VXC_OP_gather_b,
VXC_OP_scatter,
VXC_OP_scatter_b,
VXC_OP_atomic_s,
VXC_OP_atomic_s_b,
};
我们有时会通过VXC_OPn(OP, dest, src …) (VXC_OP1, VXC_OP2, VXC_OP3 …) 来调用上述操作,实际上VXC_OPn中的n就表示操作数个数,我们可以直接从头文件中看到它们是如何展开的:
#define VXC_OP1(Op, Dest, Src0) _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, Src0)
#define VXC_OP2(Op, Dest, Src0, Src1) \
do { \
int _t1; \
_viv_asm(PARAM_CHAIN, _t1, Src0, Src1); \
_viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t1); \
} while(0)
#define VXC_OP3(Op, Dest, Src0, Src1, Src2) \
do { \
int _t1, _t2; \
_viv_asm(PARAM_CHAIN, _t1, Src0, Src1); \
_viv_asm(PARAM_CHAIN, _t2, _t1, Src2); \
_viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t2); \
} while(0)
关于Filter,可以支持3×3窗口上的滤波操作:
| VXC_FM_BOX | Compute a 3×3 box filter: |1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9|. |
| VXC_FM_Guassian | Compute a 3×3 Gaussian filter: |1/16, 2/16, 1/16, 2/16, 4/16, 2/16, 1/16, 2/16, 1/16|. |
| VXC_FM_SobelX | Compute a 3×3 Sobel filter in the x-direction: |-1, 0, 1, -2, 0, 2, -1, 0, 1|. |
| VXC_FM_SobelY | Compute a 3×3 Sobel filter in the y-direction: |-1, -2, -1, 0, 0, 0, 1, 2, 1|. |
| VXC_FM_ScharrX | Compute a 3×3 Scharr filter in the x-direction: |3, 0, -3, 10, 0, -10, 3, 0, -3|. |
| VXC_FM_ScharrY | Compute a 3×3 Scharr filter in the y-direction: |3, 10, 3, 0, 0, 0, -3, -10, -3|. |
| VXC_FM_Max | Get the maximum from a 3×3 kernel. |
| VXC_FM_Min | Get the minimum from a 3×3 kernel. |
| VXC_FM_Median | Get the median from a 3×3 kernel. |
三、一些例子
3.1 读取和写出数据
可以使用img_load 来load数据,写出数据接口和load数据接口参数类似。
坐标只能是 intn 或者 floatn类型。
关于VXC_MODIFIER
VXC_MODIFIER(StartBin, EndBin, SourceBin, RoundingMode, Clamp)
我们在从图像中取数据的时候常用这个宏来产生指令,它的参数含义如下:
- StartBin/EndBin 取数据的开始和结束位置,总工会取 End-Start个数据出来
- Source Bin 可以使用 VXC_5BITOFFSET_XY指示便宜,如果没有使用则填0即可
- RoundingMode 表示 取数据时取整的方式 0 VXC_RM_TowardZero 是向0取整, 1 是向上取整,2是向最近奇数取整
- Clamp: 0表示结果truncate到合适的类型, 1表示结果被clamp到合适的结果类型
例子:从 in_image 中 取 16 个 uchar类型的数据放到r1中
int2 coord = (int2)(get_global_id(0), get_global_id(1)); vxc_uchar16 r1; VXC_OP4(img_load, r1, in_image, coord, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
再有的读取数据的接口中可以使用VXC_5BITOFFSET_XY(offsetX, offsetY)指定取数据的偏移,其中 offsetX 和 offsetY 值范围是-16~15,例子:
VXC_ReadImage(v0, input, coord, VXC_5BITOFFSET_XY(0, 0),
VXC_MODIFIER(0, 7, 0, VXC_RM_TowardZero, 0));
图像读写的接口和通道顺序的支持
// ---Read_Imagef,i,ui /* OCL image builtins can be used in VX kernel */ float4 read_imagef (image2d_t image, int2 coord); int4 read_imagei (image2d_t image, int2 coord); uint4 read_imageui (image2d_t image, int2 coord); float4 read_imagef (image1d_t image, int coord); int4 read_imagei (image1d_t image, int coord); uint4 read_imageui (image1d_t image, int coord); float4 read_imagef (image1d_array_t image, int2 coord); int4 read_imagei (image1d_array_t image, int2 coord); uint4 read_imageui (image1d_array_t image, int2 coord); // ---Write_Imagef,i,ui void write_imagef (image2d_t image, int2 coord, float4 color); void write_imagei (image2d_t image, int2 coord, int4 color); void write_imageui (image2d_t image, int2 coord, uint4 color); void write_imagef (image1d_t image, int coord, float4 color); void write_imagei (image1d_t image, int coord, int4 color); void write_imageui (image1d_t image, int coord, uint4 color); void write_imagef (image1d_array_t image, int2 coord, float4 color); void write_imagei (image1d_array_t image, int2 coord, int4 color); void write_imageui (image1d_array_t image, int2 coord, uint4 color) // ---Query Image Dimensions int2 get_image_dim (image2d_t image); size_t get_image_array_size(image1d_array_t image); /* Built-in Image Query Functions */ int get_image_width (image1d_t image); int get_image_width (image2d_t image); int get_image_width (image1d_array_t image); int get_image_height (image2d_t image); // ---Channel Data Types Supported /* Return the channel data type. Valid values are: * CLK_SNORM_INT8 * CLK_SNORM_INT16 * CLK_UNORM_INT8 * CLK_UNORM_INT16 * CLK_UNORM_SHORT_565 * CLK_UNORM_SHORT_555 * CLK_UNORM_SHORT_101010 * CLK_SIGNED_INT8 * CLK_SIGNED_INT16 * CLK_SIGNED_INT32 * CLK_UNSIGNED_INT8 * CLK_UNSIGNED_INT16 * CLK_UNSIGNED_INT32 * CLK_HALF_FLOAT * CLK_FLOAT */ int get_image_channel_data_type (image1d_t image); int get_image_channel_data_type (image2d_t image); int get_image_channel_data_type (image1d_array_t image); // ---Image Channel Orders Supported /* Return the image channel order. Valid values are: NXP Semiconductors * CLK_A * CLK_R * CLK_Rx * CLK_RG * CLK_RGx * CLK_RA * CLK_RGB * CLK_RGBx * CLK_RGBA * CLK_ARGB * CLK_BGRA * CLK_INTENSITY * CLK_LUMINANCE */ int get_image_channel_order (image1d_t image); int get_image_channel_order (image2d_t image); int get_image_channel_order (image1d_array_t image); // ---Image Channel Orders Supported /* Return the image channel order. Valid values are: * CLK_A * CLK_R * CLK_Rx * CLK_RG * CLK_RGx * CLK_RA * CLK_RGB * CLK_RGBx * CLK_RGBA * CLK_ARGB * CLK_BGRA * CLK_INTENSITY * CLK_LUMINANCE */ int get_image_channel_order (image1d_t image); int get_image_channel_order (image2d_t image); int get_image_channel_order (image1d_array_t image);
3.2 高斯滤波的例子
这里有一个高斯滤波的例子:https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/How-to-use-OpenVX-extension-for-NPU-GPU-to-accelerate-machine/ta-p/1113429
这里博主贴一下kernel部分的代码:
__kernel void gaussian
(
__read_only image2d_t in_image,
__write_only image2d_t out_image
)
{
int2 coord = (int2)(get_global_id(0), get_global_id(1));
int2 coord_out = coord;
vxc_uchar16 lineA, lineB, lineC, out;
int2 coord_in1 = coord + (int2)(-1, -1);
VXC_OP4(img_load, lineA, in_image, coord_in1, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in2 = coord + (int2)(-1, 0);
VXC_OP4(img_load, lineB, in_image, coord_in2, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in3 = coord + (int2)(-1, 1);
VXC_OP4(img_load, lineC, in_image, coord_in3, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int info = VXC_MODIFIER_FILTER(0, 13, 0, VXC_FM_Guassian, 0);
VXC_OP4(filter, out, lineA, lineB, lineC, info);
VXC_OP4_NoDest(img_store, out_image, coord_out, out, VXC_MODIFIER(0, 13, 0, VXC_RM_TowardZero, 0));
}
3.3 Half处理的例子
显示转换:需要在half赋值前用类型符修饰,而不能直接赋值
half a = (half)1000;
重解释:我们读取数据为short,再重新解释为half,但是通常较老的编译器是不带as_half选项的,于是我们通过内联汇编指令COPY来处理
vxc_short8 val vxc_half8 val_fp16; _viv_asm(COPY, val_fp16, val, sizeof(val));
3.4 其它开源实现
- TIM-VX: https://github.com/VeriSilicon/TIM-VX
- TIM-VX 是一个神经网络加速库,其中可以找到大量vx kernel代码参考
- AML-NPU-SDK: aml-npu-sdk/nnvxc_kernels at 63b337b73f6143aa58d0ba1c57d0907ac9b0fb39 · leemgs/aml-npu-sdk (github.com)
- 也是一个npu sdk 开源库,有大量nn相关的 opevx kernel nnvxc kernel实现
- 有关stereo和softisp几个CV Function的vx实现:gtec-demo-framework/DemoApps/OpenVX/Stereo/Content/CTF_kernels_vxc.vx at master · nxp-imx/gtec-demo-framework (github.com)
其它文档相关的资源整理在文章开头,希望能有帮助。