OpenCL Vision Image Extension

OpenCL Vision Image Extension 使用的指令集，又叫做 EVIS （Enhanced Vision Instruction Set）指令集，他时一个API Level的变成语言，在一些NPU、GPU、FPGA等一些OpenCL的硬件实现中适合用来处理视觉图像任务，同时它也是一个OPENCL VX Extension API ，它的源文件通常在 /usr/include/CL/cl_viv_vx_ext.h 中。

一、参考资料

OPENCL官方文档：
OPENVX官方文档：
https://github.com/KhronosGroup/Khronosdotorg/blob/main/api/openvx/resources.md
https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/OpenVX-Vision-Image-Extension-API-Introduction-Basic-API/ta-p/1117280
Github

本站网盘对一批关于opencl和openvx资料进行整理：

https://oss.p-chao.com/book/opengl_openvx_opencl_cuda

二、数据和指令

2.1 数据类型

packed data type，同opencl 的矢量数据类型基本相同，部分类型被直接定义为OpenCL的矢量类型，

同时我们也需要注意部分类型是两个更短的类型拼在一起的，长类型使用过程中需要查看源码。

矢量类型的索引同opencl一样，可以使用 .s[n] .x/.y/.. .hi/.lo 来访问

vxc_charn
vxc_ucharn
vxc_shortn
vxc_ushortn
vxc_intn
vxc_uintn
vxc_floatn
vxc_halfn

我们可以从源文件中查看定义

/* packed char vector 2/4/8/16 */
typedef _viv_char2_packed     vxc_char2;
typedef _viv_char4_packed     vxc_char4;
typedef _viv_char8_packed     vxc_char8;
typedef _viv_char16_packed    vxc_char16;
typedef struct _vxc_char32
{
    vxc_char16  hi;
    vxc_char16  lo;
} vxc_char32;

/* packed uchar vector 2/4/8/16 */
typedef _viv_uchar2_packed    vxc_uchar2;
typedef _viv_uchar4_packed    vxc_uchar4;
typedef _viv_uchar8_packed    vxc_uchar8;
typedef _viv_uchar16_packed   vxc_uchar16;
typedef struct _vxc_uchar32
{
    vxc_uchar16  hi;
    vxc_uchar16  lo;
} vxc_uchar32;

/* packed short vector 2/4/8 */
typedef _viv_short2_packed    vxc_short2;
typedef _viv_short4_packed    vxc_short4;
typedef _viv_short8_packed    vxc_short8;
typedef struct _vxc_short16
{
    vxc_short8  hi;
    vxc_short8  lo;
} vxc_short16;

/* packed ushort vector 2/4/8 */
typedef _viv_ushort2_packed   vxc_ushort2;
typedef _viv_ushort4_packed   vxc_ushort4;
typedef _viv_ushort8_packed   vxc_ushort8;
typedef struct _vxc_ushort16
{
    vxc_ushort8  hi;
    vxc_ushort8  lo;
} vxc_ushort16;

/* int vector 2/4/8/16 */
typedef int      vxc_int;
typedef int2     vxc_int2;
typedef int4     vxc_int4;
typedef int8     vxc_int8;
typedef int16    vxc_int16;

/* uint vector 2/4/8/16 */
typedef uint     vxc_uint;
typedef uint2    vxc_uint2;
typedef uint4    vxc_uint4;
typedef uint8    vxc_uint8;
typedef uint16   vxc_uint16;

/* float vector 2/4/8/16 */
typedef float     vxc_float;
typedef float2    vxc_float2;
typedef float4    vxc_float4;
typedef float8    vxc_float8;
typedef float16   vxc_float16;

/* half (float16) vector 2/4/8/16 */
typedef half                  vxc_half;
typedef _viv_half2_packed     vxc_half2;
typedef _viv_half4_packed     vxc_half4;
typedef _viv_half8_packed     vxc_half8;
typedef struct _vxc_half16
{
    vxc_half8  hi;
    vxc_half8  lo;
} vxc_half16;

typedef uint16 vxc_512bits;
typedef uint4  vxc_128bits;

typedef vxc_512bits VXC_512Bits;
typedef vxc_128bits VXC_128Bits;
typedef vxc_modifier VXC_Modifier_t ;
typedef vxc_round_mode VXC_RoundMode;
typedef vxc_filter_mode VXC_FilterMode;

2.2 _viv_uniform 关键字

常量通常在opencl运行阶段吃实话，_viv_uniform 关键字用于在加载阶段初始化内核常量，比如：

_viv_uniform vxc_512bits u512;

2.3 ASM指令

packed data type 一般不能使用 build-in-function了。如果要使用常用的build-in-function 需要把它转化为 unpacked data。或者使用内联汇编进行操作。

_viv_asm 是内联汇编调用关键字，一般指令格式是 _viv_asm(OP_CODE, dest, src0, src1), 在有些指令中， src1需要是立即数。

ABS	绝对值
ADD	求和	vxc_uchar16 a, b, c _viv_asm(ADD, a, b, c); // c = a + b
ADD_SAT	带饱和的整形求和
AND_BITWISE	位与
BIT_REVERSQL	整形按位翻转 (reversal)
BITEXTRACT	从src到dest的位提取
BITINSERT	位替换 bit replacement
BITSEL	位选择 bitwise select
CLAMP0MAX	截断到 0 到 max
CMP	compare
CONV	convert
COPY	复制	src1 表示要copy的字节数
DIV	divide
FINDLSB	找最小有效位
FINDMSB	找最大有效位
LEADZERO	检测leading zero
LSHIFT	左移
MADSAT	带饱和的整形乘加
MOD	Modulus
MUL	Multiply
MULHI
MULSAT	带饱和的整形Multiply
NEG	负数
NOT_BITWISE
OR_BITWISE
POPCOUNT	population Count
ROTATE	旋转（大小位颠倒）
RSHIFT	右移
SUB	减
SUBSAT	带饱和的整形减法
XOR_BITWISE	按位异或

2.4 OP Code指令

VXC_OP 枚举中定义了操作符，需要注意这些操作不一定都有硬件实现。

关于DP指令(Dot Product)，博主实验下来执行的是逐个元素相乘，而非向量内积，不是特别make sense，不是到是不是博主用的硬件实现的差别。opencl本身还有 dot 运算符…

ABS_DIFF	Absolute difference between two values
IADD	Adds two or three integer values
IACC_SQ	Squares a value and adds it to an accumulator
LERP	Linear interpolation between two values
FILTER	Performs a filter on a 3×3 block
MAG_PHASE	Computes magnitude and phase of 2 packed data values
MUL_SHIFT	Multiples two 8-or 16-bit integers and shifts
DP16x1	1 Dot Product from 2 16 component values
DP8x2	2 Dot Products from 2 8 component values
DP4x4	4 Dot Products from 2 4 component values
DP2x8	8 Dot Products from 2 2 component values
CLAMP	Clamps up to 16 values to a max or min value
BI_LINEAR	Computes a bi0linear interpolation of 4 pixel values
SELECT_ADD	Adds a pixel value or increments a counter inside bins
ATOMIC_ADD	Adds a valid atomically to an address
BIT_EXTRACT	Extracts up to 8 bitfields from a packed stream
BIT_REPLACE	Replaces up to 8 bitfields from a packed stream
DP32x1	1 Dot Product from 2 32 component values
DP16x2	2 Dot Products from 2 16 component values
DP8x4	4 Dot Products from 2 8 component values
DP4x8	8 Dot Products from 2 4 component values
DP2x16	16 Dot Products from 2 2 component values

还有一些其它指令，参照头文件枚举定义：

enum VXC_OP {
    VXC_OP_abs_diff = 3, /* it must be the same value as VIR_IK_abs_diff */
    VXC_OP_iadd,
    VXC_OP_iacc_sq,
    VXC_OP_lerp,
    VXC_OP_filter,
    VXC_OP_mag_phase,
    VXC_OP_mul_shift,
    VXC_OP_dp16x1,
    VXC_OP_dp8x2,
    VXC_OP_dp4x4,
    VXC_OP_dp2x8,
    VXC_OP_clamp,
    VXC_OP_bi_linear,
    VXC_OP_select_add,
    VXC_OP_atomic_add,
    VXC_OP_bit_extract,
    VXC_OP_bit_replace,
    VXC_OP_dp32x1,
    VXC_OP_dp16x2,
    VXC_OP_dp8x4,
    VXC_OP_dp4x8,
    VXC_OP_dp2x16,
    VXC_OP_dp32x1_b,
    VXC_OP_dp16x2_b,
    VXC_OP_dp8x4_b,
    VXC_OP_dp4x8_b,
    VXC_OP_dp2x16_b,
    VXC_OP_img_load,
    VXC_OP_img_load_3d,
    VXC_OP_img_store,
    VXC_OP_img_store_3d,
    VXC_OP_vload2,
    VXC_OP_vload3,
    VXC_OP_vload4,
    VXC_OP_vload8,
    VXC_OP_vload16,
    VXC_OP_vstore2,
    VXC_OP_vstore3,
    VXC_OP_vstore4,
    VXC_OP_vstore8,
    VXC_OP_vstore16,
    VXC_OP_index_add,
    VXC_OP_vert_min3,
    VXC_OP_vert_max3,
    VXC_OP_vert_med3,
    VXC_OP_horz_min3,
    VXC_OP_horz_max3,
    VXC_OP_horz_med3,
    VXC_OP_error,
    OP_bit_extract,
    VXC_OP_dp16x1_b,
    VXC_OP_dp8x2_b,
    VXC_OP_dp4x4_b,
    VXC_OP_dp2x8_b,
    VXC_OP_gather,
    VXC_OP_gather_b,
    VXC_OP_scatter,
    VXC_OP_scatter_b,
    VXC_OP_atomic_s,
    VXC_OP_atomic_s_b,
};

我们有时会通过VXC_OPn(OP, dest, src …) (VXC_OP1, VXC_OP2, VXC_OP3 …) 来调用上述操作，实际上VXC_OPn中的n就表示操作数个数，我们可以直接从头文件中看到它们是如何展开的：

#define VXC_OP1(Op, Dest, Src0)   _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, Src0)

#define VXC_OP2(Op, Dest, Src0, Src1)                  \
    do {                                               \
        int _t1;                                       \
        _viv_asm(PARAM_CHAIN, _t1, Src0, Src1);         \
        _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t1);    \
    } while(0)

#define VXC_OP3(Op, Dest, Src0, Src1, Src2)            \
    do {                                               \
        int _t1, _t2;                                  \
        _viv_asm(PARAM_CHAIN, _t1, Src0, Src1);        \
        _viv_asm(PARAM_CHAIN, _t2, _t1, Src2);         \
        _viv_asm(INTRINSIC, Dest, VXC_OP_##Op, _t2);   \
    } while(0)

关于Filter，可以支持3×3窗口上的滤波操作：

VXC_FM_BOX	Compute a 3×3 box filter: \|1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9\|.
VXC_FM_Guassian	Compute a 3×3 Gaussian filter: \|1/16, 2/16, 1/16, 2/16, 4/16, 2/16, 1/16, 2/16, 1/16\|.
VXC_FM_SobelX	Compute a 3×3 Sobel filter in the x-direction: \|-1, 0, 1, -2, 0, 2, -1, 0, 1\|.
VXC_FM_SobelY	Compute a 3×3 Sobel filter in the y-direction: \|-1, -2, -1, 0, 0, 0, 1, 2, 1\|.
VXC_FM_ScharrX	Compute a 3×3 Scharr filter in the x-direction: \|3, 0, -3, 10, 0, -10, 3, 0, -3\|.
VXC_FM_ScharrY	Compute a 3×3 Scharr filter in the y-direction: \|3, 10, 3, 0, 0, 0, -3, -10, -3\|.
VXC_FM_Max	Get the maximum from a 3×3 kernel.
VXC_FM_Min	Get the minimum from a 3×3 kernel.
VXC_FM_Median	Get the median from a 3×3 kernel.

三、一些例子

3.1 读取和写出数据

可以使用img_load 来load数据，写出数据接口和load数据接口参数类似。

坐标只能是 intn 或者 floatn类型。

关于VXC_MODIFIER

VXC_MODIFIER(StartBin, EndBin, SourceBin, RoundingMode, Clamp)

我们在从图像中取数据的时候常用这个宏来产生指令，它的参数含义如下：

StartBin/EndBin 取数据的开始和结束位置，总工会取 End-Start个数据出来
Source Bin 可以使用 VXC_5BITOFFSET_XY指示便宜，如果没有使用则填0即可
RoundingMode 表示取数据时取整的方式 0 VXC_RM_TowardZero 是向0取整， 1 是向上取整，2是向最近奇数取整
Clamp： 0表示结果truncate到合适的类型， 1表示结果被clamp到合适的结果类型

例子：从 in_image 中取 16 个 uchar类型的数据放到r1中

int2 coord = (int2)(get_global_id(0), get_global_id(1));

vxc_uchar16 r1;

VXC_OP4(img_load, r1, in_image, coord, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));

再有的读取数据的接口中可以使用VXC_5BITOFFSET_XY(offsetX, offsetY)指定取数据的偏移，其中 offsetX 和 offsetY 值范围是-16~15，例子：

VXC_ReadImage(v0, input, coord, VXC_5BITOFFSET_XY(0, 0), 
                            VXC_MODIFIER(0, 7, 0, VXC_RM_TowardZero, 0));

图像读写的接口和通道顺序的支持

// ---Read_Imagef,i,ui  
/* OCL image builtins can be used in VX kernel */  
float4 read_imagef (image2d_t image, int2 coord);  
int4 read_imagei (image2d_t image, int2 coord);  
uint4 read_imageui (image2d_t image, int2 coord);  
float4 read_imagef (image1d_t image, int coord);  
int4 read_imagei (image1d_t image, int coord);  
uint4 read_imageui (image1d_t image, int coord);  
float4 read_imagef (image1d_array_t image, int2 coord);  
int4 read_imagei (image1d_array_t image, int2 coord);  
uint4 read_imageui (image1d_array_t image, int2 coord); 

// ---Write_Imagef,i,ui  
void write_imagef (image2d_t image, int2 coord, float4 color);  
void write_imagei (image2d_t image, int2 coord, int4 color);  
void write_imageui (image2d_t image, int2 coord, uint4 color);  
void write_imagef (image1d_t image, int coord, float4 color);  
void write_imagei (image1d_t image, int coord, int4 color);  
void write_imageui (image1d_t image, int coord, uint4 color);  
void write_imagef (image1d_array_t image, int2 coord, float4 color);  
void write_imagei (image1d_array_t image, int2 coord, int4 color);  
void write_imageui (image1d_array_t image, int2 coord, uint4 color) 

// ---Query Image Dimensions 
int2 get_image_dim (image2d_t image);  
size_t get_image_array_size(image1d_array_t image);  
/* Built-in Image Query Functions */  
int get_image_width (image1d_t image);  
int get_image_width (image2d_t image);  
int get_image_width (image1d_array_t image);  
int get_image_height (image2d_t image); 

// ---Channel Data Types Supported 
/* Return the channel data type. Valid values are:  
* CLK_SNORM_INT8  
* CLK_SNORM_INT16  
* CLK_UNORM_INT8  
* CLK_UNORM_INT16  
* CLK_UNORM_SHORT_565  
* CLK_UNORM_SHORT_555  
* CLK_UNORM_SHORT_101010  
* CLK_SIGNED_INT8  
* CLK_SIGNED_INT16  
* CLK_SIGNED_INT32  
* CLK_UNSIGNED_INT8  
* CLK_UNSIGNED_INT16  
* CLK_UNSIGNED_INT32  
* CLK_HALF_FLOAT  
* CLK_FLOAT  
*/  
int get_image_channel_data_type (image1d_t image);  
int get_image_channel_data_type (image2d_t image);  
int get_image_channel_data_type (image1d_array_t image); 
// ---Image Channel Orders Supported 
/* Return the image channel order. Valid values are:  
NXP Semiconductors 
* CLK_A  
* CLK_R  
* CLK_Rx  
* CLK_RG  
* CLK_RGx  
* CLK_RA  
* CLK_RGB  
* CLK_RGBx  
* CLK_RGBA  
* CLK_ARGB  
* CLK_BGRA  
* CLK_INTENSITY  
* CLK_LUMINANCE  
*/  
int get_image_channel_order (image1d_t image);  
int get_image_channel_order (image2d_t image);  
int get_image_channel_order (image1d_array_t image);

// ---Image Channel Orders Supported 
/* Return the image channel order. Valid values are:  
* CLK_A  
* CLK_R  
* CLK_Rx  
* CLK_RG  
* CLK_RGx  
* CLK_RA  
* CLK_RGB  
* CLK_RGBx  
* CLK_RGBA  
* CLK_ARGB  
* CLK_BGRA  
* CLK_INTENSITY  
* CLK_LUMINANCE  
*/  
int get_image_channel_order (image1d_t image);  
int get_image_channel_order (image2d_t image);  
int get_image_channel_order (image1d_array_t image);

3.2 高斯滤波的例子

这里有一个高斯滤波的例子：https://community.nxp.com/t5/i-MX-Processors-Knowledge-Base/How-to-use-OpenVX-extension-for-NPU-GPU-to-accelerate-machine/ta-p/1113429

这里博主贴一下kernel部分的代码：

__kernel void gaussian
(
__read_only image2d_t in_image,
__write_only image2d_t out_image
)
{
int2 coord = (int2)(get_global_id(0), get_global_id(1));
int2 coord_out = coord;
vxc_uchar16 lineA, lineB, lineC, out;
int2 coord_in1 = coord + (int2)(-1, -1);
VXC_OP4(img_load, lineA, in_image, coord_in1, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in2 = coord + (int2)(-1, 0);
VXC_OP4(img_load, lineB, in_image, coord_in2, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int2 coord_in3 = coord + (int2)(-1, 1);
VXC_OP4(img_load, lineC, in_image, coord_in3, 0, VXC_MODIFIER(0, 15, 0, VXC_RM_TowardZero, 0));
int info = VXC_MODIFIER_FILTER(0, 13, 0, VXC_FM_Guassian, 0);
VXC_OP4(filter, out, lineA, lineB, lineC, info); 
VXC_OP4_NoDest(img_store, out_image, coord_out, out, VXC_MODIFIER(0, 13, 0, VXC_RM_TowardZero, 0));
}

3.3 Half处理的例子

显示转换：需要在half赋值前用类型符修饰，而不能直接赋值

half a = (half)1000;

重解释：我们读取数据为short，再重新解释为half，但是通常较老的编译器是不带as_half选项的，于是我们通过内联汇编指令COPY来处理

vxc_short8 val
vxc_half8 val_fp16;
_viv_asm(COPY, val_fp16, val, sizeof(val));

3.4 其它开源实现

TIM-VX: https://github.com/VeriSilicon/TIM-VX
- TIM-VX 是一个神经网络加速库，其中可以找到大量vx kernel代码参考
AML-NPU-SDK： aml-npu-sdk/nnvxc_kernels at 63b337b73f6143aa58d0ba1c57d0907ac9b0fb39 · leemgs/aml-npu-sdk (github.com)
- 也是一个npu sdk 开源库，有大量nn相关的 opevx kernel nnvxc kernel实现
有关stereo和softisp几个CV Function的vx实现：gtec-demo-framework/DemoApps/OpenVX/Stereo/Content/CTF_kernels_vxc.vx at master · nxp-imx/gtec-demo-framework (github.com)

其它文档相关的资源整理在文章开头，希望能有帮助。