Patch tries to improve vectorization of the following code:
void add1(int * __restrict dst, const int * __restrict src) { *dst++ = *src++; *dst++ = *src++ + 1; *dst++ = *src++ + 2; *dst++ = *src++ + 3; }
Currently this code cannot be vectorized because the very first operation is not a binary add, but just a load.
Turn this to a class. Also, not enough comments, the design of the structure also should be improved. Currently, it is too complex. Maybe base class + templates will help.