"vpmaddwd + vpaddd" can be combined to vpdpwssd and the latency is
reduced after combination. However when vpdpwssd is in a critical path
the combination get less ILP. It happens when vpdpwssd is in a loop, the
vpmaddwd can be executed in parallel in multi-iterations while vpdpwssd
has data dependency for each iterations. If vpaddd is in a critical path
while vpmaddwd is not, it is profitable to split vpdpwssd into "vpmaddwd
+ vpaddd ".
This patch is based on the machine combiner framework to acheive decision
on "vpmaddwd + vpaddd" combination. The typical example code is as
below.
__m256i foo(int cnt, __m256i c, __m256i b, __m256i *p) { for (int i = 0; i < cnt; ++i) { __m256i a = p[i]; __m256i m = _mm256_madd_epi16 (b, a); c = _mm256_add_epi32(m, c); } return c; }
What is this change for?