In vectorized add reduction code, the final "reduce" step is sub-optimal.
This change wll combine :
ext v1.16b, v0.16b, v0.16b, #8 add v0.4s, v1.4s, v0.4s dup v1.4s, v0.s[1] add v0.4s, v1.4s, v0.4s
into
addv s0, v0.4s
This fixes PR21371.
Differential D12325
Improve ISel using across lane addition reduction junbuml on Aug 25 2015, 11:16 AM. Authored by
Details
In vectorized add reduction code, the final "reduce" step is sub-optimal. ext v1.16b, v0.16b, v0.16b, #8 add v0.4s, v1.4s, v0.4s dup v1.4s, v0.s[1] add v0.4s, v1.4s, v0.4s into addv s0, v0.4s This fixes PR21371.
Diff Detail Event TimelineComment Actions Thanks for working on this, Jun. Mostly style nits with a few other comments.
Comment Actions Hi, Thanks for working on this! I'm generally happy with the algorithm, but I have a bunch of comment and style requests. Cheers, James
Comment Actions Thanks James for the review. I will address your comments and update it at this morning.
Comment Actions Jun/James, While this patch is focused on the add reductions, aren't there other opportunities here for optimizing the reduce step of the other reduction types? For example, couldn't we similarly use the sminv/uminv and smaxv/umaxv instructions for min/max reductions? We currently don't do this. Min/max reductions are important for 456.hmmer. Comment Actions I now see that the above comment was already made, and a FIXME was added. Please disregard. Comment Actions Yes, there are more across lane reduction instructions as mentioned in FIXME. We could extend it to support other types as well. Comment Actions The LLVM community prefers incremental changes, which simplifies code review as well as eases bisection. It might be wise to land this patch before moving onto the other types of reductions. Comment Actions Confirmed that this patch was passed correctness tests for spec2000/2006, and applied several spec benchmarks including gcc, h264, hmmer, sjeng, and vpr. Comment Actions Hi James, I'm planning on pushing another patch to handle other across vector reductions (SMAX/SMIN/UMAX/UMIN), which require some change in this patch. Do you prefer to have a single merged patch or separate patches? Either ways are fine with me. Please let me know your thought. Comment Actions @jmolloy: This looks to be in good shape, AFAICT. Mind giving this a LGTM, if you're satisfied. Comment Actions Thanks James for the review ! |
I'd like to see here an example using SDAG nodes. It's nice to have the assembly before/after, but what's really important is the exact pattern this function is intending to match, and what it replaces it with.
For example:
%2 = SHUFFLE_VECTOR ...
%1 = ADD %2, %3
EXTRACT_ELEMENT %1, 0
Something like that.
And specifically, this is the log2-shuffle pattern that the LoopVectorizer produces. Please make that very explicit, because this code won't match all reductions.