Details
- Reviewers
craig.topper
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
test/CodeGen/X86/memcmp.ll | ||
---|---|---|
1015 ↗ | (On Diff #223186) | I'm not sure this is better than the avx code. vpcmpeq to mask register and kortest are both 3 cycle latency. The vpcmpb to xmm is 1 cyc. vpmovmskb is 1 cyc and cmpl is 1 cyc. |
That's a reasonable point. From my perspective:
- Staying in the opmask registers as long as possible relieves general purpose register pressure.
- When/if MaxLoadsPerMemcmp becomes greater than 2, then keeping the intermediate results in the opmask registers is the right thing to do.
- I have a pending patch to increase MaxLoadsPerMemcmp to 4 if-and-only-if a compare against zero is happening. Changing the value to 8 (if-and-only-if a compare against zero is happening) will require more work (I think).
- In the case of 512-bit ops, using AVX512BW instead of just AVX512F better matches people's mental model (that byte-wise memcmp() should generate byte-wise vector instructions).
What do you think?
I'm fine with the 512-bit change to use byte compare with AVX512BW.
For 128/256, I wonder about the PTEST suggestion in the FIXME above this code. Unfortunately PTEST has 3 cycle latency on recent Intel CPUs. But it would avoid the GPR use entirely.
Simplified to just enabling AVX512BW for 512 and let 128/256 keep using AVX/AVX2.
PS – I trust your knowledge of instruction latencies. I wonder what Agner Fog did wrong if he measured PMOVMSKB as having 2-3 cycles of latency.
I think I just misremembered MOVMSKB latency. I think it is 2-3. I should have looked it up. I remembered from something previously that VPCMPEQ/KORTEST were both 3 cycs.