Now that rL336250 has (hopefully) landed, I'd like to prefer 2 immediate shifts + a shuffle blend over performing a multiply. Despite the increase in instructions, this is quicker (especially for slow v4i32 multiplies), avoid loads and constant pool usage. It does mean however that we do increase register pressure,. The code size will go up a little but by less than what we save on the constant pool data.
Details
Details
Diff Detail
Diff Detail
- Repository
- rL LLVM
Event Timeline
Comment Actions
Dammit, just realised that pre-SSE41 targets might introduce AND/ANDN/OR blend masks which will even more costly - I'll see if there is a better way to do this
Comment Actions
Make vXi16 "2shifts+select" more selective - only do it on pre-SSE41 if the shuffle can be widened. Only do on SSE41+ if a single PBLENDW can be used.
test/CodeGen/X86/lower-vec-shift.ll | ||
---|---|---|
211–232 ↗ | (On Diff #154197) | Subj only talks about mul, but this is div. |
test/CodeGen/X86/lower-vec-shift.ll | ||
---|---|---|
211–232 ↗ | (On Diff #154197) | This is a side effect of only accepting v8i16 2shifts+blend on pre-SSE41 (no PBLENDW) if the shuffle can be widened to v4i32, as without PBLENDW we have to perform a bitmask with OR(ANDN,AND) - but for other shifts we'd end up doing that anyway - I suppose I could limit this to SHL cases only? |