This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX2] Prefer VPBLENDW +VPBLENDD to VPBLENDVB for v16i16 blend shuffles
ClosedPublic

Authored by RKSimon on Jul 31 2018, 8:21 AM.

Details

Summary

Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.

TODO: We should investigate adding VPBLENDVB handling to target shuffle combining as well.

Should we be preferring VPBLENDVB/VSELECT for AVX512 targets?

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jul 31 2018, 8:21 AM
pcordes added a comment.EditedAug 5 2018, 5:38 PM

Nice idea to chain vpblendw + vpblendd, those are both single-uop instructions on AVX2 CPUs, and vpblendd can run on any port.

Skylake: vblendvb is 2 uops for any of p015, vpblendw is 1 uop for p5. If you can hoist the vector constant, vpblendvb is equal to 2 immediate blends, and worse than 3.

Fun fact: Skylake's legacy-SSE pblendvb x,x,xmm0 is a single uop for p015. The 4-explicit-operand VEX encoding makes it cost 2 uops.

Haswell's vblendvb is 2 uops for p5 only, so it and shuffles can easily bottlenecks on port 5. vpblendw is also port 5 only. vpblendw+vpblendd is better, but depending on port pressure, 2x vpblendw+vpblendd is worse (again assuming you can hoist the vector constant).

Agner Fog strangely doesn't have numbers for pblendvb on Piledriver or Ryzen, not even the SSE4 version. http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt says vpblendvb xmm is single-cycle latency on Ryzen, but vpblendvb ymm is 2c latency. (With throughput = latency, so I guess only one port.) VPBLENDW ymm is single-cycle latency, with 0.67c throughput, so I guess it's 1 uop per lane, on 3 ports. So Ryzen's vpblendw is better than Intel's avoiding port bottlenecks. But vpblendvb is also only 1 uop per lane, so it's definitely efficient when we can hoist the mask out of the loop, and register pressure makes that a good thing to spend a register on.


My instinct here is that for Intel tunings and probably also generic, we should replace vpblendvb with up to 2 uops of vpblendw + vpblendd, but not 3.

If we can analyze the situation and figure out that vpblendvb will definitely have to reload the mask every time, we should consider replacing it even if it takes 3 immediate blends. Ideally we can check the loop for port 5 pressure.

Stuff like this makes clang hard to use when hand-tuning a loop, though. I know I'd be very annoyed if I was using a vpblendvb intrinsic, and clang replaced it with 2x vpblendw + vpblendd and created a port 5 bottleneck on Skylake, plus costing more uops. So we should be very cautious about 3-instruction replacements.

Replacing it with 2 uops can obviously be harmful too in some cases, because vpblendw only runs on port 5. It would be great if there was an option that asked clang to use instructions more closely matching the intrinsics for hand-tuned loops, but we can always write asm by hand to tune for a specific uarch.

pcordes added a comment.EditedAug 5 2018, 6:22 PM

With AVX512BW, we should *definitely* be using vpblendmb zmm0{k1}{z}, zmm1, zmm2. According to IACA for SKX, it's single-uop, 1c latency, and runs on any vector ALU port. (i.e. port 0 or port 5 when port 1 is shut down because 512-bit uops are in flight.)

Even if setting up the constant takes extra uops, the critical path latency for the vectors is only 1 cycle.

To get a 64-bit constant into a k register, we need a movabs rcx, 0x12346... / kmovq k1, rcx, or load it from memory. I think normally k registers aren't under a lot of pressure in most functions, so we should normally be able to hoist it if there is a loop.

Loading a k register from memory costs 2 uops according to IACA, one of them being a micro-fused load+ALU. That sounds weird; IDK why you'd need an ALU uop other than the integer->k port-5-only uop. It might be correct, though; k-register store+load has 0.67c or 0.5c throughput (http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeX_InstLatX64.txt). Anyway, I think probably a mov-immediate is a good choice even for 64-bit integers, because it apparently doesn't save uops to load straight into mask registers.


With only AVX512F, we can do bit/byte blends using vpternlogd, using a vector control mask (in a zmm reg, not a k reg). Given the right truth table, one source can select the corresponding bit from either of the other two operands, so we can replace one of the inputs or replace the selector.

vpternlogd is single-uop on AVX512 CPUs, including KNL.

This could be an interesting option for byte blends of 256-bit vectors when used with AVX2 compare results (that put the result in a vector instead of mask reg). e.g. building manually-vectorized code with 256-bit vectors with -march=knl, where we have AVX512F but not BW. (And not VL, so we'd actually have to use a ZMM instruction. That's fine on KNL, but very bad on SKX if no other 512-bit instructions were in flight. We'd like to avoid -mtune=generic -mavx512f being a pitfall of nasty code-gen compared to -march=skylake-avx512)


I don't think AVX512 has any immediate blends. Even vpblendpd doesn't have an EVEX encoding, only VEX using only the low 4 bits of the imm8. At least if it does, they don't have blend or select in the mnemonic or short description.

I guess you're meant to use k registers, even though it's a 2-step 2-uop process to get an immediate into a k reg. (But one of those uops can run on any port, including port 6). Both those extra uops are off the critical path of vectors in -> vector out, unlike with multi-uop vpblendvb.

Of course, VEX vpblendd is still excellent, and should be used on 256-bit vectors whenever possible. e.g. for _mm256_mask_blend_epi32 with a compile-time constant mask, if register allocation has the operands in the low 16 registers.

(Fun fact: using only ymm16..31 avoids the need for vzeroupper, because their low lanes aren't accessible with legacy SSE instructions. But missing out on VEX instructions / short-encodings when doing 256-bit vectorization with AVX512 available is a downside to that.)

Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner. We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).

I can look at combine shuffles to VPTERNLOG in the future if/when its requested.

Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner.

Don't forget that a 32-bit mask is cheaper to create with a mov r32, imm32, so look for the chance to use vpblendmw.

We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).

From that code-gen, I hope KNL is the only CPU in the "AVX2-SLOW" category; its vpblendw/d are efficient but vpblendvb is 4 uops (and thus a front-end bottleneck). But KNL's vpshuflw/hw ymm are horrible too, 8c throughput vs. 12c for vpshufb ymm, so I'm not convinced that multiple vpshufl/hw + combine is the way to go vs. just using vpshufb ymm. Given the way its front-end works, one huge instruction that gets a lot of uops from microcode ROM in one go is probably better than multiple multi-uop instructions that stall the decoders multiple times. (But this is just based on Agner Fog's guide, not any real testing. Still, the per-instruction throughput numbers can be misleading because nearly every multi-uop instruction's throughput is based on the resulting front-end bottleneck. IDK if microcode can be read fast enough to fill that bubble for later insns...)

If KNL can load a mask for vpternlogd, that's probably your best bet for efficient byte blends if AVX512F isn't disabled. But maybe not a high priority to implement because AVX2 byte-manipulation code is generally going to suck on KNL anyway.


On anything other than KNL that supports AVX2, it comes down to whether the blend mask can be hoisted out of a loop.

Haswell and is almost always better off with 1 vpblendvb (2p5) than 3 separate instructions (2p5 + p015), when we have the mask in a reg already.

Ryzen is much better off with vpblendvb ymm (2 uops) than 6 uops.

(IDK about Excavator).

RKSimon updated this revision to Diff 162664.Aug 27 2018, 5:40 AM

rebased - still investigating how best to start including vselect inside shuffle combining - trying to optimize for everything from SSE41 to AVX512BWVL isn't straightforward - especially as we don't do much to optimize vselect nodes most of the time as their behaviour is target specific after legalization.

TBH I reckon this could go in as it is and we improve VSELECT combines later on.

TBH I reckon this could go in as it is and we improve VSELECT combines later on.

Sounds reasonable as long as we aren't pessimizing Skylake by turning vpblendvb into 3 uops (including 2 for port 5) instead of 2 for any port, inside a loop.

AVX2-FAST-LABEL: PR24935: seems to be doing that still.

Especially in manually-vectorized code, I think it would be bad to compile _mm256_blendv_epi8 with a constant into 2x vpblendw + vpblendd. Could easily cause a performance regression in some code.

Can we add a check that only at most 2 immediate blends will be needed, as a conservative option to get the improvements in place for the cases where it is a win?

RKSimon updated this revision to Diff 162716.Aug 27 2018, 11:21 AM
RKSimon retitled this revision from [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles to [X86][AVX2] Prefer VPBLENDW +VPBLENDD to VPBLENDVB for v16i16 blend shuffles.
RKSimon edited the summary of this revision. (Show Details)

Limit to only one PBLENDW until we have shuffle combining support for VSELECT.

This is looking pretty good as far as just the optimization you're aiming for. Most of my comments are on only semi-related things that happen to be in the diffs. e.g. we end up picking vpermw now instead of blending.

test/CodeGen/X86/insertelement-ones.ll
312 ↗(On Diff #162716)

We should have another testcase that blends something other than -1, because the best way to do this blend with three -1 elements is with an OR. OR with -1 produces -1 regardless of the previous contents, and OR with 0 is a no-op, thus it's a blend.

I don't have numbers on loading a constant vs. a couple extra uops outside a loop. Obviously any time we have a loop that will either keep a constant hot in cache, or let us hoist into a reg, this is a very nice win.

https://godbolt.org/z/JNv5VZ shows that this works: a manually optimized version of the function gives the same result for constant-propagation.

vpor   .LCPI1_0(%rip), %ymm0, %ymm0
ret

clang actually used vorps, but that can only run on port 5 before Skylake. I used -march=haswell, so compiling _mm256_or_si256 to vorps (port 5) instead of vpor (port 0/1/5) is really silly for an integer vector. (SKL lets vorps run on any port, with latency between FP instructions dependent on which port it actually picks. But I compiled with -march=haswell, and this is a poor choice for HSW.

Without AVX, por is 1 byte longer than orps, but even then por is can be worth it on pre-Skylake depending on the surrounding code (port 5 pressure, and/or if there's any ILP for this blend). Also with Hyperthreading, uops that can be assigned to any port are more likely to be able to take full advantage of the extra ILP exposed by SMT, vs. potentially having both threads together bottleneck on the same port.

test/CodeGen/X86/vector-shuffle-256-v32.ll
436 ↗(On Diff #162716)

vpermw costs 2 shuffle uops and 4c latency on SKX, so it's implemented internally as a lane-crossing + in-lane shuffle. Some future CPU might make it single-uop, though.

If we need a vpshufb anyway, can we use a wider-granularity shuffle like vpermd (single uop), using a vector constant there? I guess immediate vpermq isn't super helpful.

437 ↗(On Diff #162716)

movw $1, %ax has a false dependency on the old value of RAX. Saving 1 byte of code size is probably not worth it, unless benchmarks show that the code density to help the uop-cache
/ front-end is more important than back-end false deps. But the false-dep effect can just randomly happen to suck a lot (last use of RAX = cache miss or long dep chain), vs. negligible, so it's hard to bench.

(But at least this doesn't cause a decode stall, even though the operand-size prefix changes the length of the *rest* of the instruction because of the imm16. mov is exempt from LCP stalls on Sandybridge-family.)

I've noticed clang using narrow operand sizes (like 8-bit) in other cases when it's not gaining anything, and potentially hurting.

(TODO: report this as a separate bug; it's obviously unrelated to this.)

440 ↗(On Diff #162716)

Could we have used merge-masking on the vpshufb control, %ymm0, %ymm1{%k1} with a mask of eax=3 (2 bytes) instead of eax=1 (1 word)? (I'm falling asleep here, so I can't tell right now.)

pcordes accepted this revision.Aug 28 2018, 11:01 PM
This revision is now accepted and ready to land.Aug 28 2018, 11:01 PM
This revision was automatically updated to reflect the committed changes.