Page MenuHomePhabricator

[X86] Prefer 512-bit vectors on Ice Lake Server cpus (PR48336)
AbandonedPublic

Authored by RKSimon on Oct 3 2021, 11:13 AM.

Details

Summary

This patch proposes we drop the TuningPrefer256Bit tuning flag on Ice Lake Server targets and later.

It doesn't appear that IceLake and later CPUs have the drastic frequency scaling issues with 512-bit vector usage, unlike earlier AVX512-capable targets: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

After internal testing by @pengfei I limited this to just the Server variant - Ice Lake Client, Rocket Lake and Tiger Lake targets still prefer 256-bit vectors at this time.

Diff Detail

Event Timeline

RKSimon created this revision.Oct 3 2021, 11:13 AM
RKSimon requested review of this revision.Oct 3 2021, 11:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2021, 11:13 AM

Probably good for -march=icelake-client. We need to check on -march=icelake-server (launched April 2021), ideally trying it with some benchmarks where only some parts of the code can auto-vectorize.

That's always the concern: one small part of a problem hurting frequency for the rest. Or for one short-lived process lowering frequency for other processes on the same physical core. Server CPUs are maybe more likely than client to have use-cases where multiple programs are running with some waking up for a short interval, so having one program hurt CPU frequency for everything else is maybe more likely to be a concern there.

Any frequency change at all has a cost in stopping the CPU clock until the new voltage settles (like 8.5 microseconds = 8500 nanoseconds), but yes, it seems on those client chips, there isn't one between scalar and light-512 or even heavy-512, at least with multiple cores active. I think the HW power management keeps the CPU at the new frequency for long enough that the worst case isn't a huge number of transitions per second, though.
https://stackoverflow.com/questions/45472147/lost-cycles-on-intel-an-inconsistency-between-rdtsc-and-cpu-clk-unhalted-ref-ts re: clock halt time on switching frequency on Skylake.

Travis's article didn't have test results from -march=icelake-server Xeon chips; that's something to check on.
If they still only have one 512-bit FMA unit per core, the max power for FMA-heavy workloads (the most power-intensive thing you can do) is not much higher than with AVX2 keeping both halves of the FMA unit busy separately. (But wider cache accesses can mean a bit more power). Anyway, there's reason to hope that Ice Lake server might be similarly not too badly affected.

Having any 512-bit uops in flight (or something like that) also shuts down the vector ALU on port 1. But AFAIK that recovers quickly, and is usually still worth it. For something that bottlenecks on SIMD integer stuff like bitwise booleans, 2x work per uop * 2/3 uop throughput is still a factor of 4/3 = 1.333 speedup for booleans/vpternlogd, and stuff like integer addition. And most things will hopefully benefit more than that, and having scalar integer and load/store uops in the pipeline often means that you don't bottleneck just on back-end ALU ports. (Wider loads/stores often means fewer other uops, so helping with other bottlenecks, and out-of-order exec can see farther).

For something that bottlenecks on front-end throughput, we could expect more like 2x speedup.

Ice Lake has a shuffle unit on port 1 that can handle some common 128/256-bit shuffles like vpshufb ymm, but not zmm. If auto-vectorization ends up needing a lot of shuffling, 256-bit might be a win. And also, with really wide vectors, it's easier for a bad vectorization strategy to get really bad.

Matt added a subscriber: Matt.Oct 6 2021, 9:16 AM

Our internal data is not in favor of the patch, but we are collecting more data.

Our internal data is not in favor of the patch, but we are collecting more data.

If its different for particular cpus we do have the option of splitting the tuning flags down further to icelake-client, icelake-server, rocketlake and tigerlake,

Our internal data is not in favor of the patch, but we are collecting more data.

If its different for particular cpus we do have the option of splitting the tuning flags down further to icelake-client, icelake-server, rocketlake and tigerlake,

I agree, that's the data we need to collect too.

Also forgot to mention, 64-byte vectors are more sensitive to alignment, even when data isn't hot in L1d cache. e.g. loops over data coming from DRAM or maybe L3 are about 15% to 20% slower with misaligned loads IIRC, vs. only a couple % for AVX2. At least this was the case on Skylake-SP; IDK about client chips with AVX-512.

So the usual optimistic strategy of using unaligned loads but not spending any extra instructions to reach an alignment boundary might not be the best choice for some loops with 512-bit vectors.

Going scalar until an alignment boundary is pretty terrible, especially for "vertical" operations like a[i] *= 3.0 or something that means it's ok to process the same element twice, as long as any reads are before any potentially overlapping stores. e.g.

  • load a first vector
  • round the pointer up to the next alignment boundary with add reg, 64 / and reg, -64
  • load the first-iteration loop vector (peeled from first iteration)
  • store the first (unaligned) vector
  • enter a loop that ends on a pointer-compare condition.
  • cleanup that starts with the final aligned vector loaded and processed but not stored yet

If the array already was aligned, there's no overlap. For short arrays, AVX-512 masking can be used to avoid reading or writing past the end, generating masks on the fly with shlx or shrx.

Anyway, this is obviously much better than going scalar until an alignment boundary, in loops where we can sort out aliasing sufficiently, and where there's only one pointer to worry about so relative misalignment isn't a factor. In many non-reductions, there are at least pointers so it may not be possible to align both.

An efficient alignment strategy like this might help make vector width = 512 worth it for more code which doesn't take care to align its arrays. Clearly that should be a separate feature-request / proposal if there isn't one open for that already; IDK how hard it would be to teach LLVM (or GCC) that an overlapping vectors strategy can be good, or if it's just something that nobody's pointed out before.

Vector ISAs like ARM SVE and I think RISC-V's planned one have good HW support for generating masks from pointers and stuff like that, but it can be done manually especially in AVX-512 with mask registers.

Thanks @pcordes - I've raised https://bugs.llvm.org/show_bug.cgi?id=52348 about better unaligned handling in vectorization.

I think we're still waiting on comparative feedback on the various archs, and without access to hw I'm not sure if there's anything more I can do to help?

pengfei added a subscriber: wxiao3.Apr 5 2022, 7:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 7:12 AM
Herald added a subscriber: StephenFan. · View Herald Transcript

@pengfei Do you have any update on this please?

@pengfei Do you have any update on this please?

Hi @RKSimon, our internal testing shows no obvious improvement in typical benchmarks on Alderlake. Considering the negative side, we think it’s not good for client products.

OK, I'll rework this just for icelake-server

RKSimon updated this revision to Diff 437481.Jun 16 2022, 2:42 AM
RKSimon retitled this revision from [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336) to [X86] Prefer 512-bit vectors on Ice Lake Server cpus (PR48336).
RKSimon edited the summary of this revision. (Show Details)

Updated to just target Ice Lake Server

I think in general, we would like to prefer 512-bit vector on server. But notice the changes on arguments passing, which is ABI breaking. We need some downstream work before ready for the change. Please hold for a while, thanks!

RKSimon planned changes to this revision.Jun 16 2022, 3:37 AM

@pengfei reverse-ping

We are still working on it. Will update the status by end of this week.

Hi @RKSimon , our internal tests showed ~1% geomean degradation in spec 2017 on Ice Lake Server compared to 256-bit vectors. So I'd suggest we still prefer 256-bit vectors for it.

RKSimon abandoned this revision.Sun, Jul 24, 7:03 AM

Thanks for the confirmation