This is an archive of the discontinued LLVM Phabricator instance.

[private] Add min_vector_width function attribute. Use it to annotate all of the x86 intrinsic header files. Emit a attribute in IR
AbandonedPublic

Authored by craig.topper on May 10 2018, 11:56 AM.

Details

Summary

I believe this is what we talked about on IRC. If you would prefer a different name for the attributes I'm happy to change it.

There are a bunch of intrinsics that are implemented as macros so there's no place to put the attribute other than user's code. But that would be a compatibility issue. Should use of a target specific builtin automatically imply the minimum vector width required by that builtin?

There are also a bunch of intrinsics that are implemented as macros using target independent builtins like builtin_shufflevector. For those intrinsics I have no way of expressing a vector width requirement unless we want to make builtin_shufflevector require the vector width of its arguments. Or we could make x86 specific builtins that generate the correct shuffles using custom code in CGBuiltin.cpp

How do we want to encode the required vector width for a builtin? Do we want to parse the type string for the builtins and infer for that? Or should we encode it in the attribute string for the builtin?

Diff Detail

Event Timeline

craig.topper created this revision.May 10 2018, 11:56 AM

I suspect you want at least one of Reid or Richard to look at this from the Clang side, but this is definitely the direction I was thinking.

No strong opinions about the name of the attribute here.

lib/Headers/avx2intrin.h
32–33

Does it make sense to call this 256?

I'd really like to discuss this. I was speaking with Ori cc'd the other day
and he doesn't think vector width is sufficient for some of the various
slowdowns. I'll let him speak up, but we may need to rethink some of this.

atdt added a comment.May 21 2018, 11:10 AM

I'd really like to discuss this. I was speaking with Ori cc'd the other day
and he doesn't think vector width is sufficient for some of the various
slowdowns. I'll let him speak up, but we may need to rethink some of this.

I don't have all the context, but since Eric is looping me in, I assume this change is in some way related to the work of adding a -mprefer-avx128 flag, or some other mechanism that would prevent LLVM's vectorizers from using power-hungry vector instructions that can hurt performance by forcing the CPU into a lower frequency band. If this is the intent, then Eric is correct: an instruction's register width is not a reliable proxy for its power requirements.

While it is true that none of the 128-bit AVX instructions are dangerous, neither are the majority of 256-bit AVX instructions.

My conclusions are based on analysis of each AVX and AVX2 instruction, in which I measured the maximum ratio of CPU cycles to reference cycles while the processor is executing the instruction in a tight loop. A higher ratio indicates more "extra" cycles delivered by TurboBoost. The ratio allows us to estimate how many extra cycles we are getting from TurboBoost. I found that only a subset of 256-bit instructions were associated with frequency reductions.

On Haswell, the instructions that cause frequency reductions include most (but not all) floating-point operations, as well as integer multiplication. On Broadwell, a few more floating-point operations become "safe". On Skylake I haven't observed loss of Turbo cycles with any 256-bit instructions except VMOV* (store).

https://docs.google.com/a/google.com/spreadsheets/d/e/2PACX-1vSqPiY9GZStOzH7yCgmHIqOzuazDrdDm0ORu127Eauu16IhHEqaQvwz243fBZN13IO0LN2b4fggJMMr/pubhtml?gid=1023319529&single=true

The smattering of publicly-available information on this topic agrees with my findings. The descriptions Intel provided for new Turbo license PMU events on Skylake X includes references to "high-current" and "low-current" AVX 256-bit codes. Something like "-mprefer-low-current-avx" seems like the right approach here, but the set of instructions it disables would have to vary somewhat by architecture if it is to be precise (i.e., not disable more instructions than necessary).

@atdt How can I access that document?

My main focus was on preventing 512 bit instructions on Skylake X. Do you have measurements for that?

atdt added a comment.May 21 2018, 11:40 AM

@craig.topper, I don't have measurements for AVX-512, but the Intel-provided descriptions for CORE_POWER.LVL#_TURBO_LICENSE PMU events on SkylakeX implies that "low current AVX 512-bit instructions" execute with power-deliver license level 1, whereas "high-current" 512-bit instructions execute with license level 2. So it sounds like the low-current / high-current distinction applies to AVX-512 as well.

I suspect you want at least one of Reid or Richard to look at this from the Clang side, but this is definitely the direction I was thinking.

Sorry, I have no context for this, and the description of this review doesn't seem to provide any either. Can someone explain what this patch is about?

Sorry @rsmith, this patch was created only to go to Chandler for a preliminary review. I didn't expect it to get a wider audience.

My ultimate goal is stop the backend from generating 512 bit vector instructions on Skylake Server to avoid a frequency penalty unless the user passes -mprefer-vector-width=512 or the user explicitly uses one of the intrinsics that corresponds to a 512 bit instruction. If we don't see one of those things, the backend type legalizer will be told that 512 bit types are illegal and will split any IR with types wider than 256 bits. This allows the vectorizer to use wider types as it does already with AVX2 and rely on the type legalizer to make everything fit into a legal vector size. But if the user wrote code we need to allow that to be legalized to 512 bits.

To capture the use of 512 bit intrinsics, Chandler proposed on IRC that I add a new function attribute that we can tag all the intrinsics with that will be passed along to an IR attribute. This patch adds that attribute and annotates what intrinsics I could. I still had questions about the macro intrinsics which is why I created this review for Chandler.

hsaito added a subscriber: hsaito.May 21 2018, 2:20 PM

@atdt and @echristo, the hypothetical -mprefer-low-current-avx[=<uarch>] would be certainly more precise, but that comes with a much higher design/implementation cost in the areas of TTI/vectorizer cost model, vector type legalizer, and CodeGen, at least.

-mprefer-vector-width=256 provides a common denominator of the more precise flag at a much lower implementation cost, readily available (subject to community review approval), easy for the programmers to understand, and serves much of the purposes for many programmers ---- while waiting for -mprefer-low-current-avx alternative becomes viable, if ever.

atdt added a comment.May 22 2018, 9:11 AM

@atdt and @echristo, the hypothetical -mprefer-low-current-avx[=<uarch>] would be certainly more precise, but that comes with a much higher design/implementation cost in the areas of TTI/vectorizer cost model, vector type legalizer, and CodeGen, at least.

-mprefer-vector-width=256 provides a common denominator of the more precise flag at a much lower implementation cost, readily available (subject to community review approval), easy for the programmers to understand, and serves much of the purposes for many programmers ---- while waiting for -mprefer-low-current-avx alternative becomes viable, if ever.

Makes sense. I agree that -mprefer-vector-width=256 would be an improvement.

Tag every X86 vector builtin with a new attribute to indicate its required vector width. I thought about inferring it from the type string, but this makes it explicit and leaves out other targets.

If a builtin is used by a function we will now infer that the required vector width is at least as large as the builtin requires. This way existing user code won't have to change to adapt this vector width concept.

Rebase due to recent builtin and intrinsic changes

Rebase the builtins file again

I think this looks really nice.

The only other big thing I think we should add is some docs. Because of the somewhat subtle semantics here, I think it'll be important to explain to people the programming model -- all the stuff you test for, that essentially the max of the builtin/intrinsic you use and the attribute you specify wins.

Honestly, we'll even want to discuss the inlining strategy here to make it really clear how this will work in practice.

I'm OK if you want that in a separate follow-up patch of course, I just want to make sure we get there.

lib/Basic/Builtins.cpp
111

To avoid confusion with the actual width returning, maybe WidthPos? or just Pos?

lib/CodeGen/CodeGenFunction.cpp
1195–1196

FWIW, I have no idea. I'd ask Eric or Richard.

lib/Headers/avx512fintrin.h
176–177

Want this to be suffixed with 512? No strong opinion here.

Rebase the builtins file again. Address Chandler's comments.

I intend to post a real review that CCs cfe-commits. This was originally intended to get initial feedback from Chandler before going to the full community.

craig.topper abandoned this revision.Sep 10 2018, 7:35 PM
include/clang/Basic/Builtins.def