This is an archive of the discontinued LLVM Phabricator instance.

[MC][X86] Automatic alignment for Macro-Op Fusion
AbandonedPublic

Authored by Amir on May 3 2021, 11:54 PM.

Details

Summary

Introduce an option x86-align-for-macrofusion to prevent a pair of
macro-fusion eligible instructions from being split by a given alignment
boundary by automatically padding the first instruction in a pair with
a minimal size nop.

In effect, it ensures that a pair of macro-fusible instructions is not split by
a cache line boundary, which is a precondition for macro-op fusion in
modern Intel Cores (see Intel Architecture Optimization Reference Manual,
2.3.2.1 Legacy Decode Pipeline: Macro-Fusion).

Diff Detail

Event Timeline

Amir created this revision.May 3 2021, 11:54 PM
Amir requested review of this revision.May 3 2021, 11:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2021, 11:54 PM
Amir updated this revision to Diff 342664.May 4 2021, 12:54 AM
Amir edited the summary of this revision. (Show Details)

Addressed clang-format warnings

skan added inline comments.May 4 2021, 7:43 PM
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
364–367

The extension is definitely wrong. BranchFused indicates fused macro fusion pairs, why do you add something like "AlignMacroFusionCmp, AlignMacroFusionBranch" .

skan added a comment.May 4 2021, 7:49 PM

In D97982, NeverAlign fragment type is introduced to prevent macro fusion pairs from ending at a specified boundary. I am confused why you are trying to reuse the BoundaryAlign fragment here do the same thing. Maybe I do not see the big pitcture, so could you clarify you design?

Amir added a comment.May 5 2021, 4:37 PM

@skan:
The big picture is that NeverAlign fragment insertion is controlled by a client (BOLT), while BoundaryAlign insertion is done automatically. This automatic macro-fusion alignment might be useful as a standalone performance optimization.
This diff leverages the infrastructure to detect eligible macro-fusion pairs and insert BoundaryAlign fragment. I've added logic on top of it to perform macro-fusion alignment when that is requested by an option.

Amir updated this revision to Diff 343238.May 5 2021, 4:42 PM

Addressed the comment by @skan:

  • removed newly added BranchKinds, reused existing ones
Amir marked an inline comment as done.May 5 2021, 4:44 PM
Amir added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
364–367

I wanted to avoid overloading the existing types with different actions based on them: ie JCC erratum would have used existing BranchBoundaryKinds, while auto MF alignment would use new ones. But I guess it's OK to reuse and make the action depend on a cl opt.

Amir marked an inline comment as done.May 5 2021, 4:45 PM
skan added inline comments.May 6 2021, 6:39 PM
llvm/include/llvm/MC/MCFragment.h
563–568

The comments here is weired after you added the new usage to the fragment, you need to refine it.

llvm/lib/MC/MCAssembler.cpp
1086–1088

uint64_t NewSize = 0 is enough, we used U suffix before because of the conditional operator.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
80–81

The words "cmp+jcc" and "falling between" seem not accurate.

166–173

In fact, X86AlignForMacroFusion is compatible with X86AlignBranchWithin32BBoundaries, you can set AlignBoundary to 64 and add AlignBranchFused to AlignBranchType if both of the options exist.

Amir updated this revision to Diff 343813.May 7 2021, 10:40 PM

Addressed comments by @skan

Amir marked 2 inline comments as done.May 7 2021, 10:52 PM
Amir added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
166–173

No, they're semantically incompatible: X86AlignBranchWithin32BBoundaries would prevent macrofusion pair from crossing the 32B boundary (stronger alignment restriction), while X86AlignForMacroFusion would prevent macrofusion pair being perfectly split by 64B boundary.
So if X86AlignForMacroFusion is on, it's known not to satisfy X86AlignBranchWithin32BBoundaries restrictions.
(Conversely, if X86AlignBranchWithin32BBoundaries is on, X86AlignForMacroFusion is satisfied automatically).

They're further incompatible by the fact that X86AlignForMacroFusion flag enables the new BoundaryAlign behavior isAvoidEndAlign which performs the same alignment as NeverAlign.

Amir marked 2 inline comments as done.May 7 2021, 10:54 PM
Amir added inline comments.
llvm/include/llvm/MC/MCFragment.h
563–568

The comments are up-to-date. Please take a closer look at AvoidEndAlign flag usage below

Amir updated this revision to Diff 343814.May 7 2021, 11:00 PM
Amir marked an inline comment as done.

Addressed linter warning

skan added inline comments.May 7 2021, 11:30 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
166–173

"So if X86AlignForMacroFusion is on, it's known not to satisfy X86AlignBranchWithin32BBoundaries restrictions.
(Conversely, if X86AlignBranchWithin32BBoundaries is on, X86AlignForMacroFusion is satisfied automatically)."

The requirements of X86AlignForMacroFusion and X86AlignBranchWithin32BBoundaries can be met at the same time, so they are compatible.

isAvoidEndAlign is a detail about implementation and is not a good proof for the incompatibility.

Meanwhile, AlignBranchType, AlignBoundary can be also be set by options X86AlignBranchBoundary, X86AlignBranch. So the assert here is not correct.

Amir added inline comments.May 8 2021, 12:27 AM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
166–173

I see what you mean. We can handle X86AlignForMacroFusion first and then override it if X86AlignBranchWithin32BBoundaries is set as it's a stronger alignment requirement. Let me fix that.

Amir updated this revision to Diff 343867.May 8 2021, 2:57 PM

Handle X86AlignForMacroFusion option first.
Override it if X86AlignBranchWithin32BBoundaries is set.

Amir marked an inline comment as done.May 8 2021, 3:02 PM
Amir updated this revision to Diff 343870.May 8 2021, 3:19 PM

Addressed linter warning

skan added inline comments.May 10 2021, 8:51 PM
llvm/lib/MC/MCAssembler.cpp
1098–1100

Replace tab with space here.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
165–170
  1. You need test case for this
  2. I don't think it's correct to set value to an option in the constructor, maybe you need to define a class member
  3. Nest the if clauses here is incorrect. As I commented before, you should care about the fields set by options rather than the options themselves because the values like AlignBranchType, AlignBoundary are not set only by X86AlignBranchWithin32BBoundaries.

Maybe a simple solution is to add the following code to the end of the constructor

// Clean the alignment request
if (AlignBoundary == Align())
          AlignBranchType.clear();
else if (AlignBranchType)
         AlignBoundary = Align();

If(X86AlignForMacroFusion) {
   if(AlignBoundary == Align())
         AlignForMacroFusionOnly = true;
   if (AlignBoundary > Align(64) || AlignBoundary == Align()) {
          AlignBoundary = assumeAligned(64);
   } 
   AlignBranchType.addKind(X86::AlignBranchFused);         
}
661–662

Name IsBranchFused is quite misleading. Inst here is not a branch and may not be fused when IsBranchFused is true.

You need a better name here, maybe IsFirstMacroFusibleInstAndMayNeedAlign?

665–666

X86AlignForMacroFusion && IsBranchFused -> AlignForMacroFusionOnly
// Add some comments ...

Amir updated this revision to Diff 345063.May 13 2021, 12:45 AM
Amir marked 3 inline comments as done.

Addressed comments by @skan.

llvm/lib/MC/MCAssembler.cpp
1098–1100

Verified that these are spaces (I think Phabricator just shows with >> that the indentation changed)

skan added inline comments.May 13 2021, 12:53 AM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
195

Why do you clear the branch type here?

196–197

We need test to make sure unfused jcc is not aligned.

Amir abandoned this revision.Jun 11 2021, 3:42 PM

Intel folks have collected perf data for this optimization on SPECint17 workload:

SPECcpu2017
 
Compiler/optimizer:  bin/clang 
 
Overall score:              228
Breakdown
500.perlbench_r        235
502.gcc_r                    238
505.mcf_r                   165  
520.omnetpp_r         132
523.xalancbmk_r       190
525.x264_r                 465
531.deepsjeng_r       228
541.leela_r                 232
548.exchange2_r      423
557.xz_r                      162
 
Compiler/optimizer:  bin/clang -mllvm -x86-align-for-macrofusion
 
Overall score:              228
Breakdown
500.perlbench_r        235
502.gcc_r                    238
505.mcf_r                   164  
520.omnetpp_r         131
523.xalancbmk_r       190
525.x264_r                 464
531.deepsjeng_r       228
541.leela_r                 234
548.exchange2_r      422
557.xz_r                      161

I'll abandon this diff based on that data. This functionality has non-trivial compile-time overhead (as it relies on relaxation for extra fragments), but it doesn't bring any perf benefit to workloads represented by SPECint17.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
195

I think it doesn't make sense to have JCC erratum mitigation for the other branch types,