This is an archive of the discontinued LLVM Phabricator instance.

Fix X86 regression on linpack
Needs ReviewPublic

Authored by evstupac on Oct 5 2017, 7:37 PM.

Details

Summary

After setting LSR instructions number priority number 1 for X86, I've received several regression (on of them from Wei Mi) on the tests like this:

for (int i = 0; i < n; i++) 
  y[i] += c * x[i];
// y[i], c, x[i] are float

The reason was in too many complicated address calculations which conflict with other instructions.
The suggested heuristic limit number of complicated addresses, leaving all gains and avoiding the regression.

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac created this revision.Oct 5 2017, 7:37 PM
qcolombet added inline comments.Nov 2 2017, 12:41 PM
lib/Target/X86/X86TargetTransformInfo.cpp
2528

Could you add a comment explaining the >> 3?

It does not make sense to count folded addresses as additional instructions on its own and the >> 3 makes it even more cryptic why we do that :).

test/Transforms/LoopStrengthReduce/X86/folded_addresses.ll
8

Could you add a description of what are the key things this function have and explain what they exercise in LSR?

9

Could you use opt -instnamer on the input test?

Hi Quentin,

Thanks for taking a look.
I'll address your comments.

Evgeny

lib/Target/X86/X86TargetTransformInfo.cpp
2528

Sure.
All folded addresses are somehow executed in CPU as micro-ops. Anyway this is a resource. The question is how costly they are. X86 instructions with folded address access can go to less ports than without. This limitation leads to regressions in some cases.
The heuristic here tries to address this. When we have too many folded addresses in solution we count them as an additional instruction(s). Ideally it should be more complicated analysis, taking in account other loop instructions - but it is harder to implement here.

">> 3" is a kind of average bound for all x86 CPUs. It could differ for -march=slm and -march=core-avx2 for example, however testings showed that ">> 3" is the best average.

evstupac updated this revision to Diff 144178.Apr 26 2018, 12:17 PM

Addressed comments. Made heuristic more precise.

RKSimon added a subscriber: RKSimon.
RKSimon added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
2538

This seems incredibly specific to a particular Intel architecture - you said you've tried to make a best average - based on what?

test/Transforms/LoopStrengthReduce/X86/folded_addresses.ll
9

Is there any way that this can be reduced further?

evstupac added inline comments.Jun 1 2018, 11:59 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2538

I have not seen significant difference on Atom CPUs and have seen gains on Cores. It is hard to cover all CPUs. We can adjust this (by limiting to specific CPU(s)) if get a regression.

test/Transforms/LoopStrengthReduce/X86/folded_addresses.ll
9

Maybe we could get rid of vectorization, however now this is the test that got ~30% regression. It looks reasonable to track it.

Is the intent here to limit folding stores? If so, could you update the summary to reflect that? And probably add a comment to that effect as well.

lib/Target/X86/X86TargetTransformInfo.cpp
2538

This is obviously a larger change but it would seem to be useful to tie into the scheduler model and query what can execute on which ports.