Download Raw Diff

Details

Reviewers

SjoerdMeijer
samparker
dmgreen
qcolombet

Commits

rGf649f24d388c: [RAGreedy] Enable -consider-local-interval-cost for AArch64

Summary

The greedy register allocator occasionally decides to insert a large number of
unnecessary copies, see below for an example. The -consider-local-interval-cost
option (which X86 already enables by default) fixes this. We enable this option
for AArch64 only after receiving feedback that this change is not beneficial for
PowerPC.

We evaluated the impact of this change on compile time, code size and
performance benchmarks.

This option has a small impact on compile time, measured on CTMark. A 0.1%
geomean regression on -O1 and -O2, and 0.2% geomean for -O3, with at most 0.5%
on individual benchmarks.

The effect on both code size and performance on AArch64 for the LLVM test suite
is nil on the geomean with individual outliers (ignoring short exec_times)
between:

               best     worst
size..text     -3.3%    +0.0%
exec_time      -5.8%    +2.3%

On SPEC CPU® 2017 (compiled for AArch64) there is a minor reduction (-0.2% at
most) in code size on some benchmarks, with a tiny movement (-0.01%) on the
geomean. Neither intrate nor fprate show any change in performance.

This patch makes the following changes.

For the AArch64 target, enableAdvancedRASplitCost() now returns true.

Ensures that -consider-local-interval-cost=false can disable the new behaviour if necessary.

This matrix multiply example:

   $ cat test.c
   long A[8][8];
   long B[8][8];
   long C[8][8];

   void run_test() {
     for (int k = 0; k < 8; k++) {
       for (int i = 0; i < 8; i++) {
	 for (int j = 0; j < 8; j++) {
	   C[i][j] += A[i][k] * B[k][j];
	 }
       }
     }
   }

results in the following generated code on AArch64:

$ clang --target=aarch64-arm-none-eabi -O3 -S test.c -o -
[...]
                                      // %for.cond1.preheader
                                      // =>This Inner Loop Header: Depth=1
      add     x14, x11, x9
      str     q0, [sp, #16]           // 16-byte Folded Spill
      ldr     q0, [x14]
      mov     v2.16b, v15.16b
      mov     v15.16b, v14.16b
      mov     v14.16b, v13.16b
      mov     v13.16b, v12.16b
      mov     v12.16b, v11.16b
      mov     v11.16b, v10.16b
      mov     v10.16b, v9.16b
      mov     v9.16b, v8.16b
      mov     v8.16b, v31.16b
      mov     v31.16b, v30.16b
      mov     v30.16b, v29.16b
      mov     v29.16b, v28.16b
      mov     v28.16b, v27.16b
      mov     v27.16b, v26.16b
      mov     v26.16b, v25.16b
      mov     v25.16b, v24.16b
      mov     v24.16b, v23.16b
      mov     v23.16b, v22.16b
      mov     v22.16b, v21.16b
      mov     v21.16b, v20.16b
      mov     v20.16b, v19.16b
      mov     v19.16b, v18.16b
      mov     v18.16b, v17.16b
      mov     v17.16b, v16.16b
      mov     v16.16b, v7.16b
      mov     v7.16b, v6.16b
      mov     v6.16b, v5.16b
      mov     v5.16b, v4.16b
      mov     v4.16b, v3.16b
      mov     v3.16b, v1.16b
      mov     x12, v0.d[1]
      fmov    x15, d0
      ldp     q1, q0, [x14, #16]
      ldur    x1, [x10, #-256]
      ldur    x2, [x10, #-192]
      add     x9, x9, #64             // =64
      mov     x13, v1.d[1]
      fmov    x16, d1
      ldr     q1, [x14, #48]
      mul     x3, x15, x1
      mov     x14, v0.d[1]
      fmov    x17, d0
      mov     x18, v1.d[1]
      fmov    x0, d1
      mov     v1.16b, v3.16b
      mov     v3.16b, v4.16b
      mov     v4.16b, v5.16b
      mov     v5.16b, v6.16b
      mov     v6.16b, v7.16b
      mov     v7.16b, v16.16b
      mov     v16.16b, v17.16b
      mov     v17.16b, v18.16b
      mov     v18.16b, v19.16b
      mov     v19.16b, v20.16b
      mov     v20.16b, v21.16b
      mov     v21.16b, v22.16b
      mov     v22.16b, v23.16b
      mov     v23.16b, v24.16b
      mov     v24.16b, v25.16b
      mov     v25.16b, v26.16b
      mov     v26.16b, v27.16b
      mov     v27.16b, v28.16b
      mov     v28.16b, v29.16b
      mov     v29.16b, v30.16b
      mov     v30.16b, v31.16b
      mov     v31.16b, v8.16b
      mov     v8.16b, v9.16b
      mov     v9.16b, v10.16b
      mov     v10.16b, v11.16b
      mov     v11.16b, v12.16b
      mov     v12.16b, v13.16b
      mov     v13.16b, v14.16b
      mov     v14.16b, v15.16b
      mov     v15.16b, v2.16b
      ldr     q2, [sp]                // 16-byte Folded Reload
      fmov    d0, x3
      mul     x3, x12, x1
[...]

With -consider-local-interval-cost the same section of code results in the
following:

$ clang --target=aarch64-arm-none-eabi -mllvm -consider-local-interval-cost -O3 -S test.c -o -
[...]
.LBB0_1:                              // %for.cond1.preheader
                                      // =>This Inner Loop Header: Depth=1
      add     x14, x11, x9
      ldp     q0, q1, [x14]
      ldur    x1, [x10, #-256]
      ldur    x2, [x10, #-192]
      add     x9, x9, #64             // =64
      mov     x12, v0.d[1]
      fmov    x15, d0
      mov     x13, v1.d[1]
      fmov    x16, d1
      ldp     q0, q1, [x14, #32]
      mul     x3, x15, x1
      cmp     x9, #512                // =512
      mov     x14, v0.d[1]
      fmov    x17, d0
      fmov    d0, x3
      mul     x3, x12, x1
[...]

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sanwou01 created this revision.Oct 25 2019, 10:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2019, 10:45 AM

Herald added subscribers: llvm-commits, hiraditya, kristof.beyls and 2 others. · View Herald Transcript

sanwou01 added reviewers: SjoerdMeijer, samparker, dmgreen.Oct 25 2019, 10:52 AM

Harbormaster completed remote builds in B40066: Diff 226456.Oct 25 2019, 10:53 AM

Commit message formatting fixes

sanwou01 edited the summary of this revision. (Show Details)Oct 25 2019, 10:57 AM

Harbormaster completed remote builds in B40069: Diff 226461.Oct 25 2019, 11:02 AM

Can you add tests for other targets as well? eg: PowerPC, RISC-V?
And maybe pre-commit the testcases, so that it is easier to see what the diff .

@ZhangKang Can you help to do size and performance evaluation on SPEC2017 on PowerPC. Thanks.

Herald added a subscriber: • wuzish. · View Herald TranscriptOct 25 2019, 12:18 PM

jsji added a subscriber: ppc-slack.Oct 25 2019, 12:19 PM

+1 here. It fixes an issue that has already been found on X86 and we've now found a case of on AArch64. It might well come up on other architectures, it's just impossible to tell where. Turning this one for all architectures sounds like the most sensible way forward. If X86 are accepting the compile times changes (which don't seem too big), I believe they will be acceptable elsewhere too.

The fact that no other tests have changed is a good sign. I do think this can cause a bit of performance noise from the register allocation changing. In the testing I ran it was just positives and negatives in different places.

Because this is target independent though, it's probably worth sending a message over to llvm-dev explaining the problem and that we think it's best to turn this on for all backends. Give anyone interested a chance to raise concerns.

lkail added a subscriber: lkail.Oct 27 2019, 7:03 PM

In D69437#1721706, @jsji wrote:

Can you add tests for other targets as well? eg: PowerPC, RISC-V?
And maybe pre-commit the testcases, so that it is easier to see what the diff .

@ZhangKang Can you help to do size and performance evaluation on SPEC2017 on PowerPC. Thanks.

Ok, I will run the patch on SPEC2017 for PowerPC soon.

Rebased on top of test case precommit.

Harbormaster completed remote builds in B40128: Diff 226654.Oct 28 2019, 6:35 AM

sanwou01 mentioned this in D69512: Precommit AArch64 test for -consider-local-interval-cost.Oct 28 2019, 6:41 AM

Try again

Harbormaster completed remote builds in B40131: Diff 226657.Oct 28 2019, 6:44 AM

sanwou01 mentioned this in rG265ddc57579b: Precommit AArch64 test for -consider-local-interval-cost.Oct 28 2019, 8:09 AM

sanwou01 added a reviewer: qcolombet.Oct 28 2019, 8:30 AM

@jsji I have now committed the test case with -consider-local-interval-cost disabled, so the impact should be a bit more obvious now.

As Dave points out, it is practically impossible to predict where this behaviour would come up in other architectures, so finding test cases for it is not straightforward. As much as I'd like to add tests for the other platforms, I don't have any, nor a good way to find them!

@ZhangKang I'm looking forward to seeing your findings on PowerPC, thanks for running those!

ZhangKang added a comment.Oct 30 2019, 7:14 PM

This comment was removed by ZhangKang.

Hmm. That's interesting. I'm guessing the flags you use between base and peak are quite different? The only consistent results was exchange2_r, which was getting better! How noisy are these results? What kind of a confidence interval do you have?

It may be simpler to make this AArch64 only instead, same as the X86 backend previously had. We know that there it fixes the issue with repeatedly spilling registers, and the performance is otherwise not largely effected.

In D69437#1728128, @ZhangKang wrote:

For base test, this patch has not a good performance on PPC.

Thanks @ZhangKang for the testing results.

In D69437#1728336, @dmgreen wrote:

I'm guessing the flags you use between base and peak are quite different?

BASE and PEAK options are similar to what used in SPEC publish, yes, they are quite different.

How noisy are these results? What kind of a confidence interval do you have?

@ZhangKang , please double confirm the results
(I believe the result should be quite reliable if you are using our standard perf testing harness, which will run it 3 times on the quiet performance machine)

It may be simpler to make this AArch64 only instead,

I am not sure other target like RISCV, but if above result on PowerPC is reliable,
then yes, I would prefer we don't enable it by default for PowerPC for now.
We need to look into details of those big degradations first, fix the problems if any, then we can consider enable it by default.

In D69437#1728336, @dmgreen wrote:

Hmm. That's interesting. I'm guessing the flags you use between base and peak are quite different? The only consistent results was exchange2_r, which was getting better! How noisy are these results? What kind of a confidence interval do you have?

It may be simpler to make this AArch64 only instead, same as the X86 backend previously had. We know that there it fixes the issue with repeatedly spilling registers, and the performance is otherwise not largely effected.

@jsji @dmgreen , I have check the result on PPC. My old test result has used the wrong driver for some cases.
For spec base test, 3 cases downgraded more than 1%, the biggest is -4.55%. Only one case upgraded more than 1%(2.27%).
For spec peak test, 3 cases downgraded more than 1%, the biggest is -3.48%. No case upgraded more than 1%.

This patch has bad impact on spec fp test, the geomean of fp base test downgraded 0.69%, and the geomean of fp peak test downgraded 0.43%.

So I would prefer we don't enable it by default for PowerPC.

Enable -consider-local-interval-cost for AArch64 only instead of all targets.

Herald added a subscriber: steven.zhang. · View Herald TranscriptNov 7 2019, 7:08 AM

sanwou01 added subscribers: lkail, ppc-slack.Nov 7 2019, 7:10 AM

sanwou01 added subscribers: • wuzish, jsji, ZhangKang.

Harbormaster completed remote builds in B40617: Diff 228240.Nov 7 2019, 7:14 AM

It's a shame about the PPC changes. It's surprising that a change like this would cause those regressions on something as large as SPEC.

Doing this just for aarch64 sounds good though. We have a lot of registers so it makes this very expensive when we hit it, and the perf changes there don't look large at all.

LGTM.

This revision is now accepted and ready to land.Nov 7 2019, 8:40 AM

In D69437#1737159, @dmgreen wrote:

It's a shame about the PPC changes. It's surprising that a change like this would cause those regressions on something as large as SPEC.

Yeah, we should have a look later to understand why this is causing degs in PowerPC. @ZhangKang

Closed by commit rGf649f24d388c: [RAGreedy] Enable -consider-local-interval-cost for AArch64 (authored by sanwou01). · Explain WhyNov 8 2019, 2:25 AM

This revision was automatically updated to reflect the committed changes.

dmgreen mentioned this in D98232: [regalloc] Ensure Query::collectInterferringVregs is called before interval iteration.Mar 30 2021, 2:45 AM

Diff 228390

llvm/lib/CodeGen/RegAllocGreedy.cpp

Show First 20 Lines • Show All 3,219 Lines • ▼ Show 20 Lines	bool RAGreedy::runOnMachineFunction(MachineFunction &mf) {
TRI = MF->getSubtarget().getRegisterInfo();		TRI = MF->getSubtarget().getRegisterInfo();
TII = MF->getSubtarget().getInstrInfo();		TII = MF->getSubtarget().getInstrInfo();
RCI.runOnMachineFunction(mf);		RCI.runOnMachineFunction(mf);

EnableLocalReassign = EnableLocalReassignment \|\|		EnableLocalReassign = EnableLocalReassignment \|\|
MF->getSubtarget().enableRALocalReassignment(		MF->getSubtarget().enableRALocalReassignment(
MF->getTarget().getOptLevel());		MF->getTarget().getOptLevel());

EnableAdvancedRASplitCost = ConsiderLocalIntervalCost \|\|		EnableAdvancedRASplitCost =
MF->getSubtarget().enableAdvancedRASplitCost();		ConsiderLocalIntervalCost.getNumOccurrences()
		? ConsiderLocalIntervalCost
		: MF->getSubtarget().enableAdvancedRASplitCost();

if (VerifyEnabled)		if (VerifyEnabled)
MF->verify(this, "Before greedy register allocator");		MF->verify(this, "Before greedy register allocator");

RegAllocBase::init(getAnalysis<VirtRegMap>(),		RegAllocBase::init(getAnalysis<VirtRegMap>(),
getAnalysis<LiveIntervals>(),		getAnalysis<LiveIntervals>(),
getAnalysis<LiveRegMatrix>());		getAnalysis<LiveRegMatrix>());
Indexes = &getAnalysis<SlotIndexes>();		Indexes = &getAnalysis<SlotIndexes>();
Show All 34 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 469 Lines • ▼ Show 20 Lines	public:
unsigned classifyGlobalFunctionReference(const GlobalValue *GV,		unsigned classifyGlobalFunctionReference(const GlobalValue *GV,
const TargetMachine &TM) const;		const TargetMachine &TM) const;

void overrideSchedPolicy(MachineSchedPolicy &Policy,		void overrideSchedPolicy(MachineSchedPolicy &Policy,
unsigned NumRegionInstrs) const override;		unsigned NumRegionInstrs) const override;

bool enableEarlyIfConversion() const override;		bool enableEarlyIfConversion() const override;

		bool enableAdvancedRASplitCost() const override { return true; }

std::unique_ptr<PBQPRAConstraint> getCustomPBQPConstraints() const override;		std::unique_ptr<PBQPRAConstraint> getCustomPBQPConstraints() const override;

bool isCallingConvWin64(CallingConv::ID CC) const {		bool isCallingConvWin64(CallingConv::ID CC) const {
switch (CC) {		switch (CC) {
case CallingConv::C:		case CallingConv::C:
case CallingConv::Fast:		case CallingConv::Fast:
case CallingConv::Swift:		case CallingConv::Swift:
return isTargetWindows();		return isTargetWindows();
Show All 12 Lines

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll

	Show All 20 Lines
	; CHECK-NEXT: .cfi_offset b12, -40			; CHECK-NEXT: .cfi_offset b12, -40
	; CHECK-NEXT: .cfi_offset b13, -48			; CHECK-NEXT: .cfi_offset b13, -48
	; CHECK-NEXT: .cfi_offset b14, -56			; CHECK-NEXT: .cfi_offset b14, -56
	; CHECK-NEXT: .cfi_offset b15, -64			; CHECK-NEXT: .cfi_offset b15, -64
	; CHECK-NEXT: adrp x10, B+48			; CHECK-NEXT: adrp x10, B+48
	; CHECK-NEXT: adrp x11, A			; CHECK-NEXT: adrp x11, A
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: mov x8, xzr
	; CHECK-NEXT: mov x9, xzr			; CHECK-NEXT: mov x9, xzr
	; CHECK-NEXT: movi v14.2d, #0000000000000000			; CHECK-NEXT: movi v0.2d, #0000000000000000
	; CHECK-NEXT: add x10, x10, :lo12:B+48			; CHECK-NEXT: add x10, x10, :lo12:B+48
	; CHECK-NEXT: add x11, x11, :lo12:A			; CHECK-NEXT: add x11, x11, :lo12:A
				; CHECK-NEXT: str q0, [sp] // 16-byte Folded Spill
	; CHECK-NEXT: // implicit-def: $q1			; CHECK-NEXT: // implicit-def: $q1
	; CHECK-NEXT: // implicit-def: $q2			; CHECK-NEXT: // implicit-def: $q2
	; CHECK-NEXT: // implicit-def: $q3			; CHECK-NEXT: // implicit-def: $q3
	; CHECK-NEXT: // implicit-def: $q4			; CHECK-NEXT: // implicit-def: $q4
	; CHECK-NEXT: // implicit-def: $q5			; CHECK-NEXT: // implicit-def: $q5
	; CHECK-NEXT: // implicit-def: $q6			; CHECK-NEXT: // implicit-def: $q6
	; CHECK-NEXT: // implicit-def: $q7			; CHECK-NEXT: // implicit-def: $q7
	; CHECK-NEXT: // implicit-def: $q16			; CHECK-NEXT: // implicit-def: $q16
	; CHECK-NEXT: // implicit-def: $q17			; CHECK-NEXT: // implicit-def: $q17
	; CHECK-NEXT: // implicit-def: $q18			; CHECK-NEXT: // implicit-def: $q18
	; CHECK-NEXT: // implicit-def: $q19			; CHECK-NEXT: // implicit-def: $q19
	; CHECK-NEXT: // implicit-def: $q20			; CHECK-NEXT: // implicit-def: $q20
	; CHECK-NEXT: // implicit-def: $q21			; CHECK-NEXT: // implicit-def: $q21
	; CHECK-NEXT: // implicit-def: $q22			; CHECK-NEXT: // implicit-def: $q22
	; CHECK-NEXT: // implicit-def: $q23			; CHECK-NEXT: // implicit-def: $q23
	; CHECK-NEXT: // implicit-def: $q24			; CHECK-NEXT: // implicit-def: $q24
	; CHECK-NEXT: // implicit-def: $q25			; CHECK-NEXT: // implicit-def: $q25
	; CHECK-NEXT: // implicit-def: $q26			; CHECK-NEXT: // implicit-def: $q26
	; CHECK-NEXT: // implicit-def: $q28
	; CHECK-NEXT: // implicit-def: $q27			; CHECK-NEXT: // implicit-def: $q27
				; CHECK-NEXT: // implicit-def: $q28
	; CHECK-NEXT: // implicit-def: $q29			; CHECK-NEXT: // implicit-def: $q29
	; CHECK-NEXT: // implicit-def: $q30			; CHECK-NEXT: // implicit-def: $q30
	; CHECK-NEXT: // implicit-def: $q31			; CHECK-NEXT: // implicit-def: $q31
	; CHECK-NEXT: // implicit-def: $q8			; CHECK-NEXT: // implicit-def: $q8
	; CHECK-NEXT: // implicit-def: $q9			; CHECK-NEXT: // implicit-def: $q9
	; CHECK-NEXT: // implicit-def: $q10			; CHECK-NEXT: // implicit-def: $q10
	; CHECK-NEXT: // implicit-def: $q11			; CHECK-NEXT: // implicit-def: $q11
	; CHECK-NEXT: // implicit-def: $q12			; CHECK-NEXT: // implicit-def: $q12
	; CHECK-NEXT: // implicit-def: $q13			; CHECK-NEXT: // implicit-def: $q13
	; CHECK-NEXT: .LBB0_1: // %for.cond1.preheader			; CHECK-NEXT: .LBB0_1: // %for.cond1.preheader
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: str q13, [sp] // 16-byte Folded Spill
	; CHECK-NEXT: mov x12, xzr			; CHECK-NEXT: mov x12, xzr
	; CHECK-NEXT: ldr q15, [x8]			; CHECK-NEXT: ldr q15, [x8]
	; CHECK-NEXT: mov v13.16b, v12.16b
	; CHECK-NEXT: mov v12.16b, v11.16b
	; CHECK-NEXT: mov v11.16b, v10.16b
	; CHECK-NEXT: mov v10.16b, v9.16b
	; CHECK-NEXT: mov v9.16b, v8.16b
	; CHECK-NEXT: mov v8.16b, v31.16b
	; CHECK-NEXT: mov v31.16b, v30.16b
	; CHECK-NEXT: mov v30.16b, v29.16b
	; CHECK-NEXT: mov v29.16b, v27.16b
	; CHECK-NEXT: mov v27.16b, v26.16b
	; CHECK-NEXT: mov v26.16b, v25.16b
	; CHECK-NEXT: mov v25.16b, v24.16b
	; CHECK-NEXT: mov v24.16b, v23.16b
	; CHECK-NEXT: mov v23.16b, v22.16b
	; CHECK-NEXT: mov v22.16b, v21.16b
	; CHECK-NEXT: mov v21.16b, v20.16b
	; CHECK-NEXT: mov v20.16b, v19.16b
	; CHECK-NEXT: mov v19.16b, v18.16b
	; CHECK-NEXT: mov v18.16b, v17.16b
	; CHECK-NEXT: mov v17.16b, v16.16b
	; CHECK-NEXT: mov v16.16b, v7.16b
	; CHECK-NEXT: mov v7.16b, v6.16b
	; CHECK-NEXT: mov v6.16b, v5.16b
	; CHECK-NEXT: mov v5.16b, v4.16b
	; CHECK-NEXT: mov v4.16b, v3.16b
	; CHECK-NEXT: mov v3.16b, v2.16b
	; CHECK-NEXT: mov v2.16b, v1.16b
	; CHECK-NEXT: mov v1.16b, v14.16b
	; CHECK-NEXT: ldr q14, [x12]			; CHECK-NEXT: ldr q14, [x12]
	; CHECK-NEXT: ldr q0, [x10], #64			; CHECK-NEXT: ldr q0, [x10], #64
	; CHECK-NEXT: ldr x18, [x12]			; CHECK-NEXT: ldr x18, [x12]
	; CHECK-NEXT: fmov x15, d15			; CHECK-NEXT: fmov x15, d15
	; CHECK-NEXT: mov x14, v15.d[1]			; CHECK-NEXT: mov x14, v15.d[1]
	; CHECK-NEXT: mov x12, v14.d[1]
	; CHECK-NEXT: mul x1, x15, x18
	; CHECK-NEXT: fmov x13, d14			; CHECK-NEXT: fmov x13, d14
	; CHECK-NEXT: mov v14.16b, v1.16b			; CHECK-NEXT: mul x1, x15, x18
	; CHECK-NEXT: mov v1.16b, v2.16b
	; CHECK-NEXT: mov v2.16b, v3.16b
	; CHECK-NEXT: mov v3.16b, v4.16b
	; CHECK-NEXT: mov v4.16b, v5.16b
	; CHECK-NEXT: mov v5.16b, v6.16b
	; CHECK-NEXT: mov v6.16b, v7.16b
	; CHECK-NEXT: mov v7.16b, v16.16b
	; CHECK-NEXT: mov v16.16b, v17.16b
	; CHECK-NEXT: mov v17.16b, v18.16b
	; CHECK-NEXT: mov v18.16b, v19.16b
	; CHECK-NEXT: mov v19.16b, v20.16b
	; CHECK-NEXT: mov v20.16b, v21.16b
	; CHECK-NEXT: mov v21.16b, v22.16b
	; CHECK-NEXT: mov v22.16b, v23.16b
	; CHECK-NEXT: mov v23.16b, v24.16b
	; CHECK-NEXT: mov v24.16b, v25.16b
	; CHECK-NEXT: mov v25.16b, v26.16b
	; CHECK-NEXT: mov v26.16b, v27.16b
	; CHECK-NEXT: mov v27.16b, v29.16b
	; CHECK-NEXT: mov v29.16b, v30.16b
	; CHECK-NEXT: mov v30.16b, v31.16b
	; CHECK-NEXT: mov v31.16b, v8.16b
	; CHECK-NEXT: mov v8.16b, v9.16b
	; CHECK-NEXT: mov v9.16b, v10.16b
	; CHECK-NEXT: mov v10.16b, v11.16b
	; CHECK-NEXT: mov v11.16b, v12.16b
	; CHECK-NEXT: mov v12.16b, v13.16b
	; CHECK-NEXT: ldr q13, [sp] // 16-byte Folded Reload
	; CHECK-NEXT: mov x16, v0.d[1]			; CHECK-NEXT: mov x16, v0.d[1]
	; CHECK-NEXT: fmov x17, d0			; CHECK-NEXT: fmov x17, d0
	; CHECK-NEXT: fmov d0, x1			; CHECK-NEXT: fmov d0, x1
	; CHECK-NEXT: mul x1, x14, x18			; CHECK-NEXT: mul x1, x14, x18
				; CHECK-NEXT: mov x12, v14.d[1]
	; CHECK-NEXT: ldr x0, [x8]			; CHECK-NEXT: ldr x0, [x8]
	; CHECK-NEXT: mov v0.d[1], x1			; CHECK-NEXT: mov v0.d[1], x1
	; CHECK-NEXT: mul x1, x13, x18			; CHECK-NEXT: mul x1, x13, x18
	; CHECK-NEXT: add v12.2d, v12.2d, v0.2d			; CHECK-NEXT: add v12.2d, v12.2d, v0.2d
	; CHECK-NEXT: fmov d0, x1			; CHECK-NEXT: fmov d0, x1
	; CHECK-NEXT: mul x1, x12, x18			; CHECK-NEXT: mul x1, x12, x18
	; CHECK-NEXT: mov v0.d[1], x1			; CHECK-NEXT: mov v0.d[1], x1
	; CHECK-NEXT: mul x1, x17, x18			; CHECK-NEXT: mul x1, x17, x18
	; CHECK-NEXT: add v13.2d, v13.2d, v0.2d			; CHECK-NEXT: add v13.2d, v13.2d, v0.2d
	; CHECK-NEXT: add v11.2d, v11.2d, v0.2d			; CHECK-NEXT: add v11.2d, v11.2d, v0.2d
	; CHECK-NEXT: fmov d0, x1			; CHECK-NEXT: fmov d0, x1
	; CHECK-NEXT: mul x18, x16, x18			; CHECK-NEXT: mul x18, x16, x18
				; CHECK-NEXT: ldr q14, [sp] // 16-byte Folded Reload
	; CHECK-NEXT: mov v0.d[1], x18			; CHECK-NEXT: mov v0.d[1], x18
	; CHECK-NEXT: mul x18, x15, x0			; CHECK-NEXT: mul x18, x15, x0
	; CHECK-NEXT: add x1, x11, x8			; CHECK-NEXT: add x1, x11, x8
	; CHECK-NEXT: add v10.2d, v10.2d, v0.2d			; CHECK-NEXT: add v10.2d, v10.2d, v0.2d
	; CHECK-NEXT: fmov d0, x18			; CHECK-NEXT: fmov d0, x18
	; CHECK-NEXT: mul x18, x14, x0			; CHECK-NEXT: mul x18, x14, x0
	; CHECK-NEXT: ldr x1, [x1, #128]			; CHECK-NEXT: ldr x1, [x1, #128]
	; CHECK-NEXT: mov v0.d[1], x18			; CHECK-NEXT: mov v0.d[1], x18
	Show All 34 Lines
	; CHECK-NEXT: mul x13, x13, x1			; CHECK-NEXT: mul x13, x13, x1
	; CHECK-NEXT: mov v0.d[1], x14			; CHECK-NEXT: mov v0.d[1], x14
	; CHECK-NEXT: mul x12, x12, x1			; CHECK-NEXT: mul x12, x12, x1
	; CHECK-NEXT: add v29.2d, v29.2d, v0.2d			; CHECK-NEXT: add v29.2d, v29.2d, v0.2d
	; CHECK-NEXT: fmov d0, x13			; CHECK-NEXT: fmov d0, x13
	; CHECK-NEXT: mul x17, x17, x1			; CHECK-NEXT: mul x17, x17, x1
	; CHECK-NEXT: mov v0.d[1], x12			; CHECK-NEXT: mov v0.d[1], x12
	; CHECK-NEXT: mul x16, x16, x1			; CHECK-NEXT: mul x16, x16, x1
	; CHECK-NEXT: add v27.2d, v27.2d, v0.2d			; CHECK-NEXT: add v28.2d, v28.2d, v0.2d
	; CHECK-NEXT: fmov d0, x17			; CHECK-NEXT: fmov d0, x17
	; CHECK-NEXT: mov v0.d[1], x16			; CHECK-NEXT: mov v0.d[1], x16
	; CHECK-NEXT: add x8, x8, #8 // =8			; CHECK-NEXT: add x8, x8, #8 // =8
	; CHECK-NEXT: add v28.2d, v28.2d, v0.2d			; CHECK-NEXT: add v27.2d, v27.2d, v0.2d
	; CHECK-NEXT: cmp x8, #64 // =64			; CHECK-NEXT: cmp x8, #64 // =64
	; CHECK-NEXT: add x9, x9, #1 // =1			; CHECK-NEXT: add x9, x9, #1 // =1
				; CHECK-NEXT: str q14, [sp] // 16-byte Folded Spill
	; CHECK-NEXT: b.ne .LBB0_1			; CHECK-NEXT: b.ne .LBB0_1
	; CHECK-NEXT: // %bb.2: // %for.cond.cleanup			; CHECK-NEXT: // %bb.2: // %for.cond.cleanup
	; CHECK-NEXT: adrp x8, C			; CHECK-NEXT: adrp x8, C
	; CHECK-NEXT: add x8, x8, :lo12:C			; CHECK-NEXT: add x8, x8, :lo12:C
				; CHECK-NEXT: ldr q0, [sp] // 16-byte Folded Reload
	; CHECK-NEXT: stp q13, q12, [x8]			; CHECK-NEXT: stp q13, q12, [x8]
	; CHECK-NEXT: stp q11, q10, [x8, #32]			; CHECK-NEXT: stp q11, q10, [x8, #32]
	; CHECK-NEXT: stp q9, q8, [x8, #64]			; CHECK-NEXT: stp q9, q8, [x8, #64]
	; CHECK-NEXT: stp q14, q2, [x8, #464]
	; CHECK-NEXT: ldp d9, d8, [sp, #64] // 16-byte Folded Reload			; CHECK-NEXT: ldp d9, d8, [sp, #64] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d11, d10, [sp, #48] // 16-byte Folded Reload			; CHECK-NEXT: ldp d11, d10, [sp, #48] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d13, d12, [sp, #32] // 16-byte Folded Reload			; CHECK-NEXT: ldp d13, d12, [sp, #32] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d15, d14, [sp, #16] // 16-byte Folded Reload			; CHECK-NEXT: ldp d15, d14, [sp, #16] // 16-byte Folded Reload
	; CHECK-NEXT: stp q31, q30, [x8, #96]			; CHECK-NEXT: stp q31, q30, [x8, #96]
	; CHECK-NEXT: stp q29, q27, [x8, #144]			; CHECK-NEXT: stp q29, q28, [x8, #144]
	; CHECK-NEXT: stp q28, q26, [x8, #176]			; CHECK-NEXT: stp q27, q26, [x8, #176]
	; CHECK-NEXT: str q25, [x8, #208]			; CHECK-NEXT: str q25, [x8, #208]
	; CHECK-NEXT: stp q24, q23, [x8, #240]			; CHECK-NEXT: stp q24, q23, [x8, #240]
	; CHECK-NEXT: stp q22, q21, [x8, #272]			; CHECK-NEXT: stp q22, q21, [x8, #272]
	; CHECK-NEXT: stp q20, q19, [x8, #304]			; CHECK-NEXT: stp q20, q19, [x8, #304]
	; CHECK-NEXT: stp q18, q17, [x8, #336]			; CHECK-NEXT: stp q18, q17, [x8, #336]
	; CHECK-NEXT: stp q16, q7, [x8, #368]			; CHECK-NEXT: stp q16, q7, [x8, #368]
	; CHECK-NEXT: stp q6, q5, [x8, #400]			; CHECK-NEXT: stp q6, q5, [x8, #400]
	; CHECK-NEXT: stp q4, q3, [x8, #432]			; CHECK-NEXT: stp q4, q3, [x8, #432]
				; CHECK-NEXT: stp q0, q2, [x8, #464]
	; CHECK-NEXT: str q1, [x8, #496]			; CHECK-NEXT: str q1, [x8, #496]
	; CHECK-NEXT: add sp, sp, #80 // =80			; CHECK-NEXT: add sp, sp, #80 // =80
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %for.cond1.preheader			br label %for.cond1.preheader

	for.cond1.preheader: ; preds = %for.cond1.preheader, %entry			for.cond1.preheader: ; preds = %for.cond1.preheader, %entry
	%0 = phi <2 x i64> [ undef, %entry ], [ %118, %for.cond1.preheader ]			%0 = phi <2 x i64> [ undef, %entry ], [ %118, %for.cond1.preheader ]
	▲ Show 20 Lines • Show All 159 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[RAGreedy] Enable -consider-local-interval-cost for AArch64
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228390

llvm/lib/CodeGen/RegAllocGreedy.cpp

llvm/lib/Target/AArch64/AArch64Subtarget.h

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll

This is an archive of the discontinued LLVM Phabricator instance.

[RAGreedy] Enable -consider-local-interval-cost for AArch64ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228390

llvm/lib/CodeGen/RegAllocGreedy.cpp

llvm/lib/Target/AArch64/AArch64Subtarget.h

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll

[RAGreedy] Enable -consider-local-interval-cost for AArch64
ClosedPublic