This is an archive of the discontinued LLVM Phabricator instance.

[LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681)
ClosedPublic

Authored by spatel on Jan 26 2018, 3:31 PM.

Details

Summary

I think there would be a small code size win from this change and possibly a slight perf win, but I don't have a representative benchmarking system to test that theory. I figure it's worth posting this patch to get feedback and let others give it a try if they're interested. If you have access to SPEC or other standard benchmarks, I'd be most grateful to know if it helps.

The irony is that AMD Jaguar apparently does not have macro-fusion, so the target I was hoping to help the most is excluded from consideration...

In the motivating case from PR35681 and represented in the new test in this patch:
https://bugs.llvm.org/show_bug.cgi?id=35681
...there's a 37 -> 31 byte size win for the loop because we eliminate the big base address offsets.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jan 26 2018, 3:31 PM
qcolombet accepted this revision.Jan 29 2018, 11:00 AM
qcolombet added a subscriber: qcolombet.

LGTM

Couple of nits on the test case.

test/Transforms/LoopStrengthReduce/X86/macro-fuse-cmp.ll
3 ↗(On Diff #131663)

Could you also make a IR to IR test with opt -loop-reduce?

57 ↗(On Diff #131663)

Could you run instnamer on the test case?

This revision is now accepted and ready to land.Jan 29 2018, 11:00 AM

It is not obvious that constants in address can somehow hurt performance.
I'll run the patch on the benchmarks I have. However I believe performance issue in PR35861 is about complicated addresses which limit execution ports for stores.

Hi,

I agree removing the lengthy (9 byte) instructions and reducing size of the loop is good. But on performance side, I need to do some tests.

spatel updated this revision to Diff 132009.Jan 30 2018, 11:32 AM
spatel marked 2 inline comments as done.

Patch updated (no code changes; improved the new test file based on Quentin's feedback):

  1. Give the IR values real names, so the test is easier to understand.
  2. Add IR output testing, so we see what that difference looks like independent of anything else in the x86 backend.

Thanks @evstupac and @venkataramanan.kumar.llvm for the feedback. I micro-benchmarked the maxArray code in its -O2 unrolled form from PR28343 (see below for object dump with address offsets), but as expected, I can't measure any perf difference on Haswell.

It does shrink the loop from 134 bytes to 97 bytes though.

Baseline (don't account for macro-fusion, so less instructions is considered better):

0000000100001000	movq	$-0x80000, %rax
0000000100001007	nopw	(%rax,%rax)
0000000100001010	movupd	0x80000(%rsi,%rax), %xmm0
0000000100001019	movupd	0x80000(%rdi,%rax), %xmm1
0000000100001022	maxpd	%xmm1, %xmm0
0000000100001026	movupd	0x80010(%rdi,%rax), %xmm1
000000010000102f	movupd	0x80020(%rdi,%rax), %xmm2
0000000100001038	movupd	0x80030(%rdi,%rax), %xmm3
0000000100001041	movupd	%xmm0, 0x80000(%rdi,%rax)
000000010000104a	movupd	0x80010(%rsi,%rax), %xmm0
0000000100001053	maxpd	%xmm1, %xmm0
0000000100001057	movupd	%xmm0, 0x80010(%rdi,%rax)
0000000100001060	movupd	0x80020(%rsi,%rax), %xmm0
0000000100001069	maxpd	%xmm2, %xmm0
000000010000106d	movupd	0x80030(%rsi,%rax), %xmm1
0000000100001076	movupd	%xmm0, 0x80020(%rdi,%rax)
000000010000107f	maxpd	%xmm3, %xmm1
0000000100001083	movupd	%xmm1, 0x80030(%rdi,%rax)
000000010000108c	addq	$0x40, %rax
0000000100001090	jne	0x100001010
0000000100001096	retq

Use macro-fusion in cost calc (extra instruction, but allows smaller constant offsets):

0000000100001000	xorl	%eax, %eax
0000000100001002	nopw	%cs:(%rax,%rax)
000000010000100c	nopl	(%rax)
0000000100001010	movupd	(%rsi,%rax,8), %xmm0
0000000100001015	movupd	0x10(%rsi,%rax,8), %xmm1
000000010000101b	movupd	(%rdi,%rax,8), %xmm2
0000000100001020	maxpd	%xmm2, %xmm0
0000000100001024	movupd	0x10(%rdi,%rax,8), %xmm2
000000010000102a	maxpd	%xmm2, %xmm1
000000010000102e	movupd	0x20(%rdi,%rax,8), %xmm2
0000000100001034	movupd	0x30(%rdi,%rax,8), %xmm3
000000010000103a	movupd	%xmm0, (%rdi,%rax,8)
000000010000103f	movupd	%xmm1, 0x10(%rdi,%rax,8)
0000000100001045	movupd	0x20(%rsi,%rax,8), %xmm0
000000010000104b	maxpd	%xmm2, %xmm0
000000010000104f	movupd	0x30(%rsi,%rax,8), %xmm1
0000000100001055	maxpd	%xmm3, %xmm1
0000000100001059	movupd	%xmm0, 0x20(%rdi,%rax,8)
000000010000105f	movupd	%xmm1, 0x30(%rdi,%rax,8)
0000000100001065	addq	$0x8, %rax
0000000100001069	cmpq	$0x10000, %rax
000000010000106f	jne	0x100001010
0000000100001071	retq

I ran SPEC2017 on Ryzen (c/c++ benchmarks) -O2 -fno-unroll-loops. no significant change in performance with the patch.

I ran SPEC2017 on Ryzen (c/c++ benchmarks) -O2 -fno-unroll-loops. no significant change in performance with the patch.

Great - thank you for checking that. If there are no objections, I'll commit this so we get wider testing (or I can wait for more data to make sure there are no regressions).

That will allow more experimentation with other LSR changes (that might have more impact) without worrying about this case.

This revision was automatically updated to reflect the committed changes.