This is an archive of the discontinued LLVM Phabricator instance.

[LoongArch] Align functions and loops better according to uarch
ClosedPublic

Authored by xen0n on Apr 18 2023, 2:20 AM.

Details

Summary

The LA464 micro-architecture is very sensitive to alignment of hot code,
with performance variation of up to ~12% in the go1 benchmark suite of
the Go language (as observed by me during my work on the Go loong64
port).
Manual alignment of certain loops and automatic alignment of loop heads
helps a lot there, by reducing much of the random variation and
generally increasing performance, so we naturally want to do the same
here.

Practically, LA464 is the only LoongArch micro-architecture in wide use,
and we are currently supporting just that. The first "4" in "LA464"
stands for "4-issue", in particular its instruction fetch and decode
stages are 4-wide; so functions and branch targets should be preferably
aligned to at least 16 bytes for best throughput.

The Loongson team has benchmarked various combinations of function,
loop, and branch target alignments with GCC.
The results
show that "16-byte label alignment together with 32-byte function
alignment gives best results in terms of SPEC score". A "label" in GCC
means a branch target; while we don't currently align branch targets,
we do align loops, so in this patch we default to 32-byte function
alignment and 16-byte loop alignment.

Diff Detail

Event Timeline

xen0n created this revision.Apr 18 2023, 2:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 2:20 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
xen0n requested review of this revision.Apr 18 2023, 2:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 2:20 AM
xen0n edited the summary of this revision. (Show Details)Apr 18 2023, 2:43 AM
xen0n edited the summary of this revision. (Show Details)

Hmm, it seems to me setMaxBytesForAlignment sets "the maximum amount of padding allowed for aligning the BB", not the max loop size. Or am I missing something?

I'll do the similar thing for GCC but we've already missed GCC 13 deadline :(.

xen0n updated this revision to Diff 514582.Apr 18 2023, 2:54 AM

use STI consistently and fix one grammatical nit in comment

xen0n added a comment.Apr 18 2023, 2:59 AM

Hmm, it seems to me setMaxBytesForAlignment sets "the maximum amount of padding allowed for aligning the BB", not the max loop size. Or am I missing something?

I simply copied from AArch64 for that part. Maybe I should mention this in the commit message...

I'll do the similar thing for GCC but we've already missed GCC 13 deadline :(.

No worries, things happen, and it's not like everyone's gonna burn 2x more CPU cycles without such changes. I only discovered this because of surprising go1 benchmark results over and over again, getting 13% performance hits for a change that theoretically cannot increase dynamic instruction counts; it actually proves such fluctuations are largely unnoticed in everyday use.

Hmm, it seems to me setMaxBytesForAlignment sets "the maximum amount of padding allowed for aligning the BB", not the max loop size. Or am I missing something?

I simply copied from AArch64 for that part. Maybe I should mention this in the commit message...

I mean, I didn't see how the change only sets loop alignments for tight loops. Maybe I'm missing something, but with GCC there seems no way to limit -falign-loops only for tight loops.

Hmm, it seems to me setMaxBytesForAlignment sets "the maximum amount of padding allowed for aligning the BB", not the max loop size. Or am I missing something?

I agree. It should be the third expression described in https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/as_7.html#SEC113.

I mean, I didn't see how the change only sets loop alignments for tight loops. Maybe I'm missing something, but with GCC there seems no way to limit -falign-loops only for tight loops.

I have the same concern.

BTW, @xen0n Could you please add some dedicate tests for function alignment settings. For example:

; RUN: llc --mtriple=loongarch64 %s -o - | FileCheck %s
; RUN: llc --mtriple=loongarch64 -mcpu=la464 %s -o - | FileCheck %s

define void @foo() nounwind {
;CHECK: .p2align 4
;CHECK: foo:
entry:
  ret void
}
xen0n added a comment.Apr 18 2023, 9:59 PM

Hmm, it seems to me setMaxBytesForAlignment sets "the maximum amount of padding allowed for aligning the BB", not the max loop size. Or am I missing something?

I agree. It should be the third expression described in https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/as_7.html#SEC113.

I mean, I didn't see how the change only sets loop alignments for tight loops. Maybe I'm missing something, but with GCC there seems no way to limit -falign-loops only for tight loops.

I have the same concern.

BTW, @xen0n Could you please add some dedicate tests for function alignment settings. For example:

; RUN: llc --mtriple=loongarch64 %s -o - | FileCheck %s
; RUN: llc --mtriple=loongarch64 -mcpu=la464 %s -o - | FileCheck %s

define void @foo() nounwind {
;CHECK: .p2align 4
;CHECK: foo:
entry:
  ret void
}

I've checked the logic behind MaxBytesForAlignment and it seems to be the case, sorry for the confusion. I'll fix in the next revision. Thank you both.

Lulu said a benchmark should be performed for such a change (via gcc-patches) and I agree, so please hold the change until a benchmark is done.

@SixWeining I guess you can collaborate with Lulu internally for the benchmark. I don't have an access to SPEC CPU either.

Lulu said a benchmark should be performed for such a change (via gcc-patches) and I agree, so please hold the change until a benchmark is done.

@SixWeining I guess you can collaborate with Lulu internally for the benchmark. I don't have an access to SPEC CPU either.

Thanks. Let me talk with her offline.

SixWeining added inline comments.Apr 19 2023, 6:24 PM
llvm/lib/Target/LoongArch/LoongArchSubtarget.cpp
38

Should we pass CPU or TuneCPU to this function and use it to set different numbers?

xen0n retitled this revision from [LoongArch] Prefer to align functions and tight loops to 16 bytes to [LoongArch] Align functions and loops better according to uarch.Jun 1 2023, 11:09 PM
xen0n edited the summary of this revision. (Show Details)
xen0n updated this revision to Diff 527763.Jun 2 2023, 12:11 AM
xen0n edited the summary of this revision. (Show Details)

Rebased and added a dedicated test case for this change.

Also changed the preferred function alignment to 32 bytes per Loongson benchmark results.

SixWeining added inline comments.Jun 2 2023, 12:15 AM
llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
203

Almost all parts of this constructor use the member variable Subtarget but not the parameter STI except line 187 which I will change it later.

llvm/lib/Target/LoongArch/LoongArchSubtarget.h
56

Is this initial value useful? PrefFunctionAlignment and PrefLoopAlignment do not have initial values?

xen0n marked 2 inline comments as done.Jun 2 2023, 12:23 AM
xen0n added inline comments.
llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
203

Okay I'll change to Subtarget for consistency with the local style shortly. (Although IMO local variables/arguments should be preferred over instance states which is more global, it's best done separately.)

llvm/lib/Target/LoongArch/LoongArchSubtarget.cpp
38

For now I've added the TuneCPU argument for smaller future diffs when we get to add more uarch data later.

llvm/lib/Target/LoongArch/LoongArchSubtarget.h
56

This is similar to the way AArch64 does it (that I obviously referred to). Apparently the default ctor of Align means "byte alignment" and zero MaxBytesForAlignment means disabling that feature.

xen0n updated this revision to Diff 527769.Jun 2 2023, 12:30 AM
xen0n marked 2 inline comments as done.

Move default alignment setting to member declaration and use Subtarget consistently during LoongArchISelLowering init.

xen0n marked an inline comment as done.Jun 2 2023, 12:31 AM
xen0n edited the summary of this revision. (Show Details)Jul 17 2023, 9:51 PM
xen0n edited the summary of this revision. (Show Details)
xen0n edited the summary of this revision. (Show Details)Jul 18 2023, 1:52 AM
SixWeining accepted this revision.Jul 18 2023, 1:55 AM

LGTM. Thanks.

This revision is now accepted and ready to land.Jul 18 2023, 1:55 AM