This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Adjust RV64I data layout by using n32:64 in layout string
ClosedPublic

Authored by craig.topper on Jan 6 2022, 3:52 AM.

Details

Summary

Although i32 type is illegal in the backend, RV64I has pretty good support for i32 types by using W instructions.

By adding n32 to the DataLayout string, middle end optimizations will consider i32 to be a native type. One known effect of this is enabling LoopStrengthReduce on loops with i32 induction variables. This can be beneficial because C/C++ code often has loops with i32 induction variables due to the use of int or unsigned int.

If this patch exposes performance issues, those are better addressed by tuning LSR or other passes.

Diff Detail

Event Timeline

joshua-arch1 created this revision.Jan 6 2022, 3:52 AM
joshua-arch1 requested review of this revision.Jan 6 2022, 3:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2022, 3:52 AM
joshua-arch1 edited the summary of this revision. (Show Details)Jan 6 2022, 3:53 AM
asb added a comment.EditedJan 6 2022, 7:55 AM

Thanks - I don't think much thought has gone into whether n32:n64 would be reasonable for RV64, so I really appreciate you digging in to it.

I've only run this against the GCC torture suite, and on average this seems to be about neutral (or mildly positive). There are a few cases where code is worse though. Here's a rather surprising example (I haven't investigated any further):

--- a/output_rv64imafdc_lp64_O1/930921-1.s
+++ b/output_rv64imafdc_lp64_O1/930921-1.s
@@ -24,18 +24,29 @@ main:                                   # @main
 # %bb.0:                                # %entry
        addi    sp, sp, -16
        sd      ra, 8(sp)                       # 8-byte Folded Spill
+       li      a0, 0
        li      a1, 0
-       li      a0, 1
-       lui     a2, 2
-       addiw   a2, a2, 1806
+       lui     a2, 699051
+       addiw   a2, a2, -1365
+       slli    a6, a2, 32
+       lui     a3, 171
+       addiw   a3, a3, -1365
+       slli    a3, a3, 12
+       addi    a3, a3, -1365
+       lui     a4, 2
+       addiw   a4, a4, 1808
 .LBB1_1:                                # %for.body
                                         # =>This Inner Loop Header: Depth=1
-       beqz    a0, .LBB1_4
+       srli    a5, a0, 33
+       slli    a2, a1, 32
+       mulhu   a2, a2, a6
+       srli    a2, a2, 33
+       bne     a2, a5, .LBB1_4
 # %bb.2:                                # %for.cond
                                         #   in Loop: Header=BB1_1 Depth=1
-       sext.w  a3, a1
        addiw   a1, a1, 1
-       bgeu    a2, a3, .LBB1_1
+       add     a0, a0, a3
+       bne     a1, a4, .LBB1_1
 # %bb.3:                                # %for.end
        li      a0, 0
        call    exit
jrtc27 added a comment.Jan 6 2022, 8:00 AM

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

jrtc27 added a comment.Jan 6 2022, 8:04 AM

The DataLayout documentation says:

This specifies a set of native integer widths for the target CPU in bits. For example, it might contain n32 for 32-bit PowerPC, n32:64 for PowerPC 64, or n8:16:32:64 for X86-64. Elements of this set are considered to support most general arithmetic operations efficiently.

It's unclear what exactly that means for RISC-V.

Having said that, for the IVUsers example pointed out, that's just bad code in LLVM, it should permit things that are smaller than native integers, I would imagine. Otherwise, say, i16, won't be used on most targets (other than X86 and a few experimental, embedded or graphics-y targets) as they'll only have n32:64, but the comment there is clearly to just stop it creating overly-large integer types.

I believe there are two checks for DataLayout.isLegalInteger in IndVarSimplify that are interesting as well.

jrtc27 added a comment.Jan 6 2022, 8:14 AM

Possibly also SROA's, though probably unlikely to matter much as it'd be such a tiny aggregate

This will probably require something in AutoUpgrade.cpp to fix up old IR files. I think maybe X86 does an upgrade for some address space additions.

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

So the f function gets inlined into this compare f (i) != i / 3. The i/3 gets optimized by SelectionDAG into (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1. Before this patch, the automatic CSE in SelectionDAG merges this with the code from the inlined function. This makes a the comparison be statically always true so all the code gets optimized away.

With this patch, LSR kicks in and rewrites some of the inlined code from f by removing the multiply with 0xAAAAAAAB in favor of a modified induction variable that increments by 0xAAAAAAAB on each iteration. Now there's nothing for the div by constant optimization to CSE with so the comparison doesn't become statically always true.

I don't think this test is a blocker for this patch.

joshua-arch1 added a comment.EditedJan 9 2022, 6:01 PM

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

joshua-arch1 added a comment.EditedJan 16 2022, 7:31 PM

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

I used T-HEAD XuanTie C906 processor based on the RV64GCV instruction set and THEAD instruction extension.

joshua-arch1 added a comment.EditedJan 25 2022, 11:02 PM

For the following benchmark, the performance can improve without degression of other cases (on T-HEAD XuanTie C906 processor with THEAD instruction extension)

spec2006_int-471.omnetpp+18.9672%
spec2006_int-483.xalancbmk+3.5192%
coremark+3.8695%
eembc_networking_pktflowb1m+4.1295%
eembc_networking_pktflowb2m+4.5903%
eembc_networking_pktflowb4m+4.0367%
eembc_networking_pktflowb512k+5.2104%

Can you test on your hardware?

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

I used T-HEAD XuanTie C906 processor based on the RV64GCV instruction set and THEAD instruction extension.

Was it tested using a version of LLVM that supports the THEAD instruction extension or just the public LLVM?

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

So the f function gets inlined into this compare f (i) != i / 3. The i/3 gets optimized by SelectionDAG into (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1. Before this patch, the automatic CSE in SelectionDAG merges this with the code from the inlined function. This makes a the comparison be statically always true so all the code gets optimized away.

With this patch, LSR kicks in and rewrites some of the inlined code from f by removing the multiply with 0xAAAAAAAB in favor of a modified induction variable that increments by 0xAAAAAAAB on each iteration. Now there's nothing for the div by constant optimization to CSE with so the comparison doesn't become statically always true.

I don't think this test is a blocker for this patch.

Does the patch improve performance absolutely in theory?
Could anybody else also test to see if the patch can improve performance because guys may have different environment or benchmarks?

asb added a comment.Feb 16 2022, 6:16 AM

I think I'm correct in thinking that everyone so far is broadly in favour of this change, but as Craig suggests it likely wants something in AutoUpgrade.cpp to handle the change.

Does anyone feel differently?

joshua-arch1 added a comment.EditedFeb 23 2022, 10:52 PM

Ping.

I think I'm correct in thinking that everyone so far is broadly in favour of this change, but as Craig suggests it likely wants something in AutoUpgrade.cpp to handle the change.

Does anyone feel differently?

I have updated this patch by doing an upgrade for RISCV in AutoUpgrade.cpp to fix up old IR files. Could anyone review again?

jrtc27 added inline comments.Feb 27 2022, 7:43 PM
llvm/lib/IR/AutoUpgrade.cpp
4583

Don't hard-code the data layout, replace the -n64- with -n32:64-, otherwise it's a pain for downstreams like us who change the data layout depending on both the ISA and the ABI. All the other examples in this function (which aren't visible here because you uploaded a diff without context... don't do that) avoid hard-coding data layouts so you have plenty of different examples to copy from.

Reverse ping.

Sorry for the delay. The patch has been updated by replacing the '-n64-' with '-n32:64-' for RISCV in AutoUpgrade.cpp.

jrtc27 added inline comments.Apr 1 2022, 4:00 AM
llvm/lib/IR/AutoUpgrade.cpp
4601

Comment?

llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1077

This is a regression

llvm/unittests/Bitcode/DataLayoutUpgradeTest.cpp
35

Comment like AMDGPU?

craig.topper added inline comments.Apr 1 2022, 8:33 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1081

This move is interesting. I'll look closer at that.

craig.topper added inline comments.Apr 1 2022, 1:13 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1077

This does save an add inside the loop.

1081

D122933 should fix the mv here.

I'm seeing a regression on 401.bzip2 and possibly 471.astar. And I'm not seeing large improvements on 471.omnetpp or 483.xalancbmk.

craig.topper commandeered this revision.Oct 4 2022, 1:40 PM
craig.topper edited reviewers, added: joshua-arch1; removed: craig.topper.
craig.topper marked 2 inline comments as done.

Rebase.

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 2:04 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Use TT.isRISCV64() in AutoUpgrade.cpp

Add requested comment to test.

I'm seeing a regression on 401.bzip2 and possibly 471.astar. And I'm not seeing large improvements on 471.omnetpp or 483.xalancbmk.

LGTM. But do you still get regression on spec? Or any improvements?

I'm seeing a regression on 401.bzip2 and possibly 471.astar. And I'm not seeing large improvements on 471.omnetpp or 483.xalancbmk.

LGTM. But do you still get regression on spec? Or any improvements?

I only have numbers from my downstream repo which has a bunch of other changes. For one of our CPUs, I'm now seeing improvements, and another looks neutral. So I think we should move forward this. If there are individual regressions we can work on fixing those by improving the backend or fixing cost models or whatever.

reames added a subscriber: reames.Oct 20 2022, 11:20 AM

Just want to note that I have no strong opinion on this patch. It doesn't seem unreasonable, and I'm comfortable with this being an empirically driven decision.

I agree with @reames, though I do think the patch description could use a rewrite.

asb added a comment.Oct 27 2022, 5:51 AM

I've put it on the agenda for the call today so we can close this off (I figure given it's been sitting so long, waiting a few extra days to cover it in the call is no big deal). But if the data shows this overall improves things, I think it's a sensible change to make.

asb accepted this revision.Oct 27 2022, 8:54 AM

There were no objections on the call. Looks good to me - two minor changes before landing that would make sense:

  • Fraser suggested tweaking the patch description
  • Probably worth adding this to docs/ReleaseNotes.rst (I know we're not very consistent about doing this, but adding it now minimises the risk I miss it later!).
This revision is now accepted and ready to land.Oct 27 2022, 8:54 AM

Add to release notes

craig.topper retitled this revision from [RISCV] Adjust RISCV data layout by using n32:64 in layout string to [RISCV] Adjust RV64I data layout by using n32:64 in layout string.Oct 27 2022, 10:16 AM
craig.topper edited the summary of this revision. (Show Details)
arichardson added inline comments.Oct 27 2022, 4:52 PM
llvm/docs/ReleaseNotes.rst
122 ↗(On Diff #471209)

Without additional context I don't think this makes much sense to most readers. Before looking at this patch description I would not have been and to say what n is used for.

Maybe something like "i32 has been marked as a legal integer type for RV64, improving code generation for some benchmarks"?

jrtc27 added inline comments.Oct 27 2022, 5:14 PM
llvm/docs/ReleaseNotes.rst
122 ↗(On Diff #471209)

"native integer type" not "legal integer type"; it still gets legalised during ISel

Refine ReleaseNotes

frasercrmck accepted this revision.Oct 28 2022, 2:03 AM

LGTM other than a nit, but I concur that a comment in AutoUpgrade would be nice.

llvm/docs/ReleaseNotes.rst
123 ↗(On Diff #471385)

something like , among other optimizations to avoid full stop + "and"?