Page MenuHomePhabricator

[RISCV] Adjust RISCV data layout by using n32:64 in layout string
Needs ReviewPublic

Authored by craig.topper on Jan 6 2022, 3:52 AM.

Details

Summary

Although i32 type is illegal in llvm back-end, I don't think it necessory to keep consistent for layout string.
In the current version, layout string is "e-m:e-p:64:64-i64:64-i128:128-n64-S128" for 64-bit arch where native integer type is only i64. However, under this circumstance, some middle-end passes like LSRs may be influenced due to the predictor("DL.isLegalInteger") result.

Let's look at line 157-162 in IVUsers.cpp.

// LSR is not APInt clean, do not touch integers bigger than 64-bits.
// Also avoid creating IVs of non-native types. For example, we don't want a
// 64-bit IV in 32-bit code just because the loop has one 64-bit cast.
uint64_t Width = SE->getTypeSizeInBits(I->getType());
if (Width > 64 || !DL.isLegalInteger(Width))
  return false;

Now I try to use n32:64 in layout string without degression of some other passes that use this info.

Diff Detail

Unit TestsFailed

TimeTest
60,100 msx64 debian > LLVM.CodeGen/NVPTX::wmma.py
Script: -- : 'RUN: at line 5'; "/usr/bin/python3.9" /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/NVPTX/wmma.py --ptx=60 --gpu-arch=70 > /var/lib/buildkite-agent/builds/llvm-project/build/test/CodeGen/NVPTX/Output/wmma.py.tmp-ptx60-sm_70.ll
60,040 msx64 debian > libFuzzer.libFuzzer::large.test
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/LargeTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/large.test.tmp-LargeTest

Event Timeline

joshua-arch1 created this revision.Jan 6 2022, 3:52 AM
joshua-arch1 requested review of this revision.Jan 6 2022, 3:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2022, 3:52 AM
joshua-arch1 edited the summary of this revision. (Show Details)Jan 6 2022, 3:53 AM
asb added a comment.EditedJan 6 2022, 7:55 AM

Thanks - I don't think much thought has gone into whether n32:n64 would be reasonable for RV64, so I really appreciate you digging in to it.

I've only run this against the GCC torture suite, and on average this seems to be about neutral (or mildly positive). There are a few cases where code is worse though. Here's a rather surprising example (I haven't investigated any further):

--- a/output_rv64imafdc_lp64_O1/930921-1.s
+++ b/output_rv64imafdc_lp64_O1/930921-1.s
@@ -24,18 +24,29 @@ main:                                   # @main
 # %bb.0:                                # %entry
        addi    sp, sp, -16
        sd      ra, 8(sp)                       # 8-byte Folded Spill
+       li      a0, 0
        li      a1, 0
-       li      a0, 1
-       lui     a2, 2
-       addiw   a2, a2, 1806
+       lui     a2, 699051
+       addiw   a2, a2, -1365
+       slli    a6, a2, 32
+       lui     a3, 171
+       addiw   a3, a3, -1365
+       slli    a3, a3, 12
+       addi    a3, a3, -1365
+       lui     a4, 2
+       addiw   a4, a4, 1808
 .LBB1_1:                                # %for.body
                                         # =>This Inner Loop Header: Depth=1
-       beqz    a0, .LBB1_4
+       srli    a5, a0, 33
+       slli    a2, a1, 32
+       mulhu   a2, a2, a6
+       srli    a2, a2, 33
+       bne     a2, a5, .LBB1_4
 # %bb.2:                                # %for.cond
                                         #   in Loop: Header=BB1_1 Depth=1
-       sext.w  a3, a1
        addiw   a1, a1, 1
-       bgeu    a2, a3, .LBB1_1
+       add     a0, a0, a3
+       bne     a1, a4, .LBB1_1
 # %bb.3:                                # %for.end
        li      a0, 0
        call    exit
jrtc27 added a comment.Jan 6 2022, 8:00 AM

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

jrtc27 added a comment.Jan 6 2022, 8:04 AM

The DataLayout documentation says:

This specifies a set of native integer widths for the target CPU in bits. For example, it might contain n32 for 32-bit PowerPC, n32:64 for PowerPC 64, or n8:16:32:64 for X86-64. Elements of this set are considered to support most general arithmetic operations efficiently.

It's unclear what exactly that means for RISC-V.

Having said that, for the IVUsers example pointed out, that's just bad code in LLVM, it should permit things that are smaller than native integers, I would imagine. Otherwise, say, i16, won't be used on most targets (other than X86 and a few experimental, embedded or graphics-y targets) as they'll only have n32:64, but the comment there is clearly to just stop it creating overly-large integer types.

I believe there are two checks for DataLayout.isLegalInteger in IndVarSimplify that are interesting as well.

jrtc27 added a comment.Jan 6 2022, 8:14 AM

Possibly also SROA's, though probably unlikely to matter much as it'd be such a tiny aggregate

This will probably require something in AutoUpgrade.cpp to fix up old IR files. I think maybe X86 does an upgrade for some address space additions.

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

So the f function gets inlined into this compare f (i) != i / 3. The i/3 gets optimized by SelectionDAG into (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1. Before this patch, the automatic CSE in SelectionDAG merges this with the code from the inlined function. This makes a the comparison be statically always true so all the code gets optimized away.

With this patch, LSR kicks in and rewrites some of the inlined code from f by removing the multiply with 0xAAAAAAAB in favor of a modified induction variable that increments by 0xAAAAAAAB on each iteration. Now there's nothing for the div by constant optimization to CSE with so the comparison doesn't become statically always true.

I don't think this test is a blocker for this patch.

joshua-arch1 added a comment.EditedJan 9 2022, 6:01 PM

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

joshua-arch1 added a comment.EditedJan 16 2022, 7:31 PM

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

I used T-HEAD XuanTie C906 processor based on the RV64GCV instruction set and THEAD instruction extension.

joshua-arch1 added a comment.EditedJan 25 2022, 11:02 PM

For the following benchmark, the performance can improve without degression of other cases (on T-HEAD XuanTie C906 processor with THEAD instruction extension)

spec2006_int-471.omnetpp+18.9672%
spec2006_int-483.xalancbmk+3.5192%
coremark+3.8695%
eembc_networking_pktflowb1m+4.1295%
eembc_networking_pktflowb2m+4.5903%
eembc_networking_pktflowb4m+4.0367%
eembc_networking_pktflowb512k+5.2104%

Can you test on your hardware?

I have checked all the benchmarks.
For coremark, spec2006_int-471.omnetpp and eembc_networking_pktflow, the performance can improve by more than 3%, without degression of other cases.

Can you say more about what hardware and ISA extensions were used for that testing?

I used T-HEAD XuanTie C906 processor based on the RV64GCV instruction set and THEAD instruction extension.

Was it tested using a version of LLVM that supports the THEAD instruction extension or just the public LLVM?

That's trying to do return (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1; for a range of x. Presumably the issue is MULW exists but not MULWHU and friends.

So the f function gets inlined into this compare f (i) != i / 3. The i/3 gets optimized by SelectionDAG into (unsigned) (((unsigned long long) x * 0xAAAAAAAB) >> 32) >> 1. Before this patch, the automatic CSE in SelectionDAG merges this with the code from the inlined function. This makes a the comparison be statically always true so all the code gets optimized away.

With this patch, LSR kicks in and rewrites some of the inlined code from f by removing the multiply with 0xAAAAAAAB in favor of a modified induction variable that increments by 0xAAAAAAAB on each iteration. Now there's nothing for the div by constant optimization to CSE with so the comparison doesn't become statically always true.

I don't think this test is a blocker for this patch.

Does the patch improve performance absolutely in theory?
Could anybody else also test to see if the patch can improve performance because guys may have different environment or benchmarks?

asb added a comment.Feb 16 2022, 6:16 AM

I think I'm correct in thinking that everyone so far is broadly in favour of this change, but as Craig suggests it likely wants something in AutoUpgrade.cpp to handle the change.

Does anyone feel differently?

joshua-arch1 added a comment.EditedFeb 23 2022, 10:52 PM

Ping.

I think I'm correct in thinking that everyone so far is broadly in favour of this change, but as Craig suggests it likely wants something in AutoUpgrade.cpp to handle the change.

Does anyone feel differently?

I have updated this patch by doing an upgrade for RISCV in AutoUpgrade.cpp to fix up old IR files. Could anyone review again?

jrtc27 added inline comments.Feb 27 2022, 7:43 PM
llvm/lib/IR/AutoUpgrade.cpp
4583

Don't hard-code the data layout, replace the -n64- with -n32:64-, otherwise it's a pain for downstreams like us who change the data layout depending on both the ISA and the ABI. All the other examples in this function (which aren't visible here because you uploaded a diff without context... don't do that) avoid hard-coding data layouts so you have plenty of different examples to copy from.

Reverse ping.

Sorry for the delay. The patch has been updated by replacing the '-n64-' with '-n32:64-' for RISCV in AutoUpgrade.cpp.

jrtc27 added inline comments.Apr 1 2022, 4:00 AM
llvm/lib/IR/AutoUpgrade.cpp
4601

Comment?

llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1077

This is a regression

llvm/unittests/Bitcode/DataLayoutUpgradeTest.cpp
35

Comment like AMDGPU?

craig.topper added inline comments.Apr 1 2022, 8:33 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1081

This move is interesting. I'll look closer at that.

craig.topper added inline comments.Apr 1 2022, 1:13 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store.ll
1077

This does save an add inside the loop.

1081

D122933 should fix the mv here.

I'm seeing a regression on 401.bzip2 and possibly 471.astar. And I'm not seeing large improvements on 471.omnetpp or 483.xalancbmk.

craig.topper commandeered this revision.Tue, Oct 4, 1:40 PM
craig.topper edited reviewers, added: joshua-arch1; removed: craig.topper.
craig.topper marked 2 inline comments as done.

Rebase.

Herald added a project: Restricted Project. · View Herald TranscriptTue, Oct 4, 2:04 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Use TT.isRISCV64() in AutoUpgrade.cpp

Add requested comment to test.