This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Custom type legalize i32 loads by sign extending.
ClosedPublic

Authored by craig.topper on Jul 22 2022, 1:56 PM.

Details

Summary

The default is to use extload which can become a zextload or
sextload if it is followed by an 'and' or sext_inreg.

Sometimes type legalization will introduce an 'and' from promoting
something like 'srl X, C' and a sext_inreg from from a setcc. The
'and' could be freely folded with the promoted 'srl' by using srliw,
but the sext_inreg can't be folded into a compare. DAG combiner
will see both of these choices and may decide to fold the 'and'
instead of the 'sext_inreg'. This forces the sext_inreg to become
a sext.w.

By picking sextload in the type legalizer we take this choice away.
Looking at spec2006 compiled with Zba and Zbb this appeared to be
net reduction in lines of code in the objdump disassembly output.

This is similar to what we do with i32 add/sub/mul/shl in
type legalization where we always emit a sext_inreg.

There's some followup improvements we could do. For example, folding
(and (sextload X), 0xffffffff) to (zextload X) if the 'and' is the
only user.

Diff Detail

Event Timeline

craig.topper created this revision.Jul 22 2022, 1:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 1:56 PM
craig.topper requested review of this revision.Jul 22 2022, 1:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 1:56 PM
asb added a comment.Aug 1 2022, 3:16 AM

I wonder if this should be a target hook? I accept that adding a hook for something that only RISC-V uses isn't necessarily an obvious win. But it also doesn't seem ideal for us to accrete too many of these heuristic workarounds within the backend. (not sure I have a super strong opinion either way - just raising for discussion).

I wonder if this should be a target hook? I accept that adding a hook for something that only RISC-V uses isn't necessarily an obvious win. But it also doesn't seem ideal for us to accrete too many of these heuristic workarounds within the backend. (not sure I have a super strong opinion either way - just raising for discussion).

This doesn't feel very different than what we do for ADD/SUB/MUL/SHL so I'm not sure its worth a target hook right now.

@asb have you checked gcc torture suite with this?

asb added a comment.Aug 2 2022, 11:23 AM

Yes, from a quick check the overall impact is positive (550 files changed, 9764 insertions(+), 9810 deletions(-) - that's across rv64imafdc {lp64,lp64d} {O0,O1,O2,O3,Os,Oz}. But there are some cases, e.g. var-arg-24.s where some LWU get converted to LW + SLLI + SRLI.

Some cases where LWU + SRLI => LW + SRLIW which I suppose might have a tiny code-size impact. LWU is never compressible, SRLI may be. LW may be compressible, SRLIW never is. Doesn't feel like a big deal either way.

Yes, from a quick check the overall impact is positive (550 files changed, 9764 insertions(+), 9810 deletions(-) - that's across rv64imafdc {lp64,lp64d} {O0,O1,O2,O3,Os,Oz}. But there are some cases, e.g. var-arg-24.s where some LWU get converted to LW + SLLI + SRLI.

Some cases where LWU + SRLI => LW + SRLIW which I suppose might have a tiny code-size impact. LWU is never compressible, SRLI may be. LW may be compressible, SRLIW never is. Doesn't feel like a big deal either way.

Thanks Alex! I was going took at adding a DAG combine to turn sextload+and into zextload which might fix the LW + SLLI + SRLI case and maybe the LW+SRLIW cases.

Add a small DAGCombiner change that recovers va-arg-24.c I'll write a test
before I commit this. Wanted to get this out for further evaluation on other
tests Alex saw.

We might consider supporting isTruncateFree true for i64->i32 with RV64.
W instructions make it free in a lot of cases.

asb added a comment.EditedAug 2 2022, 10:04 PM

Add a small DAGCombiner change that recovers va-arg-24.c I'll write a test
before I commit this. Wanted to get this out for further evaluation on other
tests Alex saw.

Comparing old patch vs new patch, the inter-diff shows all positive changes (i.e. no cases where codegen is worse vs the first patch version).

There are still cases where the static instruction count rises, I've pasted in all such cases in this gist

More fixes. I'll separate them back out if this gets approved. Just trying to keep them together for testing.

asb added a comment.Aug 4 2022, 3:37 AM

With this version of the patch these are the cases that ended up adding more lines:

output_rv64imafdc_lp64_O0/mode-dependent-address.s: 1 lines added (net)
output_rv64imafdc_lp64_O0/pr53645.s: 16 lines added (net)
output_rv64imafdc_lp64_O1/pr23135.s: 15 lines added (net)
output_rv64imafdc_lp64_O1/pr53645.s: 32 lines added (net)
output_rv64imafdc_lp64_O2/loop-5.s: 1 lines added (net)
output_rv64imafdc_lp64_O2/pr53645.s: 30 lines added (net)
output_rv64imafdc_lp64_O3/loop-5.s: 1 lines added (net)
output_rv64imafdc_lp64_O3/memset-2.s: 46 lines added (net)
output_rv64imafdc_lp64_O3/pr53645.s: 30 lines added (net)
output_rv64imafdc_lp64_Os/pr53645.s: 30 lines added (net)
output_rv64imafdc_lp64d_O0/mode-dependent-address.s: 1 lines added (net)
output_rv64imafdc_lp64d_O0/pr53645.s: 16 lines added (net)
output_rv64imafdc_lp64d_O1/pr23135.s: 15 lines added (net)
output_rv64imafdc_lp64d_O1/pr53645.s: 32 lines added (net)
output_rv64imafdc_lp64d_O2/loop-5.s: 1 lines added (net)
output_rv64imafdc_lp64d_O2/pr53645.s: 30 lines added (net)
output_rv64imafdc_lp64d_O3/loop-5.s: 1 lines added (net)
output_rv64imafdc_lp64d_O3/memset-2.s: 46 lines added (net)
output_rv64imafdc_lp64d_O3/pr53645.s: 30 lines added (net)
output_rv64imafdc_lp64d_Os/pr53645.s: 30 lines added (net)

Diff vs current HEAD.

asb added a comment.Aug 4 2022, 10:30 AM

This gives a better idea of the impact:
output_rv64imafdc_lp64_O0/mode-dependent-address.s: 3 lines added, 2 removed (+1 net)
output_rv64imafdc_lp64_O0/pr53645.s: 112 lines added, 96 removed (+16 net)
output_rv64imafdc_lp64_O1/pr23135.s: 147 lines added, 132 removed (+15 net)
output_rv64imafdc_lp64_O1/pr53645.s: 282 lines added, 250 removed (+32 net)
output_rv64imafdc_lp64_O2/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64_O2/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64_O3/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64_O3/memset-2.s: 1129 lines added, 1083 removed (+46 net)
output_rv64imafdc_lp64_O3/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64_Os/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_O0/mode-dependent-address.s: 3 lines added, 2 removed (+1 net)
output_rv64imafdc_lp64d_O0/pr53645.s: 112 lines added, 96 removed (+16 net)
output_rv64imafdc_lp64d_O1/pr23135.s: 147 lines added, 132 removed (+15 net)
output_rv64imafdc_lp64d_O1/pr53645.s: 282 lines added, 250 removed (+32 net)
output_rv64imafdc_lp64d_O2/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64d_O2/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_O3/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64d_O3/memset-2.s: 1129 lines added, 1083 removed (+46 net)
output_rv64imafdc_lp64d_O3/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_Os/pr53645.s: 278 lines added, 248 removed (+30 net)

It's probably worth a quick check if there are obvious reasons for the additions, but the overall impact seems positive so if there's not an obvious deficiency I don't have an objection to declaring these cases are just noise due to taking a different codegen path.

asb added inline comments.Aug 4 2022, 10:33 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
182

It's not immediately obvious why Promote is needed, so it's probably worth a comment.

This gives a better idea of the impact:
output_rv64imafdc_lp64_O0/mode-dependent-address.s: 3 lines added, 2 removed (+1 net)
output_rv64imafdc_lp64_O0/pr53645.s: 112 lines added, 96 removed (+16 net)
output_rv64imafdc_lp64_O1/pr23135.s: 147 lines added, 132 removed (+15 net)
output_rv64imafdc_lp64_O1/pr53645.s: 282 lines added, 250 removed (+32 net)
output_rv64imafdc_lp64_O2/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64_O2/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64_O3/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64_O3/memset-2.s: 1129 lines added, 1083 removed (+46 net)
output_rv64imafdc_lp64_O3/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64_Os/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_O0/mode-dependent-address.s: 3 lines added, 2 removed (+1 net)
output_rv64imafdc_lp64d_O0/pr53645.s: 112 lines added, 96 removed (+16 net)
output_rv64imafdc_lp64d_O1/pr23135.s: 147 lines added, 132 removed (+15 net)
output_rv64imafdc_lp64d_O1/pr53645.s: 282 lines added, 250 removed (+32 net)
output_rv64imafdc_lp64d_O2/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64d_O2/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_O3/loop-5.s: 6 lines added, 5 removed (+1 net)
output_rv64imafdc_lp64d_O3/memset-2.s: 1129 lines added, 1083 removed (+46 net)
output_rv64imafdc_lp64d_O3/pr53645.s: 278 lines added, 248 removed (+30 net)
output_rv64imafdc_lp64d_Os/pr53645.s: 278 lines added, 248 removed (+30 net)

It's probably worth a quick check if there are obvious reasons for the additions, but the overall impact seems positive so if there's not an obvious deficiency I don't have an objection to declaring these cases are just noise due to taking a different codegen path.

I explored these differences. Some notes.

memset-2.s - Directly caused by the isTruncateFree change. We are now sharing a 64-bit constant between i64 and i32 stores by truncating. Somehow this caused some repeated rematerialization of LUI instructions. Despite on the surface the change reducing register pressure. The basic block is quite large with many calls to memset.

pr53645.s - test includes a vector value from a load passed across basic blocks that get scalarized. this increases the use count of the broken down scalars but the other basic block only wants 1 element. There's also a visitation order issue with urem by constant expansion interacting with SimplifyDemandedBits.

mode-dependent-address.s - we have an i32 load used by sext_inreg and an and with 255. This and was used to form a zextload, but it didn't remove the and, but prevented the sext_inreg from making a sextload. Seems related to isTruncateFree.

pr23135.s - DAGCombiner's ForwardStoreValueToDirectLoad needs to support sextload by creating sext_inreg.

loop-5.s - We need to both sext and zext a load. We used to use lwu and sext.w, now we use lw and slli+srli.

I only have a good answer on how to fix pr23135.s

craig.topper planned changes to this revision.Aug 12 2022, 3:44 PM

Going to start splitting this up.

Add comment to extending load change

craig.topper edited the summary of this revision. (Show Details)Aug 27 2022, 1:28 PM
asb accepted this revision.Sep 12 2022, 8:39 AM

[Sorry Craig - it looks like I'd re-reviewed and written my approval but I either didn't submit or it didn't go through].

This looks good to me, thanks!

This revision is now accepted and ready to land.Sep 12 2022, 8:39 AM