Page MenuHomePhabricator

[ARM] Register pressure with -mthumb forces register reload before each call
ClosedPublic

Authored by prathamesh on May 12 2020, 7:14 AM.

Details

Summary

Hi,
Compiling following test-case (reduced for uECC_shared_secret function from tinycrypt library) with -Oz on armv6-m:

typedef unsigned char uint8_t;
extern uint8_t x1;
extern uint8_t x2;

void foo(uint8_t *, unsigned, unsigned);

void uECC_shared_secret(uint8_t *private_key, unsigned num_bytes, unsigned num_words)
{
  foo(private_key, num_bytes, num_words);
  foo(private_key, num_bytes, num_words);
  foo(private_key, num_bytes, num_words);
}

results in ldr of function's address before each blx call:

ldr       r3, .LCPI0_0
blx      r3
mov    r0, r6
mov    r1, r5
mov    r2, r4
ldr       r3, .LCPI0_0
blx       r3
ldr        r3, .LCPI0_0
mov     r0, r6
mov     r1, r5
mov     r2, r4
blx       r3

.LCPI0_0:

.long   foo

As suggested by John Brawn in http://lists.llvm.org/pipermail/llvm-dev/2020-April/140712.html,
this happens because:
(1) ARMTargetLowering::LowerCall prefers indirect call for 3 or more functions in same basic block.
(2) For thumb1, we have only 3 callee-saved registers available (r4-r6).
(3) The function has 3 arguments and needs one extra register to hold it's address.
(4) So we have to end up spilling one of the registers. The register holding function's address gets split since it can be rematerialized, and we end up with ldr before each call.

As per the suggestion, the patch implements foldMemoryOperand hook in Thumb1InstrInfo, to convert back to bl in case of a spill.
Does the patch look OK ?
make check-llvm shows no unexpected failures.

Thanks,
Prathamesh

Diff Detail

Event Timeline

prathamesh created this revision.May 12 2020, 7:14 AM

Please add some tests for this to llvm/test/CodeGen/ARM/.

llvm/lib/Target/ARM/Thumb1InstrInfo.cpp
173–174

Please use the standard variable naming convention here (i.e. Callee and FuncName).

178–180

You can't assume that these are the registers that this function uses. You should be copying the implicit register operands of the original BLX. You also need to add the RegMask operand of the BLX, as otherwise the call is treated as preserving all registers.

llvm/lib/Target/ARM/Thumb1InstrInfo.h
58

This doesn't do anything - this using declaration introduces TargetInstrInfo::foldMemoryOperandImpl as a public member of Thumb1InstrInfo, but you can override it as a public function (as is done here) without doing that. I doing see any reason to declare foldMemoryOperandImpl as public though, it should probably be declared as protected (like it is in TargetInstrInfo).

prathamesh updated this revision to Diff 269167.Jun 8 2020, 4:25 AM

Hi John,
Thanks for the review and I am sorry for late response, I was on a leave.
I have the updated the patch with suggested changes, do they look in right direction ?
Testing with make check-llvm shows no unexpected failures.

Thanks,
Prathamesh

Changes look OK, though as I said before please add tests for this.

llvm/lib/Target/ARM/Thumb1InstrInfo.cpp
177–181

I think here you can just call MIB.add(M0) for all of the implicit operands of MI.

Hi,
Thanks for the review, and sorry I forgot to add the test earlier.
Does the attached version look OK ?
I verified that it works to remove indirect calls for the original test-case uECC_shared_secret from tinycrypt.
Tested with make check-llvm showing no unexpected failures.

Thanks,
Prathamesh

I'd expect to see some extra checks to decide if reverting the indirect-call optimisation is profitable or not. If the loaded address is used more than three times, without being spilled, then I think it's better to leave the calls as indirect ones. If you don't think this is ever likely to be worthwhile for thumb 1, then it would be better to modify ARMTargetLowering::LowerCall to not do the transformation in the first place.

llvm/test/CodeGen/ARM/minsize-call-cse-2.ll
9

Why is one of the calls being left as an indirect branch?

I'd expect to see some extra checks to decide if reverting the indirect-call optimisation is profitable or not. If the loaded address is used more than three times, without being spilled, then I think it's better to leave the calls as indirect ones. If you don't think this is ever likely to be worthwhile for thumb 1, then it would be better to modify ARMTargetLowering::LowerCall to not do the transformation in the first place.

Hi Oliver,
Instead of using a heuristic in LowerCall, we decided to convert blx -> bl using foldMemoryOperand hook only if the register holding function's address got spilled,
and the patch implements the hook for Thumb1.
For the above test-case, it seems the load of function's address was rematerialized 3 times, which was eliminated by converting the three blx to bl's.
This is till not optimal, but better than the original behavior, without affecting other test-cases and heuristics.

Thanks,
Prathamesh

only if the register holding function's address got spilled

I don't see any checks for this in the code. Why doesn't this trigger on every tLDRpci -> tBLXr pair, even if the register did not get spilled?

only if the register holding function's address got spilled

I don't see any checks for this in the code. Why doesn't this trigger on every tLDRpci -> tBLXr pair, even if the register did not get spilled?

Oops sorry about the comment - "register holding function's address gets spilled", I got mixed up on that.
The point of adding foldMemoryOperand was to fold tLDRpci, tBLXr pair to tBL instead of reloading the function's address and calling it indirectly.

For above case, where tBLXr isn't folded, it's because tLDRpci is paired with COPY, instead of tBLXr:

%16:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
%23:tgpr = COPY %16:tgpr
tBLXr 14, $noreg, %23:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit $r3, implicit-def $sp

Dumping LoadMI and MI in Thumb1InstrInfo::foldMemoryOperandImpl hook show first two insns above respectively.

Thanks,
Prathamesh

That sounds very fragile - I can easily imagine future, unrelated changes removing that COPY, or emitting one in the case which doesn't currently have one, which would completely change the effect of this patch.

I think the right way to do this is to look through the COPYs, find all the uses of the constant pool load, and do this transformation if and only if there are fewer than three uses of the load.

The alternative would be to not do the direct->indirect transformation in the first place for Thumb1-only targets, since they don't have enough registers to make it worthwhile very often. That would be trading off some generated code quality for simpler implementation.

prathamesh updated this revision to Diff 275661.Jul 6 2020, 4:43 AM

Hi,
Sorry for late response. In the attached patch, I added a couple of more constraints if we're compiling for Thumb1:

  1. Number of args passed + caller's num of args < total number of available regs
  2. Each arg to callee, is either a "pass thru" arg, OR 8-bit imm OR a constant load. The intent is to allow only those args that need a single register for computation.

The motivation is to allow the transform for simple cases which fit the above cases, or use direct call otherwise.
Does it look reasonable ? The patch does not regress ARM tests and converts all calls to bl in the test attached in patch.

Thanks,
Prathamesh

That sounds very fragile - I can easily imagine future, unrelated changes removing that COPY, or emitting one in the case which doesn't currently have one, which would completely change the effect of this patch.

I think the right way to do this is to look through the COPYs, find all the uses of the constant pool load, and do this transformation if and only if there are fewer than three uses of the load.

The alternative would be to not do the direct->indirect transformation in the first place for Thumb1-only targets, since they don't have enough registers to make it worthwhile very often. That would be trading off some generated code quality for simpler implementation.

Hi,
Sorry for late response. In the attached patch, I added a couple of more constraints if we're compiling for Thumb1:

  1. Number of args passed + caller's num of args < total number of available regs
  2. Each arg to callee, is either a "pass thru" arg, OR 8-bit imm OR a constant load. The intent is to allow only those args that need a single register for computation.

The motivation is to allow the transform for simple cases which fit the above cases, or use direct call otherwise.
Does it look reasonable ? The patch does not regress ARM tests and converts all calls to bl in the test attached in patch.

ping ?

Thanks,
Prathamesh

Thanks,
Prathamesh

Hello.

Do you have some codesize number for this, across some large-ish test suite? I would recommend the llvm-test-suite if you can get it compiling on Thumb1 - at least enough to get codesize number from it.

prathamesh added a comment.EditedJul 28 2020, 5:47 AM

Hello.

Do you have some codesize number for this, across some large-ish test suite? I would recommend the llvm-test-suite if you can get it compiling on Thumb1 - at least enough to get codesize number from it.

Hi, Sorry for late response. Unfortunately, we are having some infra issues for benchmarking code-size natively on SPEC, and am getting consistent build failures (without patch) while building llvm natively on arm. Do you have any suggestions on how to cross compile LLVM testsuite with -march=armv6-m enabled ? I did a cross compilation with SPEC, and code-size difference across all object files that got built was around 6.14%. I will report more descriptive results for SPEC after our infra issues get resolved. In the interim, I am also trying to test my patch for chromium.

Do you have any suggestions on how to cross compile LLVM testsuite with -march=armv6-m enabled ?

Use armv6, not armv6m. In Thumb mode, the instruction set is basically the same. And it's much easier to set up; a normal ARM-linux sysroot can be used to build, and the resulting binaries will run on a normal ARM Linux device.

Thanks for the suggestions. I finally was able to build llvm natively on arm. Benchmarking SPEC2006 for code-size with -march=armv6 revealed minimal differences with previous patch. It works for the original test case from tinycrypt library, but causes a code-size regression of ~10% for another function from same library, resulting in overall worse code-size. Likewise disabling the transform, gives similar results.
Should we go instead with the approach suggested by John (reattaching in current patch) ?
Implement foldMemoryOperand hook to fold tLDRpci, tBLX pair to tBL.
This has one issue above if COPY insns come in between and foldMemoryOperand gets passed tLDRpci and COPY instead.
I am not sure how to handle that. However, this approach improves somewhat on the original behavior without causing regressions for any test-cases. Benchmarking showed minimal difference in code-size for SPEC2006, but improves the original test-case in tinycrypt and avoids the 10% code-size regression.

My results seem to agree, that this improves things and doesn't break anything (they were only codesize benchmarks, but they all built OK).

My results seem to agree, that this improves things and doesn't break anything (they were only codesize benchmarks, but they all built OK).

Great, thanks for testing! Is this patch OK to commit ?

My results seem to agree, that this improves things and doesn't break anything (they were only codesize benchmarks, but they all built OK).

Great, thanks for testing! Is this patch OK to commit ?

ping ?

Thanks,
Prathamesh

Yeah sorry for the delay. I had to go and look up where this function was actually being used. It appears to be during registry allocation, it folds as opposed to spilling. That actually seems to work quite well, except that it won't remove the last constant pool load and the cpe entry. Like others have said it does seem a little odd to create these indirect loads just to undo them again, but coming up with a decent heuristic during ISel lowering sounds tricky too. If you can make this a bit more resilient, it sounds like an OK way to go to me.

llvm/lib/Target/ARM/Thumb1InstrInfo.cpp
173

I don't think you can use the name of any old Constant. Consider something like this, but it could be any other form of indirect call:

void test(int *p, int x, int y, int z) {
   ((void (*) (int*, int, int, int))0x1234567)(p, x, y, z);
   ((void (*) (int*, int, int, int))0x1234567)(p, x, y, z);
   ((void (*) (int*, int, int, int))0x1234567)(p, x, y, z);
   ((void (*) (int*, int, int, int))0x1234567)(p, x, y, z);
}

You can maybe try and dyn_cast the CPE.Val.ConstVal (providing you know it's not a MachineCPVal) into a Function* and use that. But it would be good to check all these things are definitely true for the instructions we have.

Hi,
Thanks for the catch! I tried your example, and it crashed llc with my patch. In the attached patch, I followed your suggestion to dyn_cast CPE.Val.ConstVal to Function * and that worked.

Also, I added another heuristic to ARMTargetLowering::LowerCall, to check that F.arg_size() + number of arguments does not exceed available number of registers, and in that case, and avoid emitting indirect calls (to be again turned into direct call later). This doesn't catch all cases, but catches common cases where argument requires only one reg to compute. With the heuristic, all calls to g are direct in minsize-call-cse-2.ll.

Tested with make check-llvm. Testing SPEC2006 for code size shows minimal differences, and for tinycrypt shows overall improvement.
Does the patch look OK ?

efriedma added inline comments.Aug 26 2020, 1:07 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2236 ↗(On Diff #287988)

I wrote a check for the number of arguments in a different way at https://reviews.llvm.org/D49465 ; maybe you can borrow that? Checking the number of operands in the IR is a very inaccurate approximation.

dmgreen added inline comments.Aug 27 2020, 1:27 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2236 ↗(On Diff #287988)

I would recommend trying to keep this patch small and separating any other changes into a different review. The foldMemoryOperandImpl change seems to improve things on it's own. If we get that in first we can move on to figuring out an appropriate heuristic for this part separately.

llvm/lib/Target/ARM/Thumb1InstrInfo.cpp
170

Can you add a check that CPI < MCP->getConstants().size(), to be safe.

Added check in this version to return nullptr if CPI >= MCP->getConstants().size().
Does the patch look OK ? Tested with make check-llvm.
Thanks.

prathamesh added inline comments.Aug 28 2020, 7:52 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2236 ↗(On Diff #287988)

Hi Eli,
Thanks for the suggestions! As suggested by Dave, I will try to address the heuristic in a follow up patch.

dmgreen accepted this revision.Aug 29 2020, 11:34 AM

Yeah. I'm happy with this. Looking forward to seeing any follow ups.

LGTM, thanks!

This revision is now accepted and ready to land.Aug 29 2020, 11:34 AM

Yeah. I'm happy with this. Looking forward to seeing any follow ups.

LGTM, thanks!

Thanks! Could you please commit it on my behalf ? I don't have commit access.

Sure can. Will do..

Hi @prathamesh,

the LLVM::minsize-call-cse-2.ll test gets failed on llvm-clang-x86_64-expensive-checks-ubuntu builder:
http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-ubuntu/builds/8634

Would you take care of it?
Thanks.

Hello. Sorry I wasn't sent any buildbot failure emails, perhaps because I was not the author. Thanks for letting us know.

It looks like foldMemoryOperand is automatically transferring memory operands to the new instruction, which is not what we want here. I'll revert the patch so that we can sort out a fix.

Sorry for the breakage, I will take a look.

This comment was removed by prathamesh.

Hi,
It seems the issue comes from TargetInstrInfo::foldMemoryOperand, which adds the memory operands of LoadMI to newly created MI.
It calls foldMemoryOperandImpl:

else {

  // Ask the target to do the actual folding.
  NewMI = foldMemoryOperandImpl(MF, MI, Ops, MI, LoadMI, LIS);
}

and then:

// Copy the memoperands from the load to the folded instruction.
if (MI.memoperands_empty()) {
  NewMI->setMemRefs(MF, LoadMI.memoperands())

If we return immediately after calling foldMemoryOperandImpl, then the test-case compiles succeesfully.
I guess for this specific folding we don't need to attach memory operands of LoadMI, but that might not be true in general case ?
Should there be some way for foldMemoryOperandImpl to signal to foldMemoryOperand not to add memory operands of LoadMI ?
Thanks.

It seems the issue comes from TargetInstrInfo::foldMemoryOperand, which adds the memory operands of LoadMI to newly created MI.
...
If we return immediately after calling foldMemoryOperandImpl, then the test-case compiles succeesfully.
I guess for this specific folding we don't need to attach memory operands of LoadMI, but that might not be true in general case ?
Should there be some way for foldMemoryOperandImpl to signal to foldMemoryOperand not to add memory operands of LoadMI ?

Yeah. That sounds like what I was seeing. I'm not sure if I would recommend changing it so that foldMemoryOperandImpl is responsible for adding the memory operand - there seems to be quite a lot of code already in different targets foldMemoryOperandImpl that would need to be changed. That looks complex and error prone.

Perhaps like you say a slightly different interface could be used, so that we have more control over the instruction we are creating in this case.

prathamesh added a comment.EditedSep 2 2020, 8:37 AM

I was wondering if either of these approaches look good ?
(a) Add a default param to TargetInstrInfo::foldMemoryOperandImpl, say bool &AddMemoryOperands, and the target specific hook sets it accordingly.
This will require modifying other target hooks to set AddMemoryOperands to true. Or alternatively return std::pair ?

(b) Define another hook, say bool TargetInstrInfo::AddMemoryOperands(MI *), that returns true by default, and we override it to return false in Thumb1Instrinfo ?
That won't require modifying other backends, but would adding a hook for this purpose be an overkill ?

Thanks.