This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Modify codegen for memcpy intrinsic to prefer LDM/STM.
ClosedPublic

Authored by scott-0 on Sep 29 2015, 3:22 AM.

Details

Summary

We were previously codegen'ing memcpy as regular load/store operations and
hoping that the register allocator would allocate registers in ascending order
so that we could apply an LDM/STM combine after register allocation. According
to the commit that first introduced this code (r37179), we planned to teach
the register allocator to allocate the registers in ascending order. This
never got implemented, and up to now we've been stuck with very poor codegen.

A much simpler approach for achieving better codegen is to create MEMCPY pseudo
instructions, attach scratch virtual registers to them and then, post register
allocation, expand the MEMCPYs into LDM/STM pairs using the scratch registers.
The register allocator will have picked arbitrary registers which we sort when
expanding the MEMCPY. This approach also avoids the need to repeatedly
calculate offsets which ultimately ought to be eliminated pre-RA in order to
decrease register pressure.

Fixes PR9199 and PR23768.

[This is based on Peter Collingbourne's r238473 which was reverted. I'm happy to produce the diff from r238473 if that's helpful.]

Diff Detail

Event Timeline

scott-0 updated this revision to Diff 35953.Sep 29 2015, 3:22 AM
scott-0 retitled this revision from to [ARM] Modify codegen for memcpy intrinsic to prefer LDM/STM..
scott-0 updated this object.
scott-0 added a reviewer: pcc.
scott-0 added a subscriber: llvm-commits.
jmolloy requested changes to this revision.Oct 1 2015, 5:09 AM
jmolloy added a reviewer: jmolloy.
jmolloy added a subscriber: jmolloy.

Hi Scott,

In general I like this change. It's an elegant solution to the problem. I've got a bikeshed to paint regarding the name - I think MEMCPY might be more appropriate as that's exactly what it's doing.

I think there should be more documentation about the inputs and outputs of the node - it took me a bit to realise that the outputs are the updated base registers (which means you can chain them - neat!)

There also looks to be minimal-zero non-Thumb1 tests for this - this happens in all modes so I'd expect the same amount of testing for each of ARM, T1 and T2.

The instprinter stuff is ugly. But I see why it's needed.

Cheers,

James

lib/Target/ARM/ARMISelLowering.h
190

Can we call this MEMCPY instead of MCOPY? It self-describes a bit better, I think.

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
752 ↗(On Diff #35953)

You should be able to construct the vector using the iterator construction syntax:

std::vector<MCOperand> RegOps(MI->begin() + OpNum, MI->end());
754 ↗(On Diff #35953)

Use std::stable_sort instead of std::sort, for deterministicness.

This revision now requires changes to proceed.Oct 1 2015, 5:09 AM
scott-0 marked 2 inline comments as done.Oct 1 2015, 9:35 AM

New version on it's way.

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
754 ↗(On Diff #35953)

Ok, but the registers being sorted are unique.

scott-0 updated this revision to Diff 36262.Oct 1 2015, 9:36 AM
scott-0 updated this object.
scott-0 edited edge metadata.

Thanks for the review.

Definitely a good idea on the whole.

lib/Target/ARM/ARMISelLowering.cpp
8097–8098

Why's the second MI needed? Does regalloc not bother to allocate <def,dead> operands or something?

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
753–754 ↗(On Diff #36262)

I think it'd be better to do the sorting when the instruction is expanded, it seems like a useful property to have if anyone wants to analyse an LDM/STM.

Thanks for the review. I'll put up a new version after the weekend.

lib/Target/ARM/ARMISelLowering.cpp
8097–8098

I tried it with <def,kill> and BuildMI refused; I'll investigate using <def,dead>.

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
753–754 ↗(On Diff #36262)

I'll investigate moving it.

scott-0 updated this revision to Diff 36514.Oct 5 2015, 6:47 AM
scott-0 updated this object.
scott-0 edited edge metadata.

Addressed comments.

Looks good to me now, with 1+epsilon nits:

lib/Target/ARM/ARMBaseInstrInfo.cpp
1252

This is idiomatically:

AddDefaultPred(LDM.addOperand(MI->getOperand(3)));
1259–1263

I'm not entirely convinced this is clearer than

for(unsigned I = 5; I < MI->getNumOperands(); ++I)
  ScratchRegs.push_back(MI->getOperand(I).getReg());

but that's just bike-shedding and probably personal biases, feel free to leave it if you disagree.

I think it'd be better to do the sorting when the instruction is expanded, it seems like a useful property to have if anyone wants to analyse an LDM/STM.

Interestingly, adding an assertion to require ascending order in ARMInstPrinter.cpp fails some other cases, e.g. stack_guard_remat.ll

scott-0 marked 2 inline comments as done.Oct 5 2015, 7:45 AM

Thanks for the review; I will do those final edits and commit.

lib/Target/ARM/ARMBaseInstrInfo.cpp
1259–1263

I agree; I was just a bit lambda-happy.

scott-0 accepted this revision.Oct 6 2015, 1:49 AM
scott-0 added a reviewer: scott-0.
scott-0 marked an inline comment as done.

Committed as r249322

jmolloy accepted this revision.Oct 6 2015, 1:51 AM
jmolloy edited edge metadata.
This revision is now accepted and ready to land.Oct 6 2015, 1:51 AM
scott-0 closed this revision.Oct 6 2015, 1:51 AM
test/CodeGen/Thumb/ldm-stm-base-materialization.ll