This is an archive of the discontinued LLVM Phabricator instance.

Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.
ClosedPublic

Authored by pcc on May 5 2015, 3:55 PM.

Details

Summary

We were previously codegen'ing these as regular load/store operations and
hoping that the register allocator would allocate registers in ascending order
so that we could apply an LDM/STM combine after register allocation. According
to the commit that first introduced this code (r37179), we planned to teach
the register allocator to allocate the registers in ascending order. This
never got implemented, and up to now we've been stuck with very poor codegen.

A much simpler approach for achiveing better codegen is to create LDM/STM
instructions with identical sets of virtual registers, let the register
allocator pick arbitrary registers and order register lists when printing an
MCInst. This approach also avoids the need to repeatedly calculate offsets
which ultimately ought to be eliminated pre-RA in order to decrease register
pressure.

This is implemented by lowering the memcpy intrinsic to a series of SD-only
MCOPY pseudo-instructions which performs a memory copy using a given number
of registers. During SD->MI lowering, we lower MCOPY to LDM/STM. This is a
little unusual, but it avoids the need to encode register lists in the SD,
and we can take advantage of SD use lists to decide whether to use the _UPD
variant of the instructions.

Fixes PR9199.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc updated this revision to Diff 24992.May 5 2015, 3:55 PM
pcc retitled this revision from to Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM..
pcc updated this object.
pcc edited the test plan for this revision. (Show Details)
pcc added a reviewer: rengolin.
pcc added a subscriber: Unknown Object (MLST).

This looks like a generally good idea to me, though there are some points that need to be looked at (in the inline comments).

If this were to be committed it does make me wonder what should become of the load/store optimizer though. Perhaps ARMPreAllocLoadStoreOpt should be made to introduce MCOPY? That's something for another day though.

lib/Target/ARM/ARMISelLowering.cpp
7567 ↗(On Diff #24992)

The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for a Thumb1 target we emit invalid instructions. Either LowerMCOPY should handle Thumb1, or we shouldn't be turning memcpy into MCOPY for Thumb1.

lib/Target/ARM/ARMSelectionDAGInfo.cpp
28 ↗(On Diff #24992)

EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores fails to generate a load/store sequence. ARMTargetLowering::ARMTargetLowering currently sets MaxStoresPerMemcpy to 4, so this function will only be triggered for memcpys of >16 bytes. If MCOPY gives better results than individual loads and stores then maybe that should be lowered to 0 so that this function is always used?

67–68 ↗(On Diff #24992)

If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then splitting it up in this way may not be the best idea. Consider, for example, a copy of 7 words. Splitting it into 6+1 means that the total registers that need to be available is 8 (source, dest, 6 reg list), but if we were to split it up as 3+4 then the total registers that need to be available is 6 (or maybe 5 if the source and dest are dead after the memcpy). That would reduce register pressure and in some cases allow fewer callee-saved registers to need to be saved.

Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.

87 ↗(On Diff #24992)

I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left.

Maybe it would be clearer if the calculation of BytesLeft were done in one go after the MCOPY generation instead of split around it. And possibly a more explanatory comment.

rengolin removed a subscriber: john.brawn.
pcc updated this revision to Diff 26554.May 26 2015, 4:57 PM
  • Fix Thumb-1
  • Evenly distribute registers among MCOPYs
  • Add MCOPY to ARMTargetLowering::getTargetNodeName
  • Improve comments
  • Improve test
lib/Target/ARM/ARMISelLowering.cpp
7567 ↗(On Diff #24992)

Done

lib/Target/ARM/ARMSelectionDAGInfo.cpp
28 ↗(On Diff #24992)

If the memcpy is short enough (and the target supports it) getMemcpyLoadsAndStores use the extension registers, which normally ends up being shorter due to less pressure on the general-purpose registers, so I left this as is in order to take advantage of that.

(Ideally I think we should have something like a VMCOPY pseudo-instruction that lowers to VLDM/VSTM, but this seems harder to map onto the register allocator's view of the world, as VLDM/VSTM take a consecutive register range.)

67–68 ↗(On Diff #24992)

The code now calculates the number of LDM/STMs we need anyway, and divides the registers evenly among them.

Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.

I'd be surprised if it were a problem right now. The register allocator is basically solving the same problem except distributed over a smaller number of instructions.

87 ↗(On Diff #24992)

I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left.

Right. Now that we're distributing the registers among the MCOPYs, we save on overall register pressure when emitting the extra LDM/STM pair.

john.brawn accepted this revision.May 28 2015, 4:20 AM
john.brawn edited edge metadata.

One minor nitpick, otherwise looks good to me.

lib/Target/ARM/Thumb2SizeReduction.cpp
128 ↗(On Diff #26554)

This comment should be updated - tSTMIA_UPD isn't the equivalent of t2STMIA, but the difference is correctly handled elsewhere.

This revision is now accepted and ready to land.May 28 2015, 4:20 AM
This revision was automatically updated to reflect the committed changes.