This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Patch to improve memcpy inlined assembly sequence.
Needs ReviewPublic

Authored by rs on Nov 1 2016, 7:11 AM.

Details

Summary

memcpy's which are <= 64 (see ARMSubtarget::getMaxInlineSizeThreshold) can be inlined into a sequence of loads/stores, if the memcpy number of bytes is greater than 64 then the memcpy library function is used instead.

For a copy where the number of bytes is a multiple of 4 bytes the memcpy inling function can generate full word loads for loading the source using the ldr instruction or loading multiple words using the ldm instruction and then storing the words to the destination using str or storing multiple words with stm. The optimal sequence in most cases is the one where multiple words are loaded and stored using ldm/stm as doing a single ldm/stm is faster than doing 1 word loads and stores.

When the number of bytes to copy isn't a multiple of 4 then memcpy inling will end up using ldrb/strb if the remainder is 1, lrdh/strh if the remainder is 2 and if it's 3 bytes then the backend will generate ldrb/strb/lrdh/strh. If the number of bytes was a multiple of 4 then the backend would have been able to collapse the load/store in a previous ldm/stm instruction or just do a ldr/str, in the case when the remaining bytes is 1 or 2 then the ldr/str will take the same time as the 1 byte ldrb/strb or the 2 byte ldrb/strb but when it's 3 bytes and you generate ldrb/strb/ldrh/strh then the ldr/str is much better. Also even if the remainder is 1 or 2 bytes and you're copying multiple words then it's possible the backend has already decided to collapse the load/stores in a multi load/store operation using ldm/stm.

This patch tries to implement this simple optimization by padding the destination, source and increasing the number of bytes to be a multiple of 4 words when doing a memcpy operation.

The patch implements a pass that looks for the memcpy intrinsic and uses the simple herustic below to decide whether to pad the dest/source or not:

  1. Is the destination a stack allocated constant array ?
  2. Is the source a constant ?
  3. Is the number of bytes to copy a constant ?
  4. Is destination array size == constant source size == number of bytes

If answer to those questions is yes then the pass pads the destination/source and increases the number of bytes to copy in the memcpy operation.

The pass is implemented as a midend IR level pass but is only added when the target is an ARM 32 bit core. The reason it's implemented as a
midend pass instead of a backend pass or implementing it in ARMTargetLowering::getOptimalMemOpType or ARMSelectionDAGInfo::EmitTargetCodeForMemcpy is because on previous attempts I've found that it wasn't possible to pad the source/destination as the IR objects at that level were immutable. It might be possible for me to pad the SelectionDag nodes but I would have to implement some analysis to make sure it's safe to do so. Maybe adding a midend analysis pass which would propagate information to the backend information about which memcpy's are safe to pad might work ? If you want to see my previous attempt at doing it in ARMSelectionDAGInfo::EmitTargetCodeForMemcpy then let me know. Ideally I would have liked to have done this optimization close to where the memcpy was going to be inlined.

Diff Detail

Event Timeline

rs updated this revision to Diff 76552.Nov 1 2016, 7:11 AM
rs retitled this revision from to [ARM] Patch to improve memcpy lined assembly sequence..
rs updated this object.
rs added reviewers: rengolin, t.p.northover.
rs added a subscriber: llvm-commits.
rs retitled this revision from [ARM] Patch to improve memcpy lined assembly sequence. to [ARM] Patch to improve memcpy inlined assembly sequence..Nov 1 2016, 7:12 AM
t.p.northover edited edge metadata.Nov 15 2016, 9:55 AM

This seems like a ridiculously specific optimization, even in its outline form. This is compounded by even more assumptions the actual implementation makes (that there will be GEPs for example).

The implementation is also really rather broken, but before we even get into details like that I think we need to sort out what benefit we'd get from a correct version. Basically, the high-level intent seems to be to optimize

char arr[] = "whatever";

occurring in function scope. Is that really common enough to be worth writing an entire pass for?

rs added a comment.Nov 16 2016, 9:34 AM

Thanks for the review Tim, I really appreciate it.

There are a few occurrences of this pattern in LNT test suite that can benefit from this type of optimization. I agree that this pass is very specific, I uploaded this as a way to start a discussion on how this can be implemented in a better way without using a pass, as I said I've attempted this a few times in other parts of the backend but have been unsuccessful because I've found the destination/sources IR node objects are immutable so I can't pad them. If possible I would like an experts opinion such as yourself on how the padding of the destination/source can be done on the SelectionDAG IR nodes ?

The IR is definitely the right place to do this... trying to do the sort of modifications required for this any later would be messy at best.

This needs to be generalized beyond handling just i8 arrays; this would probably trigger with some frequency on structs with small members.

This is probably interesting for other targets to some extent; other common architectures don't have LDM/STM, but they have larger registers which could benefit from a similar transformation (for example, on x86, SSE registers are used to lower memcpy.)

Granted, I'm also skeptical that this actually triggers frequently enough to be worth bothering; saying it only triggers a few times in the entirety of LNT isn't exactly encouraging.

rs added a comment.Nov 16 2016, 11:41 AM

Thanks for your review comments Eli.

The IR is definitely the right place to do this... trying to do the sort of modifications required for this any later would be messy at best.
This needs to be generalized beyond handling just i8 arrays; this would probably trigger with some frequency on structs with small members.

Do you have any suggestions where this modification can be plugged in ? Or do you think it's fine as a pass but needs to be generalised ?

This is probably interesting for other targets to some extent; other common architectures don't have LDM/STM, but they have larger registers which could benefit from a similar transformation (for example, on x86, SSE registers are used to lower memcpy.)

ok

Granted, I'm also skeptical that this actually triggers frequently enough to be worth bothering; saying it only triggers a few times in the entirety of LNT isn't exactly encouraging.

After generalising it a bit more it might be able to optimise more examples in LNT.

It's fine as a pass. I mean, it would be nice to include as part of some existing pass, but I'm not sure where it would fit in.

rs added a comment.Nov 17 2016, 7:54 AM

It's fine as a pass. I mean, it would be nice to include as part of some existing pass, but I'm not sure where it would fit in.

A colleague of mine has mentioned GlobalOpt as a potentially good place to plug this optimization in, I'm currently looking to see if it can fit nicely in their.

john.brawn resigned from this revision.May 12 2020, 6:39 AM