Page MenuHomePhabricator

[PowerPC] Add pass to expand extra memcpy calls

Authored by sfertile on Apr 6 2017, 9:46 AM.



This patch adds a new Ir-Ir pass in the PPC back-end that expands memcpy calls. There are 2 types of expansion added:

  1. calls where the size is a known compile time constant, but are large enough that we either don't want to or wont expand them with the straight line expansion.
  2. Expand calls where the size isn't a compile time constant.

This patch is to be followed up with a subsequent patch which will leverage pgo profiling data to make better decisions about when and under what conditions to actually perform these expansions.

Diff Detail


Event Timeline

sfertile created this revision.Apr 6 2017, 9:46 AM
arsenm added a subscriber: arsenm.Apr 6 2017, 9:50 AM

This looks like it handles using wider types unlike llvm::createMemCpyLoop. Could you merge this into that instead and add new TTI interfaces for controlling which types to pick?

This looks like it handles using wider types unlike llvm::createMemCpyLoop. Could you merge this into that instead and add new TTI interfaces for controlling which types to pick?

Yes, I think that makes sense to do.

sfertile updated this revision to Diff 94850.Apr 11 2017, 10:25 AM

Turned off the Unknown size expansion by default and added an option to turn it on.

sfertile updated this revision to Diff 97325.May 1 2017, 12:08 PM

I have moved the expansion of the memcpy calls into Transforms/Utils as part of Extend memcpy expansion in Transform/Utils to handle wider operand types.. I've updated this patch to now be based off that patch.

sfertile updated this revision to Diff 97382.May 1 2017, 7:58 PM

Ran clang-format over the patch.

nemanjai edited edge metadata.May 15 2017, 8:45 AM

Overall, I think this is fine with the comments addressed. It would be good to get some feedback from @hfinkel on whether it would be possible/beneficial to run loop optimizations after this.


I really think the debug type and the options should reflect what this pass does. Naming something lower-mem-intrinsics suggests that without it we don't lower these, which isn't entirely the case. Also, no real information is gained from statements such as "extra memcpy expansions". On the other hand, saying what this thing does is useful - such as "Expand memcpy calls into loops under specified conditions". And respectively, an option such as ppc-enable-memcpy-loops.


I suppose the temporary comments such as the last two sentences are fine as long as you remember to remove them in the follow-up patch.


Seems like in most implementation files, the convention is to put these at the bottom of the file. Having them in a fixed location in every file makes them easy to find.


Umm, this is kind of meaningless. Technically, PowerPC CPU's have been capable of running in both little-endian and big-endian mode for about 4-5 generations. If the pass does not check endianness, there's no real need to mention it here.


Assert message please.


Wouldn't we want the TTI from the caller? I know that in practice there's no difference, but I can certainly envision possibilities where that isn't really the case.


I don't think there's a need to split these two early exits since there's nothing between them.


Would it be possible (or appropriate) to run the loop passes after this since we may have introduced loops? Seems like unrolling and vectorization may be beneficial (at least in some circumstances).


Nit: full sentences for comments. And s/incase/in case.


In general, I think all the functions should include checks for PWR7, OPTSMALL and OPTNONE (providing that the -Os/-Oz can be specified on the command line for the tool you're invoking).


I think this should be ALL shouldn't it? You have the CPU=pwr7 attribute on the function, so regardless of the invocation, we shouldn't optimize. Same goes for memcpy_opt_small.

sfertile updated this revision to Diff 99330.May 17 2017, 11:31 AM

Updated based on Nemanjas feedback.

sfertile marked 6 inline comments as done.May 17 2017, 11:51 AM
sfertile added inline comments.

This makes sense to me. Initially I thought we would end up doing similar expansions for some of the other memory-intrinsics. I've only updated the debug-type and option for now but I think I will change the file-name and pass-name as well since they really aren't describing what we are currently doing. If/when we add other transformations we can update the name to reflect that.


Removed the temporary comment anyway.


I added 'ppc64le' since that is the default target_cpu on little-endian. I thought that little-endian was only officially supported starting in Power8, so having a LE cpu indicated Power8 or above for the architecture. If thats not the case I can rethink this switch.


You are right. The reason I moved the check to inside the loop was so that I could get the TTI from the caller, then I forgot to actually do that.


Yes, especially for the known size loop expansion. I initially tested this with the loop expansion being hand unrolled between 2-4 times and it had much better performance in my micro-benches.


I've changed the optsmall checks to be run as part of the OPT test since they didn't really have to be run on there own. The reason the no-opt test is run with llc is because opt didn't respect -O0 when invoked like this. I'll expand the PWR7 checks to the other functions as well.

inouehrs added inline comments.May 18 2017, 5:15 AM

Why not test DataLayout.isLittleEndian()?
I think PPC chips has been bi-endian for a long time (e.g. ancient Windows NT for PPC ran in little-endian mode). But the LE Linux supports POWER8 and later, but BE Linux also runs on P8.


Currently, there is no use of these two methods. Do you add them for the subsequent patches?

hfinkel added inline comments.May 22 2017, 3:35 PM

I think that it would make sense to put this before the GEPOpt passes above (not after them, as it is now). Did you try that? If you do, then you'll get index optimizations, LICM, etc. after the expansion, which I can imagine might be useful.

sfertile updated this revision to Diff 119196.Oct 16 2017, 12:48 PM
sfertile marked 2 inline comments as done.

Moved the pass to before the GEPOpt passes.

sfertile marked an inline comment as done.Oct 16 2017, 1:02 PM
hfinkel added inline comments.Oct 30 2017, 7:45 PM

I don't understand this comment. It looks like this function is just llvm::expandMemCpyAsLoop whenever TTI.useWideIRMemcpyLoopLowering() returns false. Why not just use that function?

hfinkel added inline comments.Oct 30 2017, 7:47 PM

The code checks for pwr8, pwr9, etc. You should use that form here, I imagine, for the target cpu (I believe the short forms are indeed what the backend recognizes).

sfertile updated this revision to Diff 121503.Nov 3 2017, 10:15 AM

Updated target cpu spellings in tests

sfertile marked an inline comment as done.Nov 3 2017, 10:35 AM
sfertile added inline comments.Nov 3 2017, 11:02 AM

This is so old it took me a while to figure out what was going on here but I think I got it now. This is here essentially because:

  1. It originally took pgo data into account but that is split out into a subsequent patch making this the same as ' llvm::expandMemCpyAsLoop when TTI.useWideIRMemcpyLoopLowering() returns false as you pointed out.
  1. I didn't use ' llvm::expandMemCpyAsLoop' directly because useWideIRMemcpyLoopLowering is actually off by default for every target, even PPC(!).

The reason 'useWideIRMemcpyLoopLowering ' got added in the first place was because I was hesitant to change the targets I didn't have a way of running functional testing for (amd and nvptx backends), and it was only there to give people an easy way to test it out before flipping the switch to the new implementation. It was supposed too get removed in a subsequent cleanup patch that I never implemented.

Rather then changing this to use 'llvm::expandMemCpyAsLoop' I would like to leave it since the follow up patch (D32872) modifies it so it no longer matches, and I will post the cleanup patch that removes the extra TTI hook and the old byte-copy only implementation of memcpy lowering separately since it can go in independent of this.

jtony added inline comments.Dec 4 2017, 12:37 PM

Minor nit: Why not just use MemCpyLoopExpansions++?

hfinkel added inline comments.Dec 12 2017, 4:08 PM

Please remove "The pass is off by default and can be...". These kinds of comments should, by definition, be true only temporarily, and the risk of becoming stale is high. Plus, nearly all passes have these kinds of cl::opts, and we don't have these kinds of comments in general. Feel free to add a comment below, near the relevant cl::opt, stating that this option enables or disables the entire transformation.


You should allow this check to be overridden when the user explicitly enables the transformation. You can do this by checking EnableMemcpyExpansionPass.getNumOccurrences() > 0 && EnableMemcpyExpansionPass.




"Will go away when the old memcpy expansion implementation does." - Please refer here to some specific function (i.e. a particular piece of code), so that we know if this comment becomes stale.

sfertile updated this revision to Diff 126890.Dec 13 2017, 7:52 PM

addressed several review comments.

sfertile marked 3 inline comments as done.Dec 13 2017, 7:57 PM
sfertile added inline comments.Jan 22 2018, 8:09 PM

This pass is off by default right now, so we have to specify this at least once to turn it on. I could add this check in the follow up patch that enables the pass by default though.

sfertile abandoned this revision.Mar 27 2018, 1:05 PM