Memcpy, and other memory intrinsics, typically tries to use LDM/STM if the source and target addresses are 4-byte aligned. In CodeGenPrepare look for calls to memory intrinsics and, if the object is on the stack, 4-byte align it if it's large enough that we expect that memcpy would want to use LDM/STM to copy it.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Hi John,
Have you done some stack consumption analysis? I remember Arnaud trying to reduce the size of the stack and this patch seems to go against it.
I agree it's a good idea to align it at 4, but we need to be careful, especially in -Os/z.
cheers,
--renato
Zero change in stack consumption in any EEMBC benchmark (I haven't checked SPEC). Also by CodeGen time we don't know if we're compiling -Os/-Oz as we just have CodeGenOpt::Level.
No, the size optimization attribute is independent of CodeGenOpt::Level. You can get at it from CodeGen by querying:
MF.getFunction()->hasFnAttribute(Attribute::OptimizeForSize);
That having been said, code size and stack size, in this context, might be inversely related.
It might be better to implement this in CodeGenPrep. You could look directly for allocas that are used by memcpy intrinsics or as byval function argument parameters, and increase the alignment of only those directly.
Also worth noting, is that in getMemcpyLoadsAndStores in lib/CodeGen/SelectionDAG/SelectionDAG.cpp, there already is code that attempts to reset the alignment of a destination stack object when possible. Perhaps this should just be extended to do the same for the source.
The goal is to align not just on direct calls to memcpy, but also calls to functions that then may call memcpy. It looks like doing it as a CodeGenPrep pass can do both - we first align arguments to a call, then check if the call is actually a memcpy and up its alignment if possible.
Can you please put this into CodeGenPrep proper (with some target hook to enable it). I likely want to use this for PowerPC too, and the actual logic is fairly simple (so far).
lib/Target/ARM/ARMAlignAllocaPass.cpp | ||
---|---|---|
32 ↗ | (On Diff #21012) | Line too long? |
62 ↗ | (On Diff #21012) | What are you actually trying to check for here? I assume you're mostly interested in looking for GEPs here (and not, for example, ptrtoint). Also, for GEPs, do you want to check that the offet to the underlying alloca is divisible by the alignment of interest. Maybe you should use something like Val->stripAndAccumulateInBoundsConstantOffsets() -- true, you'd miss the dynamic offset case, but I don't know if you have examples where that's important. |
New patch attached that moves this into CodeGenPrepare, enabled by shouldAlignAllocaArgs in TargetLowering.
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1236 ↗ | (On Diff #21355) | I'm a bit averse of setting parameters for the computation inside a conditional and having to assert right after every call. I'd rather have (or reuse) some default size/alignment parameters from elsewhere. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1236 ↗ | (On Diff #21355) | If an implementation of shouldAlignAllocaArgs were to not set AllocaAlign then that indicates that it's not doing what it should, so failing an assert is better than hiding that by silently using a default value. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1236 ↗ | (On Diff #21355) | I'm not against the assert, I'm against passing a reference for AllocaSize and AllocaAlign on a check call. The call is not explicit that it's setting the two values here, nor is it on TargetLowering's default implementation. A more clear way would be to make a function whose only job is to set those variables and make it explicit on its name, like "setAlignAllocaArgs(CI, Size, Align)" and add "Size > 0" to the conditional. You can avoid the assert in that case, too. Also, your check for TLI is redundant, since you already check for TD. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1236 ↗ | (On Diff #21355) | Doing it this way (return a bool, set a reference/pointer argument only if returning true) seems to be the standard way these queries are done in TargetInfo - see allowsMisalignedMemoryAccesses, getStackCookieLocation, GetAddrModeArguments. |
include/llvm/Target/TargetLowering.h | ||
---|---|---|
980 ↗ | (On Diff #21355) | Please name AllocaSize -> MinAllocaSize to make its purpose clearer. |
lib/CodeGen/CodeGenPrepare.cpp | ||
1235 ↗ | (On Diff #21355) | Don't initialize these here; that makes it clear that they must be set by the TLI call, and allows mistakes to be caught by tools that detect use of undefined values (valgrind, etc.). |
1236 ↗ | (On Diff #21355) |
I disagree. So long as the values are not initialized above the call, it is clear that the call must be setting them (and returning true if it has done so). This pattern is used by a number of other callbacks, even some in this file (getStackCookieLocation, getPreIndexedAddressParts, getPostIndexedAddressParts, ShrinkDemandedConstant, etc.), and I think it is appropriate here. |
I may be reworking this patch to align not just allocas, but also global variables defined in this translation unit. The ultimate thing I'm trying to optimize is cases like
void fn() { char arr[31]; strcpy(arr, "some string"); // do somthing with arr }
This patch aligns arr, but the variable generated for the string needs to be aligned as well. The plan was to do that in clang, but I'm thinking that maybe doing it here would be better.
Or maybe it would be better to get this checked in then do that, but the name shouldAlignAllocaArgs would have to be changed (maybe shouldAlignPointerArgs).
New patch that addresses review comments. It also renames things slightly in expectation that we'll be also aligning GlobalVariables. I'll be doing that in an upcoming patch.
You're currently doing this only for pointers that are captured as call arguments. Do you care about pointers captured in any other way (by having their address stored somewhere, or returned, for example)?
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1256 ↗ | (On Diff #21488) | may have improved -> may be able to improve |
Maybe? The original patch did that as it just aligned everything that was used in a FrameIndex instruction. I guess the question is, how much do we care if we over-align things pointlessly? When it's an argument to a memcpy call it's definitely a good idea, but in an argument to some other call or when the address is stored somewhere it may turn out to be pointless (which is possibly an argument that I should be restricting this to only memcpy/move etc. calls).
Agreed.
I don't have a strong opinion here. I suppose I'm inclined to go for other kinds of captures as well, or to stick with just memcpy/move/set. Picking only function calls, which amounts to some, but not all, captures, seems like an inconsistency likely to cause different kinds of abstractions to get different optimizations, and that's likely not good. Let's go to one extreme or the other ;)
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1252 ↗ | (On Diff #21573) | If this is not an inbounds GEP, then this subtraction could underflow (it could overflow otherwise too, but without inbounds it could underflow even in a case where the code does not have undefined behavior). I recommend making sure the value being subtracted is not greater than the alloc size. Please add a test case for this. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1252 ↗ | (On Diff #21573) | stripAndAccumulateInBoundsConstantOffsets will stop if it hits a non-inbound GEP, so we wouldn't get here in that case. Adding a test sound like a good idea though, I'll do that. |
lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
1252 ↗ | (On Diff #21573) | Yea, but even for an inbounds gep, we should do the right thing (the code may have undefined behavior, but having that reset the alignment of some objects seems a bit silly). |
Attached patch handles size<offset by doing the comparison in a different way (by checking size >= minsize+offset), and adds a bunch of tests of offset handling.
While working on the next part of this (aligning global variables) I realised that we want to 8-byte align on some targets as it's faster, so the attached patch does that.
Looks good as far as I'm concerned. You might want to do the global variable handling as a separate patch.