Thu, Jul 18
It's worth noting here in the review that this patch depends on the dereferenceable attribute (see D64205), and that attribute could change meaning as part of the larger changes related to the Attributor pass (D63243).
Based on current definitions, I think this is correct and allowable, so LGTM.
Just the add/sub opcodes for now.
If we focussed just on PR40483 for now - do we just need X86ISD ADD + SUB (ADC + SBB) ?
Wed, Jul 17
Add a test with no 'dereferenceable' attribute on the pointer argument.
I think you are missing the negative test without dereferenceable.
Ping * 2.
Tue, Jul 16
- Add limitation based on dereferenceable attribute to prevent information loss.
- Add/adjust tests to include dereferenceable attributes.
Could we check here if the base pointer has dereferenceable annotation and use that as a condition for this transformation? (It's more complicated to be completely lossless but this seems to be an easy to test starting point).
Patch updated - no functional changes from the previous draft:
- Move local variable for NumElts closer to uses.
- Add TODO comment about handling bigger-than-256-bit types.
Are the multiply test changes due to the flags being used by seto? But seto usage should never be in danger of creating the instruction duplication we're seeing in the motivating case. It does look like we're getting an improvement on those tests, but not for the reason we're selecting LEA.
Allow 256-bit reductions by extracting and using 1 more 128-bit hop.
Mon, Jul 15
Early exit if wrong types or subtarget.
- Check flag uses to avoid unintended transform.
- Add TODO comment about && vs. ||.
Sun, Jul 14
Sat, Jul 13
Fri, Jul 12
I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a + a + b
Narrowing the b load to i8 looses the fact that the memory location corresponding to b is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a = 0;
sum = horizontal_sum(a);
(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)
Thu, Jul 11
So to answer your question, it's about both code duplication and redundant matching. And while the death by a thousand cuts may be unavoidable, we should still try to not hasten it along unduly...
No code changes - just regenerated the test diffs.
Commandeering to post the rebased patch.
I don't have a good sense of how we make fast-isel speed vs. perf trade-offs, so if anyone else has thoughts about that case, feel free to comment.
If we rebase the tests after rL365711, I don't see any regressions. Not sure if we're getting all of the optimizations that were intended, but this patch seems safe to commit now.
Wed, Jul 10
Also, would it make sense to separate readable from writable? We currently have this bug where LLVM will promote all const static globals to rodata, and sometimes generate atomic cmpxchg to them (e.g. because we're trying to load a 128-bit value). Similarly, we might want to honor R / W memory protection in general. Right now dereferenceable just means "you can load from this", because we can't speculate most stores.
I do not understand the problem but I have the feeling this is an orthogonal issue.
mprotect can make memory readable but not writable, or writable but not readable... or neither. What does dereferenceable mean when faced with this fact? Further, what happens to dereferenceable when mprotect is called (any opaque function could call it)? I don't think this is an orthogonal problem at all.
So, I guess what the above means is "dereferenceable" is too coarse grained. We have "global dereferenceability" that cannot be changed, and we have "local dereferenceability" that can be changed, e.g., through calls to free, realloc, or mprotect. From accesses we can only deduce "local dereferenceability". Now, that is why we need D61652, or more precisely, D63243. After those changes landed, the reasoning introduced in this patch should be fine, before, it is as broken as Clang is when it emits dereferenceable for arguments passed by reference. (The logic above, with the same problems and more, is also used in ArgumentPromotion right now...).
- Is there a better cost query than checking if the target has a vector register ( TTI->getRegisterBitWidth(true) ) that exceeds the load size?
- Do we require that multiple scalar loads are subsumed by the vector load?
Add TODO code comment about using "isSimple()" and add test with an atomic load.
Right now the control flow isn't clever, but I wonder if, as this analysis becomes more powerful, it'll have to act differently when -fno-delete-null-pointer-checks is specified? Is there a simple test that you can add to make sure null pointer checks don't cause false assumptions whenever this optimization becomes smarter?
- Allow stores to have the same inferences as loads. This exposed more clang test failures, so those diffs are included.
- Don't infer anything from volatile (non-simple) memory accesses.
- There was a bug in how we dealt isGuaranteedToTransferExecutionToSuccessor(), so added an assert and a test with a function call to verify that.
- Added code/test for replacing DereferenceableOrNull attribute.
- Added FIXME comment to indicate that this pass should be subsumed by Attributor.
Tue, Jul 9
I do not want to block this patch but I still believe that this is the wrong way to go (middle/long term). The fact that we need to put this not in FunctionAttrs.cpp, where the other deductions live, but in InferFunctionAttrs.cpp, where we so far only annotated library functions, should be a first sign. Also, the functionality here is only one way to deduce dereferenceable, arguably, you want all the ways together such that they can benefit from each other.
Are we really allowed to change the exact flag from InstSimplify?
- Used GetPointerBaseWithConstantOffset() to allow more complex pattern matching.
- But limited that matching to cases where the argument and access have the same size to reduce complexity.
- Generalized variable names and comments to allow less churn for follow-up enhancements.
- Added tests with multiple dereferenceable arguments, pointer casts, and negative offsets.
Mon, Jul 8
Thanks for thinking of me ;) And again, I think this is an important change we need!
The Attributor is in tree and, if enabled, it is run very early (as I very very strongly believe it should). I think we can get the Attributor enabled for the next release (maybe with a low iteration count and restrictions on the attributes we derive). Now there are two missing parts to get this functionality into the Attributor in a decent way:
- A generic way to "look around for existing information" (more on this below).
- The abstract attribute for dereferenceability(_or_null) that makes use of 1) and potentially performs usual deduction.
Implementing 2) is fairly easy. It should not take long to create the boilerplate if we only want to rely on the deduction through 1). Also, the logic is already in this patch (and the old prototype). Regarding 1): I was going to work on this once I found some free cycles but I could do it now if we decide to go this way. The idea is that you specify a program point PP (=instruction) and a callback. The callback is then automatically applied to all instruction which have to be executed when PP is also reached, either before or after. I would like this to be an abstract interface from the get-go but I am also willing to provide the interface and the initial implementation that will at least suffice for this use case. It should then be used from the AbstractAttribute::initialize and AbstractAttribute::updateImpl method of the abstract attribute for the dereferenceable attribute (and others later as well).
Sat, Jul 6
If this transform always returns an existing value, it can go in InstSimplify. Please pre-commit the baseline tests (in the InstSimplify directory and change the RUN line if I got that right).
Fri, Jul 5
The direction of this makes total sense and we will need it. However this shoulnd't be here (wrt. the file/pass).
Assuming we want this right right now, it should life in FunctionAttrs.cpp. Assuming we want to do it "right" it should become part of the Attributor framework.
The early prototype of the "deref-or-null" abstract attribute already had this functionality, see https://reviews.llvm.org/D59202#C1381429NL1995, and the test case https://reviews.llvm.org/D59202#change-FJbHx7N4s6ye . For the new Attributor, dereferenceable-or-null has not yet been ported and the transfer of "close by information" is not part of the new model. Both things are going to change soon.
I missed diffs in some existing over-reaching clang and AMDGPU tests. These regression tests should not be testing the entire optimization pipeline, but I adjusted the assertions to make them pass.