- User Since
- Aug 10 2016, 1:07 PM (252 w, 3 d)
Do you know if there is a stable contract in IR to tell whether an argument is in fact a callee-owned memory?
@efriedma think that coroutine suspend wouldn't make alloca really 'may alias'. We are just trying to make a lie to the MemCpyOpt Pass.
Thu, Jun 10
This doesn't seem right. A byval argument is essentially a local variable. We should treat it the same way as we would an alloca. If it's live across the first suspend point, it should become part of the coroutine frame.
I don't see any obvious issue here. I'd like to work through the known miscompiles before we start expanding this, though.
Which disables NeverAlign padding (0 bytes)
Defining the value used to establish a control dependency, e.g. the load later writes depend on (kernel only defines writes to be ctrl-dependently ordered).
Wed, Jun 9
doing that "cheap enough" has has me slightly stuck.
Looks like a real issue for strides that aren't a power of two.
You could break __builtin_load_with_control_dependency(x) into something like __builtin_control_dependency(READ_ONCE(x)). I don't think any transforms will touch that in practice, even if it isn't theoretically sound. The rest of my suggestion still applies to that form, I think. They key point is that the compiler just needs to ensure some branch consumes the loaded value; it doesn't matter which branch it is.
I don't like using metadata like this. Dropping metadata should generally preserve the semantics of the code.
Maybe add a testcase for splat of a constant? Not sure what the code generation should look like, but it would be good to have coverage.
Tue, Jun 8
In practice, the frozen element won't be used in most of the cases; the middle-end's demanded elements analysis will trigger instcombine to almost always remove the freeze.
Looks like some combine is incorrectly producing X86ISD::VTRUNCS? Probably just exposed by this patch.
If I'm understanding correctly, the issue you're describing is that if %len is ULONG_MAX in the following loop, on trunk LLVM, the backedge-taken count comes out to zero. This is because of the math in ScalarEvolution::computeBECount: we try to compute ceil(a/b) by converting it to floor((a + b - 1) / b), which doesn't work. There are different ways we could address this. For example, we could change around the computation: we can compute ceil(a/b) by instead expanding it to floor(a/b) + umin(a % b, 1).
I noted the cases where it looks like the undef->poison change might actually impact code using compiler intrinsic functions that have external specifications. The relevant specifications say the elements in question are "undefined", without really specifying what that means.
We could change compiler-rt to expose the relevant routines if we wanted to. See https://reviews.llvm.org/D43106 etc.
LGTM. Making the order of constructors independent of UseInitArray seems obviously good in any case.
That approach looks better. But I think you missed a use of Idx in the magic_cst path?
Mon, Jun 7
Are there any existing optimizations that might be affected by this? In particular, I think GlobalOpt implicitly reorders functions in global_ctors.
Sun, Jun 6
I don't think the revised change really solves the problem. The reason you end up with dead instructions is the "return nullptr;" at the end of the function, around line 402. The new code doesn't seem related to that. If we narrow the cases where the generate the lshr+and, the infloop might trigger less frequently, but it doesn't really solve the general problem.
Fri, Jun 4
It's the right diff. The initial version only touched the backedge calculation, but it turned out to be sort of tied together with condition handling because the backedge calculation queries it. I'll update the description.
I'm happy to take some time to discuss this; if we urgently some other solution for the branch, we can take a more targeted approach.
Also perform ptrtoint conversion when analyzing implied conditions, for consistency with backedge taken analysis. This fixes some minor diffs in the previous version of the patch.
For pointers where the index type is smaller than the pointer type, coming up with the right result is a little tricky. As far as I can tell from the LangRef rules, icmp compares all the bits of the pointer, not just the index bits? So I guess we can special-case comparisons where both sides have the same pointer base: the non-index bits will be the same. If both sides have the same pointer base, and we've inferred some sort of nowrap, we can then convert the icmp to compare the index bits. I guess we can handle non-integral pointers the same way.
The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.
Thu, Jun 3
And while we don't have a load instruction that supports this
LGTM with one minor comment.
I wouldn't recommend using AtomicExpansionKind::LLSC for new code. It's been a source of problems on other targets that use/used it: most targets have a forward progress rule that imposes restrictions beyond the ll/sc instructions themselves, and normal code generation can violate them. For example, fast regalloc can insert spills inside the ll/sc loop, or the basic block layout could be rearranged. I think the only target that hasn't run into issues is Hexagon.
I'm a little nervous about using C type merging in C++; it's not designed to support C++ types, so it might end up crashing or something like that. I think I'd prefer to explicitly convert enum types to their underlying type.
Constant::isNullValue() is supposed to mean "is this value all zero bits"; various parts of the code use it that way. It shouldn't care whether the value is semantically equal to zero. Anything doing floating-point arithmetic should be using the appropriate ConstantFP/APFloat methods.
If Constant::isNullValue() isn't returning the right result, please fix it directly. Getting it wrong could have other unexpected consequences.
Wed, Jun 2
I just pushed rGd8b9ed72ee83, a testcase affected by this patch.
Can we add a few end-to-end tests of bool svdupq with constant operands to acle_sve_dupq.c? The pattern matching to create ptrue seems a bit fragile, so I want to make sure we don't break it by accident.
Tue, Jun 1
I'm not sure this is the right fix.
I found some discussion at https://patchwork.kernel.org/project/linux-arm-kernel/patch/20180930024904.14240-1-Jason@zx2c4.com/ . Apparently, the kernel is actually intentionally building for "ARMv3M" by default, to support some ancient hardware that doesn't support all the armv4 instructions. LLVM doesn't support this at the moment. We shouldn't accept -march=armv3m unless we're actually going to generate correct code for it.
I think I'm fine with the general direction, but it would be nice to have a CMake option to force the headers for all targets to be built if someone needs them. For static analysis or something like that, I can imagine someone building LLVM without any backends at all.
Mon, May 31
Oh, hmm, didn't realize the transform could still fail at that point. Need to fix the code to avoid creating the mask instruction if the transform fails. @nathanchance, feel free to revert in the meantime.
Fri, May 28
If a swifttailcc function calls another swifttailcc in a non-tail position, the proposed rule is that it has to be marked notail? Not sure I understand the justification here. Is there some reason we need to forbid sibcall optimization of swifttailcc functions?
I looked, and apparently we do have an AMDGPU test for xchg; if it isn't impacted, that's fine, I guess.
The transform makes sense on targets that don't have atomic operations on floating-point registers, but that isn't all targets. In particular, the GPU targets have floating-point atomic operations, and bitcasting like this might get in the way of the natural lowering there.
Thu, May 27
Do we really need a dedicated LLVM intrinsic to make the pattern-matching work here? It would be better if we could leverage @llvm.experimental.vector.insert.nxv16i8.v16i8 or something like that. Something along the lines of https://godbolt.org/z/Wz4azzKrP seems straightforward enough to match, and generates decent code even without any special patterns.
I don't really like unconditionally messing with the type of the atomic operation; not sure what kind of impact that will have.
I'd like to see a little more test coverage, given all the different cases where the behavior of the transform changed.
SCEVLoopAddRecRewriter was originally added for use by polly, and has only ever been used by polly. So the "original intent" is just whatever polly needed, I think.