- User Since
- Aug 10 2016, 1:07 PM (302 w, 2 d)
Concretely, my preferred solution looks something like:
ldr+op+swp still isn't atomic. For each point in the code, please try the exercise of "what if my code is interrupted here"?
I mean that ActOnPragmaFEnvAccess shouldn't call hasRoundingModeOverride()/setRoundingModeOverride(), and setRoundingMode shouldn't call getAllowFEnvAccess().
I think I'd prefer to start with the cases that don't require fast-math flags, then expand the transform as needed to include the cases that do require them.
The other alternative I considered was to not generate atomic_fetch_nand_* symbols at all if c11_atomic_fetch_nand isn't available
This doesn't make sense to me. If clang and LLVM disagree about the layout of a type, you are going to run into trouble. If you haven't yet, it's a matter of time.
For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes.
For FENV_ROUND, I think we should try to stick to the standard as closely as possible in Sema; since the standard models FENV_ROUND as a separate state, Sema should also model it as a separate state. If CodeGen decides it needs to modify the FP environment to actually implement the standard's rules, it can change the way it generates code to match. (The example in 7.6.2p5 is just pseudo-code; any internal state changes should be transparent to the user.)
The current version doesn't have the right semantics: if the pointer is lock-free, you have to use a lock-free implementation (e.g. a cmpxchg loop).
as for the second I'm not sure when it would be profitable to transform back and emit the table
Wed, May 25
In summary, the cost of a saturating fptosi vs. plain fptosi+min+max varies so much across targets that that we need two canonical forms: one for targets where saturating fptosi with the given types is cheap, and one for targets where it isn't. I guess that conclusion is fine, but we should state it in a comment in the code, so it's clear why we're cost-modeling this.
Could you lay out the expected interaction between "STDC FENV_ACCESS", "clang fp exceptions", "float_control", and "fenv_access"? If there's some way to map everything to "#pragma clang fp", please lay that out; if that isn't possible, please explain why.
It might make sense to take a step back, and look again at how ASan works here.
Tue, May 24
Mon, May 23
For scheduling, the AArch64 backend overrides "isSchedulingBoundary()" for SEH instructions.
Sun, May 22
Looking at the test deltas I wonder if it be valid to combine (sext_inreg (freeze (aextload X))) -> (freeze (sextload X)).
Fri, May 20
Did you consider instead of adding .seh_endepilogue_nop, just require writing .seh_nop; .seh_endepilogue? Then we can automatically merge a tailing nop with the "end" opcode.
Thu, May 19
uxtw #3 makes me think you're not generating the right code... the zero-extension hopefully doesn't matter, but the shift is significant. Maybe should be generating getelementptr i8?
This makes the names and conventions here seem very confusing. The function is called WidenVecRes_BinaryCanTrap, but we're dealing with operations that can't actually trap. And it now has two different ways of unrolling an operation, depending on whether we're generating a libcall: the existing codepath uses smaller vectors, and the new one completely unrolls.
Wed, May 18
I don't think we reached a consensus this is the right approach; please ping the Discourse thread.
I don't really understand how this is supposed to interact with D125292; even if you strip the readnone attribute from the call instruction, we'll still treat a call to the intrinsic as readnone.
Not sure how much I like the rule about "externally visible functions"... I mean, I guess restricting the checking to external functions avoids triggering on a bunch of cases that would be difficult to handle, but it doesn't really seem self-consistent.
setMinCmpXchgSizeInBits(32) should fix the i8/i16 testcases, I think.
There's a continual struggle between pushing towards canonical IR, vs, pushing towards things we think are cheap on some specific target. Normally, the way we've resolved that is by distinguishing "early" vs. "late" optimizations: we try to push towards a canonical form early on to make the code easier to analyze, then we start doing optimizations like vectorization etc. using target-specific heuristics. AggressiveInstCombine doesn't really have anything to do with "early" vs. "late"; if we want something that runs just before vectorization, we should probably add a dedicated pass.
Seems to match what other targets do, so I guess it's fine. Not sure what you're planning on doing with this, though...
On a target that doesn't support SMP, you don't need memory barriers, sure. (I think we'd want a CMake flag to explicitly assert that you're only going to run the code on chips without SMP.)
Tue, May 17
(This currently doesn't apply cleanly.)
It isn't enough that AddRec is nowrap; you have to prove that the initial value of the AddRec was produced by a nowrap add.
Specifically detecting this exact case doesn't really seem very useful... we should be leveraging some sort of range analysis (known bits, CorrelatedValuePropagation, etc.)
Mon, May 16
.seh_endprologue_fragment looks fine.
Can you add a testcase that has both a vpush and prologue folding?
It looks like this doesn't include any directive to set the "F" bit in xdata, or the "Condition" field in epilogue scopes. Is the omission intentional? (I assume your LLVM patch won't emit sequences that would require either of those, but that might change, and people writing assembly might want them.)
I assume it's illegal to use prologue/epilogue folding with R==1? Should we try to diagnose this?
Please add a testcase with VFP register spills. (Do we push them before FP? Does getMaxFPOffset need to account for that?)
https://github.com/llvm/llvm-project/issues/29472 was never fixed; whatever issues exist with -moutline-atomics also exist with -mno-outline-atomics. (I don't think anyone has run into any practical issues with this, so it hasn't been a priority for anyone.)
Sun, May 15
Can we just rearrange the order we save registers on the stack? I mean, generating an extra push instruction just to satisfy the unwinder is sort of silly, but it seems like the least bad alternative. (And we only need to do it in cases involving dynamic allocation.)
I think the placement new call needs to be in the operator=, not the constructor, to have any effect.
Fri, May 13
Thu, May 12
Fix comments/formatting. Fix generated code for varargs functions with more than 4 fixed arguments.
Fix v15/v16 confusion
Update more places that check for IMAGE_FILE_MACHINE_ARM64
Updated messed up formatting?
If we'd want to use this ABI mode for mingw targets too [...], I guess that would have to be another separate target like ...-gnu_arm64ec - using up yet another entry in enum EnvironmentType for it.
There definitely can't be a gap between the homing area and the stack arguments. Are we sure there's a gap between the homing area and the saved registers, as opposed to putting the saved registers flush against the homing area?