- User Since
- Jun 26 2015, 5:48 AM (238 w, 22 h)
May 23 2018
Thanks, Craig. LGTM.
Hi Craig, just one comment on the details. Everything else looks good.
May 22 2018
I agree with the changes in x86intrin.h and immintrin.h. For the others, I question whether we ought to recommend inclusion of x86intrin.h or immintrin.h. The distinction as I understand it is that immintrin.h is used for Intel-specific intrinsics while x86intrin.h covers all intrinsics for x86-based architectures.
A bit of history: In icc, the f16<=>f32 conversion intrinsics are a bit of an anomaly in that they can be implemented using either native code or emulation code based on the target architecture switch. See https://godbolt.org/g/bQy7xY (thanks, Craig, for the example code). The emulation code lives in the Intel Math Library.
May 21 2018
I had actually done that as part of my initial review, so LGTM.
May 18 2018
Looks right to me, Craig.
May 16 2018
This looks good to me, Craig. I am not worried about the constant folding issue, as I think constant folding these conversion intrinsics (assuming round-to-nearest) is a perfectly valid optimization in the absence of FENV_ACCESS. (FWIW, we don't do this constant folding in icc, but that is only because we have never gotten around to implementing it.) I am also not worried about the spurious 'inexact' exceptions that the changes to the mask intrinsics will cause.
Looks good, Craig. And thanks for the explanation on the 64-bit cases. I had forgotten about the VCVTTSx2USI & VCVTUSI2Sx instructions, which we appear to be generating already.
May 15 2018
Could you please add full context to the patch, Craig?
Mar 16 2018
Thanks, Craig. I noticed it with gcc 4.8.5.
Feb 1 2018
This looks like a nice change.
Nov 3 2017
This looks reasonable to me, Craig. For testing, the obvious thing would be to test the effect on the LoopVectorizer, since it is the primary consumer of this information.
Oct 24 2017
Oct 23 2017
Just one minor suggestion. Otherwise, this LGTM also.
Oct 6 2017
After some internal discussions, we suspect that Sanjay's counter-example may have undefined behavior according to the C standard.
Sep 25 2017
I'd recommend going ahead with this, Craig. Even if it isn't possible for this bug to trigger a failure today, it looks like a latent bug that should be fixed.
Sep 19 2017
This change seems obviously correct. The diffs in the tests all look like tweaks in the register allocation behavior and aren't testing the behavior you are fixing. Is it possible to write a test that miscompiles due to the existing bug?
Sep 18 2017
Sep 13 2017
Sep 8 2017
Sep 7 2017
Aug 11 2017
Adding Zvi as an FYI, but I think this is fine to land.
Aug 10 2017
The change looks correct to me modulo Craig's comment, though I suspect he's right that there is a bug in the Intel documentation for FNOP.
Aug 9 2017
Just a minor question on the test changes. Otherwise, this LGTM.
Aug 1 2017
Jul 24 2017
Thanks, Michael. LGTM, again pending @RKSimon's re-review.
Thanks, Michael. This LGTM pending @RKSimon's re-review. But can you please update your sources so that we can see how you merged with https://reviews.llvm.org/D35638, since that will cause conflicts?
Jul 21 2017
Jul 20 2017
Jul 19 2017
Jul 13 2017
Hi Michael, I have one small additional comment. Otherwise, this looks good.
Jul 12 2017
Thanks for the changes, Michael, and thanks for following up on the perf issue in https://bugs.llvm.org/show_bug.cgi?id=33740.
Jul 6 2017
Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.
Jun 21 2017
Thanks, Farhana. LGTM.
Jun 20 2017
Thanks for the fixes, Farhana. Other than a couple typos, this looks good.
May 30 2017
Thanks, Farhana. Please remember to add llvm-commits as a subscriber for LLVM patches. This LGTM unless Simon has additional suggestions.
May 26 2017
May 17 2017
May 10 2017
Mar 15 2017
Not from me. Looks good.
Mar 13 2017
Can we verify how stack is aligned when there is an error code?
Feb 17 2017
Amjad pointed out to me that the incoming alignment to an interrupt handler is only guaranteed to be 0 mod 8, not 8 mod 16 as is the case with the normal x86-64 ABI. HJ mentions this in 26477.
Aligning the stack is certainly the right thing to do. But this isn't the only problem with 26413. I will add a note to the report explaining what I mean.
Jan 19 2017
Jan 6 2017
Jan 4 2017
This all looks pretty good to me, Andy. Regarding
Dec 1 2016
Looks like this was fixed by r288416.
Nov 23 2016
I have no further comments. This LGTM too.
Oct 31 2016
Oct 13 2016
Thanks for taking care of this, Reid! That fix LGTM.
Oct 7 2016
This LGTM, but I'd like you to also get approvals from Michael & Elena before proceeding.
Oct 6 2016
Oct 3 2016
Sep 30 2016
Aug 23 2016
Just to clarify, the optimal result is movzbl 2(%rdi), %r10d, not
movb 2(%rdi), %al; movzbl %al, %r10d (which is the status quo)
xorl %r10d, %r10d; movb 2(%rdi), %r10d (which is what this patch does)
Aug 22 2016
Jul 28 2016
Jul 25 2016
The example Aaron sent in email is a good one:
void func(int, int, int, int) __attribute__((no_caller_saved_registers, cdecl));
Jun 30 2016
I have no objection to this solution. I think it is robust in a functional sense. And we can always change it later if we discover that the worst case MOV32r0 sinking scenario is more common than we think. It would give me the warm fuzzies if we had some experimental evidence to confirm the suspicion that this is a rare case. Do you already have that? If not, maybe it would be a good idea to write a late pass that looks for this kind of pattern and count the number of occurrences on, say, cpu2006?
SETcc r8 xor r32, r32 (or mov $0, r32) movb r32b, r8
And thanks for the performance data! Once this gets committed, I'll have someone run testing on a broader set of workloads.
Jun 28 2016
Jun 21 2016
I wasn't suggesting that Marina's fixes should replace this one. Sorry for that confusion. My point was simply that this is just one piece of a more comprehensive solution to the false dependence problem.
I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.
Jun 17 2016
Your question is a good one, and I don't know the answer. Who knows? That might be the cause of all the failures I was getting.
Thanks for the quick response, Michael!
May 26 2016
The code change looks reasonable to me Kevin.
May 24 2016
Your changes LGTM, Sanjay.
May 23 2016
May 20 2016
May 17 2016
Thanks for the pointer to https://llvm.org/bugs/show_bug.cgi?id=26776, Reid.
I just have a couple other minor suggestions. Otherwise, LGTM.
May 16 2016
May 9 2016
The connection I saw between removing the _chkstk calls and this transformation is that the pushes will touch the stack in a safe order (starting at %esp, which should be safe, and progressing downwards), whereas the original stores might be in an order that starts by touching an address beyond the allocated stack.
Thanks for the comments, Steven! We are still working on fixing the popcnt false dependence problem, but we are planning to abandon this patch.
May 6 2016
May 4 2016
Thanks, Elena. LGTM.
May 2 2016
I think we want to make sure that we move in a direction that makes it easier to do optimizations that affect the CFI between X86FrameLowering and this late pass. For example, we cannot schedule the pushes generated by the X86CallFrameOptimization pass without moving the CFI along with the push. So we generate very poor code in cases like this where the push operands get in the way of outgoing inreg arguments:
Apr 29 2016
Thanks for the reviews, Reid and Hans!
Apr 28 2016
Apr 27 2016
LGTM too, Kevin.
Apr 26 2016
Thanks for the fixes, Andy.
My comments about vzeroupper make me wonder whether we want skipFunction to be able to make the distinction between skipping a pass for OptBisect and skipping a pass for -O0. I can certainly imagine wanting to run some functionally optional "optimization" passes at -O0, e.g. a pass that can improve the size/performance of the generated code w/o negatively affecting debuggability.