- User Since
- Jan 23 2015, 9:38 AM (160 w, 5 d)
Fri, Feb 16
Thu, Feb 15
Updated to use llvm::SmallSet instead of std::set.
Please don't forget the diff context (that is missing with this patch).
Wed, Feb 14
I was under the impression that the purpose of ItinRW is a way to map processor resources to itineraries that were already defined. This in turn prevents the need to update existing instructions (since they have been defined with a list of itineraries).
Wed, Feb 7
Taking this over to abandon the revision as there are currently no plans to implement instruction fusion for Power9.
Of course, this won't catch late additions such as regclass copies and MachineInstr's we add in various MI-level passes and late pseudo-expansion, so perhaps the better choice would be to catch this in EmitInstruction by getting the MCInstrDesc for the MI and checking the register class there.
I have gotten to PPCInstrInfo.td and wanted to submit the comments rather than delaying the review further until I am completely done with it. I'll continue to review it and add comments over the remainder of the week.
A couple of general notes:
- In retrospect, the review process would have probably been much quicker and easier if this was split up into a bunch of smaller patches. Perhaps the target features and calling convention in the first, then the new register classes and spill/restore bits, then legalization of small groups of instructions along with the associated intrinsics/builtins/tests, etc.
- Considering we have a mutually exclusive situation between FPU and SPE and the sheer prevalence of FPU cores and developers adding code for those, it would probably be a good idea to add some checks that we don't break SPE. What I'm thinking would be a good idea is a linear traversal in SDAG post-processing that would ensure that all the MachineSDNodes in the DAG are available on SPE. I think this can simply be done by checking the operand/result register classes to make sure you don't end up with any F4RC or F8RC registers. Of course, this check would only fire on SPE subtargets.
- Is there a Clang portion to follow? To define builtins for all the intrinsics in C/C++. Or do you not plan to target SPE from C/C++?
Tue, Feb 6
Ping. Does anyone on the reviewer list have some time to review this?
We decided not to pursue this further as there are better approaches to aiding shrink wrapping.
Mon, Feb 5
Just wanted to also mention that this passes a double bootstrap build/lit/lnt on PPC.
The PPC code looks way nicer. The alignment is fine as well - unaligned scalar loads are cheap on modern PPC HW and the loads are in bounds. Of course, I have no comment on the X86 code as I don't know enough about the target.
Thank you for fixing the issue with the register class - it has clearly existed in this code before your patch as well. My comments are minor nits.
There is still no test case for XSNMADDADP. Also, since you're introducing pairs of patterns that generate each instruction, can you add a test case for each of the patterns.
Fri, Feb 2
Thank you Jonas. As this no longer needs PPC input, I'm resigning from the review.
There are no current plans to implement fusion. Abandoning this patch.
Have we forgotten about this? Can we decide to either abandon, update or proceed with this patch?
Now that https://reviews.llvm.org/rL323991 has actually landed, can you re-evaluate whether this patch is still applicable and if so, whether it needs any changes. Also, have you looked into Hal's suggestion?
Can you please re-evaluate the applicability and functionality of this patch in light of related patches that recently landed. For example, does https://reviews.llvm.org/rL323991 affect this patch?
Thu, Feb 1
Wed, Jan 31
This seems like a good thing to do. Would you be able to provide some benchmark data to show the impact?
Benchmarking has shown that we should be concerned with any type of call in the case statements equally (i.e. indirect calls are no worse than direct). Furthermore, large switch statements can be lowered to jump tables without much penalty even if there are calls.
Mon, Jan 29
This also reduces the overall number of register moves slightly. LGTM
Fri, Jan 26
Thu, Jan 25
Jan 21 2018
I've finally gotten around to benchmarking this. I'll try to attach the full summary (warning: it is a large PDF file) and here is a somewhat concise summary of what I've done to benchmark this:
Jan 16 2018
Jan 12 2018
Jan 11 2018
The PPC-specific part I discussed with Uli is in D41856.
Jan 10 2018
Legalize by producing PPC-specific atomic compare and swap ISD nodes.
Updated to use MaskedValueIsZero() instead of a custom function. Thanks for the suggestion Eli.
Jan 9 2018
My plan is to write a test case in projects/test-suite that will test these sub-word atomic compare-and-swap operations with negative values in various contexts (compare input is constant, parameter, loaded value, computed value). If there are no objections to this of course.
Jan 8 2018
Abandoning this as at least one target prefers that the compare input be any-extended and each target decide what to do about the high bits that may differ between the two values.
But that's not what the TLI hook says as I understand it; the TLI hook simply allows the back-end to inform common code how the result of a sub-word ATOMIC_CMP_SWAP will have been extended (either by hardware or by the back-end specific implementation). It doesn't say anything about inputs.
I have a similar understanding of the hook - i.e. how the results of atomic operations are extended. I guess I just see ATOMIC_CMP_SWAP as one where the input should conform to this as well as it is really a load-compare-store operation, so its input will be compared to a value that is loaded atomically. Namely, it's input should have the same high bits as the result of ATOMIC_LOAD.
And it's not clear that inputs really need to be extended in any particular way anyway -- if the back-end has sub-word instructions, those will likely ignore high bits anyway; and if the back-end creates a compare-and-swap loop on the surrounding word, it may be able (like SystemZ) to implement the necessary comparisons without explicit extensions. So IMO this is best left to each target.
If a target has sub-word instructions for all three operations, I suppose it doesn't care if the loaded value and the comparand have different high bits. In all other cases, the target ultimately has to do the work of ensuring the high bits are set the same way for both values at some point. It seemed to me that legalization would be a logical place for ensuring that work is done, but of course since there are objections, I'll just fix that in the PPC target.
Sure, the operands will have been any-extended from i8/i16 to i32. But the target knows that this happened, because it knows that this is a i8/i16 ATOMIC_CMP_SWAP (via getMemoryVT). In that case, if it actually requires a particular sign- or zero-extended version of the original i8/i16 constant, it can still do the appropriate in-reg extension from the any-extended i32 value it got.
Oh OK, of course. I mentioned in the original patch that I can certainly fix this in the PPC back end. I imagine that other back ends have done the same thing one way or another (or have a bug in this they don't know about just as PPC does). However, I ultimately don't see the utility in type-legalization ignoring how the target wants these inputs extended and doing an any-extend when the target is going to have to pick one down the line. And the target has already stated how it wants atomics extended in that TLI hook.
Fair enough. I agree.
Jan 7 2018
I should start by saying that I think this patch seems to me rather uncontroversial, but I don't think it is adequate to address the original problem (explanation follows).
Jan 5 2018
These comments are not an indication that another review is necessary, but please address them on the commit.
OK, I don't think there's an issue with this - if the vector just happens to be wider, we'll end up needing more of the respective instructions, but it won't turn anything into a library call that wouldn't be a library call otherwise.
Jan 3 2018
I think this is a great start. Thanks Florian.
Jan 2 2018
I think it's really cool that you've started on this and I think this is a great start. Thanks.
Dec 29 2017
Dec 28 2017
Add a test case for the edge case where operand 3 of RLWINM is a zero.
Dec 27 2017
Fix the UB identified by Ben Kramer - thanks Ben.
Dec 20 2017
This has broken all the PowerPC buildbots. I'll pull this patch to get the bots back to green unless you have a fix available. The smallest failing benchmark from the test-suite is sieve.c.
This patch removes half of the pre-inc stores that are actually needed. I imagine that the DAGCombiner is getting rid of these stores before they become pre-inc stores since there's a check for unindexed stores as part of this patch, but I haven't spent the time to really understand what this patch does.
I'm attaching both the IR generated for sieve.c as well as a bugpoint-reduced version of it (the reduction was done by comparing the number of pre-inc stores emitted before the patch and after it).
Dec 19 2017
Dec 18 2017
The update_llc_checks.py script seems to have produced a CHECK directive that doesn't actually pass. Fixed the test case.
Dec 15 2017
Dec 14 2017
Committed in https://reviews.llvm.org/rL320791
This causes failures on our build bot. Unfortunately, there was a different failure introduced by a previous commit that hid this. The failures are here: http://lab.llvm.org:8011/builders/clang-ppc64le-linux-multistage/builds/5020.
Dec 13 2017
I've already posted this comment on https://reviews.llvm.org/rL320486:
This commit seems to have broken our build bot. Can you please investigate and fix it. The failure:
Dec 12 2017
Committed in r318044.
Dec 11 2017
Dec 9 2017
Address comments from Eli. Thanks Eli!
Dec 8 2017
Eli, thanks so much for your comments. Please keep them coming. I'm not trying to be difficult about any of this - I just want to make sure the final implementation does the right thing both for compile time and run time.
I probably should have noted in the description that this provides significant run time improvements on some important workloads.