I do GPU compilers, floating point shenanigans, and vector SIMD stuff.
- User Since
- May 14 2015, 1:11 PM (214 w, 5 d)
Apr 6 2018
this does not feel like the right solution, but even if it is, this will hurt us; we *commonly* have shaders with single basic blocks exceeding 1000 instructions, and splitting them could sabotage scheduling quite badly. in the worst case, it could guarantee spilling, by splitting the block at a point that creates too many live values between the top and bottom.
Feb 8 2018
This looks like it might fix the problem we've been having, but i'm still extremely nervous about the overall concept. this feels like a fantastically large amount of machinery for such a small optimization, including this dubiously defined "isRenamable" flag that applies to operand registers even though renamability seems to be more of a property of the instruction.
Jan 11 2018
Dec 12 2017
got the approval from puyan so i'm gonna push this later unless someone else has some other comments!
Dec 4 2017
LGTM, though it'd be interesting to see if there's any other cases where (FOO (ROTATE pattern) [...]) gets torn apart besides FOO == truncate.
Dec 1 2017
Updated patch and added comments.
Nov 29 2017
Nov 27 2017
Even if they're not redundant, it's something we still don't want.
Two comments on the trunc thing:
Regardless of semantics, this patch almost surely causes a major problem for us.
turkey week is over pls glance at patch, thank
Nov 20 2017
Nov 14 2017
Closing to start a real patch review.
Nov 9 2017
what is the purpose of this transform? why is the new form considered more canonical?
Nov 2 2017
Nov 1 2017
Oct 30 2017
Oh and of course because N-ary uses SCEV it cannot possibly handle floating point without a huge extension of SCEV, I think.
Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.
Oct 27 2017
Oct 26 2017
Mar 14 2017
Mar 13 2017
Feb 7 2017
We use this function but we don't use the addrspace argument.
Jan 31 2017
That really surprises me that it's faster! I would expect SFU functions like RCP/RSQRT to dwarf the cost of a multiply, especially for double.
Don't be too embarrassed; when we switched internally from an rcp(rsqrt(x)) expansion to x * rsqrt(x), we *also* completely missed this, and literally only found the bug when it showed up as internal test failures. The new expansion was even brought up and talked about at a team meeting and signed off by multiple people, and nobody thought to consider the number zero, myself included.
Jan 30 2017
afaik, x * rsqrt(x) is wrong when x is zero (it gives NaN instead of 0). we use x * rsqrt(x) for our expansion, but we have to use an extra select_cc to handle the zero special case.
Jan 18 2017
Confirmed this fixes the problem I was having.
Nov 14 2016
Thanks for looking into this painful little bundle of nested semantic problems.
Oct 28 2016
Might it be better to just bail if ShiftBits is zero? This seems like a classic case of "trying to optimize a node that itself is going to disappear when it gets combine()'d", so maybe we should just not do this "optimization" if the shift is no shift at all.
Aug 30 2016
Sadly, the original version of this pass was *very* much not designed for the case of loading the same location multiple times in a single basic block... as you can probably tell.
Aug 1 2016
Closing for now; it looks like this can interfere with address mode folding (e.g. on x86_64 where 32->64 zext is free), even though it makes logical sense.
Jul 31 2016
Jul 20 2016
This would work with 'sub' too, right?
Jun 19 2016
Is this correct? I thought "nnan" on an instruction meant that we can optimize it assuming its inputs and outputs aren't NaN -- not that we can assume *for other instructions, that aren't fast-math* that *their inputs* aren't NaN.
Jun 17 2016
As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)
Jun 14 2016
Make sure to check that the alias analyses are set up properly in the TM; this bit me when I implemented this out of tree (i.e. confirm that the AA queries are succeeding).
Jun 8 2016
x = NaN
y = 0
fmin(x, fmax(x, y)) -> x ?
fmin(x, fmax(NaN, 0)) -> x ?
fmin(NaN, 0) -> x ?
fmin(NaN, 0) -> 0
0 != x
Jun 6 2016
My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?
May 4 2016
This is a nice solution -- instead of worrying too much about LLVM float semantic definitions, just let the target say how it implements things.
Apr 29 2016
Some might be, as this code is fairly old. Here’s an extremely blind copy paste of our code:
Apr 28 2016
LGTM. We actually do this out of tree in our backend for ADD, SUB, MUL, SHL (since we'd rather truncate inputs than do a larger op and truncate that), so it seems more than reasonable to me.
Apr 22 2016
As far as I know -- "Custom" means something is not legal, but has a custom lowering. It has nothing to do with how it's implemented on the target; it just means that the target has a custom legalization hook (as opposed to Expand or such, which means it doesn't have any custom hook).
Our target has a legal FABS (it is in fact, free, as it's a modifier). But it's implemented as FADD DST, SRC.ABS, -0.0, which may modify bits other than the top bit.
Apr 6 2016
Apr 4 2016
Poking this thread.
Apr 1 2016
I am slightly afraid to silently modify the behavior of MaxCount to affect full unrolling (when it didn't before) because out of tree users may be using it.
Added commandline options and a test.
Ironically it looks like MaxCount itself doesn't even have a commandline option either... should I introduce one for both? :/ it feels wrong to bloat the commandline options like this, I guess
For your target there is a case when full unroll fails and then partial fails as well. And when patch applied partial unroll is successful. Right? So you can create target test.
Adding full context.
Yes, I understand that, but what I mean is, the bug won't trigger because we *will* get to the path that corrects Count because UnrolledSize > UP.PartialThreshold with Count = 9, and then it'll lower Count to 3. So we won't end up with a Count that isn't evenly dividing into 9.
I don't *think* it's possible for that problem to happen under normal circumstances without FullUnrollMaxCount being set. Normally, the only way we can fail to do full unrolling is if the full unroll cost is greater than Threshold. In this case, we'll go try partial unrolling, and we'll also be above the partial threshold (since a partial threshold higher than Threshold doesn't make sense), and it'll take the path that makes the count modulo-correct.
What sort of test case should I do for this (which behavior are you looking to test)? Should I make a commandline option so that this can easily be tested via 'opt'?
Mar 31 2016
Going by the code, Count is emphatically not what I want:
That's what UP.Count is intended to do: let the target suggest an unroll count it deems reasonable.
Ugh, my mistake, I missed one hunk when uploading the diff; the hunk where the variable was used! My mistake.
FullUnrollMaxCount isn't used anywhere because it's a TTI option used by our out of tree target.
Mar 29 2016
Mar 22 2016
Thanks for catching this off-by-one.
Mar 15 2016
Ah okay, I think now (looking at this a bit more) I understand why this cost keeps showing up in the profile. Like you said, it also happens during the main InstCombine run as well. Specifically, we have lots of loads and intrinsics with all constant operands that can't be folded (since they're runtime, not compile-time constants). So when running ConstantFoldInstruction, it's forced to call SimplifyConstantExpr on every operand, even though that ends up doing absolutely nothing. The results then get thrown away because you can't constant-fold a load to something whose value isn't known at compile time.
Yeah, I don't understand that part either!
Why not do the following:
- Drop instructions that are dead.
- Check to see if any ConstExpr operands are foldable, and update the instruction
- If constant, fold the constant. [no longer needs to consider constexprs]
- Add to the initial instcombine worklist.
To try to be more clear, this code appears to:
I believe this code is in the initial portion of instcombine that creates the initial worklist; it isn't run iteratively at all, at least if I remember right.
Updated diff with some context.
What git option do I use to get a diff with full context?
Mar 14 2016
Added a test to make sure that resizing is enough to insert N elements without reallocation for a variety of sizes.
Mar 12 2016