Page MenuHomePhabricator
Feed Advanced Search

Apr 6 2018

escha added a comment to D44814: [CodeGenPrepare] Split huge basic blocks for faster compilation..

this does not feel like the right solution, but even if it is, this will hurt us; we *commonly* have shaders with single basic blocks exceeding 1000 instructions, and splitting them could sabotage scheduling quite badly. in the worst case, it could guarantee spilling, by splitting the block at a point that creates too many live values between the top and bottom.

Apr 6 2018, 12:18 PM

Feb 8 2018

escha added a comment to D43042: [MachineOperand][Target] MachineOperand::isRenamable semantics changes.

This looks like it might fix the problem we've been having, but i'm still extremely nervous about the overall concept. this feels like a fantastically large amount of machinery for such a small optimization, including this dubiously defined "isRenamable" flag that applies to operand registers even though renamability seems to be more of a property of the instruction.

Feb 8 2018, 3:54 PM

Jan 11 2018

escha committed rL322311: [Sink] Really really fix predicate in legality check.
[Sink] Really really fix predicate in legality check
Jan 11 2018, 1:30 PM
escha created D41960: [Sink] Really really fix predicate in legality check.
Jan 11 2018, 12:58 PM

Dec 12 2017

escha committed rL320515: Reassociate: add global reassociation algorithm.
Reassociate: add global reassociation algorithm
Dec 12 2017, 11:18 AM
escha added a comment to D40049: [PATCH] Global reassociation for improved CSE.

got the approval from puyan so i'm gonna push this later unless someone else has some other comments!

Dec 12 2017, 8:25 AM

Dec 4 2017

escha added inline comments to D40049: [PATCH] Global reassociation for improved CSE.
Dec 4 2017, 12:52 PM
escha added inline comments to D40049: [PATCH] Global reassociation for improved CSE.
Dec 4 2017, 11:17 AM
escha accepted D40792: DAG: Match truncated rotation (PR35487).
Dec 4 2017, 11:08 AM
escha added a comment to D40792: DAG: Match truncated rotation (PR35487).

LGTM, though it'd be interesting to see if there's any other cases where (FOO (ROTATE pattern) [...]) gets torn apart besides FOO == truncate.

Dec 4 2017, 11:04 AM

Dec 1 2017

escha updated the diff for D40049: [PATCH] Global reassociation for improved CSE.

Updated patch and added comments.

Dec 1 2017, 1:04 PM

Nov 29 2017

escha added a comment to D40049: [PATCH] Global reassociation for improved CSE.

My understanding of 'reassociate' isn't good enough to approve this, but I'm curious about running the main loop multiple times (ReassociateStep) because I noticed an improvement in some other code by running the pass twice.

  1. Can we split that part into a preliminary patch?
  2. Is it too expensive in compile time to run that to a fixed-point (like instcombine)?
Nov 29 2017, 11:13 AM
escha added a comment to D38313: [InstCombine] Introducing Aggressive Instruction Combine pass.
In D38313#937267, @zvi wrote:

Two comments on the trunc thing:

  1. Thank you!!! As a GPU target maintainer, one of my main frustrations is how much LLVM *loves* to generate code that is needlessly too wide when smaller would do. We mostly have avoided this problem due to being float-heavy, but as integer code becomes more important, I absolutely love any chance I can get to reduce 32-bit to 16-bit and save register space accordingly.

Sometimes it's LLVM, and sometimes it's the frontend that is required to extend small typed values before performing operations.

  1. I'm worried about this because the DAG *loves* to eliminate """redundant""" truncates and extensions, even if they're both marked as free. I've accidentally triggered infinite loops many times when trying to trick the DAG into emitting code that keeps intermediate variables small, an extreme example being something like this:
; pseudo-asm
; R1 = *b + (*a & 15);
; R2 = *c + (*a >> 16) & 15;
load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R0H, R0, 16
and.16 R0L, R0L, 15
and.16 R0H, R0H, 15
add.32 R1, R1, R0L
add.32 R2, R2, R0H

The DAG will usually try to turn this into this:

load.32 R0, [a]
load.32 R1, [b]
load.32 R2, [c]
shr.32 R3, R0, 16
and.32 R0, R0, 15
and.32 R3, R3, 15
add.32 R1, R1, R0
add.32 R2, R2, R3

this is just a hypothetical example but in general this makes me worry from past attempts at experimentation in this realm.

Not sure I fully understand the concern of this patch, but if the problem is root caused to Instruction Selection, shouldn't we fix it there? If DAGCombiner's elimination of free truncates/extensions is an issue, have you considered predicating the specific combines with TLI hooks?

Nov 29 2017, 12:54 AM

Nov 27 2017

escha added a comment to D37989: InstCombine: Insert missing canonicalizes.

Even if they're not redundant, it's something we still don't want.

Nov 27 2017, 11:42 AM
escha added a comment to D38313: [InstCombine] Introducing Aggressive Instruction Combine pass.

Two comments on the trunc thing:

Nov 27 2017, 11:40 AM
escha added a comment to D37989: InstCombine: Insert missing canonicalizes.

Regardless of semantics, this patch almost surely causes a major problem for us.

Nov 27 2017, 11:25 AM
escha added a comment to D37989: InstCombine: Insert missing canonicalizes.

IEEE 754 rules are that everything canonicalizes except bitwise operations (copy, abs, negate, copysign) and decimal re-encoding operations (which you don't care about).

Nov 27 2017, 11:18 AM
escha added a comment to D40049: [PATCH] Global reassociation for improved CSE.

turkey week is over pls glance at patch, thank

Nov 27 2017, 10:06 AM

Nov 20 2017

escha added a comment to D40049: [PATCH] Global reassociation for improved CSE.

*gentle bump*

Nov 20 2017, 11:58 AM

Nov 14 2017

escha updated the summary of D40049: [PATCH] Global reassociation for improved CSE.
Nov 14 2017, 1:58 PM
escha updated the summary of D40049: [PATCH] Global reassociation for improved CSE.
Nov 14 2017, 1:55 PM
escha created D40049: [PATCH] Global reassociation for improved CSE.
Nov 14 2017, 1:54 PM
escha abandoned D39340: Modifying reassociate for improved CSE.

Closing to start a real patch review.

Nov 14 2017, 1:33 PM

Nov 9 2017

escha added a comment to D39830: [DAGCombine] Transform (A + -2.0*B*C) -> (A - (B+B)*C).

what is the purpose of this transform? why is the new form considered more canonical?

Nov 9 2017, 2:29 PM

Nov 2 2017

escha added a comment to D39340: Modifying reassociate for improved CSE.

Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.

N-ary reassociate runs in dominator order and looks to see if any subexpressions of the current computation have already been computed.

But as far as I understand, it does not consider *reassociating expressions* to find those redundancies. In other words, it doesn't consider the N^2 possible sub-combinations of previous expressions; it only considers them in the order they were emitted. Its main goal seems to be optimizing addressing expressions. So for example, if you have (a+b) and a later expression computes (a+(b+2)), it can find that this is redundant. But if you have (a+(b+1)) and (a+(b+2)) but no (a+b), I don't think it handles that. It also only runs in dominator order, so it can't find arbitrary redundancies.

Similarly, my change doesn't handle GetElementPtr in the way N-ary reassociate does, since it wasn't targeted at addressing expressions.

So I agree they're redundant, and kind of share similar goals and purposes, but annoyingly I don't see any way to unify them and achieve the goals of both.

Thanks. I suppose it should be possible to to extend Reassociate to support GetElementPtr, and then we would get wha NaryReassociate provides with your patch? I am not saying this should be part of this patch, I am just trying to understand how we can build on this patch.

Nov 2 2017, 9:44 AM

Nov 1 2017

escha added a comment to D39340: Modifying reassociate for improved CSE.

*poke*

Nov 1 2017, 10:06 AM

Oct 30 2017

escha added a comment to D39340: Modifying reassociate for improved CSE.

Oh and of course because N-ary uses SCEV it cannot possibly handle floating point without a huge extension of SCEV, I think.

Oct 30 2017, 11:33 AM
escha added a comment to D39340: Modifying reassociate for improved CSE.

Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.

Oct 30 2017, 10:20 AM

Oct 27 2017

escha added a comment to D39340: Modifying reassociate for improved CSE.

This seems very similar to what n-ary reassociate wants to do, I'd like to understand why we shouldn't do it there?
(where it seems to fit perfectly)

Oct 27 2017, 12:24 PM
escha added inline comments to D39340: Modifying reassociate for improved CSE.
Oct 27 2017, 10:23 AM

Oct 26 2017

escha updated the summary of D39340: Modifying reassociate for improved CSE.
Oct 26 2017, 12:18 PM
escha updated subscribers of D39340: Modifying reassociate for improved CSE.
Oct 26 2017, 11:54 AM
escha created D39340: Modifying reassociate for improved CSE.
Oct 26 2017, 11:45 AM

Mar 14 2017

escha committed rL297788: MemCpyOptimizer: don't create new addrspace casts.
MemCpyOptimizer: don't create new addrspace casts
Mar 14 2017, 3:50 PM
escha updated the diff for D30902: MemCpyOpt: don't create new addrspacecasts.
Mar 14 2017, 12:17 PM

Mar 13 2017

escha added reviewers for D30902: MemCpyOpt: don't create new addrspacecasts: bogner, qcolombet.
Mar 13 2017, 12:47 PM
escha created D30902: MemCpyOpt: don't create new addrspacecasts.
Mar 13 2017, 11:00 AM

Feb 7 2017

escha added a comment to D29301: TargetLowering: Remove AddrSpace parameter from GetAddrModeArguments.

We use this function but we don't use the addrspace argument.

Feb 7 2017, 12:28 PM

Jan 31 2017

escha added a comment to D28508: [NVPTX] Implement NVPTXTargetLowering::getSqrtEstimate..

That really surprises me that it's faster! I would expect SFU functions like RCP/RSQRT to dwarf the cost of a multiply, especially for double.

Jan 31 2017, 3:31 PM
escha added a comment to D28508: [NVPTX] Implement NVPTXTargetLowering::getSqrtEstimate..

Don't be too embarrassed; when we switched internally from an rcp(rsqrt(x)) expansion to x * rsqrt(x), we *also* completely missed this, and literally only found the bug when it showed up as internal test failures. The new expansion was even brought up and talked about at a team meeting and signed off by multiple people, and nobody thought to consider the number zero, myself included.

Jan 31 2017, 6:49 AM

Jan 30 2017

escha added a comment to D28508: [NVPTX] Implement NVPTXTargetLowering::getSqrtEstimate..

afaik, x * rsqrt(x) is wrong when x is zero (it gives NaN instead of 0). we use x * rsqrt(x) for our expansion, but we have to use an extra select_cc to handle the zero special case.

Jan 30 2017, 11:46 PM
escha accepted D28792: AMDGPU: Fold fneg into fminnum/fmaxnum.
Jan 30 2017, 9:21 AM
escha accepted D28846: DAG: Fold fneg into compare with constant into the constant.
Jan 30 2017, 9:21 AM

Jan 18 2017

escha added a comment to D28881: LiveIntervalAnalysis: Calculate liveness even if a superreg is reserved..

Confirmed this fixes the problem I was having.

Jan 18 2017, 5:40 PM

Nov 14 2016

escha added a comment to D26648: Clarify semantic of reserved registers..

Thanks for looking into this painful little bundle of nested semantic problems.

Nov 14 2016, 8:17 PM

Oct 28 2016

escha added a comment to D26098: [SelectionDAG] Fix a crash visiting `AND` nodes.

Might it be better to just bail if ShiftBits is zero? This seems like a classic case of "trying to optimize a node that itself is going to disappear when it gets combine()'d", so maybe we should just not do this "optimization" if the shift is no shift at all.

Oct 28 2016, 2:11 PM

Aug 30 2016

escha added a comment to D24057: [LoadStoreVectorizer] Change VectorSet to Vector to match head and tail positions. Resolves PR29148..

Sadly, the original version of this pass was *very* much not designed for the case of loading the same location multiple times in a single basic block... as you can probably tell.

Aug 30 2016, 4:53 PM

Aug 1 2016

escha abandoned D23002: DAGCombiner: check isZExtFree before doing combine.
Aug 1 2016, 12:02 PM
escha added a comment to D23002: DAGCombiner: check isZExtFree before doing combine.

Closing for now; it looks like this can interfere with address mode folding (e.g. on x86_64 where 32->64 zext is free), even though it makes logical sense.

Aug 1 2016, 12:02 PM

Jul 31 2016

escha retitled D23002: DAGCombiner: check isZExtFree before doing combine from to DAGCombiner: check isZExtFree before doing combine.
Jul 31 2016, 11:21 AM

Jul 20 2016

escha added a comment to D22601: [SCCP] Mark constant xor %blah, %blah even if the lattice value is overdefined.

This would work with 'sub' too, right?

Jul 20 2016, 3:02 PM

Jun 19 2016

escha added a comment to D21284: Fold fmin(nnan x, inf) -> x, fmax(nnan x, -inf) -> x, fmax(nnan ninf x, -flt_max) -> x and fmin(nnan ninf x, flt_max) -> x.

Is this correct? I thought "nnan" on an instruction meant that we can optimize it assuming its inputs and outputs aren't NaN -- not that we can assume *for other instructions, that aren't fast-math* that *their inputs* aren't NaN.

Jun 19 2016, 4:22 PM · Restricted Project

Jun 17 2016

escha added a comment to D19501: Add LoadStoreVectorizer pass.

As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)

Jun 17 2016, 9:53 AM

Jun 14 2016

escha added a comment to D19508: AMDGPU: Run LoadStoreVectorizer pass by default.

Make sure to check that the alias analyses are set up properly in the TM; this bit me when I implemented this out of tree (i.e. confirm that the AA queries are succeeding).

Jun 14 2016, 10:52 AM

Jun 8 2016

escha added a comment to D21137: Instcombile min/max intrinsics calls.

x = NaN

y = 0

fmin(x, fmax(x, y)) -> x ?
fmin(x, fmax(NaN, 0)) -> x ?
fmin(NaN, 0) -> x ?
fmin(NaN, 0) -> 0
0 != x

Jun 8 2016, 8:25 AM

Jun 6 2016

escha added a comment to D19501: Add LoadStoreVectorizer pass.

My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?

Jun 6 2016, 7:38 AM

May 4 2016

escha added a comment to D19391: transform obscured FP sign bit ops into a fabs/fneg using TLI hook.

This is a nice solution -- instead of worrying too much about LLVM float semantic definitions, just let the target say how it implements things.

May 4 2016, 12:27 PM

Apr 29 2016

escha added a comment to D19453: DAGCombiner: Reduce truncated shl width.

Some might be, as this code is fairly old. Here’s an extremely blind copy paste of our code:

Apr 29 2016, 1:00 PM

Apr 28 2016

escha added a comment to D19453: DAGCombiner: Reduce truncated shl width.

LGTM. We actually do this out of tree in our backend for ADD, SUB, MUL, SHL (since we'd rather truncate inputs than do a larger op and truncate that), so it seems more than reasonable to me.

Apr 28 2016, 10:53 AM

Apr 22 2016

escha added a comment to D19391: transform obscured FP sign bit ops into a fabs/fneg using TLI hook.

As far as I know -- "Custom" means something is not legal, but has a custom lowering. It has nothing to do with how it's implemented on the target; it just means that the target has a custom legalization hook (as opposed to Expand or such, which means it doesn't have any custom hook).

Apr 22 2016, 8:50 AM
escha added a comment to D19391: transform obscured FP sign bit ops into a fabs/fneg using TLI hook.

Our target has a legal FABS (it is in fact, free, as it's a modifier). But it's implemented as FADD DST, SRC.ABS, -0.0, which may modify bits other than the top bit.

Apr 22 2016, 8:28 AM

Apr 6 2016

escha closed D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.
Apr 6 2016, 10:03 AM
escha committed rL265562: Loop Unroll: add options and tweak to make Partial unrolling more useful.
Loop Unroll: add options and tweak to make Partial unrolling more useful
Apr 6 2016, 10:03 AM
escha committed rL265558: LoopUnroll: only allow non-modulo Partial unrolling when Runtime=true.
LoopUnroll: only allow non-modulo Partial unrolling when Runtime=true
Apr 6 2016, 9:49 AM

Apr 4 2016

escha accepted D18782: Reduce unroll of constant bounds loop with TripCount that is not modulo of unroll factor..

LGTM

Apr 4 2016, 9:57 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Poking this thread.

Apr 4 2016, 12:44 PM

Apr 1 2016

escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

I am slightly afraid to silently modify the behavior of MaxCount to affect full unrolling (when it didn't before) because out of tree users may be using it.

Apr 1 2016, 3:58 PM
escha updated the diff for D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Added commandline options and a test.

Apr 1 2016, 3:52 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Ironically it looks like MaxCount itself doesn't even have a commandline option either... should I introduce one for both? :/ it feels wrong to bloat the commandline options like this, I guess

Apr 1 2016, 3:28 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

For your target there is a case when full unroll fails and then partial fails as well. And when patch applied partial unroll is successful. Right? So you can create target test.

Apr 1 2016, 2:36 PM
escha updated the diff for D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Adding full context.

Apr 1 2016, 2:33 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Yes, I understand that, but what I mean is, the bug won't trigger because we *will* get to the path that corrects Count because UnrolledSize > UP.PartialThreshold with Count = 9, and then it'll lower Count to 3. So we won't end up with a Count that isn't evenly dividing into 9.

Apr 1 2016, 1:00 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

I don't *think* it's possible for that problem to happen under normal circumstances without FullUnrollMaxCount being set. Normally, the only way we can fail to do full unrolling is if the full unroll cost is greater than Threshold. In this case, we'll go try partial unrolling, and we'll also be above the partial threshold (since a partial threshold higher than Threshold doesn't make sense), and it'll take the path that makes the count modulo-correct.

Apr 1 2016, 11:50 AM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

What sort of test case should I do for this (which behavior are you looking to test)? Should I make a commandline option so that this can easily be tested via 'opt'?

Apr 1 2016, 10:04 AM

Mar 31 2016

escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Going by the code, Count is emphatically not what I want:

Mar 31 2016, 5:49 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

That's what UP.Count is intended to do: let the target suggest an unroll count it deems reasonable.

Mar 31 2016, 5:47 PM
escha updated the diff for D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

Ugh, my mistake, I missed one hunk when uploading the diff; the hunk where the variable was used! My mistake.

Mar 31 2016, 5:45 PM
escha added a comment to D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.

FullUnrollMaxCount isn't used anywhere because it's a TTI option used by our out of tree target.

Mar 31 2016, 5:01 PM
escha retitled D18670: LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling from to LoopUnroll: some small fixes/tweaks to make it more useful for partial unrolling.
Mar 31 2016, 1:02 PM

Mar 29 2016

escha closed D17843: MachineSink: make shouldSink a TII target hook.
Mar 29 2016, 3:54 PM
escha committed rL264799: MachineSink: make shouldSink a TII target hook.
MachineSink: make shouldSink a TII target hook
Mar 29 2016, 3:50 PM
escha added a reviewer for D17843: MachineSink: make shouldSink a TII target hook: qcolombet.
Mar 29 2016, 3:03 PM

Mar 22 2016

escha added a comment to D18345: Fix DenseMap::reserve(): the formula was wrong.

Thanks for catching this off-by-one.

Mar 22 2016, 9:04 AM

Mar 15 2016

escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

Ah okay, I think now (looking at this a bit more) I understand why this cost keeps showing up in the profile. Like you said, it also happens during the main InstCombine run as well. Specifically, we have lots of loads and intrinsics with all constant operands that can't be folded (since they're runtime, not compile-time constants). So when running ConstantFoldInstruction, it's forced to call SimplifyConstantExpr on every operand, even though that ends up doing absolutely nothing. The results then get thrown away because you can't constant-fold a load to something whose value isn't known at compile time.

Mar 15 2016, 10:09 PM
escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

Yeah, I don't understand that part either!

Mar 15 2016, 9:54 PM
escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

Why not do the following:

  1. Drop instructions that are dead.
  2. Check to see if any ConstExpr operands are foldable, and update the instruction
  3. If constant, fold the constant. [no longer needs to consider constexprs]
  4. Add to the initial instcombine worklist.
Mar 15 2016, 6:32 PM
escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

To try to be more clear, this code appears to:

Mar 15 2016, 6:18 PM
escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

I believe this code is in the initial portion of instcombine that creates the initial worklist; it isn't run iteratively at all, at least if I remember right.

Mar 15 2016, 6:14 PM
escha updated the diff for D18155: Instcombine: try to avoid wasted work in ConstantFold.

Updated diff with some context.

Mar 15 2016, 5:49 PM
escha added a comment to D18155: Instcombine: try to avoid wasted work in ConstantFold.

What git option do I use to get a diff with full context?

Mar 15 2016, 5:30 PM
escha closed D18154: DenseMap: make .resize() do the intuitive thing.
Mar 15 2016, 11:42 AM
escha accepted D18154: DenseMap: make .resize() do the intuitive thing.
Mar 15 2016, 11:42 AM

Mar 14 2016

escha committed rL263522: DenseMap: make .resize() do the intuitive thing.
DenseMap: make .resize() do the intuitive thing
Mar 14 2016, 6:55 PM
escha updated the diff for D18154: DenseMap: make .resize() do the intuitive thing.

Added a test to make sure that resizing is enough to insert N elements without reallocation for a variety of sizes.

Mar 14 2016, 1:31 PM
escha retitled D18155: Instcombine: try to avoid wasted work in ConstantFold from to Instcombine: try to avoid wasted work in ConstantFold.
Mar 14 2016, 11:58 AM
escha retitled D18154: DenseMap: make .resize() do the intuitive thing from to DenseMap: make .resize() do the intuitive thing.
Mar 14 2016, 11:42 AM

Mar 12 2016

escha closed D18124: ConstantFoldInstruction: avoid wasted calls to ConstantFoldConstantExpression.
Mar 12 2016, 9:44 PM
escha accepted D18124: ConstantFoldInstruction: avoid wasted calls to ConstantFoldConstantExpression.
Mar 12 2016, 9:43 PM
escha committed rL263374: ConstantFoldInstruction: avoid wasted calls to ConstantFoldConstantExpression.
ConstantFoldInstruction: avoid wasted calls to ConstantFoldConstantExpression
Mar 12 2016, 9:41 PM
escha updated the diff for D18124: ConstantFoldInstruction: avoid wasted calls to ConstantFoldConstantExpression.

Use llvm::all_of.

Mar 12 2016, 4:36 PM