Authored by dsanders on Apr 29 2019, 5:27 PM.

# Details

Reviewers
 bogner aditya_nandakumar volkan aemerson paquette arsenm
Commits
Summary

Targets often have instructions that can sign-extend certain cases faster
than the equivalent shift-left/arithmetic-shift-right. Such cases can be
identified by matching a shift-left/shift-right pair but there are some
issues with this in the context of combines. For example, suppose you can
sign-extend 8-bit up to 32-bit with a target extend instruction.

```%1:_(s32) = G_SHL %0:_(s32), i32 24 # (I've inlined the G_CONSTANT for brevity)
%2:_(s32) = G_ASHR %1:_(s32), i32 24
%3:_(s32) = G_ASHR %2:_(s32), i32 1```

would reasonably combine to:

```%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 25```

which no longer matches the special case. If your shifts and extend are
equal cost, this would break even as a pair of shifts but if your shift is
more expensive than the extend then it's cheaper as:

```%2:_(s32) = G_SEXT_INREG %0:_(s32), i32 8
%3:_(s32) = G_ASHR %2:_(s32), i32 1```

It's possible to match the shift-pair in ISel and emit an extend and ashr.
However, this is far from the only way to break this shift pair and make
it hard to match the extends. Another example is that with the right
known-zeros, this:

```%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 24
%3:_(s32) = G_MUL %2:_(s32), i32 2```

can become:

```%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 23```

All upstream targets have been configured to lower it to the current
G_SHL,G_ASHR pair but will likely want to make it legal in some cases to
handle their faster cases.

To follow-up: Provide a way to legalize based on the constant. At the
moment, I'm thinking that the best way to achieve this is to provide the
MI in LegalityQuery but that opens the door to breaking core principles
of the legalizer (legality is not context sensitive). That said, it's
worth noting that looking at other instructions and acting on that
information doesn't violate this principle in itself. It's only a
violation if, at the end of legalization, a pass that checks legality
without being able to see the context would say an instruction might not be
legal. That's a fairly subtle distinction so to give a concrete example,
saying %2 in:

```%1 = G_CONSTANT 16
%2 = G_SEXT_INREG %0, %1```

is legal is in violation of that principle if the legality of %2 depends
on %1 being constant and/or being 16. However, legalizing to either:

`%2 = G_SEXT_INREG %0, 16`

or:

```%1 = G_CONSTANT 16
%2:_(s32) = G_SHL %0, %1
%3:_(s32) = G_ASHR %2, %1```

depending on whether %1 is constant and 16 does not violate that principle
since both outputs are genuinely legal.

# Diff Detail

Repository
rL LLVM

### Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
756 ↗(On Diff #197439)

It does but it potentially generates more instructions to do it. Consider %0:_(s32) = G_SEXT %1:_(s32), 4 narrowScalar'd to s8. This method will break it into four components and will emit one G_SEXT_INREG and one G_ASHR. The other method will emit one G_SEXT_INREG and one G_SEXT which may be eliminated entirely or at worst, lower to one G_ASHR

767 ↗(On Diff #197439)

It already does that. We save the register in FullExtensionReg and re-use it if we see any further full extensions. The test for it is in LegalizerHelperTest.cpp on line 821 (the last two operands of the G_MERGE_VALUES are both [[T6]]

Most targets I know would select G_ZEXT_INREG to an AND using either an inline or materialized immediate at which point we haven't really gained anything by protecting it against harmful 'optimization'.

My only consideration was that is faster to narrow scalar G_ZEXT_INREG, then to narrow scalar G_AND and G_CONSTANT. On the other hand AND has simple narrow scalar unlike G_SHL and G_ASHR so it is not that big performance/code size improvement compared to G_SEXT_INREG. Also as sign and zero extend are most of the time mentioned together, I thought that we could add G_ZEXT_INREG alongside with G_SEXT_INREG.

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
698 ↗(On Diff #197439)

Yes. For example with this test on aarch64, store is narrowScalared:

```define void @func(i8 %x, i128* %p) {
entry:
%conv = sext i8 %x to i128
store i128 %conv, i128* %p
ret void
}```

Legalizer is not able to produce something like this

```%2:_(s32) = COPY \$w0
%1:_(p0) = COPY \$x1
%11:_(s64) = G_CONSTANT i64 63
%12:_(s64) = G_SEXT_INREG %2, 8
%13:_(s64) = G_ASHR %12, %11(s64)
G_STORE %12(s64), %1(p0) :: (store 8 into %ir.p, align 16)
%7:_(s64) = G_CONSTANT i64 8
%6:_(p0) = G_GEP %1, %7(s64)
G_STORE %13(s64), %6(p0) :: (store 8 into %ir.p + 8, align 16)
RET_ReallyLR```

which I assume is a desired output.

708 ↗(On Diff #197439)

I did not consider situation when G_SEXT(G_SEXT_INREG) could combine away.
Currently G_SEXT can only combine into G_TRUNC.
The use of this G_SEXT is something legal or G_MERGE that is waiting for G_UNMERGE.
We know this since uses are legalized before defs (G_SEXT is def in this context).
Yet again if use is legal, then G_SEXT_INREG was legal or something else like lower but not narrow scalar.
Are we talking about artifact combiner, or something that is not in tree?
Could you provide a test where G_SEXT and G_TRUNC produced with this narrowScalar combine away?

756 ↗(On Diff #197439)

Could we then add a possibility for targets to chose how they want to narrow scalar that is to have:
narrowScalar - always creates G_UNMERGE+...+G_MERGE
narrowScalarExt - creates G_TRUNC+...+G_{S|Z|ANY}EXT

dsanders marked 7 inline comments as done.May 10 2019, 1:42 PM

Thanks for the explanations. I think you have some good points. Overall, it still looks to me like we're complicating things for backend writers and introducing a lot of subtle distinctions to keep in mind. It would be useful to hear what other people think about this.

It's a optional and small though and I think the wins make it worth the cost for the targets that benefit from it (I expect this to be quite a few targets since 11 in-tree targets had at least some support for the SelectionDAG equivalent*). I'm expecting that to be most targets since most elected to support ISD::SIGN_EXTEND_INREG rather than letting SelectionDAG expand it. Targets that don't benefit from it or don't want to deal with it yet can just use .lower() to opt out of it.

*For targets outside my knowledge this is based on matches for SIGN_EXTEND_INREG that aren't calls to make SelectionDAG expand it.

llvm/include/llvm/Target/GenericOpcodes.td
40 ↗(On Diff #197230)

> One way to achieve that is to use s32 -> G_TRUNC -> s7 -> G_SEXT -> s32. This costs more memory than G_SEXT_INREG (which can add up if you do it a lot, e.g. for code heavily using short or char).
Fair enough, but if that's the only good argument then this is premature optimization.

Yep, it's a nice bonus rather than the motivation.

> [...] it also means that all the register allocation code has to support s7. Similarly, spill/reload/move has to support s7, frame lowering has to support it.
You mean register bank selection? Otherwise, doing register allocation before instruction select is kind of a big difference from what happens in the upstream targets.

Both unless there's a pass that strips out the s7 types. InstructionSelection doesn't normally change the types so any that get there will also pass through to the next pass. RegisterAllocation would be the pass that assigns 32-bit registers to the 7-bit type.

Anyway, any passes would only have to support such types for G_TRUNC and G_SEXT, not for *everything* (just like any other legal operation).

It depends on the pass. Some can limit their support to G_TRUNC/G_SEXT but others don't really work with the operation. For example, Frame lowering needs to know how to promote the s7 to s8 both for storage allocation and for emitting code to load/store from/to a given frame index. Register Allocation needs to know which registers are appropriate for an s7.

Also, the list of types can potentially be large. For a 64-bit target, we have to ensure those later passes can deal with s1 through s64 unless a pass like the legalizer limits it.

If we introduce G_SEXT_INREG, they now also have to support G_SEXT_INREG in addition to G_SEXT, since you can't guarantee that after the legalization we won't have any G_SEXT left.

That's partly true. Targets that don't have getActionDefinitionBuilder(G_SEXT_INREG).lower() in their legalizer have to support both to whatever degree the legalization rules say are legal. There's no lower() for G_SEXT yet but it would be a G_ANYEXT followed by G_SEXT_INREG (which could further lower to a shift pair) which would allow us to limit support to G_ANYEXT (trivial) and G_SEXT_INREG. Those that do have getActionDefinitionBuilder(G_SEXT_INREG).lower() only need to support G_SEXT as G_SEXT_INREG will be lowered to G_SHL/G_ASHR by the legalizer and produce the same legal MIR we have today.

Instruction selection has to support it too which is a problem for imported SelectionDAG patterns as they can't describe s7 unless there's a register class with an i7 type which isn't possible as it isn't one of the MVT types.

This is a good point, but then again if a target wants to keep those instructions unlowered, then they are legal and should be selected somehow.
We should teach TableGen to handle such situations rather than hang on to whatever limitations SelectionDAG had.

I agree. It's not feasible in the SelectionDAG patterns we import (it requires changing the type inferencing engine in a big way). For now, C++ is the only way but hopefully we'll have GlobalISel rules at some point

There's probably more but the point is that being able to eliminate some types simplifies the backend. You might think that this sounds like type
legalization (and I'd be inclined to agree w.r.t the effect at least but I'd still call it operation legalization as the possible removal of types is a side
effect) but the key difference from SelectionDAG is that GlobalISel itself doesn't mandate it or treat it separately from operation legalization. If a
target works in s8 a lot but doesn't really have s8 operations or registers, it can choose to make s8 operations legal anyway and trade more
complexity in the backend for (hopefully) better code quality.

It's a trade-off, fewer types is a simplification, but more opcodes isn't. Having to always keep in mind both G_SEXT and G_SEXT_INREG is going to
be a burden for maintainers.

I agree it's a trade-off and the balance of it will differ between different targets. Where each target falls on that scale will depend on how much of a win selecting a specialized sign-extend instruction is compared to not doing it. Another factor that's important is how difficult it is to recognize a sign-extend after the optimizer has worked on the IR which ranges from easy (the optimizer didn't do much), to tricky (the optimizer mangled the pattern but it's still just about recognizable if you cover all the possibilities), to impossible (it's different code that happens to sign-extend as well). For the targets I'm interested in G_SEXT_INREG is well worth the price and I believe that other targets will land on the same end of the scale once they're optimizing more heavily.

In a sense, this is worse than type legalization, because after type legalization you were certain any funny types were
gone, but now they may or may not have all been eaten up by G_SEXT_INREG and G_SEXTLOAD, so you may or may not need to worry about
G_SEXT, depending on whether or not you left any legal producer for a certain type. It seems easier to shoot yourself in the foot.

It's fairly simple to catch the gaps in a legalization ruleset that aims to eliminate certain types. If any operation involving a given type is legal then G_ANYEXT and G_TRUNC for that type must also be legal. If either of those two are marked unsupported for a given type then the legalizer will fail on the IR and you can tell something is missing from the ruleset. Aside from that, being consistent with .clampScalar() and similar is the way to ensure certain types get eliminated.

> Suppose we lowered our sign-extend to a lsl, ashr pair.
That means we're not interested in it, otherwise we could mark it as legal and select it to whatever better sequence we know in the instruction select or some other smarter pass.

Not necessarily. It happens whenever the artifact combiner combines (sext (trunc x)). The reason this happens is because after merging those artifacts together, there's no way to represent the sign-extension other than the G_SHL/G_ASHR pair (or G_SEXT_INREG) but the operation is still required to happen. If the artifact combiner didn't do this then it would be impossible to eliminate types.

A better answer is to form a higher-level 'composite' operation to smuggle it past all the combiners and optimizers we don't want to happen. This is what G_SEXT_INREG does.

I'm still not convinced. This could be true for any other operation that can be lowered. You wouldn't propose adding G_SREM_FOR_REAL versus G_SREM just because some targets don't want to lower it, right? They'd just have to mark it as legal or custom.

It is true for any other operation that can be lowered and is important for performance. Whether I'd propose it for the G_* namespace upstream largely depends on whether I think it has general applicability to other targets. I've been meaning to propose widening multiplies for a while (e.g. 32x32 -> 64) as many targets have these but I wouldn't propose a G_FFT because only DSP's have a potential use for one (and even then, why not use G_INTRINSIC). For situations where only one or a couple backends would be interested, I recommend legalizing or combining to what we've been calling target-specific generic instructions (which is a _terrible_ name) which is essentially a target-pseudo that uses the type[0-9] constraints. The effect is very much like a generic instruction, but is target specific.

I disagree with this summary as it's too focused on the legalizer. I believe I'm solving an information-preservation problem for the compilation pipeline by preventing the decomposition of a higher-level operation into lower-level components in cases where that information is still useful to optimal codegen

You could also solve it by keeping the type, which would be a more honest representation imo.

I disagree that that's a more honest representation. It starts off honest but it becomes a lie as the compilation pipeline progresses. At the IR Translator stage, I agree that keeping the type is the better representation but once I'm shaping the MIR to suit my target (the legalizer being one of the bigger passes that does that but they all do it to some degree) that opinion changes. It takes on more target specific impurities until we get to a purely target dependent representation. Swapping out G_SEXT on non-existant types for a G_SEXT_INREG or G_SHL/G_ASHR pair is changing honesty w.r.t target independence to honesty w.r.t the target. In the same way, I swap out generic opcodes for target-specific-generic-opcodes and target-instructions as the pipeline progresses. This one just has more common ground with other targets than most.

I also disagree w.r.t to the legalizer but I might be being picky here. I would say I'm removing a requirement that all targets with sign-extension instructions which outperform shift-pairs make G_SEXT legal for all possible source types that benefit from that instruction (for AArch64, this means every type from s1 up to s63). In the context of the legalizer, this means being able to promote the source type without changing the operation and thereby making the operation specified by the opcode and immediate rather than the opcode and types. In terms of implementation, this does turn an operation legality problem from being about types to being about value-of-immediate which is pretty close to the way you stated it and is what makes me think I might be being picky. I do think there is a small distinction between the two though as the value-of-immediate bit falls out of excluding the types from the definition of the operation.

So, you're only removing the requirement because types are implicitly illegal (unless the target says otherwise), whereas immediate values are
implicitly legal. This means that types force you to think about whether or not something should be legal, whereas immediates will just sneak
through. At any rate this is something that hasn't really been discussed.

Could you elaborate on why you think immediates are implicitly legal? Their legality is specified by the legalization rules. AArch64 happens to support all of them in D61290 so it says legal without even looking at the immediate but LegalizerHelperTest.cpp in D61290 has an example where only certain immediates are legal.

I think you have to think about what is legal either way and define your legalization rules accordingly to consume any input and produce a legal output. If your target doesn't have fast sign-extend instructions then use .lower(), if you only have ones for s8 and s16 to s32 then use .legalForTypeWithImm({s32, 8}, {s32, 16}).clampScalar(0, s32, s32).lower(), if you can handle any sX to s32 then use .legalFor({s32}).clampScalar(0, s32, s32).lower()

I would say I'm removing the requirement because:

• Backends should be able to constrain the types they have to support. They shouldn't have to support all types smaller than register width just to handle G_SEXT.
• Backends shouldn't be required to have G_SEXT legal for every type combination smaller than register width.
• Changing value sizes isn't the only way to get a sign-extension (e.g. (x << 24) >> 24 in C/C++)
• It simplifies the optimizers because they can easily identify sign-extends cases which may cause them to harm rather than improve the code
• It allows the compiler to produce better code
• It allows the MIR to more accurately reflect the target
llvm/include/llvm/Target/Target.td
839 ↗(On Diff #197439)

Umm, I'm not sure what happened there. It was supposed to have 'special behaviour' after that. Fixed it

840 ↗(On Diff #197439)

That sounds better to me. Done

llvm/lib/Target/ARM/ARMLegalizerInfo.cpp
87 ↗(On Diff #197439)

Ah ok, I didn't find that one because it only tests a legal G_SEXT and not one that's legalized. I've added a test that checks it gets lowered to the shift pair

llvm/test/CodeGen/AArch64/GlobalISel/legalize-ext.mir
64 ↗(On Diff #197230)

It's inclusion in this patch indicates that the test was affected by the addition of G_SEXT_INREG. It takes a different code path to the same end
result and slightly peturbs the order in the process. FWIW, I think that makes it relevant to G_SEXT_INREG but I don't mind committing the
status-quo tests separately.

You're contradicting yourself a bit. If the order is relevant now, then it will be relevant for future changes as well, so you shouldn't switch to
CHECK-DAG.

The test is relevant because it checks that the lowering path produces the same instructions as before (.lower() must cause the intermediate G_SEXT_INREG to lower to the same G_SHL/G_ASHR pair as before). The instruction order (specifically the placement of the G_CONSTANT argument to the G_SHL/G_ASHR pair) is not important because it has no bearing on the correctness of the output. Using CHECK-DAG lets us ignore the unimportant bit (instruction order) while still checking the important bit (opcode, arguments, etc.).

dsanders updated this revision to Diff 199089.May 10 2019, 2:30 PM
dsanders marked 3 inline comments as done.

dsanders marked 2 inline comments as done.May 10 2019, 2:56 PM
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
698 ↗(On Diff #197439)

Ah ok. The issue there is that G_STORE hasn't implemented narrowScalar yet. I don't think that we should prevent any opcode from getting a narrowScalar implementation until all opcodes have one. I think we should be implementing them as we need them and gradually build up the full set over time.

708 ↗(On Diff #197439)

It does look like the artifact combiner is missing some combines, for the NarrowTy.getScalarSizeInBits() < SizeInBits case the resulting:

```%6:_(s128) = G_MERGE_VALUES %0(s64), %12(s64)
%7:_(s32) = G_TRUNC %6(s128)```

ought to be:

`%7:_(s32) = G_TRUNC %0(s64)`

and for the NarrowTy.getScalarSizeInBits() >= SizeInBits there's a:

```%10:_(s64) = G_SEXT_INREG %9, 32
%6:_(s128) = G_SEXT %10(s64)
%7:_(s32) = G_TRUNC %6(s128)```

which firstly ought to be:

```%10:_(s64) = G_SEXT_INREG %9, 32
%7:_(s32) = G_TRUNC %10(s64)```

and secondly:

```%10:_(s64) = G_SEXT_INREG %9, 32
%7:_(s32) = %9(s32)```

the former of those two isn't really related to this patch (the G_SEXT_INREG isn't involved in the combine) but the latter is.

756 ↗(On Diff #197439)

Assuming we fix the missing artifact combines, what would be the case where G_UNMERGE/G_MERGE is the better option?

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
698 ↗(On Diff #197439)

The problem (if we use narrow scalar with G_MERGE/G_UNMERGE) is the order in which we attempt to combine artifacts in Legalizer. D61787 and narrow scalar for G_ANYEXT(with G_MERGE/G_UNMERGE) will be able to give output like this.

708 ↗(On Diff #197439)

Sorry, I still don't understand. In this fragment %9 should be s64 not s32. An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.
Maybe it is possible to perform other combines before we make G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT?
That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for example some pattern that combines multiple artifacts instead of only two.

756 ↗(On Diff #197439)

%0:_(s128) = G_ANYEXT .....
%1:_(s128) = G_SEXT_INREG %0, 8
%2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)

At the moment most(if not all?) of narrow scalars make sequence of instructions that starts with G_UNMERGE and ends with G_MERGE. In case we emit G_SEXT as end of sequence it won't be able to combine with G_UNMERGE_VALUES. We then have to perform narrow scalar of this G_SEXT (that will end with G_MERGE) then combine this with %2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
Same for G_TRUNC at the start. Not to mention that this complicates order in which we attempt to combine artifacts.
And if G_SEXT_INREG is to be narrow scalared I would expect that it is surrounded with G_MERGE and G_UNMERGE, on mips at least.
Considering llvm test-suite and mips I would say that it is always better to emit G_UNMERGE/G_MERGE. I cannot think of example where G_SEXT would be able to combine with something.
And for general answer I would say that it is better to emit G_UNMERGE/G_MERGE if def of G_SEXT_INREG is used in G_UNMERGE_VALUES. This is problematic since it requires legalizer to decide how to narrow scalar based on surrounding instructions.

llvm/include/llvm/Target/GenericOpcodes.td
40 ↗(On Diff #197230)

Could you elaborate on why you think immediates are implicitly legal? Their legality is specified by the legalization rules.

Sorry about being unclear, let me give it another try.

AArch64 happens to support all of them in D61290 so it says legal without even looking at the immediate [...]

That's exactly what I meant, that it says legal without even looking at the immediate. I haven't looked at D61290 in too much detail, so excuse me if I'm misunderstanding again. Suppose you had a G_OP with Type0 and Type1, if you say .legalFor({s32}) and just that, you'll get an error because you're not covering Type1 with your rules. From your comments I got the impression that for a G_OP2 with Type0 and Imm1, if you say .legalFor({s32}) then that's enough and all values of Imm1 will just be legal. If that's not the case, then I take this comment back.

llvm/include/llvm/Target/Target.td
840 ↗(On Diff #197439)

That sounds better to me. Done

"That" = the old version? I don't mind either way, just making sure you didn't miss it by mistake.

llvm/lib/Target/ARM/ARMLegalizerInfo.cpp
87 ↗(On Diff #197439)

Cool, thanks!

dsanders marked 3 inline comments as done.May 21 2019, 6:45 PM

Sorry for the slow reply. I had to work on other things for a couple days.

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
698 ↗(On Diff #197439)

I don't understand this. What is the issue with the order artifacts are combined?

708 ↗(On Diff #197439)

Sorry, I still don't understand. In this fragment %9 should be s64 not s32.

You're right, that step isn't quite right. It should be %7:_(s32) = G_TRUNC %9(s64). The point was to consume the input rather than the output as the lower 32-bits are unchanged by the G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.

An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.

Any IR where many operations are too large for your target would provide good examples. Consider:

```%2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
%3:_(s64) = G_SEXT_INREG %2:_(s64), 16```

and that both have narrowScalar(/*typeidx=*/0, s32). The legalization would proceed to:

```%2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
%4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
%6:_(s32) = G_SEXT_INREG %4:_(s32), 16
%3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)```

and then to:

```%9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
%11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
%4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
%6:_(s32) = G_SEXT_INREG %4:_(s32), 16
%3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)```

then the artifact combiner would fold the middle merge/unmerge to:

```%9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
%11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%6:_(s32) = G_SEXT_INREG %7:_(s32), 16
%3:_(s64) = G_MERGE_VALUES %6:_(s32), %8:_(s32)```

Notice that we still have the %8:_(s32) = G_ADD %10:_(s32), %12:_(s32) at this point even though we're about to overwrite it. We're not going to be able to improve on this until a post-legalize combiner.

Now consider the same case with this optimization:

```%2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
%3:_(s64) = G_SEXT_INREG %2:_(s64), 16```

becomes:

```%2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
%4:_(s32) = G_TRUNC %2:_(s64)
%6:_(s32) = G_SEXT_INREG %4:_(s32), 16
%3:_(s64) = G_SEXT %6:_(s32)```

then:

```%9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
%11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
%4:_(s32) = G_TRUNC %2:_(s64)
%6:_(s32) = G_SEXT_INREG %4:_(s32), 16
%3:_(s64) = G_SEXT %6:_(s32)```

which the artifact combiner (should) simplify to:

```%9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
%11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%6:_(s32) = G_SEXT_INREG %7:_(s32), 16
%3:_(s64) = G_SEXT %6:_(s32)```

and then to:

```%9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
%11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%6:_(s32) = G_SEXT_INREG %7:_(s32), 16
%3:_(s64) = G_SEXT %6:_(s32)```

and then to:

```%9:_(s32) = G_TRUNC %0:_(s64)
%11:_(s32) = G_TRUNC %1:_(s64)
%6:_(s32) = G_SEXT_INREG %7:_(s32), 16
%3:_(s64) = G_SEXT %6:_(s32)```

which is simpler. The second G_ADD was correctly recognized as being dead and removed. As a result fewer instructions were emitted by the legalizer and subsequent passes have less work to do.

Maybe it is possible to perform other combines before we make G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT?
That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for
example some pattern that combines multiple artifacts instead of only two.

It's not entirely clear what you're suggesting here. I'm particularly confused by the 'that would benefit from narrow scalar' bit since it's not a choice to narrow scalar or not. An operation is either too wide for the target and must be narrowScalar'd or it's not.

If your suggestion is to try to form G_SEXT_INREG instructions in a combine pass after the legalizer then I'd say that's too late in the pipeline. There's no guarantee that the combine to form a G_SEXT_INREG will run before a combine that makes them unrecognizable. Ideally we want to make a best effort to form them in a pre-legalizer combiner in addition to the legalizer.

756 ↗(On Diff #197439)

You're assuming that we _don't_ fix the missing artifact combines there. Given:

```%0:_(s128) = G_ANYEXT .....
%1:_(s128) = G_SEXT_INREG %0, 8
%2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
... = ... %2:_(s64)
... = ... %3:_(s64)```

then assuming you are doing narrowScalar(/*typeidx=*/0, s64) on the G_SEXT_INREG, we should get:

```%0:_(s128) = G_ANYEXT .....
%5:_(s64) = G_TRUNC %0:_(s128)
%1:_(s64) = G_SEXT_INREG %0, 8
%4:_(s128) = G_SEXT %1
%2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
... = ... %2:_(s64)
... = ... %3:_(s64)```

then the artifact combiner should give:

```%0:_(s128) = G_ANYEXT .....
%5:_(s64) = G_TRUNC %0:_(s128)
%1:_(s64) = G_SEXT_INREG %0, 8
%6:_(s64) = G_ASHR %1, i64 63
... = ... %1:_(s64)
... = ... %6:_(s64)```

then if we assume that G_ANYEXT was from anything smaller than s64 (which should be the case for MIPS):

```%5:_(s64) = G_ANYEXT .....
%1:_(s64) = G_SEXT_INREG %0, 8
%6:_(s64) = G_ASHR %1, i64 63
... = ... %1:_(s64)
... = ... %6:_(s64)```

I cannot think of example where G_SEXT would be able to combine with something.

There's plenty of cases where that should be able to combine. The common cases for targets that expect the input MIR to contain s8, s16, s32, s64 operations and need to legalize to only s32 and s64 operations are:
If X == 2Y:

```%1:_(sX) = G_SEXT %0:_(sY)
%2:_(sY), %3:_(sY) = G_UNMERGE_VALUES %1
... = ... %2:_(sY)
... = ... %3:_(sY)```

to

```%3:_(sY) = G_ASHR %0:_(sY), iY (Y-1)
... = ... %0:_(sY)
... = ... %3:_(sY)```

If X == 4Y:

```%1:_(sX) = G_SEXT %0:_(sY)
%2:_(sY), %3:_(sY), %4:_(sY), %5:_(sY) = G_UNMERGE_VALUES %1
... = ... %2:_(sY)
... = ... %3:_(sY)
... = ... %4:_(sY)
... = ... %5:_(sY)```

to

```%3:_(sY) = G_ASHR %0:_(sY), iY (Y-1)
... = ... %0:_(sY)
... = ... %3:_(sY)
... = ... %3:_(sY)
... = ... %3:_(sY)```

For G_ZEXT, replace the G_ASHR with G_CONSTANT iY 0. For G_ANYEXT use IMPLICIT_DEF instead

dsanders updated this revision to Diff 200627.May 21 2019, 7:08 PM
dsanders marked 2 inline comments as done.

Small comment fix

llvm/include/llvm/Target/GenericOpcodes.td
40 ↗(On Diff #197230)

Ah ok, you're just looking for the rule verifier to check that each target looked at it like it does for type indices. Sure, I can add that. For AArch64 we'd say 1-63 is legal (i.e. every valid immediate for an s64 G_SEXT_INREG) because the SBFMX instruction handles them all but that still counts as looking at it and the verifier would still reject it if we didn't at least look.

llvm/include/llvm/Target/Target.td
840 ↗(On Diff #197439)

I meant 'is only used for clarity' sounds better. I made the change in my working copy

Petar.Avramovic marked 2 inline comments as done.May 22 2019, 5:38 AM

Thanks for the explanation. Summary of inline discussion so far:
NarrowScalar G_SEXT_INREG cannot have decent (able to combine away artifacts) mir test with this patch alone since there are problems with artifact combiner.
Approach 1: narrowScalar G_SEXT_INREG with G_UNMERGE/G_MERGE + D61787. Able to perform all combines introduced so far and generates desired output.
Approach 2: narrowScalar G_SEXT_INREG with G_TRUNC/G_SEXT + TODO: introduce more artifacts combines and fix artifact combiner to support them.
I expect that both approaches should have similar performance and generate equivalent output.
Are there there cases where Approach 2 generates better code, as I am now convinced that Approach 2 will have similar (if not better ?) performance compared to Approach 1?

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
698 ↗(On Diff #197439)

Artifact might have to wait for another instruction to be legalized (with e.g. narrowScalar) and only then perform combine. For more detail look at D61787, it covers artifact combines available at the moment.

708 ↗(On Diff #197439)

The point was to consume the input rather than the output as the lower 32-bits are unchanged by the G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.

Something like look through copy, but here it is look through G_SEXT_INREG depending on immediate (operand 2) ?

Assuming Artifact combiner is able to perform all mentioned combines, it gets same output as narrowScalar using G_UNMERGE/G_MERGE + D61787.

High bits in

%3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)

should be sign bit of %6

%3:_(s64) = G_SEXT_INREG %2:_(s64), 16
narrowScalars to:

```%4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
%8:_(s32) = G_CONSTANT i32 31
%6:_(s32) = G_SEXT_INREG %4:_, 16
%7:_(s32) = G_ASHR %6:_, %8:_(s32)
%3:_(s64) = G_MERGE_VALUES %6:_(s32), %7:_(s32)```

also
%3:_(s64) = G_SEXT %6:_(s32) has to be narrowScalared/combined away by G_UNMERGE when available.

I run mentioned example and got following results:
Setup for test is: applying: D61289, D61289 and D61787, in MipsLegalizerInfo changing mips G_SEXT_INREG to:

```getActionDefinitionsBuilder(G_SEXT_INREG)
.legalForTypeWithImm({{s32,8},{s32,16}})
.maxScalar(0, s32)
.lower();```

and changing G_SEXT_INREG to do narrow scalar with G_UNMERGE/G_MERGE always.

running following test with -march=mipsel -global-isel -O0 -stop-after=legalizer

```define i64 @f(i64 signext %a, i64 signext %b) {
entry:
%conv = trunc i64 %add to i16
%conv1 = sext i16 %conv to i64
ret i64 %conv1
}```

gives following result:

```%2:_(s32) = COPY \$a0
%3:_(s32) = COPY \$a1
%4:_(s32) = COPY \$a2
%5:_(s32) = COPY \$a3
%25:_(s32) = G_CONSTANT i32 0
%26:_(s32) = G_CONSTANT i32 1
%27:_(s32) = COPY %25(s32)
%23:_(s32) = G_AND %27, %26
%24:_(s32) = G_ICMP intpred(ult), %16(s32), %2
%28:_(s32) = COPY %24(s32)
%21:_(s32) = G_AND %28, %26
%32:_(s32) = G_CONSTANT i32 31
%33:_(s32) = G_SEXT_INREG %16, 16
%34:_(s32) = G_ASHR %33, %32(s32)
\$v0 = COPY %33(s32)
\$v1 = COPY %34(s32)
RetRA implicit \$v0, implicit \$v1```

there are few places for improvement:
first, lets remove dead instructions (this could take place after main do/while loop in Legalizer.cpp)

```%2:_(s32) = COPY \$a0

%4:_(s32) = COPY \$a2

%25:_(s32) = G_CONSTANT i32 0
%26:_(s32) = G_CONSTANT i32 1
%27:_(s32) = COPY %25(s32)
%23:_(s32) = G_AND %27, %26

%32:_(s32) = G_CONSTANT i32 31
%33:_(s32) = G_SEXT_INREG %16, 16
%34:_(s32) = G_ASHR %33, %32(s32)
\$v0 = COPY %33(s32)
\$v1 = COPY %34(s32)
RetRA implicit \$v0, implicit \$v1```

fragment

```%25:_(s32) = G_CONSTANT i32 0
%26:_(s32) = G_CONSTANT i32 1
%27:_(s32) = COPY %25(s32)
%23:_(s32) = G_AND %27, %26

comes from lower() of

```%15:_(s1) = G_CONSTANT i1 false
%16:_(s32), %17:_(s1) = G_UADDE %11:_, %13:_, %15:_```

from narrowScalar of G_ADD. If we used

`%16:_(s32), %17:_(s1) = G_UADDO %11:_, %13:_`

for low bits, we would get

```%2:_(s32) = COPY \$a0
%4:_(s32) = COPY \$a2
%32:_(s32) = G_CONSTANT i32 31
%33:_(s32) = G_SEXT_INREG %16, 16
%34:_(s32) = G_ASHR %33, %32(s32)
\$v0 = COPY %33(s32)
\$v1 = COPY %34(s32)
RetRA implicit \$v0, implicit \$v1```

This should be equivalent to mentioned desired output when G_TRUNC/G_SEXT is used in narrow scalar.

It's not entirely clear what you're suggesting here

```%10:_(s64) = G_SEXT_INREG %9, 32
%6:_(s128) = G_SEXT %10(s64)
%7:_(s32) = G_TRUNC %6(s128)```

Was most likely generated from something like:

```%11:_(s32) = G_TRUNC %9(s64)
%10:_(s64) = G_SEXT %11:_(s32)
%6:_(s128) = G_SEXT %10(s64)
%7:_(s32) = G_TRUNC %6(s128)```

I meant that we might be able to figure out that these 4 instructions are equivalent to %7:_(s32) = G_TRUNC %9(s64) (combine more then 2 artifact in artifact combiner) instead of combining first two into into G_SEXT_INREG.

that would benefit from narrow scalar

Is there a test that produces better/smaller code when when G_SEXT_INREG is narrow scalar-ed with G_TRUNC/G_SEXT instead of G_UNMERGE/G_MERGE? From discussion so far I am convinced that both approaches generate/should generate same output.

756 ↗(On Diff #197439)

Ah, didn't think of G_UNMERGE + G_SEXT combine. Then we perform work from "the for loop from G_SEXT_INREG narrow scalar" inside combine and it is pretty much same thing. Both narrowScalar approaches have similar overall performance and generate same output.

dsanders updated this revision to Diff 205223.Jun 17 2019, 5:35 PM
• Add verification that immediate indices are used with the same caveats as the type index checks (custom predicates are assumed to check, etc)
Herald added a subscriber: jfb. Jun 17 2019, 5:35 PM

I'm running into a related problem attempting to implement narrowScalar for G_SEXT. AMDGPU is allergic to 64-bit shifts, so I want to implement

```%1:_(s32) = G_TRUNC %0
%2:_(s64).= G_SEXT %1```

As

```%1:_(s32) = G_TRUNC %0
%2:_(s32) = G_ASHR %1, 31
%3:_:(s64) = G_MERGE_VALUES %1, %2```

Since the 64-bit shift is possible, the artifact combiner produces the undesirable 64-bit shift combination. Worse, this combination ends up infinitely looping if the required shift requires legalizatiion

I think it would be useful to have generic UBFE/SBFE instructions, which could take the place of sext_inreg. However, those would allow variable offset/width, so it wouldn't bee quite the same.

aemerson accepted this revision.Jul 2 2019, 1:32 PM

I think it would be useful to have generic UBFE/SBFE instructions, which could take the place of sext_inreg. However, those would allow variable offset/width, so it wouldn't bee quite the same.

Not sure what the benefit would be there. IIRC we don't really have a need for zext_in_reg and the legalizer isn't going to produce variable shifts in most cases.

I think this has been reviewed enough now.

This revision is now accepted and ready to land.Jul 2 2019, 1:32 PM

Sorry for the slow reply, I've been split between quite a few tasks recently.

@rovka: I think I resolved the last issue you had which was that the verifier didn't ensure the immediate had been checked. Could you confirm that?

I'm running into a related problem attempting to implement narrowScalar for G_SEXT. AMDGPU is allergic to 64-bit shifts, so I want to implement

```%1:_(s32) = G_TRUNC %0
%2:_(s64).= G_SEXT %1```

As

```%1:_(s32) = G_TRUNC %0
%2:_(s32) = G_ASHR %1, 31
%3:_:(s64) = G_MERGE_VALUES %1, %2```

Since the 64-bit shift is possible, the artifact combiner produces the undesirable 64-bit shift combination. Worse, this combination ends up infinitely looping if the required shift requires legalizatiion

With this patch, that will input turn to:

%1:_(s64) = G_ANYEXT %0 // Skipped if %0 is s64
%2:_(s64) = G_SEXT_INREG %0, 32

If you then narrowScalar the G_SEXT_INREG (and I've just noticed my last update lost this part of the patch, I'll fix that in a moment), you'd get:

%1:_(s64) = G_ANYEXT %0 // Skipped if %0 is s64
%3:_(s32), %4:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%5:_(s32) = G_ASHR %4, 32
%2:_(s64) = G_MERGE_VALUES %3:_(s32), %5:_(s32)

assuming we have a full set of artifact combines, then we'd get one of the following: For %0 is s64:

%3:_(s32), %4:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%5:_(s32) = G_ASHR %4, 32
%2:_(s64) = G_MERGE_VALUES %3:_(s32), %5:_(s32)

for %0 is <s64 and >s32:

%1:_(s64) = G_ANYEXT %0 // Skipped if %0 is s64
%3:_(s32), %4:_(s32) = G_UNMERGE_VALUES %1:_(s64)
%5:_(s32) = G_ASHR %4, 32
%2:_(s64) = G_MERGE_VALUES %3:_(s32), %5:_(s32)

for %0 is s32:

%5:_(s32) = G_ASHR %0, 32 // %0 is s32
%2:_(s64) = G_MERGE_VALUES %0:_(s32), %5:_(s32)

or this for %0 <s32:

%1:_(s32) = G_ANYEXT %0 // %0 is <s32
%5:_(s32) = G_ASHR %1, 32
%2:_(s64) = G_MERGE_VALUES %1:_(s32), %5:_(s32)

which all look like they'd do what you want.

I think it would be useful to have generic UBFE/SBFE instructions, which could take the place of sext_inreg. However, those would allow variable offset/width, so it wouldn't bee quite the same.

That's a good point, G_SEXT_INREG %0, 16 is really just SBFE %0, 0, 16. The snag is that, as you hint at, including variable offset and width would require us to support context sensitive legality to cover targets that have sext instructions but don't have a full signed-bit-field-extract since the immediates would have to be represented with G_CONSTANT. We could potentially fix that by supporting instructions that allow both MO.isReg() and MO.isImm() operands somehow (currently the opcode statically indicates one or the other is permitted but not both). Another option would be to have a static SBFE that only allows immediates and a dynamic one that only allows registers. In the worst case for that approach, we'd need four but I'm not sure if we'd really need them all. Do any targets allow both the position and width to be specified as registers?

. Do any targets allow both the position and width to be specified as registers?

Both the offset and width can be registers on AMDGPU (for the SALU version they are both packed in one register)

dsanders marked an inline comment as done.Jul 3 2019, 10:48 AM
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
708 ↗(On Diff #197439)

> The point was to consume the input rather than the output as the lower 32-bits are unchanged by the
> G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.
Something like look through copy, but here it is look through G_SEXT_INREG depending on immediate
(operand 2) ?

I suppose it can be viewed that way as both end up simplifying the MIR. I think of them as separate things as this is a pattern-match-and-replace whereas look through copy is something that's done to help the pattern-match part of that by ignoring instructions that aren't relevant to whether the pattern should match or not.

Do any targets allow both the position and width to be specified as registers?

Both the offset and width can be registers on AMDGPU (for the SALU version they are both packed in one register)

In that case we'd need all four combinations if we did it by opcode. It would probably be better to go the other route to achieve a bitfield extract. That has some nasty side effects to sort out (e.g. what happens to the type indices if one of the operands is an immediate) but it should be possible.

I think we should land this patch while we think of that as it will be fairly simple to convert G_SEXT_INREG to the more flexible G_SFBE in future. It's just a matter of checking for the 0 immediate when looking for G_SEXT_INREG and adding it when we build a G_SBFE

Do any targets allow both the position and width to be specified as registers?

Both the offset and width can be registers on AMDGPU (for the SALU version they are both packed in one register)

In that case we'd need all four combinations if we did it by opcode. It would probably be better to go the other route to achieve a bitfield extract. That has some nasty side effects to sort out (e.g. what happens to the type indices if one of the operands is an immediate) but it should be possible.

I think we should land this patch while we think of that as it will be fairly simple to convert G_SEXT_INREG to the more flexible G_SFBE in future. It's just a matter of checking for the 0 immediate when looking for G_SEXT_INREG and adding it when we build a G_SBFE

I would rather have G_SBFE/G_UBFE that accept arbitrary registers, and a separate G_SEXT_INREG. I wouldn't get any benefit from additional static versions

dsanders updated this revision to Diff 207854.Jul 3 2019, 11:46 AM

Bring back the code that went missing on my last update

llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

Are tablegen emitter changes needed for this? I was trying to add an immediate operand in D64054, which seemed to not work correctly

llvm/unittests/CodeGen/GlobalISel/PatternMatchTest.cpp
276 ↗(On Diff #207854)

It's a gtestism that these should be swapped to get the correct expected/actual error message

284 ↗(On Diff #207854)

Ditto

dsanders updated this revision to Diff 207863.Jul 3 2019, 12:31 PM
dsanders marked 3 inline comments as done.
• EXPECT_EQ argument order
llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

You mean the untyped_imm_0? It's needed to drive the verifier that Diana requested and tells it which immediate checks it needs to look for

llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

Yes. As far as I can tell the emitter will try looking for G_CONSTANT defined register for this. I agree there should be a special immediate operand for this, but I don't think this will work in a tablegen pattern as-is. The current form has a ValueType leaf, so matching the immediate wouldn't be quite the same.

dsanders marked an inline comment as done.Jul 3 2019, 12:45 PM
llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

I'm not sure I'm following. InOperandList in a GenericInstruction specifies the type constraints and uses special TypeOperand subclasses. They don't factor into tablegen patterns except in so far as requiring that types match.

llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

I mean I believe there's no way to define a node that will be capable of matching this instruction

dsanders updated this revision to Diff 213174.Fri, Aug 2, 9:51 PM

Rebase and update to match changes in D61321

dsanders marked an inline comment as done.Fri, Aug 2, 9:53 PM
llvm/test/CodeGen/AArch64/GlobalISel/legalize-sext.mir
11–12 ↗(On Diff #213174)

@paquette @aemerson: I ought to draw attention to this. My code appears to be doing the right thing for lower() of G_SEXT_INREG but you appear to have a custom legalization that promotes one of the two constants to s64. Is that intentional? If so, is it also intentional that it only does it for G_ASHR and not G_SHL too?

llvm/test/CodeGen/AArch64/GlobalISel/legalize-sext.mir
11–12 ↗(On Diff #213174)

Yes this is intentional. In order to re-use the existing imported patterns for ashr & lshr we promote the shift amount to i64. For G_SHL we have some custom selection code to deal with non-64b immediates so it's not necessary.

llvm/test/CodeGen/AArch64/GlobalISel/legalize-undef.mir
29 ↗(On Diff #213174)

Did constant and impdef order change? If so we can just re-run the test update script.

dsanders marked 2 inline comments as done.Mon, Aug 5, 12:08 PM
llvm/test/CodeGen/AArch64/GlobalISel/legalize-sext.mir
11–12 ↗(On Diff #213174)

That's ok then. Thanks

llvm/test/CodeGen/AArch64/GlobalISel/legalize-undef.mir
29 ↗(On Diff #213174)

Did constant and impdef order change?

Yes.

Should instruction order matter for these tests? Legalization doesn't give any guarantees on the output order and using CHECK-DAG makes the test robust against that

llvm/test/CodeGen/AArch64/GlobalISel/legalize-undef.mir
29 ↗(On Diff #213174)

It would, but as it's autogenerated It doesn't/shouldn't uses -DAG checks

dsanders marked 24 inline comments as done.Wed, Aug 7, 8:22 PM

The next update fixes the last couple nits that were left open after the LGTM. I'm planning to commit this tomorrow

llvm/include/llvm/Target/GenericOpcodes.td
46 ↗(On Diff #207854)

Ah I see. Yes, we'll need to map sext_inreg to G_SEXT_INREG and add some special handling to convert the MVT operand to an immediate.

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
770 ↗(On Diff #197439)

I tried this change but it resulted in errors.

dsanders updated this revision to Diff 214060.Wed, Aug 7, 8:27 PM
dsanders marked an inline comment as done.

Fix the last few nits

Closed by commit rL368487: [globalisel] Add G_SEXT_INREG (authored by dsanders, committed by ). Fri, Aug 9, 2:10 PM
This revision was automatically updated to reflect the committed changes.
Herald added a subscriber: kristina. Fri, Aug 9, 2:10 PM