This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove redundant ORRWrs which is generated by zero-extend
ClosedPublic

Authored by jaykang10 on Sep 30 2021, 8:18 AM.

Details

Summary

There are 2 patterns for zero-extend as below.

//In the case of a 32-bit def that is known to implicitly zero-extend,
//we can use a SUBREG_TO_REG.
def : Pat<(i64 (zext def32:$src)),
          (SUBREG_TO_REG (i64 0), GPR32:$src, sub_32)>;

//When we need to explicitly zero-extend, we use a 32-bit MOV instruction
//and then assert the extension has happened.
def : Pat<(i64 (zext GPR32:$src)),
          (SUBREG_TO_REG (i32 0), (ORRWrs WZR, GPR32:$src, 0), sub_32)>;

The def32 checks the $src needs explicitly zero-extend. However, it can not check the $src in other block and it adds ORRWrs conservatively in this case. This peephole optimization checks ORRWrs is for redundant zero-extend and try to remove it.

Diff Detail

Event Timeline

jaykang10 created this revision.Sep 30 2021, 8:18 AM
jaykang10 requested review of this revision.Sep 30 2021, 8:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2021, 8:18 AM
jaykang10 updated this revision to Diff 376464.Oct 1 2021, 3:47 AM

Fixed a bug

  • replaceRegWith changes MI's defintion register. Keep it for SSA form until deleting MI.

Can you explain in more details what makes this valid? Does it depend on the top bits already being zero? What verifies that?

It might be useful to add mir tests too, to test specific cases that should/shouldn't be removed.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
209

Do we need to check for WZR?

215

Do we need to avoid when they are in the same block? It would be easier for testing if we didn't.

228

Why do we rule these out? Why don't we rule out anything else?

234

-> definition

Can you explain in more details what makes this valid? Does it depend on the top bits already being zero? What verifies that?

It might be useful to add mir tests too, to test specific cases that should/shouldn't be removed.

At instruction selection level, we check the operations which do not zero-out the high half of the 64-bit register using isDef32 function as below.

// Any instruction that defines a 32-bit result zeros out the high half of the
// register. Truncate can be lowered to EXTRACT_SUBREG. CopyFromReg may
// be copying from a truncate. But any other 32-bit operation will zero-extend
// up to 64 bits. AssertSext/AssertZext aren't saying anything about the upper
// 32 bits, they're probably just qualifying a CopyFromReg.
static inline bool isDef32(const SDNode &N) {
  unsigned Opc = N.getOpcode();
  return Opc != ISD::TRUNCATE && Opc != TargetOpcode::EXTRACT_SUBREG &&
         Opc != ISD::CopyFromReg && Opc != ISD::AssertSext &&
         Opc != ISD::AssertZext && Opc != ISD::AssertAlign &&
         Opc != ISD::FREEZE;
}

As you can see, the isDef32 checks fundamentally ISD::TRUNCATE within a basic block because the below pattern is matched with the ISD::TRUNCATE and EXTRACT_SUBREG does not guarantee the high 32 bits are zero.

// To truncate, we can simply extract from a subregister.
def : Pat<(i32 (trunc GPR64sp:$src)),
          (i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;

This patch checks the ORRWrs has EXTRACT_SUBREG in different basic block as operand because the existing pattern with isDef32 resolves case in which the operand is in same basic block.

Let me try to add MIR tests.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
209

You are right! We need to check it because there is below assembly pattern. Let me update the code.

def : InstAlias<"mov $dst, $src", (ORRWrs GPR32:$dst, WZR, GPR32:$src, 0), 2>;
215

If the pattern is in same block, I thought it has already been handled by existing pattern with isDef32 and we do not need to check it.

def def32 : PatLeaf<(i32 GPR32:$src), [{
  return isDef32(*N);
}]>;

// In the case of a 32-bit def that is known to implicitly zero-extend,
// we can use a SUBREG_TO_REG.
def : Pat<(i64 (zext def32:$src)), (SUBREG_TO_REG (i64 0), GPR32:$src, sub_32)>;
228

I was not sure it is good to keep track of the operands of PHI and COPY in this patch... because it could make code complicated... for example, checking cycled phi. If possible, I would like to solve it in separate patch...

234

Sorry... let me update it.

jaykang10 updated this revision to Diff 376911.Oct 4 2021, 8:38 AM

Following comments of @dmgreen, updated patch.

  • Checked WZR
  • Added MIR tests

At instruction selection level, we check the operations which do not zero-out the high half of the 64-bit register using isDef32 function as below.

// Any instruction that defines a 32-bit result zeros out the high half of the
// register. Truncate can be lowered to EXTRACT_SUBREG. CopyFromReg may
// be copying from a truncate. But any other 32-bit operation will zero-extend
// up to 64 bits. AssertSext/AssertZext aren't saying anything about the upper
// 32 bits, they're probably just qualifying a CopyFromReg.
static inline bool isDef32(const SDNode &N) {
  unsigned Opc = N.getOpcode();
  return Opc != ISD::TRUNCATE && Opc != TargetOpcode::EXTRACT_SUBREG &&
         Opc != ISD::CopyFromReg && Opc != ISD::AssertSext &&
         Opc != ISD::AssertZext && Opc != ISD::AssertAlign &&
         Opc != ISD::FREEZE;
}

As you can see, the isDef32 checks fundamentally ISD::TRUNCATE within a basic block because the below pattern is matched with the ISD::TRUNCATE and EXTRACT_SUBREG does not guarantee the high 32 bits are zero.

// To truncate, we can simply extract from a subregister.
def : Pat<(i32 (trunc GPR64sp:$src)),
          (i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;

This patch checks the ORRWrs has EXTRACT_SUBREG in different basic block as operand because the existing pattern with isDef32 resolves case in which the operand is in same basic block.

That explains things in terms of DAG-ISel, but there are other instruction selectors and different optimization between then and here. (Plus the isDef32 has had so many bugs it's difficult to trust!)

We know that all (?) instructions that generate a W register under AArch64 will zero the upper bits of the X register. We seems to say in this patch that certain EXTRACT_SUBREG are not valid, COPY and PHI are currently excluded. Is that really all we have to worry about? Do we know that the top bits are always 0 for all other grp32 sources?

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
215

But, is that needed for this pass? I can see it is how DAG ISel works, but if there are this pattern of code in the same block, it should still work fine, shouldn't it?

At instruction selection level, we check the operations which do not zero-out the high half of the 64-bit register using isDef32 function as below.

// Any instruction that defines a 32-bit result zeros out the high half of the
// register. Truncate can be lowered to EXTRACT_SUBREG. CopyFromReg may
// be copying from a truncate. But any other 32-bit operation will zero-extend
// up to 64 bits. AssertSext/AssertZext aren't saying anything about the upper
// 32 bits, they're probably just qualifying a CopyFromReg.
static inline bool isDef32(const SDNode &N) {
  unsigned Opc = N.getOpcode();
  return Opc != ISD::TRUNCATE && Opc != TargetOpcode::EXTRACT_SUBREG &&
         Opc != ISD::CopyFromReg && Opc != ISD::AssertSext &&
         Opc != ISD::AssertZext && Opc != ISD::AssertAlign &&
         Opc != ISD::FREEZE;
}

As you can see, the isDef32 checks fundamentally ISD::TRUNCATE within a basic block because the below pattern is matched with the ISD::TRUNCATE and EXTRACT_SUBREG does not guarantee the high 32 bits are zero.

// To truncate, we can simply extract from a subregister.
def : Pat<(i32 (trunc GPR64sp:$src)),
          (i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;

This patch checks the ORRWrs has EXTRACT_SUBREG in different basic block as operand because the existing pattern with isDef32 resolves case in which the operand is in same basic block.

That explains things in terms of DAG-ISel, but there are other instruction selectors and different optimization between then and here. (Plus the isDef32 has had so many bugs it's difficult to trust!)

We know that all (?) instructions that generate a W register under AArch64 will zero the upper bits of the X register. We seems to say in this patch that certain EXTRACT_SUBREG are not valid, COPY and PHI are currently excluded. Is that really all we have to worry about? Do we know that the top bits are always 0 for all other grp32 sources?

I agree with you. I am also not sure about it...

As you mentioned, If AArch64's 32-bit form of instruction defines the source operand of zero-extend, we do not need the zero-extend.

From https://developer.arm.com/documentation/dui0801/b/BABBGCAC
When you use the 32-bit form of an instruction, the upper 32 bits of the source registers are ignored and the upper 32 bits of the destination register are set to zero.

We need to check the zero-extend's source operands which do not come from the destination register of AArch64's 32-bit form instructions. I thought we are not sure the upper 32-bits set to zero in below two cases.

  1. Function argument - There would be copy instructions from physical register to virtual register.
  2. EXTRACT_SUBREG

If you feel there are more cases, please let me know. Let me update code with the cases.

For keeping track of PHI and COPY's operands, if possible, I would like to handle them after we clearly understanding the cases which does not guarantee the upper 32-bit set to zero. I am sorry for that...

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
215

Yep, I agree with you. Let me remove the check for different block.

Sorry for Ping.

I'm generally worried about allowing unknown instructions here.

Any AArch64 instruction that produces a 32-bit result zeros the high bits, yes. But some MachineInstr opcodes aren't really instructions in this sense. You've noted EXTRACT_SUBREG, PHI, and COPY specifically. Probably missing IMPLICIT_DEF. Not sure if there are other relevant instructions; any pseudo-instruction is potentially an issue, but auditing them for AArch64, the target-specific ones mostly don't produce GPR32.

In any case, I'd be happier if we had a bit to check for a "real" instruction, in TSFlags or something like that. I don't want to worry about modifying this in the future if, for example, we end up with a FREEZE MachineInstr.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
215

Missed update?

I'm generally worried about allowing unknown instructions here.

Any AArch64 instruction that produces a 32-bit result zeros the high bits, yes. But some MachineInstr opcodes aren't really instructions in this sense. You've noted EXTRACT_SUBREG, PHI, and COPY specifically. Probably missing IMPLICIT_DEF. Not sure if there are other relevant instructions; any pseudo-instruction is potentially an issue, but auditing them for AArch64, the target-specific ones mostly don't produce GPR32.

In any case, I'd be happier if we had a bit to check for a "real" instruction, in TSFlags or something like that. I don't want to worry about modifying this in the future if, for example, we end up with a FREEZE MachineInstr.

I agree with you. Let me try to check the real instructions.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
215

Sorry... let me remove it.

Sorry for Ping.

I was looking a bit last week, but didn't get very far into proving this is correct. Eli's suggestions sound good to me, and having a nice big comment explaining why it is valid would be good too.

Sorry for Ping.

I was looking a bit last week, but didn't get very far into proving this is correct. Eli's suggestions sound good to me, and having a nice big comment explaining why it is valid would be good too.

Yep, once I finish to check the real instructions, let me try to add a big comment.

jaykang10 updated this revision to Diff 380439.Oct 18 2021, 9:13 AM

Following the comment of @efriedma, updated patch

  • Added a bit to TSFlags for checking real AArch64 instruction

Do we need to add an extra TS bit, or can we just use GENERIC_OP_END?

As far as I understand the opcodes are always in the order: [TargetOpcodes, G_ opcodes, A64 Pseudos, A64 instructions]. Do we need to rule out A64 pseudos? If so can we check isPseudo().

Do we need to add an extra TS bit, or can we just use GENERIC_OP_END?

As far as I understand the opcodes are always in the order: [TargetOpcodes, G_ opcodes, A64 Pseudos, A64 instructions]. Do we need to rule out A64 pseudos? If so can we check isPseudo().

Ah, you are right!!! Let me update code. Thanks @dmgreen

jaykang10 updated this revision to Diff 380617.Oct 19 2021, 2:29 AM

Following comment of @dmgreen, updated patch.

  • Check TargetOpcode::GENERIC_OP_END to distinguish real AArch64 instruction instead of adding a additional bit to TSFlags. It makes same effect.

I have checked bootstrap build and run check-all with the build. It looks OK.

Thanks.

I assume you have checked through the AArch64 pseudo instructions (the ones before ABS_ZPmZ_B in build/lib/Target/AArch64/AArch64GenInstrInfo.inc) and they look OK? They will follow the same rules of producing zeroed upper bits for W register definitions.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
21

I don't think that representing this in terms of ISel patterns is useful. It should preferably be described in terms of the Machine Instructions that will be present, no matter where they come from.

llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll
40 ↗(On Diff #380617)

This looks like the kind of test we could use update_llc_test_checks on.

Thanks.

I assume you have checked through the AArch64 pseudo instructions (the ones before ABS_ZPmZ_B in build/lib/Target/AArch64/AArch64GenInstrInfo.inc) and they look OK? They will follow the same rules of producing zeroed upper bits for W register definitions.

um... I have checked the pseudo MIs which inherits Pseudo class in the AArch64InstrInfo.td and AArch64InstrFormats.td files and expandMI function in AArch64ExpandPseudoInsts.cpp. I was able to see the pseudo MIs are expanded to the AArch64 MIs... and I thought they follows the rule.

If you feel some pseudo MIs do not follow the rule, please let me know. I could miss some MIs...

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
21

Yep, let me update the comment.

llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll
40 ↗(On Diff #380617)

It looked it is pre-commit test with NFC tag.

Let me update the expected output with update_llc_test_checks.

jaykang10 updated this revision to Diff 380908.Oct 20 2021, 5:22 AM

Following comment of @dmgreen, updated patch.

Do we need to remove Kill flags from the uses of the register we are replacing? From something like this:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-none-eabi"

@e = dso_local local_unnamed_addr global i16 0, align 4
@b = dso_local local_unnamed_addr global i32 0, align 4
@d = dso_local local_unnamed_addr global i32 0, align 4
@c = dso_local local_unnamed_addr global i32* null, align 8
@a = dso_local local_unnamed_addr global i32 0, align 4

define i32 @i() {
entry:
  %0 = load i32, i32* @b, align 4
  %1 = trunc i32 %0 to i16
  %conv1 = and i16 %1, 255
  %2 = load i32, i32* @d, align 4
  %tobool.not = icmp eq i32 %2, 0
  br i1 %tobool.not, label %if.end, label %if.then

  if.then:                                          ; preds = %entry
  %conv2 = zext i16 %conv1 to i64
  %3 = inttoptr i64 %conv2 to i32*
  store i32* %3, i32** @c, align 8
  br label %if.end

  if.end:                                           ; preds = %if.then, %entry
  %4 = load i32, i32* @a, align 4
  %5 = trunc i32 %4 to i16
  %conv4 = or i16 %conv1, %5
  store i16 %conv4, i16* @e, align 4
  ret i32 0
}
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
233

Can we add a dbgs() output explaining what was removed?

Do we need to remove Kill flags from the uses of the register we are replacing? From something like this:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-none-eabi"

@e = dso_local local_unnamed_addr global i16 0, align 4
@b = dso_local local_unnamed_addr global i32 0, align 4
@d = dso_local local_unnamed_addr global i32 0, align 4
@c = dso_local local_unnamed_addr global i32* null, align 8
@a = dso_local local_unnamed_addr global i32 0, align 4

define i32 @i() {
entry:
  %0 = load i32, i32* @b, align 4
  %1 = trunc i32 %0 to i16
  %conv1 = and i16 %1, 255
  %2 = load i32, i32* @d, align 4
  %tobool.not = icmp eq i32 %2, 0
  br i1 %tobool.not, label %if.end, label %if.then

  if.then:                                          ; preds = %entry
  %conv2 = zext i16 %conv1 to i64
  %3 = inttoptr i64 %conv2 to i32*
  store i32* %3, i32** @c, align 8
  br label %if.end

  if.end:                                           ; preds = %if.then, %entry
  %4 = load i32, i32* @a, align 4
  %5 = trunc i32 %4 to i16
  %conv4 = or i16 %conv1, %5
  store i16 %conv4, i16* @e, align 4
  ret i32 0
}

You are right! We need to clear the kill flag of the source register of the ORRWrs. I was confused with the bitmask peephole opt case.

Let me clear the kill flag. Thanks for catching that. @dmgreen

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
233

Yep, let me add debug output.

jaykang10 updated this revision to Diff 380940.Oct 20 2021, 7:00 AM

Following comment of @dmgreen, updated patch.

  • Cleared kill flag of source operand of ORRWrs.

Following comment of @dmgreen, updated patch.

  • Cleared kill flag of source operand of ORRWrs.

No error from bootstrap build and check-all on AArch64 machine.

dmgreen accepted this revision.Oct 25 2021, 12:13 AM

No error from bootstrap build and check-all on AArch64 machine.

Thanks. Knowing you were away for a few days I took the time to include this in some csmith testing which I happened to already be running for another reason. It didn't come up with any other problems related to this, which is a good sign (but doesn't mean something else might be wrong, just that csmith like code does OK). So LGTM.

This revision is now accepted and ready to land.Oct 25 2021, 12:13 AM

No error from bootstrap build and check-all on AArch64 machine.

Thanks. Knowing you were away for a few days I took the time to include this in some csmith testing which I happened to already be running for another reason. It didn't come up with any other problems related to this, which is a good sign (but doesn't mean something else might be wrong, just that csmith like code does OK). So LGTM.

Thanks for checking csmith @dmgreen! Let me push this patch after rebase.