This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Split bitmask immediate of bitwise AND operation
ClosedPublic

Authored by jaykang10 on Sep 17 2021, 6:38 AM.

Details

Summary

MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri

The mov pseudo instruction could be expanded to multiple mov instructions later. In this case, try to split the constant operand of mov instruction into two bitmask immediates. It makes only two AND instructions intead of multiple mov + and instructions.

Added a peephole optimization pass on MIR level to implement it.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have tried to implement it on ISelDAG level. There are already patterns to fold and node. After this transformation on ISelDAGToDAG, I was able to see the patterns failed to match. If we can guarantee the pattern for this transformation is matched after matching other patterns related to and, it could be ok to implement it on ISelDAG level. Maybe, we could add AddedComplexity to the pattern for this transformation but I thought the CustomInserter is better than it because it guarantees all pattern matching is done.

Custom Inserters are mostly needed for instructions that expand to multiple basic blocks. As far as I understand, this seems to be different selection from and(X, C), which should fit in fine as an AArch64 tblgen pattern or with Dag2Dag select. But the AArch64 backend can be a bit complex in places. Was there something getting in the way of that? If so do you know what?

For example, there are patterns as below.

multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}

As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.

Wilco1 requested changes to this revision.Sep 20 2021, 4:08 AM
Wilco1 added a subscriber: Wilco1.
Wilco1 added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
2281 ↗(On Diff #373207)

Is there no function to check it is a 16-bit immediate? If not, it would be best to add it.

2290 ↗(On Diff #373207)

This won't correctly handle 32-bit immediates like 0x1234FFFF since the invert sets the top 32 bits.

2292 ↗(On Diff #373207)

Should be * 8.

2308 ↗(On Diff #373207)

These need to be 64 bits

This revision now requires changes to proceed.Sep 20 2021, 4:08 AM

For example, there are patterns as below.

multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}

As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.

OK I see. So when adding the code to AArch64DAGToDAGISel::Select, it happened before the tablegen patterns and we do have tablegen patterns that conflict with it. I'm a little surprised to see "_or_more" here.

Could it be done as a tblgen pattern then? That way the larger pattern should win. As far as I understand it would be a case of getting the correct ImmLeaf with a couple of xforms.

For example, there are patterns as below.

multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}

As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.

OK I see. So when adding the code to AArch64DAGToDAGISel::Select, it happened before the tablegen patterns and we do have tablegen patterns that conflict with it. I'm a little surprised to see "_or_more" here.

Could it be done as a tblgen pattern then? That way the larger pattern should win. As far as I understand it would be a case of getting the correct ImmLeaf with a couple of xforms.

As I mentioned on previous comment, the patterns for splitting the bitmask would need AddedComplexity to guarantee that the other patterns with and are already matched. I thought it could not be good for potential patterns with and to be added in the future.
From my personal opinion, MIR level could be suitable place to handle the bitmask immediate. At this moment, AArch64 target generates Pseudo MOV MIR with immediate which is big and expand it on AArch64ExpandPseudo. It could be good to generate Pseudo AND or logical MIR with the bitmask immediate and expand it like MOV. However, it could be big change...
Let me try to add the patterns with xform and AddedComplexity.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
2281 ↗(On Diff #373207)

We can check it with isUInt<16>(Mask). Let me update it.

2290 ↗(On Diff #373207)

You are right! Let me update it.

2292 ↗(On Diff #373207)

Sorry, it is my mistake. Maybe, it could be 16 instead of 8?

2308 ↗(On Diff #373207)

You are right! Let me update it.

jaykang10 edited the summary of this revision. (Show Details)

Following the comment of @dmgreen, updated patch using TableGen patterns.

efriedma added inline comments.
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗(On Diff #373655)

Do we want to do something different if the operation is in a loop, so the immediate could be hoisted?

llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗(On Diff #373655)

This looks wrong?

llvm/test/CodeGen/AArch64/bitfield-insert.ll
13 ↗(On Diff #373655)

This isn't an improvement.

Nice job making the tablegen patterns work.

Is it possible to combine more of the 32 and 64bit logic? They share quite a bit in common. A few more tests sound useful too, for various edge cases.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
1962 ↗(On Diff #373655)

This can probably use AArch64_IMM::expandMOVImm? And Count the number of Instructions it produces. A similar "how many instructions will this immediate produce" was needed in https://reviews.llvm.org/D108871

jaykang10 edited reviewers, added: efriedma; removed: eli.friedman.Sep 21 2021, 2:32 AM

Nice job making the tablegen patterns work.

Is it possible to combine more of the 32 and 64bit logic? They share quite a bit in common. A few more tests sound useful too, for various edge cases.

Yep, let me update patch with it.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
1962 ↗(On Diff #373655)

Yep, let me update code with it.

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗(On Diff #373655)

um... I do not think the split bitmask immediate affects MachineLICM pass to detect the invariant code.

Let me add a test case with loop.

llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗(On Diff #373655)

um... I think it is correct.

mov x9, #31
movk x9, #48, lsl #32
==>
(48 << 32) | 31 = 31
and x8, x8, #0x3f
and x8, x8, #0xffffffffffffffdf
==>
0xffffffffffffffdf & 0x3f = 31
llvm/test/CodeGen/AArch64/bitfield-insert.ll
13 ↗(On Diff #373655)

Let me update code to avoid it.

jaykang10 added inline comments.Sep 21 2021, 5:31 AM
llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗(On Diff #373655)

Sorry, I made a mistake. x register is 64 bit... Let me fix it.

jaykang10 updated this revision to Diff 373902.Sep 21 2021, 6:45 AM

Following comments, updated patch and fixed a bug.

jaykang10 added inline comments.Sep 21 2021, 6:46 AM
llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗(On Diff #373655)

I have fixed it with new patch.

dmgreen added inline comments.Sep 21 2021, 8:10 AM
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
132 ↗(On Diff #373902)

I think Eli was referring to something like this: https://godbolt.org/z/YMYcqrroT
Which may or may be meaningfully worse, depending on vectorizations and constant hoisting.

jaykang10 added inline comments.Sep 21 2021, 8:51 AM
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
132 ↗(On Diff #373902)

Ah, let me have a look.

jaykang10 added inline comments.Sep 21 2021, 9:12 AM
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗(On Diff #373655)

um... I understand it now...

If instruction has register operand and its definition is in loop, the instruction is not loop invariant in MachineLoop...

AND[W|X]ri has register operand rather than MOVi32imm...

um... Thanks @efriedma. Let me think about it more...

jaykang10 added inline comments.Sep 21 2021, 9:34 AM
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗(On Diff #373655)

We could need to expand MOVi32imm on pseudo expansion with considering AND... or run a machine function pass which expands the MOVi32imm with AND after MachineLICM...
@efriedma @dmgreen How do you think about it?

I guess there are three options:

  • Ignore it, because we believe it will not come up enough in practice enough or will be dealt with elsewhere (I'm not sure this is always true).
  • Teach ISel lowering that the BB is a loop and pick different strategies based on it.
  • Do it post-sel in a peephole optimization pass or machine combining pass, that knows about being in a loop or critical patch latencies.

For point 1, the question probably becomes will these cases always be vectorized/unrolled at high optimization levels, in which case it will either be a different (vector) instruction doing most of the work, or the constant would be hoisted by constant hoisting. And when optimizing for size the smaller codesize is probably OK.

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
386 ↗(On Diff #373902)

"Create"

jaykang10 edited the summary of this revision. (Show Details)

Added a peephole optimization pass on MIR level to split bitmask immediate.

jaykang10 updated this revision to Diff 374485.Sep 23 2021, 2:15 AM
jaykang10 edited the summary of this revision. (Show Details)

Updated comments

dmgreen added inline comments.Sep 23 2021, 3:50 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
52

Can this preserve CFG, which should help it preserve some analysis passes

66

Can this use regsize == sizeof(T), to simplify it a little

81

Does it matter if MI is loop invariant? As opposed to the MOV.

129

Should this transform be done with multiple uses on the mov?

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

This can move into the peephole optimization pass now, which might allow it to be simplified somewhat.

jaykang10 added inline comments.Sep 23 2021, 4:41 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
52

Yep, I think so. Let me add AU.setPreservesCFG.

66

Yep, let me update it.

81

Let's say MOV is in preheader and AND is in the loop as below.

preheader:
  %1:gpr32 = MOVi32imm 2098176
  %2:gpr32 =
...
loop:
... 
  %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32

In above example, AND is loop invariant and we can split the constant of MOV.

preheader:
  %1:gpr32 = MOVi32imm 2098176
...
loop:
...
  %2:gpr32 = PHI 
  %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32

In above example, the AND is not loop invariant and we can not split the constant of MOV because the first AND needs %2 and the AND is outside loop.

129

I thought there could be other opportunities with two bitmask immediate and other bitwise operations like OR. If possible, I would like to keep the MOV for the potential opportunities.

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

This function could be used on other places. If possible, I would like to keep this check here conservatively.

jaykang10 updated this revision to Diff 374522.Sep 23 2021, 5:56 AM

Following comments from @dmgreen, updated patch.

jaykang10 edited the summary of this revision. (Show Details)Sep 23 2021, 5:57 AM
dmgreen added inline comments.Sep 24 2021, 2:24 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
85

Consider renaming DefMI to MovMI, to clearly specify what it is expected to contain.

103

Typo in immediate

108

A different typo in immediate :)

129

Yeah that might well happen. Do you have any tests? Maybe in an unrolled loop?

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

What other places do you expect it to be used?

It seems at the moment, with the way these are called they will be re-calculating the same values repeatedly. That was unavoidable from tablegen patterns, but shouldn't be necessary from the new Peephole pass, hopefully.

jaykang10 added inline comments.Sep 24 2021, 3:34 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
85

Yep, let me update it.

103

Sorry, let me update it.

108

Sorry, again... Let me update it.

129

I do not have test cases yet. I could try to split the bitmask more later.

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

You are right! Let me move it to the peephole pass.

dmgreen added inline comments.Sep 24 2021, 3:43 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

I just think it's worth having some tests to show this is working as intended. Maybe where the other uses are transformed and where they are not. (Ignore the unrolled loop comment - it wouldn't trigger in a loop).

jaykang10 added inline comments.Sep 24 2021, 4:01 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

Let me add a simple test in which and and or use same constant.

jaykang10 added inline comments.Sep 24 2021, 4:25 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

Sorry, I was wrong... Let's see below example.

define void @test10(i32* nocapture %x, i32* nocapture readonly %y, i32* nocapture %z) {
entry:
  %0 = load i32, i32* %y, align 4
  %and = and i32 %0, 2098176
  store i32 %and, i32* %x, align 4
  %1 = load i32, i32* %y, align 4
  %or = or i32 %1, 2098176
  store i32 %or, i32* %z, align 4
  ret void
}

After instruction selection, the MIR is as below.

; %4:gpr32 = MOVi32imm 2098176
; %5:gpr32 = ANDWrr killed %3:gpr32, %4:gpr32
; STRWui killed %5:gpr32, %0:gpr64common, 0 :: (store (s32) into %ir.x, !tbaa !8) 
; %6:gpr32 = LDRWui %1:gpr64common, 0 :: (load (s32) from %ir.y, !tbaa !8) 
; %7:gpr32 = ORRWrr killed %6:gpr32, %4:gpr32

We can see the and and or share the constant. The final assembly is as below.

	mov	w9, #1024
	movk	w9, #32, lsl #16
	and	w8, w8, #0x3ffc00
	and	w8, w8, #0xffe007ff
	str	w8, [x0]
	ldr	w8, [x1]
	orr	w8, w8, w9

If there are multiple uses of the constant, we should not split the constant because it could cause more instructions as above. I am sorry for mistake. Let me update code.

jaykang10 updated this revision to Diff 374805.Sep 24 2021, 4:42 AM

Following the comments from @dmgreen, updated patch.

dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
97

"because"

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

Was this going to be moved?

jaykang10 added inline comments.Sep 27 2021, 12:49 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
97

Sorry, let me update it.

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗(On Diff #374485)

Ah, sorry. I misunderstood what you want. Let me move this function to the peephole optimization pass.

jaykang10 updated this revision to Diff 375164.Sep 27 2021, 1:17 AM

Following comments of @dmgreen, updated patch.

My point in moving them was that they could be simplified, especially if inlined. As far as I can tell the two calls to splitAndBitmaskImm are recalculating values that were already computed in isValidAndSplitBitmaskImm. Can we not just calculate the values once?

My point in moving them was that they could be simplified, especially if inlined. As far as I can tell the two calls to splitAndBitmaskImm are recalculating values that were already computed in isValidAndSplitBitmaskImm. Can we not just calculate the values once?

Let me update it.

jaykang10 updated this revision to Diff 375295.Sep 27 2021, 9:08 AM

Following comment of @dmgreen, updated patch.

dmgreen accepted this revision.Sep 28 2021, 12:41 AM

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
146

This appears to be checked here and at the start of splitBitmaskImm.

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

Thanks for review. Let me run bootstrap build and check some benchmarks.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
146

Yep, let me remove this one.

jaykang10 updated this revision to Diff 375519.Sep 28 2021, 3:52 AM

Fixed a bug

  • Same MI could be added to ToBeRemoved multiple time.
  • Replaced SmallVector by SmallSetVector for ToBeRemoved

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

Thanks for review. Let me run bootstrap build and check some benchmarks.

I have checked bootstrap build and test-suite. After fixing a bug, they are passed. Let me push this patch.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 28 2021, 3:58 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
jaykang10 updated this revision to Diff 375557.Sep 28 2021, 6:54 AM

Fixed a bug.

The destination register class of AND register is different with AND immediate one. In order to align the register class, the COPY MI is added to convert register class as below.

  %1:gpr32 = MOVi32imm 2098176
  %2:gpr32common = ANDWrr %0:gpr32, killed %1:gpr32
==>
  %5:gpr32sp = ANDWri %0:gpr32, 1419
  %6:gpr32 = COPY %5:gpr32sp
  %7:gpr32sp = ANDWri %6:gpr32, 725
  %2:gpr32common = COPY %7:gpr32sp
haowei added a subscriber: haowei.Sep 28 2021, 4:10 PM

We are seeing clang crashes after this patch was relanded in https://reviews.llvm.org/rG73a196a11c0e6fe7bbf33055cc2c96ce3c61ff0d by @jaykang10 . Failed build task: https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.arm64-release-subbuild/b8834828820366166433/overview and reproducer: https://storage.cloud.google.com/fuchsia-build/builds/8834828820366166433/bcm-990d09.tar.gz Looks like an assertion error.

I locally revert your change and retest clang with this reproducer and the clang no longer crashes. Could you revert your change if it takes a while to fix it?

Hi, I have tracked a crash down to this patch. This cases passes at f701505c45c7 and fails at this change. The target is aarch64 and the case is run at -O1.

Unfortunately, the test case is very large and I am trying to reduce it. But for now, here is the stack trace of the failure:

1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'util-e32393.cpp'.
4.	Running pass 'Live Variable Analysis' on function '@_ZN8colossus13ParseOpenModeENSt3__u17basic_string_viewIcNS0_11char_traitsIcEEEE'
 #0 0x0000000004d0cd9a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x0000000004d0cf6b PrintStackTraceSignalHandler(void*) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x0000000004d0b50b llvm::sys::RunSignalHandlers() /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Signals.cpp:96:5
 #3 0x0000000004d0d6e1 SignalHandler(int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007f30c11dd140 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14140)
 #5 0x00007f30c0c7cce1 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #6 0x00007f30c0c66537 abort ./stdlib/abort.c:81:7
 #7 0x00007f30c0c6640f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
 #8 0x00007f30c0c6640f _nl_load_domain ./intl/loadmsgcat.c:970:34
 #9 0x00007f30c0c75662 (/lib/x86_64-linux-gnu/libc.so.6+0x34662)
#10 0x00000000038aff71 llvm::LiveVariables::HandleVirtRegUse(llvm::Register, llvm::MachineBasicBlock*, llvm::MachineInstr&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:131:20
#11 0x00000000038b27fc llvm::LiveVariables::runOnInstr(llvm::MachineInstr&, llvm::SmallVectorImpl<unsigned int>&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:540:7
#12 0x00000000038b2c9d llvm::LiveVariables::runOnBlock(llvm::MachineBasicBlock*, unsigned int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:572:25
#13 0x00000000038b331d llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:640:5
#14 0x00000000039676b7 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:72:8
#15 0x000000000404db6c llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1439:23
#16 0x0000000004052e95 llvm::FPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1485:16
#17 0x000000000404e534 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1554:23
#18 0x000000000404e058 llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:542:16
#19 0x00000000040531a1 llvm::legacy::PassManager::run(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1681:3
#20 0x00000000051cd032 (anonymous namespace)::EmitAssemblyHelper::EmitAssemblyWithNewPassManager(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1516:3
#21 0x00000000051ca597 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1675:5
#22 0x0000000006808fe0 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:334:7
#23 0x0000000008649013 clang::ParseAST(clang::Sema&, bool, bool) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Parse/ParseAST.cpp:178:12
#24 0x0000000005caa882 clang::ASTFrontendAction::ExecuteAction() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1065:1
#25 0x00000000068047bb clang::CodeGenAction::ExecuteAction() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1072:5
#26 0x0000000005caa248 clang::FrontendAction::Execute() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/FrontendAction.cpp:960:7
#27 0x0000000005bdfb12 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:974:23
#28 0x0000000005e74716 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:278:8
#29 0x00000000014ea90d cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/cc1_main.cpp:246:13
#30 0x00000000014dd3d8 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/driver.cpp:317:5
#31 0x00000000014dc3f1 main /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/driver.cpp:388:5
#32 0x00007f30c0c67d0a __libc_start_main ./csu/../csu/libc-start.c:308:16
#33 0x00000000014dbbea _start (/usr/local/google/home/saugustine/llvm-new/build/bin/clang+0x14dbbea)
craig.topper added inline comments.
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
132

Do you need to check if the SUBREG_TO_REG has multiple uses?

Since we have a reproducer, I have reverted this change with rGc07f7099690e.

Since we have a reproducer, I have reverted this change with rGc07f7099690e.

Sorry for the error.

I have fixed it and I can see the reproducer from @haowei is passed. After checking tests and benchmarks again, let me try to push this patch again.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
132

You are right! I need to check it as well as mov. Thanks @craig.topper!

Additionally, if possible, @saugustine can you file a bug with your test case to bugzilla please? It could be different failure.

jaykang10 updated this revision to Diff 375760.Sep 28 2021, 6:50 PM

Fixed bugs

  • Check MOVimm before checking multiple uses
  • Check SUBREG_TO_REG has multiple uses

I don't think this needs to create all the COPY's, if it calls constrainRegClass on the registers to ensure they do not contain the wrong set of registers.

I don't think this needs to create all the COPY's, if it calls constrainRegClass on the registers to ensure they do not contain the wrong set of registers.

You are right! We can constrain the register class to GPR[32|64]common. Thanks @dmgreen. Let me update it.

jaykang10 updated this revision to Diff 375824.EditedSep 29 2021, 3:41 AM

Following comment of @dmgreen, updated patch.

  • Removed COPY MIs by constraining register classes.

Thanks. I think this LGTM, providing the issues reported now working OK.

Thanks. I think this LGTM, providing the issues reported now working OK.

Yep, I am checking test-suites with different optimization levels. Once it is ok, let me try to push this patch again.

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

I no longer see this error in our build after applied your latest patch.

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

I no longer see this error in our build after applied your latest patch.

Thanks for checking it. Let me push this patch again.

zatrazz added a subscriber: zatrazz.Oct 5 2021, 3:54 AM

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

um... I checked the bootstrap build before pushing this patch. Let me check it again.

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

um... I checked the bootstrap build before pushing this patch. Let me check it again.

It looks like the test has -fsyntax-only command line option so I think clang patches have caused the error.

I've also reproduced it. Since check stage 1 passes my guess is that the clang built in stage 1 is emitting incorrect code.

I've also reproduced it. Since check stage 1 passes my guess is that the clang built in stage 1 is emitting incorrect code.

um... I checked it on AArch64 machine as below.

make stage2 -j32
./bin/llvm-lit ../clang/test/CXX/over/over.match/over.match.funcs/p4-0x.cpp
llvm-lit: /home/jinkan01/Projects/llvm-project/llvm/utils/lit/lit/llvm/config.py:436: note: using clang: /home/jinkan01/Projects/llvm-project/build/bin/clang
-- Testing: 1 tests, 1 workers --
PASS: Clang :: CXX/over/over.match/over.match.funcs/p4-0x.cpp (1 of 1)

Testing Time: 0.22s
  Passed: 1

If possible, can you let me know how to reproduce it please? @DavidSpickett @zatrazz

DavidSpickett added a comment.EditedOct 5 2021, 8:22 AM

If you have docker:

$ docker pull linaro/ci-arm64-tcwg-llvmbot-ubuntu:bionic
$ docker run -it --entrypoint=/bin/bash linaro/ci-arm64-tcwg-llvmbot-ubuntu:bionic
root@a9e83ff71f75:/# export CC=/usr/local/clang+llvm-12.0.0-aarch64-linux-gnu/bin/clang
root@a9e83ff71f75:/# export CXX=/usr/local/clang+llvm-12.0.0-aarch64-linux-gnu/bin/clang++
<follow commands from the buildbot page>

I suggest following https://lab.llvm.org/buildbot/#/builders/179/builds/1234 because it's the simplest. (not by much, it just doesn't use lld)

If you don't, then the things to note are that clang-12 is used for stage 1.

I'll check with the system gcc as well.

Edit: yes it does with gcc building stage 1.

I believe the re-land also caused a couple of tests to break for us in our 2-stage build:

Failed Tests (4):
  Clang :: CXX/dcl.decl/dcl.init/dcl.init.ref/p5-0x.cpp
  Clang :: CodeGenCXX/catch-undef-behavior.cpp
  Clang :: Profile/def-assignop.cpp
  Clang :: SemaCXX/type-traits.cpp

https://luci-milo.appspot.com/ui/p/fuchsia/builders/prod/clang-linux-arm64/b8834634043817970513/overview

Sorry, let me try to reproduce the errors.

jaykang10 updated this revision to Diff 377994.Oct 7 2021, 1:42 PM

Fixed a bug

  • For the 32 bit form of instruction, the upper 32 bits of the destination register are set to zero. If there is SUBREG_TO_REG, set the upper 32 bits of UImm to zero.

I am sorry for late. It took a time to find the problem and check the stage2 build again.

I was enable to reproduce the regressions with stage2 build. I have fixed them and I can see the regressions are gone.

@DavidSpickett @leonardchan @zatrazz If possible, can you check the regressions with latest update of this patch please?

If there are still regressions with the update, please let me know.

I am sorry for late. It took a time to find the problem and check the stage2 build again.

I was enable to reproduce the regressions with stage2 build. I have fixed them and I can see the regressions are gone.

@DavidSpickett @leonardchan @zatrazz If possible, can you check the regressions with latest update of this patch please?

If there are still regressions with the update, please let me know.

Following https://lab.llvm.org/buildbot/#/builders/179/builds/1234, I can see the ninja check-all of stage2 build with the latest update of this patch is passed on AArch64 machine.

If there are still regressions with the update, please let me know.

Looks good on our side. I think you can reland now and see if anyone else still has failures with that.

If there are still regressions with the update, please let me know.

Looks good on our side. I think you can reland now and see if anyone else still has failures with that.

Thanks @DavidSpickett for checking it! Let me re-commit this patch.

This patch causes a compiler crash, the reduced reproducer is here> https://godbolt.org/z/71sThGE19

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

With the chromium codebase we found getUniqueVRegDef returns null as it could normally.
I'd add a check for that.

137–138

Ditto.

craig.topper added inline comments.Oct 15 2021, 1:46 PM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

Since this is running on SSA form, that should be getVRegDef. Is returning null because it is a physical register or for some other reason?

jaykang10 added inline comments.Oct 15 2021, 2:39 PM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129

Sorry for bug again...

The MI is deleted after visiting all instructions so there are multiple definitions with same register after transformation and it breaks SSA form... I found the problem when I implemented https://reviews.llvm.org/D110841... I missed to update this patch with it... I am sorry for that.

I need to check the bootstrap build. Once it is ok, I will let you know.

jaykang10 updated this revision to Diff 380105.Oct 15 2021, 3:06 PM

Fixed a bug.

  • Create new virtual register for the definition of new AND instruction and replace old register by the new one to keep SSA form.

If you feel it is not good solution, please let me know. I am checking bootstrap build on AArch64 machine with this change.

jaykang10 updated this revision to Diff 380106.Oct 15 2021, 3:10 PM

The check-all with bootstrap build is ok on AArch64 machine. The reduced case is also passed.

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

benshi001 added a comment.EditedOct 21 2021, 9:18 PM

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

I've not heard any issues, it sounds good to go ahead. Were the problems the same as were fixed here?

benshi001 added a comment.EditedNov 1 2021, 12:35 AM

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

I've not heard any issues, it sounds good to go ahead. Were the problems the same as were fixed here?

Not sure if the problems the same as were fixed here, seems to be different, and I can not reproduce it locally. Maybe I should directly do "git push" and wait for blames?