This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Target/AArch64/
-
lib/
-
Target/
-
AArch64/
14/29
AArch64MIPeepholeOpt.cpp

Differential D109963

[AArch64] Split bitmask immediate of bitwise AND operation
ClosedPublic

Authored by jaykang10 on Sep 17 2021, 6:38 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
dmgreen
mcrosier
Wilco1
efriedma
momchil.velikov

Commits

rG3f0b178de21e: [AArch64] Fixed a bug on AArch64MIPeepholeOpt
rG30caca39f401: Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
rG13f3c39f3658: Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
rG864b206796ae: [AArch64] Split bitmask immediate of bitwise AND operation

Summary

MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri

The mov pseudo instruction could be expanded to multiple mov instructions later. In this case, try to split the constant operand of mov instruction into two bitmask immediates. It makes only two AND instructions intead of multiple mov + and instructions.

Added a peephole optimization pass on MIR level to implement it.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In D109963#3008656, @dmgreen wrote:

I have tried to implement it on ISelDAG level. There are already patterns to fold and node. After this transformation on ISelDAGToDAG, I was able to see the patterns failed to match. If we can guarantee the pattern for this transformation is matched after matching other patterns related to and, it could be ok to implement it on ISelDAG level. Maybe, we could add AddedComplexity to the pattern for this transformation but I thought the CustomInserter is better than it because it guarantees all pattern matching is done.

Custom Inserters are mostly needed for instructions that expand to multiple basic blocks. As far as I understand, this seems to be different selection from and(X, C), which should fit in fine as an AArch64 tblgen pattern or with Dag2Dag select. But the AArch64 backend can be a bit complex in places. Was there something getting in the way of that? If so do you know what?

For example, there are patterns as below.

multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}

As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.

Wilco1 requested changes to this revision.Sep 20 2021, 4:08 AM

Wilco1 added a subscriber: Wilco1.

Wilco1 added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
2281 ↗	(On Diff #373207)	Is there no function to check it is a 16-bit immediate? If not, it would be best to add it.
2290 ↗	(On Diff #373207)	This won't correctly handle 32-bit immediates like 0x1234FFFF since the invert sets the top 32 bits.
2292 ↗	(On Diff #373207)	Should be * 8.
2308 ↗	(On Diff #373207)	These need to be 64 bits

This revision now requires changes to proceed.Sep 20 2021, 4:08 AM

For example, there are patterns as below.
multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}
As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.

OK I see. So when adding the code to AArch64DAGToDAGISel::Select, it happened before the tablegen patterns and we do have tablegen patterns that conflict with it. I'm a little surprised to see "_or_more" here.

Could it be done as a tblgen pattern then? That way the larger pattern should win. As far as I understand it would be a case of getting the correct ImmLeaf with a couple of xforms.

In D109963#3009100, @dmgreen wrote:
For example, there are patterns as below.
multiclass SIMDAcrossLanesUnsignedIntrinsic<string baseOpc,
                                            SDPatternOperator opNode>
    : SIMDAcrossLanesIntrinsic<baseOpc, opNode> {
// If there is a masking operation keeping only what has been actually
// generated, consume it.
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v8i8 V64:$Rn)), (i64 0)), (i64 0))), maski8_or_more)),
      (i32 (EXTRACT_SUBREG
        (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
          (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub),
        ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v16i8 V128:$Rn)), (i64 0))),
            maski8_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub),
          ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (insert_subvector undef,
            (opNode (v4i16 V64:$Rn)), (i64 0)), (i64 0))), maski16_or_more)),
          (i32 (EXTRACT_SUBREG
            (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub),
            ssub))>;
def : Pat<(i32 (and (i32 (vector_extract (opNode (v8i16 V128:$Rn)), (i64 0))),
            maski16_or_more)),
        (i32 (EXTRACT_SUBREG
          (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub),
          ssub))>;
}
As you can see, the patterns checks roughly (and (...), mask[8|16]_or_more) and it folds the and node. When I tried to split the bitmask immediate on ISelDAGToDAG level, I saw the cases in which above patterns does not work because the mask[8]16]_or_more constraint is failed.

As other case, on AArch64ISelDAGToDAG, '(or (and' patterns are folded by tryBitfieldInsertOp(). After splitting the bitmask immediate, I saw the cases in which the tryBitfieldInsertOp() is failed.

I have not checked all of regressions but there were more cases in which there are more instructions after splitting the bitmask immediate on ISelDAGToDAG level. In order to avoid it, I implemented the logic with CustomInserter.
OK I see. So when adding the code to AArch64DAGToDAGISel::Select, it happened before the tablegen patterns and we do have tablegen patterns that conflict with it. I'm a little surprised to see "_or_more" here.

Could it be done as a tblgen pattern then? That way the larger pattern should win. As far as I understand it would be a case of getting the correct ImmLeaf with a couple of xforms.

As I mentioned on previous comment, the patterns for splitting the bitmask would need AddedComplexity to guarantee that the other patterns with and are already matched. I thought it could not be good for potential patterns with and to be added in the future.
From my personal opinion, MIR level could be suitable place to handle the bitmask immediate. At this moment, AArch64 target generates Pseudo MOV MIR with immediate which is big and expand it on AArch64ExpandPseudo. It could be good to generate Pseudo AND or logical MIR with the bitmask immediate and expand it like MOV. However, it could be big change...
Let me try to add the patterns with xform and AddedComplexity.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
2281 ↗	(On Diff #373207)	We can check it with `isUInt<16>(Mask)`. Let me update it.
2290 ↗	(On Diff #373207)	You are right! Let me update it.
2292 ↗	(On Diff #373207)	Sorry, it is my mistake. Maybe, it could be 16 instead of 8?
2308 ↗	(On Diff #373207)	You are right! Let me update it.

Following the comment of @dmgreen, updated patch using TableGen patterns.

efriedma added a subscriber: efriedma.Sep 20 2021, 11:25 AM

efriedma added inline comments.

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗	(On Diff #373655)	Do we want to do something different if the operation is in a loop, so the immediate could be hoisted?
llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗	(On Diff #373655)	This looks wrong?
llvm/test/CodeGen/AArch64/bitfield-insert.ll
13 ↗	(On Diff #373655)	This isn't an improvement.

Harbormaster completed remote builds in B124712: Diff 373655.Sep 20 2021, 11:31 AM

Nice job making the tablegen patterns work.

Is it possible to combine more of the 32 and 64bit logic? They share quite a bit in common. A few more tests sound useful too, for various edge cases.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
1962 ↗	(On Diff #373655)	This can probably use AArch64_IMM::expandMOVImm? And Count the number of Instructions it produces. A similar "how many instructions will this immediate produce" was needed in https://reviews.llvm.org/D108871

jaykang10 edited reviewers, added: efriedma; removed: eli.friedman.Sep 21 2021, 2:32 AM

In D109963#3011413, @dmgreen wrote:

Nice job making the tablegen patterns work.

Is it possible to combine more of the 32 and 64bit logic? They share quite a bit in common. A few more tests sound useful too, for various edge cases.

Yep, let me update patch with it.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
1962 ↗	(On Diff #373655)	Yep, let me update code with it.
llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗	(On Diff #373655)	um... I do not think the split bitmask immediate affects MachineLICM pass to detect the invariant code. Let me add a test case with loop.
llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗	(On Diff #373655)	um... I think it is correct. mov x9, #31 movk x9, #48, lsl #32 ==> (48 << 32) \| 31 = 31 and x8, x8, #0x3f and x8, x8, #0xffffffffffffffdf ==> 0xffffffffffffffdf & 0x3f = 31
llvm/test/CodeGen/AArch64/bitfield-insert.ll
13 ↗	(On Diff #373655)	Let me update code to avoid it.

jaykang10 added inline comments.Sep 21 2021, 5:31 AM

llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗	(On Diff #373655)	Sorry, I made a mistake. x register is 64 bit... Let me fix it.

Following comments, updated patch and fixed a bug.

Harbormaster completed remote builds in B124896: Diff 373902.Sep 21 2021, 6:46 AM

jaykang10 added inline comments.Sep 21 2021, 6:46 AM

llvm/test/CodeGen/AArch64/arm64-ccmp.ll
374 ↗	(On Diff #373655)	I have fixed it with new patch.

dmgreen added inline comments.Sep 21 2021, 8:10 AM

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
132 ↗	(On Diff #373902)	I think Eli was referring to something like this: https://godbolt.org/z/YMYcqrroT Which may or may be meaningfully worse, depending on vectorizations and constant hoisting.

jaykang10 added inline comments.Sep 21 2021, 8:51 AM

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
132 ↗	(On Diff #373902)	Ah, let me have a look.

jaykang10 added inline comments.Sep 21 2021, 9:12 AM

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗	(On Diff #373655)	um... I understand it now... If instruction has register operand and its definition is in loop, the instruction is not loop invariant in MachineLoop... `AND[W\|X]ri` has register operand rather than `MOVi32imm`... um... Thanks @efriedma. Let me think about it more...

jaykang10 added inline comments.Sep 21 2021, 9:34 AM

llvm/test/CodeGen/AArch64/aarch64-split-and-bitmask-immediate.ll
8 ↗	(On Diff #373655)	We could need to expand `MOVi32imm` on pseudo expansion with considering `AND`... or run a machine function pass which expands the `MOVi32imm` with `AND` after MachineLICM... @efriedma @dmgreen How do you think about it?

I guess there are three options:

Ignore it, because we believe it will not come up enough in practice enough or will be dealt with elsewhere (I'm not sure this is always true).
Teach ISel lowering that the BB is a loop and pick different strategies based on it.
Do it post-sel in a peephole optimization pass or machine combining pass, that knows about being in a loop or critical patch latencies.

For point 1, the question probably becomes will these cases always be vectorized/unrolled at high optimization levels, in which case it will either be a different (vector) instruction doing most of the work, or the constant would be hoisted by constant hoisting. And when optimizing for size the smaller codesize is probably OK.

llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
386 ↗	(On Diff #373902)	"Create"

Added a peephole optimization pass on MIR level to split bitmask immediate.

Herald added a subscriber: mgorny. · View Herald TranscriptSep 22 2021, 10:58 AM

Harbormaster completed remote builds in B125173: Diff 374295.Sep 22 2021, 11:29 AM

Updated comments

Harbormaster completed remote builds in B125301: Diff 374485.Sep 23 2021, 2:37 AM

dmgreen added inline comments.Sep 23 2021, 3:50 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
52	Can this preserve CFG, which should help it preserve some analysis passes
66	Can this use regsize == sizeof(T), to simplify it a little
81	Does it matter if MI is loop invariant? As opposed to the MOV.
129	Should this transform be done with multiple uses on the mov?
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	This can move into the peephole optimization pass now, which might allow it to be simplified somewhat.

jaykang10 added inline comments.Sep 23 2021, 4:41 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
52	Yep, I think so. Let me add `AU.setPreservesCFG`.
66	Yep, let me update it.
81	Let's say `MOV` is in preheader and `AND` is in the loop as below. preheader: %1:gpr32 = MOVi32imm 2098176 %2:gpr32 = ... loop: ... %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32 In above example, AND is loop invariant and we can split the constant of MOV. preheader: %1:gpr32 = MOVi32imm 2098176 ... loop: ... %2:gpr32 = PHI %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32 In above example, the AND is not loop invariant and we can not split the constant of MOV because the first AND needs %2 and the AND is outside loop.
129	I thought there could be other opportunities with two bitmask immediate and other bitwise operations like `OR`. If possible, I would like to keep the MOV for the potential opportunities.
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	This function could be used on other places. If possible, I would like to keep this check here conservatively.

Following comments from @dmgreen, updated patch.

jaykang10 edited the summary of this revision. (Show Details)Sep 23 2021, 5:57 AM

Harbormaster completed remote builds in B125328: Diff 374522.Sep 23 2021, 6:22 AM

dmgreen added inline comments.Sep 24 2021, 2:24 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
85	Consider renaming DefMI to MovMI, to clearly specify what it is expected to contain.
103	Typo in immediate
108	A different typo in immediate :)
129	Yeah that might well happen. Do you have any tests? Maybe in an unrolled loop?
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	What other places do you expect it to be used? It seems at the moment, with the way these are called they will be re-calculating the same values repeatedly. That was unavoidable from tablegen patterns, but shouldn't be necessary from the new Peephole pass, hopefully.

jaykang10 added inline comments.Sep 24 2021, 3:34 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
85	Yep, let me update it.
103	Sorry, let me update it.
108	Sorry, again... Let me update it.
129	I do not have test cases yet. I could try to split the bitmask more later.
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	You are right! Let me move it to the peephole pass.

dmgreen added inline comments.Sep 24 2021, 3:43 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129	I just think it's worth having some tests to show this is working as intended. Maybe where the other uses are transformed and where they are not. (Ignore the unrolled loop comment - it wouldn't trigger in a loop).

jaykang10 added inline comments.Sep 24 2021, 4:01 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129	Let me add a simple test in which `and` and `or` use same constant.

jaykang10 added inline comments.Sep 24 2021, 4:25 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp

129

Sorry, I was wrong... Let's see below example.

define void @test10(i32* nocapture %x, i32* nocapture readonly %y, i32* nocapture %z) {
entry:
  %0 = load i32, i32* %y, align 4
  %and = and i32 %0, 2098176
  store i32 %and, i32* %x, align 4
  %1 = load i32, i32* %y, align 4
  %or = or i32 %1, 2098176
  store i32 %or, i32* %z, align 4
  ret void
}

After instruction selection, the MIR is as below.

; %4:gpr32 = MOVi32imm 2098176
; %5:gpr32 = ANDWrr killed %3:gpr32, %4:gpr32
; STRWui killed %5:gpr32, %0:gpr64common, 0 :: (store (s32) into %ir.x, !tbaa !8) 
; %6:gpr32 = LDRWui %1:gpr64common, 0 :: (load (s32) from %ir.y, !tbaa !8) 
; %7:gpr32 = ORRWrr killed %6:gpr32, %4:gpr32

We can see the and and or share the constant. The final assembly is as below.

	mov	w9, #1024
	movk	w9, #32, lsl #16
	and	w8, w8, #0x3ffc00
	and	w8, w8, #0xffe007ff
	str	w8, [x0]
	ldr	w8, [x1]
	orr	w8, w8, w9

If there are multiple uses of the constant, we should not split the constant because it could cause more instructions as above. I am sorry for mistake. Let me update code.

Following the comments from @dmgreen, updated patch.

Harbormaster completed remote builds in B125536: Diff 374805.Sep 24 2021, 5:04 AM

dmgreen mentioned this in D110476: [AArch64] Optimize add/sub with immediate.Sep 27 2021, 12:05 AM

dmgreen added inline comments.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
97	"because"
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	Was this going to be moved?

jaykang10 added inline comments.Sep 27 2021, 12:49 AM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
97	Sorry, let me update it.
llvm/lib/Target/AArch64/MCTargetDesc/AArch64AddressingModes.h
342 ↗	(On Diff #374485)	Ah, sorry. I misunderstood what you want. Let me move this function to the peephole optimization pass.

jaykang10 updated this revision to Diff 375164.Sep 27 2021, 1:17 AM

Following comments of @dmgreen, updated patch.

benshi001 mentioned this in D110474: [AArch64][test] Add more tests of add/sub with immediate.Sep 27 2021, 1:38 AM

Harbormaster completed remote builds in B125808: Diff 375164.Sep 27 2021, 1:42 AM

My point in moving them was that they could be simplified, especially if inlined. As far as I can tell the two calls to splitAndBitmaskImm are recalculating values that were already computed in isValidAndSplitBitmaskImm. Can we not just calculate the values once?

In D109963#3024009, @dmgreen wrote:

My point in moving them was that they could be simplified, especially if inlined. As far as I can tell the two calls to splitAndBitmaskImm are recalculating values that were already computed in isValidAndSplitBitmaskImm. Can we not just calculate the values once?

Let me update it.

Following comment of @dmgreen, updated patch.

Harbormaster completed remote builds in B125900: Diff 375295.Sep 27 2021, 10:08 AM

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
146	This appears to be checked here and at the start of splitBitmaskImm.

In D109963#3026741, @dmgreen wrote:

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

Thanks for review. Let me run bootstrap build and check some benchmarks.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
146	Yep, let me remove this one.

Fixed a bug

Same MI could be added to ToBeRemoved multiple time.
Replaced SmallVector by SmallSetVector for ToBeRemoved

In D109963#3026759, @jaykang10 wrote:

In D109963#3026741, @dmgreen wrote:

Thanks. LGTM

This kind of code can be incredibly easy to get wrong. Perhaps try running a bootstrap or something like it to flush out any issues.

Thanks for review. Let me run bootstrap build and check some benchmarks.

I have checked bootstrap build and test-suite. After fixing a bug, they are passed. Let me push this patch.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 28 2021, 3:58 AM

This revision was landed with ongoing or failed builds.

Closed by commit rG864b206796ae: [AArch64] Split bitmask immediate of bitwise AND operation (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG864b206796ae: [AArch64] Split bitmask immediate of bitwise AND operation.

Harbormaster completed remote builds in B126067: Diff 375519.Sep 28 2021, 4:15 AM

jaykang10 added a reverting change: rGf85d8a5bed95: Revert "[AArch64] Split bitmask immediate of bitwise AND operation".Sep 28 2021, 5:29 AM

jaykang10 added a reviewer: momchil.velikov.Sep 28 2021, 6:42 AM

Fixed a bug.

The destination register class of AND register is different with AND immediate one. In order to align the register class, the COPY MI is added to convert register class as below.

  %1:gpr32 = MOVi32imm 2098176
  %2:gpr32common = ANDWrr %0:gpr32, killed %1:gpr32
==>
  %5:gpr32sp = ANDWri %0:gpr32, 1419
  %6:gpr32 = COPY %5:gpr32sp
  %7:gpr32sp = ANDWri %6:gpr32, 725
  %2:gpr32common = COPY %7:gpr32sp

Harbormaster completed remote builds in B126095: Diff 375557.Sep 28 2021, 7:33 AM

We are seeing clang crashes after this patch was relanded in https://reviews.llvm.org/rG73a196a11c0e6fe7bbf33055cc2c96ce3c61ff0d by @jaykang10 . Failed build task: https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.arm64-release-subbuild/b8834828820366166433/overview and reproducer: https://storage.cloud.google.com/fuchsia-build/builds/8834828820366166433/bcm-990d09.tar.gz Looks like an assertion error.

I locally revert your change and retest clang with this reproducer and the clang no longer crashes. Could you revert your change if it takes a while to fix it?

Hi, I have tracked a crash down to this patch. This cases passes at f701505c45c7 and fails at this change. The target is aarch64 and the case is run at -O1.

Unfortunately, the test case is very large and I am trying to reduce it. But for now, here is the stack trace of the failure:

1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module 'util-e32393.cpp'.
4.	Running pass 'Live Variable Analysis' on function '@_ZN8colossus13ParseOpenModeENSt3__u17basic_string_viewIcNS0_11char_traitsIcEEEE'
 #0 0x0000000004d0cd9a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x0000000004d0cf6b PrintStackTraceSignalHandler(void*) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x0000000004d0b50b llvm::sys::RunSignalHandlers() /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Signals.cpp:96:5
 #3 0x0000000004d0d6e1 SignalHandler(int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007f30c11dd140 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14140)
 #5 0x00007f30c0c7cce1 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #6 0x00007f30c0c66537 abort ./stdlib/abort.c:81:7
 #7 0x00007f30c0c6640f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
 #8 0x00007f30c0c6640f _nl_load_domain ./intl/loadmsgcat.c:970:34
 #9 0x00007f30c0c75662 (/lib/x86_64-linux-gnu/libc.so.6+0x34662)
#10 0x00000000038aff71 llvm::LiveVariables::HandleVirtRegUse(llvm::Register, llvm::MachineBasicBlock*, llvm::MachineInstr&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:131:20
#11 0x00000000038b27fc llvm::LiveVariables::runOnInstr(llvm::MachineInstr&, llvm::SmallVectorImpl<unsigned int>&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:540:7
#12 0x00000000038b2c9d llvm::LiveVariables::runOnBlock(llvm::MachineBasicBlock*, unsigned int) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:572:25
#13 0x00000000038b331d llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:640:5
#14 0x00000000039676b7 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:72:8
#15 0x000000000404db6c llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1439:23
#16 0x0000000004052e95 llvm::FPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1485:16
#17 0x000000000404e534 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1554:23
#18 0x000000000404e058 llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:542:16
#19 0x00000000040531a1 llvm::legacy::PassManager::run(llvm::Module&) /usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1681:3
#20 0x00000000051cd032 (anonymous namespace)::EmitAssemblyHelper::EmitAssemblyWithNewPassManager(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1516:3
#21 0x00000000051ca597 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1675:5
#22 0x0000000006808fe0 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:334:7
#23 0x0000000008649013 clang::ParseAST(clang::Sema&, bool, bool) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Parse/ParseAST.cpp:178:12
#24 0x0000000005caa882 clang::ASTFrontendAction::ExecuteAction() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1065:1
#25 0x00000000068047bb clang::CodeGenAction::ExecuteAction() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1072:5
#26 0x0000000005caa248 clang::FrontendAction::Execute() /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/FrontendAction.cpp:960:7
#27 0x0000000005bdfb12 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:974:23
#28 0x0000000005e74716 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:278:8
#29 0x00000000014ea90d cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/cc1_main.cpp:246:13
#30 0x00000000014dd3d8 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/driver.cpp:317:5
#31 0x00000000014dc3f1 main /usr/local/google/home/saugustine/llvm-new/llvm-project/clang/tools/driver/driver.cpp:388:5
#32 0x00007f30c0c67d0a __libc_start_main ./csu/../csu/libc-start.c:308:16
#33 0x00000000014dbbea _start (/usr/local/google/home/saugustine/llvm-new/build/bin/clang+0x14dbbea)

craig.topper added a subscriber: craig.topper.Sep 28 2021, 5:38 PM

craig.topper added inline comments.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
132	Do you need to check if the SUBREG_TO_REG has multiple uses?

saugustine mentioned this in rGc07f7099690e: Revert "Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"".Sep 28 2021, 6:03 PM

Since we have a reproducer, I have reverted this change with rGc07f7099690e.

In D109963#3029134, @saugustine wrote:

Since we have a reproducer, I have reverted this change with rGc07f7099690e.

Sorry for the error.

I have fixed it and I can see the reproducer from @haowei is passed. After checking tests and benchmarks again, let me try to push this patch again.

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
132	You are right! I need to check it as well as `mov`. Thanks @craig.topper!

Additionally, if possible, @saugustine can you file a bug with your test case to bugzilla please? It could be different failure.

Fixed bugs

Check MOVimm before checking multiple uses
Check SUBREG_TO_REG has multiple uses

Harbormaster completed remote builds in B126230: Diff 375760.Sep 28 2021, 7:15 PM

I don't think this needs to create all the COPY's, if it calls constrainRegClass on the registers to ensure they do not contain the wrong set of registers.

In D109963#3029409, @dmgreen wrote:

I don't think this needs to create all the COPY's, if it calls constrainRegClass on the registers to ensure they do not contain the wrong set of registers.

You are right! We can constrain the register class to GPR[32|64]common. Thanks @dmgreen. Let me update it.

Following comment of @dmgreen, updated patch.

Removed COPY MIs by constraining register classes.

Thanks. I think this LGTM, providing the issues reported now working OK.

In D109963#3029797, @dmgreen wrote:

Thanks. I think this LGTM, providing the issues reported now working OK.

Yep, I am checking test-suites with different optimization levels. Once it is ok, let me try to push this patch again.

Harbormaster completed remote builds in B126282: Diff 375824.Sep 29 2021, 3:59 AM

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

In D109963#3029824, @jaykang10 wrote:

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

I no longer see this error in our build after applied your latest patch.

In D109963#3031089, @haowei wrote:

In D109963#3029824, @jaykang10 wrote:

@haowei @saugustine If possible, can you check your examples are ok with latest update of this patch please?

I no longer see this error in our build after applied your latest patch.

Thanks for checking it. Let me push this patch again.

This revision was landed with ongoing or failed builds.Sep 30 2021, 1:27 AM

jaykang10 added a commit: rG13f3c39f3658: Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation".

benshi001 mentioned this in D111034: [AArch64] Optimize add/sub with immediate.Oct 3 2021, 8:06 PM

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

In D109963#3042177, @zatrazz wrote:

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

um... I checked the bootstrap build before pushing this patch. Let me check it again.

In D109963#3042220, @jaykang10 wrote:

In D109963#3042177, @zatrazz wrote:

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

um... I checked the bootstrap build before pushing this patch. Let me check it again.

It looks like the test has -fsyntax-only command line option so I think clang patches have caused the error.

I've also reproduced it. Since check stage 1 passes my guess is that the clang built in stage 1 is emitting incorrect code.

In D109963#3042600, @DavidSpickett wrote:

I've also reproduced it. Since check stage 1 passes my guess is that the clang built in stage 1 is emitting incorrect code.

um... I checked it on AArch64 machine as below.

make stage2 -j32
./bin/llvm-lit ../clang/test/CXX/over/over.match/over.match.funcs/p4-0x.cpp
llvm-lit: /home/jinkan01/Projects/llvm-project/llvm/utils/lit/lit/llvm/config.py:436: note: using clang: /home/jinkan01/Projects/llvm-project/build/bin/clang
-- Testing: 1 tests, 1 workers --
PASS: Clang :: CXX/over/over.match/over.match.funcs/p4-0x.cpp (1 of 1)

Testing Time: 0.22s
  Passed: 1

If possible, can you let me know how to reproduce it please? @DavidSpickett @zatrazz

If you have docker:

$ docker pull linaro/ci-arm64-tcwg-llvmbot-ubuntu:bionic
$ docker run -it --entrypoint=/bin/bash linaro/ci-arm64-tcwg-llvmbot-ubuntu:bionic
root@a9e83ff71f75:/# export CC=/usr/local/clang+llvm-12.0.0-aarch64-linux-gnu/bin/clang
root@a9e83ff71f75:/# export CXX=/usr/local/clang+llvm-12.0.0-aarch64-linux-gnu/bin/clang++
<follow commands from the buildbot page>

I suggest following https://lab.llvm.org/buildbot/#/builders/179/builds/1234 because it's the simplest. (not by much, it just doesn't use lld)

If you don't, then the things to note are that clang-12 is used for stage 1.

I'll check with the system gcc as well.

Edit: yes it does with gcc building stage 1.

I believe the re-land also caused a couple of tests to break for us in our 2-stage build:

Failed Tests (4):
  Clang :: CXX/dcl.decl/dcl.init/dcl.init.ref/p5-0x.cpp
  Clang :: CodeGenCXX/catch-undef-behavior.cpp
  Clang :: Profile/def-assignop.cpp
  Clang :: SemaCXX/type-traits.cpp

https://luci-milo.appspot.com/ui/p/fuchsia/builders/prod/clang-linux-arm64/b8834634043817970513/overview

Sorry, let me try to reproduce the errors.

DavidSpickett added a reverting change: rGfc36fb4d23a5: Revert "Second Recommit "[AArch64] Split bitmask immediate of bitwise AND….Oct 6 2021, 1:39 AM

Fixed a bug

For the 32 bit form of instruction, the upper 32 bits of the destination register are set to zero. If there is SUBREG_TO_REG, set the upper 32 bits of UImm to zero.

I am sorry for late. It took a time to find the problem and check the stage2 build again.

I was enable to reproduce the regressions with stage2 build. I have fixed them and I can see the regressions are gone.

@DavidSpickett @leonardchan @zatrazz If possible, can you check the regressions with latest update of this patch please?

If there are still regressions with the update, please let me know.

Harbormaster completed remote builds in B127619: Diff 377994.Oct 7 2021, 2:18 PM

In D109963#3049312, @jaykang10 wrote:

I am sorry for late. It took a time to find the problem and check the stage2 build again.

I was enable to reproduce the regressions with stage2 build. I have fixed them and I can see the regressions are gone.

@DavidSpickett @leonardchan @zatrazz If possible, can you check the regressions with latest update of this patch please?

If there are still regressions with the update, please let me know.

Following https://lab.llvm.org/buildbot/#/builders/179/builds/1234, I can see the ninja check-all of stage2 build with the latest update of this patch is passed on AArch64 machine.

If there are still regressions with the update, please let me know.

Looks good on our side. I think you can reland now and see if anyone else still has failures with that.

In D109963#3050597, @DavidSpickett wrote:

If there are still regressions with the update, please let me know.

Looks good on our side. I think you can reland now and see if anyone else still has failures with that.

Thanks @DavidSpickett for checking it! Let me re-commit this patch.

jaykang10 added a commit: rG30caca39f401: Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation".Oct 8 2021, 3:29 AM

This patch causes a compiler crash, the reduced reproducer is here> https://godbolt.org/z/71sThGE19

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129	With the chromium codebase we found getUniqueVRegDef returns null as it could normally. I'd add a check for that.
137–138	Ditto.

craig.topper added inline comments.Oct 15 2021, 1:46 PM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129	Since this is running on SSA form, that should be getVRegDef. Is returning null because it is a physical register or for some other reason?

jaykang10 added inline comments.Oct 15 2021, 2:39 PM

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
129	Sorry for bug again... The MI is deleted after visiting all instructions so there are multiple definitions with same register after transformation and it breaks SSA form... I found the problem when I implemented https://reviews.llvm.org/D110841... I missed to update this patch with it... I am sorry for that. I need to check the bootstrap build. Once it is ok, I will let you know.

Fixed a bug.

Create new virtual register for the definition of new AND instruction and replace old register by the new one to keep SSA form.

If you feel it is not good solution, please let me know. I am checking bootstrap build on AArch64 machine with this change.

jaykang10 updated this revision to Diff 380106.Oct 15 2021, 3:10 PM

Harbormaster completed remote builds in B129128: Diff 380106.Oct 15 2021, 3:55 PM

The check-all with bootstrap build is ok on AArch64 machine. The reduced case is also passed.

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

@jaykang10

In D109963#3068209, @jaykang10 wrote:

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

In D109963#3069293, @danielkiss wrote:

@jaykang10

In D109963#3068209, @jaykang10 wrote:

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

dmgreen added a subscriber: benshi001.Oct 18 2021, 12:42 AM

jaykang10 added a commit: rG3f0b178de21e: [AArch64] Fixed a bug on AArch64MIPeepholeOpt.Oct 18 2021, 12:56 AM

In D109963#3069351, @jaykang10 wrote:

In D109963#3069293, @danielkiss wrote:

@jaykang10

In D109963#3068209, @jaykang10 wrote:

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

In D109963#3079967, @benshi001 wrote:

In D109963#3069351, @jaykang10 wrote:

In D109963#3069293, @danielkiss wrote:

@jaykang10

In D109963#3068209, @jaykang10 wrote:

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

I've not heard any issues, it sounds good to go ahead. Were the problems the same as were fixed here?

In D109963#3080014, @dmgreen wrote:

In D109963#3079967, @benshi001 wrote:

In D109963#3069351, @jaykang10 wrote:

In D109963#3069293, @danielkiss wrote:

@jaykang10

In D109963#3068209, @jaykang10 wrote:

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Thanks for checking it! Let me push this patch.

Are there any more blames from build machines? Can I do my rebase work in https://reviews.llvm.org/D111034 ?

I've not heard any issues, it sounds good to go ahead. Were the problems the same as were fixed here?

Not sure if the problems the same as were fixed here, seems to be different, and I can not reproduce it locally. Maybe I should directly do "git push" and wait for blames?

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64MIPeepholeOpt.cpp

15 lines

Diff 380106

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines struct AArch64MIPeepholeOpt : public MachineFunctionPass {

template <typename T> template <typename T>

bool visitAND(MachineInstr &MI, bool visitAND(MachineInstr &MI,

SmallSetVector<MachineInstr *, 8> &ToBeRemoved); SmallSetVector<MachineInstr *, 8> &ToBeRemoved);

bool runOnMachineFunction(MachineFunction &MF) override; bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { StringRef getPassName() const override {

return "AArch64 MI Peephole Optimization pass"; return "AArch64 MI Peephole Optimization pass";

} }

dmgreenUnsubmitted

Not Done

Can this preserve CFG, which should help it preserve some analysis passes

dmgreen: Can this preserve CFG, which should help it preserve some analysis passes

jaykang10AuthorUnsubmitted

Done

Yep, I think so. Let me add AU.setPreservesCFG.

jaykang10: Yep, I think so. Let me add `AU.setPreservesCFG`.

void getAnalysisUsage(AnalysisUsage &AU) const override { void getAnalysisUsage(AnalysisUsage &AU) const override {

AU.setPreservesCFG(); AU.setPreservesCFG();

AU.addRequired<MachineLoopInfo>(); AU.addRequired<MachineLoopInfo>();

MachineFunctionPass::getAnalysisUsage(AU); MachineFunctionPass::getAnalysisUsage(AU);

} }

}; };

char AArch64MIPeepholeOpt::ID = 0; char AArch64MIPeepholeOpt::ID = 0;

} // end anonymous namespace } // end anonymous namespace

INITIALIZE_PASS(AArch64MIPeepholeOpt, "aarch64-mi-peephole-opt", INITIALIZE_PASS(AArch64MIPeepholeOpt, "aarch64-mi-peephole-opt",

"AArch64 MI Peephole Optimization", false, false) "AArch64 MI Peephole Optimization", false, false)

dmgreenUnsubmitted

Not Done

Can this use regsize == sizeof(T), to simplify it a little

dmgreen: Can this use regsize == sizeof(T), to simplify it a little

jaykang10AuthorUnsubmitted

Done

Yep, let me update it.

jaykang10: Yep, let me update it.

template <typename T> template <typename T>

static bool splitBitmaskImm(T Imm, unsigned RegSize, T &Imm1Enc, T &Imm2Enc) { static bool splitBitmaskImm(T Imm, unsigned RegSize, T &Imm1Enc, T &Imm2Enc) {

T UImm = static_cast<T>(Imm); T UImm = static_cast<T>(Imm);

if (AArch64_AM::isLogicalImmediate(UImm, RegSize)) if (AArch64_AM::isLogicalImmediate(UImm, RegSize))

return false; return false;

// If this immediate can be handled by one instruction, do not split it. // If this immediate can be handled by one instruction, do not split it.

SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn; SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn;

AArch64_IMM::expandMOVImm(UImm, RegSize, Insn); AArch64_IMM::expandMOVImm(UImm, RegSize, Insn);

if (Insn.size() == 1) if (Insn.size() == 1)

return false; return false;

// The bitmask immediate consists of consecutive ones. Let's say there is // The bitmask immediate consists of consecutive ones. Let's say there is

// constant 0b00000000001000000000010000000000 which does not consist of // constant 0b00000000001000000000010000000000 which does not consist of

// consecutive ones. We can split it in to two bitmask immediate like // consecutive ones. We can split it in to two bitmask immediate like

dmgreenUnsubmitted

Not Done

Does it matter if MI is loop invariant? As opposed to the MOV.

dmgreen: Does it matter if MI is loop invariant? As opposed to the MOV.

jaykang10AuthorUnsubmitted

Done

Let's say MOV is in preheader and AND is in the loop as below.

preheader:
  %1:gpr32 = MOVi32imm 2098176
  %2:gpr32 =
...
loop:
... 
  %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32

In above example, AND is loop invariant and we can split the constant of MOV.

preheader:
  %1:gpr32 = MOVi32imm 2098176
...
loop:
...
  %2:gpr32 = PHI 
  %3:gpr32 = ANDWrr killed %2:gpr32, %1:gpr32

In above example, the AND is not loop invariant and we can not split the constant of MOV because the first AND needs %2 and the AND is outside loop.

jaykang10: Let's say `MOV` is in preheader and `AND` is in the loop as below. ``` preheader: %1:gpr32 =…

// 0b00000000001111111111110000000000 and 0b11111111111000000000011111111111. // 0b00000000001111111111110000000000 and 0b11111111111000000000011111111111.

// If we do AND with these two bitmask immediate, we can see original one. // If we do AND with these two bitmask immediate, we can see original one.

unsigned LowestBitSet = countTrailingZeros(UImm); unsigned LowestBitSet = countTrailingZeros(UImm);

unsigned HighestBitSet = Log2_64(UImm); unsigned HighestBitSet = Log2_64(UImm);

dmgreenUnsubmitted

Not Done

Consider renaming DefMI to MovMI, to clearly specify what it is expected to contain.

dmgreen: Consider renaming DefMI to MovMI, to clearly specify what it is expected to contain.

jaykang10AuthorUnsubmitted

Done

Yep, let me update it.

jaykang10: Yep, let me update it.

// Create a mask which is filled with one from the position of lowest bit set // Create a mask which is filled with one from the position of lowest bit set

// to the position of highest bit set. // to the position of highest bit set.

T NewImm1 = (static_cast<T>(2) << HighestBitSet) - T NewImm1 = (static_cast<T>(2) << HighestBitSet) -

(static_cast<T>(1) << LowestBitSet); (static_cast<T>(1) << LowestBitSet);

// Create a mask which is filled with one outside the position of lowest bit // Create a mask which is filled with one outside the position of lowest bit

// set and the position of highest bit set. // set and the position of highest bit set.

T NewImm2 = UImm | ~NewImm1; T NewImm2 = UImm | ~NewImm1;

// If the split value is not valid bitmask immediate, do not split this // If the split value is not valid bitmask immediate, do not split this

// constant. // constant.

if (!AArch64_AM::isLogicalImmediate(NewImm2, RegSize)) if (!AArch64_AM::isLogicalImmediate(NewImm2, RegSize))

dmgreenUnsubmitted

Not Done

"because"

dmgreen: "because"

jaykang10AuthorUnsubmitted

Done

Sorry, let me update it.

jaykang10: Sorry, let me update it.

return false; return false;

Imm1Enc = AArch64_AM::encodeLogicalImmediate(NewImm1, RegSize); Imm1Enc = AArch64_AM::encodeLogicalImmediate(NewImm1, RegSize);

Imm2Enc = AArch64_AM::encodeLogicalImmediate(NewImm2, RegSize); Imm2Enc = AArch64_AM::encodeLogicalImmediate(NewImm2, RegSize);

return true; return true;

} }

dmgreenUnsubmitted

Not Done

Typo in immediate

dmgreen: Typo in immediate

jaykang10AuthorUnsubmitted

Done

Sorry, let me update it.

jaykang10: Sorry, let me update it.

template <typename T> template <typename T>

bool AArch64MIPeepholeOpt::visitAND( bool AArch64MIPeepholeOpt::visitAND(

MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved) { MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved) {

// Try below transformation. // Try below transformation.

dmgreenUnsubmitted

Not Done

A different typo in immediate :)

dmgreen: A different typo in immediate :)

jaykang10AuthorUnsubmitted

Done

Sorry, again... Let me update it.

jaykang10: Sorry, again... Let me update it.

// //

// MOVi32imm + ANDWrr ==> ANDWri + ANDWri // MOVi32imm + ANDWrr ==> ANDWri + ANDWri

// MOVi64imm + ANDXrr ==> ANDXri + ANDXri // MOVi64imm + ANDXrr ==> ANDXri + ANDXri

// //

// The mov pseudo instruction could be expanded to multiple mov instructions // The mov pseudo instruction could be expanded to multiple mov instructions

// later. Let's try to split the constant operand of mov instruction into two // later. Let's try to split the constant operand of mov instruction into two

// bitmask immediates. It makes only two AND instructions intead of multiple // bitmask immediates. It makes only two AND instructions intead of multiple

// mov + and instructions. // mov + and instructions.

unsigned RegSize = sizeof(T) * 8; unsigned RegSize = sizeof(T) * 8;

assert((RegSize == 32 || RegSize == 64) && assert((RegSize == 32 || RegSize == 64) &&

"Invalid RegSize for AND bitmask peephole optimization"); "Invalid RegSize for AND bitmask peephole optimization");

// Check whether AND's MBB is in loop and the AND is loop invariant. // Check whether AND's MBB is in loop and the AND is loop invariant.

MachineBasicBlock *MBB = MI.getParent(); MachineBasicBlock *MBB = MI.getParent();

MachineLoop *L = MLI->getLoopFor(MBB); MachineLoop *L = MLI->getLoopFor(MBB);

if (L && !L->isLoopInvariant(MI)) if (L && !L->isLoopInvariant(MI))

return false; return false;

// Check whether AND's operand is MOV with immediate. // Check whether AND's operand is MOV with immediate.

MachineInstr *MovMI = MRI->getUniqueVRegDef(MI.getOperand(2).getReg()); MachineInstr *MovMI = MRI->getUniqueVRegDef(MI.getOperand(2).getReg());

dmgreenUnsubmitted

Not Done

Should this transform be done with multiple uses on the mov?

dmgreen: Should this transform be done with multiple uses on the mov?

jaykang10AuthorUnsubmitted

Done

I thought there could be other opportunities with two bitmask immediate and other bitwise operations like OR. If possible, I would like to keep the MOV for the potential opportunities.

jaykang10: I thought there could be other opportunities with two bitmask immediate and other bitwise…

dmgreenUnsubmitted

Not Done

Yeah that might well happen. Do you have any tests? Maybe in an unrolled loop?

dmgreen: Yeah that might well happen. Do you have any tests? Maybe in an unrolled loop?

jaykang10AuthorUnsubmitted

Done

I do not have test cases yet. I could try to split the bitmask more later.

jaykang10: I do not have test cases yet. I could try to split the bitmask more later.

dmgreenUnsubmitted

Not Done

I just think it's worth having some tests to show this is working as intended. Maybe where the other uses are transformed and where they are not. (Ignore the unrolled loop comment - it wouldn't trigger in a loop).

dmgreen: I just think it's worth having some tests to show this is working as intended. Maybe where the…

jaykang10AuthorUnsubmitted

Done

Let me add a simple test in which and and or use same constant.

jaykang10: Let me add a simple test in which `and` and `or` use same constant.

jaykang10AuthorUnsubmitted

Done

Sorry, I was wrong... Let's see below example.

define void @test10(i32* nocapture %x, i32* nocapture readonly %y, i32* nocapture %z) {
entry:
  %0 = load i32, i32* %y, align 4
  %and = and i32 %0, 2098176
  store i32 %and, i32* %x, align 4
  %1 = load i32, i32* %y, align 4
  %or = or i32 %1, 2098176
  store i32 %or, i32* %z, align 4
  ret void
}

After instruction selection, the MIR is as below.

; %4:gpr32 = MOVi32imm 2098176
; %5:gpr32 = ANDWrr killed %3:gpr32, %4:gpr32
; STRWui killed %5:gpr32, %0:gpr64common, 0 :: (store (s32) into %ir.x, !tbaa !8) 
; %6:gpr32 = LDRWui %1:gpr64common, 0 :: (load (s32) from %ir.y, !tbaa !8) 
; %7:gpr32 = ORRWrr killed %6:gpr32, %4:gpr32

We can see the and and or share the constant. The final assembly is as below.

	mov	w9, #1024
	movk	w9, #32, lsl #16
	and	w8, w8, #0x3ffc00
	and	w8, w8, #0xffe007ff
	str	w8, [x0]
	ldr	w8, [x1]
	orr	w8, w8, w9

If there are multiple uses of the constant, we should not split the constant because it could cause more instructions as above. I am sorry for mistake. Let me update code.

jaykang10: Sorry, I was wrong... Let's see below example. ``` define void @test10(i32* nocapture %x, i32*…

danielkissUnsubmitted

Not Done

// Check whether AND's operand is MOV with immediate.

MachineInstr *MovMI = MRI->getUniqueVRegDef(MI.getOperand(2).getReg());

+ if (!MovMI)

+ return false;

MachineInstr *SubregToRegMI = nullptr;

With the chromium codebase we found getUniqueVRegDef returns null as it could normally.
I'd add a check for that.

danielkiss: With the chromium codebase we found getUniqueVRegDef returns null as it could normally. I'd add…

craig.topperUnsubmitted

Not Done

Since this is running on SSA form, that should be getVRegDef. Is returning null because it is a physical register or for some other reason?

craig.topper: Since this is running on SSA form, that should be getVRegDef. Is returning null because it is a…

jaykang10AuthorUnsubmitted

Done

Sorry for bug again...

The MI is deleted after visiting all instructions so there are multiple definitions with same register after transformation and it breaks SSA form... I found the problem when I implemented https://reviews.llvm.org/D110841... I missed to update this patch with it... I am sorry for that.

I need to check the bootstrap build. Once it is ok, I will let you know.

jaykang10: Sorry for bug again... The MI is deleted after visiting all instructions so there are…

if (!MovMI)

return false;

craig.topperUnsubmitted

Not Done

Do you need to check if the SUBREG_TO_REG has multiple uses?

craig.topper: Do you need to check if the SUBREG_TO_REG has multiple uses?

jaykang10AuthorUnsubmitted

Done

You are right! I need to check it as well as mov. Thanks @craig.topper!

jaykang10: You are right! I need to check it as well as `mov`. Thanks @craig.topper!

MachineInstr *SubregToRegMI = nullptr; MachineInstr *SubregToRegMI = nullptr;

// If it is SUBREG_TO_REG, check its operand. // If it is SUBREG_TO_REG, check its operand.

if (MovMI->getOpcode() == TargetOpcode::SUBREG_TO_REG) { if (MovMI->getOpcode() == TargetOpcode::SUBREG_TO_REG) {

SubregToRegMI = MovMI; SubregToRegMI = MovMI;

MovMI = MRI->getUniqueVRegDef(MovMI->getOperand(2).getReg()); MovMI = MRI->getUniqueVRegDef(MovMI->getOperand(2).getReg());

if (!MovMI)

danielkissUnsubmitted

Not Done

MovMI = MRI->getUniqueVRegDef(MovMI->getOperand(2).getReg());

}

- if (MovMI->getOpcode() != AArch64::MOVi32imm &&

+ if (MovMI &&

+ MovMI->getOpcode() != AArch64::MOVi32imm &&

MovMI->getOpcode() != AArch64::MOVi64imm)

return false;

Ditto.

danielkiss: Ditto.

return false;

} }

if (MovMI->getOpcode() != AArch64::MOVi32imm && if (MovMI->getOpcode() != AArch64::MOVi32imm &&

MovMI->getOpcode() != AArch64::MOVi64imm) MovMI->getOpcode() != AArch64::MOVi64imm)

return false; return false;

// If the MOV has multiple uses, do not split the immediate because it causes // If the MOV has multiple uses, do not split the immediate because it causes

dmgreenUnsubmitted

Not Done

This appears to be checked here and at the start of splitBitmaskImm.

dmgreen: This appears to be checked here and at the start of splitBitmaskImm.

jaykang10AuthorUnsubmitted

Done

Yep, let me remove this one.

jaykang10: Yep, let me remove this one.

// more instructions. // more instructions.

if (!MRI->hasOneUse(MovMI->getOperand(0).getReg())) if (!MRI->hasOneUse(MovMI->getOperand(0).getReg()))

return false; return false;

if (SubregToRegMI && !MRI->hasOneUse(SubregToRegMI->getOperand(0).getReg())) if (SubregToRegMI && !MRI->hasOneUse(SubregToRegMI->getOperand(0).getReg()))

return false; return false;

// Split the bitmask immediate into two. // Split the bitmask immediate into two.

Show All 10 Lines bool AArch64MIPeepholeOpt::visitAND(

// Create new AND MIs. // Create new AND MIs.

DebugLoc DL = MI.getDebugLoc(); DebugLoc DL = MI.getDebugLoc();

const TargetRegisterClass *ANDImmRC = const TargetRegisterClass *ANDImmRC =

(RegSize == 32) ? &AArch64::GPR32spRegClass : &AArch64::GPR64spRegClass; (RegSize == 32) ? &AArch64::GPR32spRegClass : &AArch64::GPR64spRegClass;

unsigned Opcode = (RegSize == 32) ? AArch64::ANDWri : AArch64::ANDXri; unsigned Opcode = (RegSize == 32) ? AArch64::ANDWri : AArch64::ANDXri;

MRI->constrainRegClass(NewTmpReg, MRI->getRegClass(SrcReg)); MRI->constrainRegClass(NewTmpReg, MRI->getRegClass(SrcReg));

BuildMI(*MBB, MI, DL, TII->get(Opcode), NewTmpReg) BuildMI(*MBB, MI, DL, TII->get(Opcode), NewTmpReg)

.addReg(SrcReg) .addReg(SrcReg)

.addImm(Imm1Enc); .addImm(Imm1Enc);

MRI->constrainRegClass(DstReg, ANDImmRC); MRI->constrainRegClass(NewDstReg, MRI->getRegClass(DstReg));

BuildMI(*MBB, MI, DL, TII->get(Opcode), DstReg) BuildMI(*MBB, MI, DL, TII->get(Opcode), NewDstReg)

.addReg(NewTmpReg) .addReg(NewTmpReg)

.addImm(Imm2Enc); .addImm(Imm2Enc);

MRI->replaceRegWith(DstReg, NewDstReg);

// replaceRegWith changes MI's definition register. Keep it for SSA form until

// deleting MI.

MI.getOperand(0).setReg(DstReg);

ToBeRemoved.insert(&MI); ToBeRemoved.insert(&MI);

if (SubregToRegMI) if (SubregToRegMI)

ToBeRemoved.insert(SubregToRegMI); ToBeRemoved.insert(SubregToRegMI);

ToBeRemoved.insert(MovMI); ToBeRemoved.insert(MovMI);

return true; return true;

} }

Show All 38 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Split bitmask immediate of bitwise AND operationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380106

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp

[AArch64] Split bitmask immediate of bitwise AND operation
ClosedPublic