This is an archive of the discontinued LLVM Phabricator instance.

[RISCV][ISel] improved compressed instruction use
Needs ReviewPublic

Authored by dybv-sc on Aug 22 2022, 2:09 AM.

Download Raw Diff

Details

Reviewers

reames
asb
craig.topper

Summary

Helping to emit compressed BNEZ in comparasion with constant when possible.

Addressing following issue: https://github.com/llvm/llvm-project/issues/56393

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dybv-sc created this revision.Aug 22 2022, 2:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2022, 2:09 AM

Herald added subscribers: sunshaoce, VincentWu, luke957 and 29 others. · View Herald Transcript

dybv-sc requested review of this revision.Aug 22 2022, 2:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2022, 2:09 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B182531: Diff 454422.Aug 22 2022, 2:10 AM

Fixed missing commit

dybv-sc edited the summary of this revision. (Show Details)Aug 22 2022, 2:29 AM

dybv-sc added reviewers: reames, asb.

Herald added a subscriber: StephenFan. · View Herald TranscriptAug 22 2022, 2:29 AM

Harbormaster completed remote builds in B182532: Diff 454424.Aug 22 2022, 2:56 AM

anton-afanasyev added a subscriber: anton-afanasyev.Aug 22 2022, 5:07 AM

As noted in the bug, this increases the critical path length in some cases. Have you benchmarked this?

For immediates that fit in c.li the new sequence might be larger. c.li works for all registers. c.bnez only works for x8-x15 and short distances.

craig.topper added a reviewer: craig.topper.Aug 29 2022, 2:56 PM

Sorry for long silence.
I've benchmarked SPEC with llvm-test-suite on Alibaba THead machine and found out that there is slight performance downgrade with this substitution. I've isolated one case:

lw      s0, 0(a0)
li      a2, 101
ld      a0, 0(a1)
slliw   a1, s0, 1
addw    a1, a1, s0
sw      a1, 0(a0)
blt     s0, a2, .LBB0_2

transforms to:

lw      s0, 0(a0)
ld      a0, 0(a1)
slti    a2, s0, 101
slliw   a1, s0, 1
addw    a1, a1, s0
sw      a1, 0(a0)
bnez    a2, .LBB0_2

Being put in a hot loop the latter one adds 7 cycles more to each iteration. I found out that It does not affect branch predictor or cache, so there must be a pipeline stall happening here. I'll investigate this further.

So, after more running more spec tests in different modes (train and ref) on different RISCV boards (SiFive and THead) I got mixed results on performance. Performance increase on number on tests was insignificant while on other there was a slight decrease. On average performance declined by 0.5%. On the other hand, size reduction can be seen uniformly among all tests. On average it is 20 less bytes or 0.04% of size reduction. I think these amounts can't justify the performance cost.
Considering that some performance reductions are platform specific (like the one I mentioned in previous comment) and rely on internal architecture features, it is not seem possible to come up with general solution here. And more specialized ones will require more time and effort. And possible 0.04% code size reduction just not worth it.
What do you think?

In D132358#3902311, @dybv-sc wrote:

So, after more running more spec tests in different modes (train and ref) on different RISCV boards (SiFive and THead) I got mixed results on performance. Performance increase on number on tests was insignificant while on other there was a slight decrease. On average performance declined by 0.5%. On the other hand, size reduction can be seen uniformly among all tests. On average it is 20 less bytes or 0.04% of size reduction. I think these amounts can't justify the performance cost.
Considering that some performance reductions are platform specific (like the one I mentioned in previous comment) and rely on internal architecture features, it is not seem possible to come up with general solution here. And more specialized ones will require more time and effort. And possible 0.04% code size reduction just not worth it.
What do you think?

Thanks for collecting the data. I agree, it sounds like its not worth it.

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVInstrInfo.td

4 lines

test/

CodeGen/

RISCV/

codemodel-lowering.ll

8 lines

isel-compressed-comp.ll

16 lines

Diff 454424

llvm/lib/Target/RISCV/RISCVInstrInfo.td

	Show First 20 Lines • Show All 1,323 Lines • ▼ Show 20 Lines

	defm : BccPat<SETEQ, BEQ>;			defm : BccPat<SETEQ, BEQ>;
	defm : BccPat<SETNE, BNE>;			defm : BccPat<SETNE, BNE>;
	defm : BccPat<SETLT, BLT>;			defm : BccPat<SETLT, BLT>;
	defm : BccPat<SETGE, BGE>;			defm : BccPat<SETGE, BGE>;
	defm : BccPat<SETULT, BLTU>;			defm : BccPat<SETULT, BLTU>;
	defm : BccPat<SETUGE, BGEU>;			defm : BccPat<SETUGE, BGEU>;

				// Try to produce compressed bnez instruction if possible
				def : Pat<(riscv_brcc GPR:$rs1, simm12:$c, SETLT, bb:$imm12),
				(BNE (SLTI GPR:$rs1, simm12:$c), X0, simm13_lsb0:$imm12)>;

	let isBarrier = 1, isBranch = 1, isTerminator = 1 in			let isBarrier = 1, isBranch = 1, isTerminator = 1 in
	def PseudoBR : Pseudo<(outs), (ins simm21_lsb0_jal:$imm20), [(br bb:$imm20)]>,			def PseudoBR : Pseudo<(outs), (ins simm21_lsb0_jal:$imm20), [(br bb:$imm20)]>,
	PseudoInstExpansion<(JAL X0, simm21_lsb0_jal:$imm20)>;			PseudoInstExpansion<(JAL X0, simm21_lsb0_jal:$imm20)>;

	let isBarrier = 1, isBranch = 1, isIndirectBranch = 1, isTerminator = 1 in			let isBarrier = 1, isBranch = 1, isIndirectBranch = 1, isTerminator = 1 in
	def PseudoBRIND : Pseudo<(outs), (ins GPRJALR:$rs1, simm12:$imm12), []>,			def PseudoBRIND : Pseudo<(outs), (ins GPRJALR:$rs1, simm12:$imm12), []>,
	PseudoInstExpansion<(JALR X0, GPR:$rs1, simm12:$imm12)>;			PseudoInstExpansion<(JALR X0, GPR:$rs1, simm12:$imm12)>;

	▲ Show 20 Lines • Show All 355 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/codemodel-lowering.ll

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; Check lowering of blockaddress that forces a displacement to be added			; Check lowering of blockaddress that forces a displacement to be added

	define signext i32 @lower_blockaddress_displ(i32 signext %w) nounwind {			define signext i32 @lower_blockaddress_displ(i32 signext %w) nounwind {
	; RV32I-SMALL-LABEL: lower_blockaddress_displ:			; RV32I-SMALL-LABEL: lower_blockaddress_displ:
	; RV32I-SMALL: # %bb.0: # %entry			; RV32I-SMALL: # %bb.0: # %entry
	; RV32I-SMALL-NEXT: addi sp, sp, -16			; RV32I-SMALL-NEXT: addi sp, sp, -16
	; RV32I-SMALL-NEXT: lui a1, %hi(.Ltmp0)			; RV32I-SMALL-NEXT: lui a1, %hi(.Ltmp0)
	; RV32I-SMALL-NEXT: addi a1, a1, %lo(.Ltmp0)			; RV32I-SMALL-NEXT: addi a1, a1, %lo(.Ltmp0)
	; RV32I-SMALL-NEXT: li a2, 101			; RV32I-SMALL-NEXT: slti a0, a0, 101
	; RV32I-SMALL-NEXT: sw a1, 8(sp)			; RV32I-SMALL-NEXT: sw a1, 8(sp)
	; RV32I-SMALL-NEXT: blt a0, a2, .LBB2_3			; RV32I-SMALL-NEXT: bnez a0, .LBB2_3
	; RV32I-SMALL-NEXT: # %bb.1: # %if.then			; RV32I-SMALL-NEXT: # %bb.1: # %if.then
	; RV32I-SMALL-NEXT: lw a0, 8(sp)			; RV32I-SMALL-NEXT: lw a0, 8(sp)
	; RV32I-SMALL-NEXT: jr a0			; RV32I-SMALL-NEXT: jr a0
	; RV32I-SMALL-NEXT: .Ltmp0: # Block address taken			; RV32I-SMALL-NEXT: .Ltmp0: # Block address taken
	; RV32I-SMALL-NEXT: .LBB2_2: # %return			; RV32I-SMALL-NEXT: .LBB2_2: # %return
	; RV32I-SMALL-NEXT: li a0, 4			; RV32I-SMALL-NEXT: li a0, 4
	; RV32I-SMALL-NEXT: addi sp, sp, 16			; RV32I-SMALL-NEXT: addi sp, sp, 16
	; RV32I-SMALL-NEXT: ret			; RV32I-SMALL-NEXT: ret
	; RV32I-SMALL-NEXT: .LBB2_3: # %return.clone			; RV32I-SMALL-NEXT: .LBB2_3: # %return.clone
	; RV32I-SMALL-NEXT: li a0, 3			; RV32I-SMALL-NEXT: li a0, 3
	; RV32I-SMALL-NEXT: addi sp, sp, 16			; RV32I-SMALL-NEXT: addi sp, sp, 16
	; RV32I-SMALL-NEXT: ret			; RV32I-SMALL-NEXT: ret
	;			;
	; RV32I-MEDIUM-LABEL: lower_blockaddress_displ:			; RV32I-MEDIUM-LABEL: lower_blockaddress_displ:
	; RV32I-MEDIUM: # %bb.0: # %entry			; RV32I-MEDIUM: # %bb.0: # %entry
	; RV32I-MEDIUM-NEXT: addi sp, sp, -16			; RV32I-MEDIUM-NEXT: addi sp, sp, -16
	; RV32I-MEDIUM-NEXT: .Lpcrel_hi2:			; RV32I-MEDIUM-NEXT: .Lpcrel_hi2:
	; RV32I-MEDIUM-NEXT: auipc a1, %pcrel_hi(.Ltmp0)			; RV32I-MEDIUM-NEXT: auipc a1, %pcrel_hi(.Ltmp0)
	; RV32I-MEDIUM-NEXT: addi a1, a1, %pcrel_lo(.Lpcrel_hi2)			; RV32I-MEDIUM-NEXT: addi a1, a1, %pcrel_lo(.Lpcrel_hi2)
	; RV32I-MEDIUM-NEXT: li a2, 101			; RV32I-MEDIUM-NEXT: slti a0, a0, 101
	; RV32I-MEDIUM-NEXT: sw a1, 8(sp)			; RV32I-MEDIUM-NEXT: sw a1, 8(sp)
	; RV32I-MEDIUM-NEXT: blt a0, a2, .LBB2_3			; RV32I-MEDIUM-NEXT: bnez a0, .LBB2_3
	; RV32I-MEDIUM-NEXT: # %bb.1: # %if.then			; RV32I-MEDIUM-NEXT: # %bb.1: # %if.then
	; RV32I-MEDIUM-NEXT: lw a0, 8(sp)			; RV32I-MEDIUM-NEXT: lw a0, 8(sp)
	; RV32I-MEDIUM-NEXT: jr a0			; RV32I-MEDIUM-NEXT: jr a0
	; RV32I-MEDIUM-NEXT: .Ltmp0: # Block address taken			; RV32I-MEDIUM-NEXT: .Ltmp0: # Block address taken
	; RV32I-MEDIUM-NEXT: .LBB2_2: # %return			; RV32I-MEDIUM-NEXT: .LBB2_2: # %return
	; RV32I-MEDIUM-NEXT: li a0, 4			; RV32I-MEDIUM-NEXT: li a0, 4
	; RV32I-MEDIUM-NEXT: addi sp, sp, 16			; RV32I-MEDIUM-NEXT: addi sp, sp, 16
	; RV32I-MEDIUM-NEXT: ret			; RV32I-MEDIUM-NEXT: ret
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/isel-compressed-comp.ll

This file was added.

				; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
				; RUN: \| FileCheck %s

				declare void @foo()

				define void @bar(i64 %a) {
				; CHECK: slti
				; CHECK-NEXT: bnez
				%c = icmp slt i64 255, %a
				br i1 %c, label %taken, label %return
				taken:
				call void @foo()
				br label %return
				return:
				ret void
				}