This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/WebAssembly/
-
Target/
-
WebAssembly/
-
WebAssemblyISelDAGToDAG.cpp
1/1
WebAssemblyRegisterInfo.cpp
-
test/CodeGen/WebAssembly/
-
CodeGen/
-
WebAssembly/
-
eh-lsda.ll
-
exception.ll
5/6
negative-base-reg.ll
-
offset.ll
1/7
userstack.ll

Differential D139645

[WebAssembly] Fold adds with global addresses into load offset
AbandonedPublic

Authored by luke on Dec 8 2022, 9:31 AM.

Download Raw Diff

Details

Reviewers

asb
tlively
aheejin
samparker
dschuff

Summary

This allows loads at global address + x to be selected better, by putting the global address operand into the offset.
From splitting up D139530

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

luke created this revision.Dec 8 2022, 9:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2022, 9:31 AM

Herald added subscribers: pmatos, StephenFan, ecnelises and 5 others. · View Herald Transcript

luke published this revision for review.Dec 8 2022, 9:32 AM

luke added a parent revision: D139631: [WebAssembly][NFC] Add ComplexPattern for loads.

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2022, 9:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

luke mentioned this in D139530: [WebAssembly] Add ComplexPattern for loads.Dec 8 2022, 9:33 AM

Harbormaster completed remote builds in B202008: Diff 481328.Dec 8 2022, 5:13 PM

dschuff added inline comments.Dec 8 2022, 5:29 PM

llvm/test/CodeGen/WebAssembly/userstack.ll
333	so the result of this is that the frame offset will end up in a const or add expression (consumed by the i32.load8_u), and the global address will be folded? Is this better than what we did before (i.e. the address in a const and the frame offset folded)?

luke added inline comments.Dec 12 2022, 1:24 AM

llvm/test/CodeGen/WebAssembly/userstack.ll

333

In this case the frame offset wasn't being folded, although I'm not sure if that is intentional or not. This is the codegen before the patch:

frame_offset_with_global_address:
	.functype	frame_offset_with_global_address () -> (i32)
	i32.const	$push0=, str
	global.get	$push5=, __stack_pointer
	i32.const	$push6=, 16
	i32.sub 	$push9=, $pop5, $pop6
	i32.const	$push7=, 12
	i32.add 	$push8=, $pop9, $pop7
	i32.add 	$push1=, $pop0, $pop8
	i32.load8_u	$push2=, 0($pop1)
	i32.const	$push3=, 67
	i32.and 	$push4=, $pop2, $pop3
	return	$pop4

With the patch the global address gets folded in which saves a const and an add instruction:

frame_offset_with_global_address:
	.functype	frame_offset_with_global_address () -> (i32)
	global.get	$push3=, __stack_pointer
	i32.const	$push4=, 16
	i32.sub 	$push7=, $pop3, $pop4
	i32.const	$push5=, 12
	i32.add 	$push6=, $pop7, $pop5
	i32.load8_u	$push1=, str($pop6)
	i32.const	$push0=, 67
	i32.and 	$push2=, $pop1, $pop0
	return	$pop2

luke added inline comments.Dec 12 2022, 1:49 AM

llvm/lib/Target/WebAssembly/WebAssemblyRegisterInfo.cpp
78–92	Now that (add tga x) is now selected into something like i32.load offset=tga, x, the above assertion was being triggered because it assumed that any offset operand would always be an immediate, not a target global address. So this just wraps around it.

dschuff added inline comments.Dec 13 2022, 1:34 PM

llvm/test/CodeGen/WebAssembly/userstack.ll
333	Ah ok yeah this test was added in https://reviews.llvm.org/D90577 which fixed a bug in the case below the one you are modifying in this CL. So I guess what's happening now is that this IR is being ISel'ed differently and is ending up in the first case in eliminateFrameIndex instead of the second.

dschuff added inline comments.Dec 13 2022, 1:51 PM

llvm/test/CodeGen/WebAssembly/userstack.ll
333	Maybe we could add to the comment something like "(this allows the global address to be relocated)" Since this test doesn't cover that second case anymore, I wonder if we have any tests that do. I would assume that clause is still needed, maybe at least for when we aren't using DAG ISel...

aheejin added inline comments.Dec 14 2022, 12:04 AM

llvm/test/CodeGen/WebAssembly/userstack.ll
333	I commented the `if` added in D90577 out and ran this test, and this test still passes. When it was added that wouldn't have been the case, so I'm not sure what changed since then to make this pass. So what I'm saying is, this test is not covering the `if` in D90577 even now. 🤷🏻 Not sure what we should do. Delete the test and come up with a new test that covers the case? But that shouldn't be this CL's responsibility.. @dschuff And I have another question. I am not very familiar with this prolog-epilog code.. What is the difference between what this upper snippet does and the lower snippet does? Both seem to be about folding offsets. Why do we need to do custom folding here, and why isn't this taken care of the normal isel patterns? Is this for the case of fast isel?

aheejin accepted this revision.Dec 16 2022, 5:13 PM

aheejin added inline comments.

llvm/test/CodeGen/WebAssembly/userstack.ll
333	By the way I don't think we need to hold up this CL because of this issue..?

This revision is now accepted and ready to land.Dec 16 2022, 5:13 PM

Link to GitHub issue

Update test cases

luke added inline comments.Dec 20 2022, 1:54 PM

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22	@dschuff This got shuffled about, and now the global address gets folded into the store. But is the offset here still unfolded? Or by "offset" here is it also referring to the global address too

Harbormaster completed remote builds in B204233: Diff 484371.Dec 20 2022, 2:50 PM

dschuff accepted this revision.Dec 22 2022, 10:44 AM

dschuff added inline comments.

llvm/test/CodeGen/WebAssembly/userstack.ll
333	Oh, that's interesting. It looks like this test now actually does cover the behavior in this CL, so maybe it's good to keep. IIRC the main difference is that the first one is targeting a load that has the FI operand and the second is targeting an add with the FI operand (so different patterns that come out of ISel. My guess is what happened here is that at the time the test was written it covered this code but the patterns coming from ISel and the front half of the MI passes are different now, so it doesn't anymore. It would be interesting to see if there are any redundancies (or other optimization opportunities) for FI elimination but I agree that none of that needs to block this CL. The reason there's custom logic here is that code comes out of ISel and goes through many of the MI passes (until after regalloc IIRC) with the FrameIndex operands as part of the IR, and this pass doesn't run until fairly late. So most of the passes that could optimize it have run already.

Sorry, I missed your comment 2 days ago. See my comment below.

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22	OK, so looking back at https://github.com/llvm/llvm-project/issues/29497 and D24053 this actually does look like the problem that we fixed back then. IIRC the problem is that sometimes the address operand of the store (i.e. local 0, or L1 here) can have a negative value (here it's -128), but the store's effective address calculation (i.e. the operand plus the constant offset, L1 + args + 128 with this CL applied) is unsigned, so it will overflow. So we have to ensure that the calculation that recombines the native local value with the compile-time constant (here 128) happens with an add instruction rather than getting folded. Does that make sense?

This revision now requires changes to proceed.Dec 22 2022, 11:08 AM

luke added inline comments.Dec 22 2022, 1:02 PM

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22	Thanks, that makes sense. And from what I understand, that would then mean we can't optimise for the case in this GitHub issue, at least not in the case where the argument being passed is signed. At the risk of going down the rabbit hole here, do you think there's a way to generate a gep that results in an `ISD::ADD` with `nuw`? I.e. is there a way to say that `%n` is unsigned here so that we are still allowed to fold `@global_i32` in? define i32 @load_i32_global_address_with_folded_offset(i32 %n) { %s = getelementptr inbounds i32, i32* @global_i32, i32 %n %t = load i32, i32* %s ret i32 %t } https://reviews.llvm.org/D15544 suggests that it used to be enough to just specify `inbounds`: ; CHECK-LABEL: load_i32_with_folded_gep_offset: ; CHECK: i32.load $push0=, 24($0){{$}} define i32 @load_i32_with_folded_gep_offset(i32* %p) { %s = getelementptr inbounds i32, i32* %p, i32 6 %t = load i32, i32* %s ret i32 %t } But now it looks like the pointer needs to be manually calculated: ; CHECK-LABEL: load_i32_with_folded_offset: ; CHECK: i32.load $push0=, 24($0){{$}} define i32 @load_i32_with_folded_offset(ptr %p) { %q = ptrtoint ptr %p to i32 %r = add nuw i32 %q, 24 %s = inttoptr i32 %r to ptr %t = load i32, ptr %s ret i32 %t }

luke added inline comments.Dec 22 2022, 3:21 PM

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22	So I guess my question is given the following C: extern int *data; int f(unsigned idx) { return data[idx]; } Should it not be semantically possible to somehow generate `i32.load data($0)` from it, given that: a) we know the argument is unsigned b) WebAssembly performs unsigned addition of the offset and address

luke added inline comments.Dec 22 2022, 4:04 PM

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22	Just also writing this down here so I remember later: Because the spec defines the memarg offset as an unsigned integer, the addition of the offset to the base address won't wrap. And it looks like it may be impossible to do this optimisation because the signed-ness of an integer is lost in LLVM IR.

luke added inline comments.Dec 22 2022, 4:34 PM

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll

20–22

To illustrate my point: Given the following C:

extern char data[1024];
char h(unsigned idx) {
  return data[idx];
}

char i(signed idx) {
  return data[idx];
}

The emitted IR (clang -emit-llvm -S --target=wasm32 foo.c -O3) is exactly the same for the two functions:

; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn
define hidden signext i8 @h(i32 noundef %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0
  %3 = load i8, ptr %2, align 1, !tbaa !2
  ret i8 %3
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn
define hidden signext i8 @i(i32 noundef %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0
  %3 = load i8, ptr %2, align 1, !tbaa !2
  ret i8 %3
}

If we really wanted to cater for this, could we maybe get clang to generate the ptrtoint -> add nuw -> inttoptr sequence instead of a gep when it knows that the type is unsigned?

Yeah, we had some of this discussion with @sunfish back in 2016 as well. I just ended up disabling some of our tests and accepting reduced optimizations in D24053 and we never came up with a way to get them back at the time.
I haven't had a chance to think about this more since my earlier comment but thanks for writing this down, I'll come back to this in a couple of weeks (feel free to ping me if I don't).

Check for non unsigned wrap before folding global addresses in

Harbormaster completed remote builds in B204783: Diff 485122.Dec 23 2022, 9:18 AM

luke abandoned this revision.Apr 3 2023, 4:36 AM

Thanks for working on this anyway, sorry it didn't quite work out. I had some hope that we could make some progress here. Perhaps in the future :(

In D139645#4241005, @dschuff wrote:

Thanks for working on this anyway, sorry it didn't quite work out. I had some hope that we could make some progress here. Perhaps in the future :(

Not at all, I haven't had any time to work on this either! Just marked it as closed to clean up the review queues. I'd be happy to reopen this/convert it to a GitHub issue. I think the discussion here is a useful starting point

Yeah actually filing a github issue is a great idea.

https://github.com/llvm/llvm-project/issues/61930

Revision Contents

Path

Size

llvm/

lib/

Target/

WebAssembly/

WebAssemblyISelDAGToDAG.cpp

17 lines

WebAssemblyRegisterInfo.cpp

21 lines

test/

CodeGen/

WebAssembly/

4 lines

2 lines

6 lines

26 lines

5 lines

Diff 484371

llvm/lib/Target/WebAssembly/WebAssemblyISelDAGToDAG.cpp

Show First 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	bool WebAssemblyDAGToDAGISel::SelectInlineAsmMemoryOperand(
return true;		return true;
}		}

bool WebAssemblyDAGToDAGISel::SelectAddrAddOperands(MVT OffsetType, SDValue N,		bool WebAssemblyDAGToDAGISel::SelectAddrAddOperands(MVT OffsetType, SDValue N,
SDValue &Offset,		SDValue &Offset,
SDValue &Addr) {		SDValue &Addr) {
assert(N.getNumOperands() == 2 && "Attempting to fold in a non-binary op");		assert(N.getNumOperands() == 2 && "Attempting to fold in a non-binary op");

		// Fold target global addresses in an add into the offset.
		if (!TM.isPositionIndependent()) {
		for (size_t i = 0; i < 2; ++i) {
		SDValue Op = N.getOperand(i);
		SDValue OtherOp = N.getOperand(i == 0 ? 1 : 0);

		if (Op.getOpcode() == WebAssemblyISD::Wrapper)
		Op = Op.getOperand(0);

		if (Op.getOpcode() == ISD::TargetGlobalAddress) {
		Offset = Op;
		Addr = OtherOp;
		return true;
		}
		}
		}

// WebAssembly constant offsets are performed as unsigned with infinite		// WebAssembly constant offsets are performed as unsigned with infinite
// precision, so we need to check for NoUnsignedWrap so that we don't fold an		// precision, so we need to check for NoUnsignedWrap so that we don't fold an
// offset for an add that needs wrapping.		// offset for an add that needs wrapping.
if (N.getOpcode() == ISD::ADD && !N.getNode()->getFlags().hasNoUnsignedWrap())		if (N.getOpcode() == ISD::ADD && !N.getNode()->getFlags().hasNoUnsignedWrap())
return false;		return false;

// Folds constants in an add into the offset.		// Folds constants in an add into the offset.
for (size_t i = 0; i < 2; ++i) {		for (size_t i = 0; i < 2; ++i) {
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/lib/Target/WebAssembly/WebAssemblyRegisterInfo.cpp

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	bool WebAssemblyRegisterInfo::eliminateFrameIndex(

// If this is the address operand of a load or store, make it relative to SP		// If this is the address operand of a load or store, make it relative to SP
// and fold the frame offset directly in.		// and fold the frame offset directly in.
unsigned AddrOperandNum = WebAssembly::getNamedOperandIdx(		unsigned AddrOperandNum = WebAssembly::getNamedOperandIdx(
MI.getOpcode(), WebAssembly::OpName::addr);		MI.getOpcode(), WebAssembly::OpName::addr);
if (AddrOperandNum == FIOperandNum) {		if (AddrOperandNum == FIOperandNum) {
unsigned OffsetOperandNum = WebAssembly::getNamedOperandIdx(		unsigned OffsetOperandNum = WebAssembly::getNamedOperandIdx(
MI.getOpcode(), WebAssembly::OpName::off);		MI.getOpcode(), WebAssembly::OpName::off);
assert(FrameOffset >= 0 && MI.getOperand(OffsetOperandNum).getImm() >= 0);		auto &OffsetOp = MI.getOperand(OffsetOperandNum);
int64_t Offset = MI.getOperand(OffsetOperandNum).getImm() + FrameOffset;		// Don't fold offset in if offset is a global address to be resolved later
		if (OffsetOp.isImm()) {
if (static_cast<uint64_t>(Offset) <= std::numeric_limits<uint32_t>::max()) {		assert(FrameOffset >= 0 && OffsetOp.getImm() >= 0);
MI.getOperand(OffsetOperandNum).setImm(Offset);		int64_t Offset = OffsetOp.getImm() + FrameOffset;

		if (static_cast<uint64_t>(Offset) <=
		std::numeric_limits<uint32_t>::max()) {
		OffsetOp.setImm(Offset);
MI.getOperand(FIOperandNum)		MI.getOperand(FIOperandNum)
.ChangeToRegister(FrameRegister, /isDef=/false);		.ChangeToRegister(FrameRegister, /isDef=/false);
return false;		return false;
}		}
}		}
		}
		lukeAuthorUnsubmitted Done Reply Inline Actions Now that (add tga x) is now selected into something like i32.load offset=tga, x, the above assertion was being triggered because it assumed that any offset operand would always be an immediate, not a target global address. So this just wraps around it. luke: Now that (add tga x) is now selected into something like i32.load offset=tga, x, the above…

// If this is an address being added to a constant, fold the frame offset		// If this is an address being added to a constant, fold the frame offset
// into the constant.		// into the constant.
if (MI.getOpcode() == WebAssemblyFrameLowering::getOpcAdd(MF)) {		if (MI.getOpcode() == WebAssemblyFrameLowering::getOpcAdd(MF)) {
MachineOperand &OtherMO = MI.getOperand(3 - FIOperandNum);		MachineOperand &OtherMO = MI.getOperand(3 - FIOperandNum);
if (OtherMO.isReg()) {		if (OtherMO.isReg()) {
Register OtherMOReg = OtherMO.getReg();		Register OtherMOReg = OtherMO.getReg();
if (Register::isVirtualRegister(OtherMOReg)) {		if (Register::isVirtualRegister(OtherMOReg)) {
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

llvm/test/CodeGen/WebAssembly/eh-lsda.ll

	Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	;			;
	; There are three landing pads. The second landing pad should share action table			; There are three landing pads. The second landing pad should share action table
	; entries with the first landing pad because they end with the same sequence			; entries with the first landing pad because they end with the same sequence
	; (double -> ...). But the third landing table cannot share action table entries			; (double -> ...). But the third landing table cannot share action table entries
	; with others, so it should create its own entries.			; with others, so it should create its own entries.

	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; In static linking, we load GCC_except_table as a constant directly.			; In static linking, we load GCC_except_table as a constant directly.
	; NOPIC: i[[PTR]].const $push[[CONTEXT:.*]]=, __wasm_lpad_context			; NOPIC: i[[PTR]].const $push[[X:.*]]=, {{[48]}}
	; NOPIC-NEXT: i[[PTR]].const $push[[EXCEPT_TABLE:.*]]=, GCC_except_table1			; NOPIC-NEXT: i[[PTR]].const $push[[EXCEPT_TABLE:.*]]=, GCC_except_table1
	; NOPIC-NEXT: i[[PTR]].store {{[48]}}($pop[[CONTEXT]]), $pop[[EXCEPT_TABLE]]			; NOPIC-NEXT: i[[PTR]].store __wasm_lpad_context($pop[[X]]), $pop[[EXCEPT_TABLE]]

	; In case of PIC, we make GCC_except_table symbols a relative on based on			; In case of PIC, we make GCC_except_table symbols a relative on based on
	; __memory_base.			; __memory_base.
	; PIC: global.get $push[[CONTEXT:.*]]=, __wasm_lpad_context@GOT			; PIC: global.get $push[[CONTEXT:.*]]=, __wasm_lpad_context@GOT
	; PIC-NEXT: local.tee $push{{.}}=, $[[CONTEXT_LOCAL:.]]=, $pop[[CONTEXT]]			; PIC-NEXT: local.tee $push{{.}}=, $[[CONTEXT_LOCAL:.]]=, $pop[[CONTEXT]]
	; PIC: global.get $push[[MEMORY_BASE:.*]]=, __memory_base			; PIC: global.get $push[[MEMORY_BASE:.*]]=, __memory_base
	; PIC-NEXT: i[[PTR]].const $push[[EXCEPT_TABLE_REL:.*]]=, GCC_except_table1@MBREL			; PIC-NEXT: i[[PTR]].const $push[[EXCEPT_TABLE_REL:.*]]=, GCC_except_table1@MBREL
	; PIC-NEXT: i[[PTR]].add $push[[EXCEPT_TABLE:.*]]=, $pop[[MEMORY_BASE]], $pop[[EXCEPT_TABLE_REL]]			; PIC-NEXT: i[[PTR]].add $push[[EXCEPT_TABLE:.*]]=, $pop[[MEMORY_BASE]], $pop[[EXCEPT_TABLE_REL]]
	▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/test/CodeGen/WebAssembly/exception.ll

	Show All 28 Lines
	; }			; }

	; CHECK-LABEL: test_catch:			; CHECK-LABEL: test_catch:
	; CHECK: global.get ${{.+}}=, __stack_pointer			; CHECK: global.get ${{.+}}=, __stack_pointer
	; CHECK: try			; CHECK: try
	; CHECK: call foo			; CHECK: call foo
	; CHECK: catch $[[EXN:[0-9]+]]=, __cpp_exception			; CHECK: catch $[[EXN:[0-9]+]]=, __cpp_exception
	; CHECK: global.set __stack_pointer			; CHECK: global.set __stack_pointer
	; CHECK: i32.{{store\|const}} {{.*}} __wasm_lpad_context			; CHECK: i32.{{store\|const}} __wasm_lpad_context({{.*}})
	; CHECK: call $drop=, _Unwind_CallPersonality, $[[EXN]]			; CHECK: call $drop=, _Unwind_CallPersonality, $[[EXN]]
	; CHECK: block			; CHECK: block
	; CHECK: br_if 0			; CHECK: br_if 0
	; CHECK: call $drop=, __cxa_begin_catch			; CHECK: call $drop=, __cxa_begin_catch
	; CHECK: call __cxa_end_catch			; CHECK: call __cxa_end_catch
	; CHECK: br 1			; CHECK: br 1
	; CHECK: end_block			; CHECK: end_block
	; CHECK: rethrow 0			; CHECK: rethrow 0
	▲ Show 20 Lines • Show All 329 Lines • Show Last 20 Lines

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll

	Show All 11 Lines
	; CHECK: i32.const $push[[L0:[0-9]+]]=, -128			; CHECK: i32.const $push[[L0:[0-9]+]]=, -128
	; CHECK-NEXT: local.set 0, $pop[[L0]]			; CHECK-NEXT: local.set 0, $pop[[L0]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%i.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]			%i.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
	; The offset should not be folded into the store.			; The offset should not be folded into the store.
	; CHECK: i32.const $push{{[0-9]+}}=, args+128			; CHECK: local.get $push[[L1:[0-9]+]]=, 0
	; CHECK: i32.add $push[[L1:[0-9]+]]=,			; CHECK-NEXT: i32.const $push[[V:[0-9]+]]=, 1
	; CHECK: i32.store 0($pop[[L1]])			; CHECK-NEXT: i32.store args+128($pop[[L1]]), $pop[[V]]
				lukeAuthorUnsubmitted Done Reply Inline Actions @dschuff This got shuffled about, and now the global address gets folded into the store. But is the offset here still unfolded? Or by "offset" here is it also referring to the global address too luke: @dschuff This got shuffled about, and now the global address gets folded into the store. But is…
				dschuffUnsubmitted Not Done Reply Inline Actions OK, so looking back at https://github.com/llvm/llvm-project/issues/29497 and D24053 this actually does look like the problem that we fixed back then. IIRC the problem is that sometimes the address operand of the store (i.e. local 0, or L1 here) can have a negative value (here it's -128), but the store's effective address calculation (i.e. the operand plus the constant offset, L1 + args + 128 with this CL applied) is unsigned, so it will overflow. So we have to ensure that the calculation that recombines the native local value with the compile-time constant (here 128) happens with an add instruction rather than getting folded. Does that make sense? dschuff: OK, so looking back at https://github.com/llvm/llvm-project/issues/29497 and D24053 this…
				lukeAuthorUnsubmitted Done Reply Inline Actions Thanks, that makes sense. And from what I understand, that would then mean we can't optimise for the case in this GitHub issue, at least not in the case where the argument being passed is signed. At the risk of going down the rabbit hole here, do you think there's a way to generate a gep that results in an `ISD::ADD` with `nuw`? I.e. is there a way to say that `%n` is unsigned here so that we are still allowed to fold `@global_i32` in? define i32 @load_i32_global_address_with_folded_offset(i32 %n) { %s = getelementptr inbounds i32, i32* @global_i32, i32 %n %t = load i32, i32* %s ret i32 %t } https://reviews.llvm.org/D15544 suggests that it used to be enough to just specify `inbounds`: ; CHECK-LABEL: load_i32_with_folded_gep_offset: ; CHECK: i32.load $push0=, 24($0){{$}} define i32 @load_i32_with_folded_gep_offset(i32* %p) { %s = getelementptr inbounds i32, i32* %p, i32 6 %t = load i32, i32* %s ret i32 %t } But now it looks like the pointer needs to be manually calculated: ; CHECK-LABEL: load_i32_with_folded_offset: ; CHECK: i32.load $push0=, 24($0){{$}} define i32 @load_i32_with_folded_offset(ptr %p) { %q = ptrtoint ptr %p to i32 %r = add nuw i32 %q, 24 %s = inttoptr i32 %r to ptr %t = load i32, ptr %s ret i32 %t } luke: Thanks, that makes sense. And from what I understand, that would then mean we can't optimise…
				lukeAuthorUnsubmitted Done Reply Inline Actions So I guess my question is given the following C: extern int data; int f(unsigned idx) { return data[idx]; } Should it not be semantically possible to somehow generate `i32.load data($0)` from it, given that: a) we know the argument is unsigned b) WebAssembly performs unsigned addition of the offset and address luke:* So I guess my question is given the following C: ``` extern int *data; int f(unsigned idx) {…
				lukeAuthorUnsubmitted Done Reply Inline Actions Just also writing this down here so I remember later: Because the spec defines the memarg offset as an unsigned integer, the addition of the offset to the base address won't wrap. And it looks like it may be impossible to do this optimisation because the signed-ness of an integer is lost in LLVM IR. luke: Just also writing this down here so I remember later: Because the spec defines the memarg…
				lukeAuthorUnsubmitted Done Reply Inline Actions To illustrate my point: Given the following C: extern char data[1024]; char h(unsigned idx) { return data[idx]; } char i(signed idx) { return data[idx]; } The emitted IR (`clang -emit-llvm -S --target=wasm32 foo.c -O3`) is exactly the same for the two functions: ; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn define hidden signext i8 @h(i32 noundef %0) local_unnamed_addr #0 { %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0 %3 = load i8, ptr %2, align 1, !tbaa !2 ret i8 %3 } ; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn define hidden signext i8 @i(i32 noundef %0) local_unnamed_addr #0 { %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0 %3 = load i8, ptr %2, align 1, !tbaa !2 ret i8 %3 } If we really wanted to cater for this, could we maybe get clang to generate the `ptrtoint -> add nuw -> inttoptr` sequence instead of a gep when it knows that the type is unsigned? luke: To illustrate my point: Given the following C: ``` extern char data[1024]; char h(unsigned…
	%arrayidx = getelementptr inbounds [32 x i32], ptr @args, i32 0, i32 %i.04			%arrayidx = getelementptr inbounds [32 x i32], ptr @args, i32 0, i32 %i.04
	store i32 1, ptr %arrayidx, align 4, !tbaa !1			store i32 1, ptr %arrayidx, align 4, !tbaa !1
	%inc = add nuw nsw i32 %i.04, 1			%inc = add nuw nsw i32 %i.04, 1
	%exitcond = icmp eq i32 %inc, 32			%exitcond = icmp eq i32 %inc, 32
	br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !5			br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !5

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	Show All 13 Lines

llvm/test/CodeGen/WebAssembly/offset.ll

	Show First 20 Lines • Show All 660 Lines • ▼ Show 20 Lines
	; CHECK: i32.store16 12($0), $pop[[L1]]{{$}}			; CHECK: i32.store16 12($0), $pop[[L1]]{{$}}
	; CHECK: i32.const $push[[L2:[0-9]+]]=, 0{{$}}			; CHECK: i32.const $push[[L2:[0-9]+]]=, 0{{$}}
	; CHECK: i32.store 8($0), $pop[[L2]]{{$}}			; CHECK: i32.store 8($0), $pop[[L2]]{{$}}
	; CHECK: i64.const $push[[L3:[0-9]+]]=, 0{{$}}			; CHECK: i64.const $push[[L3:[0-9]+]]=, 0{{$}}
	; CHECK: i64.store 0($0), $pop[[L3]]{{$}}			; CHECK: i64.store 0($0), $pop[[L3]]{{$}}
	define {i64,i32,i16,i8} @aggregate_return_without_merge() {			define {i64,i32,i16,i8} @aggregate_return_without_merge() {
	ret {i64,i32,i16,i8} zeroinitializer			ret {i64,i32,i16,i8} zeroinitializer
	}			}

				;===----------------------------------------------------------------------------
				; Global address loads
				;===----------------------------------------------------------------------------

				@global_i32 = external global i32
				@global_i8 = external global i8

				; CHECK-LABEL: load_i32_global_address_with_folded_offset:
				; CHECK: i32.const $push0=, 2
				; CHECK: i32.shl $push1=, $0, $pop0
				; CHECK: i32.load $push2=, global_i32($pop1)
				define i32 @load_i32_global_address_with_folded_offset(i32 %n) {
				%s = getelementptr inbounds i32, i32* @global_i32, i32 %n
				%t = load i32, i32* %s
				ret i32 %t
				}

				; CHECK-LABEL: load_i8_i32s_global_address_with_folded_offset:
				; CHECK: i32.load8_s $push0=, global_i8($0)
				define i32 @load_i8_i32s_global_address_with_folded_offset(i32 %n) {
				%s = getelementptr inbounds i8, i8* @global_i8, i32 %n
				%t = load i8, i8* %s
				%u = sext i8 %t to i32
				ret i32 %u
				}

llvm/test/CodeGen/WebAssembly/userstack.ll

	Show First 20 Lines • Show All 323 Lines • ▼ Show 20 Lines
	define void @inline_asm() {			define void @inline_asm() {
	%tmp = alloca i8			%tmp = alloca i8
	call void asm sideeffect "# %0", "r"(ptr %tmp)			call void asm sideeffect "# %0", "r"(ptr %tmp)
	ret void			ret void
	}			}

	; We optimize the format of "frame offset + operand" by folding it, but this is			; We optimize the format of "frame offset + operand" by folding it, but this is
	; only possible when that operand is an immediate. In this example it is a			; only possible when that operand is an immediate. In this example it is a
	; global address, so we should not fold it.			; global address, so we should fold the global address into the offset, but not
				; the frame offset.
				dschuffUnsubmitted Not Done Reply Inline Actions so the result of this is that the frame offset will end up in a const or add expression (consumed by the i32.load8_u), and the global address will be folded? Is this better than what we did before (i.e. the address in a const and the frame offset folded)? dschuff: so the result of this is that the frame offset will end up in a const or add expression…
				lukeAuthorUnsubmitted Done Reply Inline Actions In this case the frame offset wasn't being folded, although I'm not sure if that is intentional or not. This is the codegen before the patch: frame_offset_with_global_address: .functype frame_offset_with_global_address () -> (i32) i32.const $push0=, str global.get $push5=, __stack_pointer i32.const $push6=, 16 i32.sub $push9=, $pop5, $pop6 i32.const $push7=, 12 i32.add $push8=, $pop9, $pop7 i32.add $push1=, $pop0, $pop8 i32.load8_u $push2=, 0($pop1) i32.const $push3=, 67 i32.and $push4=, $pop2, $pop3 return $pop4 With the patch the global address gets folded in which saves a `const` and an `add` instruction: frame_offset_with_global_address: .functype frame_offset_with_global_address () -> (i32) global.get $push3=, __stack_pointer i32.const $push4=, 16 i32.sub $push7=, $pop3, $pop4 i32.const $push5=, 12 i32.add $push6=, $pop7, $pop5 i32.load8_u $push1=, str($pop6) i32.const $push0=, 67 i32.and $push2=, $pop1, $pop0 return $pop2 luke: In this case the frame offset wasn't being folded, although I'm not sure if that is intentional…
				dschuffUnsubmitted Not Done Reply Inline Actions Ah ok yeah this test was added in https://reviews.llvm.org/D90577 which fixed a bug in the case below the one you are modifying in this CL. So I guess what's happening now is that this IR is being ISel'ed differently and is ending up in the first case in eliminateFrameIndex instead of the second. dschuff: Ah ok yeah this test was added in https://reviews.llvm.org/D90577 which fixed a bug in the case…
				dschuffUnsubmitted Not Done Reply Inline Actions Maybe we could add to the comment something like "(this allows the global address to be relocated)" Since this test doesn't cover that second case anymore, I wonder if we have any tests that do. I would assume that clause is still needed, maybe at least for when we aren't using DAG ISel... dschuff: Maybe we could add to the comment something like "(this allows the global address to be…
				aheejinUnsubmitted Not Done Reply Inline Actions I commented the `if` added in D90577 out and ran this test, and this test still passes. When it was added that wouldn't have been the case, so I'm not sure what changed since then to make this pass. So what I'm saying is, this test is not covering the `if` in D90577 even now. 🤷🏻 Not sure what we should do. Delete the test and come up with a new test that covers the case? But that shouldn't be this CL's responsibility.. @dschuff And I have another question. I am not very familiar with this prolog-epilog code.. What is the difference between what this upper snippet does and the lower snippet does? Both seem to be about folding offsets. Why do we need to do custom folding here, and why isn't this taken care of the normal isel patterns? Is this for the case of fast isel? aheejin: I commented the `if` added in D90577 out and ran this test, and this test still passes. When it…
				dschuffUnsubmitted Not Done Reply Inline Actions Oh, that's interesting. It looks like this test now actually does cover the behavior in this CL, so maybe it's good to keep. IIRC the main difference is that the first one is targeting a load that has the FI operand and the second is targeting an add with the FI operand (so different patterns that come out of ISel. My guess is what happened here is that at the time the test was written it covered this code but the patterns coming from ISel and the front half of the MI passes are different now, so it doesn't anymore. It would be interesting to see if there are any redundancies (or other optimization opportunities) for FI elimination but I agree that none of that needs to block this CL. The reason there's custom logic here is that code comes out of ISel and goes through many of the MI passes (until after regalloc IIRC) with the FrameIndex operands as part of the IR, and this pass doesn't run until fairly late. So most of the passes that could optimize it have run already. dschuff: Oh, that's interesting. It looks like this test now actually does cover the behavior in this CL…
				aheejinUnsubmitted Not Done Reply Inline Actions By the way I don't think we need to hold up this CL because of this issue..? aheejin: By the way I don't think we need to hold up this CL because of this issue..?
	; CHECK-LABEL: frame_offset_with_global_address			; CHECK-LABEL: frame_offset_with_global_address
	; CHECK: i[[PTR]].const ${{.*}}=, str			; CHECK: i32.load8_u ${{.}}=, str({{.}})
	@str = local_unnamed_addr global [3 x i8] c"abc", align 16			@str = local_unnamed_addr global [3 x i8] c"abc", align 16
	define i8 @frame_offset_with_global_address() {			define i8 @frame_offset_with_global_address() {
	%1 = alloca i8, align 4			%1 = alloca i8, align 4
	%2 = ptrtoint ptr %1 to i32			%2 = ptrtoint ptr %1 to i32
	;; Here @str is a global address and not an immediate, so cannot be folded			;; Here @str is a global address and not an immediate, so cannot be folded
	%3 = getelementptr [3 x i8], ptr @str, i32 0, i32 %2			%3 = getelementptr [3 x i8], ptr @str, i32 0, i32 %2
	%4 = load i8, ptr %3, align 8			%4 = load i8, ptr %3, align 8
	%5 = and i8 %4, 67			%5 = and i8 %4, 67
	ret i8 %5			ret i8 %5
	}			}

	; TODO: test over-aligned alloca			; TODO: test over-aligned alloca

This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Fold adds with global addresses into load offsetAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 484371

llvm/lib/Target/WebAssembly/WebAssemblyISelDAGToDAG.cpp

llvm/lib/Target/WebAssembly/WebAssemblyRegisterInfo.cpp

llvm/test/CodeGen/WebAssembly/eh-lsda.ll

llvm/test/CodeGen/WebAssembly/exception.ll

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll

llvm/test/CodeGen/WebAssembly/offset.ll

llvm/test/CodeGen/WebAssembly/userstack.ll

[WebAssembly] Fold adds with global addresses into load offset
AbandonedPublic