This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Fold adds with global addresses into load offset
AbandonedPublic

Authored by luke on Dec 8 2022, 9:31 AM.

Details

Summary

This allows loads at global address + x to be selected better, by putting the global address operand into the offset.
From splitting up D139530

Diff Detail

Event Timeline

luke created this revision.Dec 8 2022, 9:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2022, 9:31 AM
luke published this revision for review.Dec 8 2022, 9:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2022, 9:32 AM
dschuff added inline comments.Dec 8 2022, 5:29 PM
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

so the result of this is that the frame offset will end up in a const or add expression (consumed by the i32.load8_u), and the global address will be folded?
Is this better than what we did before (i.e. the address in a const and the frame offset folded)?

luke added inline comments.Dec 12 2022, 1:24 AM
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

In this case the frame offset wasn't being folded, although I'm not sure if that is intentional or not. This is the codegen before the patch:

frame_offset_with_global_address:
	.functype	frame_offset_with_global_address () -> (i32)
	i32.const	$push0=, str
	global.get	$push5=, __stack_pointer
	i32.const	$push6=, 16
	i32.sub 	$push9=, $pop5, $pop6
	i32.const	$push7=, 12
	i32.add 	$push8=, $pop9, $pop7
	i32.add 	$push1=, $pop0, $pop8
	i32.load8_u	$push2=, 0($pop1)
	i32.const	$push3=, 67
	i32.and 	$push4=, $pop2, $pop3
	return	$pop4

With the patch the global address gets folded in which saves a const and an add instruction:

frame_offset_with_global_address:
	.functype	frame_offset_with_global_address () -> (i32)
	global.get	$push3=, __stack_pointer
	i32.const	$push4=, 16
	i32.sub 	$push7=, $pop3, $pop4
	i32.const	$push5=, 12
	i32.add 	$push6=, $pop7, $pop5
	i32.load8_u	$push1=, str($pop6)
	i32.const	$push0=, 67
	i32.and 	$push2=, $pop1, $pop0
	return	$pop2
luke added inline comments.Dec 12 2022, 1:49 AM
llvm/lib/Target/WebAssembly/WebAssemblyRegisterInfo.cpp
78–92

Now that (add tga x) is now selected into something like i32.load offset=tga, x, the above assertion was being triggered because it assumed that any offset operand would always be an immediate, not a target global address. So this just wraps around it.

dschuff added inline comments.Dec 13 2022, 1:34 PM
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

Ah ok yeah this test was added in https://reviews.llvm.org/D90577 which fixed a bug in the case below the one you are modifying in this CL. So I guess what's happening now is that this IR is being ISel'ed differently and is ending up in the first case in eliminateFrameIndex instead of the second.

dschuff added inline comments.Dec 13 2022, 1:51 PM
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

Maybe we could add to the comment something like "(this allows the global address to be relocated)"

Since this test doesn't cover that second case anymore, I wonder if we have any tests that do. I would assume that clause is still needed, maybe at least for when we aren't using DAG ISel...

aheejin added inline comments.Dec 14 2022, 12:04 AM
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

I commented the if added in D90577 out and ran this test, and this test still passes. When it was added that wouldn't have been the case, so I'm not sure what changed since then to make this pass. So what I'm saying is, this test is not covering the if in D90577 even now. 🤷🏻 Not sure what we should do. Delete the test and come up with a new test that covers the case? But that shouldn't be this CL's responsibility..

@dschuff And I have another question. I am not very familiar with this prolog-epilog code..

  1. What is the difference between what this upper snippet does and the lower snippet does? Both seem to be about folding offsets.
  1. Why do we need to do custom folding here, and why isn't this taken care of the normal isel patterns? Is this for the case of fast isel?
aheejin accepted this revision.Dec 16 2022, 5:13 PM
aheejin added inline comments.
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

By the way I don't think we need to hold up this CL because of this issue..?

This revision is now accepted and ready to land.Dec 16 2022, 5:13 PM
luke updated this revision to Diff 484354.Dec 20 2022, 12:46 PM

Link to GitHub issue

luke updated this revision to Diff 484371.Dec 20 2022, 1:52 PM

Update test cases

luke added inline comments.Dec 20 2022, 1:54 PM
llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

@dschuff This got shuffled about, and now the global address gets folded into the store. But is the offset here still unfolded? Or by "offset" here is it also referring to the global address too

dschuff accepted this revision.Dec 22 2022, 10:44 AM
dschuff added inline comments.
llvm/test/CodeGen/WebAssembly/userstack.ll
336 ↗(On Diff #481328)

Oh, that's interesting.
It looks like this test now actually does cover the behavior in this CL, so maybe it's good to keep.

  1. IIRC the main difference is that the first one is targeting a load that has the FI operand and the second is targeting an add with the FI operand (so different patterns that come out of ISel. My guess is what happened here is that at the time the test was written it covered this code but the patterns coming from ISel and the front half of the MI passes are different now, so it doesn't anymore. It would be interesting to see if there are any redundancies (or other optimization opportunities) for FI elimination but I agree that none of that needs to block this CL.
  1. The reason there's custom logic here is that code comes out of ISel and goes through many of the MI passes (until after regalloc IIRC) with the FrameIndex operands as part of the IR, and this pass doesn't run until fairly late. So most of the passes that could optimize it have run already.
dschuff requested changes to this revision.Dec 22 2022, 11:08 AM

Sorry, I missed your comment 2 days ago. See my comment below.

llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

OK, so looking back at https://github.com/llvm/llvm-project/issues/29497 and D24053 this actually does look like the problem that we fixed back then.
IIRC the problem is that sometimes the address operand of the store (i.e. local 0, or L1 here) can have a negative value (here it's -128), but the store's effective address calculation (i.e. the operand plus the constant offset, L1 + args + 128 with this CL applied) is unsigned, so it will overflow.
So we have to ensure that the calculation that recombines the native local value with the compile-time constant (here 128) happens with an add instruction rather than getting folded. Does that make sense?

This revision now requires changes to proceed.Dec 22 2022, 11:08 AM
luke added inline comments.Dec 22 2022, 1:02 PM
llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

Thanks, that makes sense.
And from what I understand, that would then mean we can't optimise for the case in this GitHub issue, at least not in the case where the argument being passed is signed.

At the risk of going down the rabbit hole here, do you think there's a way to generate a gep that results in an ISD::ADD with nuw? I.e. is there a way to say that %n is unsigned here so that we are still allowed to fold @global_i32 in?

define i32 @load_i32_global_address_with_folded_offset(i32 %n) {
  %s = getelementptr inbounds i32, i32* @global_i32, i32 %n
  %t = load i32, i32* %s
  ret i32 %t
}

https://reviews.llvm.org/D15544 suggests that it used to be enough to just specify inbounds:

; CHECK-LABEL: load_i32_with_folded_gep_offset:
; CHECK: i32.load  $push0=, 24($0){{$}}
define i32 @load_i32_with_folded_gep_offset(i32* %p) {
  %s = getelementptr inbounds i32, i32* %p, i32 6
  %t = load i32, i32* %s
  ret i32 %t
}

But now it looks like the pointer needs to be manually calculated:

; CHECK-LABEL: load_i32_with_folded_offset:
; CHECK: i32.load  $push0=, 24($0){{$}}
define i32 @load_i32_with_folded_offset(ptr %p) {
  %q = ptrtoint ptr %p to i32
  %r = add nuw i32 %q, 24
  %s = inttoptr i32 %r to ptr
  %t = load i32, ptr %s
  ret i32 %t
}
luke added inline comments.Dec 22 2022, 3:21 PM
llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

So I guess my question is given the following C:

extern int *data;
int f(unsigned idx) {
  return data[idx];
}

Should it not be semantically possible to somehow generate i32.load data($0) from it, given that:

a) we know the argument is unsigned
b) WebAssembly performs unsigned addition of the offset and address

luke added inline comments.Dec 22 2022, 4:04 PM
llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

Just also writing this down here so I remember later: Because the spec defines the memarg offset as an unsigned integer, the addition of the offset to the base address won't wrap.

And it looks like it may be impossible to do this optimisation because the signed-ness of an integer is lost in LLVM IR.

luke added inline comments.Dec 22 2022, 4:34 PM
llvm/test/CodeGen/WebAssembly/negative-base-reg.ll
20–22 ↗(On Diff #484371)

To illustrate my point: Given the following C:

extern char data[1024];
char h(unsigned idx) {
  return data[idx];
}

char i(signed idx) {
  return data[idx];
}

The emitted IR (clang -emit-llvm -S --target=wasm32 foo.c -O3) is exactly the same for the two functions:

; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn
define hidden signext i8 @h(i32 noundef %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0
  %3 = load i8, ptr %2, align 1, !tbaa !2
  ret i8 %3
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind readonly willreturn
define hidden signext i8 @i(i32 noundef %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [1024 x i8], ptr @data, i32 0, i32 %0
  %3 = load i8, ptr %2, align 1, !tbaa !2
  ret i8 %3
}

If we really wanted to cater for this, could we maybe get clang to generate the ptrtoint -> add nuw -> inttoptr sequence instead of a gep when it knows that the type is unsigned?

Yeah, we had some of this discussion with @sunfish back in 2016 as well. I just ended up disabling some of our tests and accepting reduced optimizations in D24053 and we never came up with a way to get them back at the time.
I haven't had a chance to think about this more since my earlier comment but thanks for writing this down, I'll come back to this in a couple of weeks (feel free to ping me if I don't).

luke updated this revision to Diff 485122.Dec 23 2022, 8:23 AM

Check for non unsigned wrap before folding global addresses in

luke abandoned this revision.Apr 3 2023, 4:36 AM

Thanks for working on this anyway, sorry it didn't quite work out. I had some hope that we could make some progress here. Perhaps in the future :(

luke added a comment.Apr 3 2023, 10:33 AM

Thanks for working on this anyway, sorry it didn't quite work out. I had some hope that we could make some progress here. Perhaps in the future :(

Not at all, I haven't had any time to work on this either! Just marked it as closed to clean up the review queues. I'd be happy to reopen this/convert it to a GitHub issue. I think the discussion here is a useful starting point

Yeah actually filing a github issue is a great idea.