This is an archive of the discontinued LLVM Phabricator instance.

[RFC][WebAssembly] Optimize GEPs
Needs ReviewPublic

Authored by samparker on Dec 6 2022, 5:21 AM.

Details

Summary

Workaround the shortcomings of LoopStrengthReduce and ScalarEvolutionExpander to allow inbounds getelemenptr instructions, in unrolled loops, to reach the backend. This enables greater usage of memory operations with immediate offsets.

The pass simply pattern matches geps, in the form: (getelementptr ptr %base (or i32 %reg_offset, i32 constant)) and then creates a new base from %base and %reg_offset, using pointer arithmetic. This is then shared amongst the geps, all of each can then use an immediate index too.

Diff Detail

Event Timeline

samparker created this revision.Dec 6 2022, 5:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2022, 5:21 AM
samparker requested review of this revision.Dec 6 2022, 5:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2022, 5:21 AM
Herald added a subscriber: aheejin. · View Herald Transcript

Do you have specific examples of those existing passes deficiencies?
It would be best to fix them, and benefit all, instead of workaround them for a single target.

Do you have specific examples of those existing passes deficiencies?

I'd firstly say that I think this is a very webassembly shaped problem, I don't think any other architectures will have similar addressing modes.

As to the specific problems, it's mainly due to SCEVExpander often not being able to regenerate no wrap flags or inbound geps. Judging by the number of TODOs in that area, and I think I've done some in the past, it seems very much non-trivial. The function that I was looking at for this was SCEVExpander::expandAddToGEP. If there is a clear plan on how this can be implemented, I would be very happy to hear it.

LSR is just the driver of the problem... but a very dense driver at that (which I struggle with!), but the main problem is that it isn't very 'predictable' when it comes to unrolled loops, due it's complexity limit, which results in LSR behaving differently for different sized loops.

tlively added inline comments.Dec 7 2022, 10:18 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
139

Would this be hard to change? Wasm64 is generally supported elsewhere in LLVM these days.

167–168

Why use a hash value as the key? Wouldn't it be better to use the pair itself as the key?

llvm/test/CodeGen/WebAssembly/optimize-geps.ll
7

Would it be possible to have a smaller test case? It's hard to tell what to look at here.

samparker added inline comments.Dec 8 2022, 2:06 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
139

Okay, will do.

167–168

Good point.

llvm/test/CodeGen/WebAssembly/optimize-geps.ll
7

Yeah, if I remove one array and move to i32 this should hopefully half the size.

samparker updated this revision to Diff 481243.Dec 8 2022, 4:31 AM
  • Changed 'key' type.
  • Added wasm64 support.
  • Reduced some tests.
  • Added option to disable the pass.
tlively added inline comments.Dec 9 2022, 7:44 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
12
16
110

Do we need BasePairs? It looks like we could remove it if we iterate through the Candidates keys instead.

llvm/test/CodeGen/WebAssembly/optimize-geps.ll
7

Thanks! Could you also add comments in the test input pointing out where interesting things happen?

lebedev.ri added inline comments.Dec 9 2022, 8:23 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
19

How is this different from SeparateConstOffsetFromGEP pass?

samparker added inline comments.Dec 12 2022, 3:33 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
19

I started by trying to use it and it didn't do anything. But, looking again, that appears to be a TTI issue. It still produces IR that we don't want, as the base address is generated by a GEP and not pointer arithmetic. I should be able to remedy this through another TTI hook though. I'll experiment.

aheejin added inline comments.Dec 14 2022, 1:00 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
2

Can you add a motivating example of small code snippets that tries to solve that the status quo doesn't?

llvm/test/CodeGen/WebAssembly/optimize-geps.ll
7

Can we have some comments for most of the tests to help what to look at? Not sure which part of the code we need to look. Also, what is dim?

dschuff added inline comments.Dec 14 2022, 10:14 AM
llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
457

Does this pass fix up debug info when necessary?
If feasible, it should; but if not, maybe we should restrict it to CodeGenOpt::Default and higher.

samparker added inline comments.Dec 16 2022, 5:31 AM
llvm/lib/Target/WebAssembly/WebAssemblyOptimizeGEPs.cpp
110

IIUC, we then won't have a stable iteration order - I originally tried doing this but I got intermittent test failures.

llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
457

No, it doesn't. I'm going to update the patch to not add it into the backend pipeline yet.

samparker updated this revision to Diff 483505.Dec 16 2022, 5:45 AM

Hopefully addressed all the comments...

I've removed the pass from the backend for now, and my expectation is that SeparateConstOffsetFromGEP and EarlyCSE will need to run before this as they get the code into an easily analyzable form. I need to run some benchmarks though to see if they have any negative affects.

I just realized that I'm creating the new base address incorrectly, I will fix it after the weekend.

samparker updated this revision to Diff 483966.Dec 19 2022, 7:47 AM

Fixed base address calculation.

samparker updated this revision to Diff 484284.Dec 20 2022, 8:23 AM

Handling global base addresses.

Taking D141926 as a base, this transform brings the total code size of my local benchmarks down from 7982KB to 7970KB.