This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE] Allow PRE of instructions that read physical registers
ClosedPublic

Authored by john.brawn on Oct 25 2022, 4:42 AM.

Details

Summary

Currently MachineCSE forbids PRE when the instruction reads a physical register. Relax this so that it's allowed when the value being read is the same as what would be read in the place the instruction would be hoisted to.

This is being done in preparation for adding FPCR handling to the AArch64 backend, in order to prevent it to from worsening the generated code, but for targets that already have a similar register it should improve things.

This patch affects code generation in several tests. The new code looks better except for in Thumb2/LowOverheadLoops/memcall.ll where we perform PRE but the LowOverheadLoops transformation then undoes it. Also in AMDGPU/selectcc-opt.ll the CHECK makes things look worse, but actually the function as a whole is better (as a MOV is PRE'd).

Diff Detail

Event Timeline

john.brawn created this revision.Oct 25 2022, 4:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 4:42 AM
john.brawn requested review of this revision.Oct 25 2022, 4:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 4:42 AM
t.p.northover accepted this revision.Oct 26 2022, 6:34 AM
t.p.northover added a subscriber: t.p.northover.

Thanks for working on this, I think this change is reasonable.

This revision is now accepted and ready to land.Oct 26 2022, 6:34 AM
This revision was landed with ongoing or failed builds.Oct 27 2022, 6:15 AM
This revision was automatically updated to reflect the committed changes.

Headsup, this caused miscompilations on arm for me.

To reproduce the issue, you can e.g. do this:

git clone git://git.videolan.org/ffmpeg.git
cd ffmpeg
./configure --arch=armv7 --cc="clang -target armv7-linux-gnueabihf .." --samples=$(pwd)/../samples
make fate-rsync
make -j$(nproc) fate-oma-demux

One preprocessed source file that differs in behaviour after this commit is this: https://martin.st/temp/id3v2-preproc.c Compiled with clang -target armv7-linux-gnueabihf id3v2-preproc.c -c -o libavformat/id3v2.o -O2.

Ok to revert until the issue has been resolved?

lkail added a subscriber: lkail.Oct 28 2022, 4:31 AM
john.brawn reopened this revision.Oct 28 2022, 6:41 AM

I've reverted. Looks like there's something going on with a flags-setting SUBS being hosted above a conditional branch.

This revision is now accepted and ready to land.Oct 28 2022, 6:41 AM

Adjusted to find physical register uses and defs by iterating through all operands, as the defs list doesn't include implicit defs. Added a test based on the code in ffmpeg that was getting incorrectly optimised.

Hi @john.brawn

This change is causing miscompilations in armv7. I was looking into a chromium bug [1] and eventually figured out that it was a regression caused by a clang update in the chromium tree.

I bisected further within the llvm changes and it bisects to this change. If i manually revert this change at current tip-of-tree, the code compiles correctly and works as expected.

Steps to reproduce:

  1. build libyuv with "-march=armv7-a -mfloat-abi=softfp -mfpu=neon -O2 -mthumb"
  2. Build the attached code snippet from [2] with "-march=armv7-a -mfloat-abi=softfp -mfpu=neon -O2 -mthumb -lyuv"
  3. The resulting binary is doing some of the computations incorrectly after this change. If i revert this change, it works as expected.

Please let me know if you need any more details. :)

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=1424089
[2] https://gist.github.com/vigneshvg/09c8414b94187349429641171475ce47

  1. build libyuv with "-march=armv7-a -mfloat-abi=softfp -mfpu=neon -O2 -mthumb"

link to libyuv: https://chromium.googlesource.com/libyuv/libyuv/

Please let me know if you need any more details. :)

I quickly tried cross-building that example for a bare-metal environment (because it's something I have the setup to do easily) and the output produced by the executable was identical in current trunk and with this patch reverted. So:

  • I'm guessing this failure is happening in some kind of android environment?
  • A description of what's going wrong that's more precise than "doing some of the computations incorrectly" would be helpful
  • A smaller reproducer that doesn't depend on the host environment (e.g. preprocessed with -E) would also be helpful

Also: I'm on holiday tomorrow and next week, so won't be able to look at this until after that.

Please let me know if you need any more details. :)

I quickly tried cross-building that example for a bare-metal environment (because it's something I have the setup to do easily) and the output produced by the executable was identical in current trunk and with this patch reverted. So:

  • I'm guessing this failure is happening in some kind of android environment?
  • A description of what's going wrong that's more precise than "doing some of the computations incorrectly" would be helpful
  • A smaller reproducer that doesn't depend on the host environment (e.g. preprocessed with -E) would also be helpful

Also: I'm on holiday tomorrow and next week, so won't be able to look at this until after that.

Thanks for the reply @john.brawn. Turns out it was a bug in libyuv's assembly code and how it was being optimized with -flto.

My bisect may just have been accidentally triggering the edge case in this commit. Sorry about that.