This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
atomic-idempotent.ll

Differential D58632

[X86] Improve lowering of idemptotent RMW operations
ClosedPublic

Authored by reames on Feb 25 2019, 9:22 AM.

Download Raw Diff

Details

Reviewers

craig.topper
jfb

Commits

rZORGa7e6d75cd491: [X86] Improve lowering of idemptotent RMW operations
rZORG115381d182c5: [X86] Improve lowering of idemptotent RMW operations
rGa7e6d75cd491: [X86] Improve lowering of idemptotent RMW operations
rG115381d182c5: [X86] Improve lowering of idemptotent RMW operations
rGbd588dfd5947: [X86] Improve lowering of idemptotent RMW operations
rL360393: [X86] Improve lowering of idemptotent RMW operations

Summary

The current lowering uses an mfence. mfences are substaintially higher latency than the locked operations originally requested, but we do want to avoid contention on the original cache line. As such, use a locked instruction on a cache line assumed to be thread local.

Diff Detail

Repository: rL LLVM

Event Timeline

reames created this revision.Feb 25 2019, 9:22 AM

Herald added subscribers: jdoerfert, bollu, mcrosier. · View Herald TranscriptFeb 25 2019, 9:22 AM

Overall this looks good, but I have a few comments.

lib/Target/X86/X86ISelLowering.cpp
25482 ↗	(On Diff #188200)	This comment needs an update.
25492 ↗	(On Diff #188200)	This seems to need an update too.
test/CodeGen/X86/atomic-idempotent.ll
193 ↗	(On Diff #188200)	Can you test `monotonic` as well? Also, `i8` and `i16`.

jfb added a subscriber: anemet.Feb 25 2019, 10:56 AM

reames planned changes to this revision.Feb 25 2019, 8:31 PM

reames marked 2 inline comments as done.

reames added inline comments.

lib/Target/X86/X86ISelLowering.cpp
25482 ↗	(On Diff #188200)	I'm not sure what update you're asking for. If we fall down this path, the comments appear correct as they ever were. Something I'm missing?
test/CodeGen/X86/atomic-idempotent.ll
193 ↗	(On Diff #188200)	Will do.

Add tests requested by JF

Overall this LGTM, but that part of CodeGen isn't my area of expertise so it would be nice to get another person to chime in.

This revision is now accepted and ready to land.Feb 26 2019, 11:44 AM

anemet edited reviewers, added: craig.topper; removed: jfb.Feb 26 2019, 1:01 PM

This revision now requires review to proceed.Feb 26 2019, 1:01 PM

Herald added a subscriber: jfb. · View Herald TranscriptFeb 26 2019, 1:01 PM

anemet added a reviewer: jfb.Feb 26 2019, 1:01 PM

ping?

ping

ping?

Herald added a subscriber: dexonsmith. · View Herald TranscriptApr 19 2019, 12:03 PM

craig.topper added inline comments.Apr 23 2019, 10:28 PM

lib/Target/X86/X86ISelLowering.cpp
25455 ↗	(On Diff #188385)	Should that be "auto C"? I think we try to keep the on pointers.
25995 ↗	(On Diff #188385)	Line comments up?
25999 ↗	(On Diff #188385)	Shouldn't this be mi8 for a shorter encoding? Does 64-bit really need a 64-bit access or can we just use a 32-bit access?
26081 ↗	(On Diff #188385)	I think you can use isNullConstant

craig.topper added inline comments.Apr 23 2019, 10:34 PM

lib/Target/X86/X86ISelLowering.cpp
25984 ↗	(On Diff #188385)	Windows 64 definitely doesn't have a red zone. I'm not sure if any 32-bit target does.

reames mentioned this in rL360274: [Tests] Landing tests for D58632 to show diffs in review.May 8 2019, 10:26 AM

reames mentioned this in rG8186e3908269: [Tests] Landing tests for D58632 to show diffs in review.May 8 2019, 10:29 AM

Address review comments

reames marked 4 inline comments as done.May 8 2019, 10:58 AM

reames added inline comments.

lib/Target/X86/X86ISelLowering.cpp
25984 ↗	(On Diff #188385)	Does my whitelist approach seem reasonable here? Or do you have a better suggestion?
25999 ↗	(On Diff #188385)	For the 64 bit, we need a valid stack location. There's no guarantee that ESP will contain a valid address. (Since it's the 32 bit truncation of RSP and the stack may be outside the low 4 GB.)

jfb added inline comments.May 8 2019, 11:16 AM

lib/Target/X86/X86ISelLowering.cpp
26255 ↗	(On Diff #198698)	I'd rather see something more precise here. We have a `noredzone` function attribute which should be honored: https://releases.llvm.org/3.7.0/docs/LangRef.html#id630 aarch64 has some handling for redzone too: https://llvm.org/doxygen/AArch64FrameLowering_8cpp_source.html#l00186 So does X86FrameLowering.cpp, X86InstrInfo.cpp, and X86MachineFunctionInfo.h. Some other architectures to this too. I think you should do the same: if the attribute isn't present, redzone away. If some targets don't have a redzone yet don't set the attribute, they've already opened themselves to pain. Maybe we need a PSA on the mailing list to make sure these targets set `noredzone` on all their functions.

craig.topper added inline comments.May 8 2019, 11:18 AM

lib/Target/X86/X86ISelLowering.cpp
25999 ↗	(On Diff #188385)	Sorry. I meant can we use OR32mi8Locked in 64-bit mode, but still use RSP for the address ? The amount of stack space accessed isn't important right? We could read/write 4 bytes from the stack? I mainly ask because OR32mi8Locked is 1 byte shorter to encode than LOCK_OR64mi8 if none of the registers used force the use of a REX prefix.

craig.topper added inline comments.May 8 2019, 11:30 AM

lib/Target/X86/X86ISelLowering.cpp
26255 ↗	(On Diff #198698)	But clang doesn't currently do that does it? Wouldn't it also create a bitcode backwards compatibility issue?

jfb added inline comments.May 8 2019, 12:46 PM

lib/Target/X86/X86ISelLowering.cpp
26255 ↗	(On Diff #198698)	The above heuristic must, at a minimum, be updated to obey `noredzone`. I'm saying that on top of this I think we want something more general. Yes there's potential bugs lurking and incompatibilities, hence why I think a PSA is warranted, I think the discussion will be helpful.

reames marked 2 inline comments as done.May 9 2019, 12:21 PM

reames added inline comments.

lib/Target/X86/X86ISelLowering.cpp
26255 ↗	(On Diff #198698)	Ok, I think this is snowballing a bit. From a quick dig through the code, the existing redzone code is a mix of "have we relied on the redzone", and "are we allowed to use the redzone". I'm fine trying to do a bit of cleanup there, but I would very much like to separate that into a distinct set of patches. One point, I don't agree with your intepretation of the noredzone attribute. The LangRef explicit says it only matters if the ABI would otherwise allow the use of the redzone. Can we land this w/o using the redzone? Banging on top of stack isn't great, but it's a lot better than the mfence. If everyone is okay with that, I'll rev the patch to remove the offset for the moment.
25999 ↗	(On Diff #188385)	Ah, gotcha. When I do the next rev, I'll make this change.

jfb added inline comments.May 9 2019, 12:24 PM

lib/Target/X86/X86ISelLowering.cpp
26255 ↗	(On Diff #198698)	Works for me (as long as it doesn't special-case any ABI), maybe leave a FIXME to use redzone and check the attribute, etc.

Remove redzone handling (to be done in a future change)

Reverse a test change which was only needed by the redzone handling (which has been removed)

jfb accepted this revision.May 9 2019, 3:34 PM

This revision is now accepted and ready to land.May 9 2019, 3:34 PM

Closed by commit rL360393: [X86] Improve lowering of idemptotent RMW operations (authored by reames). · Explain WhyMay 9 2019, 4:21 PM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2019, 4:21 PM

We've seen crashes bisected to this patch with fatal error: error in backend: Cannot emit physreg copy instruction.

I've only seen it with SLH so far. Will try to reduce a testcase, but I think we'll likely need to revert this revision.

@sammccall, any chance you’ve seen it with an asserts build? I think the debug output might say which registers it was trying to copy.

@ctopper thanks for the hint! There is an assertion, hope it helps...

(Unfortunately after running with -emit-llvm I can't get opt to crash, so I'm stuck waiting for creduce)

clang-6.0: /usr/local/google/home/sammccall/src/llvm/include/llvm/CodeGen/TargetRegisterInfo.h:302: static unsigned int llvm::TargetRegisterInfo::virtReg2Index(unsigned int): Assertion `isVirtualRegister(Reg) && "Not a virtual register"' failed.
Stack dump:
0.      Program arguments: /usr/local/google/home/sammccall/llvmbuild/bin/clang-6.0 -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -main-file-name crash-reduced.cpp -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -target-feature +retpoline-indirect-calls -dwarf-column-info -debugger-tuning=gdb -momit-leaf-frame-pointer -coverage-notes-file /dev/null.gcno -resource-dir /usr/local/google/home/sammccall/llvmbuild/lib/clang/9.0.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/c++/8.0.1 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/x86_64-linux-gnu/c++/8.0.1 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/x86_64-linux-gnu/c++/8.0.1 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/c++/8.0.1/backward -internal-isystem /usr/local/include -internal-isystem /usr/local/google/home/sammccall/llvmbuild/lib/clang/9.0.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=gnu++17 -fdeprecated-macro -fdebug-compilation-dir /usr/local/google/home/sammccall/llvmbuild -ferror-limit 19 -fmessage-length 174 -mspeculative-load-hardening -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -target-feature +sse4.2 -o /dev/null -x c++ /tmp/crash-reduced.cpp -faddrsig
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module '/tmp/crash-reduced.cpp'.
4.      Running pass 'X86 speculative load hardening' on function '@<snip>'
 #0 0x0000000003c7d2b9 llvm::sys::PrintStackTrace(llvm::raw_ostream&) /usr/local/google/home/sammccall/src/llvm/lib/Support/Unix/Signals.inc:494:11
 #1 0x0000000003c7d469 PrintStackTraceSignalHandler(void*) /usr/local/google/home/sammccall/src/llvm/lib/Support/Unix/Signals.inc:558:1
 #2 0x0000000003c7bd76 llvm::sys::RunSignalHandlers() /usr/local/google/home/sammccall/src/llvm/lib/Support/Signals.cpp:67:5
 #3 0x0000000003c7daeb SignalHandler(int) /usr/local/google/home/sammccall/src/llvm/lib/Support/Unix/Signals.inc:357:1
 #4 0x00007f596a05e0c0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x110c0)
 #5 0x00007f5968be8fcf raise (/lib/x86_64-linux-gnu/libc.so.6+0x32fcf)
 #6 0x00007f5968bea3fa abort (/lib/x86_64-linux-gnu/libc.so.6+0x343fa)
 #7 0x00007f5968be1e37 (/lib/x86_64-linux-gnu/libc.so.6+0x2be37)
 #8 0x00007f5968be1ee2 (/lib/x86_64-linux-gnu/libc.so.6+0x2bee2)
 #9 0x00000000013fd6d4 llvm::TargetRegisterInfo::virtReg2Index(unsigned int) /usr/local/google/home/sammccall/src/llvm/include/llvm/CodeGen/TargetRegisterInfo.h:303:12
#10 0x00000000013fd607 llvm::VirtReg2IndexFunctor::operator()(unsigned int) const /usr/local/google/home/sammccall/src/llvm/include/llvm/CodeGen/TargetRegisterInfo.h:1154:5
#11 0x00000000013fd549 llvm::IndexedMap<std::pair<llvm::PointerUnion<llvm::TargetRegisterClass const*, llvm::RegisterBank const*>, llvm::MachineOperand*>, llvm::VirtReg2IndexFunctor>::operator[](unsigned int) const /usr/local/google/home/sammccall/src/llvm/include/llvm/ADT/IndexedMap.h:51:7
#12 0x00000000013fdc09 llvm::MachineRegisterInfo::getRegClass(unsigned int) const /usr/local/google/home/sammccall/src/llvm/include/llvm/CodeGen/MachineRegisterInfo.h:627:5
#13 0x000000000285042d (anonymous namespace)::X86SpeculativeLoadHardeningPass::hardenLoadAddr(llvm::MachineInstr&, llvm::MachineOperand&, llvm::MachineOperand&, llvm::SmallDenseMap<unsigned int, unsigned int, 32u, llvm::DenseMapInfo<unsigned int>, llvm::detail::DenseMapPair<unsigned int, unsigned int> >&) /usr/local/google/home/sammccall/src/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp:2030:11
#14 0x000000000284ccf1 (anonymous namespace)::X86SpeculativeLoadHardeningPass::tracePredStateThroughBlocksAndHarden(llvm::MachineFunction&) /usr/local/google/home/sammccall/src/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp:1793:11
#15 0x000000000284902e (anonymous namespace)::X86SpeculativeLoadHardeningPass::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/sammccall/src/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp:549:30
#16 0x0000000002ef3527 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/sammccall/src/llvm/lib/CodeGen/MachineFunctionPass.cpp:73:8
#17 0x00000000033c18b9 llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/sammccall/src/llvm/lib/IR/LegacyPassManager.cpp:1648:23
#18 0x00000000033c1cff llvm::FPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/sammccall/src/llvm/lib/IR/LegacyPassManager.cpp:1685:16
#19 0x00000000033c2465 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/sammccall/src/llvm/lib/IR/LegacyPassManager.cpp:1752:23
#20 0x00000000033c1fa5 llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/sammccall/src/llvm/lib/IR/LegacyPassManager.cpp:1865:16
#21 0x00000000033c29e1 llvm::legacy::PassManager::run(llvm::Module&) /usr/local/google/home/sammccall/src/llvm/lib/IR/LegacyPassManager.cpp:1896:3
#22 0x0000000003fafe68 (anonymous namespace)::EmitAssemblyHelper::EmitAssembly(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/CodeGen/BackendUtil.cpp:894:3
#23 0x0000000003fac4dc clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/CodeGen/BackendUtil.cpp:1466:5
#24 0x000000000509bc7d clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/CodeGen/CodeGenAction.cpp:300:7
#25 0x00000000067cd7fe clang::ParseAST(clang::Sema&, bool, bool) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/Parse/ParseAST.cpp:178:12
#26 0x0000000004761552 clang::ASTFrontendAction::ExecuteAction() /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/Frontend/FrontendAction.cpp:1037:1
#27 0x000000000509935f clang::CodeGenAction::ExecuteAction() /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/CodeGen/CodeGenAction.cpp:1057:1
#28 0x0000000004760f70 clang::FrontendAction::Execute() /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/Frontend/FrontendAction.cpp:938:7
#29 0x00000000046f8ae0 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/Frontend/CompilerInstance.cpp:945:7
#30 0x00000000048e4c81 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /usr/local/google/home/sammccall/src/llvm/tools/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:273:8
#31 0x000000000137d3c3 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /usr/local/google/home/sammccall/src/llvm/tools/clang/tools/driver/cc1_main.cpp:225:13
#32 0x00000000013711e1 ExecuteCC1Tool(llvm::ArrayRef<char const*>, llvm::StringRef) /usr/local/google/home/sammccall/src/llvm/tools/clang/tools/driver/driver.cpp:309:5
#33 0x0000000001370571 main /usr/local/google/home/sammccall/src/llvm/tools/clang/tools/driver/driver.cpp:381:5
#34 0x00007f5968bd62b1 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202b1)
#35 0x000000000136fd6a _start (/usr/local/google/home/sammccall/llvmbuild/bin/clang-6.0+0x136fd6a)
clang-6: error: unable to execute command: Aborted
clang-6: error: clang frontend command failed due to signal (use -v to see invocation)

You’ll need to use llc to crash. But I can guess the issue now. SLH is expecting the memory address to have a virtual register as a base register, but we forced RSP on this access. We need to teach SLH about this case

@craig.topper doh, I'm a frontend person and somehow forgot about llc vs opt. I'll get a repro and attach it.

@reames Yes please, reverting until we can address this symptom would be really useful. We try to integrate daily (which is how we found this), and it gets much harder the longer a breakage lasts.

It reduces to this point with the same error:

bugpoint-reduced-simplifycfg.ll33 KBDownload

Bugpoint can reduce it further, but to a segfault that I'm not sure is the same error (--append-exit-code doesn't seem to help me):

bugpoint-reduced-simplified.ll1 KBDownload

@sammccall, can you try this patch

diff --git a/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp b/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp
index c8b740ca39e..ab8cb5250c9 100644
--- a/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp
+++ b/llvm/lib/Target/X86/X86SpeculativeLoadHardening.cpp
@@ -1958,7 +1958,7 @@ void X86SpeculativeLoadHardeningPass::hardenLoadAddr(
 
   SmallVector<MachineOperand *, 2> HardenOpRegs;
 
-  if (BaseMO.isFI()) {
+  if (BaseMO.isFI() || BaseMO.getReg() == X86::RSP) {
     // A frame index is never a dynamically controllable load, so only
     // harden it if we're covering fixed address loads as well.
     LLVM_DEBUG(

reames mentioned this in D61799: Factor out redzone ABI checks [NFCI].May 10 2019, 12:41 PM

@craig.topper Indeed that successfully compiles the problematic (non-reduced) file.

Whether the resulting binary works I should be able to tell you in an hour or so :-)

Yes, that fix works for us. Thanks!

It'd be great to get it landed or a revert of this patch soon if possible.

I've commited a modified version of my patch in r360475. I'll try to reduce a test case and commit that later tonight or tomorrow.

Diffusion added a reverting change: rL360475: [X86] Disable speculative load hardening for operations with an explicit RSP….May 10 2019, 3:01 PM

craig.topper added a reverting change: rGdf10cc6068b2: [X86] Disable speculative load hardening for operations with an explicit RSP….May 10 2019, 3:02 PM

reames mentioned this in rL360479: Factor out redzone ABI checks [NFCI].May 10 2019, 3:56 PM

reames mentioned this in rG849ef823df01: Factor out redzone ABI checks [NFCI].

reames mentioned this in D61862: Use an offset from TOS for idempotent rmw locked op lowering.May 13 2019, 9:52 AM

reames mentioned this in D61863: [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit targets.May 13 2019, 10:07 AM

reames mentioned this in rL360649: [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit….May 13 2019, 9:42 PM

reames mentioned this in rG3098e44daa76: [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit….

reames mentioned this in rL360719: Use an offset from TOS for idempotent rmw locked op lowering.May 14 2019, 3:31 PM

reames mentioned this in rG445f942fc498: Use an offset from TOS for idempotent rmw locked op lowering.

sidorovd mentioned this in rG46a6adeeedeb: [Tests] Landing tests for D58632 to show diffs in review.May 30 2019, 8:52 AM

sidorovd added a reverting change: rGf62401d7d835: [X86] Disable speculative load hardening for operations with an explicit RSP….May 30 2019, 9:07 AM

sidorovd mentioned this in rG23ebdeff5427: Factor out redzone ABI checks [NFCI].

sidorovd mentioned this in rGd9555062eba1: [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit….May 30 2019, 9:18 AM

sidorovd mentioned this in rG83a463edabf8: Use an offset from TOS for idempotent rmw locked op lowering.May 30 2019, 9:22 AM

sidorovd mentioned this in rG722b585a7a59: [Tests] Landing tests for D58632 to show diffs in review.May 30 2019, 9:56 AM

sidorovd added a reverting change: rG2f84c3005f6f: [X86] Disable speculative load hardening for operations with an explicit RSP….May 30 2019, 10:09 AM

sidorovd mentioned this in rGdc5224138016: Factor out redzone ABI checks [NFCI].

sidorovd mentioned this in rG0e6e9fcb1f05: [X86] Prefer locked stack op over mfence for seq_cst 64-bit stores on 32-bit….May 30 2019, 10:18 AM

sidorovd mentioned this in rGa75df0b74b21: Use an offset from TOS for idempotent rmw locked op lowering.May 30 2019, 10:22 AM

sidorovd added a reverting change: rZORGf62401d7d835: [X86] Disable speculative load hardening for operations with an explicit RSP….Oct 28 2019, 4:42 PM

sidorovd added a reverting change: rZORG2f84c3005f6f: [X86] Disable speculative load hardening for operations with an explicit RSP….Oct 28 2019, 4:59 PM

vchuravy mentioned this in D129947: [X86] Prefer `lock or` over mfence..Jul 16 2022, 5:20 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

89 lines

test/

CodeGen/

X86/

atomic-idempotent.ll

88 lines

Diff 198938

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 25,721 Lines • ▼ Show 20 Lines	X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;		unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;
Type *MemType = AI->getType();		Type *MemType = AI->getType();
// Accesses larger than the native width are turned into cmpxchg/libcalls, so		// Accesses larger than the native width are turned into cmpxchg/libcalls, so
// there is no benefit in turning such RMWs into loads, and it is actually		// there is no benefit in turning such RMWs into loads, and it is actually
// harmful as it introduces a mfence.		// harmful as it introduces a mfence.
if (MemType->getPrimitiveSizeInBits() > NativeWidth)		if (MemType->getPrimitiveSizeInBits() > NativeWidth)
return nullptr;		return nullptr;

		// If this is a canonical idempotent atomicrmw w/no uses, we have a better
		// lowering available in lowerAtomicArith.
		// TODO: push more cases through this path.
		if (auto *C = dyn_cast<ConstantInt>(AI->getValOperand()))
		if (AI->getOperation() == AtomicRMWInst::Or && C->isZero() &&
		AI->use_empty())
		return nullptr;

auto Builder = IRBuilder<>(AI);		auto Builder = IRBuilder<>(AI);
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
auto SSID = AI->getSyncScopeID();		auto SSID = AI->getSyncScopeID();
// We must restrict the ordering to avoid generating loads with Release or		// We must restrict the ordering to avoid generating loads with Release or
// ReleaseAcquire orderings.		// ReleaseAcquire orderings.
auto Order = AtomicCmpXchgInst::getStrongestFailureOrdering(AI->getOrdering());		auto Order = AtomicCmpXchgInst::getStrongestFailureOrdering(AI->getOrdering());

// Before the load we need a fence. Here is an example lifted from		// Before the load we need a fence. Here is an example lifted from
▲ Show 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	static SDValue LowerBITREVERSE(SDValue Op, const X86Subtarget &Subtarget,

SDValue LoMask = DAG.getBuildVector(VT, DL, LoMaskElts);		SDValue LoMask = DAG.getBuildVector(VT, DL, LoMaskElts);
SDValue HiMask = DAG.getBuildVector(VT, DL, HiMaskElts);		SDValue HiMask = DAG.getBuildVector(VT, DL, HiMaskElts);
Lo = DAG.getNode(X86ISD::PSHUFB, DL, VT, LoMask, Lo);		Lo = DAG.getNode(X86ISD::PSHUFB, DL, VT, LoMask, Lo);
Hi = DAG.getNode(X86ISD::PSHUFB, DL, VT, HiMask, Hi);		Hi = DAG.getNode(X86ISD::PSHUFB, DL, VT, HiMask, Hi);
return DAG.getNode(ISD::OR, DL, VT, Lo, Hi);		return DAG.getNode(ISD::OR, DL, VT, Lo, Hi);
}		}

		/// Emit a locked operation on a stack location which does not change any
		/// memory location, but does involve a lock prefix. Location is chosen to be
		/// a) very likely accessed only by a single thread to minimize cache traffic,
		/// and b) definitely dereferenceable. Returns the new Chain result.
		static SDValue emitLockedStackOp(SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		SDValue Chain, SDLoc DL) {
		// Implementation notes:
		// 1) LOCK prefix creates a full read/write reordering barrier for memory
		// operations issued by the current processor. As such, the location
		// referenced is not relevant for the ordering properties of the instruction.
		// See: Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual,
		// 8.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions
		// 2) Using an immediate operand appears to be the best encoding choice
		// here since it doesn't require an extra register.
		// 3) OR appears to be very slightly faster than ADD. (Though, the difference
		// is small enough it might just be measurement noise.)
		// 4) For the moment, we are using top of stack. This creates false sharing
		// with actual stack access/call sequences, and it would be better to use a
		// location within the redzone. For the moment, this is still better than an
		// mfence though. TODO: Revise the offset used when we can assume a redzone.
		//
		// For a general discussion of the tradeoffs and benchmark results, see:
		// https://shipilev.net/blog/2014/on-the-fence-with-dependencies/

		if (Subtarget.is64Bit()) {
		SDValue Zero = DAG.getTargetConstant(0, DL, MVT::i8);
		SDValue Ops[] = {
		DAG.getRegister(X86::RSP, MVT::i64), // Base
		DAG.getTargetConstant(1, DL, MVT::i8), // Scale
		DAG.getRegister(0, MVT::i64), // Index
		DAG.getTargetConstant(0, DL, MVT::i32), // Disp
		DAG.getRegister(0, MVT::i32), // Segment.
		Zero,
		Chain};
		SDNode *Res = DAG.getMachineNode(X86::LOCK_OR32mi8, DL, MVT::Other, Ops);
		return SDValue(Res, 0);
		}

		SDValue Zero = DAG.getTargetConstant(0, DL, MVT::i32);
		SDValue Ops[] = {
		DAG.getRegister(X86::ESP, MVT::i32), // Base
		DAG.getTargetConstant(1, DL, MVT::i8), // Scale
		DAG.getRegister(0, MVT::i32), // Index
		DAG.getTargetConstant(0, DL, MVT::i32), // Disp
		DAG.getRegister(0, MVT::i32), // Segment.
		Zero,
		Chain
		};
		SDNode *Res = DAG.getMachineNode(X86::OR32mi8Locked, DL, MVT::Other, Ops);
		return SDValue(Res, 0);
		}

static SDValue lowerAtomicArithWithLOCK(SDValue N, SelectionDAG &DAG,		static SDValue lowerAtomicArithWithLOCK(SDValue N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
unsigned NewOpc = 0;		unsigned NewOpc = 0;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
case ISD::ATOMIC_LOAD_ADD:		case ISD::ATOMIC_LOAD_ADD:
NewOpc = X86ISD::LADD;		NewOpc = X86ISD::LADD;
break;		break;
case ISD::ATOMIC_LOAD_SUB:		case ISD::ATOMIC_LOAD_SUB:
Show All 18 Lines	return DAG.getMemIntrinsicNode(
NewOpc, SDLoc(N), DAG.getVTList(MVT::i32, MVT::Other),		NewOpc, SDLoc(N), DAG.getVTList(MVT::i32, MVT::Other),
{N->getOperand(0), N->getOperand(1), N->getOperand(2)},		{N->getOperand(0), N->getOperand(1), N->getOperand(2)},
/MemVT=/N->getSimpleValueType(0), MMO);		/MemVT=/N->getSimpleValueType(0), MMO);
}		}

/// Lower atomic_load_ops into LOCK-prefixed operations.		/// Lower atomic_load_ops into LOCK-prefixed operations.
static SDValue lowerAtomicArith(SDValue N, SelectionDAG &DAG,		static SDValue lowerAtomicArith(SDValue N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		AtomicSDNode *AN = cast<AtomicSDNode>(N.getNode());
SDValue Chain = N->getOperand(0);		SDValue Chain = N->getOperand(0);
SDValue LHS = N->getOperand(1);		SDValue LHS = N->getOperand(1);
SDValue RHS = N->getOperand(2);		SDValue RHS = N->getOperand(2);
unsigned Opc = N->getOpcode();		unsigned Opc = N->getOpcode();
MVT VT = N->getSimpleValueType(0);		MVT VT = N->getSimpleValueType(0);
SDLoc DL(N);		SDLoc DL(N);

// We can lower atomic_load_add into LXADD. However, any other atomicrmw op		// We can lower atomic_load_add into LXADD. However, any other atomicrmw op
// can only be lowered when the result is unused. They should have already		// can only be lowered when the result is unused. They should have already
// been transformed into a cmpxchg loop in AtomicExpand.		// been transformed into a cmpxchg loop in AtomicExpand.
if (N->hasAnyUseOfValue(0)) {		if (N->hasAnyUseOfValue(0)) {
// Handle (atomic_load_sub p, v) as (atomic_load_add p, -v), to be able to		// Handle (atomic_load_sub p, v) as (atomic_load_add p, -v), to be able to
// select LXADD if LOCK_SUB can't be selected.		// select LXADD if LOCK_SUB can't be selected.
if (Opc == ISD::ATOMIC_LOAD_SUB) {		if (Opc == ISD::ATOMIC_LOAD_SUB) {
AtomicSDNode *AN = cast<AtomicSDNode>(N.getNode());
RHS = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), RHS);		RHS = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), RHS);
return DAG.getAtomic(ISD::ATOMIC_LOAD_ADD, DL, VT, Chain, LHS,		return DAG.getAtomic(ISD::ATOMIC_LOAD_ADD, DL, VT, Chain, LHS,
RHS, AN->getMemOperand());		RHS, AN->getMemOperand());
}		}
assert(Opc == ISD::ATOMIC_LOAD_ADD &&		assert(Opc == ISD::ATOMIC_LOAD_ADD &&
"Used AtomicRMW ops other than Add should have been expanded!");		"Used AtomicRMW ops other than Add should have been expanded!");
return N;		return N;
}		}

		// Specialized lowering for the canonical form of an idemptotent atomicrmw.
		// The core idea here is that since the memory location isn't actually
		// changing, all we need is a lowering for the ordering impacts of the
		// atomicrmw. As such, we can chose a different operation and memory
		// location to minimize impact on other code.
		if (Opc == ISD::ATOMIC_LOAD_OR && isNullConstant(RHS)) {
		// On X86, the only ordering which actually requires an instruction is
		// seq_cst which isn't SingleThread, everything just needs to be preserved
		// during codegen and then dropped. Note that we expect (but don't assume),
		// that orderings other than seq_cst and acq_rel have been canonicalized to
		// a store or load.
		if (AN->getOrdering() == AtomicOrdering::SequentiallyConsistent &&
		AN->getSyncScopeID() == SyncScope::System) {
		// Prefer a locked operation against a stack location to minimize cache
		// traffic. This assumes that stack locations are very likely to be
		// accessed only by the owning thread.
		SDValue NewChain = emitLockedStackOp(DAG, Subtarget, Chain, DL);
		DAG.ReplaceAllUsesOfValueWith(N.getValue(1), NewChain);
		return SDValue();
		}
		// MEMBARRIER is a compiler barrier; it codegens to a no-op.
		SDValue NewChain = DAG.getNode(X86ISD::MEMBARRIER, DL, MVT::Other, Chain);
		DAG.ReplaceAllUsesOfValueWith(N.getValue(1), NewChain);
		return SDValue();
		}

SDValue LockOp = lowerAtomicArithWithLOCK(N, DAG, Subtarget);		SDValue LockOp = lowerAtomicArithWithLOCK(N, DAG, Subtarget);
// RAUW the chain, but don't worry about the result, as it's unused.		// RAUW the chain, but don't worry about the result, as it's unused.
assert(!N->hasAnyUseOfValue(0));		assert(!N->hasAnyUseOfValue(0));
// NOTE: The getUNDEF is needed to give something for the unused result 0.		// NOTE: The getUNDEF is needed to give something for the unused result 0.
return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(),		return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(),
DAG.getUNDEF(VT), LockOp.getValue(1));		DAG.getUNDEF(VT), LockOp.getValue(1));
}		}

▲ Show 20 Lines • Show All 18,094 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/atomic-idempotent.ll

Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines
; X32-NEXT: mfence		; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax		; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl		; X32-NEXT: retl
%1 = atomicrmw and i32* %p, i32 -1 acq_rel		%1 = atomicrmw and i32* %p, i32 -1 acq_rel
ret i32 %1		ret i32 %1
}		}

define void @or32_nouse_monotonic(i32* %p) {		define void @or32_nouse_monotonic(i32* %p) {
; X64-LABEL: or32_nouse_monotonic:		; CHECK-LABEL: or32_nouse_monotonic:
; X64: # %bb.0:		; CHECK: # %bb.0:
; X64-NEXT: mfence		; CHECK-NEXT: #MEMBARRIER
; X64-NEXT: movl (%rdi), %eax		; CHECK-NEXT: ret{{[l\|q]}}
; X64-NEXT: retq
;
; X32-LABEL: or32_nouse_monotonic:
; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl
atomicrmw or i32* %p, i32 0 monotonic		atomicrmw or i32* %p, i32 0 monotonic
ret void		ret void
}		}


define void @or32_nouse_acquire(i32* %p) {		define void @or32_nouse_acquire(i32* %p) {
; X64-LABEL: or32_nouse_acquire:		; CHECK-LABEL: or32_nouse_acquire:
; X64: # %bb.0:		; CHECK: # %bb.0:
; X64-NEXT: mfence		; CHECK-NEXT: #MEMBARRIER
; X64-NEXT: movl (%rdi), %eax		; CHECK-NEXT: ret{{[l\|q]}}
; X64-NEXT: retq
;
; X32-LABEL: or32_nouse_acquire:
; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl
atomicrmw or i32* %p, i32 0 acquire		atomicrmw or i32* %p, i32 0 acquire
ret void		ret void
}		}

define void @or32_nouse_release(i32* %p) {		define void @or32_nouse_release(i32* %p) {
; X64-LABEL: or32_nouse_release:		; CHECK-LABEL: or32_nouse_release:
; X64: # %bb.0:		; CHECK: # %bb.0:
; X64-NEXT: mfence		; CHECK-NEXT: #MEMBARRIER
; X64-NEXT: movl (%rdi), %eax		; CHECK-NEXT: ret{{[l\|q]}}
; X64-NEXT: retq
;
; X32-LABEL: or32_nouse_release:
; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl
atomicrmw or i32* %p, i32 0 release		atomicrmw or i32* %p, i32 0 release
ret void		ret void
}		}

define void @or32_nouse_acq_rel(i32* %p) {		define void @or32_nouse_acq_rel(i32* %p) {
; X64-LABEL: or32_nouse_acq_rel:		; CHECK-LABEL: or32_nouse_acq_rel:
; X64: # %bb.0:		; CHECK: # %bb.0:
; X64-NEXT: mfence		; CHECK-NEXT: #MEMBARRIER
; X64-NEXT: movl (%rdi), %eax		; CHECK-NEXT: ret{{[l\|q]}}
; X64-NEXT: retq
;
; X32-LABEL: or32_nouse_acq_rel:
; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl
atomicrmw or i32* %p, i32 0 acq_rel		atomicrmw or i32* %p, i32 0 acq_rel
ret void		ret void
}		}

define void @or32_nouse_seq_cst(i32* %p) {		define void @or32_nouse_seq_cst(i32* %p) {
; X64-LABEL: or32_nouse_seq_cst:		; X64-LABEL: or32_nouse_seq_cst:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: mfence		; X64-NEXT: lock orl $0, (%rsp)
; X64-NEXT: movl (%rdi), %eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X32-LABEL: or32_nouse_seq_cst:		; X32-LABEL: or32_nouse_seq_cst:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: lock orl $0, (%esp)
; X32-NEXT: mfence
; X32-NEXT: movl (%eax), %eax
; X32-NEXT: retl		; X32-NEXT: retl
atomicrmw or i32* %p, i32 0 seq_cst		atomicrmw or i32* %p, i32 0 seq_cst
ret void		ret void
}		}

; TODO: The value isn't used on 32 bit, so the cmpxchg8b is unneeded		; TODO: The value isn't used on 32 bit, so the cmpxchg8b is unneeded
define void @or64_nouse_seq_cst(i64* %p) {		define void @or64_nouse_seq_cst(i64* %p) {
; X64-LABEL: or64_nouse_seq_cst:		; X64-LABEL: or64_nouse_seq_cst:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: mfence		; X64-NEXT: lock orl $0, (%rsp)
; X64-NEXT: movq (%rdi), %rax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X32-LABEL: or64_nouse_seq_cst:		; X32-LABEL: or64_nouse_seq_cst:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: pushl %ebx		; X32-NEXT: pushl %ebx
; X32-NEXT: .cfi_def_cfa_offset 8		; X32-NEXT: .cfi_def_cfa_offset 8
; X32-NEXT: pushl %esi		; X32-NEXT: pushl %esi
; X32-NEXT: .cfi_def_cfa_offset 12		; X32-NEXT: .cfi_def_cfa_offset 12
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	; X128-NEXT: retl
atomicrmw or i128* %p, i128 0 seq_cst		atomicrmw or i128* %p, i128 0 seq_cst
ret void		ret void
}		}


define void @or16_nouse_seq_cst(i16* %p) {		define void @or16_nouse_seq_cst(i16* %p) {
; X64-LABEL: or16_nouse_seq_cst:		; X64-LABEL: or16_nouse_seq_cst:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: mfence		; X64-NEXT: lock orl $0, (%rsp)
; X64-NEXT: movzwl (%rdi), %eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X32-LABEL: or16_nouse_seq_cst:		; X32-LABEL: or16_nouse_seq_cst:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: lock orl $0, (%esp)
; X32-NEXT: mfence
; X32-NEXT: movzwl (%eax), %eax
; X32-NEXT: retl		; X32-NEXT: retl
atomicrmw or i16* %p, i16 0 seq_cst		atomicrmw or i16* %p, i16 0 seq_cst
ret void		ret void
}		}

define void @or8_nouse_seq_cst(i8* %p) {		define void @or8_nouse_seq_cst(i8* %p) {
; X64-LABEL: or8_nouse_seq_cst:		; X64-LABEL: or8_nouse_seq_cst:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: mfence		; X64-NEXT: lock orl $0, (%rsp)
; X64-NEXT: movb (%rdi), %al
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X32-LABEL: or8_nouse_seq_cst:		; X32-LABEL: or8_nouse_seq_cst:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: lock orl $0, (%esp)
; X32-NEXT: mfence
; X32-NEXT: movb (%eax), %al
; X32-NEXT: retl		; X32-NEXT: retl
atomicrmw or i8* %p, i8 0 seq_cst		atomicrmw or i8* %p, i8 0 seq_cst
ret void		ret void
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve lowering of idemptotent RMW operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 198938

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/atomic-idempotent.ll

[X86] Improve lowering of idemptotent RMW operations
ClosedPublic