This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPCMIPeephole.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
byval-lhs.ll
-
convert-load-to-copy.mir
1/1
vsx-p9.ll

Differential D138899

[DAGCombiner] handle more store value forwarding
ClosedPublic

Authored by shchenz on Nov 28 2022, 11:13 PM.

Download Raw Diff

Details

Reviewers

nemanjai
scui
tstellar
RKSimon
sdardis
RolandF

Group Reviewers

Restricted Project

Commits

rG4f0ed16a46c5: Reland rGf35a09daebd0a90daa536432e62a2476f708150d and…
rGf35a09daebd0: [DAGCombiner] handle more store value forwarding

Summary

When lowering calls on target like PPC, some stack loads will be generated
for by value parameters. Node CALLSEQ_START prevents such loads from
being combined.

Suggested by @RolandF, this patch removes the unnecessary loads for the
byval parameter by extending ForwardStoreValueToDirectLoad

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

shchenz created this revision.Nov 28 2022, 11:13 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 11:13 PM

Herald added subscribers: kbarton, hiraditya. · View Herald Transcript

shchenz requested review of this revision.Nov 28 2022, 11:13 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptNov 28 2022, 11:13 PM

Harbormaster completed remote builds in B199941: Diff 478465.Nov 28 2022, 11:14 PM

shchenz added inline comments.Nov 28 2022, 11:24 PM

llvm/test/CodeGen/PowerPC/vsx-p9.ll
141	xvadddp 0, 0, 1 stxv 0, 32(1) lxv 34, 32(1) ------> Code change is because this load is eliminated. bl sink

Please add some information about other approaches you considered/attempted to solve this problem. I really don't like the idea of doing store forwarding in the peephole. Are there no existing passes that can perform this type of store forwarding (and most likely more general forms of it)? While I am not all that familiar with MemorySSA, it would seem that this is the exact type of thing it should handle effortlessly.

This revision now requires changes to proceed.Nov 30 2022, 6:56 AM

Thanks for taking a look @nemanjai

Please add some information about other approaches you considered/attempted to solve this problem.

Hope I explained enough for the reason I didn't do this in DAG-ISEL while lower calls with byval parameter. And PPC is working on switching to Global-ISEL, so that's another reason it is not so good to implement the load elimination for byval parameter in DAG-ISEL.

My first thought was implementing this in MachineCSE. But I did't do it there because:

MachineCSE is a target independent pass. From what I got, byval parameter for the call seems will not be lowered to loads to the caller's stack frame for all the other targets(at least some?). Target independent peephole was also excluded because of this.
MachineCSE seems only commoning instructions which have same opcodes, i.e., it can handle load elimination for pattern load addr ... load addr.

While I am not all that familiar with MemorySSA, it would seem that this is the exact type of thing it should handle effortlessly.

MemorySSA is not an analysis pass that can be used by MIR pass for now. And MemorySSA depends on AAResultsWrapperPass which may leads to compiler time concern. IR level pass mem2reg also does not require MemorySSA to do the transformation.

I thought there is no other pass which can implement this opt in PPC pipeline except the above ones, but maybe you can give some insight here.

request review for further discussion

scui added a subscriber: RolandF.Dec 6 2022, 3:55 PM

There is an existing place in codegen to handle forwarding a store value to the corresponding load - ForwardStoreValueToDirectLoad in lib/CodeGen/SelectionDAG/DAGCombiner.cpp. Handling this in codegen would not require putting the code in multiple places. The code there is relatively simple - it just looks to see if the thing on the chain immediately before the load is the setting store. For the first case in byval-lhs.ll there is a CALLSEQ_START on the chain in between the load and the store. Maybe it is possible to look past that in the chain to see the store. Since the load is for a call there may be a register copy required to replace the load. Where there is a sequence of multiple stores followed by multiple loads it would require looking back in the chain past loads and past stores to fixed stack locations that do not overlap. I don't know if that is allowed, but in theory it could work.

I would suggest to discuss offline with @nemanjai if such an approach is viable and preferred. If not, I think that is a reasonable argument for an MIR level approach.

address @RolandF comment, fixing in DAG Combiner and add more reviewers for other targets

Herald added subscribers: kosarev, kerbowa, pengfei and 3 others. · View Herald TranscriptDec 9 2022, 2:26 AM

Harbormaster completed remote builds in B202184: Diff 481578.Dec 9 2022, 2:27 AM

shchenz retitled this revision from [PowerPC] add a peephole to eliminate unnecessary load to [DAGCombiner] handle more store value forwarding.Dec 9 2022, 2:29 AM

shchenz edited the summary of this revision. (Show Details)

shchenz added inline comments.

llvm/test/CodeGen/X86/fastcc-byval.ll

2 ↗

(On Diff #481578)

Changing from:

_bar:                                   ## @bar
## %bb.0:
	subl	$12, %esp
	movl	$1, 8(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%esp)
	calll	_foo
	movl	8(%esp), %eax
	addl	$12, %esp
	retl

To:

_bar:                                   ## @bar
## %bb.0:
	subl	$12, %esp
	movl	$1, 8(%esp)
	movl	$1, (%esp)
	calll	_foo
	movl	8(%esp), %eax
	addl	$12, %esp
	retl

Seems one load is saved, needs confirmation from X86 experts.

RKSimon mentioned this in rG2d63646afdb0: [X86] Regenerate fastcc-byval.ll.Dec 9 2022, 2:37 AM

RKSimon added inline comments.Dec 9 2022, 2:38 AM

llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗	(On Diff #481578)	Please can you rebase after rG2d63646afdb0e4d8c063fe057916d7900f681968?

RKSimon mentioned this in rG51f154b99797: [X86] Regenerate fastcc-byval.ll with visible stack math.Dec 9 2022, 2:41 AM

RKSimon added inline comments.Dec 9 2022, 2:44 AM

llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗	(On Diff #481578)	Actually, rG51f154b997975ff1001b45823e903f87070222ad

rebase

Harbormaster completed remote builds in B202186: Diff 481584.Dec 9 2022, 2:49 AM

shchenz added inline comments.Dec 9 2022, 2:50 AM

llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗	(On Diff #481578)	Nice, thank you!

RolandF added inline comments.Dec 12 2022, 10:07 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17112 ↗	(On Diff #481584)	Is this check really needed for token factor operands? They are supposed to be independent. Aliasing seems like an order dependence.

shchenz marked an inline comment as done.Dec 12 2022, 10:03 PM

shchenz added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

17112 ↗

(On Diff #481584)

This is what I originally thought, then I met following case:

t16: ch = store<(store (s16) into @x1a0, !tbaa !8)> t15, t13, GlobalAddress:i64<ptr @x1a0> 0, undef:i64
  t18: i16,ch = load<(dereferenceable load (s16) from @x1a2, !tbaa !8)> t16, GlobalAddress:i64<ptr @x1a2> 0, undef:i64
  t20: i16,ch = load<(dereferenceable load (s16) from @i, !tbaa !8)> t16, GlobalAddress:i64<ptr @i> 0, undef:i64
  t25: i16,ch = load<(load (s16) from %ir.arrayidx.i, !tbaa !8)> t16, t24, undef:i64
  t27: i16,ch = load<(dereferenceable load (s16) from @si, !tbaa !8)> t16, GlobalAddress:i64<ptr @si> 0, undef:i64

    t37: ch = TokenFactor t18:1, t20:1, t25:1, t27:1

  t38: ch = store<(store (s16) into %ir.arrayidx.i, !tbaa !8)> t37, t36, t24, undef:i64

  t41: i16,ch = load<(dereferenceable load (s16) from @x1a0, !tbaa !8)> t38, GlobalAddress:i64<ptr @x1a0> 0, undef:i64

>

t16: ch = store<(store (s16) into @x1a0, !tbaa !8)> t15, t13, GlobalAddress:i64<ptr @x1a0> 0, undef:i64
t18: i16,ch = load<(dereferenceable load (s16) from @x1a2, !tbaa !8)> t16, GlobalAddress:i64<ptr @x1a2> 0, undef:i64
t20: i16,ch = load<(dereferenceable load (s16) from @i, !tbaa !8)> t16, GlobalAddress:i64<ptr @i> 0, undef:i64

t25: i16,ch = load<(load (s16) from %ir.arrayidx.i, !tbaa !8)> t16, t24, undef:i64
t27: i16,ch = load<(dereferenceable load (s16) from @si, !tbaa !8)> t16, GlobalAddress:i64<ptr @si> 0, undef:i64


  t915: ch = TokenFactor t16, t25:1
t916: ch = store<(store (s16) into %ir.arrayidx.i, !tbaa !8)> t915, t36, t24, undef:i64
  t919: ch = TokenFactor t916, t16
t920: i16,ch = load<(dereferenceable load (s16) from @x1a0, !tbaa !8)> t919, GlobalAddress:i64<ptr @x1a0> 0, undef:i64

DAGCombiner will forward the store value of t16 to t920 if no this aliasing check. But actually t916(t38) is aliased to t16, so the load result should be the content of t916 instead of t16.

And from the implementation of GatherAllAliases, one operand in a TokenFactor node does not mean it is not aliased to other operands of the TokenFactor node. That's the reason why it checks aliasing by checking each child operand of the TokenFactor node?

RolandF added inline comments.Dec 14 2022, 8:26 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗	(On Diff #481584)	If we can't make the assumption that token factor operands are unaliased, I think that the non-store path needs to be checked more strongly. There are ways to touch memory besides stores. I would suggest to whitelist the kinds of operands we know are safe and to otherwise give up.
17112 ↗	(On Diff #481584)	ok

shchenz marked 2 inline comments as done.Dec 15 2022, 5:53 PM

shchenz added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗	(On Diff #481584)	In the below `GatherAllAliases()`, it will collect all nodes in the `TokenFactor`(not limited to store) that are aliased to the candidate Store. If any nodes found except the candidate store itself, we will give up. That sounds like enough, what do you think?

RolandF added inline comments.Jan 5 2023, 10:05 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗	(On Diff #481584)	ok

@tstellar Hi, could you please help to confirm if the AMDGPU case changes are valid? Thank you very much.

nemanjai added inline comments.Jan 13 2023, 8:55 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17088–17090 ↗	(On Diff #481584)	This comment is misplaced. I think all this does is: // Look through CALLSEQ_START.
17094–17128 ↗	(On Diff #481584)	I think it would be better for readability if you put this into a `static` function. Something like: // Find a unique store that feeds this load if such a store exists. // This will look through CALLSEQ_START to allow forwarding // stores to the stack for byval arguments. static StoreSDNode getUniqueStoreFeeding(LoadSDNode Load, SelectionDAG &DAG);
17098–17103 ↗	(On Diff #481584)	Nit: this comment can simply be: // Look for unique store within the TokenFactor.
17113–17114 ↗	(On Diff #481584)	Nit: if (Aliases.empty() \|\| (Aliases.size() == 1 && Aliases.front() == Store)
17122–17125 ↗	(On Diff #481584)	Nit: rather than repeating this code, it might aid readability if you collect all unaliased stores (either the only store chained to CALLSEQ_START or all the unaliased stores in the TokenFactor) and then look for one with a matching `BaseIndexOffset`.

address @nemanjai comments and rebase

Harbormaster completed remote builds in B211123: Diff 493798.Jan 31 2023, 5:46 PM

Thanks for review @nemanjai .

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17094–17128 ↗	(On Diff #481584)	Done. Seems I have to use a member function as `GatherAllAliases` is a member function.
17122–17125 ↗	(On Diff #481584)	Thanks, I get your point. A function that finds all unaliased stores in `TokenFactor` may be time consuming. And we don't just care about store nodes, we care all nodes that write memory. So to me finding the candidate store first and then check alias with memory nodes in `TokenFactor` with the specific candidate store can save some compile time?

LGTM. The minor nits do not require another review.

I'm not sure if @tstellar has anything further to add or if he objects to the AMDGPU changes.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

365 ↗

(On Diff #493798)

I think a comment describing this function would be useful. Something like:

// Looks up the chain to find a unique (unaliased) store feeding the
// passed load. If no such store is found, returns a nullptr.
// Note: This will look past a CALLSEQ_START if the load is
//       chained to it so that it can find stack stores for byval
//       params.

17330 ↗

(On Diff #493798)

Nit: you can probably reduce nesting a bit with

if (!BasePtrST.equalBaseIndex(BasePtrLD, DAG, Offset))
  continue;

This revision is now accepted and ready to land.Feb 1 2023, 7:06 AM

This revision was landed with ongoing or failed builds.Feb 1 2023, 6:06 PM

Closed by commit rGf35a09daebd0: [DAGCombiner] handle more store value forwarding (authored by shchenz). · Explain Why

This revision was automatically updated to reflect the committed changes.

shchenz mentioned this in rG0a32e693e31f: [DAGCombiner][NFC] add testcases for D138899.

shchenz added a commit: rGf35a09daebd0: [DAGCombiner] handle more store value forwarding.

@shchenz You neither incorporated nor responded to the comments I left when approving the patch. Could you please respond to the two comments as to why you felt they did not need to be incorporated into the final patch?

shchenz mentioned this in rG63854f91d3ee: [DAGCombiner] fix comments for D138899; NFC.Feb 6 2023, 9:32 PM

In D138899#4108103, @nemanjai wrote:

@shchenz You neither incorporated nor responded to the comments I left when approving the patch. Could you please respond to the two comments as to why you felt they did not need to be incorporated into the final patch?

I am really sorry, Nemanja. @nemanjai I remember I addressed your comments and tested the new update when I committed this patch. I must forget to amend the changes to the final commit. 63854f91d3ee1056796a5ef27753648396cac6ec is committed for the two comments.

Heads up that we've bisected a test failure (https://bugs.chromium.org/p/chromium/issues/detail?id=1412740) to this commit. We're still investigating but would be interested to hear if others have found any issues too.

I believe this is causing a miscompile on this code.

$ cat /tmp/a.ll
source_filename = "../../chrome/browser/dom_distiller/dom_distiller_service_factory.cc"
target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128"
target triple = "i686-unknown-linux-android24"

%0 = type { %1 }
%1 = type { %2 }
%2 = type { ptr }
%3 = type { %4 }
%4 = type { %5 }
%5 = type { ptr }
%6 = type { %7 }
%7 = type { %8 }
%8 = type { ptr }
%9 = type { %10 }
%10 = type { %11 }
%11 = type { ptr }

@g = external hidden unnamed_addr constant { [5 x ptr], [9 x ptr] }, align 4
@g2 = external hidden unnamed_addr constant { [5 x ptr] }, align 4

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
define hidden void @f(ptr noundef align 4 dereferenceable_or_null(48) %0, ptr noundef byval(%0) align 4 %1, ptr noundef byval(%3) align 4 %2, ptr noundef byval(%6) align 4 %3, ptr noundef byval(%9) align 4 %4) unnamed_addr #0 align 2 {
  %6 = alloca %0, align 4
  %7 = alloca %3, align 4
  %8 = alloca %6, align 4
  %9 = alloca %9, align 4
  store ptr getelementptr inbounds ({ [5 x ptr] }, ptr @g2, i32 0, inrange i32 0, i32 2), ptr %0, align 4
  %10 = getelementptr inbounds i8, ptr %0, i32 4
  %11 = load ptr, ptr %1, align 4
  store ptr null, ptr %1, align 4
  store ptr %11, ptr %6, align 4
  %12 = load ptr, ptr %2, align 4
  store ptr null, ptr %2, align 4
  store ptr %12, ptr %7, align 4
  %13 = load ptr, ptr %3, align 4
  store ptr null, ptr %3, align 4
  store ptr %13, ptr %8, align 4
  %14 = load ptr, ptr %4, align 4
  store ptr null, ptr %4, align 4
  store ptr %14, ptr %9, align 4
  tail call void @h1(ptr noundef align 4 dereferenceable_or_null(44) %10, ptr noundef nonnull byval(%0) align 4 %6, ptr noundef nonnull byval(%3) align 4 %7, ptr noundef nonnull byval(%6) align 4 %8, ptr noundef nonnull byval(%9) align 4 %9) #2
  store ptr getelementptr inbounds ({ [5 x ptr], [9 x ptr] }, ptr @g, i32 0, inrange i32 0, i32 2), ptr %0, align 4
  store ptr getelementptr inbounds ({ [5 x ptr], [9 x ptr] }, ptr @g, i32 0, inrange i32 1, i32 2), ptr %10, align 4
  call void @h2(ptr noundef align 4 dereferenceable_or_null(4) %4) #2
  call void @h3(ptr noundef align 4 dereferenceable_or_null(4) %3) #2
  call void @h4(ptr noundef align 4 dereferenceable_or_null(4) %2) #2
  call void @h5(ptr noundef align 4 dereferenceable_or_null(4) %1) #2
  ret void
}

; Function Attrs: minsize null_pointer_is_valid optsize
declare void @h1(ptr noundef align 4 dereferenceable_or_null(44), ptr noundef byval(%0) align 4, ptr noundef byval(%3) align 4, ptr noundef byval(%6) align 4, ptr noundef byval(%9) align 4) unnamed_addr #1

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h2(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h3(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h4(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h5(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

attributes #0 = { minsize nounwind null_pointer_is_valid optsize uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="i686" "target-features"="+cx8,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "tune-cpu"="generic" }
attributes #1 = { minsize null_pointer_is_valid optsize "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="i686" "target-features"="+cx8,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "tune-cpu"="generic" }
attributes #2 = { minsize nounwind optsize }

!llvm.linker.options = !{}
!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6}

!0 = !{i32 1, !"NumRegisterParameters", i32 0}
!1 = !{i32 7, !"Dwarf Version", i32 4}
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 2}
!5 = !{i32 7, !"uwtable", i32 2}
!6 = !{i32 7, !"frame-pointer", i32 1}

$ llc /tmp/a.ll # one commit before this change
...
	movl	8(%ebp), %ebx
	movl	$g2+8, (%ebx)
	leal	12(%ebp), %ecx
	movl	(%ecx), %eax
	movl	%ecx, %edx
	leal	16(%ebp), %ecx
	movl	(%ecx), %ecx
	movl	%eax, -40(%ebp)
	leal	20(%ebp), %edi
	movl	%ecx, -32(%ebp)
	movl	(%edi), %eax
	movl	%eax, -24(%ebp)
	leal	24(%ebp), %esi
	movl	(%esi), %eax
	movl	%eax, -16(%ebp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, (%edx) ; sets 12(%ebp), 16(%ebp), 20(%ebp), 24(%ebp) to 0
	leal	4(%ebx), %eax
	movl	-16(%ebp), %ecx
	movl	%ecx, 16(%esp) ; stores -16(%ebp), which is also 24(%ebp), to 16(%esp)
	movl	-24(%ebp), %ecx
	movl	%ecx, 12(%esp)
	movl	-32(%ebp), %ecx
	movl	%ecx, 8(%esp)
	movl	-40(%ebp), %ecx
	movl	%ecx, 4(%esp)
	movl	%eax, (%esp)
	calll	h1@PLT
...
$ llc /tmp/a.ll # at this commit
...
	movl	8(%ebp), %ebx
	movl	$g2+8, (%ebx)
	leal	12(%ebp), %ecx
	movl	(%ecx), %eax
	movl	%ecx, %edx
	xorps	%xmm0, %xmm0
	leal	16(%ebp), %ecx
	movl	(%ecx), %ecx
	movl	%eax, -32(%ebp)
	leal	20(%ebp), %edi
	movl	%ecx, -24(%ebp)
	movl	(%edi), %eax
	movl	%eax, -16(%ebp)
	leal	24(%ebp), %esi
	movups	%xmm0, (%edx) ; sets 12(%ebp), 16(%ebp), 20(%ebp), 24(%ebp) to 0
	movl	(%esi), %eax ; loads 24(%ebp) to %eax, which is now 0
	movl	%eax, -40(%ebp)
	leal	4(%ebx), %ecx
	movl	%eax, 16(%esp) ; stores 0 to 16(%esp) which is wrong!!!
	movl	-16(%ebp), %eax
	movl	%eax, 12(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-32(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	%ecx, (%esp)
	calll	h1@PLT
...

see my added comments in the llc output

Going to revert this commit (and the follow up commit)

aeubanks reopened this revision.Feb 13 2023, 7:05 PM

This revision is now accepted and ready to land.Feb 13 2023, 7:05 PM

aeubanks mentioned this in rGac6219d0aeb7: Revert "[DAGCombiner] fix comments for D138899; NFC".Feb 13 2023, 7:07 PM

aeubanks added a reverting change: rG7c6b46e87eaf: Revert "[DAGCombiner] handle more store value forwarding".

To me, this patch is innocent for the failure https://bugs.chromium.org/p/chromium/issues/detail?id=1412740. The patch exposes an issue in the pre-ra machine scheduler on X86 target.

Before the machine scheduler:

176B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
192B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)     // the content will be stored to 16($esp) is loaded before the filing 0 MOVUPSmr
208B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

After the machine scheduler:

264B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
272B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
280B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)      // now the content is loaded after the filling 0 MOVUPSmr
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

Seems like scheduler does not know %6:gr32 = MOV32rm and MOVUPSmr are aliased.

@aeubanks Thanks for the quick action. Could you please help to verify if llc option -enable-misched=false can solve your issue? I am not very familiar with X86 target, do you have any suggestions about how to fix this X86 scheduling issue, create a github issue for X86 target? Thank you!

In D138899#4125143, @shchenz wrote:
To me, this patch is innocent for the failure https://bugs.chromium.org/p/chromium/issues/detail?id=1412740. The patch exposes an issue in the pre-ra machine scheduler on X86 target.

Before the machine scheduler:
176B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
192B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)     // the content will be stored to 16($esp) is loaded before the filing 0 MOVUPSmr
208B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))
After the machine scheduler:
264B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
272B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
280B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)      // now the content is loaded after the filling 0 MOVUPSmr
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))
Seems like scheduler does not know %6:gr32 = MOV32rm and MOVUPSmr are aliased.

@aeubanks Thanks for the quick action. Could you please help to verify if llc option -enable-misched=false can solve your issue? I am not very familiar with X86 target, do you have any suggestions about how to fix this X86 scheduling issue, create a github issue for X86 target? Thank you!

I'm not familiar with scheduler, but it looks to me like a general scheduler problem rather than X86 specific. I have filed 60744.
BTW, I found the problem cannot be reproduced without this patch. I guess it's the change happen to trigger the schedule.

I'm not familiar with scheduler, but it looks to me like a general scheduler problem rather than X86 specific. I have filed 60744.
BTW, I found the problem cannot be reproduced without this patch. I guess it's the change happen to trigger the schedule.

Thanks for creating the bug. Yes, reproducing the issue needs to apply this patch.

that bug has been fixed, now this can reland (hopefully assuming the fix isn't reverted for some reason), sorry for the trouble

may want to double check the reproducer before relanding

I have been waiting for some days for D144711 for potential issues. Since no issues reported, I will reland this patch.

Closed by commit rG4f0ed16a46c5: Reland rGf35a09daebd0a90daa536432e62a2476f708150d and… (authored by shchenz). · Explain WhyMar 12 2023, 7:12 PM

This revision was automatically updated to reflect the committed changes.

shchenz added a commit: rG4f0ed16a46c5: Reland rGf35a09daebd0a90daa536432e62a2476f708150d and….

Revision Contents

Path

Size

llvm/

lib/

Target/

PowerPC/

PPCMIPeephole.cpp

116 lines

test/

CodeGen/

PowerPC/

byval-lhs.ll

8 lines

convert-load-to-copy.mir

12 lines

vsx-p9.ll

4 lines

Diff 478465

llvm/lib/Target/PowerPC/PPCMIPeephole.cpp

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	private:

// Perform peepholes.		// Perform peepholes.
bool eliminateRedundantCompare();		bool eliminateRedundantCompare();
bool eliminateRedundantTOCSaves(std::map<MachineInstr *, bool> &TOCSaves);		bool eliminateRedundantTOCSaves(std::map<MachineInstr *, bool> &TOCSaves);
bool combineSEXTAndSHL(MachineInstr &MI, MachineInstr *&ToErase);		bool combineSEXTAndSHL(MachineInstr &MI, MachineInstr *&ToErase);
bool emitRLDICWhenLoweringJumpTables(MachineInstr &MI);		bool emitRLDICWhenLoweringJumpTables(MachineInstr &MI);
void UpdateTOCSaves(std::map<MachineInstr *, bool> &TOCSaves,		void UpdateTOCSaves(std::map<MachineInstr *, bool> &TOCSaves,
MachineInstr *MI);		MachineInstr *MI);
		bool optimizeLoad(MachineInstr &MI);

public:		public:

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineDominatorTree>();		AU.addRequired<MachineDominatorTree>();
AU.addRequired<MachinePostDominatorTree>();		AU.addRequired<MachinePostDominatorTree>();
AU.addRequired<MachineBlockFrequencyInfo>();		AU.addRequired<MachineBlockFrequencyInfo>();
AU.addPreserved<MachineDominatorTree>();		AU.addPreserved<MachineDominatorTree>();
▲ Show 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	if (MDT->dominates(CurrInst, MI)) {
Keep = false;		Keep = false;
break;		break;
}		}
}		}
// Add new instruction to map.		// Add new instruction to map.
TOCSaves[MI] = Keep;		TOCSaves[MI] = Keep;
}		}

		// Try to eliminate the load which can be represented by previous store content.
		// For example, for byval parameter, PPC will generate some loads to load them
		// from caller's stack frame, if there is a store to the same stack location,
		// this may cause load hit store issue. Because the load is generated in ISEL,
		// mem2reg has no chance to optimize this pattern.
		bool PPCMIPeephole::optimizeLoad(MachineInstr &MI) {
		// For now, only enable for pre RA.
		if (!MRI->isSSA())
		return false;

		if (MI.isInlineAsm())
		return false;

		if (!MI.hasOneMemOperand() \|\| !MI.memoperands()[0]->isLoad())
		return false;

		const MCInstrDesc &MCID = MI.getDesc();
		if (MCID.getNumDefs() != 1)
		return false;

		bool SawStore = false;
		if (!MI.isSafeToMove(nullptr, SawStore))
		return false;

		auto IsValidStore = [&](MachineInstr &StoreMI) {
		if (StoreMI.isInlineAsm())
		return false;

		if (!StoreMI.hasOneMemOperand() \|\| !StoreMI.memoperands()[0]->isStore())
		return false;

		unsigned StoreDefNum = StoreMI.getNumExplicitDefs();
		unsigned StoreOpNum = StoreMI.getNumExplicitOperands();

		// Store explicit use operands should be 1 bigger than load explicit use
		// operands. The extra 1 use operand is the value stored to the memory.
		if (StoreOpNum - StoreDefNum != MI.getNumExplicitOperands())
		return false;

		if (!StoreMI.explicit_operands().begin()->isReg())
		return false;

		Register DestReg = MI.getOperand(0).getReg();
		Register SrcReg = StoreMI.explicit_operands().begin()->getReg();

		// The content stored to memory should be same type with the load
		// instruciton's single def.
		if (MRI->getRegClass(DestReg) != MRI->getRegClass(SrcReg))
		return false;

		if (StoreMI.memoperands()[0]->getMemoryType().getSizeInBytes() !=
		MI.memoperands()[0]->getMemoryType().getSizeInBytes())
		return false;

		// The extention type must also be the same.
		if (TII->isSignOrZeroExtended(SrcReg, 0, MRI) !=
		TII->isSignOrZeroExtended(DestReg, 0, MRI))
		return false;

		// Each operand for the memory location of the store is same with the load.
		// Ignore the first one which is for the value stored to the memory.
		for (unsigned UseOp = StoreDefNum + 1; UseOp < StoreOpNum; UseOp++)
		if (!StoreMI.getOperand(UseOp).isIdenticalTo(
		MI.getOperand(UseOp - StoreDefNum)))
		return false;

		return true;
		};

		// To reduce complexity, only check the possible store in same block. This is
		// enough for handling the motivated load hit store issue.
		MachineBasicBlock::reverse_iterator E = MI.getParent()->rend(), It = MI;
		It++;
		MachineInstr *Store = nullptr;

		for (; It != E; ++It) {
		if (It->isLoadFoldBarrier() && !It->mayStore())
		return false;

		if (!It->mayStore())
		continue;

		if (IsValidStore(*It)) {
		Store = &*It;
		break;
		}

		if (MI.mayAlias(/* AA / nullptr, It, /* UseTBAA */ false))
		return false;
		}

		if (!Store)
		return false;

		LLVM_DEBUG(dbgs() << "Replacing load instruction "; MI.dump());
		LLVM_DEBUG(dbgs() << " Related Store is "; Store->dump());

		auto StoreOp = Store->explicit_operands().begin();
		if (StoreOp->isKill())
		StoreOp->setIsKill(false);

		MachineInstr Copy = BuildMI(MI.getParent(), &MI, MI.getDebugLoc(),
		TII->get(PPC::COPY), MI.getOperand(0).getReg())
		.add(*StoreOp);
		(void)Copy;
		LLVM_DEBUG(dbgs() << "with COPY "; Copy->dump());

		return true;
		}

// This function returns a list of all PHI nodes in the tree starting from		// This function returns a list of all PHI nodes in the tree starting from
// the RootPHI node. We perform a BFS traversal to get an ordered list of nodes.		// the RootPHI node. We perform a BFS traversal to get an ordered list of nodes.
// The list initially only contains the root PHI. When we visit a PHI node, we		// The list initially only contains the root PHI. When we visit a PHI node, we
// add it to the list. We continue to look for other PHI node operands while		// add it to the list. We continue to look for other PHI node operands while
// there are nodes to visit in the list. The function returns false if the		// there are nodes to visit in the list. The function returns false if the
// optimization cannot be applied on this tree.		// optimization cannot be applied on this tree.
static bool collectUnprimedAccPHIs(MachineRegisterInfo *MRI,		static bool collectUnprimedAccPHIs(MachineRegisterInfo *MRI,
MachineInstr *RootPHI,		MachineInstr *RootPHI,
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	for (MachineInstr &MI : MBB) {
ToErase = &MI;		ToErase = &MI;
continue;		continue;
}		}

// Ignore debug instructions.		// Ignore debug instructions.
if (MI.isDebugInstr())		if (MI.isDebugInstr())
continue;		continue;

		if (MI.mayLoad() && optimizeLoad(MI)) {
		ToErase = &MI;
		continue;
		}

// Per-opcode peepholes.		// Per-opcode peepholes.
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {

default:		default:
break;		break;
case PPC::COPY: {		case PPC::COPY: {
Register Src = MI.getOperand(1).getReg();		Register Src = MI.getOperand(1).getReg();
Register Dst = MI.getOperand(0).getReg();		Register Dst = MI.getOperand(0).getReg();
▲ Show 20 Lines • Show All 1,315 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/byval-lhs.ll

	Show All 11 Lines

	define void @bar1(i64 %a) nounwind {			define void @bar1(i64 %a) nounwind {
	; LE-LABEL: bar1:			; LE-LABEL: bar1:
	; LE: # %bb.0:			; LE: # %bb.0:
	; LE-NEXT: mflr r0			; LE-NEXT: mflr r0
	; LE-NEXT: stdu r1, -48(r1)			; LE-NEXT: stdu r1, -48(r1)
	; LE-NEXT: std r0, 64(r1)			; LE-NEXT: std r0, 64(r1)
	; LE-NEXT: std r3, 40(r1)			; LE-NEXT: std r3, 40(r1)
	; LE-NEXT: ld r3, 40(r1)
	; LE-NEXT: bl f0			; LE-NEXT: bl f0
	; LE-NEXT: nop			; LE-NEXT: nop
	; LE-NEXT: addi r1, r1, 48			; LE-NEXT: addi r1, r1, 48
	; LE-NEXT: ld r0, 16(r1)			; LE-NEXT: ld r0, 16(r1)
	; LE-NEXT: mtlr r0			; LE-NEXT: mtlr r0
	; LE-NEXT: blr			; LE-NEXT: blr
	;			;
	; AIX-LABEL: bar1:			; AIX-LABEL: bar1:
	; AIX: # %bb.0:			; AIX: # %bb.0:
	; AIX-NEXT: mflr r0			; AIX-NEXT: mflr r0
	; AIX-NEXT: stdu r1, -128(r1)			; AIX-NEXT: stdu r1, -128(r1)
	; AIX-NEXT: std r0, 144(r1)			; AIX-NEXT: std r0, 144(r1)
	; AIX-NEXT: std r3, 120(r1)			; AIX-NEXT: std r3, 120(r1)
	; AIX-NEXT: ld r3, 120(r1)
	; AIX-NEXT: bl .f0[PR]			; AIX-NEXT: bl .f0[PR]
	; AIX-NEXT: nop			; AIX-NEXT: nop
	; AIX-NEXT: addi r1, r1, 128			; AIX-NEXT: addi r1, r1, 128
	; AIX-NEXT: ld r0, 16(r1)			; AIX-NEXT: ld r0, 16(r1)
	; AIX-NEXT: mtlr r0			; AIX-NEXT: mtlr r0
	; AIX-NEXT: blr			; AIX-NEXT: blr
	%s = alloca %struct.type8, align 8			%s = alloca %struct.type8, align 8
	store i64 %a, ptr %s, align 8			store i64 %a, ptr %s, align 8
	%call = tail call ptr @f0(ptr noundef nonnull byval(%struct.type8) align 8 %s)			%call = tail call ptr @f0(ptr noundef nonnull byval(%struct.type8) align 8 %s)
	ret void			ret void
	}			}

	define void @bar2(i64 %a) nounwind {			define void @bar2(i64 %a) nounwind {
	; LE-LABEL: bar2:			; LE-LABEL: bar2:
	; LE: # %bb.0:			; LE: # %bb.0:
	; LE-NEXT: mflr r0			; LE-NEXT: mflr r0
	; LE-NEXT: stdu r1, -48(r1)			; LE-NEXT: stdu r1, -48(r1)
				; LE-NEXT: mr r4, r3
	; LE-NEXT: std r0, 64(r1)			; LE-NEXT: std r0, 64(r1)
	; LE-NEXT: std r3, 32(r1)			; LE-NEXT: std r3, 32(r1)
	; LE-NEXT: std r3, 40(r1)			; LE-NEXT: std r3, 40(r1)
	; LE-NEXT: ld r4, 40(r1)
	; LE-NEXT: ld r3, 32(r1)
	; LE-NEXT: bl f1			; LE-NEXT: bl f1
	; LE-NEXT: nop			; LE-NEXT: nop
	; LE-NEXT: addi r1, r1, 48			; LE-NEXT: addi r1, r1, 48
	; LE-NEXT: ld r0, 16(r1)			; LE-NEXT: ld r0, 16(r1)
	; LE-NEXT: mtlr r0			; LE-NEXT: mtlr r0
	; LE-NEXT: blr			; LE-NEXT: blr
	;			;
	; AIX-LABEL: bar2:			; AIX-LABEL: bar2:
	; AIX: # %bb.0:			; AIX: # %bb.0:
	; AIX-NEXT: mflr r0			; AIX-NEXT: mflr r0
	; AIX-NEXT: stdu r1, -128(r1)			; AIX-NEXT: stdu r1, -128(r1)
				; AIX-NEXT: mr r4, r3
	; AIX-NEXT: std r0, 144(r1)			; AIX-NEXT: std r0, 144(r1)
	; AIX-NEXT: std r3, 112(r1)			; AIX-NEXT: std r3, 112(r1)
	; AIX-NEXT: std r3, 120(r1)			; AIX-NEXT: std r3, 120(r1)
	; AIX-NEXT: ld r4, 120(r1)
	; AIX-NEXT: ld r3, 112(r1)
	; AIX-NEXT: bl .f1[PR]			; AIX-NEXT: bl .f1[PR]
	; AIX-NEXT: nop			; AIX-NEXT: nop
	; AIX-NEXT: addi r1, r1, 128			; AIX-NEXT: addi r1, r1, 128
	; AIX-NEXT: ld r0, 16(r1)			; AIX-NEXT: ld r0, 16(r1)
	; AIX-NEXT: mtlr r0			; AIX-NEXT: mtlr r0
	; AIX-NEXT: blr			; AIX-NEXT: blr
	%s = alloca %struct.type16, align 8			%s = alloca %struct.type16, align 8
	%index1 = getelementptr inbounds i64, ptr %s, i32 0			%index1 = getelementptr inbounds i64, ptr %s, i32 0
	store i64 %a, ptr %index1, align 8			store i64 %a, ptr %index1, align 8
	%index2 = getelementptr inbounds i64, ptr %s, i32 1			%index2 = getelementptr inbounds i64, ptr %s, i32 1
	store i64 %a, ptr %index2, align 8			store i64 %a, ptr %index2, align 8
	%call = tail call ptr @f1(ptr noundef nonnull byval(%struct.type16) align 8 %s)			%call = tail call ptr @f1(ptr noundef nonnull byval(%struct.type16) align 8 %s)
	ret void			ret void
	}			}

llvm/test/CodeGen/PowerPC/convert-load-to-copy.mir

Show All 12 Lines	body: \|
bb.0.entry:		bb.0.entry:
liveins: $x3		liveins: $x3

; CHECK-LABEL: name: test		; CHECK-LABEL: name: test
; CHECK: liveins: $x3		; CHECK: liveins: $x3
; CHECK-NEXT: {{ $}}		; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3		; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3
; CHECK-NEXT: STD [[COPY]], 0, %stack.0 :: (store (s64) into %stack.0)		; CHECK-NEXT: STD [[COPY]], 0, %stack.0 :: (store (s64) into %stack.0)
; CHECK-NEXT: [[LD:%[0-9]+]]:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)		; CHECK-NEXT: [[COPY1:%[0-9]+]]:g8rc = COPY [[COPY]]
; CHECK-NEXT: $x3 = COPY [[LD]]		; CHECK-NEXT: $x3 = COPY [[COPY1]]
; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm		; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm
%0:g8rc = COPY $x3		%0:g8rc = COPY $x3
STD %0:g8rc, 0, %stack.0 :: (store (s64) into %stack.0)		STD %0:g8rc, 0, %stack.0 :: (store (s64) into %stack.0)
%1:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)		%1:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)
$x3 = COPY %1:g8rc		$x3 = COPY %1:g8rc
BLR8 implicit $lr8, implicit $rm		BLR8 implicit $lr8, implicit $rm
...		...
---		---
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	body: \|
bb.0.entry:		bb.0.entry:
liveins: $x3, $x4		liveins: $x3, $x4

; CHECK-LABEL: name: test6		; CHECK-LABEL: name: test6
; CHECK: liveins: $x3, $x4		; CHECK: liveins: $x3, $x4
; CHECK-NEXT: {{ $}}		; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3		; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3
; CHECK-NEXT: STDX [[COPY]], $x4, %stack.0 :: (store (s64) into %stack.0)		; CHECK-NEXT: STDX [[COPY]], $x4, %stack.0 :: (store (s64) into %stack.0)
; CHECK-NEXT: [[LDX:%[0-9]+]]:g8rc = LDX $x4, %stack.0 :: (load (s64) from %stack.0)		; CHECK-NEXT: [[COPY1:%[0-9]+]]:g8rc = COPY [[COPY]]
; CHECK-NEXT: $x3 = COPY [[LDX]]		; CHECK-NEXT: $x3 = COPY [[COPY1]]
; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm		; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm
%0:g8rc = COPY $x3		%0:g8rc = COPY $x3
STDX %0:g8rc, $x4, %stack.0 :: (store (s64) into %stack.0)		STDX %0:g8rc, $x4, %stack.0 :: (store (s64) into %stack.0)
%1:g8rc = LDX $x4, %stack.0 :: (load (s64) from %stack.0)		%1:g8rc = LDX $x4, %stack.0 :: (load (s64) from %stack.0)
$x3 = COPY %1:g8rc		$x3 = COPY %1:g8rc
BLR8 implicit $lr8, implicit $rm		BLR8 implicit $lr8, implicit $rm
...		...
---		---
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	body: \|
bb.0.entry:		bb.0.entry:
liveins: $x3, $x4		liveins: $x3, $x4

; CHECK-LABEL: name: test9		; CHECK-LABEL: name: test9
; CHECK: liveins: $x3, $x4		; CHECK: liveins: $x3, $x4
; CHECK-NEXT: {{ $}}		; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3		; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc = COPY $x3
; CHECK-NEXT: STD [[COPY]], 0, %stack.0 :: (store monotonic (s64) into %stack.0)		; CHECK-NEXT: STD [[COPY]], 0, %stack.0 :: (store monotonic (s64) into %stack.0)
; CHECK-NEXT: [[LD:%[0-9]+]]:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)		; CHECK-NEXT: [[COPY1:%[0-9]+]]:g8rc = COPY [[COPY]]
; CHECK-NEXT: $x3 = COPY [[LD]]		; CHECK-NEXT: $x3 = COPY [[COPY1]]
; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm		; CHECK-NEXT: BLR8 implicit $lr8, implicit $rm
%0:g8rc = COPY $x3		%0:g8rc = COPY $x3
STD %0:g8rc, 0, %stack.0 :: (store monotonic (s64) into %stack.0)		STD %0:g8rc, 0, %stack.0 :: (store monotonic (s64) into %stack.0)
%1:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)		%1:g8rc = LD 0, %stack.0 :: (load (s64) from %stack.0)
$x3 = COPY %1:g8rc		$x3 = COPY %1:g8rc
BLR8 implicit $lr8, implicit $rm		BLR8 implicit $lr8, implicit $rm
...		...
---		---
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/vsx-p9.ll

	Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
	; CHECK: stxv 34,			; CHECK: stxv 34,
	; CHECK: bl sink			; CHECK: bl sink
	%22 = load <2 x double>, ptr @vda, align 16			%22 = load <2 x double>, ptr @vda, align 16
	%23 = load <2 x double>, ptr @vdb, align 16			%23 = load <2 x double>, ptr @vdb, align 16
	%add.i12 = fadd <2 x double> %22, %23			%add.i12 = fadd <2 x double> %22, %23
	tail call void (...) @sink(<2 x double> %add.i12)			tail call void (...) @sink(<2 x double> %add.i12)
	; CHECK: lxv 0, 0(3)			; CHECK: lxv 0, 0(3)
	; CHECK: lxv 1, 0(3)			; CHECK: lxv 1, 0(3)
	; CHECK: xvadddp 0, 0, 1			; CHECK: xvadddp 34, 0, 1
	; CHECK: stxv 0,			; CHECK: stxv 34,
				shchenzAuthorUnsubmitted Done Reply Inline Actions xvadddp 0, 0, 1 stxv 0, 32(1) lxv 34, 32(1) ------> Code change is because this load is eliminated. bl sink shchenz: ``` xvadddp 0, 0, 1 stxv 0, 32(1) lxv 34, 32(1) ------> Code change is because…
	; CHECK: bl sink			; CHECK: bl sink
	ret void			ret void
	}			}

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	define <4 x float> @testXVIEXPSP(<4 x i32> %a, <4 x i32> %b) {			define <4 x float> @testXVIEXPSP(<4 x i32> %a, <4 x i32> %b) {
	entry:			entry:
	%0 = tail call <4 x float> @llvm.ppc.vsx.xviexpsp(<4 x i32> %a, <4 x i32> %b)			%0 = tail call <4 x float> @llvm.ppc.vsx.xviexpsp(<4 x i32> %a, <4 x i32> %b)
	▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] handle more store value forwardingClosedPublic

Details

Diff Detail

Event Timeline

>

Revision Contents

Diff 478465

llvm/lib/Target/PowerPC/PPCMIPeephole.cpp

llvm/test/CodeGen/PowerPC/byval-lhs.ll

llvm/test/CodeGen/PowerPC/convert-load-to-copy.mir

llvm/test/CodeGen/PowerPC/vsx-p9.ll

[DAGCombiner] handle more store value forwarding
ClosedPublic