This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] handle more store value forwarding
ClosedPublic

Authored by shchenz on Nov 28 2022, 11:13 PM.

Details

Summary

When lowering calls on target like PPC, some stack loads will be generated
for by value parameters. Node CALLSEQ_START prevents such loads from
being combined.

Suggested by @RolandF, this patch removes the unnecessary loads for the
byval parameter by extending ForwardStoreValueToDirectLoad

Diff Detail

Event Timeline

shchenz created this revision.Nov 28 2022, 11:13 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 11:13 PM
shchenz requested review of this revision.Nov 28 2022, 11:13 PM
shchenz added inline comments.Nov 28 2022, 11:24 PM
llvm/test/CodeGen/PowerPC/vsx-p9.ll
141
xvadddp 0, 0, 1
 stxv 0, 32(1)
 lxv 34, 32(1)  ------> Code change is because this load is eliminated.
 bl sink
nemanjai requested changes to this revision.Nov 30 2022, 6:56 AM

Please add some information about other approaches you considered/attempted to solve this problem. I really don't like the idea of doing store forwarding in the peephole. Are there no existing passes that can perform this type of store forwarding (and most likely more general forms of it)? While I am not all that familiar with MemorySSA, it would seem that this is the exact type of thing it should handle effortlessly.

This revision now requires changes to proceed.Nov 30 2022, 6:56 AM

Thanks for taking a look @nemanjai

Please add some information about other approaches you considered/attempted to solve this problem.

Hope I explained enough for the reason I didn't do this in DAG-ISEL while lower calls with byval parameter. And PPC is working on switching to Global-ISEL, so that's another reason it is not so good to implement the load elimination for byval parameter in DAG-ISEL.

My first thought was implementing this in MachineCSE. But I did't do it there because:

  • MachineCSE is a target independent pass. From what I got, byval parameter for the call seems will not be lowered to loads to the caller's stack frame for all the other targets(at least some?). Target independent peephole was also excluded because of this.
  • MachineCSE seems only commoning instructions which have same opcodes, i.e., it can handle load elimination for pattern load addr ... load addr.

While I am not all that familiar with MemorySSA, it would seem that this is the exact type of thing it should handle effortlessly.

MemorySSA is not an analysis pass that can be used by MIR pass for now. And MemorySSA depends on AAResultsWrapperPass which may leads to compiler time concern. IR level pass mem2reg also does not require MemorySSA to do the transformation.

I thought there is no other pass which can implement this opt in PPC pipeline except the above ones, but maybe you can give some insight here.

shchenz requested review of this revision.Dec 4 2022, 9:06 PM

request review for further discussion

scui added a subscriber: RolandF.Dec 6 2022, 3:55 PM

There is an existing place in codegen to handle forwarding a store value to the corresponding load - ForwardStoreValueToDirectLoad in lib/CodeGen/SelectionDAG/DAGCombiner.cpp. Handling this in codegen would not require putting the code in multiple places. The code there is relatively simple - it just looks to see if the thing on the chain immediately before the load is the setting store. For the first case in byval-lhs.ll there is a CALLSEQ_START on the chain in between the load and the store. Maybe it is possible to look past that in the chain to see the store. Since the load is for a call there may be a register copy required to replace the load. Where there is a sequence of multiple stores followed by multiple loads it would require looking back in the chain past loads and past stores to fixed stack locations that do not overlap. I don't know if that is allowed, but in theory it could work.

I would suggest to discuss offline with @nemanjai if such an approach is viable and preferred. If not, I think that is a reasonable argument for an MIR level approach.

shchenz updated this revision to Diff 481578.Dec 9 2022, 2:26 AM
shchenz added reviewers: tstellar, RKSimon, sdardis, RolandF.

address @RolandF comment, fixing in DAG Combiner and add more reviewers for other targets

shchenz retitled this revision from [PowerPC] add a peephole to eliminate unnecessary load to [DAGCombiner] handle more store value forwarding.Dec 9 2022, 2:29 AM
shchenz edited the summary of this revision. (Show Details)
shchenz added inline comments.
llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗(On Diff #481578)

Changing from:

_bar:                                   ## @bar
## %bb.0:
	subl	$12, %esp
	movl	$1, 8(%esp)
	movl	8(%esp), %eax
	movl	%eax, (%esp)
	calll	_foo
	movl	8(%esp), %eax
	addl	$12, %esp
	retl

To:

_bar:                                   ## @bar
## %bb.0:
	subl	$12, %esp
	movl	$1, 8(%esp)
	movl	$1, (%esp)
	calll	_foo
	movl	8(%esp), %eax
	addl	$12, %esp
	retl

Seems one load is saved, needs confirmation from X86 experts.

RKSimon added inline comments.Dec 9 2022, 2:38 AM
llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗(On Diff #481578)

Please can you rebase after rG2d63646afdb0e4d8c063fe057916d7900f681968?

RKSimon added inline comments.Dec 9 2022, 2:44 AM
llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗(On Diff #481578)
shchenz added inline comments.Dec 9 2022, 2:50 AM
llvm/test/CodeGen/X86/fastcc-byval.ll
2 ↗(On Diff #481578)

Nice, thank you!

RolandF added inline comments.Dec 12 2022, 10:07 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17112 ↗(On Diff #481584)

Is this check really needed for token factor operands? They are supposed to be independent. Aliasing seems like an order dependence.

shchenz marked an inline comment as done.Dec 12 2022, 10:03 PM
shchenz added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17112 ↗(On Diff #481584)

This is what I originally thought, then I met following case:

t16: ch = store<(store (s16) into @x1a0, !tbaa !8)> t15, t13, GlobalAddress:i64<ptr @x1a0> 0, undef:i64
  t18: i16,ch = load<(dereferenceable load (s16) from @x1a2, !tbaa !8)> t16, GlobalAddress:i64<ptr @x1a2> 0, undef:i64
  t20: i16,ch = load<(dereferenceable load (s16) from @i, !tbaa !8)> t16, GlobalAddress:i64<ptr @i> 0, undef:i64
  t25: i16,ch = load<(load (s16) from %ir.arrayidx.i, !tbaa !8)> t16, t24, undef:i64
  t27: i16,ch = load<(dereferenceable load (s16) from @si, !tbaa !8)> t16, GlobalAddress:i64<ptr @si> 0, undef:i64

    t37: ch = TokenFactor t18:1, t20:1, t25:1, t27:1

  t38: ch = store<(store (s16) into %ir.arrayidx.i, !tbaa !8)> t37, t36, t24, undef:i64

  t41: i16,ch = load<(dereferenceable load (s16) from @x1a0, !tbaa !8)> t38, GlobalAddress:i64<ptr @x1a0> 0, undef:i64
>
t16: ch = store<(store (s16) into @x1a0, !tbaa !8)> t15, t13, GlobalAddress:i64<ptr @x1a0> 0, undef:i64
t18: i16,ch = load<(dereferenceable load (s16) from @x1a2, !tbaa !8)> t16, GlobalAddress:i64<ptr @x1a2> 0, undef:i64
t20: i16,ch = load<(dereferenceable load (s16) from @i, !tbaa !8)> t16, GlobalAddress:i64<ptr @i> 0, undef:i64

t25: i16,ch = load<(load (s16) from %ir.arrayidx.i, !tbaa !8)> t16, t24, undef:i64
t27: i16,ch = load<(dereferenceable load (s16) from @si, !tbaa !8)> t16, GlobalAddress:i64<ptr @si> 0, undef:i64


  t915: ch = TokenFactor t16, t25:1
t916: ch = store<(store (s16) into %ir.arrayidx.i, !tbaa !8)> t915, t36, t24, undef:i64
  t919: ch = TokenFactor t916, t16
t920: i16,ch = load<(dereferenceable load (s16) from @x1a0, !tbaa !8)> t919, GlobalAddress:i64<ptr @x1a0> 0, undef:i64

DAGCombiner will forward the store value of t16 to t920 if no this aliasing check. But actually t916(t38) is aliased to t16, so the load result should be the content of t916 instead of t16.

And from the implementation of GatherAllAliases, one operand in a TokenFactor node does not mean it is not aliased to other operands of the TokenFactor node. That's the reason why it checks aliasing by checking each child operand of the TokenFactor node?

RolandF added inline comments.Dec 14 2022, 8:26 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗(On Diff #481584)

If we can't make the assumption that token factor operands are unaliased, I think that the non-store path needs to be checked more strongly. There are ways to touch memory besides stores. I would suggest to whitelist the kinds of operands we know are safe and to otherwise give up.

17112 ↗(On Diff #481584)

ok

shchenz marked 2 inline comments as done.Dec 15 2022, 5:53 PM
shchenz added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗(On Diff #481584)

In the below GatherAllAliases(), it will collect all nodes in the TokenFactor(not limited to store) that are aliased to the candidate Store. If any nodes found except the candidate store itself, we will give up. That sounds like enough, what do you think?

RolandF added inline comments.Jan 5 2023, 10:05 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17107 ↗(On Diff #481584)

ok

shchenz marked an inline comment as done.Jan 10 2023, 12:32 AM

@tstellar Hi, could you please help to confirm if the AMDGPU case changes are valid? Thank you very much.

nemanjai added inline comments.Jan 13 2023, 8:55 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17088–17090 ↗(On Diff #481584)

This comment is misplaced. I think all this does is:

// Look through CALLSEQ_START.
17094–17128 ↗(On Diff #481584)

I think it would be better for readability if you put this into a static function. Something like:

// Find a unique store that feeds this load if such a store exists.
// This will look through CALLSEQ_START to allow forwarding
// stores to the stack for byval arguments.
static StoreSDNode *getUniqueStoreFeeding(LoadSDNode *Load, SelectionDAG &DAG);
17098–17103 ↗(On Diff #481584)

Nit: this comment can simply be:

// Look for unique store within the TokenFactor.
17113–17114 ↗(On Diff #481584)

Nit:

if (Aliases.empty() || (Aliases.size() == 1 && Aliases.front() == Store)
17122–17125 ↗(On Diff #481584)

Nit: rather than repeating this code, it might aid readability if you collect all unaliased stores (either the only store chained to CALLSEQ_START or all the unaliased stores in the TokenFactor) and then look for one with a matching BaseIndexOffset.

shchenz updated this revision to Diff 493798.Jan 31 2023, 5:46 PM
shchenz marked 5 inline comments as done.

address @nemanjai comments and rebase

Thanks for review @nemanjai .

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17094–17128 ↗(On Diff #481584)

Done. Seems I have to use a member function as GatherAllAliases is a member function.

17122–17125 ↗(On Diff #481584)

Thanks, I get your point. A function that finds all unaliased stores in TokenFactor may be time consuming. And we don't just care about store nodes, we care all nodes that write memory. So to me finding the candidate store first and then check alias with memory nodes in TokenFactor with the specific candidate store can save some compile time?

nemanjai accepted this revision.Feb 1 2023, 7:06 AM

LGTM. The minor nits do not require another review.

I'm not sure if @tstellar has anything further to add or if he objects to the AMDGPU changes.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
365 ↗(On Diff #493798)

I think a comment describing this function would be useful. Something like:

// Looks up the chain to find a unique (unaliased) store feeding the
// passed load. If no such store is found, returns a nullptr.
// Note: This will look past a CALLSEQ_START if the load is
//       chained to it so that it can find stack stores for byval
//       params.
17330 ↗(On Diff #493798)

Nit: you can probably reduce nesting a bit with

if (!BasePtrST.equalBaseIndex(BasePtrLD, DAG, Offset))
  continue;
This revision is now accepted and ready to land.Feb 1 2023, 7:06 AM
This revision was landed with ongoing or failed builds.Feb 1 2023, 6:06 PM
This revision was automatically updated to reflect the committed changes.

@shchenz You neither incorporated nor responded to the comments I left when approving the patch. Could you please respond to the two comments as to why you felt they did not need to be incorporated into the final patch?

@shchenz You neither incorporated nor responded to the comments I left when approving the patch. Could you please respond to the two comments as to why you felt they did not need to be incorporated into the final patch?

I am really sorry, Nemanja. @nemanjai I remember I addressed your comments and tested the new update when I committed this patch. I must forget to amend the changes to the final commit. 63854f91d3ee1056796a5ef27753648396cac6ec is committed for the two comments.

hans added a subscriber: hans.Feb 10 2023, 5:37 AM

Heads up that we've bisected a test failure (https://bugs.chromium.org/p/chromium/issues/detail?id=1412740) to this commit. We're still investigating but would be interested to hear if others have found any issues too.

I believe this is causing a miscompile on this code.

$ cat /tmp/a.ll
source_filename = "../../chrome/browser/dom_distiller/dom_distiller_service_factory.cc"
target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128"
target triple = "i686-unknown-linux-android24"

%0 = type { %1 }
%1 = type { %2 }
%2 = type { ptr }
%3 = type { %4 }
%4 = type { %5 }
%5 = type { ptr }
%6 = type { %7 }
%7 = type { %8 }
%8 = type { ptr }
%9 = type { %10 }
%10 = type { %11 }
%11 = type { ptr }

@g = external hidden unnamed_addr constant { [5 x ptr], [9 x ptr] }, align 4
@g2 = external hidden unnamed_addr constant { [5 x ptr] }, align 4

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
define hidden void @f(ptr noundef align 4 dereferenceable_or_null(48) %0, ptr noundef byval(%0) align 4 %1, ptr noundef byval(%3) align 4 %2, ptr noundef byval(%6) align 4 %3, ptr noundef byval(%9) align 4 %4) unnamed_addr #0 align 2 {
  %6 = alloca %0, align 4
  %7 = alloca %3, align 4
  %8 = alloca %6, align 4
  %9 = alloca %9, align 4
  store ptr getelementptr inbounds ({ [5 x ptr] }, ptr @g2, i32 0, inrange i32 0, i32 2), ptr %0, align 4
  %10 = getelementptr inbounds i8, ptr %0, i32 4
  %11 = load ptr, ptr %1, align 4
  store ptr null, ptr %1, align 4
  store ptr %11, ptr %6, align 4
  %12 = load ptr, ptr %2, align 4
  store ptr null, ptr %2, align 4
  store ptr %12, ptr %7, align 4
  %13 = load ptr, ptr %3, align 4
  store ptr null, ptr %3, align 4
  store ptr %13, ptr %8, align 4
  %14 = load ptr, ptr %4, align 4
  store ptr null, ptr %4, align 4
  store ptr %14, ptr %9, align 4
  tail call void @h1(ptr noundef align 4 dereferenceable_or_null(44) %10, ptr noundef nonnull byval(%0) align 4 %6, ptr noundef nonnull byval(%3) align 4 %7, ptr noundef nonnull byval(%6) align 4 %8, ptr noundef nonnull byval(%9) align 4 %9) #2
  store ptr getelementptr inbounds ({ [5 x ptr], [9 x ptr] }, ptr @g, i32 0, inrange i32 0, i32 2), ptr %0, align 4
  store ptr getelementptr inbounds ({ [5 x ptr], [9 x ptr] }, ptr @g, i32 0, inrange i32 1, i32 2), ptr %10, align 4
  call void @h2(ptr noundef align 4 dereferenceable_or_null(4) %4) #2
  call void @h3(ptr noundef align 4 dereferenceable_or_null(4) %3) #2
  call void @h4(ptr noundef align 4 dereferenceable_or_null(4) %2) #2
  call void @h5(ptr noundef align 4 dereferenceable_or_null(4) %1) #2
  ret void
}

; Function Attrs: minsize null_pointer_is_valid optsize
declare void @h1(ptr noundef align 4 dereferenceable_or_null(44), ptr noundef byval(%0) align 4, ptr noundef byval(%3) align 4, ptr noundef byval(%6) align 4, ptr noundef byval(%9) align 4) unnamed_addr #1

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h2(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h3(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h4(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

; Function Attrs: minsize nounwind null_pointer_is_valid optsize uwtable
declare hidden void @h5(ptr noundef align 4 dereferenceable_or_null(4)) unnamed_addr #0 align 2

attributes #0 = { minsize nounwind null_pointer_is_valid optsize uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="i686" "target-features"="+cx8,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "tune-cpu"="generic" }
attributes #1 = { minsize null_pointer_is_valid optsize "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="i686" "target-features"="+cx8,+mmx,+sse,+sse2,+sse3,+ssse3,+x87" "tune-cpu"="generic" }
attributes #2 = { minsize nounwind optsize }

!llvm.linker.options = !{}
!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6}

!0 = !{i32 1, !"NumRegisterParameters", i32 0}
!1 = !{i32 7, !"Dwarf Version", i32 4}
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = !{i32 1, !"wchar_size", i32 4}
!4 = !{i32 8, !"PIC Level", i32 2}
!5 = !{i32 7, !"uwtable", i32 2}
!6 = !{i32 7, !"frame-pointer", i32 1}

$ llc /tmp/a.ll # one commit before this change
...
	movl	8(%ebp), %ebx
	movl	$g2+8, (%ebx)
	leal	12(%ebp), %ecx
	movl	(%ecx), %eax
	movl	%ecx, %edx
	leal	16(%ebp), %ecx
	movl	(%ecx), %ecx
	movl	%eax, -40(%ebp)
	leal	20(%ebp), %edi
	movl	%ecx, -32(%ebp)
	movl	(%edi), %eax
	movl	%eax, -24(%ebp)
	leal	24(%ebp), %esi
	movl	(%esi), %eax
	movl	%eax, -16(%ebp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, (%edx) ; sets 12(%ebp), 16(%ebp), 20(%ebp), 24(%ebp) to 0
	leal	4(%ebx), %eax
	movl	-16(%ebp), %ecx
	movl	%ecx, 16(%esp) ; stores -16(%ebp), which is also 24(%ebp), to 16(%esp)
	movl	-24(%ebp), %ecx
	movl	%ecx, 12(%esp)
	movl	-32(%ebp), %ecx
	movl	%ecx, 8(%esp)
	movl	-40(%ebp), %ecx
	movl	%ecx, 4(%esp)
	movl	%eax, (%esp)
	calll	h1@PLT
...
$ llc /tmp/a.ll # at this commit
...
	movl	8(%ebp), %ebx
	movl	$g2+8, (%ebx)
	leal	12(%ebp), %ecx
	movl	(%ecx), %eax
	movl	%ecx, %edx
	xorps	%xmm0, %xmm0
	leal	16(%ebp), %ecx
	movl	(%ecx), %ecx
	movl	%eax, -32(%ebp)
	leal	20(%ebp), %edi
	movl	%ecx, -24(%ebp)
	movl	(%edi), %eax
	movl	%eax, -16(%ebp)
	leal	24(%ebp), %esi
	movups	%xmm0, (%edx) ; sets 12(%ebp), 16(%ebp), 20(%ebp), 24(%ebp) to 0
	movl	(%esi), %eax ; loads 24(%ebp) to %eax, which is now 0
	movl	%eax, -40(%ebp)
	leal	4(%ebx), %ecx
	movl	%eax, 16(%esp) ; stores 0 to 16(%esp) which is wrong!!!
	movl	-16(%ebp), %eax
	movl	%eax, 12(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-32(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	%ecx, (%esp)
	calll	h1@PLT
...

see my added comments in the llc output

Going to revert this commit (and the follow up commit)

aeubanks reopened this revision.Feb 13 2023, 7:05 PM
This revision is now accepted and ready to land.Feb 13 2023, 7:05 PM
shchenz added a comment.EditedFeb 13 2023, 11:00 PM

To me, this patch is innocent for the failure https://bugs.chromium.org/p/chromium/issues/detail?id=1412740. The patch exposes an issue in the pre-ra machine scheduler on X86 target.

Before the machine scheduler:

176B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
192B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)     // the content will be stored to 16($esp) is loaded before the filing 0 MOVUPSmr
208B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

After the machine scheduler:

264B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
272B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
280B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)      // now the content is loaded after the filling 0 MOVUPSmr
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

Seems like scheduler does not know %6:gr32 = MOV32rm and MOVUPSmr are aliased.

@aeubanks Thanks for the quick action. Could you please help to verify if llc option -enable-misched=false can solve your issue? I am not very familiar with X86 target, do you have any suggestions about how to fix this X86 scheduling issue, create a github issue for X86 target? Thank you!

To me, this patch is innocent for the failure https://bugs.chromium.org/p/chromium/issues/detail?id=1412740. The patch exposes an issue in the pre-ra machine scheduler on X86 target.

Before the machine scheduler:

176B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
192B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)     // the content will be stored to 16($esp) is loaded before the filing 0 MOVUPSmr
208B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

After the machine scheduler:

264B      %12:gr32 = LEA32r %fixed-stack.0, 1, $noreg, 0, $noreg
272B      MOVUPSmr %18:gr32, 1, $noreg, 0, $noreg, %3:vr128 :: (store (s128) into %ir.1, align 4)
280B      %6:gr32 = MOV32rm %12:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from %ir.4, align 16)      // now the content is loaded after the filling 0 MOVUPSmr
320B      MOV32mr $esp, 1, $noreg, 16, $noreg, %6:gr32 :: (store (s32))

Seems like scheduler does not know %6:gr32 = MOV32rm and MOVUPSmr are aliased.

@aeubanks Thanks for the quick action. Could you please help to verify if llc option -enable-misched=false can solve your issue? I am not very familiar with X86 target, do you have any suggestions about how to fix this X86 scheduling issue, create a github issue for X86 target? Thank you!

I'm not familiar with scheduler, but it looks to me like a general scheduler problem rather than X86 specific. I have filed 60744.
BTW, I found the problem cannot be reproduced without this patch. I guess it's the change happen to trigger the schedule.

I'm not familiar with scheduler, but it looks to me like a general scheduler problem rather than X86 specific. I have filed 60744.
BTW, I found the problem cannot be reproduced without this patch. I guess it's the change happen to trigger the schedule.

Thanks for creating the bug. Yes, reproducing the issue needs to apply this patch.

that bug has been fixed, now this can reland (hopefully assuming the fix isn't reverted for some reason), sorry for the trouble

may want to double check the reproducer before relanding

I have been waiting for some days for D144711 for potential issues. Since no issues reported, I will reland this patch.

This revision was automatically updated to reflect the committed changes.