This is an archive of the discontinued LLVM Phabricator instance.

[X86] Reduce Store Forward Block issues in HW

Authored by lsaba on Dec 17 2017, 3:25 AM.



If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Diff Detail

Event Timeline

lsaba created this revision.Dec 17 2017, 3:25 AM

Few style comments. I didn't look at the algorithm closely yet.


What about the EVEX versions of these instructions?


Remove the parentheses here.


Use a "static const unsigned " variable instead of a #define.


Capitalize variable names.

lsaba updated this revision to Diff 127507.Dec 19 2017, 5:40 AM
lsaba marked 4 inline comments as done.

Addressed Craig's comments


More cases can still be added in the future

hfinkel added inline comments.

AA is too conservative.

Did you try turning it on? :-)

If you overload useAA() in X86Subtarget, then you'll actually get non-trivial AA in the backend. Given all of the recent work on adding scheduling models, we should probably experiment with this again.

That having been said, it relies on having good MMOs. If we don't, then you'll have trouble using it for this purpose until that's fixed.


Can you elaborate on what you're seeing? Missing MMOs are something we should fix. If the MMO size is too small, that's a bug (i.e., can cause miscompiles from invalid instruction scheduling). MMO sizes that are too big should also be fixed.

lsaba added inline comments.Dec 20 2017, 7:08 AM

Perhaps it's better to say that the solution approach suggested in the comment is more conservative, and not the AA, I will try to explain what I meant with an example:

%struct.S = type { i32, i32, i32, i32 }
; Function Attrs: nounwind uwtable 
define void @test_conditional_block(%struct.S* nocapture %s1, 
%struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* 
nocapture readonly %s4) local_unnamed_addr #0 {
  %b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
  store i32 %x, i32* %b, align 4
  %0 = bitcast %struct.S* %s3 to i8*
  %1 = bitcast %struct.S* %s4 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
  %2 = bitcast %struct.S* %s2 to i8*
  %3 = bitcast %struct.S* %s1 to i8*
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
  ret void

the currently generated code:

movl %edx, 4(%rdi) // blocking store
movups (%r8), %xmm0
movups %xmm0, (%rcx)
movups (%rdi), %xmm0 // blocked load
movups %xmm0, (%rsi)

If we were to solve the problem by loading the old value, then storing like the comment suggests:

movups (%rdi), %xmm0 // blocked load
vinsert %edx, xmm0, 1 //store b to xmm0 at index 1 
movups %xmm0, (%rsi)

We would have to prove that the previous copy from s3 to s4:

movups (%r8), %xmm0
movups %xmm0, (%rcx)

does not alias or must alias (and in that case do a similar handling for it) the copy from s1 to s2, while the AA would naturally assume they may alias in this case.
On the other hand, the proposed solution in this patch (breaking the memcpy) will be correct regardless to the memory relation between s3, s4 and s1,s2.

I am not too familiar with the capabilities of the backend AA, I did not try turning it on :) correct me if i'm wrong, but relying on it and using the approach in the comment will yield a more conservative optimization than the one implemented here, at least for the trivial cases.

In general, the solution in the comment gets more complicated if the blocking store is in a predecessing block, in that case we'll need to prove safety in all predecessing blocks, or hoist the memcpy for conditional blocks


You are right. I saw missing MMOs, I will try to fix the ones I encountered, but it doesn't guarantee there won't be more in code I did not reach so i'm not sure it would be safe to remove those defines

lsaba updated this revision to Diff 128164.Dec 26 2017, 4:00 AM
lsaba added a comment.Dec 26 2017, 4:00 AM

add MMO info

hfinkel added inline comments.Dec 28 2017, 6:56 AM

I think that this makes sense, please explain this better in the comment (i.e., that you're speculatively assuming aliasing, as the transformation is correct either way, and you often won't be able to prove a sufficient number of the interesting cases).


Okay, thanks (although you should use something other than #defines (constants or a function with a switch statement, etc.).

lsaba updated this revision to Diff 128492.Jan 3 2018, 12:04 AM

addressed @hfinkel comments.
ping to reviewers.

lsaba marked 2 inline comments as done.Jan 3 2018, 12:13 AM
lsaba added a comment.Jan 16 2018, 1:17 AM

Pinging again,,

craig.topper added inline comments.Jan 17 2018, 5:19 PM

I think you need to add VMOVDQU64Z128rm/VMOVDQU32Z128rm/VMOVDQA64Z128rm/VMOVDQA32Z128rm. Those are the EVEX integer equivalents.


Add the EVEX integer instructions.


Add a comment about what's considered a relevant addressing mode?


Why can't these be MachineOperand references? Why are they assigned to nullptr and then immediately overwritten?


Add avx512vl command line?

All of the load/store switch statements can be moved to more compact lookup tables, similar to what we do for domain switching in X86InstrInfo.cpp?



lsaba updated this revision to Diff 132197.Jan 31 2018, 9:20 AM
lsaba marked 6 inline comments as done.

addressed comments from @RKSimon and @craig.topper

lsaba added a comment.Jan 31 2018, 9:25 AM
This comment was removed by lsaba.
craig.topper added inline comments.Feb 2 2018, 9:32 AM

Should these be DenseMaps? @RKSimon, what do you think?

Can you use a std::array or std::pair instead of std::vector for the second type since the size is fixed. Each std::vector will be a separate heap allocation.


I don't think this was addressed. Why aren't these references instead of pointers?

lsaba updated this revision to Diff 132765.Feb 4 2018, 1:28 AM
lsaba marked an inline comment as done.

addressed @craig.topper comments

lsaba added a comment.Feb 8 2018, 12:04 AM

ping to reviewers

craig.topper added inline comments.Feb 8 2018, 12:28 AM

I don't think the DenseMap question here was answered.


Return a reference?


Return a reference here too.


Potentially is mispelled.


Can you clang-format at least this function? That 8 looks really out of place. So does the splitting of the getRegClass call.


Pass the map by reference. Probably const reference if you're not modifying it.

lsaba updated this revision to Diff 133471.Feb 8 2018, 12:37 PM
lsaba marked 4 inline comments as done.

addressed @craig.topper 's comments


I think DenseMap is less suitable since it doesn't have a static initializer constructor


This weird format is actually clang-format's output, I changed it to make it more readable

craig.topper added inline comments.Feb 8 2018, 2:04 PM

Is MRI->getTargetRegisterInfo() different than the TRI you already have?

lsaba updated this revision to Diff 133786.Feb 11 2018, 12:26 AM
lsaba marked an inline comment as done.

fixed TRI @craig.topper

This revision is now accepted and ready to land.Feb 11 2018, 12:36 AM

Sorry for showing up late, but I was looking at this code because it turns out it is miscompiling code for us. We're working on a test case, but there are really a number of basic LLVM coding convention problems here.

Craig, I hate to say this, but I don't think this should have gotten approved in its current form. It violates several clear guidelines in the LLVM coding standards, and pretty frequently. Maybe try to help folks ramping up on LLVM get their patches up to a higher code quality bar before approving?


I find using *only* the acronym "SFB" everywhere really confusing. It makes it hard to discover things.

Also, the word Fixup doesn't really communicate much. Maybe X86AvoidStoreForwardingBlocks?


Why can't these predicates be generated by tablegen from the instruction definitions. I really don't like have lots of tables in every pass as it creates a maintenance nightmare when adding new instructions and other issues.


But this still isn't going to be a constant. It isn't marked constexpr. This will require building a map every time the binary starts.

Please don't create globals with complex initializers in LLVM:

Also, std::map is *really* slow...

Why not just build a function with a switch to do this mapping?

Even better, could this be generated by tablegen?


It seems really weird to have getters above for some of these and then to not use them here.

If we want a better API for extracting the memory operands, we should build one that can be used everywhere (likely in X86InstrInfo.h) rather than a couple of ad-hoc ones that we end up not using in places like this. =[


We don't use all caps constants typically in LLVM. And we almost always expose debug flags for this kind of constant. It should also be at the top of the file and the function comment actually attached to the function rather than the constant.


Please use early-continue or early-exit to reduce indentation when writing LLVM code:


Missing vertical space here.


Please follow LLVM's variable naming conventions rather than inst.

Also Inst isn't a very idiomatic variable name. But worse, it is actively confusing. What is it? I would assume an instruction, perhaps a MachineInstr, but it isn't. It is a pair of int and and unsigned?? I have no idea what this variable means.

lsaba updated this revision to Diff 135075.Feb 20 2018, 9:22 AM
lsaba marked 5 inline comments as done.

@chandlerc Thanks for your comments.
still checking the option of using tablegen for some of the tables.
Any luck with the reproducer?


back to using switch functions for this.

lsaba updated this revision to Diff 138930.Mar 19 2018, 8:22 AM

Fix the reported bug (added case to bug 36346, based on code provided by Richard Smith )
and limit the transformation to 64bit only

lsaba reopened this revision.Mar 19 2018, 8:30 AM
This revision is now accepted and ready to land.Mar 19 2018, 8:30 AM
lsaba updated this revision to Diff 139434.Mar 22 2018, 6:43 AM
lsaba added a subscriber: niravd.

fix comments from @niravd and @craig.topper in review (closing that one)

lsaba added a comment.Mar 27 2018, 4:17 AM

ping @ reviewers, if there are no more comments i'd like to re-submit this

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

Minor comment: I got tripped up on remembering what StoreForwardingBlock the first time I saw it. I think it Store-Forward Blocking is a more intuitive name than StoreForwardingBlock.

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

The switch statements are back, are you referring to the DisplacementSizeMap? this one needs to be sorted, that's why i'm using std::map

craig.topper added inline comments.Mar 28 2018, 12:17 PM
60 ↗(On Diff #139434)

Prefix command line option strings with "x86-"

344 ↗(On Diff #139434)

Should this be called BlockCnt instead of BlockLimit. It itself is not a limit.

345 ↗(On Diff #139434)

Why do we need InspectionLimit? Can't we use x86AvoidSFBInspectionLimit everywhere? Or at least make InspectionLimit const. Naively it looks like the limit might be changed in the function.

346 ↗(On Diff #139434)

LI isn't a great name for this iterator. The name seems to have been chosen because it starts at LoadInst, but that's not meaningful once you walk away from LoadInst.

347 ↗(On Diff #139434)

This won't visit the first instruction in the block. And it visits LoadInst. Is that intended?

368 ↗(On Diff #139434)


544 ↗(On Diff #139434)

Use std::make_pair so you don't need to be explicit with the types.

578 ↗(On Diff #139434)


638 ↗(On Diff #139434)

Variables should be capitalized.

697 ↗(On Diff #139434)


lsaba marked 10 inline comments as done.Mar 29 2018, 7:09 AM
lsaba added inline comments.
345 ↗(On Diff #139434)

I did it for readability purposes only

347 ↗(On Diff #139434)

Nope, fixed.

lsaba updated this revision to Diff 140236.Mar 29 2018, 7:11 AM
lsaba marked 2 inline comments as done.

Updated with @craig.topper comments

craig.topper added inline comments.Mar 29 2018, 10:38 AM
347 ↗(On Diff #140236)

I don't think you want std::next here. reverse_iterator already makes it point to the instruction before LoadInst. So now I think you skip that instruction.

lsaba added inline comments.Mar 30 2018, 12:30 AM
347 ↗(On Diff #140236)

are you sure? I've double checked, without std::next the iterator starts from the load instruction

This revision was automatically updated to reflect the committed changes.

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Sure, taking a look.

lsaba added a comment.Apr 19 2018, 9:06 AM

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Please see
This should hopefully solve it