Page MenuHomePhabricator

[X86] Reduce Store Forward Block issues in HW

Authored by lsaba on Dec 17 2017, 3:25 AM.



If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Diff Detail


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lsaba added a comment.Dec 26 2017, 4:00 AM

add MMO info

hfinkel added inline comments.Dec 28 2017, 6:56 AM
32 ↗(On Diff #127507)

I think that this makes sense, please explain this better in the comment (i.e., that you're speculatively assuming aliasing, as the transformation is correct either way, and you often won't be able to prove a sufficient number of the interesting cases).

224 ↗(On Diff #127507)

Okay, thanks (although you should use something other than #defines (constants or a function with a switch statement, etc.).

lsaba updated this revision to Diff 128492.Jan 3 2018, 12:04 AM

addressed @hfinkel comments.
ping to reviewers.

lsaba marked 2 inline comments as done.Jan 3 2018, 12:13 AM
lsaba added a comment.Jan 16 2018, 1:17 AM

Pinging again,,

craig.topper added inline comments.Jan 17 2018, 5:19 PM
109 ↗(On Diff #128492)

I think you need to add VMOVDQU64Z128rm/VMOVDQU32Z128rm/VMOVDQA64Z128rm/VMOVDQA32Z128rm. Those are the EVEX integer equivalents.

116 ↗(On Diff #128492)

Add the EVEX integer instructions.

246 ↗(On Diff #128492)

Add a comment about what's considered a relevant addressing mode?

254 ↗(On Diff #128492)

Why can't these be MachineOperand references? Why are they assigned to nullptr and then immediately overwritten?

4 ↗(On Diff #128492)

Add avx512vl command line?

All of the load/store switch statements can be moved to more compact lookup tables, similar to what we do for domain switching in X86InstrInfo.cpp?

17 ↗(On Diff #128492)


lsaba updated this revision to Diff 132197.Jan 31 2018, 9:20 AM
lsaba marked 6 inline comments as done.

addressed comments from @RKSimon and @craig.topper

lsaba added a comment.Jan 31 2018, 9:25 AM
This comment was removed by lsaba.
craig.topper added inline comments.Feb 2 2018, 9:32 AM
127 ↗(On Diff #132197)

Should these be DenseMaps? @RKSimon, what do you think?

Can you use a std::array or std::pair instead of std::vector for the second type since the size is fixed. Each std::vector will be a separate heap allocation.

254 ↗(On Diff #128492)

I don't think this was addressed. Why aren't these references instead of pointers?

lsaba updated this revision to Diff 132765.Feb 4 2018, 1:28 AM
lsaba marked an inline comment as done.

addressed @craig.topper comments

lsaba added a comment.Feb 8 2018, 12:04 AM

ping to reviewers

craig.topper added inline comments.Feb 8 2018, 12:28 AM
232 ↗(On Diff #132765)

Return a reference?

237 ↗(On Diff #132765)

Return a reference here too.

425 ↗(On Diff #132765)

Potentially is mispelled.

449 ↗(On Diff #132765)

Can you clang-format at least this function? That 8 looks really out of place. So does the splitting of the getRegClass call.

454 ↗(On Diff #132765)

Pass the map by reference. Probably const reference if you're not modifying it.

127 ↗(On Diff #132197)

I don't think the DenseMap question here was answered.

lsaba updated this revision to Diff 133471.Feb 8 2018, 12:37 PM
lsaba marked 4 inline comments as done.

addressed @craig.topper 's comments

449 ↗(On Diff #132765)

This weird format is actually clang-format's output, I changed it to make it more readable

127 ↗(On Diff #132197)

I think DenseMap is less suitable since it doesn't have a static initializer constructor

craig.topper added inline comments.Feb 8 2018, 2:04 PM
447 ↗(On Diff #132765)

Is MRI->getTargetRegisterInfo() different than the TRI you already have?

lsaba updated this revision to Diff 133786.Feb 11 2018, 12:26 AM
lsaba marked an inline comment as done.

fixed TRI @craig.topper

This revision is now accepted and ready to land.Feb 11 2018, 12:36 AM

Sorry for showing up late, but I was looking at this code because it turns out it is miscompiling code for us. We're working on a test case, but there are really a number of basic LLVM coding convention problems here.

Craig, I hate to say this, but I don't think this should have gotten approved in its current form. It violates several clear guidelines in the LLVM coding standards, and pretty frequently. Maybe try to help folks ramping up on LLVM get their patches up to a higher code quality bar before approving?

1 ↗(On Diff #133786)

I find using *only* the acronym "SFB" everywhere really confusing. It makes it hard to discover things.

Also, the word Fixup doesn't really communicate much. Maybe X86AvoidStoreForwardingBlocks?

104 ↗(On Diff #133786)

Why can't these predicates be generated by tablegen from the instruction definitions. I really don't like have lots of tables in every pass as it creates a maintenance nightmare when adding new instructions and other issues.

248–252 ↗(On Diff #133786)

It seems really weird to have getters above for some of these and then to not use them here.

If we want a better API for extracting the memory operands, we should build one that can be used everywhere (likely in X86InstrInfo.h) rather than a couple of ad-hoc ones that we end up not using in places like this. =[

272 ↗(On Diff #133786)

We don't use all caps constants typically in LLVM. And we almost always expose debug flags for this kind of constant. It should also be at the top of the file and the function comment actually attached to the function rather than the constant.

428–430 ↗(On Diff #133786)

Please use early-continue or early-exit to reduce indentation when writing LLVM code:

444 ↗(On Diff #133786)

Missing vertical space here.

464 ↗(On Diff #133786)

Please follow LLVM's variable naming conventions rather than inst.

Also Inst isn't a very idiomatic variable name. But worse, it is actively confusing. What is it? I would assume an instruction, perhaps a MachineInstr, but it isn't. It is a pair of int and and unsigned?? I have no idea what this variable means.

127 ↗(On Diff #132197)

But this still isn't going to be a constant. It isn't marked constexpr. This will require building a map every time the binary starts.

Please don't create globals with complex initializers in LLVM:

Also, std::map is *really* slow...

Why not just build a function with a switch to do this mapping?

Even better, could this be generated by tablegen?

lsaba updated this revision to Diff 135075.Feb 20 2018, 9:22 AM
lsaba marked 5 inline comments as done.

@chandlerc Thanks for your comments.
still checking the option of using tablegen for some of the tables.
Any luck with the reproducer?

127 ↗(On Diff #132197)

back to using switch functions for this.

lsaba updated this revision to Diff 138930.Mar 19 2018, 8:22 AM

Fix the reported bug (added case to bug 36346, based on code provided by Richard Smith )
and limit the transformation to 64bit only

lsaba reopened this revision.Mar 19 2018, 8:30 AM
This revision is now accepted and ready to land.Mar 19 2018, 8:30 AM
lsaba updated this revision to Diff 139434.Mar 22 2018, 6:43 AM
lsaba added a subscriber: niravd.

fix comments from @niravd and @craig.topper in review (closing that one)

lsaba added a comment.Mar 27 2018, 4:17 AM

ping @ reviewers, if there are no more comments i'd like to re-submit this

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

Minor comment: I got tripped up on remembering what StoreForwardingBlock the first time I saw it. I think it Store-Forward Blocking is a more intuitive name than StoreForwardingBlock.

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

The switch statements are back, are you referring to the DisplacementSizeMap? this one needs to be sorted, that's why i'm using std::map

craig.topper added inline comments.Mar 28 2018, 12:17 PM
60 ↗(On Diff #139434)

Prefix command line option strings with "x86-"

344 ↗(On Diff #139434)

Should this be called BlockCnt instead of BlockLimit. It itself is not a limit.

345 ↗(On Diff #139434)

Why do we need InspectionLimit? Can't we use x86AvoidSFBInspectionLimit everywhere? Or at least make InspectionLimit const. Naively it looks like the limit might be changed in the function.

346 ↗(On Diff #139434)

LI isn't a great name for this iterator. The name seems to have been chosen because it starts at LoadInst, but that's not meaningful once you walk away from LoadInst.

347 ↗(On Diff #139434)

This won't visit the first instruction in the block. And it visits LoadInst. Is that intended?

368 ↗(On Diff #139434)


544 ↗(On Diff #139434)

Use std::make_pair so you don't need to be explicit with the types.

578 ↗(On Diff #139434)


638 ↗(On Diff #139434)

Variables should be capitalized.

697 ↗(On Diff #139434)


lsaba marked 10 inline comments as done.Mar 29 2018, 7:09 AM
lsaba added inline comments.
345 ↗(On Diff #139434)

I did it for readability purposes only

347 ↗(On Diff #139434)

Nope, fixed.

lsaba updated this revision to Diff 140236.Mar 29 2018, 7:11 AM
lsaba marked 2 inline comments as done.

Updated with @craig.topper comments

craig.topper added inline comments.Mar 29 2018, 10:38 AM
347 ↗(On Diff #140236)

I don't think you want std::next here. reverse_iterator already makes it point to the instruction before LoadInst. So now I think you skip that instruction.

lsaba added inline comments.Mar 30 2018, 12:30 AM
347 ↗(On Diff #140236)

are you sure? I've double checked, without std::next the iterator starts from the load instruction

This revision was automatically updated to reflect the committed changes.

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Sure, taking a look.

lsaba added a comment.Apr 19 2018, 9:06 AM

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Please see
This should hopefully solve it