This is an archive of the discontinued LLVM Phabricator instance.

Support for inserting profile-directed cache prefetches
ClosedPublic

Authored by mtrofin on Nov 2 2018, 2:52 PM.

Details

Summary

Support for profile-driven cache prefetching (X86)

This change is part of a larger system, consisting of a cache prefetches recommender, create_llvm_prof (https://github.com/google/autofdo), and LLVM.

A proof of concept recommender is DynamoRIO's cache miss analyzer. It processes memory access traces obtained from a running binary and identifies patterns in cache misses. Based on them, it produces a csv file with recommendations. The expectation is that, by leveraging such recommendations, we can reduce the amount of clock cycles spent waiting for data from memory. A microbenchmark based on the DynamoRIO analyzer is available as a proof of concept: https://goo.gl/6TM2Xp.

The recommender makes prefetch recommendations in terms of:

  • the binary offset of an instruction with a memory operand;
  • a delta;
  • and a type (nta, t0, t1, t2)

meaning: a prefetch of that type should be inserted right before the instrution at that binary offset, and the prefetch should be for an address delta away from the memory address the instruction will access.

For example:

0x400ab2,64,nta

and assuming the instruction at 0x400ab2 is:

movzbl (%rbx,%rdx,1),%edx

means that the recommender determined it would be beneficial for a prefetchnta instruction to be inserted right before this instruction, as such:

prefetchnta 0x40(%rbx,%rdx,1)
movzbl (%rbx, %rdx, 1), %edx

The workflow for prefetch cache instrumentation is as follows (the proof of concept script details these steps as well):

  1. build binary, making sure -gmlt -fdebug-info-for-profiling is passed. The latter option will enable the X86DiscriminateMemOps pass, which ensures instructions with memory operands are uniquely identifiable (this causes ~2% size increase in total binary size due to the additional debug information).
  1. collect memory traces, run analysis to obtain recommendations (see above-referenced DynamoRIO demo as a proof of concept).
  1. use create_llvm_prof to convert recommendations to reference insertion locations in terms of debug info locations.
  1. rebuild binary, using the exact same set of arguments used initially, to which -mllvm -prefetch-hints-file=<file> needs to be added, using the afdo file obtained at step 3.

Note that if sample profiling feedback-driven optimization is also desired, that happens before step 1 above. In this case, the sample profile afdo file that was used to produce the binary at step 1 must also be included in step 4.

The data needed by the compiler in order to identify prefetch insertion points is very similar to what is needed for sample profiles. For this reason, and given that the overall approach (memory tracing-based cache recommendation mechanisms) is under active development, we use the afdo format as a syntax for capturing this information. We avoid confusing semantics with sample profile afdo data by feeding the two types of information to the compiler through separate files and compiler flags. Should the approach prove successful, we can investigate improvements to this encoding mechanism.

Diff Detail

Repository
rL LLVM

Event Timeline

mtrofin created this revision.Nov 2 2018, 2:52 PM
davidxl added a reviewer: wmi.Nov 2 2018, 3:06 PM
davidxl added subscribers: danielcdh, davide.
craig.topper added inline comments.Nov 3 2018, 8:50 PM
lib/Target/X86/X86.h
1 ↗(On Diff #172441)

Accidental change?

130 ↗(On Diff #172441)

No need for this to be two lines

davidxl added inline comments.Nov 5 2018, 9:17 AM
lib/Target/X86/X86DiscriminateMemOps.cpp
91 ↗(On Diff #172441)

Add a comment for this variable.

108 ↗(On Diff #172441)

move the 'insert' before and check the return value (to see if the insertion fails or not). This can save one lookup.

lib/Target/X86/X86InsertPrefetch.cpp
64 ↗(On Diff #172441)

The name of the function can be changed to reflect its real semantics. Also add a comment.

83 ↗(On Diff #172441)

typo: profided.

101 ↗(On Diff #172441)

Add a comment for the loop.

103 ↗(On Diff #172441)

define a const variable for the prefix string?

wmi added inline comments.Nov 5 2018, 9:18 AM
lib/Target/X86/X86InsertPrefetch.cpp
35–38 ↗(On Diff #172441)

Perhaps make it a clang option instead of an internal option.

64 ↗(On Diff #172441)

The function only borrows the call target format of afdo. It doesn't really return call target map. Can we rename the function to avoid confusion? Maybe also using a new type name for SampleRecord::CallTargetMap.

124–125 ↗(On Diff #172441)

llvm style requires a message for an assertion: assert(expr && "msg");
The same for other asserts in the patch.

171–172 ↗(On Diff #172441)

Do you think it is worthy to issue a warning so people know there is some mismatch here?

175–210 ↗(On Diff #172441)

Perhaps extract the code into an insertPrefetch function? This part is machine dependent.

lib/Target/X86/X86TargetMachine.cpp
500–501 ↗(On Diff #172441)

Does prefetch insertion need to be guarded by "getOptLevel != CodeGenOpt::None"?

500–501 ↗(On Diff #172441)

Have you considered to make the two passes architecture independent?

mtrofin updated this revision to Diff 172632.Nov 5 2018, 12:01 PM
mtrofin marked 15 inline comments as done.
  • Addressed first round of feedback.
mtrofin added inline comments.Nov 5 2018, 12:01 PM
lib/Target/X86/X86InsertPrefetch.cpp
35–38 ↗(On Diff #172441)

I think that would be more appropriate once the overall approach is more mature.

171–172 ↗(On Diff #172441)

Most instructions won't have prefetch hints provided.

175–210 ↗(On Diff #172441)

The whole pass is machine dependent.

lib/Target/X86/X86TargetMachine.cpp
500–501 ↗(On Diff #172441)

Not really, someone might want no optimization and prefetch insertion (although this is a very odd scenario and a very weak argument). Happy to guard it if there's a reason to.

500–501 ↗(On Diff #172441)

It's too early for generalizing this outside X86.

davidxl added inline comments.Nov 6 2018, 12:01 PM
lib/Target/X86/X86DiscriminateMemOps.cpp
106 ↗(On Diff #172632)

The return of the insert should be saved into a variable, so that it can be used later at line 110.

110 ↗(On Diff #172632)

no need to insert again, just update using the iterator returned from the first insert.

mtrofin updated this revision to Diff 172857.Nov 6 2018, 2:23 PM
mtrofin marked 2 inline comments as done.
  • Second round of feedback.
wmi added inline comments.Nov 7 2018, 9:25 AM
lib/Target/X86/X86InsertPrefetch.cpp
171–172 ↗(On Diff #172441)

Then is there anyway to check if there is prefetch which cannot find matching memory instruction?

lib/Target/X86/X86TargetMachine.cpp
500–501 ↗(On Diff #172441)

I was thinking those two passes can be architecture independent with only x86 hook available currently. But I don't have strong opinion if you feel refactoring that later will not be more difficult.

mtrofin marked 2 inline comments as done.Nov 7 2018, 1:39 PM
mtrofin added inline comments.
lib/Target/X86/X86InsertPrefetch.cpp
171–172 ↗(On Diff #172441)

Indeed - that's on my to-do list as a next step. My plan was to surface that as ORE warnings. I'd prefer doing it in a separate CL, if that's OK.

lib/Target/X86/X86TargetMachine.cpp
500–501 ↗(On Diff #172441)

I understand - I'd prefer doing a refactoring when we have at least a second architecture we want to target.

mtrofin marked 2 inline comments as done.Nov 7 2018, 1:41 PM
wmi accepted this revision.Nov 9 2018, 4:20 PM
This revision is now accepted and ready to land.Nov 9 2018, 4:20 PM
davidxl accepted this revision.Nov 12 2018, 10:57 AM

lgtm too.

Wait for Craig to do the final sign off. thanks.

mtrofin marked 7 inline comments as done.Nov 20 2018, 7:58 AM

Gentle reminder - Craig, I'd love to hear your feedback. I'm planning to commit early next week (unless I hear otherwise, of course).

Thank you,
Mircea.

Weird. I thought I wrote some comments last week, but they obviously didn’t make it in here. Guess I’ll redo that today when I get to the office. One thing I remember is that this patch uses ‘auto’ more than llvm prefers https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

lib/Target/X86/X86InsertPrefetch.cpp
175 ↗(On Diff #172857)

LLVM style prefers to only use "auto" where the type is really obvious from the rest of the line. Like when its mentioned as part of a dyn_cast. The iterators in the for loop is probably fine, but the rest of this function should really use explicit types in place of auto.

mtrofin updated this revision to Diff 174845.Nov 20 2018, 2:40 PM
mtrofin marked an inline comment as done.
  • Addressed first round of feedback.
  • Second round of feedback.
  • Avoiding using 'auto' except for iterators.

Weird. I thought I wrote some comments last week, but they obviously didn’t make it in here. Guess I’ll redo that today when I get to the office. One thing I remember is that this patch uses ‘auto’ more than llvm prefers https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

Thanks - addressed the use of 'auto' throughout the change.

Please let me know if there is anything else I should address.

Thanks!

This revision was automatically updated to reflect the committed changes.
vitalybuka reopened this revision.Nov 26 2018, 2:57 PM
vitalybuka added a subscriber: vitalybuka.

This patch causes the following errors on multiple bots
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/26513

******************** TEST 'LLVM :: CodeGen/X86/O0-pipeline.ll' FAILED ********************
Script:
--
: 'RUN: at line 3';   /b/sanitizer-x86_64-linux-fast/build/llvm_build_msan/bin/llc -mtriple=x86_64-- -O0 -debug-pass=Structure < /b/sanitizer-x86_64-linux-fast/build/llvm/test/CodeGen/X86/O0-pipeline.ll -o /dev/null 2>&1    | grep -v 'Verify generated machine code' | /b/sanitizer-x86_64-linux-fast/build/llvm_build_msan/bin/FileCheck /b/sanitizer-x86_64-linux-fast/build/llvm/test/CodeGen/X86/O0-pipeline.ll
--
Exit Code: 1

Command Output (stderr):
--
/b/sanitizer-x86_64-linux-fast/build/llvm/test/CodeGen/X86/O0-pipeline.ll:61:15: error: CHECK-NEXT: is not on the line after the previous match
; CHECK-NEXT: Contiguously Lay Out Funclets
              ^
<stdin>:56:2: note: 'next' match was here
 Contiguously Lay Out Funclets
 ^
<stdin>:53:25: note: previous match ended here
 X86 vzeroupper inserter
                        ^
<stdin>:54:1: note: non-matching line after previous match is here
 Unnamed pass: implement Pass::getPassName()
This revision is now accepted and ready to land.Nov 26 2018, 2:57 PM
vitalybuka closed this revision.Nov 26 2018, 3:01 PM

Looks like rL347607 the fix.