Page MenuHomePhabricator

A light-weight solution to align branches within 32B boundary by prefix padding
AbandonedPublic

Authored by skan on Feb 27 2020, 8:50 AM.

Details

Summary

If we want a branch or a fused pair not to cross or be against the boundary, we currently emit NOP before it (D70157), and in most cases, we can bring back the lost performance due to microcode update. We also observed cases in which nop padding doesn't mitigate the effect very well, but prefix padding does (D72225). As we discussed about the prefix padding, D72225 adopts an aggressive way to add prefixes and, as a result, the fact that every single intruction ends up in it's own fragment is a huge increase in memory usage. So we put forward a light-weight solution. In this solution, to align a branch, at most one instruction can be prefixed, and if there is no sufficient room to add segment prefixes, NOP will be inserted instead. We measured the memory usage of the link process with lto when building SPEC, it only increased a little compared to NOP padding. We turned on the new prefix padding by default and passed the internal large test set and llvm's testsuite.

D75203 seems to support a general alignment padding. If the general alignment padding supports adding prefixes for instructions, this patch is not needed. Currently, the revison is opened here to avoid duplicate work .

Diff Detail

Unit TestsFailed

TimeTest
1,100 msMLIR.mlir-cpu-runner::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/bare_ptr_call_conv.mlir -convert-loop-to-std -convert-std-to-llvm='use-bare-ptr-memref-call-conv=1' | mlir-cpu-runner -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so -entry-point-result=void | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/bare_ptr_call_conv.mlir
1,100 msMLIR.mlir-cpu-runner::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/unranked_memref.mlir -convert-linalg-to-loops -convert-linalg-to-llvm -convert-std-to-llvm | mlir-cpu-runner -e main -entry-point-result=void -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so,/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas.so,/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libcblas_interface.so | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/unranked_memref.mlir
1,090 msMLIR.mlir-cpu-runner::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/mlir-opt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/utils.mlir -convert-linalg-to-loops -convert-linalg-to-llvm -convert-std-to-llvm | mlir-cpu-runner -e print_0d -entry-point-result=void -shared-libs=/mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/lib/libmlir_runner_utils.so | /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/build/bin/FileCheck /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/mlir/test/mlir-cpu-runner/utils.mlir --check-prefix=PRINT-0D

Event Timeline

skan created this revision.Feb 27 2020, 8:50 AM
skan edited the summary of this revision. (Show Details)Feb 27 2020, 8:53 AM

I think this is a step in the right direction. I expect we'll need to iterate on the design to allow further padding without extensive memory usage, but starting with a single padding instruction seems like a reasonable starting point. Being able to iterate in tree is obvious valuable, so I think cleaning this up and getting something in is a good idea.

I'd like to suggest a couple of design changes to simplify the code here though.

First, I think we should introduce a new MCFragment type for the padding opportunity. Having MCBranchAlign mean both "this is a place we need to enforce alignment" and "this is a place we can optional add padding for a later alignment" is confusing. I'd suggest something along the lines of a MCPrefixPaddingFragment. I think splitting this will make the code a lot more readable.

Second, I think we should rebase this on D75203. The advantage of doing so is that we could completely remove the MCPrefixPaddingFragment from relaxation, and only adjust it's size afterwards. Relaxation would be responsible for figuring out how much padding is needed and recording that in the MCBoundaryAlign, and then the post pass would divide the padding between nop, prefix, and other relaxable instructions.

Third, the complexity of the state machine for inserting fragments really bothers me. I don't have a concrete suggestion here, but I think we need to simplify this.

Hm, after reflecting a bit on points 1 and 2, I may have an alternate suggestion which eliminates (1) entirely. The basic idea is that we treat prefix padding as a thing done to relaxable instructions. A relaxable instruction has all the MCInst info to determine legal prefixes. If we added an interface to MCAsmBackend along the lines of "padInstructionEncoding(MCRelaxableFragment&)" which takes a fragment and changes it's encoding to increase encoding size, this would nicely fit into the design of D75203, and would allow all the prefix logic to live inside a single function in the X86AsmBackend. Then we'd just have to decide which instructions to make relaxable instead of directly adding to the data fragment. (i.e. the state machine)

On the surface, that seems like it would work well. I'm going to prototype that a bit on top of D75203 and see if it works as well as it seems.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
385

This looks like a potentially unrelated change. Can it be separated?

493

Please replace constants with X86::CS_PREFIX and friends.

My prototyping resulted in a POC patch (D75300) which nicely demonstrates that we can prefix pad MCRelaxableFragment instructions with zero additional memory overhead. Note that this approach *only* covers relaxable instructions, not those currently combined into DataFragments.

This patch goes several steps further in choosing instructions to insert prefixes before without making them relaxable. We could either a) just make more instructions relaxable, or b) support both mechanisms. This mechanism - particularly after splitting out a 'MCPrefixPaddingFragment' - will require less memory per potentially padded instruction than converting every instruction we might wish to pad into a RelxableFragment of it's own.

Another idea to explore would be trying to frame prefix padding insertion as something analogous to a fixup. Today, all of our fixups are fixed size (I think), but having a fixup which inserts a set of bytes might be reasonable. If we did that, we could use a single DataFragment for a block of instructions and still insert padding later if needed.

To be clear, I think it's fine to iterate on this design (for non-relaxable instructions) in tree. We could potentially start simple by simply making more instructions relaxable, and then implement the fixup like scheme just mentioned or something like this patch does.

skan added a comment.Feb 27 2020, 6:51 PM

I think this is a step in the right direction. I expect we'll need to iterate on the design to allow further padding without extensive memory usage, but starting with a single padding instruction seems like a reasonable starting point. Being able to iterate in tree is obvious valuable, so I think cleaning this up and getting something in is a good idea.

I'd like to suggest a couple of design changes to simplify the code here though.

Thanks for your comments and suggestions! They are really useful!

First, I think we should introduce a new MCFragment type for the padding opportunity. Having MCBranchAlign mean both "this is a place we need to enforce alignment" and "this is a place we can optional add padding for a later alignment" is confusing. I'd suggest something along the lines of a MCPrefixPaddingFragment. I think splitting this will make the code a lot more readable.

Yes, I aslo think using one type of fragment to do two kinds of things is confusing.

Second, I think we should rebase this on D75203. The advantage of doing so is that we could completely remove the MCPrefixPaddingFragment from relaxation, and only adjust it's size afterwards. Relaxation would be responsible for figuring out how much padding is needed and recording that in the MCBoundaryAlign, and then the post pass would divide the padding between nop, prefix, and other relaxable instructions.

Agree, doing this in the post pass would reduce the complexity of laying out the fragments. I think D75203 is a good start and seems reasonable, we don't need to stick to his patch since this patch does not nicely fit into the design of D75203. We can enable prefix padding in another way based on D75203. Let's focus on D75203 first.

Third, the complexity of the state machine for inserting fragments really bothers me. I don't have a concrete suggestion here, but I think we need to simplify this.

Hm, after reflecting a bit on points 1 and 2, I may have an alternate suggestion which eliminates (1) entirely. The basic idea is that we treat prefix padding as a thing done to relaxable instructions. A relaxable instruction has all the MCInst info to determine legal prefixes. If we added an interface to MCAsmBackend along the lines of "padInstructionEncoding(MCRelaxableFragment&)" which takes a fragment and changes it's encoding to increase encoding size, this would nicely fit into the design of D75203, and would allow all the prefix logic to live inside a single function in the X86AsmBackend. Then we'd just have to decide which instructions to make relaxable instead of directly adding to the data fragment. (i.e. the state machine)

Nice design! But I have one concern. D75203 replaces short jumps with long jumps to reduce the number of the NOP, is this friendly to performance? I think currently we should add a option to turn it on rather than enabling it by default before we get some performance data.

On the surface, that seems like it would work well. I'm going to prototype that a bit on top of D75203 and see if it works as well as it seems.

skan marked 2 inline comments as done.Feb 27 2020, 8:06 PM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
385

Okay

493

X86::CS_PREFIX is a enum value and used to represent the instruction is a cs prefix indeed. Constants 0x2e is the encoding value for X86::CS, thery are different.

skan marked an inline comment as done.Feb 28 2020, 5:34 AM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
385

Separate this piece of code to D75346

skan marked an inline comment as done.Feb 28 2020, 8:22 AM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
446–447

Split the piece of code to D75357

skan planned changes to this revision.Mar 2 2020, 1:12 AM

Plan to rebase this on D75203

skan abandoned this revision.Mar 19 2020, 8:27 PM

Reimplemented this by D76286 based on D75300