Page MenuHomePhabricator

[X86] New pass that moves immediate operands to registers.
Needs ReviewPublic

Authored by sepavloff on Sep 30 2014, 10:24 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

The pass scans machine instructions looking for uses of immediate
values. If the same value is used in neighbouring instructions, it
is moved to a register, if this is possible and reduces code size.
For instance, instructions

mov $0, 0x4(%esi)
mov $0, 0x8(%esi)

can be replaced by

mov $0, %eax
mov %eax, 0x4(%esi)
mov %eax, 0x8(%esi)

which is shorter in code size.

This patch fixes PR5124, it uses feedback on the previous patch discussed in the thread
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130916/188079.html .

Diff Detail

Event Timeline

sepavloff updated this revision to Diff 14236.Sep 30 2014, 10:24 AM
sepavloff retitled this revision from to [X86] New pass that moves immediate operands to registers..
sepavloff updated this object.
sepavloff edited the test plan for this revision. (Show Details)
sepavloff added a subscriber: Unknown Object (MLST).
hliao added a subscriber: hliao.Sep 30 2014, 7:46 PM

Shall we do this carefully with consideration of register pressure?
Even it reduces the code size, but it increases register pressure. Do
you have benchmark results on the impact of this pass?

  • Michael
joerg added a subscriber: joerg.Sep 30 2014, 11:58 PM

Could you take a look at http://llvm.org/bugs/show_bug.cgi?id=9517 and see if your pass could help with that?

Unfortunately no. When this pass runs asm statements are not expanded yet.

Thanks,
--Serge

2014-10-01 13:58 GMT+07:00 Joerg Sonnenberger <joerg@netbsd.org>:

Could you take a look at http://llvm.org/bugs/show_bug.cgi?id=9517 and
see if your pass could help with that?

http://reviews.llvm.org/D5544

Hi Serge,

A couple of general comments:

  • None of the comment uses the doxygen style: / instead of . Please fix it.
  • Comments start with a capital letter and end with a period (or any appropriate punctuation :)).
  • As I said in one of my email, we should do that just for Os and Oz functions (i.e., look for function attribute: OptimizeForSize and MinSize).

The overall approach seems more complicated than I would expect for such a pass. In particular, I think that all the InReg, Scheduled, and Waiting states could be avoided, as well as all the history thing with the expiration date.
My understanding is that you try to limit the impact of this transformation on the register pressure. That seems fragile to me (there are a lot of magic numbers involved here for the different limits) and harder than it should be to understand.

That I would recommend is having more faith in the register allocator. The live-ranges, you are creating/extending are rematerializable. Therefore, if the register pressure becomes too high, the register allocator can theoritically rematerialize the values instead of spilling.
The main problem then, is that you may end up with code like this:
reg = mov imm
reg2 = op reg3, reg
instead of
reg2 = op reg3, imm

That said, we could probably teach the register allocator to “fold” the operand as we do for memory load, so I wouldn’t worry too much about that.

The bottom line is that I would expect the pass to be a relatively simple scan that creates appropriate constants when the whole basic block has been traversed.
I.e., something like:
ConstantsUsage // Sort of dictionary. Map a constant to its uses.
for each instr in bb

if instr can use register variant
  record instr, instr.imm in ConstantsUsage // This is could be the trick part. See below for some thoughts.

for each constant in ConstantsUsage

if constant is profitable
  materialize constant // This creates one constant and update all its uses using the appropriate subreg, removing the redundant load imm, etc.

Regarding ‘record instr, imm, ConstantsUsage’.
The hard part, per say, is to enable reuse of wide constant, e.g., 0x0fff is used for 0xff.
I believe that you wouldn’t have that many constants to look through per basic block and a linear search may be sufficient.
Now, if we want something faster, we could do some trick like having a mapping per size, sorted by profitability, etc.

What do you think?

BTW, you could also have some function level scope if you really are interested at saving as much size as possible.

Cheers,
-Quentin

lib/Target/X86/X86MaterializeImmediates.cpp
10

s/resister/register

32

I believe this should come after the includes. In other words, usually right after using namespace llvm.

50

I believe most users won’t mess with the related option and I rather have the number I asked on the command line than a hidden maximal value.
The bottom line is, please do not use a hard-coded maximal value here.

54

Ditto.

58

Ditto.

62

Ditto.

66

Ditto.

72

Init this with the maximal value.

79

Ditto.

88

Ditto.

93

Ditto.

98

Ditto.

112

I would explicitly set the value, to be sure anyone updating this code matches the intent.

130

Period at the end of the comments.

134

IsLoadImm maybe?

135

For this patch I think it is fine to have all the information written here, but in the future it would be nice if it could be part of tablegen.
I do not like having yet another place to update when I add instructions.

241

Could you add a comment on that function?
I do not understand its purpose with just its name.

243

What do these comments mean?

253

Shouldn’t we assert Entire is bigger than Part?

269

Ditto.

795

nullptr instead of 0.

947

expences -> expenses.

Hi Quentin,

Thank you very much for your detailed review.

2014-10-04 4:34 GMT+07:00 Quentin Colombet <qcolombet@apple.com>:

Hi Serge,

A couple of general comments:

  • None of the comment uses the doxygen style: / instead of . Please fix it.
  • Comments start with a capital letter and end with a period (or any appropriate punctuation :)).
  • As I said in one of my email, we should do that just for Os and Oz functions (i.e., look for function attribute: OptimizeForSize and MinSize).

OK, will fix these.

The overall approach seems more complicated than I would expect for such a
pass. In particular, I think that all the InReg, Scheduled, and Waiting
states could be avoided, as well as all the history thing with the
expiration date.

History was introduced to track only "fresh" immediate values. A value that
is not used for a long time would occupy memory and would increase search
time, with long basic block these might be substantial. On the other hand,
only values used in close vicinity of the current instruction should be
taken into account. History solves the problem of maintaining compact and
actual pool of values. It does not require memory allocation and has
predictable search time. This pass was made as a kind of digital filter
that processes instruction flow, terms 'history' and 'time' were borrowed
from there. Using value history is attractive due to guaranty of limited
expenses in memory and execution time.

As for InReg, Scheduled and Waiting states, they exist to postpone
decision, what register class is to be used to store immediate. If we see
that some 32 bit value is profitable to materialize but then see use of 64
bit values, lower half of which is that 32 bit value, we could use part of
64-bit value, if we inserted appropriate load instruction. It looks like
this code indeed may be simplified.

My understanding is that you try to limit the impact of this
transformation on the register pressure. That seems fragile to me (there
are a lot of magic numbers involved here for the different limits) and
harder than it should be to understand.

That I would recommend is having more faith in the register allocator. The
live-ranges, you are creating/extending are rematerializable. Therefore, if
the register pressure becomes too high, the register allocator can
theoritically rematerialize the values instead of spilling.
The main problem then, is that you may end up with code like this:
reg = mov imm
reg2 = op reg3, reg
instead of
reg2 = op reg3, imm

That said, we could probably teach the register allocator to “fold” the
operand as we do for memory load, so I wouldn’t worry too much about that.

Yes, this is good idea, to teach register allocator about spilling
registers loaded from immediates. If register allocator could handle such
register enough flexibly, it would minimize impact of this pass on register
pressure and some magic constants in this pass could be avoided.

The bottom line is that I would expect the pass to be a relatively simple
scan that creates appropriate constants when the whole basic block has been
traversed.
I.e., something like:
ConstantsUsage // Sort of dictionary. Map a constant to its uses.
for each instr in bb

if instr can use register variant
  record instr, instr.imm in ConstantsUsage // This is could be the

trick part. See below for some thoughts.

for each constant in ConstantsUsage

if constant is profitable
  materialize constant // This creates one constant and update all its

uses using the appropriate subreg, removing the redundant load imm, etc.

In this approach we have full information about immediate use at the end of
basic block, this simplifies decision about materialization. The only
drawback is higher resource consumption, both memory (need to keep all
values in BB) and execution time (make two passes instead one). Maybe
enhancements in register allocator can alleviate this problem. For
instance, all values that could be profitable are loaded into virtual
registers and the register allocator decides which values should be
converted into immediates, it anyway use heuristics to map physical
registers.

Regarding ‘record instr, imm, ConstantsUsage’.

The hard part, per say, is to enable reuse of wide constant, e.g., 0x0fff
is used for 0xff.

It is expected in memset expansion. In other cases probably this is rare
case.

I believe that you wouldn’t have that many constants to look through per
basic block and a linear search may be sufficient.
Now, if we want something faster, we could do some trick like having a
mapping per size, sorted by profitability, etc.

What do you think?

For me using history is attractive due to predictable resource consumption,
but probably in most cases a map is enough, this needs investigation.

BTW, you could also have some function level scope if you really are
interested at saving as much size as possible.

Looks like a job for RegisterCoalescer?

Cheers,
-Quentin

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:10
@@ +9,3 @@
+
+
This file defines the pass which will move immediate operand into a
resister
+// if it is used in several adjacent instructions. This transformation
tries to


s/resister/register

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:32
@@ +31,3 @@
+
+#define DEBUG_TYPE "x86-imm2reg"
+#include "X86.h"


I believe this should come after the includes. In other words, usually
right after using namespace llvm.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:50
@@ +49,3 @@
+// may be set smaller by supplying proper command line option.
+const unsigned MaxHistoryDepth = 8;
+


I believe most users won’t mess with the related option and I rather have
the number I asked on the command line than a hidden maximal value.
The bottom line is, please do not use a hard-coded maximal value here.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:54
@@ +53,3 @@
+// may be set smaller by supplying proper command line option.
+const unsigned MaxHistoryWidth = 4;
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:58
@@ +57,3 @@
+// immediate value before it is dropped from history.
+const unsigned DefMaxSeparatingInstrs = MaxHistoryWidth;
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:62
@@ +61,3 @@
+// values.
+const unsigned DefMaxRegisters = 2;
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:66
@@ +65,3 @@
+// register.
+const unsigned DefMinProfit = 1;
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:72
@@ +71,3 @@
+ cl::desc("Max number of immediate values tracked in history."),
+ cl::init(MaxHistoryWidth), cl::Hidden);
+


Init this with the maximal value.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:79
@@ +78,3 @@
+ cl::desc("Max number of uses stored for a tracked immediate value."),
+ cl::init(MaxHistoryDepth), cl::Hidden);
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:88
@@ +87,3 @@
+ cl::desc("Max number of instructions separating two immediate uses."),
+ cl::init(DefMaxSeparatingInstrs), cl::Hidden);
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:93
@@ +92,3 @@
+ cl::desc("Max number of registers used for
materialization."),
+ cl::init(DefMaxRegisters), cl::Hidden);
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:98
@@ +97,3 @@
+ cl::desc("Minimal number of bytes that materialization must save."),
+ cl::init(DefMinProfit), cl::Hidden);
+


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:112
@@ +111,3 @@
+ // Must be 1 << DATA_N == size in bytes
+ DATA_8,
+ DATA_16,


I would explicitly set the value, to be sure anyone updating this code
matches the intent.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:130
@@ +129,3 @@
+ int Opcode;
+ int NewOpCode; Opcode if imm is replaced by reg
+ ImmediateSize Size;
of immediate data


Period at the end of the comments.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:134
@@ +133,3 @@
+ int Profit; Gain in bytes if imm is changed to reg
+ bool IsLoad;
is this an instruction like 'mov imm, reg'?
+};


IsLoadImm maybe?

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:135
@@ +134,3 @@
+ bool IsLoad; // is this an instruction like 'mov imm, reg'?
+};
+


For this patch I think it is fine to have all the information written
here, but in the future it would be nice if it could be part of tablegen.
I do not like having yet another place to update when I add instructions.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:241
@@ +240,3 @@
+
+static int GetSizeOfMovImmToRegInstr(int OpCode) {
+ if (OpCode == X86::MOV16ri)


Could you add a comment on that function?
I do not understand its purpose with just its name.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:243
@@ +242,3 @@
+ if (OpCode == X86::MOV16ri)
+ return 4; // 66 B8 iw
+ if (OpCode == X86::MOV32ri)


What do these comments mean?

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:253
@@ +252,3 @@
+
+static unsigned GetSubreg(ImmediateSize Entire, ImmediateSize Part) {
+ if (Entire == Part)


Shouldn’t we assert Entire is bigger than Part?

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:269
@@ +268,3 @@
+static unsigned GetSubregByBytes(unsigned Entire, unsigned Part) {
+ if (Entire == Part)
+ return X86::NoSubRegister;


Ditto.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:795
@@ +794,3 @@
+ InstrInfo *Rec = OpCodeTable::Find(Instr.getOpcode());
+ if (Rec == 0)
+ return false;


nullptr instead of 0.

================
Comment at: lib/Target/X86/X86MaterializeImmediates.cpp:947
@@ +946,3 @@
+
+ If use set of the value contains load instruction, expences for load
+
instruction can be decreased.


expences -> expenses.

http://reviews.llvm.org/D5544