This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Generate fewer instructions for (sub <constant>, x)
AbandonedPublic

Authored by uweigand on Jul 5 2016, 1:42 PM.

Details

Summary

Created patterns to match subtraction of a variable from a constant to reduce the number of instructions and needed registers. The idea is to load the complement and use addition.
For example, for (sub i64 63, x):

before the change, the following is generated:

lghi    %r0, 63
sgr     %r0, %r2
lgr     %r2, %r0

after the change, the following is generated (less instructions):

lghi    %r0, 63
sgrk    %r2, %r0, %r2

Diff Detail

Event Timeline

assem updated this revision to Diff 62782.Jul 5 2016, 1:42 PM
assem retitled this revision from to [SystemZ] Generate fewer instructions for (sub <constant>, x).
assem updated this object.
assem added reviewers: uweigand, zhanjunl.
assem added a subscriber: llvm-commits.
assem updated this object.Jul 5 2016, 2:50 PM
uweigand edited edge metadata.Jul 6 2016, 7:17 AM

So I see that we get less instructions in this particular case. However, I'm wondering whether execution is also *faster*, given that the new code seems to create longer dependency chains. Normally I'd like to compare benchmark runs before making a change like that ...

Also, I'm wondering why we can't have the best of both worlds: just two instructions *and* the shorter dependency chain, by creating instead something like:

lghi %r0, 63
sgrk %r2, %r0, %r2

(assuming we're on z196 and higher. But then I don't think z10 is a big concern these days anyway.)

assem added a comment.Jul 7 2016, 8:20 AM

Thanks for the comment, Ulrich!

Compared to the old version the new version uses less instructions and only one register. Compared to your suggested change, the new version uses 1 less register. I can run some performance tests (micro benchmarks) to test the performance impact of the new change, but I thought the advantage is clear.

I am not sure what you do mean by the "longer" dependency chain. If I understand correctly, the old version has longer dependency chain and your suggested change has the same dependency chain length (sgrk requires result from lghi) as the new version in the patch. If I misunderstood, can you please clarify?

Thanks

By longer dependency chain I mean this: With the my new variant, the final result depends on the SUBTRACT, which has two inputs, the original input and the constant in another register. Those two inputs do not depend on one another, which means the ouf-of-order logic in the processor can schedule them independently of each other, and they could in particular run in parallel (or loading the constant can happen very early). So the overall number of cycles from the point the original input is available to the point the output is available is just the latency time of the SUBTRACT, and loading the constant may not affect the overall run time at all. [ As always, this is subject to many details, and can be wrong if the OOO logic is already heavily loaded. ]

However, with your variant, the overall result depends on the ADD IMMEDIATE, which depends on the LOAD COMPLEMENT, which depends on the original input. This means the overall latency of the operation is always the *sum* of the latencies of the two instructions. [ Since all instructions in question here are simple integer arithmetic, they will all have a pipelined latency of 1 cycle, so the new variant in effect doubles the latency. ]

[ With the original code with the additional LOAD REGISTER, there's indeed also a longer dependency chain; but that is really limited to this specific test case where input and output are forced into the same register due the ABI constraints ... in general, if the code occurs in the middle of a larger sequence, the register allocator will usually make that LOAD REGISTER go away. In fact, I'm wondering why the register allocator doesn't make it go away even in this case by using the three-operand instruction variant ... ]

assem updated this revision to Diff 63711.Jul 12 2016, 12:24 PM
assem updated this object.
assem edited edge metadata.

I have updated the patch based on Ulrich's comments, i.e., sgrk is used now. I have also updated the summary of the patch.

This can lead to wrong code generation, since you now unconditionally introduce SGRK, even on platforms (like z10) that don't actually have that instruction.

Also, in general SGRK is *not* better than SGR. Only if there is a register conflict should we use the three-address form. Note that there is a separate pass to make that distinction. The way this is supposed to work is that at instruction selection, we generate the two-address from, and then the two-address instruction pass switches this to the three-address form if it considers this form more profitable in any given situation.

This happens in lib/CodeGen/TwoAddressInstructionPass.cpp, in TwoAddressInstructionPass::tryInstructionTransform, which uses the subroutine TwoAddressInstructionPass::isProfitableToConv3Addr to check whether to make that transformation. Having a quick look at that routine, it seems that it currently does not recognize the situation in your test case:

// Look for situations like this:
// %reg1024<def> = MOV r1
// %reg1025<def> = MOV r0
// %reg1026<def> = ADD %reg1024, %reg1025
// r2            = MOV %reg1026
// Turn ADD into a 3-address instruction to avoid a copy.
unsigned FromRegB = getMappedReg(RegB, SrcRegMap);
if (!FromRegB)
  return false;
unsigned ToRegA = getMappedReg(RegA, DstRegMap);
return (ToRegA && !regsAreCompatible(FromRegB, ToRegA, TRI));

So it converts to three-address form if, in the terms of the example above, reg1026 and reg1024 are both tied to hard registers, but different ones. This is correct, since in that situation using a two-address form forces a copy. But this situation does not apply in your test case.

However, there is a *different* case when a copy is forced, and that is when reg1026 and reg1025 are both tied to the same hard register, because this means that we cannot also use that same hard register for the value in reg1024, and thus we also need an extra copy for the two-address form. This is the situation in your test case.

So I think the proper fix would be to improve that routine to detect this case as well, which should then fix the test case without requiring any SystemZ back-end changes.

assem added a comment.Aug 9 2016, 11:23 AM

Thanks Ulrich. I have tried to implement what you have suggested in the two-address pass. However, it seems that checking for reg1026 and reg1025 being mapped to the same physical register is very broad and can catch a lot of cases that we don't want to convert to 3-address form.

I have looked at the log file generated by -debug and -print-after-all, and I think that the correct thing to do is to include a pass after register allocation to check for cases where a copy is done after a 2-address instruction. I am not sure if there is a better place to implement this. I appreciate any suggestions.

Thanks Ulrich. I have tried to implement what you have suggested in the two-address pass. However, it seems that checking for reg1026 and reg1025 being mapped to the same physical register is very broad and can catch a lot of cases that we don't want to convert to 3-address form.

Huh. Do you have an example where this happens (i.e. the two are mapped to the same physical register but we do *not* want to convert to 3-address form)?

jonpa added a subscriber: jonpa.Jun 7 2019, 11:59 PM

This test has now been added as int-sub-11.ll by r362868 "Favor 3-address instructions during instruction selection".

I suppose this patch can be abandoned now?

uweigand commandeered this revision.Jun 9 2019, 5:18 PM
uweigand abandoned this revision.
uweigand edited reviewers, added: assem; removed: uweigand.