This is an archive of the discontinued LLVM Phabricator instance.

LSR tunings for SystemZ, with some minor common code changes
ClosedPublic

Authored by jonpa on Jul 6 2017, 4:17 AM.

Details

Summary

The purpose of this patch is to make LSR generate better code for SystemZ in the cases of memory intrinsics and comparison of immediate with memory. These instructions in particular can have no index register and can only accept a small immediate offset. Improvements on benchmarks have been confirmed.

In order to achieve this, the following common code changes were made:

  • New TTI hook: LSRWithInstrQueries(), which defaults to false. Controls if LSR should do instruction-based addressing evaluations by calling isLegalAddressingMode() and isFoldableMemAccessOffset() with the Instruction pointers.
  • isLegalAddressingMode() gets a new optional Instruction* parameter (defaults to nullptr) used by LSR if Target returns true in LSRWithInstrQueries(). All target methods have been updated as well.
  • In LSR / isAddressUse(): handle address operands of memset, memmove and memcpy as address uses.
  • In LSR / RateFormula(): Don't add to ImmCost if the instructions are already checked. It only adds confusion when the results are otherwise equal. Call isFoldableMemAccessOffset() for any LSRUse::Address, not just loads / stores.
  • In LSR / isAMCompletelyFolded(): Let target look at instructions if it returns true in LSRWithInstrQueries().

    SystemZ:
  • isLSRCostLess() overriden to check instruction counts like X86 does it.
  • isLegalAddressingMode() and isFoldableMemAccessOffset() improved to handle memcpy and compare imm w/ mem.
  • LSRWithInstrQueries() returns true
  • minor updates of tests dag-combine-01.ll and loop-01.ll
  • Two new tests in loop-01.ll

Diff Detail

Event Timeline

jonpa created this revision.Jul 6 2017, 4:17 AM
qcolombet requested changes to this revision.Jul 10 2017, 3:31 PM

Hi Jonas,

Unless I am mistaken, I see three different changes here.
First batch is

    • New TTI hook: LSRWithInstrQueries(), which defaults to false. Controls if LSR should do instruction-based addressing evaluations by calling isLegalAddressingMode() and isFoldableMemAccessOffset() with the Instruction pointers.
  • isLegalAddressingMode() gets a new optional Instruction* parameter (defaults to nullptr) used by LSR if Target returns true in LSRWithInstrQueries(). All target methods have been updated as well.
    • In LSR / isAMCompletelyFolded(): Let target look at instructions if it returns true in LSRWithInstrQueries().

Second batch is

  • In LSR / isAddressUse(): handle address operands of memset, memmove and memcpy as address uses.

Third batch is

  • In LSR / RateFormula(): Don't add to ImmCost if the instructions are already checked. It only adds confusion when the results are otherwise equal. Call isFoldableMemAccessOffset() for any LSRUse::Address, not just loads / stores.

Could you please split the patch accordingly?

More comments inlined.

Cheers,
-Quentin

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1657

I feel that this code does not belong here.

Indeed, we have quite a few isAMCompletelyFolded overloaded functions, and I believe not all invocations would go through that specific instance.
Instead, I would have expected this to happen to the lower most version of the isAMCompletelyFolded version. The one that calls isLegalAddressingMode.

test/CodeGen/SystemZ/loop-01.ll
243

Could you run opt -instnamer on the IR?

This revision now requires changes to proceed.Jul 10 2017, 3:31 PM
jonpa updated this revision to Diff 106034.Jul 11 2017, 8:39 AM
jonpa edited edge metadata.

Test updated with opt -instnamer per request.

jonpa updated this revision to Diff 106042.Jul 11 2017, 9:03 AM
jonpa marked an inline comment as done.

I broke out "second batch" (isAddressUse()) into https://reviews.llvm.org/D35262

I however think 1 & 3 both depend on LSRWithInstrQueries(), so I let them remain here together, if that's ok?

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1657

The reason that I put it here, is because this is where LU is available. The check can't be done without LU (which has the Fixups), so if it's not placed here, the argument lists of the other versions must be changed, as well as the call sites of it and isLegalUse() (and possibly more?) to make the Fixups available. Is this what you have in mind, and if so should LU replace the other LU-arguments like MinOffset etc?

test/CodeGen/SystemZ/loop-01.ll
243

done.

I see why second batch depends on this patch. I wonder if it should though.
Hence, do we rely guard that check with LSRWithInstrQueries?

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1282

Why would we guard that check with TTI.LSRWithInstrQueries, whereas it was guarded previously.

1286

This slightly changes how/when we accumulate ImmCost.
Is that intentional?

1657

The thing that I don't like is that what we're basically copying what's inside the bottom most version of isAMCompletelyFolded in the Address case. If possible I would have liked we call that code.

Could we keep that loop here but call the bottom most isAMCompletelyFolded with an additional Instr parameter?

jonpa updated this revision to Diff 106151.Jul 12 2017, 1:19 AM

Call isAMCompletelyFolded() with Instr parameter instead of duplicating code.

jonpa added inline comments.Jul 12 2017, 1:21 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1282

The idea here is that ImmCost is not updated with Offset when Target checks it for each fixup in isFoldableMemAccessOffset(). I had found loop regressions where the NumBaseAdds were the same, but ImmCost were different, and I found that this would be resolved by not updating ImmCost with Offset. I am not sure exactly why -- my guess is that the better formulas (including pre-LSR/input) go first.

1286

yes -- SA

1657

yes - that works just the same.

qcolombet added inline comments.Jul 12 2017, 10:55 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1282

Two things regarding this comment:

  • 1. ---

The idea was that if NumBaseAdds is the same, ImmCost is used as a tie breaker.
If I understand correctly you're saying not using this as a tie breaker generates better code. I found that concerning.

Could you dig into that before moving forward?

Assuming you're guess is correct, we should document that fact and make sure the order in the list is not pure luck.

  • 2. ---

This does not answer the question why we should guard this check with LSRWithInstrQueries given it wasn't guarded previously. My concern here is that we change a fairly high weighted piece of the formulae rating and given all other target would return false for LSRWithInstrQueries, I am afraid it will affect their performance across the board.

jonpa added a comment.Jul 20 2017, 2:15 AM

The idea was that if NumBaseAdds is the same, ImmCost is used as a tie breaker.
If I understand correctly you're saying not using this as a tie breaker generates better code. I found that concerning.

Could you dig into that before moving forward?

  1. If I move back one step and update the ImmCost like
    if (LU.Kind == LSRUse::Address && TTI.LSRWithInstrQueries()) {
       if (!TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
         C.NumBaseAdds++;
-    } else
-      C.ImmCost += APInt(64, Offset, true).getMinSignedBits();
+    }
+
+    C.ImmCost += APInt(64, Offset, true).getMinSignedBits();
   }

I get that on SPEC-2006, now 717 loops are bigger (as expected), while 361 are smaller. So this affects only ~2% of the loops.

Without ImmCost on address fixups:
717 loops better

205 better because live-out PHI duplicated and modified slightly in other version  (#0)
177 better because of more spilling in other version (#1)
280 better because more COPYs in other version (#2)
132 better becasue less loads of constants in loops (#3)
 56 better because of less immediate adds
    42 before a compare using same IV (#4)
    14 before a memory instruction
 83 better because more fused compare / branches (#5)
--- left:
 (27 loops better)

With ImmCost on address fixups:
361 loops better

  5 better because live-out PHI duplicated and modified slightly in other version   (#0)
173 better because of more spilling in other version (#1)
165 better because more COPYs in other version (#2)
 25 better becasue less loads of constants in loops (#3)
 99 better because of less immediate adds
    89 before a compare using same IV (#4)
    10 before a memory instruction
 26 better because more fused compare / branches (#5)
--- left:
 (22 loops better)

#0: It seems that LSR is not aware if a live-out PHI is reused or not. The formula that reuses a live-out phi does not need the extra phi, which should mean a cost difference of one add in the loop. So if LSR comes up with a formula that offsets the original PHI with some offset which makes the total Offset bits of the fixups less, it will insert a new phi.

Since this is clearly bad (205/5), my guess is that LSR most of the time manages to first use an existing PHI. Without the fixups ImmCost, the formulas have the same cost, and the formula with the original PHI remains.

#1: These loops got randomly different number of spills / reloads in them. This is a current regalloc flaw, and there is a patch in progress to handle it by Wei Mi (https://bugs.llvm.org/show_bug.cgi?id=32722). This happens evenly (177/173) w/ or w/out ImmCost, so this isn't an LSR issue.

#2: The coalescing isn't perfect, and I have for instance seen cases where due to the pre-RA-scheduler the IV increment may be scheduled above the last use of it, prohibiting coalescing of the IV. As a result, there is an extra needless copy after the IV increment. Not sure why there are more COPYs with ImmCost.

#3: Not sure why, but it seems that a lot less loads of constants got hoisted out of loops with ImmCost. Maybe different register allocation for some reason?

#4: It seems that ImmCost did reduce the number of adds of immediates before compares somewhat (42/89).

#5: A few more compares got fused into the branch without ImmCost (83/26)

So, there are maybe a thing or two here possible to improve, but I am not sure if these details affect more than the 2% of the loops or not...

I think for SystemZ, a first step might be to just skip the ImmCost in the cost function, since that actually only affects 18 loops, so that is surely an easier way if this patch cannot be accepted as is. In that case, I would revert the LoopStrenghtReduce patch changes related to ImmCost.

This does not answer the question why we should guard this check with LSRWithInstrQueries given it wasn't guarded previously. My concern here is that we change a fairly high weighted piece of the formulae rating and given all other target would return false for LSRWithInstrQueries, I am afraid it will affect their performance across the board.

I was just seing that only SystemZ is using isFoldableMemAccessOffset(), but I suppose there are out-of-tree targets that you are concerned about?

jonpa updated this revision to Diff 107533.Jul 20 2017, 8:47 AM

Patch updated so that SystemZ drops ImmCost from its implementation of isLSRCostLess() since this simplifies the common code changes and has nearly no impact at all.

jonpa updated this revision to Diff 107535.Jul 20 2017, 8:55 AM
jonpa marked an inline comment as done.

Sorry - missed a comment that needed updating.

uweigand edited edge metadata.Jul 20 2017, 10:59 AM

A few comments/questions, looking only at the SystemZ-specific parts.

I'm now wondering what the difference between isLegalAddressingMode and isFoldableMemAccessOffset is, now that they both take a specific instruction. Don't these two functions now answer the exact same question? And if so, shouldn't the implementation then be merged? The following questions about differences between the two would be moot if we actually had a shared implementation here.

For example, you change isLegalAddressingMode to check for hasNoIndexReg, but not needsD12. If the function does get an Instruction, and we can see that it won't accept large displacements, shouldn't we then reject the address from isLegalAddressingMode as well?

Also, the various checks for float/vector access in isFoldableMemAccessOffset -- shouldn't they move to hasLessAddressing? We know that float/vector accesses won't accept large displacements, so shouldn't the function say so?

The comment before hasLessAddressing talks about accesses being converted to vector instructions, but that happens only on z13 or later. Shouldn't this then be guarded by a Subtarget feature check just like the existing code in isFoldableMemAccessOffset does?

Finally, I'm not really happy about the names ... hasLessAddressing sound a bit strange. Maybe instead a function called "supportedAddressingMode" that takes an Instruction and returns a description of the addressing mode (long vs. short displacement, index vs. no index), either as an enum or as a pair of booleans,

qcolombet accepted this revision.Jul 20 2017, 11:41 AM

Hi Jonas,

Thanks for digging into the regressions.

The patch makes sense now.

#0: It seems that LSR is not aware if a live-out PHI is reused or not. The formula that reuses a live-out phi does not need the extra phi, which should mean a cost difference of one add in the loop. So if LSR comes up with a formula that offsets the original PHI with some offset which makes the total Offset bits of the fixups less, it will insert a new phi.

Since this is clearly bad (205/5), my guess is that LSR most of the time manages to first use an existing PHI. Without the fixups ImmCost, the formulas have the same cost, and the formula with the original PHI remains.

That sounds sensible, we should improve that in a different patch.

Cheers,
-Quentin

This revision is now accepted and ready to land.Jul 20 2017, 11:41 AM

BTW, my review does not include SystemZ changes ;)

jonpa updated this revision to Diff 107638.Jul 21 2017, 1:50 AM

SystemZ part updated per review.

I'm now wondering what the difference between isLegalAddressingMode and isFoldableMemAccessOffset is, now that they both take a specific instruction. Don't these two functions now answer the exact same question? And if so, shouldn't the implementation then be merged? The following questions about differences between the two would be moot if we actually had a shared implementation here.

I agree this isn't perfect. I also noticed this before, and thought this could be possibly refactored but however also noticed the many different overloaded versions of isLegalAddressingMode() and thought that it could wait as a next step, perhaps. RateFormula() currently only checks the fixup instructions for the offsets, so I am not sure this is a trivial change.

For example, you change isLegalAddressingMode to check for hasNoIndexReg, but not needsD12. If the function does get an Instruction, and we can see that it won't accept large displacements, shouldn't we then reject the address from isLegalAddressingMode as well?

Makes sense -- see below.

Also, the various checks for float/vector access in isFoldableMemAccessOffset -- shouldn't they move to hasLessAddressing? We know that float/vector accesses won't accept large displacements, so shouldn't the function say so?

Done.

The comment before hasLessAddressing talks about accesses being converted to vector instructions, but that happens only on z13 or later. Shouldn't this then be guarded by a Subtarget feature check just like the existing code in isFoldableMemAccessOffset does?

Good point - fixed.

Finally, I'm not really happy about the names ... hasLessAddressing sound a bit strange. Maybe instead a function called "supportedAddressingMode" that takes an Instruction and returns a description of the addressing mode (long vs. short displacement, index vs. no index), either as an enum or as a pair of booleans.

Changed it and agree this is more clear by returning both values as a pair. Does it look acceptable now?

I see then two points that could be handled in separate patches:

  • Model the cost of live-out PHIs. I actually see now that this is supposed to be handled with isExistingPhi(), but I guess could always try to look further why it appeared to not always work for me.
  • Try if possible to merge isLegalAddressingMode() with isFoldableMemAccessOffset() and also improve isLegalAddressingMode() to check the offset. I have inserted a TODO comment so that we remember this.
uweigand accepted this revision.Jul 21 2017, 3:05 AM

I agree this isn't perfect. I also noticed this before, and thought this could be possibly refactored but however also noticed the many different overloaded versions of isLegalAddressingMode() and thought that it could wait as a next step, perhaps. RateFormula() currently only checks the fixup instructions for the offsets, so I am not sure this is a trivial change.

Right, changing the whole common code interface would be a larger effort. I was primarily concerned about the SystemZ implementation of those routines -- they should be doing the same thing, ideally by sharing code. However, with this latest version of the patch, they actually already do share most of the code, so my concern is mostly resolved. If we can go on and unify the offset checks (i.e. have the same 12- or 20-bit checks in both routines), then we should be completely OK.

I understand this change might lead to changes in generated code, which might have other performance impacts. So I'd be fine with checking in this patch, and then following up with another patch to fix the offset checks (after doing another round of benchmarking).

Changed it and agree this is more clear by returning both values as a pair. Does it look acceptable now?

Yes, this does indeed look much better now.

The SystemZ changes overall LGTM now, just some minor in-line comments.

lib/Target/SystemZ/SystemZISelLowering.cpp
588

Just as a minor readability enhancement, I'd move the supportedAddressingMode check down here, and write the whole thing like this:

if (I != nullptr && !supportedAddressingMode(I, Subtarget.hasVector()).IndexReg)
  // No indexing allowed.
  return AM.Scale == 0;
else
  // Indexing is OK but no scale factor can be applied.
  return AM.Scale == 0 || AM.Scale == 1;
623

Typo: displacement

633

Hmm. I know the code didn't before either, but shouldn't we check whether the Offset fits into 20 bits here? Maybe at least add a TODO if we don't want to change it right now.

jonpa closed this revision.Jul 21 2017, 5:03 AM
jonpa marked 3 inline comments as done.

Thanks for review!
r308729

lib/Target/SystemZ/SystemZISelLowering.cpp
588

Done.

633

I think my idea was that isLegalAddressingMode() would handle that during formula generation. I anyhow added a TODO and will check this for the next patch.