This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Increase the number of VLREPs
ClosedPublic

Authored by jonpa on Nov 8 2018, 7:26 AM.

Details

Reviewers
uweigand
Summary

As I experimented with making an fp-vector element 0 insert free (given that there is the "vector merge high" instruction available which takes two input vectors and combines them into one), I ran into a regression. In the end, it seemed like the SLP-vectorizer is doing one more vectorization (4 instead of 3) in a function, which ended up to cause the vector operands to be significantly more numerous. This would be worth looking into at some point, probably.

However, one of the things I also noticed was many VL64 -> VREPG instructions where this could have seemingly been just VLREPG. IIUC, SystemZ buildVector() creates a REPLICATE node if there is a loaded element, with the intent of this becoming VLREP. It seems that this folding does not happen if the load has more than one user.

To remedy this, I experimented with a patch that handles these cases by putting those other users of the load to use the REPLICATE 0-element instead of the load. This way, the load has only the REPLICATE node as user, and we get a VLREP.

So far, I have only looked at my test case, which was floating point (calculix/Utilities_DV), and I am not sure of all the implications. It just seems better to get VLREP in more cases... For the files that get a different opcode count comparing to trunk on SPEC, see

I think the regression I saw during experiments disappeared mostly with this fix. Not sure about the performance effects generally yet.

Diff Detail

Event Timeline

jonpa created this revision.Nov 8 2018, 7:26 AM

I'm not sure that this is always a win. In fact, this may depend on the type of the loaded value.

For floating point values (which can typically be re-used directly from the element 0 slot of the replicated VR), I agree that this approach looks useful. But for integer values (which will then need to be moved to a GR), there seem to be drawbacks.

Looking at the first example in your list, I see:

vlvgp          :                    9                    0       -9
vlgvf          :                    0                    9       +9
l              :                  166                  157       -9
vrepf          :                   11                    2       -9
vlrepf         :                    0                    9       +9

This appears to indicate that we have a 32-bit integer value that is being used both as scalar in a GPR, and as replicated vector in a VR. The original code would have been something like

L -> GPR -> use GPR
         -> VLVGP -> VR -> VREPF -> use VR

This is a total latency for the GPR use of 4 cycles, and a latency of 9 cycles for the VR use.

With the new approach, we have instead

VLREPF -> VR -> use VR
             -> VLGVF -> GPR -> use GPR

The total latency of the VR use goes down to 4 cycles, but the latency of the GPR use is now 8 cycles.

While this may be an overall win in many cases, it can also hurt in those cases where the overall performance is bounded more by the latency of the GPR use chain.

As an aside, for the case of a 64-bit integer value, I think the sequence LG -> VLVGP (which implicitly does the replicate already) seems to be preferable always.

jonpa updated this revision to Diff 173641.Nov 12 2018, 3:30 AM

I'm not sure that this is always a win. In fact, this may depend on the type of the loaded value.

Thanks for review!

After initial feedback, I began to update the patch in the direction of never introducing any new VLGVs (extracts to GPRs). I added checks for cases where there was another scalar use of the load, such as an address register or a scalar operation. There were still some cases left which was i16/i32 elements that got transferred from the new VLREP via a GPR (to another vector register), instead of VPERM. This surprised me, and I suspect that there is an issue here to fix, given that a VPERM should be better than a pair of VLGVF + VLVGF, even though it includes a load of the byte mask, or? There were only a few more VLREPs produced in the i16/i32 cases, so I decided to simply avoid this.

There was then no difference on SPEC code generation between handling just i64 minus scalar uses, or handling no integer loads at all, so I simply removed the integer handling to only do this for floating point.

This seems now to generally be useful with some 200 more VLREPs on SPEC. There are also some (2?) files where the load seems to be used in multiple blocks and so there are some minor changes in CodeGen. In one case it is the Control Flow Optimizer which now can no longer merge identical code. I am guessing that this is generally beneficial still, since I have seen a case where a small loop got many more instructions just because of this problem (although that was with another patch applied).

For new opcode diffs per file, see:


Summary for SPEC:

This seems reasonable in principle to me now, but see inline comments.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387

What exactly are you trying to check here? As far as I can see, UseVT is always equal to LdVT, right? (It's the type of UI's "Val", which is the value of N ...)

Why not simply checking that the result type of the load is a floating-point type above, and then be done with it?

5388

UserResultVT never appears to be used.

5390

What exactly ensures that there cannot be a second REPLICATE use? You'd be causing an internal compiler error here if that actually does occur ...

If this is just something that this optimization cannot handle, you should gracefully fail here instead of asserting.

5411

Again, I'm not sure what this type check is supposed to achive. (Also, why isInteger?)

jonpa updated this revision to Diff 173672.Nov 12 2018, 7:28 AM
jonpa marked an inline comment as done.

Updated per review.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387

UseVT here is the type of the used SDValue, so I think this could be i1 for a chain.

I think that chains should be ignored / preserved in this method:

  • If there are only other memory accesses chained after this load, I think VLREP will still happen, even though there are other "users".
  • Don't change the chain edge to the REPLICATE below
5388

Ah, right - I can remove that now also since I removed all the i64 checks.

5390

I suppose I don't really need that check anymore. It was just making sure that I understood correctly that there really should just be one REPLICATE of the same load - since DAG.getNode() should return the same node for the second query..

It never happened so far, so I think I could remove it since it wouldn't be disastrous to REPLICATE again an already VLREP:ed load.

Should I remove any other of the asserts as well?

5411

See above about preserving Chains.

And for the isInteger(), I had also forgotten to remove that after removing the i64 checks, sorry.

Is the IsDataUse variable ok, or would you rather just eliminate it?

uweigand added inline comments.Nov 12 2018, 8:03 AM
lib/Target/SystemZ/SystemZISelLowering.cpp
5387

I see. If you simply want to distinguish between the value and chain results of the load, wouldn't a check along the lines of UI.getResNo() == 0 be more straightforward?

5390

Just removing the assert doesn't feel correct, you now don't even add the use to the OtherUses list. If that's OK, you should do that (or otherwise, just fail gracefully).

The other asserts can probably be just removed.

jonpa updated this revision to Diff 173686.Nov 12 2018, 8:37 AM

NFC update per review.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387

right.

5390

I think this should not ever happen, so I just added a check to fail if it ever did.

I still think the DAG would be legal with two REPLICATEs, where one of them takes on the scalar users, but maybe this is safest if something weird like that happens...

uweigand accepted this revision.Nov 12 2018, 8:49 AM

LGTM, thanks.

This revision is now accepted and ready to land.Nov 12 2018, 8:49 AM
jonpa closed this revision.Nov 13 2018, 12:43 AM

Thanks for review. r346746