(MS(G)RKC part of original patch)
Details
Diff Detail
Event Timeline
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp | ||
---|---|---|
1201 | Is this necessary for correctness, or just for performance tuning? I thought the point of the MemFold pseudos was that they actually are three-operand instructions; and if they're used with non-tied operands, that gets fixed up in a later pass. So I guess it would be *correct* to use them even for non-tied operands, right? |
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp | ||
---|---|---|
1201 | Yeah, so far we have aimed to only fold instructions when we see that at least at this point the registers match. And in the case when that changes (register reassignment), SystemZPostRewrite will insert a COPY so that the 2-address reg/mem instruction can be used. This is a performance tuning consideration, I think. So yes, I it should still work to fold into a MemFold pseudo even if the registers are different, but I haven't tried it. The current (trunk) check with getTwoOperandOpcode() works as to find a 3-address instruction with a mapped FoldMem pseudo, an they have non-tied operands, so that is certainly correct. |
Simplify the check to simply check for instructions with three explicit operands that are not tied (and fix the check for TIED_TO).
Moving the check for VRM up a bit in the function makes the code cleaner and is NFC (currently).
I'm not sure I understand those latest changes. You seem to no longer check at all whether the target opcode actually requires tied operands, you just always tie them?
The current check 'if (SystemZ::getTwoOperandOpcode(Opcode) != -1)' finds instructions that are three address that have a two-address equivalent opcode, e.g. AGRK -> AGR. I was thinking that the check 'if (NumOps == 3 && MI.getDesc().getOperandConstraint(1, MCOI::TIED_TO) == -1)' should be exactly equivalent to that and would also include the MS(G)RKC opcodes.
I suppose all these instructions should have a MemFoldPseudo, so checking for a target memory instruction to detect this case should also be equivalent.
I am also thinking that there really isn't any reg/memory instructions that are not tied (?), so this can simply be assumed to be the case and checking for that doesn't make much sense.
This is just a heuristic to not use MemFoldPseudos unless it is fairly certain that they do not need a COPY to match the target memory instruction.
This change is NFC on benchmarks.
The more I think about it, the more it seems that the original check has always been somewhat questionable.
What is it that we really need to check here? The main correctness question is, whether the resulting MI would be correct, i.e. have the right operand types. In general, for all the possible MemOpcode values, the memory operand must come last, so it is only valid to replace the final register operand with a memory operand. The exception for this is if the original instruction is commutable, in which case we may also first commute the (commutable pair of) operands and then replace the (now) final register operand with a memory operand. All of this has nothing to do with getTwoOperandOpcode, with existing physreg assignments or anything else. We do this commutability check for compares already, and we can do it just the same for other MemOperands.
And if this is all we do, I believe the code would still be correct.
Now, everything else seems to be just a performance tuning question. And even here I'm not sure whether this is really doing us any good. We originally have a three-way opcode like
dst = ARK src1, src2
where src2 now lives in a spill slot (assuming the non-commutating case for simplicity). If dst and src1 are mapped to the same register, this is transferred to A (using the MemFold pseudo for A as an intermediate step). Now, if dst and src1 are *not* mapped to the same register, we currently do *not* perform the transformation, so we get
tmp = L spill-src2 dst = ARK src1, tmp
However, if we simply did perform the transformation, we'd instead get
dst = COPY src1 dst = A dst, spill-src2
Now, it is a bit unclear which version is preferable. The first version may have one fewer microops, but it uses one more register (which may be an issue given that we're already spilling). So this would be worthwhile to check.
However, if we actually want to avoid the COPY and keep the existing code, then I believe the correct check is simply whether or not MemOperand is a MemFold pseudo. This is the only case where this can ever be a concern. (Note that getMemoryOpcode actually preserves the state of whether or not the destination is tied; I had been confused about that above.) So going back to your previous version of the check and simply only checking for MemFold pseudos should do what we want here.
Now, it is a bit unclear which version is preferable. The first version may have one fewer microops, but it uses one more register (which may be an issue given that we're already spilling). So this would be worthwhile to check.
I made a build of benchmarks (on top of these recent improvements) where the known allocations were ignored, which caused a lot more folds with the needed register moves:
lg : 1090487 1081897 -8590 lgr : 874009 882174 +8165 ag : 37427 44978 +7551 agr : 101720 94572 -7148 l : 229926 225837 -4089 lr : 56066 59363 +3297 a : 59296 61866 +2570 ar : 42553 40021 -2532 ...
Over a nightly run, it looked to be just slightly in favor to *not* do this, in other words it seemed better to avoid the many register moves. But it was a close race...
However, if we actually want to avoid the COPY and keep the existing code, then I believe the correct check is simply whether or not MemOperand is a MemFold pseudo. This is the only case where this can ever be a concern. (Note that getMemoryOpcode actually preserves the state of whether or not the destination is tied; I had been confused about that above.) So going back to your previous version of the check and simply only checking for MemFold pseudos should do what we want here.
I agree this makes most sense and patch is updated.
Thanks for running the benchmark, I guess I'm OK with the current implementation then.
Just one minor inline comment, otherwise this now looks good.
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp | ||
---|---|---|
1179 | Does it still make sense to move the VRM check up here? I guess this could now cancel the transformation in some cases where it isn't necessary because we don't even need VRM below ... | |
1225 | The one thing I'm wondering about: are there any cases where we don't have a MemFold pseudo, but it still would be beneficial to commute the operands? I guess not, given that "real" memory operations are never three-operand ... |
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp | ||
---|---|---|
1179 | I do believe that it is theoretically better to have the VRM check further down as before, but that it is cluttering the code. It never happens (at least not on a 2017 SPEC build) that this is reached with a null VRM, but of course that may change in the future. | |
1225 | I checked this by building 2017, and it turns out that there are actually some cases of LOCGR/LOCRMux that show up here as commutable without a MemFold pseudo (select instructions are mapped to the LOC MemFold pseudos, but LOC instructions map directly to the target memory instruction). However, when they show up here (during regalloc), op0 and op1 are tied and have the same virtual register, so since we are only handling cases of exactly one operand, I don't think these instructions could have anything but op2 to fold. |
Does it still make sense to move the VRM check up here? I guess this could now cancel the transformation in some cases where it isn't necessary because we don't even need VRM below ...