This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Improve codegen for memset
ClosedPublic

Authored by jonpa on Oct 18 2021, 8:40 AM.

Details

Summary

Memset with a constant length was implemented with a single store followed by series of MVC:s. This patch changes this so that one store of the byte is emitted for each MVC, which avoids data dependencies between the MVCs. An MVI/STC + MVC(len-1) is done for each block.

(A few somewhat unfortunate changes in memset-01.ll (extra STC/MVC) - hopefully not important. E.g. STC;MVC(256) -> STC;MVC(255);STC)

Memset with a variable is now also handled without a libcall. Since the byte is first stored and then MVC is used from that address, a length of two must now be subtracted instead of one for the loop. There needs to be an extra check for the case of length one is handled in a special block with just a single MVI/STC (this is per what GCC does).

I put the MBB for doing the memset of length 1 at the end of MF and also set a low probability on that edge since otherwise the Branch Probability Basic Block Placement pass would move it back into the sequence. When that happens the common path (length > 1) then involves a taken branch across the length-1 block which should be suboptimal.

I see that GCC prefetches these loops 4 iterations ahead, while clang does 3...

Side note: Unfortunately it seems that memset loops suffer a bit currently from poor register coalescing in cases where the address is used after the memset. This is not just for this patch, but also for the memset-0 case (XC loop). I guess that should be handled if possible...

char fun(char *dst, char byte, int len)
{
  memset(dst, byte, len);
  return dst[14];
}

=>

fun:                                    # @fun
# %bb.0:                                # %entry
	aghi	%r5, -2
	cgije	%r5, -2, .LBB0_6
# %bb.1:                                # %entry
	cgije	%r5, -1, .LBB0_7
# %bb.2:                                # %entry
	srlg	%r0, %r5, 8
	lgr	%r1, %r2                ##### <<<<<<<<<<<<<<<
	cgije	%r0, 0, .LBB0_5
# %bb.3:
	lgr	%r1, %r2                ##### <<<<<<<<<<<<<<< REDUNDANT
.LBB0_4:                                # %entry
                                        # =>This Inner Loop Header: Depth=1
	pfd	2, 768(%r1)
	stc	%r4, 0(%r1)
	mvc	1(255,%r1), 0(%r1)
	la	%r1, 256(%r1)
	brctg	%r0, .LBB0_4
...

So far, I know that Early Machine LICM inserts a pre-header for the loop since there is none (a preheader can only have one successor: the loop header). This causes the PHI node using the original address register in the loop to have a different predecessor than the one after the loop (EXRL block), so there are two COPYs inserted into different MBBs by the PHI lowering pass. Not sure yet why this doesn't get coalesced in the memset cases, when it is handled for memcpy...

Diff Detail

Event Timeline

jonpa created this revision.Oct 18 2021, 8:40 AM
jonpa requested review of this revision.Oct 18 2021, 8:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2021, 8:40 AM
jonpa added a comment.Oct 19 2021, 4:37 AM

https://reviews.llvm.org/D112065 is a proposed patch for the register coalescing problem.

jonpa updated this revision to Diff 382595.Oct 27 2021, 3:55 AM

Refactored so that MEMSET_MVC and associated pseudos only carry one address. NFC.

I think it would be a bit simpler and clearer to handle the Src vs. Dest address special case for memset in a single place at the beginning. Something along the lines of:

if (!isMemSet)
{
  DestBase = earlyUseOperand(MI.getOperand(0));
  DestDisp = MI.getOperand(1).getImm();
  SrcBase = earlyUseOperand(MI.getOperand(2));
  SrcDisp = MI.getOperand(3).getImm();
}
else
{
  SrcBase = earlyUseOperand(MI.getOperand(0));
  SrcDisp = MI.getOperand(1).getImm();
  DestBase = SrcBase;
  DestDisp = SrcDisp + 1;
  foldDisplIfNeeded(DestBase, DestDisp);
}

There should be no need for any subsequent +1 displacement arithmetic then.

jonpa updated this revision to Diff 384702.Nov 4 2021, 4:27 AM

Patch updated per suggenstion.

I tried to also avoid folding just Dest so that two registers would not be used when SrcDisp is 4095, like:

+  auto foldDisplIfNeeded = [&](MachineOperand &Base, uint64_t &Disp,
+                               bool Force = false) -> void {
+    if (Disp > 0xfff || Force) {
...
+      if (DestDisp > 0xfff) {
+        // Make sure only one base register is used.
+        foldDisplIfNeeded(SrcBase, SrcDisp, true/*Force*/);
+        DestBase = SrcBase;
+        DestDisp = 1;
+      }

This however did not change benchmarks at all, so I removed it (it did however improve memset-01.ll:f17,f18...). I also tried generalizing this idea to all operations, but again there was no cases of this in benchmarks.

Also some other minor fixing:

  • Changed the variable names in insertMemMemOp() to avoid using the same names already used (DestBase, SrcBase, ...)
  • add ByteMO directly also with STC in insertMemMemOp() and make sure to call earlyUseOperand() to clear any kill flag.
  • For the PFD I decided it was simplest to subtract IsMemset compared to using both ThisSrcReg and SrcDisp.

LGTM, just a few minor nits inline. We also should do a performance validation run before committing this.

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
7881

Minor nit: this should use !isUInt<12>(Disp) like elsewhere.

7980

Was this line moved deliberately?

llvm/test/CodeGen/SystemZ/tail-call-mem-intrinsics.ll
12

Maybe we can still find a way to test emission of the call? E.g. using a volatile pointer?

jonpa added inline comments.Nov 15 2021, 5:28 AM
llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
7980

yes, I thought that looked a little better, but it doesn't matter that much...

uweigand added inline comments.Nov 15 2021, 6:10 AM
llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
7980

OK, that's fine with me then.

jonpa marked 3 inline comments as done.Nov 28 2021, 9:15 AM
jonpa added inline comments.
llvm/test/CodeGen/SystemZ/tail-call-mem-intrinsics.ll
12

ah, yes - the fourth argument is actually 'isvolatile'...

jonpa updated this revision to Diff 390211.Nov 28 2021, 9:16 AM
jonpa marked an inline comment as done.

Updated per review.

uweigand accepted this revision.Dec 6 2021, 3:43 AM

LGTM, thanks!

This revision is now accepted and ready to land.Dec 6 2021, 3:43 AM
This revision was landed with ongoing or failed builds.Dec 6 2021, 10:12 AM
This revision was automatically updated to reflect the committed changes.