This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Experimental: Improve getMemoryOpCost() to find foldable loads that are converted.
ClosedPublic

Authored by jonpa on Sep 29 2018, 8:09 AM.

Details

Reviewers
uweigand
Summary

The SystemZ backend can do arithmetic of memory by loading and then extending one of the operands. Similarly, a load + truncate can be folded into an operand.

This was not recognized by the SystemZ TTI cost functions. I found a loop where this was obvious and decided to do a more full-fledged version of the check for folding of memory operands.

As it turned out, this seems to be the only case where this triggered, so it does not seem to be of much use to apply this patch at the moment - not even on z14 which has the 16->64 bit memory operands also.

I suppose then we don't use this at least for the moment, but save it here for the record.

Diff Detail

Event Timeline

jonpa created this revision.Sep 29 2018, 8:09 AM

This does look like a nice refactoring, so it might be worthwhile getting it in anyway, even if there's not much change in codegen ...

jonpa updated this revision to Diff 169183.Oct 11 2018, 3:33 AM

Updated patch to also handle compare with pointer correctly by using the new static getScalarSizeInBits(Ty*) . Added CHECKs for the new cases in test file.

On z13, the number of loads getting 0 cost in cost evaluation of VF=1 goes up from 737 to 778. No change in output.

On z14, the number of loads getting 0 cost in cost evaluation of VF=1 goes up from 737 to 779. One loop (and file) changed in perlbench (pp_pack / S_unpack_rec / while.body720)

Loop before vectorize pass:

while.body720:                                    ; preds = %while.body720.preheader, %while.body720
%dec7171149 = phi i32 [ %dec717, %while.body720 ], [ %dec7171143, %while.body720.preheader ]
%s.addr.211148 = phi i8* [ %add.ptr721, %while.body720 ], [ %s.addr.1, %while.body720.preheader ]
%cuv.131146 = phi i64 [ %add736, %while.body720 ], [ %cuv.1, %while.body720.preheader ]
%ai16.0..sroa_cast248 = bitcast i8* %s.addr.211148 to i16*
%ai16.0.copyload = load i16, i16* %ai16.0..sroa_cast248, align 1
%add.ptr721 = getelementptr inbounds i8, i8* %s.addr.211148, i64 2
%conv735 = sext i16 %ai16.0.copyload to i64
%add736 = add i64 %cuv.131146, %conv735
%dec717 = add nsw i32 %dec7171149, -1
%cmp718 = icmp sgt i32 %dec7171149, 0
br i1 %cmp718, label %while.body720, label %sw.epilog1397.loopexit1323

Trunk selects VF=2:

renamable $v1 = VLREPF renamable $r5d, 0, $noreg :: (load 4 from %ir.lsr.iv24332435, align 1)
renamable $v1 = VUPHH killed renamable $v1
renamable $v1 = VUPHF killed renamable $v1
renamable $v0 = VAG killed renamable $v0, killed renamable $v1

while with this patch VF=1 is instead selected, since it can do just one AGH per iteration. This seems to make sense, and this loop is not apparently in any hot path, so I would assume we don't need to check benchmarks.

uweigand accepted this revision.Oct 17 2018, 6:26 AM

LGTM, thanks!

This revision is now accepted and ready to land.Oct 17 2018, 6:26 AM
jonpa closed this revision.Oct 25 2018, 3:43 PM

Thanks for review.

r345327.