This is an archive of the discontinued LLVM Phabricator instance.

New pre-isel pass SystemZBRCTLoops
Needs ReviewPublic

Authored by jonpa on Apr 8 2016, 8:08 AM.

Details

Reviewers
uweigand
hfinkel
Summary

I experimented with making a new pass for the improved generation of BRCT loops for SystemZ. This is based on PPCCTRLoops.cpp.

Results:
Number of brct's on SPEC:
w/out patch: 4752
w/out patch, unrolled: 1367
w/ patch: 8743
w/ patch, unrolled: 11518 (new prologue loops increase the number)

gcc with unrolling generates 20864 brct's, so this result is an improvement, although not necessarily near optimal.

This seemed necessary in order to activate the loop-unroller and tune it properly, otherwise a loop might get unrolled but the BRCT and some performance will be lost.

Diff Detail

Event Timeline

jonpa updated this revision to Diff 53025.Apr 8 2016, 8:08 AM
jonpa retitled this revision from to New pre-isel pass SystemZBRCTLoops .
jonpa updated this object.
jonpa added reviewers: hfinkel, uweigand.
jonpa added a subscriber: llvm-commits.
uweigand edited edge metadata.Apr 8 2016, 10:23 AM

Can you explain why we need this new intrinsic (int_s390_hwloop_count)? I can understand why the PowerPC code needs intrinsics, since it has to access the special CTR register. But on SystemZ we don't have any special registers for BRCT, so why the intrinsic?

Since there seems to be some general interest for a target independent hw-loop pass, I started out thinking that the hw-loop counter might be a simple beginning for it. So I actually first made just a 'hw-loop-count' intrinsic in the main file, but then moved it into a '-s390-' since this is still only a SystemZ pass.

I think this pass could basically be target-independent right now, so the question is if anyone would want this right now? Should I make a systemz pre-isel pass out of this or a new target-independent hw-loop pass with just the hw-loop-count intrinsic implemented?

Regardless, I don't think SystemZ needs the intrinsic, like you pointed out. For just a SystemZ pre-isel pass, it could just be removed. With a target-independent pass, just the isel patterns for it would remain.

jonpa edited edge metadata.Apr 10 2016, 11:51 PM
jonpa added subscribers: kparzysz, atrick.
hfinkel edited edge metadata.Apr 13 2016, 8:31 PM

Since there seems to be some general interest for a target independent hw-loop pass, I started out thinking that the hw-loop counter might be a simple beginning for it. So I actually first made just a 'hw-loop-count' intrinsic in the main file, but then moved it into a '-s390-' since this is still only a SystemZ pass.

I think this pass could basically be target-independent right now, so the question is if anyone would want this right now? Should I make a systemz pre-isel pass out of this or a new target-independent hw-loop pass with just the hw-loop-count intrinsic implemented?

Regardless, I don't think SystemZ needs the intrinsic, like you pointed out. For just a SystemZ pre-isel pass, it could just be removed. With a target-independent pass, just the isel patterns for it would remain.

I think this could make sense as a target-independent pass? Did you just copy the PPCCTRLoops pass, change the name, change the name of the intrinsics generated, and remove the PPC-specific CTR-register use legality checking? I could certainly see abstracting those bits into callbacks (or just adding some target-independent intrinsics if we can both use the same ones, and adding a callback for additional legality checking to TLI). That seems a much better option than copy-and-paste here.

jonpa added a comment.Apr 25 2016, 8:28 AM

Yes, I started my experiment by copying and modifying until I had a simplest version that seemed to work.
I also think it should definitely be possible to make a target independent version like you proposed.

For SystemZ, this is currently giving mixed results, and I think it will have to wait until the pre-ra scheduler is in place. Thanks for review so far.

jonpa updated this revision to Diff 84105.Jan 12 2017, 4:39 AM
jonpa edited edge metadata.

After improving on LSR and loop unrolling, this pass has been found beneficial on benchmarks on SystemZ

As a first step, it has therefore been rewritten for the purposes of the SystemZ target, which means an intrinsic is not used for the trip count.

It also now handles loop counters of i8 and i16, by zero extending the trip count in the pre-header.

Once this has come to use, the opportunity is still there for a target independent pre-isel pass.