This is an archive of the discontinued LLVM Phabricator instance.

[SampleFDO] Add two passes of MIRAddFSDiscriminatorsPass
ClosedPublic

Authored by xur on Jun 18 2021, 4:29 PM.

Details

Summary

This patch adds Pass1 of MIRADDFSDiscriminatorsPass before register allocation, and
Pass2 of MIRAddFSDiscriminatorsPass before Block-Placement.
This is still under --enable-fs-discrmininator option (default false).

This would reduce the turn-around time for FSAFDO transition.

Diff Detail

Event Timeline

xur created this revision.Jun 18 2021, 4:29 PM
xur requested review of this revision.Jun 18 2021, 4:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2021, 4:29 PM

This would reduce the turn-around time for FSAFDO transition.

Not sure if I follow, could you explain a bit more?

llvm/lib/CodeGen/TargetPassConfig.cpp
1476

Wondering once we finalize the spots for adding FS discriminators, do we want to replace Pass1, Pass2.. to say BBPlacementDisc, etc. eventually

llvm/test/CodeGen/X86/fsafdo_test1.ll
6–7

This is changing from (input discriminator + last discriminator) to (input discriminator + block placement discriminator + last discriminator), right?

xur added a comment.Jun 18 2021, 6:15 PM

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

Thanks for clarification, makes sense.

Sounds like your production usage is going to have one FS-Disc pass before block layout. And if we add more FS-Disc pass, say before RA, or others, it's not worth it?

xur added a comment.Jun 20 2021, 10:46 PM

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

Thanks for clarification, makes sense.

Sounds like your production usage is going to have one FS-Disc pass before block layout. And if we add more FS-Disc pass, say before RA, or others, it's not worth it?

That's a good suggestion. I haven't tried to experiment with one pass before RA. I know that the one before block placement will the performance. I also tried add one after block-placement but did not gain much.
Let me do some experiments and report back.

Another aspect of this change is that we want to do this more conservatively. We have many SampleFDO targets and we don't want to commit too much changes in the initial flip ( to minimize compile time, memory usage impact). We prefer to do this step by step.

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

Thanks for clarification, makes sense.

Sounds like your production usage is going to have one FS-Disc pass before block layout. And if we add more FS-Disc pass, say before RA, or others, it's not worth it?

That's a good suggestion. I haven't tried to experiment with one pass before RA. I know that the one before block placement will the performance. I also tried add one after block-placement but did not gain much.
Let me do some experiments and report back.

Another aspect of this change is that we want to do this more conservatively. We have many SampleFDO targets and we don't want to commit too much changes in the initial flip ( to minimize compile time, memory usage impact). We prefer to do this step by step.

Incremental approach sounds reasonable. Was just curious whether you have tried other passes and see no improvements.

Looks like FS-AFDO changes are mostly in tree now though the profile generation part for FS profile is still outside of llvm (in your create_llvm_prof tool), right? Otherwise, we can give it a try too.

wenlei added inline comments.Jun 21 2021, 11:11 AM
llvm/lib/CodeGen/TargetPassConfig.cpp
1476

An extra profile loader pass is still to be added here coupled with the FS disc pass?

hoy added inline comments.Jun 21 2021, 2:48 PM
llvm/lib/CodeGen/TargetPassConfig.cpp
1476

addBlockPlacement is very close to addPreEmitPass in pass order. I'm wondering if we should use FSDiscriminatorPass::Pass3.

xur added a comment.Jun 21 2021, 2:52 PM

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

Thanks for clarification, makes sense.

Sounds like your production usage is going to have one FS-Disc pass before block layout. And if we add more FS-Disc pass, say before RA, or others, it's not worth it?

That's a good suggestion. I haven't tried to experiment with one pass before RA. I know that the one before block placement will the performance. I also tried add one after block-placement but did not gain much.
Let me do some experiments and report back.

Another aspect of this change is that we want to do this more conservatively. We have many SampleFDO targets and we don't want to commit too much changes in the initial flip ( to minimize compile time, memory usage impact). We prefer to do this step by step.

Incremental approach sounds reasonable. Was just curious whether you have tried other passes and see no improvements.

Looks like FS-AFDO changes are mostly in tree now though the profile generation part for FS profile is still outside of llvm (in your create_llvm_prof tool), right? Otherwise, we can give it a try too.

Most code is in except the FS_ProfileLoader I can send FS ProfileLoad for review (can be turned on by an option).

I think I checked in all of code for create_llvm_prof tool internally. We just need to send it to upstream some time. I'll discuss with Wei.
But I heard you guys do not use create-llvm-prof tool. Is that right?

wenlei added a subscriber: wlei.Jun 21 2021, 3:00 PM

This might just be specific to our deployment. Our AFDO optimized build
uses the profile generated in the previous release.
If we hold this Pass1 discriminator change and commit together with the FS
profile change. We won't see the FSAFDO performance in the release right
after -- we need to wait for another release to get all the samples with
Pass1 discriminations.

-Rong

Thanks for clarification, makes sense.

Sounds like your production usage is going to have one FS-Disc pass before block layout. And if we add more FS-Disc pass, say before RA, or others, it's not worth it?

That's a good suggestion. I haven't tried to experiment with one pass before RA. I know that the one before block placement will the performance. I also tried add one after block-placement but did not gain much.
Let me do some experiments and report back.

Another aspect of this change is that we want to do this more conservatively. We have many SampleFDO targets and we don't want to commit too much changes in the initial flip ( to minimize compile time, memory usage impact). We prefer to do this step by step.

Incremental approach sounds reasonable. Was just curious whether you have tried other passes and see no improvements.

Looks like FS-AFDO changes are mostly in tree now though the profile generation part for FS profile is still outside of llvm (in your create_llvm_prof tool), right? Otherwise, we can give it a try too.

Most code is in except the FS_ProfileLoader I can send FS ProfileLoad for review (can be turned on by an option).

I think I checked in all of code for create_llvm_prof tool internally. We just need to send it to upstream some time. I'll discuss with Wei.
But I heard you guys do not use create-llvm-prof tool. Is that right?

Yeah, we don't use the create_llvm_prof tool from google's github. We have our own internal implementation which is similar, but we want to move everything into llvm-profgen inside llvm-project and switch to use llvm-profgen in production too (@wlei will help on that in H2). Hopefully that will make it easier for all of us to share things in the future.

If you can upstream FS part of create_llvm_prof, it'd be very helpful too, and we can also help integrated that into llvm-profgen if needed down the road.

xur updated this revision to Diff 365539.Aug 10 2021, 10:52 AM
xur retitled this revision from [SampleFDO] Add Pass1 of MIRAddFSDiscriminatorsPass before Block-Placement to [SampleFDO] Add two passes of MIRAddFSDiscriminatorsPass.
xur edited the summary of this revision. (Show Details)

Add another pass berfore RA (as suggested by WenLei).
I tested the performance for FSAFDO Loader using one google benchmark.
This extra pass turns to be positive in performance -- I'm seeing additional 0.6% - 0.8% improvement.

BTW, sorry for the delay. I was off work last month.

wmi added a comment.Aug 10 2021, 11:54 AM

This extra pass turns to be positive in performance -- I'm seeing additional 0.6% - 0.8% improvement.

That is a good improvement. I guess the improvement is because regalloc is profile sensitive and it is benefited from preciser fs profile information after you insert another pass of profile loading before regalloc?

This extra pass turns to be positive in performance -- I'm seeing additional 0.6% - 0.8% improvement.

That is a good improvement. I guess the improvement is because regalloc is profile sensitive and it is benefited from preciser fs profile information after you insert another pass of profile loading before regalloc?

Good to know, thanks for experimenting Rong! We've also seen RA being sensitive to profile in our profile inference work - sometimes small change in profile due to inference change can lead to significant cascading effect in RA.

wmi accepted this revision.Aug 11 2021, 8:55 AM
wmi added inline comments.
llvm/lib/CodeGen/TargetPassConfig.cpp
1116–1118

How about adding a comment that the pass is intentionally placed before regalloc, in case some other passes are inserted in between in the future.

This revision is now accepted and ready to land.Aug 11 2021, 8:55 AM
wenlei added inline comments.Aug 11 2021, 8:59 AM
llvm/lib/CodeGen/TargetPassConfig.cpp
1116–1118

name FSDiscriminatorPass::Pass1 as something like FSDiscriminatorPass::RegAlloc might help too.

hoy accepted this revision.Aug 11 2021, 9:42 AM

Nice to see the win from RegAlloc!

llvm/lib/CodeGen/TargetPassConfig.cpp
1116–1118

Either way sounds good to me. Was wondering if Pass1, Pass2... should be named more specifically by the pass they are preceding. On a second thought, if they are moved around in the future, their name may need be changed too.

xur added inline comments.Aug 11 2021, 10:49 AM
llvm/lib/CodeGen/TargetPassConfig.cpp
1116–1118

Yes. I have the same concern if we change the names. Let's keep the Pass1 etc for now. Adding a comment definitely helps. Will do.

wenlei accepted this revision.Aug 11 2021, 10:50 AM

lgtm, thanks.

This revision was automatically updated to reflect the committed changes.