This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace non-kernel function uses of LDS globals by pointers.
AbandonedPublic

Authored by hsmhsm on May 25 2021, 11:26 PM.

Details

Summary

The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering phase from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.

Diff Detail

Event Timeline

hsmhsm created this revision.May 25 2021, 11:26 PM
hsmhsm requested review of this revision.May 25 2021, 11:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2021, 11:26 PM

I am planning to take-up handling of *bit-casted function pointers* (while collecting reacheable callees for kernels) as a seperate patch on top of it.

The reason for this is :

(1) *not* handling of *bit-casted function pointers* does not break any functionality - when we fail to collect reacheable callees because of this, lowering phase directly lower LDS instead of pointers, so no break in any functionality from semantics persepctive
(2) I want this patch to be upstreamed as soon as possible before we getting hit by new JIRAs because of too much used LDS memory.

hsmhsm updated this revision to Diff 347854.May 26 2021, 12:01 AM

Fixed few spell mistakes in comments, and a minor CHECK mistake in one of the lit tests.

hsmhsm updated this revision to Diff 347858.May 26 2021, 1:22 AM

Pointers created for LDS globals during pointer replacement phase should be immediately
lowered during module lowering phase. Hence add a CHECK in the lit tests to check that
pointers do not remain in the output IR file.

foad added inline comments.May 26 2021, 3:03 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
25

Typo "Reachable" here and in lots of comments.

hsmhsm updated this revision to Diff 347884.May 26 2021, 3:26 AM

Fixed review comment by Jay.

s/reacheble/reachable
s/Reacheble/Reachable

hsmhsm marked an inline comment as done.May 26 2021, 3:33 AM
JonChesterfield requested changes to this revision.May 26 2021, 4:45 AM

Same objection to previous diffs with this code in. You're adding loads of complicated code to an existing pass.

Instead, this independent optimisation should be a separate pass that is scheduled before the lowering. That way it can't break the lowering and the optimization can be tested in isolation, making for simpler and more effective test cases.

This revision now requires changes to proceed.May 26 2021, 4:45 AM

Same objection to previous diffs with this code in. You're adding loads of complicated code to an existing pass.

Instead, this independent optimisation should be a separate pass that is scheduled before the lowering. That way it can't break the lowering and the optimization can be tested in isolation, making for simpler and more effective test cases.

I do not agree with you anymore.

Don't you remember that, in the meeting held on Friday 21, for various reasons, we all of us (including you) agreed to add pointer replacement related code in this pass itself, instead of a separate pass?

What is compilcated code here? I cleanly separated pointer replacement implementation from lowering implementation.

I see that. However, we are working on a multipass compiler with a clearly delimited format used between passes.

If your optimisation is written standalone, it can be tested directly. The output doesn't have to embed the lowering pass semantics. It is also easier to write as a separate pass. You have the utils header for sharing code between the two.

I agreed to Stas making the minor adjustments to generalise the lowering pass, for use lowering a function as well as a module. I definitely did not, and do not, agree to splicing an optimisation through the lowering code.

I'm also expecting the optimisation in question to pessimise openmp code, so want to keep an option to disable it from the pipeline when building openmp applications.

I see that. However, we are working on a multipass compiler with a clearly delimited format used between passes.

If your optimisation is written standalone, it can be tested directly. The output doesn't have to embed the lowering pass semantics. It is also easier to write as a separate pass. You have the utils header for sharing code between the two.

I agreed to Stas making the minor adjustments to generalise the lowering pass, for use lowering a function as well as a module. I definitely did not, and do not, agree to splicing an optimisation through the lowering code.

I'm also expecting the optimisation in question to pessimise openmp code, so want to keep an option to disable it from the pipeline when building openmp applications.

Jon,

I get your points. We discussed it in the past and on Friday's meeting. There are advantages and dis-advantages in both the approaches. I am personally fine with either approach, but what I want is a common agreement on the decision taken and being stick to it (in general on all technical topics). When I raised this question in Friday meeting, Stas had an openion of having the pointer-repalcement code within this patch itself, and I seconded it. And we asked about your decision, and also said, you are fine too. But, now you are opposing it after the patch is ready. Then how can I trust on you sticking to a decision taken?

Use different passes for different things, and optimisations are different from lowering. Those should not be contentious statements in a multipass compiler. Strictly I'm not sure this is an optimisation as it burns cycles to reduce memory, and that's only useful if the saved memory improves occupancy.

I see zero advantages to your single-combined-pass preference and have enumerated disadvantages. Mostly that making it harder to review and to test makes it less likely to work. It also means the tests are fragile, which will cost us when improving the heuristics on which variables to transform.

What are the advantages you see that merit breaking from the default of using multiple passes in a multiple pass compiler?

Use different passes for different things, and optimisations are different from lowering.

This is right as per theory. But practicality is different from theory. Those should not be contentious statements in a multipass compiler.

Strictly I'm not sure this is an optimisation as it burns cycles to reduce memory, and that's only useful if the saved memory improves occupancy.

Run time cost v/s memory cost are endless debatable topic, which cost saving, we should aim for, depends on context and enviroment. Here memory cost is too important one and cannot be neglected, though run time cost is also important one. Further, until we test the system with some heavily LDS bound application, you cannot come to a conlusion that it is useless transformation.

I see zero advantages to your single-combined-pass preference and have enumerated disadvantages. Mostly that making it harder to review and to test makes it less likely to work. It also means the tests are fragile, which will cost us when improving the heuristics on which variables to transform.

Yes, these are the dis-advantages of having pointer-replacement code within ModuleLDSLowering pass

What are the advantages you see that merit breaking from the default of using multiple passes in a multiple pass compiler?

Here are advantages:

First, this tranformation is not something unique and independent, it has meaning only in the presence of ModuleLDSLowering pass, otherwise it is totally meaningless to have this pointer-replacement transormation pass. Imagine a scenario, where we run pointer-replacement transormation pass, but ModuleLDSLowering pass is disabled.

Second, if there exist some other transormation passes between pointer-replacement pass and ModuleLDSLowering pass which have some side effect on the the pointer-replacement transformed code, then it would cause more trouble to whole intention behind the pass.

Third, logically both are tied together to achieve the common purpose. So it make sense that both exist together in a single pass. OpenMP use-cases may not care about it, but, AMDGPU backend does not only support OpenMP, but also support other GPU languages.

In anycase, I am fine with a separate pass, and I am going to implement it. Stay tuned for a new patch.

hsmhsm abandoned this revision.May 26 2021, 10:06 PM

We have lately decided to put pointer-replacement code as a seperate pass instead of having it within ModuleLDSLowering pass. Hence abandoning this patch. Will be posting new patch shortly which implements pointer-replacement as a seperate pass.