Page MenuHomePhabricator

[AMDGPU][WIP] Lower Function Local LDS Variables.
Needs ReviewPublic

Authored by hsmhsm on Nov 15 2020, 11:09 PM.

Details

Summary

[1] Added a new pass, namely, amdgpu-lower-function-local-lds.
[2] Implemented required initial plumbing work for both old and new pass managers.
[3] A flag, namely, amdgpu-enable-function-local-lds-lowering is added, which is a hidden flag and enabled by default.

The pass can be explicitly disabled by passing the option -mllvm --amdgpu-enable-function-local-lds-lowering=false.

Further, build all the required data structures which will be later used to lower function local LDS. Below are the data structures being built.

[1] Kernel Set - Holds all the kernels in the module
[2] Function Local LDS Set - Holds all the function local LDS from all functions
[3] Function Address Taken Set - Holds all the functions whose address is taken within the module

[4] LDS to Function Map - Maps each function local LDS to a function within which the LDS is defined
[5] Function to LDS Map - Reverse of above map, which maps each functon F to a SET of LDS which are defined within F

[6] Kernel to Callee Map - Maps each kernel K to a SET of functions which define LDS and there exists call graph path from K to these functions.
[7] Kernel to LDS Map - Maps each kernel K to a set of function local LDS which are supposed to be lowered w.r.t K.

Data structures [1], [2], and [3] are built by iterating over the globals and functions defined within the module.
Data structures [4] and [5] are built using BOTTOM-UP based on the use list of function local LDS.
Data structure [6] is built using TOP-DOWN via call graph traversal.
Data structure [7] is built using the result of above BOTTOM-UP and TOP-DOWN constructed data structures.

Diff Detail

Unit TestsFailed

TimeTest
80 msx64 debian > LLVM.CodeGen/AMDGPU::force-alwaysinline-lds-global-address-codegen.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llc -mtriple=amdgcn-amd-amdhsa -amdgpu-function-calls -amdgpu-stress-function-calls < /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address-codegen.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false -check-prefix=GCN /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address-codegen.ll
60 msx64 debian > LLVM.CodeGen/AMDGPU::force-alwaysinline-lds-global-address.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-always-inline /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false --check-prefix=ALL /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address.ll
290 msx64 debian > LLVM.CodeGen/AMDGPU::local-memory.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llc -march=amdgcn -mcpu=verde -verify-machineinstrs < /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/local-memory.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false --check-prefixes=GCN,FUNC /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/local-memory.ll
130 msx64 windows > LLVM.CodeGen/AMDGPU::force-alwaysinline-lds-global-address-codegen.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\llc.exe -mtriple=amdgcn-amd-amdhsa -amdgpu-function-calls -amdgpu-stress-function-calls < C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\AMDGPU\force-alwaysinline-lds-global-address-codegen.ll | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false -check-prefix=GCN C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\AMDGPU\force-alwaysinline-lds-global-address-codegen.ll
80 msx64 windows > LLVM.CodeGen/AMDGPU::force-alwaysinline-lds-global-address.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\opt.exe -S -mtriple=amdgcn-amd-amdhsa -amdgpu-always-inline C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\AMDGPU\force-alwaysinline-lds-global-address.ll | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false --check-prefix=ALL C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\AMDGPU\force-alwaysinline-lds-global-address.ll
View Full Test Results (6 Failed)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arsenm added inline comments.Thu, Jan 7, 9:27 AM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
480 ↗(On Diff #313659)

i32 should be more than sufficient

507–508 ↗(On Diff #313659)

ConstantInt::get or IRBuilder::getInt64 is simpler

571 ↗(On Diff #313659)

Function name and purpose is unclear

573 ↗(On Diff #313659)

Spelling inctructions

575 ↗(On Diff #313659)

Redundant ifndef

650–654 ↗(On Diff #313659)

I don't think there should be a distinction between direct and indirect call sites. Both should behave the same way

1096 ↗(On Diff #313659)

Document function

1243–1247 ↗(On Diff #313659)

This doesn't make sense, the same variable may appear in a kernel and in a function.

Also this refers to "device scope" which is not a concept in the backend

1417–1420 ↗(On Diff #313659)

What happens if an LDS variable is used in a constant initializer?

1422–1424 ↗(On Diff #313659)

I don't understand this. The whole point of the pass is to handle non-kernel uses

"Device scope" is inappropriate to talk about in the backend, all the functions are device scope.

1486 ↗(On Diff #313659)

Pointless comment

hsmhsm updated this revision to Diff 315322.Fri, Jan 8, 2:37 AM

Fixed few trivial review comments (by Matt).

hsmhsm added a comment.Fri, Jan 8, 2:42 AM

@arsenm

There are too many review comments - some are trivial and some are non-trivial. And I have my own doubts about some of the comments being made. Hence let's handle all these comments file by file and within a file, function by function. Otherwise, it only makes the review process more difficult.

Please go through my first set of inline comments.

hsmhsm added inline comments.Fri, Jan 8, 3:04 AM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
1486 ↗(On Diff #313659)

Taken care

1505 ↗(On Diff #313659)

Taken care.

Further, as mentioned in the review comments below, I removed all the redundant code within this function, and since we had then left with single call, I just inlined this call.

1510 ↗(On Diff #313659)

I intentionally used SetVector in order to make sure that the TRAVERSAL ORDER is preserved during multiple traversals of this set in different places.

Is it harm to use SetVector here? Is it causes noticible performance issues? If not, why not we use it when it would provide additional benifit of preserving TRAVERSAL ORDER?

1511 ↗(On Diff #313659)

taken care

1512 ↗(On Diff #313659)

I assume, the only kind entry functions are KERNELS (for hip programming language). And, this pass is targetted only for HIP. With this context, is there anything that we are missing?

1519 ↗(On Diff #313659)

taken care

1527–1529 ↗(On Diff #313659)

taken care

1531–1533 ↗(On Diff #313659)

taken care

1534–1539 ↗(On Diff #313659)

taken care

1534–1539 ↗(On Diff #313659)

taken care

1543–1545 ↗(On Diff #313659)

taken care

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38

This is needed in the function AMDGPUAlwaysInlinePass.cpp. Please look at the changes being made to this file.

hsmhsm added inline comments.Fri, Jan 8, 6:37 AM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
70 ↗(On Diff #313659)

Removed the confusing comment

198 ↗(On Diff #313659)

Taken care

201–207 ↗(On Diff #313659)

Taken care

281 ↗(On Diff #313659)

Taken care

469–470 ↗(On Diff #313659)

Taken care

573 ↗(On Diff #313659)

Taken care

575 ↗(On Diff #313659)

Taken care

1167 ↗(On Diff #313659)

Taken care

arsenm added inline comments.Fri, Jan 8, 7:37 AM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
336 ↗(On Diff #315322)

There should be no const_casts

478 ↗(On Diff #315322)

I think we could avoid passing the base pointer since it should always be 0. I would have to think about how to best guarantee this to codegen though

1510 ↗(On Diff #313659)

Traversal order shouldn't matter though

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38

This should be a required pass without an option to disable it that the pass would need to be aware of. Is this just for debug/bringup?

hsmhsm added inline comments.Fri, Jan 8, 8:15 AM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
336 ↗(On Diff #315322)

But GetElementPtrInst::CreateInBounds() receives non_cast Instruction parameter.

478 ↗(On Diff #315322)

How do you avoid it? I am not getting. Could you be more explicit here?

1510 ↗(On Diff #313659)

Agree, but what is the harm in preserving traversal order? If you insist, I will change, but just curious to know, if it is really a serious comment to be taken care.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38

Current idea and common consensus (as I understand it) is to guard this feature by a flag, and disable it by default. HIP application programmer need to pass the option to enable this feature, at least in the begining. May be in future, we can think of getting rid of the option.

hsmhsm updated this revision to Diff 315567.Fri, Jan 8, 8:05 PM

Fixed few more review comments.

hsmhsm added inline comments.Fri, Jan 8, 8:08 PM
llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
59–60 ↗(On Diff #313659)

Fixed

940 ↗(On Diff #313659)

Fixed

hsmhsm updated this revision to Diff 317500.Tue, Jan 19, 2:25 AM

[0] Started re-implementing from scratch again.
[1] Added a new pass, namely, amdgpu-lower-function-local-lds.
[2] Implemented required initial plumbing work for both old and new pass

managers.

[3] An option, namely, amdgpu-enable-function-local-lds-lowering is

added, when passed, it enables the pass.
hsmhsm retitled this revision from [AMDGPU] Support for device scope shared variables to [AMDGPU][WIP] Lower Function Local LDS Variables..Tue, Jan 19, 2:27 AM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)

Started to implement the feature from scratch again. The previous experience tells me that - "a single very big patch is very problamatic and confusing for a meanigful review process". Hence this time, I am planning to submit small patches (time to time) which can be reasonably reviewed. This first patch implements the following.

[1] Add new pass, namely, amdgpu-lower-function-local-lds.
[2] Implement required initial plumbing work for both old and new pass managers.
[3] Add an option, namely, amdgpu-enable-function-local-lds-lowering, when passed, it enables the pass.

You can just use the done checkbox, you don't need to comment on each point

llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
1510 ↗(On Diff #313659)

You should only try to preserve things that are important, otherwise you are adding cost and complexity for no benefit

1512 ↗(On Diff #313659)

The IR is language independent and none of the constructs here are tied to a language

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38

This isn't a user exposed flag, and there shouldn't be a need for users to set one.

hsmhsm marked 3 inline comments as done.Tue, Jan 19, 7:31 PM
hsmhsm updated this revision to Diff 317763.EditedTue, Jan 19, 8:53 PM

Based on the Matt's comment for previous patch, some changes are done w.r.t the handling of the guarding flag - amdgpu-enable-function-local-lds-lowering.

Though Matt is against the usage of any guarding flag for this pass, I personally feel the need of it for below two reasons.

(1) Presence of this pass means, we should disable the forcefull inlining as it is done within the pass - amdgpu-always-inline. Otherwise, this pass does not make any sense at all. It is a better idea to disable this forcefull inlining via a flag.

(2) In case of any emergency issue within this pass, customer should have an handy approach to disable the pass and temporarily move on until the fix available.

So the amdgpu-enable-function-local-lds-lowering is a hidden flag, and it is enabled by default. It works as below:

(1) Default behavoir is to run the pass as shown below.

  • Old pass manager:
mahesha@brego:[tmp]$ hipcc main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$
  • New pass manager:
mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$

(2) The pass will not run when it is explicitly turned off as shown below.

  • Old pass manager:
mahesha@brego:[tmp]$ hipcc -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$
  • New pass manager:
mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager  -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$
hsmhsm edited the summary of this revision. (Show Details)Tue, Jan 19, 9:44 PM
hsmhsm updated this revision to Diff 318723.EditedFri, Jan 22, 8:01 PM

Build all the required data structures which will be later used to lower function local LDS. Below are the data structures being built.

[1] Kernel Set - Holds all the kernels in the module
[2] Function Local LDS Set - Holds all the function local LDS from all functions
[3] Function Address Taken Set - Holds all the functions whose address is taken within the module

[4] LDS to Function Map - Maps each function local LDS to a function within which the LDS is defined
[5] Function to LDS Map - Reverse of above map, which maps each functon F to a SET of LDS which are defined within F

[6] Kernel to Callee Map - Maps each kernel K to a SET of functions which define LDS and there exists call graph path from K to these functions.
[7] Kernel to LDS - Maps each kernel K to a set of function local LDS which are supposed to be lowered w.r.t K.

Data structures [1], [2], and [3] are built by iterating over the globals and functions defined within the module.
Data structures [4] and [5] are built using BOTTOM-UP based on the use list of function local LDS.
Data structure [6] is built using TOP-DOWN via call graph traversal.
Data structure [7] is built using the result of above BOTTOM-UP and TOP-DOWN constructed data structures.

hsmhsm updated this revision to Diff 318724.Fri, Jan 22, 8:08 PM

Add missing "static" keyword to a function isKernel().

hsmhsm updated this revision to Diff 318726.Fri, Jan 22, 8:44 PM

Added a FIXME comment.

Harbormaster completed remote builds in B86389: Diff 318723.
hsmhsm updated this revision to Diff 318730.Fri, Jan 22, 9:44 PM

Add missing explicit keyword for constructor.

hsmhsm edited the summary of this revision. (Show Details)Fri, Jan 22, 9:46 PM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm updated this revision to Diff 318731.Fri, Jan 22, 9:59 PM

Fix few spell mistakes in comments.

hsmhsm updated this revision to Diff 318732.Fri, Jan 22, 10:05 PM

Fix few spell mistakes in comments.

hsmhsm updated this revision to Diff 318733.Fri, Jan 22, 10:17 PM

Fix comments.

hsmhsm updated this revision to Diff 318736.Fri, Jan 22, 11:18 PM

Re-arrange code for more readability.

hsmhsm updated this revision to Diff 318738.Fri, Jan 22, 11:40 PM

Fixed clang-tidy warnings.