The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I am planning to take-up handling of *bit-casted function pointers* (while collecting reacheable callees for kernels) as a seperate patch on top of it.
The reason for this is :
(1) *not* handling of *bit-casted function pointers* does not break any functionality - when we fail to collect reacheable callees because of this, lowering phase directly lower LDS instead of pointers, so no break in any functionality from semantics persepctive
(2) I want this base patch to be upstreamed first before handling corner cases. It will also help us to mitigate LDS memory ovehead to greater extent when the ModuleLDSLowering code will be shipped in coming rocm releases.
Fix pre-merge check failure notified at https://reviews.llvm.org/harbormaster/unit/view/755898/
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
47 | You say "all reachable ... nodes" but then you only iterate over an SCC, which is different, and will miss some reachable nodes. If you really want all reachable nodes then use depth_first_ext which will do it all for you. (See EliminateUnreachableBlocks in BasicBlockUtils.cpp for an example of how to use it.) |
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
47 | The comment is bit vague. I actually meant this - "In a call graph, for a given (caller) node, collect all reachable *callee* nodes". Here, when I say reachable callee nodes, it includes those which are called within callees of caller, and so on. For example, say, f1 calls f2, f2 calls f3, and f3 calls f4. Then reachable callee set for f1 is {f2, f3, f4} So, given f1, is not that SCC iterator will collect {f2, f3, f4} for me? What are the cases that it cannot handle? |
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
47 | See https://en.wikipedia.org/wiki/Strongly_connected_component f4 will only be in the same SCC as f1 if there is a path from f1 to f4 and back to f1 again. |
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
47 | But, that is fine, it does not matter. What we are fundamentally doing here is - We collect all reachable SCCs from given kernel, then we collect all the nodes within those collected SCCs. Actually, I am not sure, if I get you here. Let me further think about it, in case if I am missing any obvious point here. |
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
47 | Further, I experimented for the graph that you mentioned: f1 calls f2 f1 calls f4 f2 calls f3 f4 calls f1 For this graph, yes, we land-up with three SCCs and we collect nodes from all three SCCs. Hence we have collected all reachable callees from f1. Start node: f1 SCC1: f3 SCC2: f2 SCC3: f4 f1 Reachable nodes from f1: f3 f2 f4 f1 |
You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp | ||
---|---|---|
353 | It is not strictly required. I would just add it from TM. | |
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
123 | auto UsedList. Less updates if return type will change. | |
136 | It is local variable, no need to clean. It will be destroyed. | |
141 | Require is too strong word here. Not beneficial. | |
152 | return !LDSToNonKernels[GV].empty(); | |
169 | Technically it can be less. Some ASICs have 32K. | |
263 | Can you name it somehow more readable? Entry, Entry2... Not speaking names. | |
310 | That's a copy of set, you could use reference. |
But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.
I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp | ||
---|---|---|
29 | Typo runnning | |
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
148–151 | Why not check this first? | |
201 | The compiler should never introduce ptrtoint. This should offset from a base address with getelementptr | |
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
246 | This should really be a generic utility function |
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
201 | Actually I am wondering - how to do it. "inttoptr" can be replaced with getelementptr. On the other-hand, how to achieve it for "ptrtoint"? getelementptr gives pointer value, not integer value, what I want here is - an integer value (pointer converted to integer). So, how to get integer value from getelementptr? |
llvm/include/llvm/IR/ReplaceConstant.h | ||
---|---|---|
31 ↗ | (On Diff #349451) | Should this utility be reviewed an submitted separately since it is a common code? |
llvm/include/llvm/IR/ReplaceConstant.h | ||
---|---|---|
31 ↗ | (On Diff #349451) | Pushed new patch at https://reviews.llvm.org/D103661. |
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
153 | Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels. Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit. |
For the record, the agreed way is to do a store from lane 0 of each wave and follow with a wave barrier.
I agree with @rampitec, although we did suggest measuring to confirm that the work-group barrier and a single wave's lane 0 is not faster than multiple waves's lane 0 and a wave barrier.
We also observed that using a wave barrier is UB in the language memory model, although well defined in the AMDGPU hardware memory model. However, the current AMD GPU sync-scope definition implies the language rules and so would be an issue if any future atomic optimizations exploited that.
Two approaches for limiting the stores to lane 0 of each wave:
- Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
- Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
151 | If it is empty it probably will not be returned by findVariablesToLower()? | |
153 |
Sound like a good TODO at the very least. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll | ||
42 | We had problems with null before. Instruction selection will happily allocate it overlapping with first non-null variable. That's how AMDGPULowerModuleLDSPass turned to use struct instead of null. Did you check we are not introducing this again in the final ISA? |
OK, let's see, which one is more feasible from the implementation point of view.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
151 | That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set. @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1 @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64*), align 8 define void @f0() { %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)* store i8 1, i8 addrspace(3)* %bc, align 1 ret void } define amdgpu_kernel void @k0() { call void @f0() ret void } | |
153 | It is in my TODO list. It requires some logical thinking, before carefully handling this case. That is the reason, I had not yet replied to Jon's comment. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll | ||
42 | The null represent the base address of LDS memory. My understanding is that, earlier, instruction selection had gone for toss because of the bug where we were not actually allocating memory at all for llvm.amdgcn.module.lds, but we were accessing it at address 0. And, instruction selection was allocating address 0 for some other lds (since this address was not occupied), so there was an issue. But, nevertheless, let me look into ISA and see if we are fine or any issue because of null. |
Let only lane 0 from each wave do lds pointer initialization, followed by a wave barrier.
Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll | ||
---|---|---|
42 | Further, I looked into ISA, it looks good to me. For example consider below input code. @lds.1 = internal addrspace(3) global i32 undef, align 4 @lds.2 = internal addrspace(3) global i64 undef, align 4 define internal void @func_uses_lds() { entry: store i32 7, i32 addrspace(3)* @lds.1 store i64 31, i64 addrspace(3)* @lds.2 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: call void @func_uses_lds() ret void } output from pointer-replacement is: @lds.1 = internal addrspace(3) global i32 undef, align 4 @lds.2 = internal addrspace(3) global i64 undef, align 4 @lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2 @lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2 define internal void @func_uses_lds() { entry: %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2 %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0 %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)* %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2 %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3 %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)* store i32 7, i32 addrspace(3)* %5, align 4 store i64 31, i64 addrspace(3)* %2, align 4 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = icmp eq i32 %0, 0 br i1 %1, label %2, label %3 2: ; preds = %entry store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2 store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2 br label %3 3: ; preds = %entry, %2 call void @llvm.amdgcn.wave.barrier() call void @func_uses_lds() ret void } output from module (kernel) lds lowering is: %llvm.amdgcn.module.lds.t = type { i16, i16 } %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t = type { i64, i32 } @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2 @llvm.compiler.used = appending global [1 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*) to i8*)], section "llvm.metadata" @llvm.amdgcn.kernel.kernel_reaches_lds.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t undef, align 8 define internal void @func_uses_lds() { entry: %0 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2 %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0 %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)* %3 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2 %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3 %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)* store i32 7, i32 addrspace(3)* %5, align 4 store i64 31, i64 addrspace(3)* %2, align 4 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ] %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = icmp eq i32 %0, 0 br i1 %1, label %2, label %5 2: ; preds = %entry %3 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 0) to i16 store i16 %3, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2 %4 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 1) to i16 store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2 br label %5 5: ; preds = %entry, %2 call void @llvm.amdgcn.wave.barrier() call void @func_uses_lds() ret void } what it fundamentally means is:
func_uses_lds: ; @func_uses_lds ; %bb.0: ; %entry s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) v_mov_b32_e32 v1, 0 ds_read_i16 v2, v1 ds_read_i16 v3, v1 offset:2 v_mov_b32_e32 v4, 7 v_mov_b32_e32 v0, 31 s_waitcnt lgkmcnt(1) ds_write_b32 v2, v4 s_waitcnt lgkmcnt(1) ds_write_b64 v3, v[0:1] s_waitcnt lgkmcnt(0) s_setpc_b64 s[30:31] |
As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.
@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
151 |
Should not it be fixed by D103431? | |
153 | You can leave a TODO comment here for now. Patch is already too big, so it will need a separate change anyway. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll | ||
42 | So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is forced to be allocated at address zero these pointers will fall into the allocated space? We need to be super careful here because if anything else will be injected into the llvm.amdgcn.module.lds by the module lds pass it will all break. In fact there seems to be no work for module lds after this for the module, only for kernels. You probably can just create module structure right here. Then module lds should skip run on module (but not on kernels) if that variable already exists. It will also untangle dependency between passes. In addition you will resolve issue with null as you would just use that structure and Matt's concern about ptrtoint. |
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
121 | Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by getting a parent of the first instruction. |
Anyhow, manual says:
Example to compute each thread's position in 0..63: v_mbcnt_lo_u32_b32 v0, -1, 0 v_mbcnt_hi_u32_b32 v0, -1, v0 // v0 now contains ThreadPosition
So it tells to nest it. Plus will fail on lane 32 as far as I understand.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll | ||
---|---|---|
16 | This likely should not be i16. You could do it this way: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i8 addrspace(3)* undef, align 2 Then use bitcast to i8 addrspace(3)* instead of ptrtoint on store. But I still think that creating module lds right here is better. |
Actually, in wave64 mode, for any thread >= 32, mbcnt_lo returns 32 not 0, practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64, I will make changes accordingly.
@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.
Anyhow, manual says:
Example to compute each thread's position in 0..63: v_mbcnt_lo_u32_b32 v0, -1, 0 v_mbcnt_hi_u32_b32 v0, -1, v0 // v0 now contains ThreadPositionSo it tells to nest it. Plus will fail on lane 32 as far as I understand.
No it does not fail on lane 32. For lane 32, v_mbcnt_lo returns 32, and you pass this return value from v_mbcnt_lo to v_mbcnt_hi. v_mbcnt_hi will not add any additioncal count (on top of 32), hence it rerurns back 32 again, that is lane position for lane 32.
practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64
I am baffled by this. Why would you add an extra instruction when you have just explained that it is not required?
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
121 | I think we need it. Consider below transformed code. Here we need to track the newly introduced block st since this is where we need to insert store instructions. And I am not sure, how can I get it from KernelToLDSPointers. define protected amdgpu_kernel void @k0() { entry: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0) %2 = icmp eq i32 %1, 0 br i1 %2, label %st, label %ft st: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16), i16 addrspace(3)* @lds.ptr, align 2 br label %ft ft: call void @llvm.amdgcn.wave.barrier() call void @f0() ret void } | |
151 | The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not getting lowering. Now, here is the basic funda: The pass LowerModuleLDS lowers LDS which are used either in global scope or in non-kernel function scope or in both. And it directly packs the LDS within struct which we want to avoid since it wastes memory. So, in this pointer-replacement patch, we try to replace uses of LDS by pointers so that LowerModuleLDS pass ends-up packing pointers instead of LDS themselves. However, we can only pointer replace those LDS which are used within non-kernel function scope, but not the ones which are used in global scope. Now in the above example, assume that we create a pointer to @lds.1 since it is used within non-kernel function scope, and replace its use within function by pointer. Then LowerModuleLDS pass sees use of @lds.1 (in global scope) and the use of pointer within function. Hence it lowers both lds and its pointer, means it packs both @lds.1 and its pointer within struct, which is waste. Hence this pass does not touch the lds which are used both in global and non-kernel function scope, and leaves those lds to getting directly lowered by LowerModuleLDS pass, means directly getting packed within struct. And, I assume such cases are not frequent in practice. | |
153 | I have kept is as TODO and come to it later in a separate patch. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll | ||
16 | The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here. Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass). This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll | ||
42 | As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot remove it for now. The main reason for above is - for a given use of LDS within some non-kernel function, we need to make sure that appropriate memory for LDS is being accessed based on which kernel is called that function. We have not yet solved this issue. The current available solution is this llvm.amdgcn.module.lds. But, we should attempt to avoid wasting of memory by llvm.amdgcn.module.lds, hence this pointer replacement pass. What this pass does is below: (1) create an i16 pointer "lds.ptr" to "lds" where "lds" is used in some non-kernel function. lds.ptr = ptrtoint(lds) (3) Within non-kernel where lds is used: lds.addr = GEP(null, lds.ptr) // where null is the base address of the LDS memory, and we use GEP here to avoid inttoptr replace use of "lds" by "lds.addr" after apropriately bitcasting. What it means to LowerModuleLDS pass is below: (1) The use of "lds" is moved to kernel (pointer initialization), hence it is properly allocated within kernel as a part of kernel specific struct, and more important it is allocated only if there is call from kernel to function. Otherwise there is no this pointer initialization, and hence no usage, and hence no allocation. (2) It sees the use of "lds.ptr" within non-kernel function, and hence it lowers this ptr by putting it within module struct. Otherwise, we still allocate memory for llvm.amdgcn.module.lds within every kernel. But, now we are not wasting too much memory. In anycase, within every kernel, llvm.amdgcn.module.lds is allocated at address 0. all the other vars (belonging to kernel) are allocated at some non-zero addresses. In this way, I do not see any problem here, unless I am missing any corner, but serious case for conflcting with address 0. |
Correct, but, I myself could not confidently prove that it is always safe to use only mbcnt_lo even for wave64 mode, hence had to insert mbcnt_hi for wave64 mode.
Actually I think you were right. mbcnt_lo will always return -1 for any lane higher than 31.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp | ||
---|---|---|
121 | Ok. | |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll | ||
16 |
Should not sizeof local pointer be 2? Checked the DL: p3:32:32. Sigh!
My concern is that you assume these i16 pointers will start a null, but that might not be the case if you rely on the module LDS pass to pack them into the struct along with other LDS and let it sort it. Can you add a test where you have these pointers and also another module LDS which will have higher alignment and size, so module LDS will pack them together? We could then inspect output of 2 passes and the ISA. |
[1] Do not insert mbcnt_hi for wave64 mode, mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.
[2] Add a new lit test as asked by Stas.
Patch is again updated accordingly.
Now I again updated the patch - mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll | ||
---|---|---|
16 | No, what you concern here is not at all arise. Please have a look the new test case - replace-lds-by-ptr-lds-offsets.ll. I have given detailed description within this test, and I hope, it will clarify your above doubt. |
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll | ||
---|---|---|
16 | Thanks, that helps. Tracked pointers and stores there, it seems to be consistent. Matt raised a concern that IR passes may refuse to produce a 'noalias' answer in presence of null, but that is another concern which we may be able to solve later. In particular any operation on such pointer is invariant and noalias in a given function, essentially it is just a single load, so we may just annotate it accordingly. Not required for this patch though. |
LGTM. Please allow at least 24 hours for others to comment too before submitting. In the meanwhile PSDB and ePSDB runs will be useful.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp | ||
---|---|---|
51 | There is no need to find strongly connected components. This should use a df_iterator instead. |
This may be the root cause of recent crashes. Reported to me offline, posting the stack trace here. @ronlieb may have a repro from different code.
llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed. PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: $HOME/clang/bin/llc /tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -filetype=obj -o /tmp/zgemm-bb1d9d-gfx908-2e0957.o 1. Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc'. #0 0x00007f1e8ed35ebf PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x00007f1e8ed33879 SignalHandler(int) Signals.cpp:0:0 #2 0x00007f1e93a112d0 __restore_rt (/lib64/libpthread.so.0+0x132d0) #3 0x00007f1e8dec5520 raise (/lib64/libc.so.6+0x39520) #4 0x00007f1e8dec6b01 abort (/lib64/libc.so.6+0x3ab01) #5 0x00007f1e8debdb1a __assert_fail_base (/lib64/libc.so.6+0x31b1a) #6 0x00007f1e8debdb92 (/lib64/libc.so.6+0x31b92) #7 0x00007f1e8d753ad4 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fad4) #8 0x00007f1e8d753e27 llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fe27) #9 0x00007f1e9cc2bf67 (anonymous namespace)::ReplaceLDSUseImpl::replaceLDSUse() AMDGPUReplaceLDSUseWithPointer.cpp:0:0 #10 0x00007f1e9cc2eec8 (anonymous namespace)::AMDGPUReplaceLDSUseWithPointer::runOnModule(llvm::Module&) AMDGPUReplaceLDSUseWithPointer.cpp:0:0 #11 0x00007f1e8f40f4e8 llvm::legacy::PassManagerImpl::run(llvm::Module&) ($HOME/clang/lib/libLLVMCore.so.13git+0x2174e8) #12 0x0000000000416615 compileModule(char**, llvm::LLVMContext&) llc.cpp:0:0 #13 0x000000000040d596 main ($HOME/clang/bin/llc+0x40d596) #14 0x00007f1e8deb034a __libc_start_main (/lib64/libc.so.6+0x2434a)
seeing this assert in trunk today in pass
llvm::AMDGPU::CollectReachableCallees::collectReachableCallees
rlieberm@r4:~/git/aomp12/aomp/test/smoke/helloworld$ /home/rlieberm/rocm/trunk_1.0bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld
llc: /work/rlieberm/mono-repo/llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/rlieberm/rocm/trunk_1.0bin/llc /tmp/helloworld-9e3621-gfx906-linked-3dda69.bc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -filetype=obj -o /tmp/helloworld-9e3621-gfx906-1499ac.o
- Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/helloworld-9e3621-gfx906-linked-3dda69.bc'. #0 0x000055e8771bf08f PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x000055e8771bc8dd SignalHandler(int) Signals.cpp:0:0 #2 0x00007efe2440d980 restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980) #3 0x00007efe230befb7 raise /build/glibc-S7xCS9/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0 #4 0x00007efe230c0921 abort /build/glibc-S7xCS9/glibc-2.27/stdlib/abort.c:81:0 #5 0x00007efe230b048a assert_fail_base /build/glibc-S7xCS9/glibc-2.27/assert/assert.c:89:0 #6 0x00007efe230b0502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502) #7 0x000055e8774c87c2 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b27c2) #8 0x000055e8774c8b1f llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b2b1f)
instructions to build aomp from trunk and run test
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/trunk_1.0 \
-DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_CCACHE_BUILD=$OPTION_CCACHE \ -DLLVM_ENABLE_RUNTIMES="openmp" \ -DOPENMP_ENABLE_LIBOMPTARGET_HSA=1 \ -DCLANG_DEFAULT_LINKER=lld \ ../llvm -GNinja
ninja install
git clone http://github.com/rocm-developer-tools/aomp
cd aomp/test/smoke/helloworld
/tmp/trunk_1.0//bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld
those clone instructions will yield a bunch of code, including some makefile machinery and the following C:
// helloworld.c #include <stdio.h> #include <omp.h> int main(void) { int isHost = 1; #pragma omp target map(tofrom: isHost) { isHost = omp_is_initial_device(); printf("Hello world. %d\n", 100); for (int i =0; i<5; i++) { printf("Hello world. iteration %d\n", i); } } printf("Target region executed on the %s\n", isHost ? "host" : "device"); return isHost; }
I am going to revert this change as the author is unlikely to be available at present (early morning Saturday), in accordance with the guidelines at https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy. It'll reapply easily once the assert is run to ground.
It breaks a lot of the smoke tests in aomp. It's also been hit externally. It reproduces on a minimal openmp example (presumably because our runtime does things with shared memory). I don't see how this change could have passed epsdb, so reverting it also helps with the merge from trunk into rocm.
edit: doesn't revert cleanly. Considering whether to leave trunk broken or not.
Git revert is unclean, gone with D104962 instead. That changes the pass to disabled by default, changes the test header for these tests to re-enable it, deletes one test where that strategy failed.
I am seeing a run-time failure as well in addition to the compile-time failure reported earlier. Here's a test case, veccopy.c.
#include <stdio.h>
#include <omp.h>
int main()
{
int N = 100000; int a[N]; int b[N]; int i; for (i=0; i<N; i++){ a[i]=0; b[i]=i; }
#pragma omp target teams distribute parallel for map(from: a[0:N]) map(to: b[0:N])
{ for (int j = 0; j< N; j++) a[j]=b[j]; } int rc = 0; for (i=0; i<N; i++) if (a[i] != b[i] ) { rc++; printf ("Wrong value: a[%d]=%d\n", i, a[i]); } if (!rc){ printf("Success\n"); return EXIT_SUCCESS; } else{ printf("Failure\n"); return EXIT_FAILURE; }
}
Compile/run:
clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 veccopy.c -o veccopy
./veccopy
[GPU Memory Error] Addr: 0x7f60a4210000 Reason: Page not present or supervisor privilege.
Memory access fault by GPU node-1 (Agent handle: 0x168c5d0) on address 0x7f60a4210000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Adding -mllvm -amdgpu-enable-lds-replace-with-pointer=false gets rid of the problem.
Running with env var LIBOMPTARGET_KERNEL_TRACE=1
shows LDS went down from 16400B to 32B but it triggered the above problem.
W.r.t compile time failure, it might be because of the patch D104704 which is added on top of this patch.
W.r.t run time issues, I am surprised that our internal psdb job did not catch it. We really need to look into it.
psdb (our classic testing) has zero OpenMP tests.
The ePSDB does have OpenMP testing.
a jenkins request to run ePSDB on a phab patch would be needed to get ePSDB testing.
After the patch has landed in trunk, the twice-daily merges from trunk to amd-stg-open will run both a psdb and ePSDB to validate the set of patches which might get merged.
if the merge testing in psdb/ePSDB fails to pass, merge does not land in amd-stg-open.
i think our merge process is about to try and digest the LDS patch.
Jon, Dhruva, i added your two tests to our smoke-fails this morning. i see both failing as you described with trunk build. I see both passing in this mornings amd-stg-open with the pass enabled, and with the pass disabled.
Here's a .ll testcase reduced from the openmp helloworld thing:
; ModuleID = 'z.ll' source_filename = "llvm-link" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7" target triple = "amdgcn-amd-amdhsa" @usedSlotIdx = internal local_unnamed_addr addrspace(3) global i32 undef, align 4 define weak amdgpu_kernel void @__omp_offloading_10302_1a22e53_main_l7() { entry: %nvptx_num_threads = call i32 @__kmpc_amdgcn_gpu_num_threads() ret void } ; Function Attrs: nounwind readnone speculatable willreturn declare i32 @llvm.amdgcn.workitem.id.x() #0 declare i32 @__kmpc_amdgcn_gpu_num_threads() define internal void @__kmpc_kernel_init() { entry: store i32 undef, i32 addrspace(3)* @usedSlotIdx, align 4, !tbaa !0 ret void } ; Function Attrs: argmemonly nofree nosync nounwind willreturn declare void @llvm.lifetime.start.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1 ; Function Attrs: argmemonly nofree nosync nounwind willreturn declare void @llvm.lifetime.end.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1 attributes #0 = { nounwind readnone speculatable willreturn } attributes #1 = { argmemonly nofree nosync nounwind willreturn } !0 = !{!1, !1, i64 0} !1 = !{!"int", !2, i64 0} !2 = !{!"omnipotent char", !3, i64 0} !3 = !{!"Simple C++ TBAA"}
Compile with: llc -march=amdgcn -mcpu=gfx906 -o /dev/null z.ll
Typo runnning