If workgroup size is known inform llvm about range returned by local
id and local size queries.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata
Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.
On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.
The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.
Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.
I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:
__attribute__((reqd_work_group_size(128, 1, 1))) kernel void zext_grp_size_256(global uint *a) { a[0] = get_local_id(0) & 0xff; }
compiled to:
; Function Attrs: nounwind define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 { entry: %call = tail call i64 @_Z12get_local_idj(i32 0) #2 %0 = trunc i64 %call to i32 %conv = and i32 %0, 255 store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8 ret void } ; Function Attrs: alwaysinline nounwind readnone define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 { %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2 ret i64 %2 } attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" } attributes #2 = { nounwind readnone }
BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...
That is probably possible to extend AMDGPULowerIntrinsics pass to generate range metadata. It works before inliner pass, but since it works after opt it shall be reasonable to assume these calls are already inlined.
What do you think, Matt?
I'm not saying the hardware maximum is the final answer, but it is a useful starting point when there is no fixed workgroup size.
__attribute__((reqd_work_group_size(128, 1, 1))) kernel void zext_grp_size_256(global uint *a) { a[0] = get_local_id(0) & 0xff; }compiled to:
; Function Attrs: nounwind define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 { entry: %call = tail call i64 @_Z12get_local_idj(i32 0) #2 %0 = trunc i64 %call to i32 %conv = and i32 %0, 255 store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8 ret void } ; Function Attrs: alwaysinline nounwind readnone define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 { %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2 ret i64 %2 }
You need to look a level below this. Ideally these would be annotate as well, but I think just the final intrinsic call has it. Range metadata can also apply to loads, so it works in the library's use for the sizes read out of the dispatch packet
attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nounwind readnone }
BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...
You can get a value type with an EVT. SelectionDAGBuilder::lowerRangeToAssertZExt does this.
That might be a place to do it. I think we would want this done earlier, although then there are call graph problems with multiple kernels
I do not see range info anywhere... In fact it is only generated in AMDGPUPromoteAlloca for newly inserted calls in alloca handling.
Anyway, I will try to patch AMDGPULowerIntrinsics now.
Changed approach to generate range metadata.
Added processing of reqd_work_group_size to refine individual dimension results.
Created common method in subtarget to facilitate all places where we use it.
Promote alloca pass switched to the new method. This refines ranges produced over previous HW limit.
This also fixes bug in the range info produced by promote alloca pass: range metadata is [Lo, Hi), it was generated incorrectly as [0, 2048). Note, for ID query the range shall be one less than for size query, while it was produced the same. I.e. if size would really be 2048 local size range would be incorrectly assumed [0..2047].
Fixed bug in previous revision: promote alloca should not set range on XY component of local size, since it actually loads two lanes.
lib/Target/AMDGPU/AMDGPULowerIntrinsics.cpp | ||
---|---|---|
106 ↗ | (On Diff #94874) | Lowercase first letter |
lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
251–255 ↗ | (On Diff #94874) | This seems to be unconditionally adding invariant load metadata. Why? This is broken and also unrelated to the range metadata |
lib/Target/AMDGPU/AMDGPUSubtarget.h | ||
517 ↗ | (On Diff #94874) | Putting this in the subtarget is a weird place. Why not leave it in the pass? |
test/CodeGen/AMDGPU/zext-lid.ll | ||
1 ↗ | (On Diff #94874) | This should be a test running the IR pass, with more checks for the specific ranges added |
lib/Target/AMDGPU/AMDGPUSubtarget.h | ||
---|---|---|
517 ↗ | (On Diff #94874) | I need to access it from the intrinsic lowering and from promote alloca, so I needed some kind of utility function. |
Renamed function and moved invariant load meta back into promote alloca pass.
test/CodeGen/AMDGPU/zext-lid.ll | ||
---|---|---|
1 ↗ | (On Diff #94874) | How do you propose to run it? With opt? |
test/CodeGen/AMDGPU/zext-lid.ll | ||
---|---|---|
1 ↗ | (On Diff #94874) | Actually I have a problem here: |