This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Generate range metadata for workitem id
ClosedPublic

Authored by rampitec on Apr 7 2017, 1:54 AM.

Details

Summary

If workgroup size is known inform llvm about range returned by local
id and local size queries.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Apr 7 2017, 1:54 AM
arsenm edited edge metadata.Apr 10 2017, 10:47 AM

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:

__attribute__((reqd_work_group_size(128, 1, 1)))
kernel void zext_grp_size_256(global uint *a) {
  a[0] = get_local_id(0) & 0xff;
}

compiled to:

; Function Attrs: nounwind
define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 {
entry:
  %call = tail call i64 @_Z12get_local_idj(i32 0) #2
  %0 = trunc i64 %call to i32
  %conv = and i32 %0, 255
  store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8
  ret void
}

; Function Attrs: alwaysinline nounwind readnone
define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 {
  %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2
  ret i64 %2
}

attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

attributes #2 = { nounwind readnone }

BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...

That is probably possible to extend AMDGPULowerIntrinsics pass to generate range metadata. It works before inliner pass, but since it works after opt it shall be reasonable to assume these calls are already inlined.
What do you think, Matt?

Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata

Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.

On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.

The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.

Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:

I'm not saying the hardware maximum is the final answer, but it is a useful starting point when there is no fixed workgroup size.

__attribute__((reqd_work_group_size(128, 1, 1)))
kernel void zext_grp_size_256(global uint *a) {
  a[0] = get_local_id(0) & 0xff;
}

compiled to:

; Function Attrs: nounwind
define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 {
entry:
  %call = tail call i64 @_Z12get_local_idj(i32 0) #2
  %0 = trunc i64 %call to i32
  %conv = and i32 %0, 255
  store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8
  ret void
}

; Function Attrs: alwaysinline nounwind readnone
define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 {
  %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2
  ret i64 %2
}

You need to look a level below this. Ideally these would be annotate as well, but I think just the final intrinsic call has it. Range metadata can also apply to loads, so it works in the library's use for the sizes read out of the dispatch packet

attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

attributes #2 = { nounwind readnone }

BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...

You can get a value type with an EVT. SelectionDAGBuilder::lowerRangeToAssertZExt does this.

That is probably possible to extend AMDGPULowerIntrinsics pass to generate range metadata. It works before inliner pass, but since it works after opt it shall be reasonable to assume these calls are already inlined.
What do you think, Matt?

That might be a place to do it. I think we would want this done earlier, although then there are call graph problems with multiple kernels

I do not see range info anywhere... In fact it is only generated in AMDGPUPromoteAlloca for newly inserted calls in alloca handling.
Anyway, I will try to patch AMDGPULowerIntrinsics now.

rampitec updated this revision to Diff 94757.Apr 10 2017, 5:10 PM
rampitec retitled this revision from [AMDGPU] zero extend workitem id to [AMDGPU] Generate range metadata for workitem id.
rampitec edited the summary of this revision. (Show Details)

Changed approach to generate range metadata.
Added processing of reqd_work_group_size to refine individual dimension results.
Created common method in subtarget to facilitate all places where we use it.
Promote alloca pass switched to the new method. This refines ranges produced over previous HW limit.
This also fixes bug in the range info produced by promote alloca pass: range metadata is [Lo, Hi), it was generated incorrectly as [0, 2048). Note, for ID query the range shall be one less than for size query, while it was produced the same. I.e. if size would really be 2048 local size range would be incorrectly assumed [0..2047].

rampitec updated this revision to Diff 94779.Apr 10 2017, 10:00 PM

Fixed bug in previous revision: promote alloca should not set range on XY component of local size, since it actually loads two lanes.

rampitec updated this revision to Diff 94874.Apr 11 2017, 12:27 PM

Also produce lower bound range info when possible.

arsenm added inline comments.Apr 11 2017, 1:54 PM
lib/Target/AMDGPU/AMDGPULowerIntrinsics.cpp
106 ↗(On Diff #94874)

Lowercase first letter

lib/Target/AMDGPU/AMDGPUSubtarget.cpp
251–255 ↗(On Diff #94874)

This seems to be unconditionally adding invariant load metadata. Why? This is broken and also unrelated to the range metadata

lib/Target/AMDGPU/AMDGPUSubtarget.h
517 ↗(On Diff #94874)

Putting this in the subtarget is a weird place. Why not leave it in the pass?

test/CodeGen/AMDGPU/zext-lid.ll
1 ↗(On Diff #94874)

This should be a test running the IR pass, with more checks for the specific ranges added

rampitec marked 2 inline comments as done.Apr 11 2017, 2:01 PM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUSubtarget.h
517 ↗(On Diff #94874)

I need to access it from the intrinsic lowering and from promote alloca, so I needed some kind of utility function.

rampitec updated this revision to Diff 94889.Apr 11 2017, 2:15 PM

Renamed function and moved invariant load meta back into promote alloca pass.

test/CodeGen/AMDGPU/zext-lid.ll
1 ↗(On Diff #94874)

How do you propose to run it? With opt?

test/CodeGen/AMDGPU/zext-lid.ll
1 ↗(On Diff #94874)

Actually I have a problem here:
opt -S -mtriple=amdgcn-- -amdgpu-lower-intrinsics
In this situation opt does not create TargetMachine, so the pass cannot do anything.

rampitec updated this revision to Diff 94896.Apr 11 2017, 2:41 PM

Added check to not crash if TM is not created.

rampitec updated this revision to Diff 94898.Apr 11 2017, 3:14 PM
rampitec marked 3 inline comments as done.

Added IR pass run to the test and range checks.

arsenm accepted this revision.Apr 12 2017, 1:24 PM

LGTM

This revision is now accepted and ready to land.Apr 12 2017, 1:24 PM
This revision was automatically updated to reflect the committed changes.