Preserve vec3 type with CodeGen option.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Could you please add your test here (probably goes to test/CodeGenOpenCL)?
include/clang/Driver/CC1Options.td | ||
---|---|---|
661 ↗ | (On Diff #91307) | Can we drop "operations"? |
Could you elaborate on the motivation for this change?
I was wondering why clang (CodeGen) needed the help of a command line option to decide whether vec3 should be converted to vec4. Can it just preserve vec3 when the architecture is spir?
Ah, sorry... I forgot to link the discussion cfe-dev. http://lists.llvm.org/pipermail/cfe-dev/2017-March/052956.html I hope it is helpful for you.
Would you be able to update ScalarExprEmitter::VisitAsTypeExpr implementation accordingly to make things consistent?
Probably, No... the load/store with vec3 generates vec4 temporarily to be aligned but builtin_astype needs to generate vec4 from vec3 or vice versa because spec allows it on 6.2.4.2. As we discussed on previous e-mails, we would need to add intrinsic function for builtin_astype on llvm and add default behavior on llvm's codegen. It is beyond clang and I am not sure llvm's IR and codegen people allow it... If you have another idea or I missed something, please let me know.
Hi JinGu,
I just read the discussion on cfe-dev, some comments:
My colleague and I are implementing a transformation pass between LLVM IR and another IR and we want to keep the 3-component vector types in our target IR
Why can't you go back through the shuffle to find out that it comes from a vec3 -> vec4 transformation? Adding a flag here just for that seems very odd.
Hi Bruno,
Thanks for your comment. We have a pass to undo the vec4 to vec3. I wondered why clang generates the vec4 for vec3 load/store. As you can see the comment on clang's code, they are generated for better performance. I had a questions at this point. clang should consider vector load/store aligned by 4 regardless of target??? llvm's codegen could handle vec3 according to targets' vector load/store with their alignment. I agree the flag looks odd but I was concerned some llvm's targets depend on the vec4 so I suggested to add the flag. If I missed something, please let me know.
I think doing this transformation in a frontend was not the best choice because we are losing the source information too early. But some targets can offer a better support for vec3 natively than converting to vec4. Regarding the option, I am wondering whether adding this as a part of TargetInfo would be a better solution. Although, I don't think any currently available upstream targets support this.
I have a fundamental question for vec3 load/store again. I have checked how llvm's codegen handles the vec3 and vec4 load/store. Assuming that target's action is widen for vector type unsupported. If they have same alignment, the llvm's codegen generates one vector load for both of them. But it handles store differently between vec3 and vec4 on DAGTypeLegalizer because vec4 store writes place over vec3 type. For example,
source code
typedef float float3 __attribute__((ext_vector_type(3))); void foo(float3 *a, float3 *b) { *a = *b; }
LLVM IR for vec4 (it is original clang's output)
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %castToVec4 = bitcast <3 x float>* %b to <4 x float>* %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16 %storetmp = bitcast <3 x float>* %a to <4 x float>* store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16 ret void }
LLVM IR for vec3
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %0 = load <3 x float>, <3 x float>* %b, align 16 store <3 x float> %0, <3 x float>* %a, align 16 ret void }
As you can see above IR for vec4, "store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16" writes place beyond float3. Is the behavior correct??? I can not find which spec or document mentions it... The vec3 store are lowered to more instructions such as stores and extractelements but the behavior is correct. If the vec3 --> vec4 store is not correct, we could add vec3 as default in TargetInfo. Additionally, if the vec3 --> vec4 store is correct, we could modify llvm codegen's widen vector store for vec3 to generate one vector store like vec4. If I missed something, please let me know.
It looks like converting vec3 to vec4 is incorrect in some cases. In the following program, IRGen emits "store <4 x float>" to store g1 to *a, which will overwrite s1.f2.
typedef __attribute__((__ext_vector_type__(3))) float float3; struct S1 { float3 f1; float f2; }; float3 g1; void __attribute__((noinline)) foo2(float3 *a) { *a = g1; } void foo1() { struct S1 s1; foo2(&s1.f1); }
Actually, it's not a mis-compile. The record layout shows that there is a padding before field f2 and f2 starts at byte 16. So using "store <4 x float>" doesn't overwrite the field.
It depends on float3's alignment. I guess the float3's alignment is 16 on your target so there is a padding bytes before f2 to be aligned by 16. If you add attribute((packed)) on struct type, you could see overwrite.
Thanks for your comment. We have a pass to undo the vec4 to vec3. I wondered why clang generates the vec4 for vec3 load/store. As you can see the comment on clang's code, they are generated for better performance. I had a questions at this point. clang should consider vector load/store aligned by 4 regardless of target???
I believe the assumption is more practical: most part of upstream llvm targets only support vectors with even sized number of lanes. And in those cases you would have to expand to a 4x vector and leave the 4th element as undef anyway, so it was done in the front-end to get rid of it right away. Probably GPU targets do some special tricks here during legalization.
llvm's codegen could handle vec3 according to targets' vector load/store with their alignment. I agree the flag looks odd but I was concerned some llvm's targets depend on the vec4 so I suggested to add the flag. If I missed something, please let me know.
My guess here is that targets that do care are looking through the vector shuffle and customizing to whatever seems the best to them. If you wanna change the default behavior you need some data to show that your model solves a real issue and actually brings benefits; do you have any real examples on where that happens, or why GPU targets haven't yet tried to change this? Maybe other custom front-ends based on clang do? Finding the historical reason (if any) should be a good start point.
I believe the assumption is more practical: most part of upstream llvm targets only support vectors with even sized number of lanes. And in those cases you would have to expand to a 4x vector and leave the 4th element as undef anyway, so it was done in the front-end to get rid of it right away. Probably GPU targets do some special tricks here during legalization.
I have compiled below code, which current clang generates for vec3, using llc with amdgcn target.
LLVM IR (vec3 --> vec4)
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %castToVec4 = bitcast <3 x float>* %b to <4 x float>* %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16 %storetmp = bitcast <3 x float>* %a to <4 x float>* store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16 ret void }
SelectionDAG after legalization.
Legalized selection DAG: BB#0 'foo:entry' SelectionDAG has 43 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t9: i64 = add t2, Constant:i64<40> t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64 t4: i64 = add t2, Constant:i64<36> t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64 t12: ch = TokenFactor t6:1, t10:1 t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29 t21: v4f32 = bitcast t32 t18: v4i32 = bitcast t21 t22: i32,ch = load<LD4[%castToVec4](align=16)> t12, t10, undef:i32 t24: i32 = add t10, Constant:i32<4> t25: i32,ch = load<LD4[%castToVec4+4]> t12, t24, undef:i32 t26: i32 = add t24, Constant:i32<4> t27: i32,ch = load<LD4[%castToVec4+8](align=8)> t12, t26, undef:i32 t28: i32 = add t26, Constant:i32<4> t29: i32,ch = load<LD4[%castToVec4+12]> t12, t28, undef:i32 t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1 t35: i32 = extract_vector_elt t18, Constant:i32<0> t36: ch = store<ST4[%storetmp](align=16)> t31, t35, t6, undef:i32 t38: i32 = extract_vector_elt t18, Constant:i32<1> t39: i32 = add t6, Constant:i32<4> t40: ch = store<ST4[%storetmp+4]> t31, t38, t39, undef:i32 t42: i32 = extract_vector_elt t18, Constant:i32<2> t44: i32 = add t6, Constant:i32<8> t45: ch = store<ST4[%storetmp+8](align=8)> t31, t42, t44, undef:i32 t47: i32 = extract_vector_elt t18, Constant:i32<3> t49: i32 = add t6, Constant:i32<12> t50: ch = store<ST4[%storetmp+12]> t31, t47, t49, undef:i32 t51: ch = TokenFactor t36, t40, t45, t50 t17: ch = ENDPGM t51
As you can see, the SelectionDAG still has 4 load/store.
Assembly output
.section .AMDGPU.config .long 47176 .long 11272257 .long 47180 .long 132 .long 47200 .long 0 .long 4 .long 0 .long 8 .long 0 .text .globl foo .p2align 8 .type foo,@function foo: ; @foo ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_load_dword s0, s[0:1], 0xa s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s0 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:4 buffer_load_dword v4, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:12 v_mov_b32_e32 v1, s2 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:12 buffer_store_dword v4, v1, s[4:7], s8 offen offset:8 buffer_store_dword v3, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen s_endpgm .Lfunc_end0: .size foo, .Lfunc_end0-foo .section .AMDGPU.csdata
As you can see, assembly also has 4 load/store. I think gpu target does not handle it specially.
My guess here is that targets that do care are looking through the vector shuffle and customizing to whatever seems the best to them. If you wanna change the default behavior you need some data to show that your model solves a real issue and actually brings benefits; do you have any real examples on where that happens, or why GPU targets haven't yet tried to change this? Maybe other custom front-ends based on clang do? Finding the historical reason (if any) should be a good start point.
I did "git blame" and I read the commit's message. You can also see it with "c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's comment. I also wonder how the vec3 --> vec4 load/store has not caused problems. As Akira's example, if struct type has float3 type as one of fields and it has 'packed' attribute, it overwrites next field. The vec3 load/store generates more instructions like stores and extract_vectors like below SelectionDAG.
LLVM IR for vec3
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %0 = load <3 x float>, <3 x float>* %b, align 16 store <3 x float> %0, <3 x float>* %a, align 16 ret void }
SelectionDAG after type legalization for amdgcn (other targets has similar SelectionDAG after type legalization because of widen vector store on type legalizer)
Type-legalized selection DAG: BB#0 'foo:entry' SelectionDAG has 24 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0 t4: i64 = add t2, Constant:i64<36> t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64 t9: i64 = add t2, Constant:i64<40> t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64 t12: ch = TokenFactor t6:1, t10:1 t21: v4i32,ch = load<LD16[%b]> t12, t10, undef:i32 t22: v2i64 = bitcast t21 t24: i64 = extract_vector_elt t22, Constant:i32<0> t25: ch = store<ST8[%a](align=16)> t21:1, t24, t6, undef:i32 t29: i32 = extract_vector_elt t21, Constant:i32<2> t27: i32 = add t6, Constant:i32<8> t30: ch = store<ST4[%a+8](align=8)> t21:1, t29, t27, undef:i32 t33: ch = TokenFactor t25, t30 t17: ch = ENDPGM t33
As you can see, the type legalizer handle vec3 load/store properly. It does not write 4th element. The vec3 load/store generates more instructions but it has correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or not because no one has complained about it. But if the vec3 --> vec4 load/store is correct, llvm's type legalizer or somewhere on llvm's codegen could follow the approach too to generate optimal code. As a result, I think it would be good for clang to have both of features and I would like to stick to the option "-fpresereve-vec3' to change the behavior easily.
I don't think there is anything wrong with the generation of vec3->vec4 in Clang. I believe the motivation for this was the OpenCL spec treating vec3 as vec4 aligned type (see section 6.1.5: https://www.khronos.org/registry/OpenCL/specs/opencl-2.0-openclc.pdf#12). So in terms of memory layout vec3 wouldn't be any different to vec4. But in terms of operations (including loads/stores) there can be potential gain from not handling the 4th element. This can be exploited by some targets. I think generating the vec3 from the frontend would be a better choice in the first place. Because backend can decide how to handle this. Including for architectures with no SIMD support it would just generate 3 separate loads/stores. Right now it seems that it will be forced to generate 4 loads/stores.
But considering that transformation to vec4 has been the default implementation for quite a while in the frontend, I think we would need a stronger motivation for switching to original vec3. So current approach with a special flag for preserving vec3 should be good enough to fit all needs.
As you can see, the type legalizer handle vec3 load/store properly. It does not write 4th element. The vec3 load/store generates more instructions but it has correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or not because no one has complained about it. But if the vec3 --> vec4 load/store is correct, llvm's type legalizer or somewhere on llvm's codegen could follow the approach too to generate optimal code.
Thanks for the nice investigation/explanation on amdgcn.
As a result, I think it would be good for clang to have both of features and I would like to stick to the option "-fpresereve-vec3' to change the behavior easily.
The motivation doesn't seem solid to me, who else is going to benefit from this flag? You also didn't explain why doing this transformation yourself (looking through the shuffle) on your downstream pass isn't enough for you. We generally try to avoid adding flags if not for a good reason.
The motivation doesn't seem solid to me, who else is going to benefit from this flag? You also didn't explain why doing this transformation yourself (looking through the shuffle) on your downstream pass isn't enough for you. We generally try to avoid adding flags if not for a good reason.
If you turn on optimization with -O option, the shufflevector is gone on 'InstructionCombining' pass. I showed you the IR example and source code on previous comment. You could mention bitcast instead of shufflevector. The bitcast could be constant expression with global variables and it makes code dirty. I think Anastasia has already explained the motivation well on her comments. If you need something more, please let me know.
I looked at ItaniumRecordLayoutBuilder to understand how attribute((packed)) affects the struct's layout.
If the number of elements of a vector is not a power of 2, ASTContext::getTypeInfoImpl rounds up the width and alignment to the next power of 2. So the size of float3 is 16 bytes. attribute((packed)) can change the alignment of float3, but it doesn't change its size, so there is a 4-byte padding between field f1 and f2.
There are some off the main tree implementation that would benefit. But in the case of AMD GPU 3 loads/stores will be produced instead of 4. Sounds like a good optimization to me. As I said in my previous comment I think it should have been the default behavior from the beginning, but since different implementation landed first we can integrate this one now with an additional option.
Oops, I am sorry, I was wrong. I did not imagine the size has same behavior with alignment. Thank you for pointing out it! :) Additionally, I wonder why the non-power-of-2 length vector's size has same behavior with alignment... Do you know something about the reason?
Additionally, Here is assembly output from vec3 with amdgcn target. :)
LLVM IR
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) { entry: %0 = load <3 x float>, <3 x float>* %b, align 16 store <3 x float> %0, <3 x float>* %a, align 16 ret void }
Assembly Output
.text .section .AMDGPU.config .long 47176 .long 11272256 .long 47180 .long 132 .long 47200 .long 0 .long 4 .long 0 .long 8 .long 0 .text .globl foo .p2align 8 .type foo,@function foo: ; @foo ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_load_dword s0, s[0:1], 0xa s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s0 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:4 v_mov_b32_e32 v1, s2 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen buffer_store_dword v3, v1, s[4:7], s8 offen offset:8 s_endpgm .Lfunc_end0: .size foo, .Lfunc_end0-foo .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 112 ; NumSgprs: 9 ; NumVgprs: 4 ; FloatMode: 192 ; IeeeMode: 1 ; ScratchSize: 0 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 1 ; VGPRBlocks: 0 ; NumSGPRsForWavesPerEU: 9 ; NumVGPRsForWavesPerEU: 4 ; ReservedVGPRFirst: 0 ; ReservedVGPRCount: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 2 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .section ".note.GNU-stack"
Hi Anastasia, Bruno,
Do you still have other opinion? or Can we go ahead and commit this patch?
Hi Anastasia, Bruno,
Do you still have other opinion? or Can we go ahead and commit this patch?
I would like to see this patch committed. I see clear evidence of it improving existing GPU targets in the master repo as well as outside of the main tree implementations. Bruno, do you have any more concerns?
I guess due to 4 element alignment the generated casts as vec4 should be fine for vec3, but it is worth checking...
Let's look at examples.
Source code
void kernel float4_to_float3(global float3 *a, global float4 *b) { //float4 --> float3 *a = __builtin_astype(*b, float3); } void kernel float3_to_float4(global float3 *a, global float4 *b) { //float3 --> float4 *b = __builtin_astype(*a, float4); }
LLVM IR
; Function Attrs: norecurse nounwind define void @float4_to_float3(<3 x float>* nocapture %a, <4 x float>* nocapture readonly %b) { entry: %0 = load <4 x float>, <4 x float>* %b, align 16, !tbaa !6 %storetmp = bitcast <3 x float>* %a to <4 x float>* store <4 x float> %0, <4 x float>* %storetmp, align 16, !tbaa !6 ret void } ; Function Attrs: norecurse nounwind define void @float3_to_float4(<3 x float>* nocapture readonly %a, <4 x float>* nocapture %b) { entry: %castToVec4 = bitcast <3 x float>* %a to <4 x float>* %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16 store <4 x float> %loadVec4, <4 x float>* %b, align 16, !tbaa !6 ret void }
We could change above IR with vec3 as following:
; Function Attrs: norecurse nounwinddefine void @float4_to_float3(<3 x float>* nocapture %a, <4 x float>* nocapture readonly %b) { entry: %0 = load <4 x float>, <4 x float>* %b, align 16 %1 = shufflevector <4 x float> %0, <4 x float> undef, <3 x i32><i32 0, i32 1, i32 2> store <3 x float> %1, <3 x float>* %a, align 16 ret void } ; Function Attrs: norecurse nounwinddefine void @float3_to_float4(<3 x float>* nocapture readonly %a, <4 x float>* nocapture %b){ entry: %0 = load <3 x float>, <3 x float>* %a, align 16 %1 = shufflevector <3 x float> %0, <3 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef> store <4 x float> %1, <4 x float>* %b, align 16 ret void }
I guess it is what you want on LLVM IR. I have compiled above IRs with amdgcn target.
vec4 float4_to_float3 assembly code
float4_to_float3: ; @float4_to_float3 ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_load_dword s0, s[0:1], 0xa s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s0 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:4 buffer_load_dword v4, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:12 v_mov_b32_e32 v1, s2 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:12 buffer_store_dword v4, v1, s[4:7], s8 offen offset:8 buffer_store_dword v3, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen s_endpgm
vec3 float4_to_float3 assembly code
float4_to_float3: ; @float4_to_float3 ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_load_dword s0, s[0:1], 0xa s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s0 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:4 v_mov_b32_e32 v1, s2 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen buffer_store_dword v3, v1, s[4:7], s8 offen offset:8 s_endpgm
I think it is what we expect because there are 3 load/store for vec instead of 4 load/store.
The float3_to_float4's output is different with float4_to_float3.
vec4 float3_to_float4 assembly code
float3_to_float4: ; @float3_to_float4 ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s2 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:4 buffer_load_dword v4, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:12 s_load_dword s0, s[0:1], 0xa s_waitcnt lgkmcnt(0) v_mov_b32_e32 v1, s0 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:12 buffer_store_dword v4, v1, s[4:7], s8 offen offset:8 buffer_store_dword v3, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen s_endpgm
vec4 float3_to_float4 assembly code
float3_to_float4: ; @float3_to_float4 ; BB#0: ; %entry s_load_dword s2, s[0:1], 0x9 s_mov_b32 s4, SCRATCH_RSRC_DWORD0 s_mov_b32 s5, SCRATCH_RSRC_DWORD1 s_mov_b32 s6, -1 s_mov_b32 s8, s3 s_mov_b32 s7, 0xe8f000 s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s2 buffer_load_dword v2, v0, s[4:7], s8 offen buffer_load_dword v3, v0, s[4:7], s8 offen offset:4 buffer_load_dword v4, v0, s[4:7], s8 offen offset:8 buffer_load_dword v0, v0, s[4:7], s8 offen offset:12 s_load_dword s0, s[0:1], 0xa s_waitcnt lgkmcnt(0) v_mov_b32_e32 v1, s0 s_waitcnt vmcnt(0) buffer_store_dword v0, v1, s[4:7], s8 offen offset:12 buffer_store_dword v4, v1, s[4:7], s8 offen offset:8 buffer_store_dword v3, v1, s[4:7], s8 offen offset:4 buffer_store_dword v2, v1, s[4:7], s8 offen s_endpgm
The output is same for float3_to_float4 between vec3 and vec4 because codegen legalizes the type with widen vector according to alignment. In float4_to_float3 case, codegen legalizes the store's vec3 type differently and in the end, it generates 3 load/store. As a result, I think we could generate above vec3 IR or similar one with option and it could be better output for GPU. On previous comments, I was wrong for "ScalarExprEmitter::VisitAsTypeExpr". I am sorry for that. If you agree with above vec3 IRs, I will add it on this patch. If I missed something or you have another idea or opinion, please let me know.
Yes. This would make sense. I am guessing that in vec3->vec4, we will have 3 loads and 4 stores and in vec4->vec3 we will have 4 loads and 3 stores?
Although, I am guessing the load/store optimizations would come from your current change? I don't see anything related to it in the VisitAsTypeExpr implementation. But I think it might be a good idea to add this to the test to make sure the IR output is as we expect.
Yes. This would make sense. I am guessing that in vec3->vec4, we will have 3 loads and 4 stores and in vec4->vec3 we will have 4 loads and 3 stores?
It depends on implementation. If you scalarize all vector operations on LLVM IR level before entering llvm's codegen, the vec3->vec4 would generate 3 loads and 4 stores and the vec4->vec3 would generate 4 loads and 3 stores. I guess your implementation follows this situation. The AMDGPU target keeps the vector form on LLVM IR level and handles it with legalization of SelectionDAG on llvm's codegen. On this situation, vec3->vec4 generates 4 loads and 4 stores because type legalizer widen vec3 load to vec4 load because the alignment is 16. vec4->vec3 generates 4 loads and 3 stores with type legalization. The output could be different according to llvm's backend implementation.
Although, I am guessing the load/store optimizations would come from your current change? I don't see anything related to it in the VisitAsTypeExpr implementation. But I think it might be a good idea to add this to the test to make sure the IR output is as we expect.
We should implement the IRs with option. I will update patch with it.
I would like to see this patch committed. I see clear evidence of it improving existing GPU targets in the master repo as well as outside of the main tree implementations. Bruno, do you have any more concerns?
I believe the argument lacks numbers (or at least you have them but didn't mention). I didn't hear about performance results, or validation that this was actually tested for correctness. Small test cases prove a point, but can't be considered general.
OTOH, it seems like this is exactly why you want the flag, to hide any potential issues and play with it. I'm not opposed to adding the flag if there's a commitment to actually get the result and change the default to whatever seems to be better, does that seems reasonable?
I believe the argument lacks numbers (or at least you have them but didn't mention). I didn't hear about performance results, or validation that this was actually tested for correctness. Small test cases prove a point, but can't be considered general.
OTOH, it seems like this is exactly why you want the flag, to hide any potential issues and play with it. I'm not opposed to adding the flag if there's a commitment to actually get the result and change the default to whatever seems to be better, does that seems reasonable?
To be honest, I did not start this discussion to get better performance for GPU. But the goal has moved while we discussing it. I think I have showed you the gain from AMD GPU target on previous comments. On current version of clang/llvm, it generates one less load/store to preserve vec3 type instead of vec4 type on GPU target. But it could cause more load/store on another targets which have vector register because they usually expect aligned vector load/store. It means that if we want to change default to vec3 on clang, we need to change llvm's codegen's default legalization or backend targets need to be modified in order to get aligned vector load/store with vec3 type. I think It is too big change at this moment. In the future, we could move this discussion to llvm-dev. Up to that time, as you mentioned, I would like to hide the potential issues with vec3 type and play with it.
Just to summarize. There seems to be no issue in adding this feature by a special flag considering that changing default behaviors requires changing the backend that we might consider at some later point of time. We can start a separate discussion on that.
The change can be used to generate more generic and optimal IR resulting in 1/4 less instructions generated in the final binary per each load and store of vec3 types in some architectures (e.g. AMD) . I am not sure what kind of further numbers are needed and if there is any measuring infrastructure LLVM project provides for that.
Considering all the above I don't feel there is any reason this can't be committed.