At the moment Loop Versioning for LICM does not support the following loops which, if versioned, give ~+18-40%% score improvement of benchmarks on Cortex-M7:
void mem_copy_01(char **dst, char **src, int bytes_count) { while (bytes_count--) { *((*dst)++) = *((*src)++); } } void mem_copy_02(char **dst, char *src, int bytes_count) { while (bytes_count--) { *((*dst)++) = *src++; } } void mem_copy_03(char *dst, char **src, int bytes_count) { while (bytes_count--) { *dst++ = *((*src)++); } }
IR of mem_copy_01:
define void @mem_copy_01(i8** nocapture %dst, i8** nocapture %src, i32 %bytes_count) { entry: %tobool2 = icmp eq i32 %bytes_count, 0 br i1 %tobool2, label %while.end, label %while.body while.body: ; preds = %entry, %while.body %bytes_count.addr.03 = phi i32 [ %dec, %while.body ], [ %bytes_count, %entry ] %dec = add nsw i32 %bytes_count.addr.03, -1 %0 = load i8*, i8** %src, align 4, !tbaa !3 %incdec.ptr = getelementptr inbounds i8, i8* %0, i32 1 store i8* %incdec.ptr, i8** %src, align 4, !tbaa !3 %1 = load i8, i8* %0, align 1, !tbaa !7 %2 = load i8*, i8** %dst, align 4, !tbaa !3 %incdec.ptr1 = getelementptr inbounds i8, i8* %2, i32 1 store i8* %incdec.ptr1, i8** %dst, align 4, !tbaa !3 store i8 %1, i8* %2, align 1, !tbaa !7 %tobool = icmp eq i32 %dec, 0 br i1 %tobool, label %while.end, label %while.body while.end: ; preds = %while.body, %entry ret void }
LoopAccessAnalysis can create aliasing checks for src and dst but not for *src and *dst because *src and *dst are loaded from memory. If we look at IR above we can notice how the pointers are defined and used (InvPtr - loop invariant pointer):
Ptr = Load(InvPtr) NewPtr = GEP(Ptr, Const) Store(NewPtr, InvPtr) Mem_operations using Ptr
- If Ptr and InvPtr are not aliased at the iteration N then at the iteration N+1 the value of Ptr is the value defined by the GEP instruction.
- Without aliasing Ptr has values from [Ptr0, Ptr0 + (number_of_iterations-1) * type_size * GEP_index], where Ptr0 is Load(InvPtr) at the first iteration.
Absence of aliasing means:
4_or_8_bytes_aligned(Ptr0) != InvPtr : iteration 1 4_or_8_bytes_aligned(Ptr0 + type_size*GEP_index) != InvPtr : iteration 2 4_or_8_bytes_aligned(Ptr0 + 2*type_size*GEP_index) != InvPtr: iteration 3 ...
Aligned Ptr0 is used because InvPtr is a pointer to a pointer and it's aligned either 4 or 8 bytes.
We can write a stricter check:
InvPtr is not in [4_or_8_bytes_aligned(Ptr0), Ptr0 + (number_of_iterations-1) * type_size * GEP_index]
which guarantees all checks above are satisfied.
We check only aliasing among pointers loaded from invariant locations and pointers to those locations which is enough to make decisions to move operations on invariant pointers out of a loop. As checks are for the purpose of LICM and don't cover all pointers combinations creation/adding of them can not be in LoopAccessAnalysis/LoopVersioning. LoopAccessAnalysis/LoopVersioning should provide a means of processing unrecognized pointers and adding checks for them.
Summary of changes:
- Clean up of the code of Loop Versioning for LICM. See comments to the changes below.
- LoopVersioning::versionLoop functions are changed to return BasicBlock where RT checks are inserted. The return basic block can be used for inserting additional checks.
- LoopAccessAnalysis can operate in 'PartialCheckingAllowed' state which mean to create RT checks for recognized pointers and collect unrecognized pointers. The unrecognized pointers can be processed by a user of LAA later.
- Recognition of the new optimization case is added to Loop Versioning for LICM.
- New tests are added.
- Old tests are updated.
Won't this just make canCheckPtrAtRT say that it can ignore all pointers with unknown bounds?
I think this would be incorrect for all other LAA users.