When optimizing the following code using the HardwareLoops pass, I found that due to the inability to determine the range of the 'step' variable, conservative behavior was applied and the hardware loop was not generated.
opt -passes='hardware-loops<force-hardware-loops>' test.ll -S -o test_opt.ll
define void @TestHWLoops(ptr %out, i32 %step) { entry: %gt0 = icmp sgt i32 %step, 0 %lt10 = icmp slt i32 %step, 10 tail call void @llvm.assume(i1 %gt0) tail call void @llvm.assume(i1 %lt10) br label %for.body for.body: ; preds = %for.body, %entry %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ] %arrayidx = getelementptr inbounds i32, ptr %out, i32 %i store i32 0, ptr %arrayidx, align 4 %i.next = add i32 %i, %step %cmp = icmp slt i32 %i.next, 10 br i1 %cmp, label %for.body, label %for.cond.cleanup for.cond.cleanup: ; preds = %for.body ret void } declare void @llvm.assume(i1)
I tried to add an assume statement to 'step' to ensure its range is [1,10), but it still did not take effect.
I observed two reasons for this:
- When checking if it is safe to transform the loop, ScalarEvolution::isKnownPositive is used, which ultimately calls ScalarEvolution::getRangeRef. However, in the case of scUnknown, assume is not effectively utilized to obtain range information. Therefore, I used ValueTracking's computeConstantRange here.
- In computeConstantRange, Arguments cannot set CtxI to themselves like regular Instructions. Additionally, Arguments do not need to worry about being deleted like regular instructions.
Using the following changes can correctly generate hardware loops:
define void @TestHWLoops(ptr %out, i32 %step) { entry: %gt0 = icmp sgt i32 %step, 0 %lt10 = icmp slt i32 %step, 10 tail call void @llvm.assume(i1 %gt0) tail call void @llvm.assume(i1 %lt10) %0 = udiv i32 9, %step %1 = add nuw nsw i32 %0, 1 call void @llvm.set.loop.iterations.i32(i32 %1) br label %for.body for.body: ; preds = %for.body, %entry %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ] %arrayidx = getelementptr inbounds i32, ptr %out, i32 %i store i32 0, ptr %arrayidx, align 4 %i.next = add i32 %i, %step %2 = call i1 @llvm.loop.decrement.i32(i32 1) br i1 %2, label %for.body, label %for.cond.cleanup for.cond.cleanup: ; preds = %for.body ret void } ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: readwrite) declare void @llvm.assume(i1 noundef) #0 ; Function Attrs: nocallback noduplicate nofree nosync nounwind willreturn declare void @llvm.set.loop.iterations.i32(i32) #1 ; Function Attrs: nocallback noduplicate nofree nosync nounwind willreturn declare i1 @llvm.loop.decrement.i32(i32) #1 attributes #0 = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: readwrite) } attributes #1 = { nocallback noduplicate nofree nosync nounwind willreturn }
An empty block isn't well formed IR, does it really need to handle that