This is an archive of the discontinued LLVM Phabricator instance.

[LoopIdiom] Replace cttz loop by call to cttz intrinsic.
Needs ReviewPublic

Authored by eiffel on Jan 4 2021, 9:17 AM.

Details

Summary

Hi.

First, I hope you are fine and the same for your relatives.

I wrote a patch which solves the issue 17128.
The goal of this patch is to replace such snippet:

int cttz(unsigned long x){
	unsigned long i = 0;
	while(i < 64 && (((x >> i) & 0x1) == 0))
		i++;
	return i;
}

by calls to llvm cttz intrinsic which can then be translated to the corresponding assembly instruction, if the architecture has one.
In my case, the intrinsic was replaced by bfsq instruction.

To confirm my results, I wrote cttz.ll test to confirm the patch works and I ran the the check-llvm-unit and check-llvm targets.
The second gave me one failure for Bindings/Go/go.test:

# llvm.org/llvm/bindings/go/llvm.test
/usr/lib/go-1.11/pkg/tool/linux_amd64/link: running /usr/bin/c++ failed: exit status 1
ld.lld: error: unknown --compress-debug-sections value: zlib-gnu
collect2: error: ld returned 1 exit status

I do not think this problem is related to my patch but rather to my configuration.

I also quickly benchmarked my modifications.
First, I measured the compilation time of this program by compiling it 100 times using this command clang -O3 -S -emit-llvm and time as measuring tool:

#include <stdlib.h>


int cttz(unsigned long x){
	unsigned long i = 0;
	while(i < 64 && (((x >> i) & 0x1) == 0))
		i++;
	return i;
}

int main(void){
	int bits_field;
	int first_set;

	bits_field = rand();

	first_set = cttz(bits_field);

	return first_set;
}

The results are the following (in second):

100 compilationsmean for 1 compilation
without patch20.54.21
with patch16.19.17

So, the patch reduces compilation time of approximately 23%.
However, I am not really sure of this result as I first though that the modification would make the compilation slower.
Maybe it is quicker due to the loop removing and then not having to optimize it.

Then, I measure the performance of the generated code by running the above code 10000 times using time as measuring tool, the results are as follows (in millisecond):

10000 runsmean for 1 run
without patch8730.873
with patch8220.822

So, the patch reduces code execution time by around 6%.

If you see any way to improve the patch or mistake I made, feel free to share.

Best regards.

Diff Detail

Event Timeline

eiffel created this revision.Jan 4 2021, 9:17 AM
eiffel requested review of this revision.Jan 4 2021, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 4 2021, 9:17 AM
lebedev.ri retitled this revision from [SCEV] Replace cttz loop by call to cttz intrinsic. to [LoopIdiom] Replace cttz loop by call to cttz intrinsic..Jan 4 2021, 9:55 AM
eiffel updated this revision to Diff 314542.EditedJan 5 2021, 2:20 AM

Add a check in recognizeAndReplaceCTZ() to test that CurLoop has a UniqueExitBlock.
This should correct the check-all failed test compiling compiler-rt/lib/gwp_asan/tests/mutex_test.cpp.

eiffel updated this revision to Diff 421805.Apr 10 2022, 1:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2022, 1:06 PM
Herald added a subscriber: StephenFan. · View Herald Transcript

Hi.

This contribution is a bit old but I rebased it and tested it.
All the tests are passing:

[100%] Running the LLVM regression tests

Testing Time: 6737.87s
  Skipped          :     7
  Unsupported      :   472
  Passed           : 47326
  Expectedly Failed:   161
[100%] Built target check-llvm

So is it possible to, please, get some opinions regarding it?

Best regards and thank you in advance.

craig.topper added inline comments.Apr 11 2022, 10:00 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1885

Drop the comment about jdoerfert. Its fine for a commit message, but doesn't add anything to the understanding of the code.

1892

Use auto *AddIncValue0

1900

auto

1906

auto

eiffel updated this revision to Diff 422612.EditedApr 13 2022, 12:49 PM

Hi.

Thank you for your reviews.
I addressed your comments and updated the patch, I only ran the transforms tests and they are passing:

[100%] Running lit suite /home/francis/llvm/llvm-project/llvm/test/Transforms

Testing Time: 357.54s
  Unsupported      :   11
  Passed           : 7333
  Expectedly Failed:   33
[100%] Built target check-llvm-transforms

Best regards.

yurai007 added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1972

nit: const

llvm/test/Transforms/LoopIdiom/cttz.ll
67

Would it make sense to add some negative tests covering scenarios when pattern is (as expected) not recognized?

eiffel updated this revision to Diff 423752.Apr 19 2022, 3:39 PM

Hi.

I updated my code to adress the const nit.

However, I am not really sure of the output for cttz64 when ran with -O1.
Indeed, I get:

; ./bin/opt -O1 -S < ../llvm-project/llvm/test/Transforms/LoopIdiom/cttz.ll
; ModuleID = '<stdin>'
source_filename = "<stdin>"

; Function Attrs: nofree norecurse nosync nounwind readnone uwtable
define i32 @cttz32(i32 %x) local_unnamed_addr #0 {
entry:
  %0 = call i32 @llvm.cttz.i32(i32 %x, i1 true), !range !0
  ret i32 %0
}

; Function Attrs: nofree norecurse nosync nounwind readnone ssp uwtable
define i32 @cttz64(i64 %x) local_unnamed_addr #1 {
entry:
  br label %land.rhs

land.rhs:                                         ; preds = %while.body, %entry
  %i.06 = phi i64 [ 0, %entry ], [ %inc, %while.body ]
  %0 = shl nuw i64 1, %i.06
  %1 = and i64 %0, %x
  %cmp1 = icmp eq i64 %1, 0
  br i1 %cmp1, label %while.body, label %while.end.split.loop.exit2

while.body:                                       ; preds = %land.rhs
  %inc = add nuw nsw i64 %i.06, 1
  %cmp = icmp ult i64 %i.06, 63
  br i1 %cmp, label %land.rhs, label %while.end

while.end.split.loop.exit2:                       ; preds = %land.rhs
  %extract.t1.le = trunc i64 %i.06 to i32
  br label %while.end

while.end:                                        ; preds = %while.body, %while.end.split.loop.exit2
  %i.0.lcssa.off0 = phi i32 [ %extract.t1.le, %while.end.split.loop.exit2 ], [ 64, %while.body ]
  ret i32 %i.0.lcssa.off0
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i32 @llvm.cttz.i32(i32, i1 immarg) #2

attributes #0 = { nofree norecurse nosync nounwind readnone uwtable }
attributes #1 = { nofree norecurse nosync nounwind readnone ssp uwtable }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!0 = !{i32 0, i32 33}

While the result is correct for cttz32, the intrinsic is not present for cttz64.
Note that, when I ran this, I get an output which is correct for both:

; ./bin/opt -loop-idiom -loop-deletion -S < ../llvm-project/llvm/test/Transforms/LoopIdiom/cttz.ll          
; ModuleID = '<stdin>'
source_filename = "<stdin>"

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @cttz32(i32 %x) #0 {
entry:
  br label %while.end

while.end:                                        ; preds = %entry
  %0 = call i32 @llvm.cttz.i32(i32 %x, i1 true)
  ret i32 %0
}

; Function Attrs: nounwind readnone ssp uwtable
define i32 @cttz64(i64 %x) #1 {
entry:
  br label %while.end

while.end:                                        ; preds = %entry
  %0 = call i64 @llvm.cttz.i64(i64 %x, i1 true)
  %conv = trunc i64 %0 to i32
  ret i32 %conv
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i32 @llvm.cttz.i32(i32, i1 immarg) #2

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.cttz.i64(i64, i1 immarg) #2

attributes #0 = { norecurse nounwind readnone uwtable }
attributes #1 = { nounwind readnone ssp uwtable }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

So, can someone with better experience give me his/her thoughts about this?

Best regards and thank you in advance.

Hi.

I updated my code to adress the const nit.

However, I am not really sure of the output for cttz64 when ran with -O1.
Indeed, I get:

; ./bin/opt -O1 -S < ../llvm-project/llvm/test/Transforms/LoopIdiom/cttz.ll
; ModuleID = '<stdin>'
source_filename = "<stdin>"

; Function Attrs: nofree norecurse nosync nounwind readnone uwtable
define i32 @cttz32(i32 %x) local_unnamed_addr #0 {
entry:
  %0 = call i32 @llvm.cttz.i32(i32 %x, i1 true), !range !0
  ret i32 %0
}

; Function Attrs: nofree norecurse nosync nounwind readnone ssp uwtable
define i32 @cttz64(i64 %x) local_unnamed_addr #1 {
entry:
  br label %land.rhs

land.rhs:                                         ; preds = %while.body, %entry
  %i.06 = phi i64 [ 0, %entry ], [ %inc, %while.body ]
  %0 = shl nuw i64 1, %i.06
  %1 = and i64 %0, %x
  %cmp1 = icmp eq i64 %1, 0
  br i1 %cmp1, label %while.body, label %while.end.split.loop.exit2

while.body:                                       ; preds = %land.rhs
  %inc = add nuw nsw i64 %i.06, 1
  %cmp = icmp ult i64 %i.06, 63
  br i1 %cmp, label %land.rhs, label %while.end

while.end.split.loop.exit2:                       ; preds = %land.rhs
  %extract.t1.le = trunc i64 %i.06 to i32
  br label %while.end

while.end:                                        ; preds = %while.body, %while.end.split.loop.exit2
  %i.0.lcssa.off0 = phi i32 [ %extract.t1.le, %while.end.split.loop.exit2 ], [ 64, %while.body ]
  ret i32 %i.0.lcssa.off0
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i32 @llvm.cttz.i32(i32, i1 immarg) #2

attributes #0 = { nofree norecurse nosync nounwind readnone uwtable }
attributes #1 = { nofree norecurse nosync nounwind readnone ssp uwtable }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!0 = !{i32 0, i32 33}

While the result is correct for cttz32, the intrinsic is not present for cttz64.
Note that, when I ran this, I get an output which is correct for both:

; ./bin/opt -loop-idiom -loop-deletion -S < ../llvm-project/llvm/test/Transforms/LoopIdiom/cttz.ll          
; ModuleID = '<stdin>'
source_filename = "<stdin>"

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @cttz32(i32 %x) #0 {
entry:
  br label %while.end

while.end:                                        ; preds = %entry
  %0 = call i32 @llvm.cttz.i32(i32 %x, i1 true)
  ret i32 %0
}

; Function Attrs: nounwind readnone ssp uwtable
define i32 @cttz64(i64 %x) #1 {
entry:
  br label %while.end

while.end:                                        ; preds = %entry
  %0 = call i64 @llvm.cttz.i64(i64 %x, i1 true)
  %conv = trunc i64 %0 to i32
  ret i32 %conv
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i32 @llvm.cttz.i32(i32, i1 immarg) #2

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.cttz.i64(i64, i1 immarg) #2

attributes #0 = { norecurse nounwind readnone uwtable }
attributes #1 = { nounwind readnone ssp uwtable }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

So, can someone with better experience give me his/her thoughts about this?

Best regards and thank you in advance.

That's because -O1 runs pipeline of transformations and apparently cttz32 at the time of reaching LIR is expected and pattern is recognized successfully.
However it's not the case for cttz64 - perhaps IR was mutated by previous passes and in consequence recognizeAndReplaceCTZ bails out on unexpected pattern.
If you want to find out what's going on then LLVM_DEBUG macro and -debug/-print-changed options may be useful for debugging purpose.

yurai007 added inline comments.Apr 20 2022, 12:27 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1975

Since it's running on all targets I wonder whether checking size is enough. For example - in case of CTLZ transformation profitability is checked by additional query to TTI. More context: https://reviews.llvm.org/D32605#752717

lebedev.ri resigned from this revision.Jan 12 2023, 5:30 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.