Page MenuHomePhabricator

[LoopIdiom] Replace cttz loop by call to cttz intrinsic.
Needs ReviewPublic

Authored by eiffel on Mon, Jan 4, 9:17 AM.

Details

Summary

Hi.

First, I hope you are fine and the same for your relatives.

I wrote a patch which solves the issue 17128.
The goal of this patch is to replace such snippet:

int cttz(unsigned long x){
	unsigned long i = 0;
	while(i < 64 && (((x >> i) & 0x1) == 0))
		i++;
	return i;
}

by calls to llvm cttz intrinsic which can then be translated to the corresponding assembly instruction, if the architecture has one.
In my case, the intrinsic was replaced by bfsq instruction.

To confirm my results, I wrote cttz.ll test to confirm the patch works and I ran the the check-llvm-unit and check-llvm targets.
The second gave me one failure for Bindings/Go/go.test:

# llvm.org/llvm/bindings/go/llvm.test
/usr/lib/go-1.11/pkg/tool/linux_amd64/link: running /usr/bin/c++ failed: exit status 1
ld.lld: error: unknown --compress-debug-sections value: zlib-gnu
collect2: error: ld returned 1 exit status

I do not think this problem is related to my patch but rather to my configuration.

I also quickly benchmarked my modifications.
First, I measured the compilation time of this program by compiling it 100 times using this command clang -O3 -S -emit-llvm and time as measuring tool:

#include <stdlib.h>


int cttz(unsigned long x){
	unsigned long i = 0;
	while(i < 64 && (((x >> i) & 0x1) == 0))
		i++;
	return i;
}

int main(void){
	int bits_field;
	int first_set;

	bits_field = rand();

	first_set = cttz(bits_field);

	return first_set;
}

The results are the following (in second):

100 compilationsmean for 1 compilation
without patch20.54.21
with patch16.19.17

So, the patch reduces compilation time of approximately 23%.
However, I am not really sure of this result as I first though that the modification would make the compilation slower.
Maybe it is quicker due to the loop removing and then not having to optimize it.

Then, I measure the performance of the generated code by running the above code 10000 times using time as measuring tool, the results are as follows (in millisecond):

10000 runsmean for 1 run
without patch8730.873
with patch8220.822

So, the patch reduces code execution time by around 6%.

If you see any way to improve the patch or mistake I made, feel free to share.

Best regards.

Diff Detail

Event Timeline

eiffel created this revision.Mon, Jan 4, 9:17 AM
eiffel requested review of this revision.Mon, Jan 4, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Jan 4, 9:17 AM
lebedev.ri retitled this revision from [SCEV] Replace cttz loop by call to cttz intrinsic. to [LoopIdiom] Replace cttz loop by call to cttz intrinsic..Mon, Jan 4, 9:55 AM
eiffel updated this revision to Diff 314542.EditedTue, Jan 5, 2:20 AM

Add a check in recognizeAndReplaceCTZ() to test that CurLoop has a UniqueExitBlock.
This should correct the check-all failed test compiling compiler-rt/lib/gwp_asan/tests/mutex_test.cpp.