This is an archive of the discontinued LLVM Phabricator instance.

[LoongArch] Heuristically load FP immediates by movgr2fr from materialized integer
AcceptedPublic

Authored by gonglingqin on Jul 13 2022, 6:32 PM.

Details

Summary

Load FP immediates by movgr2fr from materialized integer if the bitcasted integer
can be materialized within 2 instructions.
For example, when loading double 1024.0, use

lu52i.d $a0, $zero, 1033
movgr2fr.d $fa0, $a0

instead of

pcalau12i $a0, .LCPI2_0
addi.d $a0, $a0, .LCPI2_0
fld.d $fa0, $a0, 0

Test this patch with 3A5000 on llvm13, the result shows that SPEC CPU2006 FP
score increases 1.2% in average, 470.lbm score increases 11.9%.

Thanks to @xry111 for the suggestion: https://reviews.llvm.org/D128898#3632140

Diff Detail

Event Timeline

gonglingqin created this revision.Jul 13 2022, 6:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 6:32 PM
gonglingqin requested review of this revision.Jul 13 2022, 6:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 6:32 PM
xen0n accepted this revision.Jul 13 2022, 7:05 PM

Thanks!

This revision is now accepted and ready to land.Jul 13 2022, 7:05 PM

This change optimzies out the constant pool for loading floating-point constant by li+i2f, so I think the title could be: "[LoongArch] Optimize the loading of floating-point immediates by li+i2f", and it' better that use a common case in the summary but not the 1.0.

To other reviewers: we tested this optimization with an internal llvm version (llvm13) on 3A5000, and it shows that SPEC CPU2006 FP score increases 1% in average. 470.lbm score increases 8.9%. But we wonder why other architectures have not done so? Is there any potential issue?

xen0n added a comment.Jul 13 2022, 7:53 PM

This change optimzies out the constant pool for loading floating-point constant by li+i2f, so I think the title could be: "[LoongArch] Optimize the loading of floating-point immediates by li+i2f", and it' better that use a common case in the summary but not the 1.0.

To other reviewers: we tested this optimization with an internal llvm version (llvm13) on 3A5000, and it shows that SPEC CPU2006 FP score increases 1% in average. 470.lbm score increases 8.9%. But we wonder why other architectures have not done so? Is there any potential issue?

Hmm yeah I overlooked the overly generic patch title (I was reviewing the code in the metro). But it seems the i2f isn't found anywhere in the repo nor the commit history, instead there's i2fp but that usage isn't common either. I assume the i2fp is short for the {s,u}itofp IR insn. Then the description is incorrect because the {s,u}itofp transfers the numeric value, not bit layout.

I think the title could be further simplified into something like "[LoongArch] Load FP immediates by movgr2fr from materialized integer", and the justification (such as the performance numbers you cited) could be put in the commit message body. What do you think?

This change optimzies out the constant pool for loading floating-point constant by li+i2f, so I think the title could be: "[LoongArch] Optimize the loading of floating-point immediates by li+i2f", and it' better that use a common case in the summary but not the 1.0.

To other reviewers: we tested this optimization with an internal llvm version (llvm13) on 3A5000, and it shows that SPEC CPU2006 FP score increases 1% in average. 470.lbm score increases 8.9%. But we wonder why other architectures have not done so? Is there any potential issue?

Hmm yeah I overlooked the overly generic patch title (I was reviewing the code in the metro). But it seems the i2f isn't found anywhere in the repo nor the commit history, instead there's i2fp but that usage isn't common either. I assume the i2fp is short for the {s,u}itofp IR insn. Then the description is incorrect because the {s,u}itofp transfers the numeric value, not bit layout.

I think the title could be further simplified into something like "[LoongArch] Load FP immediates by movgr2fr from materialized integer", and the justification (such as the performance numbers you cited) could be put in the commit message body. What do you think?

That sounds good! Thanks!

This change optimzies out the constant pool for loading floating-point constant by li+i2f, so I think the title could be: "[LoongArch] Optimize the loading of floating-point immediates by li+i2f", and it' better that use a common case in the summary but not the 1.0.

To other reviewers: we tested this optimization with an internal llvm version (llvm13) on 3A5000, and it shows that SPEC CPU2006 FP score increases 1% in average. 470.lbm score increases 8.9%. But we wonder why other architectures have not done so? Is there any potential issue?

Hmm yeah I overlooked the overly generic patch title (I was reviewing the code in the metro). But it seems the i2f isn't found anywhere in the repo nor the commit history, instead there's i2fp but that usage isn't common either. I assume the i2fp is short for the {s,u}itofp IR insn. Then the description is incorrect because the {s,u}itofp transfers the numeric value, not bit layout.

I think the title could be further simplified into something like "[LoongArch] Load FP immediates by movgr2fr from materialized integer", and the justification (such as the performance numbers you cited) could be put in the commit message body. What do you think?

Thanks. I will change that.

Address @xen0n and @SixWeining's comments.

gonglingqin retitled this revision from [LoongArch] Optimize the loading of floating-point immediates to [LoongArch] Load FP immediates by movgr2fr from materialized integer.Jul 13 2022, 8:26 PM
gonglingqin edited the summary of this revision. (Show Details)

To other reviewers: we tested this optimization with an internal llvm version (llvm13) on 3A5000, and it shows that SPEC CPU2006 FP score increases 1% in average. 470.lbm score increases 8.9%. But we wonder why other architectures have not done so? Is there any potential issue?

I guess the reason is "for very simple test cases fld is really faster".

bench.S:

#define VALUE	0x4090000000000000

.text
.type	main, @function
.globl	main

main:
	li.w	$t1, 1048576
.loop:
	.rept 1024
#if LOAD_IMM
	li.d	$t0, VALUE
	movgr2fr.d	$ft0, $t0
#else
	la.local	$t0, .const0
	fld.d	$ft0, $t0, 0
#endif
	.endr
	addi.w	$t1, $t1, -1
	bnez	$t1, .loop
	li.w	$a0, 0
	jr	$ra


.data
.hidden	.const0
.const0:
	.dword	VALUE

On my 3A5000 (at 2.3 GHz) cc bench_imm.S && time ./a.out gives 0.35s, but cc bench_imm.S -DLOAD_IMM && time ./a.out gives 0.60s. But I think it's just because for the simple case the constant pool is always in the L1 cache...

On my 3A5000 (at 2.3 GHz) cc bench_imm.S && time ./a.out gives 0.35s, but cc bench_imm.S -DLOAD_IMM && time ./a.out gives 0.60s. But I think it's just because for the simple case the constant pool is always in the L1 cache...

Ah, just make the fetch unit busier then the result will prefer immediate loading:

.loop:
	.rept 1024
#if LOAD_IMM
	li.d	$t0, VALUE
	movgr2fr.d	$ft0, $t0
#else
	la.local	$t0, .const0
	fld.d	$ft0, $t0, 0
#endif
	la.local	$t0, .const1
	ld.d	$t2, $t0, 0
	.endr
	addi.w	$t1, $t1, -1
	bnez	$t1, .loop
	li.w	$a0, 0
	jr	$ra

cc bench_imm.S && time ./a.out gives 0.70s, and cc bench_imm.S -DLOAD_IMM && time ./a.out gives 0.59s. But for a more complex bit pattern (like 0x400921FB54442D18 for PI) fld.d will win again.

It it possible to limit the use of movgr2fr for the patterns can be loaded with only one or two instructions and see how the SPEC will change?

On my 3A5000 (at 2.3 GHz) cc bench_imm.S && time ./a.out gives 0.35s, but cc bench_imm.S -DLOAD_IMM && time ./a.out gives 0.60s. But I think it's just because for the simple case the constant pool is always in the L1 cache...

Ah, just make the fetch unit busier then the result will prefer immediate loading:

.loop:
	.rept 1024
#if LOAD_IMM
	li.d	$t0, VALUE
	movgr2fr.d	$ft0, $t0
#else
	la.local	$t0, .const0
	fld.d	$ft0, $t0, 0
#endif
	la.local	$t0, .const1
	ld.d	$t2, $t0, 0
	.endr
	addi.w	$t1, $t1, -1
	bnez	$t1, .loop
	li.w	$a0, 0
	jr	$ra

cc bench_imm.S && time ./a.out gives 0.70s, and cc bench_imm.S -DLOAD_IMM && time ./a.out gives 0.59s. But for a more complex bit pattern (like 0x400921FB54442D18 for PI) fld.d will win again.

It it possible to limit the use of movgr2fr for the patterns can be loaded with only one or two instructions and see how the SPEC will change?

Thanks for the suggestion. It may be possible, I will test it.

Address @xry111's comments. Load FP immediates by movgr2fr from materialized integer if the bitcasted integer can be materialized within 2 instructions.

gonglingqin retitled this revision from [LoongArch] Load FP immediates by movgr2fr from materialized integer to [LoongArch] Heuristically load FP immediates by movgr2fr from materialized integer.Jul 14 2022, 7:42 PM
gonglingqin edited the summary of this revision. (Show Details)
xen0n accepted this revision.Jul 14 2022, 7:48 PM

I don't know if you did the experiments thoroughly and found out 2 is the optimal threshold (on SPEC2006), or if it was just an arbitrary choice ("拍脑袋").

You could mention how the threshold was chosen, in case it is indeed arbitrary but others wrongly assume it's something related to micro-architecture details, or empirically verified.

llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
854

nit: bitcast -- the verb "cast"'s past participle is itself, so is the compound word "bitcast".

gonglingqin added a comment.EditedJul 14 2022, 8:08 PM

I don't know if you did the experiments thoroughly and found out 2 is the optimal threshold (on SPEC2006), or if it was just an arbitrary choice ("拍脑袋").

You could mention how the threshold was chosen, in case it is indeed arbitrary but others wrongly assume it's something related to micro-architecture details, or empirically verified.

I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.

gonglingqin added inline comments.Jul 14 2022, 8:11 PM
llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
854

Thanks. I will change that.

xen0n added a comment.Jul 14 2022, 8:22 PM

I don't know if you did the experiments thoroughly and found out 2 is the optimal threshold (on SPEC2006), or if it was just an arbitrary choice ("拍脑袋").

You could mention how the threshold was chosen, in case it is indeed arbitrary but others wrongly assume it's something related to micro-architecture details, or empirically verified.

I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.

Could be better to find some time to upgrade your benchmarking environment for testing the actual main branch. ;-)

Regarding the actual benchmarks, yes I think testing the 3-instruction case could be useful. But again, it may not make a significant difference, since the IEEE-754 biased exponent is occupying the highest 12 bits (except the sign bit), all f64's with top 12 bits zeroed are denormals. And numbers whose binary representation have big "holes" of all-0s or 1s for their two "middle" 20-bit segments or lowest 12 bits are probably not commonly used in the wild, let alone being used as immediates. You could try benchmarking of course, but I doubt the result would be much different from the 2-insn case.

(The 4-insn case is useless and equivalent to unconditionally loading via integer immediates, because all 64-bit values can be loaded in 4 insns (lu12i.w + ori + lu32i.d + lu52i.d) in LA64, and in LA32 you need two pairs of materialization and GPR-FPR moves for the higher and lower 32 bits anyway.)

xen0n added a comment.Jul 14 2022, 9:13 PM

I don't know if you did the experiments thoroughly and found out 2 is the optimal threshold (on SPEC2006), or if it was just an arbitrary choice ("拍脑袋").

You could mention how the threshold was chosen, in case it is indeed arbitrary but others wrongly assume it's something related to micro-architecture details, or empirically verified.

I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.

Could be better to find some time to upgrade your benchmarking environment for testing the actual main branch. ;-)

Ignore this; I forgot the main branch has no clang support yet.

I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.

Could be better to find some time to upgrade your benchmarking environment for testing the actual main branch. ;-)

Regarding the actual benchmarks, yes I think testing the 3-instruction case could be useful. But again, it may not make a significant difference, since the IEEE-754 biased exponent is occupying the highest 12 bits (except the sign bit), all f64's with top 12 bits zeroed are denormals. And numbers whose binary representation have big "holes" of all-0s or 1s for their two "middle" 20-bit segments or lowest 12 bits are probably not commonly used in the wild, let alone being used as immediates. You could try benchmarking of course, but I doubt the result would be much different from the 2-insn case.

(The 4-insn case is useless and equivalent to unconditionally loading via integer immediates, because all 64-bit values can be loaded in 4 insns (lu12i.w + ori + lu32i.d + lu52i.d) in LA64, and in LA32 you need two pairs of materialization and GPR-FPR moves for the higher and lower 32 bits anyway.)

The test results show that the performance of materialized integer within 3 instructions is better than that of the 2-instructions case. The test results are shown in the table

BenchmarksScore of 2 instructions caseScore of 3 instructions casediff
433.milc13.213.20
444.namd1515.10.1
447.dealII26.626.70.1
450.soplex23.624.20.6
453.povray23.323.40.1
470.lbm21.521.90.4
482.sphinx325.525.50

It seems that 3-instructions case outperforms the other cases. @xen0n, Do you have any suggestions?
(Since we do not support flang for the time being, I didn't test fortran related topics)

xen0n added a comment.Jul 15 2022, 3:26 AM

I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.

Could be better to find some time to upgrade your benchmarking environment for testing the actual main branch. ;-)

Regarding the actual benchmarks, yes I think testing the 3-instruction case could be useful. But again, it may not make a significant difference, since the IEEE-754 biased exponent is occupying the highest 12 bits (except the sign bit), all f64's with top 12 bits zeroed are denormals. And numbers whose binary representation have big "holes" of all-0s or 1s for their two "middle" 20-bit segments or lowest 12 bits are probably not commonly used in the wild, let alone being used as immediates. You could try benchmarking of course, but I doubt the result would be much different from the 2-insn case.

(The 4-insn case is useless and equivalent to unconditionally loading via integer immediates, because all 64-bit values can be loaded in 4 insns (lu12i.w + ori + lu32i.d + lu52i.d) in LA64, and in LA32 you need two pairs of materialization and GPR-FPR moves for the higher and lower 32 bits anyway.)

The test results show that the performance of materialized integer within 3 instructions is better than that of the 2-instructions case. The test results are shown in the table

BenchmarksScore of 2 instructions caseScore of 3 instructions casediff
433.milc13.213.20
444.namd1515.10.1
447.dealII26.626.70.1
450.soplex23.624.20.6
453.povray23.323.40.1
470.lbm21.521.90.4
482.sphinx325.525.50

It seems that 3-instructions case outperforms the other cases. @xen0n, Do you have any suggestions?
(Since we do not support flang for the time being, I didn't test fortran related topics)

This is interesting data, is the SPEC2006 runs one-shot or averaged over multiple runs like the Phoronix Test Suite? Although the 450.soplex case seems statistically significant enough.

I think some assembly comparison could go a long way, but again, SPEC2006 is *horribly outdated* so actually IMO the argument for 3-instruction threshold would be a lot stronger if you could replicate this result on some more recent or comprehensive benchmark suites. (PTS or newer SPEC are all better than SPEC2006 in this regard.)

This is interesting data, is the SPEC2006 runs one-shot or averaged over multiple runs like the Phoronix Test Suite? Although the 450.soplex case seems statistically significant enough.

Spec2006 was tested twice and the average of the scores was taken

I think some assembly comparison could go a long way, but again, SPEC2006 is *horribly outdated* so actually IMO the argument for 3-instruction threshold would be a lot stronger if you could replicate this result on some more recent or comprehensive benchmark suites. (PTS or newer SPEC are all better than SPEC2006 in this regard.)

Thanks, I will test other benchmark sets.

I'm now feeling guilty because I've raise the suggestion w/o any benchmarking done... When I get some spare time I'll try to implement this for GCC and benchmark it.

I'm now feeling guilty because I've raise the suggestion w/o any benchmarking done... When I get some spare time I'll try to implement this for GCC and benchmark it.

You don't have to feel guilty. We must do this to make best decision. :)

I think some assembly comparison could go a long way, but again, SPEC2006 is *horribly outdated* so actually IMO the argument for 3-instruction threshold would be a lot stronger if you could replicate this result on some more recent or comprehensive benchmark suites. (PTS or newer SPEC are all better than SPEC2006 in this regard.)

Thanks, I will test other benchmark sets.

I used cpu2017(fortran excluded) to test the performance in 5 cases,

  1. using constant pool,
  2. materialized integer with 1 instruction,
  3. materialized integer within 2 instructions,
  4. materialized integer within 3 instructions,
  5. materialized integer within 4 instructions.

(Tests were run three times for each condition and the scores were geometrically averaged).
The results showed no change in the scores for the 5 cases. @xen0n, @xry111, do you have any suggestions?

xry111 added a comment.EditedJul 26 2022, 7:31 PM

I think some assembly comparison could go a long way, but again, SPEC2006 is *horribly outdated* so actually IMO the argument for 3-instruction threshold would be a lot stronger if you could replicate this result on some more recent or comprehensive benchmark suites. (PTS or newer SPEC are all better than SPEC2006 in this regard.)

Thanks, I will test other benchmark sets.

I used cpu2017(fortran excluded) to test the performance in 5 cases,

  1. using constant pool,
  2. materialized integer with 1 instruction,
  3. materialized integer within 2 instructions,
  4. materialized integer within 3 instructions,
  5. materialized integer within 4 instructions.

(Tests were run three times for each condition and the scores were geometrically averaged).
The results showed no change in the scores for the 5 cases. @xen0n, @xry111, do you have any suggestions?

Make it a tunable (-loongarch-materialize-float-imm=0/1/2/3/4, or some better name), I guess. And set the default to 0 for -mtune=generic or -mtune=la464. Then we can set it to other values if a future uarch behaves differently.

I used cpu2017(fortran excluded) to test the performance in 5 cases,

  1. using constant pool,
  2. materialized integer with 1 instruction,
  3. materialized integer within 2 instructions,
  4. materialized integer within 3 instructions,
  5. materialized integer within 4 instructions.

(Tests were run three times for each condition and the scores were geometrically averaged).
The results showed no change in the scores for the 5 cases. @xen0n, @xry111, do you have any suggestions?

Make it a tunable (-loongarch-materialize-float-imm=0/1/2/3/4, or some better name), I guess. And set the default to 0 for -mtune=generic or -mtune=la464. Then we can set it to other values if a future uarch behaves differently.

Good suggestion! Thanks! If others agree with this opinion, I will implement it.

xen0n added a comment.Jul 27 2022, 5:49 AM

I used cpu2017(fortran excluded) to test the performance in 5 cases,

  1. using constant pool,
  2. materialized integer with 1 instruction,
  3. materialized integer within 2 instructions,
  4. materialized integer within 3 instructions,
  5. materialized integer within 4 instructions.

(Tests were run three times for each condition and the scores were geometrically averaged).
The results showed no change in the scores for the 5 cases. @xen0n, @xry111, do you have any suggestions?

Make it a tunable (-loongarch-materialize-float-imm=0/1/2/3/4, or some better name), I guess. And set the default to 0 for -mtune=generic or -mtune=la464. Then we can set it to other values if a future uarch behaves differently.

Good suggestion! Thanks! If others agree with this opinion, I will implement it.

Hmm, so the workload characteristics of SPEC2017fp actually changed enough to make this optimization negligible. Interesting.

I think @xry111's suggestion to make this optimization tunable is reasonable, but it may need more work then. Perhaps this patch could be put on the back burner, we can always come to finish this later after the more essential things. Lots of downstream projects are blocked by availability of LLVM so it may be very worthwhile to shift priorities for now.

I think @xry111's suggestion to make this optimization tunable is reasonable, but it may need more work then. Perhaps this patch could be put on the back burner, we can always come to finish this later after the more essential things. Lots of downstream projects are blocked by availability of LLVM so it may be very worthwhile to shift priorities for now.

Yes, if there is too much extra cost we can just delay this one. My suggestion to make a tunable is based on "the logic is already written and let's not waste it".

Apologize again for raising some premature thoughts too early :(.

I think @xry111's suggestion to make this optimization tunable is reasonable, but it may need more work then. Perhaps this patch could be put on the back burner, we can always come to finish this later after the more essential things. Lots of downstream projects are blocked by availability of LLVM so it may be very worthwhile to shift priorities for now.

Yes, if there is too much extra cost we can just delay this one. My suggestion to make a tunable is based on "the logic is already written and let's not waste it".

Thank you for your suggestions. After discussion, we will continue to improve this patch after the implementation of important functions.

Apologize again for raising some premature thoughts too early :(.

You don't have to feel guilty. This is an interesting optimization.