This is an initial attempt at a pass that will traverse the machine code while still in SSA form and attempt various transformations aimed at reducing the number of CR-logical operations. The only support provided by the initial patch is for splitting basic blocks on binary CR-logical operations (i.e. branch early on operands). The hope however is that the provided infrastructure will allow for further transformations that can reduce the number of these operations.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I plan to run those to try to get reliable performance data to motivate heuristics for this pass. I'll do that after the dev conference and report the impact of this patch on those.
lib/Target/PowerPC/PPCMachineBasicBlockUtils.h | ||
---|---|---|
93 | It looks to me this static function uses too many parameters, which makes it a little bit hard to understand. Is it possible for us to reduce the number of parameters used here? One way I could think of is to make this a member function of class PPCReduceCRLogicals. In that case we could make some of the parameters member of the class, so that we don't have to pass them around. But it is totally up to you to decide whether the effort is worthwhile or not. | |
lib/Target/PowerPC/PPCReduceCRLogicals.cpp | ||
220 | Can we do what the FIXME suggested here in this patch? Since the community doesn't like us putting FIXME in the code. |
lib/Target/PowerPC/PPCMachineBasicBlockUtils.h | ||
---|---|---|
93 | I don't want to make it a member variable as I think this is useful for other purposes than just this pass. However, I think it might be useful to wrap all the parameters into a struct that we'll build up and pass. | |
lib/Target/PowerPC/PPCReduceCRLogicals.cpp | ||
220 | I meant to remove this FIXME. Thanks for pointing it out. |
- Rebase
- Fix test cases that fail on ToT
- Simplify the split function signature
- Add the input CR logical instructions back to the queue if we split on the use
Some performance numbers as requested by @echristo (all run on a Power8 2GHz machine, bound to a specific physical CPU):
SPEC2006 (run time improvements - negative is good):
444.namd,868.7807,0.0000,850.4903,0.0000,-2.11%
447.dealII,630.6040,0.0000,631.2014,0.0000,0.09%
450.soplex,473.6103,0.0000,417.0427,0.0000,-11.94%
453.povray,505.4464,0.0000,487.0534,0.0000,-3.64%
401.bzip2,969.0608,0.0000,923.7337,0.0000,-4.68%
445.gobmk,868.6733,0.0000,857.6179,0.0000,-1.27%
464.h264ref,1053.8501,0.0000,1005.3811,0.0000,-4.60%
471.omnetpp,473.8660,0.0000,444.2557,0.0000,-6.25%
Changes below 1% omitted.
This was from a single run of SPEC2016 with r319300 and then same revision with this patch applied. Since it was a single run, I don't have information about the noise range, but the important thing is that the trend seems to be rather clear - performance improves with this patch.
For Hal's benchmarks, the relative performance doesn't change almost at all with this patch. The absolute performance improves with 4,3,float, 4,4,float by about 30% and degrades with 4,4,int by about 20%. The rest are unchanged.
Quite.
For Hal's benchmarks, the relative performance doesn't change almost at all with this patch. The absolute performance improves with 4,3,float, 4,4,float by about 30% and degrades with 4,4,int by about 20%. The rest are unchanged.
Weird.
Anyhow, I'm ok with this.
It looks to me this static function uses too many parameters, which makes it a little bit hard to understand. Is it possible for us to reduce the number of parameters used here? One way I could think of is to make this a member function of class PPCReduceCRLogicals. In that case we could make some of the parameters member of the class, so that we don't have to pass them around. But it is totally up to you to decide whether the effort is worthwhile or not.