This is the second part of the commit fixing PR38917 (hoisting partitially redundant machine instruction). Most of PRE (partitial redundancy elimination) and CSE work is done on LLVM IR, but some of redundancy arises during DAG legalization. Machine CSE is not enough to deal with it. This simple PRE implementation works a little bit intricately: it passes before CSE, looking for partitial redundancy and transforming it to fully redundancy, anticipating that the next CSE step will eliminate this created redundancy. If CSE doesn't eliminate this, than created instruction will remain dead and eliminated later by Remove Dead Machine Instructions pass. The third part of the commit is supposed to refactor MachineCSE, to make it more clear and to merge MachinePRE with MachineCSE, so one need no rely on further Remove Dead pass to clear instrs not eliminated by CSE. First step: https://reviews.llvm.org/D54839 Fixes llvm.org/PR38917
Details
- Reviewers
RKSimon MatzeB spatel craig.topper aemerson qcolombet atrick rtereshin - Commits
- rZORG3b8eade8a420: [MIR] Add simple PRE pass to MachineCSE
rZORGea3d1df9d6a7: [MIR] Add simple PRE pass to MachineCSE
rG3b8eade8a420: [MIR] Add simple PRE pass to MachineCSE
rGea3d1df9d6a7: [MIR] Add simple PRE pass to MachineCSE
rG9c20156de39b: [MIR] Add simple PRE pass to MachineCSE
rL359870: [MIR] Add simple PRE pass to MachineCSE
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 30931 Build 30930: arc lint + arc unit
Event Timeline
FYI: Generally accepted ping frequency is once per week.
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
54 ↗ | (On Diff #181982) | Our general style is not to use line continuations like this, but instead rely on adjacent string concatenation, so this would be: STATISTIC(NumPREs, "Number of partial redundant expression" " transformed to fully redundant"); (where the starting quotes are vertically aligned) |
754 ↗ | (On Diff #181982) | Comments should be full sentences and end with a period (or other punctuation). Also, please explain *why* the conditions are stronger in this case. |
790 ↗ | (On Diff #181982) | It looks like this could be: if (!PREMap.count(MI)) { PREMap[MI] = MBB; continue; } and then you don't need the else and extra associated indentation. |
812 ↗ | (On Diff #181982) | It would be better to not create, and then delete, instructions speculatively. isProfitableToCSE does look at the parent blocks of the instructions, but we could abstract that by:
|
Ok, thanks!
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
54 ↗ | (On Diff #181982) | Ok, fixed. |
754 ↗ | (On Diff #181982) | Ok, changed. |
790 ↗ | (On Diff #181982) | Thanks, that is better! Fixed. |
812 ↗ | (On Diff #181982) | Thanks, changed. Actually isProfitableToCSE() doesn't need CSMI (new instr) as argument, only its basic block, so just changed function arglist. |
Have you measured the compile-time impact?
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
755 ↗ | (On Diff #183272) | I think it would be useful to elaborate on under what conditions we create these PRE instructions that are not later CSEd. Do you know how often that happens (I assume that you can figure this out by looking at the statistics and comparing runs where this happens vs. where it doesn't)? |
What do you mean by compile-time impact? I've run test suite of course, compared with baseline.
The results for exec_time and size..text metrics are like here:
> ~/llvm/test-suite/utils/compare.py --filter-short -m exec_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json ... Program results_rel_base results_rel_exp3 diff MicroBench...RayFDRMultiThreaded/threads:32 1055.94 931.21 -11.8% Bitcode/Be...rid/halide_bilateral_grid.test 30.26 26.70 -11.8% MultiSourc...mbolics-flt/Symbolics-flt.test 1.10 0.98 -10.9% MicroBench...XRayFDRMultiThreaded/threads:2 497.85 547.92 10.1% MicroBench...lcalsCRaw.test:BM_EOS_RAW/5001 10.33 9.49 -8.2% MicroBench...bda.test:BM_PIC_1D_LAMBDA/5001 75.75 81.43 7.5% MicroBench...XRayFDRMultiThreaded/threads:8 1029.95 1100.16 6.8% MicroBench...XRayFDRMultiThreaded/threads:4 626.46 667.17 6.5% MicroBench...ambda.test:BM_EOS_LAMBDA/44217 87.21 92.66 6.2% MultiSourc...flt/LoopRestructuring-flt.test 3.76 3.53 -6.2% MicroBench...calsCRaw.test:BM_EOS_RAW/44217 85.94 90.96 5.8% MicroBench...Lambda.test:BM_ADI_LAMBDA/5001 128.27 135.35 5.5% MicroBench...est:BM_ReturnNeverInstrumented 1.60 1.68 5.3% MicroBench....test:BM_MULADDSUB_LAMBDA/5001 14.14 14.84 4.9% SingleSour...ncils/fdtd-apml/fdtd-apml.test 0.63 0.66 4.4% ... > ~/llvm/test-suite/utils/compare.py --filter-short -m size..text ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json ... Program results_rel_base results_rel_exp3 diff SingleSour...marks/Shootout/Shootout-random 658.00 642.00 -2.4% SingleSour...ce/UnitTests/SignlessTypes/rem 4834.00 4930.00 2.0% SingleSour...ootout-C++/Shootout-C++-random 818.00 802.00 -2.0% MultiSource/Benchmarks/Olden/mst/mst 2658.00 2706.00 1.8% SingleSource/Benchmarks/Misc/ffbench 2322.00 2290.00 -1.4% MultiSource/Benchmarks/NPB-serial/is/is 4946.00 4898.00 -1.0% MultiSource/Benchmarks/McCat/18-imp/imp 13266.00 13138.00 -1.0% MultiSourc...rks/McCat/03-testtrie/testtrie 1810.00 1794.00 -0.9% MultiSourc...chmarks/McCat/01-qbsort/qbsort 2066.00 2050.00 -0.8% MultiSourc...iabench/g721/g721encode/encode 7394.00 7346.00 -0.6% SingleSource/Benchmarks/Misc/fbench 2546.00 2562.00 0.6% SingleSour...ce/Benchmarks/Misc/ReedSolomon 11170.00 11234.00 0.6% SingleSource/Benchmarks/Stanford/Towers 2802.00 2786.00 -0.6% MultiSourc...e/Applications/SIBsim4/SIBsim4 47666.00 47938.00 0.6% MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG 76258.00 76674.00 0.5%
Although I'm not sure exec_time is good metrics. I've also tried compile_time, never used it before:
~/llvm/test-suite/utils/compare.py --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp3.json ... Program results_rel_base results_rel_exp3 diff SingleSour.../Benchmarks/Misc-C++/Large/ray 0.68 0.74 7.6% MultiSourc...Prolangs-C/simulator/simulator 1.56 1.64 5.7% MultiSource/Benchmarks/McCat/18-imp/imp 0.82 0.78 -5.3% MultiSourc...OE-ProxyApps-C/RSBench/rsbench 0.88 0.84 -4.6% MultiSource/Benchmarks/Ptrdist/bc/bc 2.42 2.52 4.1% SingleSour...enchmarks/CoyoteBench/fftbench 1.08 1.03 -4.1% SingleSour...UnitTests/Vectorizer/gcc-loops 1.42 1.48 3.9% MultiSourc...chmarks/Prolangs-C++/city/city 3.20 3.08 -3.9% MultiSource/Benchmarks/sim/sim 0.72 0.75 3.9% MultiSourc...nchmarks/FreeBench/pifft/pifft 2.56 2.65 3.6% MultiSource/Applications/spiff/spiff 1.58 1.64 3.5% MultiSourc...lds-flt/CrossingThresholds-flt 3.80 3.94 3.5% MultiSourc...s/Prolangs-C/compiler/compiler 0.84 0.81 -3.3% MultiSourc...rks/Prolangs-C++/employ/employ 0.74 0.71 -3.3% SingleSour...ks/Misc-C++/stepanov_container 3.01 3.11 3.2% ...
Also one can see many tests increased their machine-cse.NumCSEs statistics (meaning that PRE step created common subexpressions eliminated by CSE):
> ~/llvm/test-suite/utils/compare.py --filter-short -m machine-cse.NumCSEs ~/llvm/test-suite-results_deb_base.json ~/llvm/test-suite-results/results_deb_exp.json ... Program results_deb_base results_deb_exp diff SingleSour...ce/UnitTests/SignlessTypes/rem 1.00 19.00 1800.0% MultiSource/Benchmarks/llubenchmark/llu 1.00 10.00 900.0% MultiSource/Benchmarks/NPB-serial/is/is 1.00 9.00 800.0% SingleSource/Benchmarks/Misc/fbench 1.00 7.00 600.0% MultiSourc...OE-ProxyApps-C/XSBench/XSBench 2.00 8.00 300.0% SingleSource/Benchmarks/Misc/ffbench 2.00 8.00 300.0% SingleSour...chmarks/Misc-C++/stepanov_v1p2 1.00 3.00 200.0% MultiSourc...arks/FreeBench/distray/distray 7.00 20.00 185.7% MicroBench.../ImageProcessing/Dither/Dither 3.00 8.00 166.7% MultiSourc...ench/telecomm-FFT/telecomm-fft 5.00 13.00 160.0% MultiSourc...enchmarks/mafft/pairlocalalign 482.00 1214.00 151.9% MultiSourc...aran/netbench-url/netbench-url 3.00 7.00 133.3% MultiSourc...Benchmarks/SciMark2-C/scimark2 6.00 13.00 116.7% MultiSourc...nch/mpeg2/mpeg2dec/mpeg2decode 57.00 115.00 101.8% MultiSourc...chmarks/McCat/01-qbsort/qbsort 2.00 4.00 100.0% ...
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
755 ↗ | (On Diff #183272) | Thanks for pointing this out, I've cleaned this place. I've removed needless checks like MI->isReturn() after comparing statistics (no change). Mostly, these checks are implicitly done while CSE work (not inside isCSECandidate()). I've left aux checks of virtual registers though -- they are also implicitly done. I've also updated comment. This looks artificially, but I'm going to refactor this place in the third (and last) step of this work -- merging PRE and CSE, so we need no different checks. |
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
800 ↗ | (On Diff #185240) | Are these calls expensive? SingleSour.../Benchmarks/Misc-C++/Large/ray 0.68 0.74 7.6% That 7% increase in compile time, and a few other others, look significant. (Maybe there's something you can cache?) |
lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
800 ↗ | (On Diff #185240) | I believe caching this gives nothing, cause two common subexpressions unlikely belong to the same basic blocks pair. 7% are only for one test among hundreds, there are also -5% and so on. I'm to run benchmarks again to compare results. |
Hi @hfinkel,
Unfortunately, the results of compile_time are unstable and not reproducible. Below are two different runs.
The first run:
> ~/llvm/test-suite/utils/compare.py --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_exp.json Tests: 1163 Short Running: 1010 (filtered out) Remaining: 153 Metric: compile_time Program results_rel_base results_rel_exp diff MultiSourc...OE-ProxyApps-C/XSBench/XSBench 0.74 0.79 6.5% MicroBench...y/ReturnReference/retref-bench 1.24 1.17 -6.1% SingleSour...rks/Misc-C++/Large/sphereflake 0.71 0.68 -4.5% MultiSourc...ps-C++/HACCKernels/HACCKernels 0.72 0.76 4.4% MultiSourc...chmarks/Prolangs-C/gnugo/gnugo 1.40 1.34 -4.3% MultiSourc...rks/Prolangs-C++/employ/employ 0.76 0.73 -4.2% MicroBench...arks/ImageProcessing/Blur/blur 1.61 1.55 -4.0% MultiSourc...rks/Trimaran/enc-3des/enc-3des 0.62 0.65 3.8% MultiSourc...rks/mediabench/gsm/toast/toast 2.29 2.38 3.7% MultiSourc...Benchmarks/SciMark2-C/scimark2 0.79 0.82 3.6% MicroBench.../ImageProcessing/Dilate/Dilate 1.32 1.36 2.7% MicroBench...oopInterchange/LoopInterchange 0.74 0.72 -2.7% MultiSourc...s/Prolangs-C/unix-tbl/unix-tbl 2.22 2.16 -2.7% MultiSourc...rolangs-C/archie-client/archie 1.20 1.17 -2.7% MultiSourc...marks/Prolangs-C/loader/loader 0.62 0.60 -2.6% results_rel_base results_rel_exp diff count 153.000000 152.000000 152.000000 mean 7.235216 7.270763 -0.000904 std 12.322023 12.328571 0.016021 min 0.600000 0.604000 -0.061093 25% 1.516000 1.535000 -0.007504 50% 3.784000 3.800000 0.000000 75% 5.568000 5.581000 0.006387 max 88.724000 88.972000 0.064516
and the second run:
> ~/llvm/test-suite/utils/compare.py --filter-short -m compile_time ~/llvm/test-suite-results/results_rel_base2.json ~/llvm/test-suite-results/results_rel_exp2.json Tests: 1163 Short Running: 1009 (filtered out) Remaining: 154 Metric: compile_time Program results_rel_base2 results_rel_exp2 diff ultiSource/Benchmarks/McCat/09-vor/vor 0.66 0.62 -6.6% ultiSource...Benchmarks/SciMark2-C/scimark2 0.80 0.75 -5.5% ultiSource...rolangs-C/archie-client/archie 1.16 1.22 5.2% ultiSource...OE-ProxyApps-C/XSBench/XSBench 0.79 0.76 -4.1% ultiSource/Benchmarks/McCat/18-imp/imp 0.80 0.83 4.0% ultiSource...OE-ProxyApps-C/RSBench/rsbench 0.87 0.90 3.7% ultiSource...rks/Trimaran/enc-3des/enc-3des 0.66 0.63 -3.7% ultiSource...hmarks/MallocBench/cfrac/cfrac 1.64 1.70 3.6% icroBenchm.../ImageProcessing/Dilate/Dilate 1.37 1.32 -3.5% ultiSource...ALAC/decode/alacconvert-decode 2.41 2.34 -3.0% ultiSource...chmarks/Prolangs-C++/city/city 3.22 3.13 -3.0% ultiSource...xyApps-C/Pathfinder/PathFinder 1.77 1.72 -2.9% ultiSource...s/Prolangs-C/compiler/compiler 0.86 0.84 -2.8% ultiSource...rks/Prolangs-C++/employ/employ 0.73 0.71 -2.7% ultiSource/Applications/sgefa/sgefa 0.99 1.02 2.4% results_rel_base2 results_rel_exp2 diff count 152.000000 153.000000 151.000000 mean 7.284842 7.236418 -0.002551 std 12.346005 12.320806 0.015407 min 0.600000 0.604000 -0.066265 25% 1.567000 1.504000 -0.009324 50% 3.794000 3.776000 -0.001112 75% 5.609000 5.584000 0.005669 max 89.004000 89.036000 0.051724
Also, I don't think the heavy functions here (like findNearestCommonDominator() and isPotentiallyReachable()) can significant impact compile time -- they are called rarely only for two BBs with common subexpressions, which is unusual. Taking the machine-cse.NumCSE statistics above as example, one can make an estimation of no more than 10-100 calls of these functions for the benchs from test suite. Also, these heavy calls are already cached in a sense, they are using prebuilt Machine Dominator Tree.
LGTM - based on the improved codegen and the negligible effect on compile time. Thanks.
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
784 | Very minor - but why can't the ++I be moved inside the for-loop increment statement? |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
784 | It can, I'm to fix it, thanks. |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
784 | This one too? |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
784 | Hmm, looks like I've eventually fixed L506 -- the original loop inside ProcessBlockCSE() instead of ProcessBlockPRE(). |
This is causing incorrect code generation for this piece of IR:
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-arm-none-eabi" @var = hidden local_unnamed_addr global i32 0, align 4 @_ZTIi = external dso_local constant i8* declare dso_local void @_Z2fnv() local_unnamed_addr #1 declare dso_local i32 @__gxx_personality_v0(...) declare i32 @llvm.eh.typeid.for(i8*) #2 declare dso_local i8* @__cxa_begin_catch(i8*) local_unnamed_addr declare dso_local void @__cxa_end_catch() local_unnamed_addr define hidden i32 @_Z7examplev() personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) { entry: invoke void @_Z2fnv() to label %try.cont unwind label %lpad lpad: ; preds = %entry %0 = landingpad { i8*, i32 } catch i8* bitcast (i8** @_ZTIi to i8*) catch i8* null %1 = extractvalue { i8*, i32 } %0, 0 %2 = extractvalue { i8*, i32 } %0, 1 %3 = tail call i32 @llvm.eh.typeid.for(i8* bitcast (i8** @_ZTIi to i8*)) %matches = icmp eq i32 %2, %3 %4 = tail call i8* @__cxa_begin_catch(i8* %1) %5 = load i32, i32* @var, align 4 br i1 %matches, label %catch1, label %catch catch1: ; preds = %lpad %or3 = or i32 %5, 4 store i32 %or3, i32* @var, align 4 tail call void @__cxa_end_catch() br label %try.cont try.cont: ; preds = %entry, %catch1, %catch %6 = load i32, i32* @var, align 4 ret i32 %6 catch: ; preds = %lpad %or = or i32 %5, 8 store i32 %or, i32* @var, align 4 tail call void @__cxa_end_catch() br label %try.cont }
As part of the accesses to var an ADRP instruction is being generated and MachineCSE is hoisting it to the entry block, but it gets hoisted to after the call to fn so when we catch an exception the ADRP hasn't been executed so the loads and stores use an undefined base register.
Hi @john.brawn , thanks, I'm looking into it. At first glance, this instruction should have special flag to being non-hoistable.
The issue actually is not with this instruction, but it's related to exception handling. Hoisted instruction is inserted before getFirstTerminator(), but there could be EH_LABEL's which are not terminators, but could change control flow.
I'm to fix it.
Hi @john.brawn , here is the fix, could you please look to it: https://reviews.llvm.org/D63148
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
814 ↗ | (On Diff #197945) | Do we need to enhance the algorithm to consider more about register pressure on the profit calculation? I'm afraid there is performance drop when the register pressure is heavy. |
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
814 ↗ | (On Diff #197945) | Hi @LuoYuanke, yes, this could be the case. Actually this commit doesn't change profit calculation, ProcessBlockPRE() uses the same isProfitableToCommit() function as ProcessBlockCSE() uses. Do you have concrete test cases where register pressure increases? |
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
814 ↗ | (On Diff #197945) | CSE only eliminate MI, but current PRE insert MI to the common dominated block, so the instruction is hoisted to dominated block. It increase register pressure more than CSE. |
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
814 ↗ | (On Diff #197945) | Hi @LuoYuanke, actually PRE doesn't hoist instruction, but _duplicates_ it, real hoisting is made by CSE when it _eliminates_ instruction. I do believe that CSE should consider register pressure while eliminating instruction. Have you tried MachineLICMBase::isProfitableToHoist() with your test case? We can adopt and insert it to CSE. |
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
814 ↗ | (On Diff #197945) | Hi @anton-afanasyev, sorry for reply late. I'm trying to figure out this issue, but it seems complex for me. Can you run SPEC cpu2017/500.perlbench_r? There is some significant performance drop on X86 with the patch. If you have resource to run the benchmark, could you help to look into the issue? |
@anton-afanasyev
Hi,
Did you look into the SPEC cpu2017/500.perlbench_r issue? There is some significant performance drop on X86 with the patch. I ask you to revert the patch first, and when the SPEC2017 regression is fixed, we can submit the patch again. How do you think?
Hi @LuoYuanke, I'm sorry, it looks like my previous answer to you remained unsubmitted.
Here it is:
Hi @LuoYuanke , I have only access to SPEC cpu2006, I see 400.perlbench there. Could it be the case you know regressed sample from cpu2006?
Also, I have prepared new patch for review: https://reviews.llvm.org/D63934 -- I believe it can potentially decrease reg pressure.
At the moment, I believe that another patch (from Kai Luo) https://reviews.llvm.org/D64394 fixes this regression.
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
807 ↗ | (On Diff #197945) | Hi @anton-afanasyev , I have a concern here that CFG of LLVM IR might not be equivalent to CFG of Machine IR. |
@anton-afanasyev
Hi,
Do you have any performance data for the patch? I'd like to know what benchmark has performance gain with your patch. https://reviews.llvm.org/D64394 fixed perlbench regression, but I wonder what the performance gain do we achieve with the 2 patch?
Hi @LuoYuanke, I've benchmarked the first patch and posted results in two posts started from here: https://reviews.llvm.org/D56772#1376284.
But you are right: with the second patch the performance gain may be eliminated! I'm to measure second patch effect.
llvm/trunk/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
807 ↗ | (On Diff #197945) | Hi @lkail, yes this is possible case. But that is not an issue actually, for this case in place it would be just non-optimization. PRE will create instruction, but if CSE doesn't eliminate original instructions, than "hoisted" instruction will be deleted by Remove Dead Machine Instructions pass later. |
Hi @LuoYuanke, I've benchmarked the effect of the revertion my and @lkail patches.
The benchmark showed some increase of the exec_time:
~/llvm/test-suite/utils/compare.py --filter-short -m exec_time ~/llvm/test-suite-results/results_rel_base.json ~/llvm/test-suite-results/results_rel_base2.json vs ~/llvm/test-suite-results/results_rel_exp.json ~/llvm/test-suite-results/results_rel_exp2.json Program lhs rhs diff test-suite...Raw.test:BM_MULADDSUB_RAW/5001 13.96 17.35 24.3% test-suite...XRayFDRMultiThreaded/threads:4 568.16 634.38 11.7% test-suite...RayFDRMultiThreaded/threads:32 858.02 944.45 10.1% test-suite...XRayFDRMultiThreaded/threads:2 464.93 501.92 8.0% test-suite...algebra/kernels/symm/symm.test 14.00 15.12 7.9% test-suite...st:BM_BAND_LIN_EQ_LAMBDA/44217 39.41 37.34 -5.2% test-suite...test:BM_MULADDSUB_LAMBDA/44217 141.97 148.38 4.5% test-suite...Lambda.test:BM_EOS_LAMBDA/5001 9.88 9.48 -4.0% test-suite...ambda.test:BM_EOS_LAMBDA/44217 89.08 85.78 -3.7% test-suite...s/Halide/blur/halide_blur.test 1.77 1.83 3.6% test-suite...BenchmarkGame/Large/fasta.test 0.72 0.70 -3.3% test-suite...lcalsCRaw.test:BM_EOS_RAW/5001 9.86 9.54 -3.3% test-suite...calsCRaw.test:BM_EOS_RAW/44217 88.90 86.00 -3.3% test-suite...XRayFDRMultiThreaded/threads:8 983.92 1014.32 3.1% test-suite...RayFDRMultiThreaded/threads:16 979.55 1008.68 3.0%
So I'd like to leave these changes unreverted and to proceed with particular patches (like this one https://reviews.llvm.org/D63934). Please tell if you have any objections.
@anton-afanasyev, I have no objection. Thank you for the effort to improve the performance.
Very minor - but why can't the ++I be moved inside the for-loop increment statement?