This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Improve computation of scalarization overhead.
Needs ReviewPublic

Authored by jonpa on Oct 30 2018, 5:12 AM.

Details

Summary

I found loops that got a much overestimated costs of vectorization where two instructions were both scalarized. Since one was using the result of the other, the defining instruction does not need to do the inserts, and the user does not have to extract any elements.

I experimented with this patch that makes the LoopVectorizer collect any instructions that the target will scalarize (expand). This is then used to find these cases and passed (eventually) to getScalarizationOverhead() which then returns a reduced value.

I found so far in practice on SystemZ that this amounts to more float loops being vectorized, typically with the only benefit being vectorized memory operations. I am not sure hos beneficial this is.

This should easily be usable by other targets also, but sofar this is SystemZ only.

Is this useful enough to include in the loop vectorizer?

Diff Detail

Event Timeline

jonpa created this revision.Oct 30 2018, 5:12 AM
nhaehnle removed a subscriber: nhaehnle.Nov 6 2018, 7:51 AM

I have found some more potential use for this:

  • Currently a load is widened always if possible. I think however if it is known that the user of the load is scalarized by target it might be better to also scalarize the load rather than loading the vector register first to then extract all the elements. This would be true if the extract costs more than 1, I think.
  • In addition, if the target knows that the load and user will be scalarized, it can also consider the folding of the load as a memory operand into the user.

So, please take a look anyone :-)

PS. A related subject here is the presence of loop-invariant operands, which will affect CodeGen. Did anyone try passing the Loop* or some kind of LoopInvariant flag for the operands?

jonpa added a comment.Nov 27 2018, 8:17 AM

ping!

Is this a fundamentally good idea, considering the future with VPlan, for instance?
Would this be acceptable if it would only affect the specific target(s) that desired this?

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV). If you are interested in developing that area of VPlan based cost model, I can clean up my workspace and upload what I have to Phabricator as WIP patch.

When producer/consumer has to reside in separate BBs for some reason, the current recipe (which resides in BB) based modeling won't help much. As such, from the generalized implementation perspective, some kind of U-D based mapping (like the one used here) may be inevitable. So, from the technical aspect, we should discuss what are the plausible scenarios that make producer and consumer to reside in separate BBs and whether that situations are rare enough to ignore.

typically with the only benefit being vectorized memory operations

I have a mixed feeling here. ICC vectorizes relatively aggressively and it has good enough reasons to do so. Having said that, it comes with the code size and compile time associated to aggressive vectorization (on top of vectorization not always profitable in execution time reduction). So, if we are doing this, we should make it easy to tune (by vectorizer developer as well as by the programmer using the compiler). This comment is not specific to this patch, though.

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV).

Thanks for taking a look! IIUC, my patch is not useful since VPlan will soon improve this area without it.

I am curious as to how VPlan will accomplish this - will it also add some kind of check with TLI if the instruction will be expanded and propagate this information? Or is there some other way that this may be accomplished?

Ayal added a comment.Nov 28 2018, 12:45 PM

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV).

Thanks for taking a look! IIUC, my patch is not useful since VPlan will soon improve this area without it.

Not so quick.
The underlying supporting mechanism is VPReplicateRecipe, in VPlan.h. The parent of a VP*Recipe is VPBasicBlock. If both use and def belong to the same ReplicateRecipe, things are simple.
Your map based query becomes "do the instructions belong to the same Recipe" query. The question is, of course, can we always do that? If the answer is NO, then, this approach has a hole that need to be filled by some other means.

I am curious as to how VPlan will accomplish this - will it also add some kind of check with TLI if the instruction will be expanded and propagate this information? Or is there some other way that this may be accomplished?

Recipe is making instruction grouping (within VPBasicBlock) easier to identify. If the code motion across VPBasicBlocks is legal, we want to merge two or more ReplicateRecipes. In ideal cases, both use and def are in the same Recipe, so you don't need a map. You just ask for the cost for the ReplicateRecipe ---- scalar compute * VF for each instruction in the recipe + extract for live-ins + insert for live-outs. In the general case, however, use and def are in different ReplicateRecipes. So, things aren't that simple. For each live-ins, we should check whether live-ins are computed in scalar form and ditto for live-out.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()? If so, that's probably where you want to dig in.

/// Collect the instructions that are scalar after vectorization. An           
/// instruction is scalar if it is known to be uniform or will be scalarized   
/// during vectorization. Non-uniform scalarized instructions will be          
/// represented by VF values in the vectorized loop, each corresponding to an  
/// iteration of the original scalar loop.                                     
void collectLoopScalars(unsigned VF);

As for the computation of the cost for scalarized instruction:

I think the TTI based per instruction cost should fully account for extract and insert. I don't think we should change that.

What we need to change is the caller side. If the instruction is scalarized, the base cost for it should be scalar exec cost * VF.
We should then add extract cost for each vector source (but only if not already accounted for for another scalar use site) and
insert cost if one or more of the uses is vector.

jonpa added a comment.Nov 29 2018, 2:08 AM

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

The only target-specific part I am thinking about is which instructions will later be expanded during *isel*.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()?

My understanding is that currently the LoopVectorizer notion of a scalarized instruction refers to an *LLVM I/R* scalarized instruction. In other words, which instructions it will itself produce scalarized. These are the instructions contained in Scalars[VF].

As en example, consider this loop:

define void @fun(i64 %NumIters, float* %Ptr1, float* %Ptr2, float* %Dst) {
entry:
  br label %for.body

for.body:
  %IV  = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
  %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
  %LD1 = load float, float* %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
  %LD2 = load float, float* %GEP2
  %mul = fmul float %LD1, %LD2
  %add = fadd float %mul, %LD2
  store float %add, float* %GEP1
  %IVNext = add nuw nsw i64 %IV, 1
  %exitcond = icmp eq i64 %IVNext, %NumIters
  br i1 %exitcond, label %exit, label %for.body

exit:
  ret void
}

This loop is interesting because on z13 vector float operations are not supported, so they are expanded during instruction selection to scalar instructions. If forced to vectorize (even though costs would normally prevent it) with

clang -S -o - -O3 -march=z13 tc_targscal.ll -mllvm -unroll-count=1 -mllvm -debug-only=loop-vectorize -mllvm -force-vector-width=4

, the loop vectorzer produces this loop:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %0 = add i64 %index, 0
  %1 = getelementptr inbounds float, float* %Ptr1, i64 %0
  %2 = getelementptr inbounds float, float* %1, i32 0
  %3 = bitcast float* %2 to <4 x float>*
  %wide.load = load <4 x float>, <4 x float>* %3, align 4, !alias.scope !0, !noalias !3
  %4 = getelementptr inbounds float, float* %Ptr2, i64 %0
  %5 = getelementptr inbounds float, float* %4, i32 0
  %6 = bitcast float* %5 to <4 x float>*
  %wide.load6 = load <4 x float>, <4 x float>* %6, align 4, !alias.scope !3
  %7 = fmul <4 x float> %wide.load, %wide.load6
  %8 = fadd <4 x float> %wide.load6, %7
  %9 = bitcast float* %2 to <4 x float>*
  store <4 x float> %8, <4 x float>* %9, align 4, !alias.scope !0, !noalias !3
  %index.next = add i64 %index, 4
  %10 = icmp eq i64 %index.next, %n.vec
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !5

The cost computation looked like:

LV: Found an estimated cost of 0 for VF 4 For instruction:   %IV = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD1 = load float, float* %GEP1, align 4
LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD2 = load float, float* %GEP2, align 4
LV: Found an estimated cost of 16 for VF 4 For instruction:   %mul = fmul float %LD1, %LD2
LV: Found an estimated cost of 16 for VF 4 For instruction:   %add = fadd float %LD2, %mul
LV: Found an estimated cost of 1 for VF 4 For instruction:   store float %add, float* %GEP1, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction:   %IVNext = add nuw nsw i64 %IV, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond = icmp eq i64 %IVNext, %NumIters
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond, label %exit, label %for.body

So the costs for the vectorized float operations have been calculated by the target as 2x4 extracts + 4 mul/add + 4 inserts = 16. The loop vectorizer has produced vector instructions, and as far as it is concerned, that's what they are.

However, the assembly output looks like:

.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vl      %v2, 0(%r1,%r3)
        vl      %v3, 0(%r1,%r4)
        vrepf   %v0, %v3, 1
        vrepf   %v1, %v2, 1
        vrepf   %v4, %v2, 2
        vrepf   %v5, %v2, 3
        vrepf   %v6, %v3, 2
        vrepf   %v7, %v3, 3
        meebr   %f1, %f0
        meebr   %f2, %f3
        meebr   %f4, %f6
        meebr   %f5, %f7
        aebr    %f5, %f7
        aebr    %f4, %f6
        aebr    %f2, %f3
        aebr    %f1, %f0
        aghi    %r5, -4
        vmrhf   %v4, %v4, %v5
        vmrhf   %v0, %v2, %v1
        vmrhg   %v0, %v0, %v4
        vst     %v0, 0(%r1,%r3)
        la      %r1, 16(%r1)
        jne     .LBB0_4

, which is 2 x Vector Load + 6 extracts (the fp element 0 overlaps the vector register and does not need an extract) + 4 fp-multiply (meebr) + 4 fp-add + 3 inserts + a Vector Store.

There is no need to insert and extract into a vector register between meebr and aebr. This is where the costs of 16 are wrong - they should have been less.

My question then is how to fix this? My idea was to let the loop vectorizer keep thinking about these instructions as vectorized, while being aware of a later expansion during isel (perhaps ISelExpanded would be a better name than TargetScalarized?).

This could be used to compute better scalarization overhead costs. It would also be interesting at least on SystemZ to use to do scalar stores/loads instead of widening if the def / user is "isel-expanded". This could perhaps be controlled by TTI.supportsEfficientVectorElementLoadStore(), which is already available.

Is this making sense?

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

The only target-specific part I am thinking about is which instructions will later be expanded during *isel*.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()?

My understanding is that currently the LoopVectorizer notion of a scalarized instruction refers to an *LLVM I/R* scalarized instruction. In other words, which instructions it will itself produce scalarized. These are the instructions contained in Scalars[VF].

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

It seems to me that what you are trying to do here is bringing "packed scalar" notion into LLVM IR --- i.e., part of my IR uses vector data type but it's really just a compact form of VF-times scalar operation. I suggest bringing this to wider RFC discussion with the rest of the community ---- but I'm sure one of the questions people will ask would be "why?".

My question then is how to fix this? My idea was to let the loop vectorizer keep thinking about these instructions as vectorized, while being aware of a later expansion during isel (perhaps ISelExpanded would be a better name than TargetScalarized?).

So there are basically two possible ways to model sequences like this:

  1. The vectorizer models/emits the instructions as "vector" instructions, but gives a discount to back-to-back instructions which will be scalarized.
  2. The vectorizer compares vector and scalar instructions based on the cost model, and explicitly scalarizes instruction sequences if they would be cheaper.

This patch implements the first possibility, but the second is more useful, I think; it's closer to how the underlying target actually works, and it composes with other cost modeling more easily. For example, for some operations, a vector instruction exists, but it's only a little cheaper than transforming it into a extract+scalar+insert sequence.

jonpa added a comment.Dec 12 2018, 2:01 AM

Thank you for your feedback!

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

I am just trying to keep it simple by not changing how LV generates code, but merely improve the cost computations. Changing the output of a vectorized loop seems like a much bigger project, which I did not attempt.

So there are basically two possible ways to model sequences like this:

  1. The vectorizer models/emits the instructions as "vector" instructions, but gives a discount to back-to-back instructions which will be scalarized.
  2. The vectorizer compares vector and scalar instructions based on the cost model, and explicitly scalarizes instruction sequences if they would be cheaper.

I can see the benefit of (2) in cases where an instruction is vectorizable but in-between two scalarized instructions, like: Scal -> Vec -> Scal. In such a case it may be better to also scalarize the vectorizable (Vec) instruction.

Are you proposing some kind of search over instruction sequences with some limited lookahead? Like comparing the costs of scalarizing I1, I2, I3 compared to vectorizing? So {Extr + Scal1 + Scal2 + Scal3 + Ins} is compared to {Vec1 + Vec2 + Vec3}, where a target expanded vector instruction would automatically (as it currently does) include the scalarization cost...

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize (no vector instruction available). I think that would be the starting points for these searches, or? After all, if all instructions are vectorizable, this step could simply be skipped. In that sense my patch might be a first step in this direction, or?

Are you proposing some kind of search over instruction sequences with some limited lookahead?

Yes, something like this.

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize

I think you would just start with instructions where the cost model says that VectorOperationCost > ScalarOperationCost*VF. It doesn't really matter why it's expensive. (I guess at some point, we might want to model which execution units are used by a vector instruction, but I think you'd want a different interface for that.)

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

I am just trying to keep it simple by not changing how LV generates code, but merely improve the cost computations. Changing the output of a vectorized loop seems like a much bigger project, which I did not attempt.

In my perspective, that's not simple at all. LV already has a mechanism to scalarize instructions, and ways to compute cost for those scalarized instructions. If everyone keeps adding their own way of scalarizing (or computing the cost for it), that's adding unnecessary complexity.

Please try to see if you can add TTI based "can vectorize instruction" check inside this function. Consider something like isScalarWithoutPredication() analogous to isScalarWithPredication().

void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF)

Once the instruction is added to InstsToScalarize, the rest of the cost model you desire should just happen --- else, that's where you'd like to contribute to benefit all including you.

Then, go look at the Recipe construction. Again, we need to check similar to isScalarWithPredication and return false here.

bool VPRecipeBuilder::tryToWiden(Instruction *I, VPBasicBlock *VPBB, VFRange &Range)

That should force the creation of replicate recipe.

Hopefully, that's all you need to do to serialize those vector FP operations for which you don't have HW vector support.

Ayal added a comment.Dec 12 2018, 1:39 PM

Are you proposing some kind of search over instruction sequences with some limited lookahead?

Yes, something like this.

Agreed. Again, computePredInstDiscount() and sinkScalarOperands() already perform such a search. See also "Throttled SLP" by Vasileios in LLVM Dev 2015, PACT 2015.

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize

I think you would just start with instructions where the cost model says that VectorOperationCost > ScalarOperationCost*VF. It doesn't really matter why it's expensive. (I guess at some point, we might want to model which execution units are used by a vector instruction, but I think you'd want a different interface for that.)

The interface is already there: expectedCost() notes if the best cost for a given VF is obtained by using vector or scalar instructions. But this is only used to report if any vector instructions will actually be generated, to comply with user "vectorize" directive.

jonpa updated this revision to Diff 178627.Dec 18 2018, 2:21 AM

Thanks for review and suggestions! I tried your ideas and was pleasantly surprised that it actually worked quite "out of the box" as you said to explicitly scalarize instructions instead of just correcting the costs for them as I first did. :-)

I still think this patch is experimental in the sense that I haven't seen any convincing benchmark improvements on my target (SystemZ) to motivate this extra code in the LoopVectorizer, although it seems to do what it's intended for. I see two main parts in this patch now.

  1. Find instructions (sequences) that will be scalarized by target and explicitly scalarize them with correct costs, just like for predicated instructions.
  2. Scalarize loads and stores that are connected with a scalarized user/def, instead of always widening them if that would be cheaper.
  • For (1), I implemented isScalarWithoutPredication(), per your suggestions to compare the "scalar * VF" against "vector" costs. Unfortunately, I had to add an "InPassedVF" argument to getInstructionCost() and getMemoryInstructionCost(), to not confuse those methods which may pass the Instruction pointer to the target implementation. This extra argument is used for VF=1 by setting it to false, so that the user will eventually not think that this is a scalar instruction in a scalar context and find folding opportunities which may not be present in the vectorized loop. I then took inspiration from computePredInstDiscount() and implemented computeInstDiscount() to find the scalarized sequences. Did not attempt to search past vectorizable instructions, but it recognizes scalarized loads / stores.

At this point I noticed for SystemZ that this affected primarily ~100 loops on z13 containing the float (fp32) type, which is scalarized. Typically, this would be of the type "load; op(s); store", with some variations. This patch made those vector loops cheaper (with more correct scalarization costs, while the scalar version still suffered a bit from not considering the folded loads into the scalar operations such as fadd). The operands scalarization cost was also improved by the fact of checking for loop invariant ones, which is not currently done. (Side notes: I saw that now VF=2 got a good treatment since the loop is scalarized, and the DAG problem of expansion to VF=4 disappeared. Also, I had to disable this for integer div/rem, since we still see spilling if those loops are vectorized).

  • (2) needs (1) to work, since the correct cost is only attained if e.g. the scalarized instruction using the load does not also get an (incorrect) extraction cost. Basically, if the vector instruction plus extraction/insertion costs is greater than the cost for scalar instructions, the memory access is scalarized instead of widened.

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else. This had to be done in setCostBasedWideningDecision(). It was as well not possible to use the new isScalarWithoutPredication() at this point, since getInstructionCost() can not be called since the scalars have not been collected. I therefore used the TLI check to see which instructions which would be scalarized, wrapped in a target hook I named preferScalarizedMemFor(). Perhaps there is a better way for this, but for SystemZ I wanted to avoid extractions into GPRs (integer extractions), and also prefer to store 2x64 bit instead of 128 bits if the wide store requires inserts.

This seemed beneficial for loads in places, while I saw that LSR generated poor code for the 2x64 stores with duplicated address computations for address registers, instead of using an immediate offset.

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else.

If we change "memory widening decision" from "hard decision" to "soft decision to be finalized after collect scalars", we might be able to solve that. In general, I don't think this is any different from data flow based optimization problem

  • and the standard solution there is do this iteratively. So, some form of going back and revise the prior decision is inevitable if we want to improve this.

In any case, the newer code you have written is more generically applicable, and the improvements needed on top of that is also generically applicable. Best of all, we don't have to teach downstream optimizers about the weirdness in cost modeling.
That's what I was aiming for in reviewing your change.

Is the uploaded code ready for review? Or shall we continue discussing the approach?

jonpa added a comment.Jan 6 2019, 11:53 PM

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else.

If we change "memory widening decision" from "hard decision" to "soft decision to be finalized after collect scalars", we might be able to solve that. In general, I don't think this is any different from data flow based optimization problem

  • and the standard solution there is do this iteratively. So, some form of going back and revise the prior decision is inevitable if we want to improve this.

I was not sure this is a natural extension of the current algorithm. Currently, the memory widening decisions are first made and then the loop uniforms and scalars are collected based on this. Are you saying that changing the decision to 'scalarize' from e.g. 'widen' could be done without affecting the set of uniforms/scalars? Perhaps it could be... Scalarizing a widened access might just mean some extra accesses with immediate offsets, which would not require any changes to the induction variables...

In any case, the newer code you have written is more generically applicable, and the improvements needed on top of that is also generically applicable. Best of all, we don't have to teach downstream optimizers about the weirdness in cost modeling.
That's what I was aiming for in reviewing your change.

Yes, I agree it looks better now :-)

Is the uploaded code ready for review? Or shall we continue discussing the approach?

It's quite ready, except for our discussion above.

Also, in the spirit of keeping LoopVectorizer as clean as possible, I think we should have some more people agreeing to this as a general improvement. If this would be the case, I would be happy to commit, but I can't say at the moment if this is really improving things enough to go in... It may be that this is mostly a theoretical improvement which may or may not be worth having, depending on the complexity of the patch...