User Details
- User Since
- May 24 2016, 8:35 AM (355 w, 6 d)
Today
We possibly do have a test like that in another file, but I've made sure we have a specific one in reassoc_nonfast_f32x4 at the end of the newly added tests.
Sounds like a nice idea. This may want to custom lower 128bit vectors too, to turn them into vpmin.u8 lowhalf, highhalf as a first step.
Yesterday
Hi - I like the idea of generating tbl instructions from non-constant masks. Does the patch need an update to be based against main?
Fri, Mar 17
I think this makes a lot of sense. Especially if we are treating many shuffles with a cost of 1, floating point operations should not be twice the cost. We could consider doing the same for fmul from looking at software optimization guides, but the changes for fadd/fsub already have a high likelihood of causing some large changes. Adding fneg is worth it though as that should be a simple operation.
Hello - yeah I was thinking of 2, as isConsecutivePtr is fairly simple. The Arm backend uses getPtrStride and is more careful about the strides it allows. LD2/LD4 are ruled out, but other strides are allowed to use the gather/scatters for example. It will miss the SymbolicStrides and "Assume" checks though.
Thu, Mar 16
LoopVectorizationLegality::containsDecreasingPointers seems to loop over all the instructions and call isConsecutivePtr, which just calls getPtrStride. Could that logic just be placed in AArch64TTIImpl::preferPredicateOverEpilogue? That is how it has worked in ARMTTIImpl::preferPredicateOverEpilogue via canTailPredicateLoop. Otherwise the code in containsDecreasingPointers is ran for any architecture, but only used by AArch64.
I had written a very similar patch recently, but it would only use the fixed length if the scalable was unknown. The performance of it was pretty bad though, so I ended up dropping it. I had noticed that there is an xfail in llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll at the moment. Can it now be replaced with a check for store <vscale x 4 x i32>?
Wed, Mar 15
Thanks. Sometimes you wonder how things ever worked. LGTM.
Tue, Mar 14
Does this fix https://github.com/llvm/llvm-project/issues/61203?
It looks like this comes up from odd numbered vector elements quite a bit. The AArch64 changes all look OK.
I checked the AArch64 test and they all seem to be valid. LGTM, thanks.
Mon, Mar 13
Brilliant, thanks. LGTM.
This too old to be useful now and I don't have any plans to work on it in the near term. (It would be good to see improvements though, where the vplan is costed more directly as opposed continuing to go through the IR instructions).
The CostKind can be TCK_RecipThroughput (the default and the one we usually care most about), TCK_Latency, TCK_CodeSize or TCK_SizeAndLatency. I think if we have the code we might as well get TCK_CodeSize correct and return 0 in that case, so the load+dup have a combined cost of 1. TCK_Latency and TCK_SizeAndLatency I'm less sure about, perhaps leave them with the same costs as TCK_RecipThroughput?
Sorry for the delay - I've had less time than I would like to get back to this. I have updated and rebased the patch. There is still one large MVE issue I need to work through, and the combo of epilog vectorization + DataAndControlFlow is currently not working correctly. I will split that off into another patch though as it is a bit of a more involved change. Plus there is another patch for letting this be controlled by the target or an option.
Fri, Mar 10
Hello. This looks like a nice idea. We have done some work to make v4i8 better in the recent past. I hadn't realized that the slp vectorizer wasn't making use of that for stores.
If you can move the call to identifySymmetricOperation into this patch, then it LGTM. Cheers
If you can move the identifySymmetricOperation call to the other patch then what remains LGTM. Thanks.
This LGTM but please make sure the correct parts are in the correct patches. I don't think this will actually build on its own.
LGTM. Thanks
Yeah nothing else from me. LGTM, thanks for the changes.
Hello. I had to remind myself where this came from. It looks like it was introduced in D123638, and there were some comments already about the performance not always being ideal. It apparently helped for some <2 x double> vectorization. I'm not sure if there it a perfect answer, but an effective cost of 2 for the throughput of the ld1r would seem to match the hardware better. This doesn't alter isLegalBroadcastLoad and the tests added in D123638 don't seem to change.
Wed, Mar 8
There is hopefully a fix in 1c6ea961938488997712763762079e535b8b704. Please let me know if that does or doesn't fix your issue, and if you have details on getting assembly from mlir. Thanks
Hi - thanks for the report. It sounds like the offset might be wrong from the look at the assembly. This instructions specifically:
10534: 40 f4 7f 3d ldr b0, [x2, #4093] vs 10534: 46 0c 00 d1 sub x6, x2, #3 1053c: c0 00 40 0d ld1 { v0.b }[0], [x6]
Tue, Mar 7
I think it's worth adding test for both the ashr and lshr versions, but otherwise I think this LGTM. Thanks
Mon, Mar 6
The add/sub sounds like a nice change.
Thanks. here are some alive proofs for the transform in https://alive2.llvm.org/ce/z/N6hwQY and https://alive2.llvm.org/ce/z/u_GjYJ.
Sun, Mar 5
Fri, Mar 3
Thu, Mar 2
Thanks. LGTM
It's a bit odd to read code with the filter-out push/pop, but LGTM.
Sounds good to me.
I don't think this is the right approach to take. As far as I understand this will effect any aarch64-none-eabi targets that are compiling for bare metal (as the tests show), causing a performance hit to any bare metal application. They shouldn't have to pay the price if they don't actually use x18 for anything, which many will not.
Wed, Mar 1
Additional tests sounds OK. We don't always match splats to a single cost like we should.
Hi - I was just looking at the patch whilst you updated it! Please ignore any comments that don't apply any more.
Sounds good. I'm a little surprised that if we have a vector load where only one lane it demanded that we don't change it into a scalar load.
Mon, Feb 27
Cheers
Sun, Feb 26
Fri, Feb 24
Hello. Sorry for the delay in looking at this but I wasn't sure exactly what you were trying to do, and I've never been a huge fan of DAG combines that create the wrong node just to expand it later. It looks like for legal types this can lead to a nice decrease in instruction count though.
Wed, Feb 22
Tue, Feb 21
Sounds good to me. This LGTM but you may be able to remove the extra pattern, and it may be worth quickly adding tests for each of the 4 type sizes.
Hi - We noticed some regressions from this. The largest ones appear to be similar to the potential regressions that were reported in https://reviews.llvm.org/D142359#4131731, but with code that includes intrinsics and didn't change the vector cost: https://godbolt.org/z/P1zoxz85f.
Mon, Feb 20
Yeah they sounds OK to add to me. The tests LGTM, whether you want to update the triple or not.
Do you mean the AES pass tests? Do you think that should happen before we land this, they've been an issue for a little while.
Thanks. From what I can see this LGTM.
Can you make sure you upload with full context (-U999999), as per https://llvm.org/docs/DeveloperPolicy.html#making-and-submitting-a-patch. It can help to make the patch easier to review.
Would it be possible to optimize the ADDCARRY to the same result as without this fold? Similar to combineADDCARRYDiamond. I looked at the DAG that was being produced, but it's not obvious to me how it would be sensible combined to the same result as before.
I think this change is OK for AArch64 too, I don't think it will change much in practice. Some of the tests may not be testing what they did in the past though.