- User Since
- May 11 2015, 7:59 AM (231 w, 3 d)
- Removed support for unaligned accesses.
- Added a couple of fp tests.
Thinking about this some more, I think it would be best to at least check some features of the loop for legality:
- no vector widths greater than 128 bits.
- all vector operations should have the same number of lanes.
Wed, Oct 16
Tue, Oct 15
I agree that it would be good to know why a loop isn't a candidate, but this looks like a good start to me!
How about adding support for the QD* versions as well?
- Addressed comments in the dag combiner.
- Changed x86 backend so that extending masked loads are not desirable.
- Changed arm backend so that expanding extending masked loads are not desirable.
- Added more tests.
Thanks @craig.topper. I'll add the necessary changes into the X86 backend.
Mon, Oct 14
@craig.topper This patch currently causes an isel failure for pr35443.ll when an v4i8 masked load is being zero extended into an v4i64. I know nothing about AVX, could you please advise whether this operation is supported or how to address the issue? Thanks.
- Rebased so we're now using MaybeAlign.
- Removed codegen support for unaligned masked loads.
- Added tests for wider than 128-bit vectors.
- Added loop vectorize tests for unaligned accesses.
Tue, Oct 8
- Now using MaybeAlign
- Now using an alignment helper function in the vectorizer.
Looks like it! I'll try it.
Yes, it turns out Align is simple to use, the constructor just takes the unsigned value... but it will not accept zero and so causes assertion failures.
Mon, Oct 7
Just wondering, who generates the intrinsics? From the little that I remember from the last time I looked, I thought it was clang but it that is was x86 specific? Do we have some hooks somewhere saying that we support them?
I had missed the shift value on the input patterns.
Rebased and added a '1' shift value for strh.
Fri, Oct 4
Now handling the a bitcast passthru value in LowerMLOAD. Corrected the half load addr values.
I've missed out test changes that also change in D68337.
I saw that, but wasn't sure how to use it! The alignment is just an unsigned value at all the call sites, so that's why I went for that.
- Moved the combine into generic dagcombine.
- Now checking memory alignment to decide legality.
- Not allowing v2 vectors.
- Masked load patterns are now explicitly either aligned or unaligned.
- Added more tests.
Thu, Oct 3
Thanks for those points, I'll add loads more tests.
Wed, Oct 2
Added support for VFMA and VMLA
Tue, Oct 1
Mon, Sep 30
Fri, Sep 27
Thu, Sep 26
Cheers. I'm gonna put this patch on hold to do some more investigations into how hardware loops and loop unrolling interact... The current testing for this is in Transforms/HardwareLoops/ARM/structure.ll
Inverted the predicate and created a whitelist instead.
Cheers Dave, top-bottom and early clobber sound like complications. Having a whitelist sounds like a good idea, with the list of our unknowns growing, it sounds like the switch statement could be smaller too!
We had a bug involving SIToFP and then I got worried about the signext arg in the reproducer too, as I had just forgotten that we explicitly zext any arguments used. Adding some profitability checks are on my todo list though.
Wed, Sep 25
Thanks both. I've added a threshold to the number of loads that we can inspect, which causes us to bail before examining any loads. I've also added an early exit into the troublesome loop in RecordMemoryOps.
Tue, Sep 24
Updated a couple of the tests.
Mon, Sep 23
Fri, Sep 20
Added a few more tests.
Not across the loop because LoopEnd implicitly defines it.
Created Cleanup function.
So the pass shouldn't attempt to convert a loop that contains any vector intrinsics, other than masked.load and masked.store. So, we will actually need to do extra work for this to operate with MVE intrinsics. Even once we've done that, the pass only tries to remove the old icmp predicates, which we pattern match. If the user defined predicates match those that the vectorizer outputs, then there's no reason why we won't perform this transform. I would say that I'd add a test, but we don't have intrinsic support yet...
Added a couple more tests.
Thu, Sep 19
Added 'hasSideEffects' to VCTP because machine cse did a good job of removing the duplicate vctp in the exit block. I think always executing the 'extra' instruction outweighs the risk of having to spill/reload the vpr or not being able to perform the proper tail predication.
Now cloning VCTP in the exit block if needed. The reason being that if we create a real TP loop, then VPR will not hold the predicate.