This patch allows the mixing of scaled and unscaled load/stores to form
load/store pairs.
PR24465
Differential D12116
[AArch64] Improve load/store optimizer to handle LDUR + LDR. ab on Aug 18 2015, 2:12 PM. Authored by
Details
This patch allows the mixing of scaled and unscaled load/stores to form PR24465
Diff Detail
Event TimelineComment Actions I haven't done any testing beyond basic lit tests. Hopefully, I'll have correctness/perf testing in flight by tomorrow. Regardless, I think this is in good enough shape to review.
Comment Actions Hi Chad, I checked on a small testcase, and with this patch we do merge STUR and STR. There is one potential issue though: in some cases we intentionally split STP into two STUR/STR, as there is a big performance penalty if STP crosses a cache line (see e.g. performSTORECombine in AArch64ISelLowering.cpp and getMemoryOpCost in AArch64TargetTransformInfo.cpp). So, I think we might want to perform this combining only for non-temporal or known-to-be-well-aligned memory accesses. What do you think, does it make sense? Thanks,
Comment Actions Hi Michael, If I understand things correctly, I think this makes sense. However, I'd probably implement this in a separate patch and test it on other subtargets (e.g., A57). Sound reasonable?
Comment Actions Hi Chad, As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization. I think that being able to combine STUR/STR to STP is good by itself, but from the patch alone we'll probably get regressions (AFAIR, the biggest one was matmul from LNT testsuite). That's my only concern, but I don't mind landing the patch and addressing this issue later if we actually observe it. Also, I can't provide a high-quality review of this code, as I'm not very familiar with it, so I'd appreciate other people's feedback on the patch. Thanks! Michael
Comment Actions This LGTM (modulo nits), but let's see what the testsuite says first. But this issue isn't specific to this patch, right? If the unaligned store was split to STR+STR we would have generated an STP even before this change. I agree we'll need to do something about this though, but separately, and for both mixed and non-mixed STR/STUR pairs.
Comment Actions Thanks for the feedback, Ahmed. Unfortunately, I don't have a setup that can test with the testsuite at the moment. That is in the works... This transformation is very narrow and did not hit anything in Spec2000. Therefore, I'm going to move onto other work with a higher ROI. However, feel free to push this one along.
That is absolutely correct! This patch only applies to the very narrow case of STR/STUR. The common case is when we're pairing STUR/STUR, which has already been committed.
Comment Actions Hi Chad, Ahmed, I ran some testing for this patch and found a bug there. The issue is that when we scale offset, we need to make sure that the original value is divisible by the scale factor. Here is a test to illustrate the issue: define <2 x i64> @test4(i64* %p) nounwind { %a1 = bitcast i64* %p to <2 x i64>* %tmp1 = load <2 x i64>, < 2 x i64>* %a1, align 8 # Load <p[0], p[1]> %add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3 %a2 = bitcast i64* %add.ptr2 to <2 x i64>* %tmp2 = load <2 x i64>, <2 x i64>* %a2, align 8 # Load <p[3], p[4]> %add = add nsw <2 x i64> %tmp1, %tmp2 ret <2 x i64> %add } The current patch will combine these two loads, which is incorrect. Michael Comment Actions And this should catch the unscaled offset not being a multiple of the scale (with abundant asserts). Comment Actions Hi, I just finished LNT+externals run, and saw no failures. Thanks for the fix! Michael Comment Actions Committed r246769.
Comment Actions I speculatively reverted r246769 in r246782 as the compiler looks to be ICEing on Multisource/Benchmarks/tramp3d-v4. |
MI -> Paired, FirstMI -> I ?