Recent commit history on llvm: https://github.com/llvm-mirror/llvm/commits/master?author=annamthomas
User Details
- User Since
- Mar 30 2016, 11:13 AM (390 w, 1 d)
Yesterday
Looks like the missing concern from Nikita is about the compile time impact. Marek is checking this using the compile time tracker.
Added ctpop test
Tue, Sep 19
Any comments to progress this further?
Tue, Sep 12
ping?
Fri, Sep 8
This is a POC of most of the changes for the RFC to follow. The change will be split into LangRef + API as one change, Verifier + pass updates as another change.
shall I go ahead and land this? Simple enough change.
Wed, Sep 6
rebased over changes from dmgreen with more accurate costs for ARM out-of-loop reductions.
Our X86 results are in. Over 244 workloads, the geomean changes by about ~0.4%. Looking at individual workloads, there are no major gains or regressions in large applications (unfortunately, we cannot share the exact benchmark names publicly). However, we do see big gains in 3 of these workloads where it performs a floating point minimum/maximum reduction over a float array of 3 elements (without this change we were vectorizing it).
Overall, I think the change is still reasonable to have since we now account for out of loop reductions more accurately.
Thanks @dantrushin for the patch. We have now had nasty miscompiles in practice (lowering memory intrinsics to regular loops and miss adding barriers for GC) and would like to make progress with this patch.
Tue, Sep 5
Rebase needed. thanks.
Thu, Aug 31
The regressions on ARM and RISCV are because we have an updated minimum trip count we expect (to offset the cost of the out-of-loop reductions), so the CHECK lines have been updated accordingly. RISCV looks okay with the minimum trip count being at least 4 or 8 (we round the MinimumTripCount up to VF, so it is more conservative).
changed the flag to on by default. Fixed the ARM/RISCV tests (by updating the lines). More details will follow.
Wed, Aug 30
LGTM.
Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.
Mon, Aug 28
Thanks!
Fri, Aug 25
added an option which switches this off by default. The failures above are related to ARM and RISCV targets.
Thu, Aug 24
Aug 22 2023
Rewrote the patch to extend existing logic for runtime checks calculation.
This now supports all kinds of out of loop reductions. Since we compute an upper-bound on MinProfitableTripCount, this maybe a more conservative estimate.
Aug 18 2023
addressed review comment.
One thing to note though is we would need to future proof against incompatible types (whenever new return types are added), but that happens in the common API: AttributeFuncs::typeIncompatible.
remove incompatible return attributes at the point we change the return type.
Aug 17 2023
LGTM w/ comments.
do you need review for this? Could you pls state the review stack in the description so it is clear the order of reviews. Thanks.
do you need review for this? Could you pls state the review stack in the description so it is clear the order of reviews. Thanks.
Aug 15 2023
LGTM w/ comments.
Aug 9 2023
Thanks, I've moved the test and removed the x86-registered-target
Aug 8 2023
Thanks everyone for the review and test cases. I'll try landing this again today.
LGTM - we don't use the legacy pass pipeline. In fact, we're moving away from guard representation completely (not sure if anyone else downstream uses it though).
Aug 3 2023
added Ayal's minimal reproducer. Addressed review comments.
updated test case with correct check lines and comments.
thank you for the fix! Details of the diagnostic is here (for reference later) : https://reviews.llvm.org/D155145#4556178
Aug 2 2023
thank you @craig.topper and @pengfei .
Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.
Sure, on the original code before the patch you suggested right?
The two calls are:
bool HasLeaf7 = MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX); + llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX + << " EBX: " << EBX << " ECX: " << ECX << " EDX: " << EDX + << "\n";
addressed review comment for use-after-free error (updated test to show the LoopAccessAnalysis bailout no longer present).
Thanks Ayal for the root cause. I'll update the patch
Aug 1 2023
Jul 28 2023
We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:
CPU info: current cpu id: 22 total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush AvgLoads: 0.30, 0.10, 0.18 CPU Model and flags from /proc/cpuinfo: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d Online cpus: 0-31 Offline cpus: BIOS frequency limitation: <Not Available> Frequency switch latency (ns): 20000 Available cpu frequencies: <Not Available> Current governor: schedutil Core performance/turbo boost: <Not Available>
I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?
Jul 27 2023
Fixed the bug caused by this patch.
fixed use-after-free error (with added testcase). This bug wasn't there before the patch because we would break out of the inner loop accessing A, whereas now we were continuing to see which other A accesses needed to release the group.
I reduced the testcase enough to show how GroupA can be same as GroupB, but there are two ways to fix this:
- Identify that GroupA is same as GroupB and once we released groupA, we should iterate to the next B instruction (in the outer loop).
- GroupA which is being released should never be same as GroupB and we do this by making sure that there's no dependency between *any of* the stores when inserting into GroupB. AFAICT, we don't do that since we check only between A and B when inserting (store) A into (store) groupB.
Thanks @mstorsjo and just for info, the use-after-free error shows up as a hang in non-asan builds (so its highly likely both reports are the same bug). I've reduced the first repro with bugpoint and it is due to GroupA == GroupB.
Jul 26 2023
Jul 25 2023
Thank you for the review Ayal!
Hi Ayal, any more comments? Thanks.
Jul 21 2023
addressed review comments.
Jul 19 2023
addressed most review comments: corrected if (DependentInst) check and couple other NFC.
Jul 17 2023
Jul 14 2023
Jul 13 2023
thank you Florian, this is a nice idea. Working on it.
Jul 12 2023
rebased over test in D155096
removed check tags