Page MenuHomePhabricator

[LV] Do not use vector type to compute cost of scalar address comp.
Needs ReviewPublic

Authored by fhahn on Dec 11 2020, 9:51 AM.

Details

Summary

Currently getMemInstScalarizationCost severely overestimates the cost of
computing the addresses for scalarized memory instructions. It computes
the cost of computing vector addresses for each scalar access.

This patch adjusts the address cost computation to use the scalar
pointer type. Unfortunately this means that we now prefer scalarization
over gather/scatters in some cases on X86. This is due to the fact that
we divide the cost for predicated memory instructions by the reciprocal
block frequency. This means the cost is usually lower than the masked
gather/scatters. I am not sure if there's an easy way to avoid that in
the short term. The problem is that we currently do not really consider
the cost overhead for the compares/branches of predicated blocks.

With this change, we now vectorize the gather-cost.ll tests. I checked
on both ARM64 and X86 (skylake), and the vectorized versions are roughly
2x faster than the scalar loops.

Diff Detail

Unit TestsFailed

TimeTest
70 msx64 windows > LLVM.CodeGen/XCore::threads.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\llc.exe -march=xcore < C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll

Event Timeline

fhahn created this revision.Dec 11 2020, 9:51 AM
fhahn requested review of this revision.Dec 11 2020, 9:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 11 2020, 9:51 AM

I haven't looked at x86 gather/scatter code at all - ping @pengfei for more thoughts on that.

llvm/test/Transforms/LoopVectorize/X86/gather-cost.ll
16–17

Seems like we should do a bit more checking here to show what is happening. Would it make sense to also add a RUN line with -mcpu=skylake or similar, so we can show some difference between the full vectorization with gathers vs. partial without gathers?

I don't have much experience on performance tuning. Adding more people who contributed to the affected tests.

I checked on both ARM64 and X86 (skylake), and the vectorized versions are roughly 2x faster than the scalar loops.

How did you check the performance? Can we evaluate the affected tests inv_val_store_to_inv_address_conditional and masked_strided2 using the same method?

fhahn added a comment.Feb 1 2021, 4:57 AM

I don't have much experience on performance tuning. Adding more people who contributed to the affected tests.

I checked on both ARM64 and X86 (skylake), and the vectorized versions are roughly 2x faster than the scalar loops.

How did you check the performance? Can we evaluate the affected tests inv_val_store_to_inv_address_conditional and masked_strided2 using the same method?

I built used a C wrapper which called the IR function and linked them together. I can try the same for the others ,but it's a bit time consuming.

sdesmalen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6550

drive-by nit: VF.getFixedValue(), since this function only makes sense for fixed-width vectors.
(same for some other places in this function, not necessarily caused by this patch)