This patch adds accurate instructions cost. The formula presents two cases(stride 3 and stride 4) and calculates the cost according to the VF and stride.
Details
Diff Detail
Event Timeline
My only main concerns are with respect to interleave-group with gaps (see below), and the fact that we don't distinguish the AVX2 from the AVX512 case (also see below). Just minor comments beyond that.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2556 | Can use \p when you refer to the function arguments. Some phrasing suggestions: | |
2562 | v64i8 is not a legal type for AVX2... will you be getting the right cost here for AVX2? | |
2563 | Maybe add a comment here "Currently the X86InterleavedAccess pass supports only char accesses," | |
2570 | Would be nice if you could add some intuition to the numbers below... | |
2570 | What if the interleave-group has gaps? does the X86InterleavedAccess pass still support it? (I think not, right?). In that case you want to move the call to the cost function to after this check. | |
2687 | Why did you place the call to optimizeInterleaveCost specifically here, if it's not using the MemOpCost calculation? | |
2740 | is this the right indentation? (same for the occurrence below) |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2562 | according to what I saw in the code, only legal types pass into this section and therefore you don't need to care for that. |
The AVX2 changes look ok to me now.
A couple comments about the AVX512 changes below.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2660 | I think the common practice here is to place such cost tables inside the function, exactly where it is used. Also, does the cost in these two tables include the load/store instructions, or does it account only for the shuffles? | |
2691 | Please add a comment here saying that if the X86InterleaveAccessPass supports the specific interleaved access group at hand, we return the hard coded cost of the shuffle sequence that the X86InterleaveAccessPass will generate. Otherwise we compute the cost of the (unoptimized) shuffle sequence (that would be a result of lowering each of the IR shuffles of the interleaved group independently). |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2660 | In the mine computation, I check all the instructions include load and store. For the value of the table to be consistent with either avx2 code or avx512 I had to adapt the computation to the user code. |
AVX512 side of things now also looks good to me (with the tiny comments below).
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2692 | you can fit in more text in this line; and while you're touching this, maybe clarify this comment saying that the cost in the table accounts only for the shuffle sequence (the loads/stores are accounted for separately); | |
2763 | "If entry" --> "[ ]If [an] entry" |
returns true