This is an archive of the discontinued LLVM Phabricator instance.

[Aarch64] Add cost for missing extensions.
ClosedPublic

Authored by mssimpso on Nov 16 2015, 2:17 PM.

Details

Summary

This patch adds a cost estimate for some missing sign and zero extensions. The
costs were determined by counting the number of shift instructions generated
without context for each new extension.

For example, the zero extension below has a cost estimate of 2.

define <8 x i32> @f(<8 x i16> %a) {
entry:
  %b = zext <8 x i16> %a to <8 x i32>
  ret <8 x i32> %b
}

Because the backend generates the following code.

ushll2  v1.4s, v0.8h, #0
ushll   v0.4s, v0.4h, #0
ret

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 40341.Nov 16 2015, 2:17 PM
mssimpso retitled this revision from to [Aarch64] Add zero cost for extensions that may be eliminated..
mssimpso updated this object.
mssimpso added a subscriber: llvm-commits.
mssimpso updated this object.Nov 16 2015, 2:18 PM
sbaranga edited edge metadata.Nov 17 2015, 2:34 AM

Correct me if I misunderstood, but emitting these instructions would require some context?

For example:

define <8 x i32> @g(<8 x i16>* %a, <8 x i32>* %b) {
entry:

%opa = load <8 x i16>, <8 x i16>* %a
%opb = load <8 x i32>, <8 x i32>* %b
%zea = zext <8 x i16> %opa to <8 x i32>
%res  = sub nsw <8 x i32> %zea, %opb
ret <8 x i32> %res

}

gives:

ldr q0, [x0]
ldp q2, q1, [x1]
ushll v3.4s, v0.4h, #0
ushll2 v0.4s, v0.8h, #0
sub v1.4s, v0.4s, v1.4s
sub v0.4s, v3.4s, v2.4s
ret

which does perform an extension. This is a general problem with the cost model.
A fix for this this is matching this pattern (two exts and an add) in SLP and having a cost hook for it.

Do you think this might be an issue?

Thanks,
Silviu

Hi Silviu,

Thanks very much for the feedback. I appreciate the counterexample. To be
clear, I'm not suggesting that these casts can always be eliminated. As you
mention that requires more context.

However, I do think that modeling these particular casts with zero (or near
zero) cost is probably more accurate for the majority of cases than the current
values. If I'm reading the intent of the current heuristics correctly, even in
your counterexample, the expected cost for the zext should be 2 (the number of
shift instructions) whereas the current cost model calculates a cost of 44.

I don't think the intent of the cost model is to predict with complete accuracy
the sequence of instructions the backend will select, but we should produce
reasonable values for common cases. I think that a zero (or near zero) cost for
these casts is the more reasonable value.

Having said that, we can certainly detect the interesting patterns in SLP and
other transformations with additional cost hooks. But I worry that this would
add unneeded complexity.

Since the casts in question aren't always guaranteed to be eliminated (but
often are), would you be opposed to giving them a near zero cost instead? I
think that would still be preferable to the current situation.

Ah, right. 44 is way too high. I suspect that the model was thinking that we would scalarize the operation.

I certainly wouldn't be opposed to a cost equal to the number of instructions (which would be near zero), but without assuming any context. Would that be enough for your use-case?

Thanks,
Silviu

That would certainly be enough for my use-case, Silviu. Thanks! I will upload a
new patch that counts the number of shifts generated without context.

mssimpso updated this revision to Diff 40410.Nov 17 2015, 10:42 AM
mssimpso edited edge metadata.

Added context-free costs for new extensions and organized existing entries.

mssimpso retitled this revision from [Aarch64] Add zero cost for extensions that may be eliminated. to [Aarch64] Add cost for missing extensions..Nov 17 2015, 10:44 AM
mssimpso updated this object.

Thanks! I think this generally looks good and we only need some performance data.

-Silviu

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
208 ↗(On Diff #40410)

A small style issue: it would be nice to have a consistent order in this table. The logic for the ordering of the costs above doesn't match the one for the costs below.

FWIW I think the costs for the extensions between legal types (of size 64 or 128) was already 1 from the generic logic, but I think it's ok to have them explicitly specified here.

mssimpso added inline comments.Nov 18 2015, 8:18 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
208 ↗(On Diff #40410)

Nice catch. I'd rather not add extra lines of code that we don't need, so I'll remove the entries I created that are already covered by the generic logic. I'll also insert the new entries according to the existing sorting.

mssimpso updated this revision to Diff 40519.Nov 18 2015, 8:45 AM

Removed entries covered by the generic logic and sorted the remaining ones
according to the existing scheme.

mssimpso marked an inline comment as done.Nov 18 2015, 8:45 AM
mssimpso updated this revision to Diff 40520.Nov 18 2015, 8:48 AM

Actually sort the entries this time.

Performance data for the updated revision on spec2000 and spec2006 is below. A
binary diff of the programs before and after the change shows that only three
programs are modified. All the other programs in spec are unaffected. A
positive difference indicates improvement. The programs were tested on a
Cortex-A57-like chip.

Benchmark           Diff (%)
----------------   ---------
spec2006/gcc       +0.413233
spec2006/h264ref   +0.183790
spec2006/sphinx3   +0.129399

For the LLVM test-suite, a binary diff of the programs before and after the
change shows that only the three programs shown below are modified. All other
programs are unaffected. Unfortunately, our testing infrastructure is not yet
capable of providing performance data for the test-suite programs. If the
programs below are of concern to you, I would have to ask for assistance from
the community for gathering the data. Otherwise, given the small number of
affected programs, a post-commit performance review may be sufficient here.

MultiSource/Applications/JM/lencod/lencod
MultiSource/Applications/sgefa/sgefa
MultiSource/Benchmarks/PAQ8p/paq8p
sbaranga accepted this revision.Nov 18 2015, 8:54 AM
sbaranga edited edge metadata.

Thanks, lgtm!

-Silviu

This revision is now accepted and ready to land.Nov 18 2015, 8:54 AM

Thanks very much for the review, Silviu!

This revision was automatically updated to reflect the committed changes.