The support for neoverse-512tvb mirrors the same option available in GCC[1].
There is no functional effect for this option yet.
This patch ensures the driver accepts "-mcpu=neoverse-512tvb", and enough
plumbing is in place to allow the new option to be used in the future.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This is a bit of a shame. I was hoping we wouldn't need the same hacks as GCC. The llvm cost modelling can work quite differently at times to GCC and I didn't think we were close enough to optimal code to need to worry about these kinds of differences. I guess having the option is useful for consistency.
llvm/lib/Support/Host.cpp | ||
---|---|---|
216 ↗ | (On Diff #381853) | This doesn't sound right - for a fake cpu to work with -mcpu=native. |
llvm/lib/Target/AArch64/AArch64Subtarget.cpp | ||
168 | Should this have Loop Alignment too? Is the interleave factor higher due to the 512bit vector bandwidth? | |
llvm/lib/Target/ARM/ARM.td | ||
1412 ↗ | (On Diff #381853) | Are we sure gcc has a -mcpu=neoverse-512tvb option for Arm? Or is it AArch64 only? |
This one might get a VScaleForTuning:
https://reviews.llvm.org/D112459
Do you need this as well?
Thank you for your review @dmgreen and @tschuett.
I rebase the patch, now VScaleForTuning is being set.
And I removed support for neoverse-512tvb from Arm 32 bits.
@paulwalker-arm pointed me that neoverse-512tvb is only supported by AArch64.
llvm/lib/Support/Host.cpp | ||
---|---|---|
216 ↗ | (On Diff #381853) | Good catch, I should have removed that. |
Thanks. If the cpu has a 512 bit total vector bandwidth, should the VScaleForTuning be 1 or 2 (or higher)? llvm doesn't usually deal with total bandwidth a lot, perhaps not as much as it should.
@david-arm any thoughts?
The total vector bandwidth includes unrolling so currently having VScaleForTuning=1 and MaxInterleaveFactor=4 implies 512 tvb. If the target has >128bit vectors then vector loops will likely have more work than they can handle in parallel but as long as that does not negatively affect register pressure it shouldn't be a problem.
The total vector bandwidth includes unrolling so currently having VScaleForTuning=1 and MaxInterleaveFactor=4 implies 512 tvb. If the target has >128bit vectors then vector loops will likely have more work than they can handle in parallel but as long as that does not negatively affect register pressure it shouldn't be a problem.
That doesn't fit with my understanding of how VScaleForTuning is currently used, and vectorizing/unrolling too far can easily cause the vector part to be skipped for many loop counts, falling back to the scalar part. But that all sounds fine to me for what this is. Cheers.
nit: s/512/512-TVB/