This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Unrestrict Armv8 IT blocks
AbandonedPublic

Authored by samparker on Dec 30 2019, 2:29 AM.

Details

Summary

In Section 5.10 of the Cortex-A75 optimization guide and Section 4.12 of the Cortex-A76 and A77 optimization guide, it is stated that multi-instruction IT blocks are preferred over multiple, single instruction, blocks.
We're still checking for conditional eligibility, those listed in 'Partial deprecation of IT' of the Armv8 architecture manual, but most instructions will now be allowed within an IT block. isV8EligibleForIT has been modified to take a 'Restricted' boolean which is used to either reject, or allow, most instructions.
The biggest cause of code generation changes will be that the size reduction pass will not run before if-conversion and then if-conversion will do more converting because more instructions are eligible.

Diff Detail

Event Timeline

samparker created this revision.Dec 30 2019, 2:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2019, 2:29 AM

Sounds interesting. My understanding is that there were several features of IT instructions that were performance deprecated. Some of the instructions inside the blocks and accesses to PC were ruled out too. Take a look at isV8EligibleForIT for how they are disabled (I happened to be looking at it the other day for different reasons. Someone wanted to try restrict-it on other cpus).

The opt guide makes it sound like only the multi-instructions in a block were un-deprecated. Would it be better to call this AllowMultiInstrIT (or something like it), and have it just control that part of Thumb2ITBlock, without the isV8EligibleForIT changes too?

llvm/test/CodeGen/Thumb2/v8_deprecate_IT.ll
9

Like it. That was going to be my other question.

Renaming and just using the feature in the IT block pass sounds good to me.

samparker updated this revision to Diff 235589.Dec 30 2019, 5:50 AM
  • Renamed the feature.
  • Modified the IT block pass a bit
  • Added a couple of lines to another existing test.

TODO: I didn't appreciate the convoluted way that this feature is used... so this still doesn't prevent us from skipping the other V8 IT performance checks. There's three cases here that a single boolean is used for throughout the backend:

  1. The user has passed an option to restrict IT block sizes for any Thumb2 target.
  2. The user has passed an option to prevent IT block sizes restrictions.
  3. The default behaviour of Armv8, which includes the size of the block and some instructions.
  4. Performance recommendations of some Armv8 cores.

I'm still thinking about how to handle this matrix in a clear and correct way, suggestions welcome!

samparker updated this revision to Diff 235607.Dec 30 2019, 8:40 AM
samparker edited the summary of this revision. (Show Details)
  • Replaced RestrictIT within ForceRestrictIT and ForceNoRestrictIT.
  • AllowMultiInstrIT causes us to skip the instruction eligibility checks.
samparker edited the summary of this revision. (Show Details)Dec 30 2019, 8:43 AM
samparker edited the summary of this revision. (Show Details)
  • Renamed the option back to 'DontRestrictIT'.
  • Now calling isV8EligibleForIT, even when unrestricted, so we can check for the conditionally eligible opcodes.
evandro added inline comments.Jan 10 2020, 12:22 PM
llvm/lib/Target/ARM/ARM.td
410
s/FeatureDontRestrictIT/FeatureUnestrictedIT/

Are there any cores where we don't want FeatureDontRestrictIT ? I mean, multi-instruction it blocks were formally deprecated, but I'm not sure anyone actually took advantage of that.

I think we currently generate a warning in the assembler for deprecated it blocks? Do we want to eliminate that?

llvm/lib/Target/ARM/ARMFeatures.h
77

The checks for tADDspr and tADDrSP look wrong; neither of those instructions are deprecated, as far as I can tell. Maybe it was confused with something else? (I guess this isn't really part of you patch; feel free to ignore.)

Are these checks intentionally not controlled by the "Restricted" boolean?

llvm/lib/Target/ARM/ARMSubtarget.h
423

Maybe worth noting in the comments that the "Force" booleans correspond to command-line options, not subtarget features?

samparker marked an inline comment as done.Jan 15 2020, 6:55 AM

Are there any cores where we don't want FeatureDontRestrictIT?

How would you feel about unrestricting by default..? One of the GCC guys I was talking too also said it's hampering performance compared to armv7. I've got little bandwidth and I want to get some more numbers, and I'm not quite trusting my chromebook at the minute... but its' A72/A53 and I'm seeing good improvements on my data set of 1 (coremark) of 14% improvement.

llvm/lib/Target/ARM/ARMFeatures.h
77

There's a note in Table F1-15 'Non-deprecated IT 16-bit conditional instructions' about using the PC, I think this logic is okay. And yes, it's intention, I've heard that some cores branch predictors don't work as well when some of these instructions are used.

I'm not really tracking performance for 32-bit armv8-a at the moment... but I would be fine with changing the default with appropriate testing. I don't think we should have core-specific heuristics unless there's evidence some core is actually different in a meaningful way.

I'm a little concerned this could have some unexpected impact; getting rid of extra "it" markers probably doesn't have any performance impact, but more aggressive predication of 32-bit instructions could. (I would guess 32-bit instructions that have an equivalent 16-bit instruction are fine, but it might not be wise to predicate certain expensive instructions, like udiv.)

llvm/lib/Target/ARM/ARMFeatures.h
77

I was going to ask about equivalent 32-bit instructions, but I guess most Thumb2 instructions aren't allowed to access pc anyway.

If you're worried about performance penalties for certain branches, can we handle that in a separate "isWeirdBranch" function? It sounds like the concern here is independent of the general deprecation.

Ok, I'll get some test suite numbers across some cores. I would hope long latency instructions, such as div, would be predicated in the same way as armv7 so the performance would be comparable... I definitely share your concern there though, it will be the main source of the performance changes.

yroux added a subscriber: yroux.Mar 10 2021, 3:57 AM

Hi Sam,

did you have a chance to get these numbers on various cores ?

Hi Yvan,

I had forgotten about this... the reason I didn't follow-up on it was that I bricked my WoA laptop trying to install ubuntu on it :( and didn't have access to any other reasonably modern armv8 machine for performance testing. I guess the situation may have changed over the last 12 months or so though. What kind of hardware do you have available at linaro, and do you have time/interest in picking this up?

yroux added a comment.Mar 10 2021, 6:53 AM

I'm on a D05 box (Cortex-A72) and I should have a remote access to WoA surface laptop... I'll check

samparker abandoned this revision.Feb 3 2022, 5:13 AM

Abandoning in favour of D118044.