Page MenuHomePhabricator

[RISCV] Add pre-emit pass to make more instructions compressible

Authored by lewis-revill on Nov 25 2020, 8:03 AM.



When optimizing for size, this pass searches for instructions that are
prevented from being compressed by one of the following:

  1. The use of a single uncompressed register.
  2. A base register + offset where the offset is too large to be compressed and the base register may or may not already be compressed.

In the first case, if there is a compressed register available, then the
uncompressed register is copied to the compressed register and its uses
replaced. This is only done if there are enough uses that code size
would be improved.

In the second case, if a compressed register is available, then the
original base register is copied and adjusted such that:

new_base_register = base_register + adjustment
base_register + large_offset = new_base_register + small_offset

and the uses of the base register are replaced with the new base
register. Again this is only done if there are enough uses for code size
to be improved.

This pass was authored by Lewis Revill, with large offset optimization
added by Craig Blackmore.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jrtc27 added inline comments.Jan 18 2021, 9:10 AM
244 ↗(On Diff #317371)

You need the type of the register (int or float) in order to support compressing floating-point loads/stores (and so we can support capabilities in our downstream CHERI fork, both for the base and for the source/destination).

Removed hard-coded assumptions noted by @jrtc27 and rebased. Thank you for the review and apologies for the delay in updating the patch.

lenary removed a subscriber: lenary.Oct 15 2021, 2:00 AM
asb accepted this revision.Nov 6 2021, 6:54 AM

This looks good to me - many thanks for the contribution.

@jrtc27 - are you happy your comments were all addressed?

This revision is now accepted and ready to land.Nov 6 2021, 6:54 AM
jrtc27 added a comment.Nov 6 2021, 9:03 AM

I don't understand why this is still limited to just LW and SW; you have all the bits for dealing with the differing immediate shifts already so it should be ~no effort to add LD and SD on top, and floating-point variants are just supporting additional register classes for the source and destination, some of which you again already have code for.

108 ↗(On Diff #379935)

Comment says bytes, this is bits

124 ↗(On Diff #379935)

This is just 31 * width_in_bytes (or 31 << log2(width_in_bytes))

133 ↗(On Diff #379935)

If this returned log2(bytes) (or you could compute it, but every user of the function is happy with log2(bytes)) then this is just

return isUIntN(6 + log2(bytes), Offset) && (Offset % (UINT64_C(1) << log2(bytes)) == 0);

(inlining a non-templated version of isShiftedUInt, since we don't seem to have an isShiftedUIntN in MathExtras.h, though perhaps it'd be better to just add it). Ditto for compressibleOffset just with 5 not 6.

163 ↗(On Diff #379935)

You don't need the parens

214 ↗(On Diff #379935)

There's a lot of duplicated code between loads and stores

craigblackmore edited the summary of this revision. (Show Details)

Add support for LD/SD/FLW/FLD/FSW/FSD and address other inline comments from @jrtc27.

jrtc27 added inline comments.Nov 22 2021, 4:40 AM
181 ↗(On Diff #388866)

Stray semicolon

189 ↗(On Diff #388866)

Can also be scaled by 8 for double-sized quantities (and 16 for us downstream). The scaling isn't really relevant here, just that it has a larger immediate (and no source/destination register restriction, as you already note).

194 ↗(On Diff #388866)

Not rs2 in the case of loads, so I guess SrcDest or something?

202 ↗(On Diff #388866)

I think this would be more helpful if it then went on to say what it actually did for stores

205 ↗(On Diff #388866)

How often is storing an address to itself at offset 0 actually hit?..

206 ↗(On Diff #388866)

I can't obviously see tests that exercise this?

1 ↗(On Diff #388866)

These will need +f and +d given you use F[LS][WD]

32 ↗(On Diff #388866)

Why null for all of these? That seems like a bad idea...

80 ↗(On Diff #388866)

This would be better as an offset to a real pointer, not a horrible inttoptr of a constant

jrtc27 added inline comments.Nov 22 2021, 4:48 AM
1 ↗(On Diff #388866)

File name is also misleading. Makes it sound like where instruction compression happens, but it's not, it's a transformation to make the instructions more amenable to compression, at the expense (in most cases) of instruction count.

219 ↗(On Diff #388866)

You can get the MBB from the MI, otherwise you risk passing the wrong MBB

297 ↗(On Diff #388866)

What if you hit a def of the new register?

347 ↗(On Diff #388866)

What about when Imm is 0 and all uses can be updated?

355 ↗(On Diff #388866)

Is it ever possible for this to return false on anything other than the final MI? Shouldn't analyzeCompressibleUses have stopped on such an instruction?

craig.topper added inline comments.Nov 22 2021, 4:30 PM
77 ↗(On Diff #388866)

This is also misleading like the filename.

100 ↗(On Diff #388866)

There's little reason for this to return uint8_t. Just use unsigned.

139 ↗(On Diff #388866)

Why do we need to compressibleOffset? Wouldn't the Offset & ~compressedLDSTOffsetMask(Opcode) return 0 for the same cases?

146 ↗(On Diff #388866)

Function name should start with a verb action so isCompressedReg

153 ↗(On Diff #388866)


159 ↗(On Diff #388866)


246 ↗(On Diff #388866)

There should be a comma after instruction.

247 ↗(On Diff #388866)

Comma after compressed

Thank you for the reviews. I have updated based on the feedback and rebased.

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 11:41 AM
jrtc27 added inline comments.Mar 4 2022, 11:56 AM

You don't want to be transforming FLW on RV64, that's reused for the LD encodings (ditto stores)


Weird spacing

1 ↗(On Diff #413085)

The names for these are a bit dodgy IMO in that they encode far too much about the specific instruction sets used, so changing that requires changing the name. Maybe make-compressible.mir and the RV64-only one to be make-compressible-rv64.mir? Or split fp out from i32 into its own file?

42 ↗(On Diff #413085)

Please don't, ptrtoint is a terrible thing... store %p itself to a bitcasted %p

craigblackmore marked 19 inline comments as done.Mar 4 2022, 11:57 AM
craigblackmore added inline comments.
206 ↗(On Diff #388866)

I have added tests for this now.

297 ↗(On Diff #388866)

I have added a comment about this.

// The new register was scavenged for the range of instructions that are
// being updated, therefore it should not be defined within this range
// except possibly in the final instruction.
347 ↗(On Diff #388866)

I have added a TODO to possibly update all uses in the future.

355 ↗(On Diff #388866)

I have removed the check of the return value and removed the bool from updateOperands since it is no longer used.

craigblackmore updated this revision to Diff 414422.EditedMar 10 2022, 10:24 AM
craigblackmore marked an inline comment as done.

Address latest review.

  • No longer transforming FLW/FSW on 64-bit.
  • Fixed weird spacing.
  • Renamed test files.
  • Used bitcast instead of ptrtoint.
asb added a comment.Mar 17 2022, 7:47 AM

Any thoughts from anyone on whether this should be turned on at Oz only? It can certainly increase the number of instructions executed.

Enable at -Oz instead of -Os.

Do you have an example of a benchmark or some other codebase where this pass makes a difference? I tested this in 3 different places and there were no changes at all.

I tested this in 3 different places and there were no changes at all.

Actually, in OpenTitan I see up to 0.5% code size gains (60 bytes) in the best case.

The pass saves 456 bytes (0.76%) on a proprietary application.

If no one has additional concerns I think this could be merged.
Thanks for the info regarding the code size gains.

asb added a comment.EditedApr 14 2022, 5:11 AM

@craigblackmore: could you please take a look at compiling complex-2.c from the GCC torture suite with this patch. I'm getting incorrect output for RV32IMAFDC at Oz:

./output_rv32imafdc_ilp32_Oz/complex-2.s:16:7: error: invalid operand for instruction
        addi    fa0, ft0, 0
./output_rv32imafdc_ilp32_Oz/complex-2.s:18:7: error: invalid operand for instruction
        addi    fa1, ft1, 0
lewis-revill commandeered this revision.Apr 25 2022, 5:13 AM
lewis-revill added a reviewer: craigblackmore.

Rebased including an update from using isShiftedUIntN to the templated replacement isShiftedUInt<N, S>. Addressed the issue of inserting incorrect instructions to copy FP registers to compressible FP registers by using FSGNJ_S and FSGNJ_D instead. We note that we can't have an additional offset to adjust the resulting register by if it is an FP register.

reames added a subscriber: reames.Apr 26 2022, 7:41 AM

Drive by review. I'm mostly using this as an exercise to get familiar with the code and the target, so please don't consider any of my comments blocking.


I would expect us to have a means for querying the width of a load/store somewhere, and for this to simply be the log of an existing utility. I haven't dug around, but are you sure that doesn't exist?


Don't we have a means already for computing max offset representable in the addressing of an instruction? I'd expect us to,


It's not clear me to me why the last instruction has to be a load as this assert would seem to imply. Could we instead directly assert this is the last instruction somehow?


While this is a size opt, I'd expect improving code density to outweigh the effect of an extra register move and/or add. We should measure, but this might be useful even for O2.


The algorithm here appears to be O(N^2). Can we do better here?

Drive by review. I'm mostly using this as an exercise to get familiar with the code and the target, so please don't consider any of my comments blocking.

Thanks very much for the review


I've had a brief look, and it seems to me the only existing way would be to use MI.memoperands_begin()->getSize(). Doesn't seem to be anything simple in TargetInstrInfo.h that we can use/override either to get this functionality.


No, every place which would need to check this just repeats the same logic since it's so simple.


We only look at loads and stores for compressibility, and as such if we see a compressible instruction which also defines the register we know it is a compressible load. We also know that since we stop when we see the next definition of the register we are looking at, but only after we've checked the defining instruction for compressibility. So the only way we are looking at a definition is if it's the final instruction.


Until recently this was enabled at -Os but changed to -Oz due to the increase in number of instructions executed. I suggest keeping at -Oz and changing this later if there is a motivation for it.


Hmm, I guess there'd be a way to accumulate the ranges that we're interested in as we go through in one iteration - though I think this would be far more difficult to implement and you might end up with issues to do with looking at instructions which are yet to be updated and making incorrect conclusions (I don't see any cause for that at a first glance but it's always possible).

The pass saves 456 bytes (0.76%) on a proprietary application.

The pass saves 160 bytes (0.101%) on a proprietary application, from 157974 bytes-->157814 bytes.

asb added a comment.May 17 2022, 8:04 AM

At the last sync-up call, @reames I believe said he had no further blocking comments (thanks for the review on this Philip). I think we're good to land this Lewis if you're happy with the current code.

Rebased prior to commit

This revision was landed with ongoing or failed builds.May 25 2022, 1:27 AM
This revision was automatically updated to reflect the committed changes.