Page MenuHomePhabricator

[RISCV][compiler-rt] Add support for save-restore
ClosedPublic

Authored by edward-jones on Nov 18 2020, 8:30 AM.

Details

Summary

This adds the compiler-rt entry points required by the -msave-restore option. The added entry points are riscv_save_X and riscv_restore_X, where X is the number of callee-saved registers to be saved or restored respectively.

Diff Detail

Event Timeline

edward-jones created this revision.Nov 18 2020, 8:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2020, 8:30 AM
Herald added subscribers: Restricted Project, frasercrmck, NickHung and 28 others. · View Herald Transcript
edward-jones requested review of this revision.Nov 18 2020, 8:30 AM

It seems a bit excessive to me to coalesce the entry points into bundles of 4. Do you have any particular benchmarking data or reasoning that supports choosing that threshold?
Also, shouldn't this implementation include CFI directives?

compiler-rt/lib/builtins/riscv/save_restore.h
1 ↗(On Diff #306122)

The header banner is missing here.

Also, compiler-rt seems to use a mix of .h and .inc files. I'm not sure which one is more canonical, but the existing one in the riscv directory is a .inc.

9–19 ↗(On Diff #306122)

clang-format is complaining here and suggesting that you remove all indentation.
You can still indent it. The existing style is to leave the # in the first column and indent the rest of the line.

For what it's worth, libgcc seems to group by 2 for RV64 and by 4 for RV32, i.e. it always rounds up to a multiple of 16 bytes as that's an easy way to preserve stack alignment requirements.

It seems a bit excessive to me to coalesce the entry points into bundles of 4. Do you have any particular benchmarking data or reasoning that supports choosing that threshold?
Also, shouldn't this implementation include CFI directives?

I used bundles of 4 just to follow the behaviour I saw in libgcc, and the grouping of 2 for rv64 seemed a bit too fine-grained. I'm not sure what the original justification for the coalescing into groups of 2/4 was in libgcc.

I'll update to account for other suggested changes and see if I can find any benchmarks which show the tradeoff for the grouping threshold

I used bundles of 4 just to follow the behaviour I saw in libgcc, and the grouping of 2 for rv64 seemed a bit too fine-grained. I'm not sure what the original justification for the coalescing into groups of 2/4 was in libgcc.
I'll update to account for other suggested changes and see if I can find any benchmarks which show the tradeoff for the grouping threshold

Thanks. After thinking about it some more I don't see any significant issue with a bundle size of 4.
I also examined the assembly more thoroughly and tested it, and it seemed correct.

It seems a bit excessive to me to coalesce the entry points into bundles of 4. Do you have any particular benchmarking data or reasoning that supports choosing that threshold?
Also, shouldn't this implementation include CFI directives?

I used bundles of 4 just to follow the behaviour I saw in libgcc, and the grouping of 2 for rv64 seemed a bit too fine-grained. I'm not sure what the original justification for the coalescing into groups of 2/4 was in libgcc.

I'll update to account for other suggested changes and see if I can find any benchmarks which show the tradeoff for the grouping threshold

Because the stack alignment is 16 bytes; see my earlier comment.

It seems a bit excessive to me to coalesce the entry points into bundles of 4. Do you have any particular benchmarking data or reasoning that supports choosing that threshold?
Also, shouldn't this implementation include CFI directives?

I used bundles of 4 just to follow the behaviour I saw in libgcc, and the grouping of 2 for rv64 seemed a bit too fine-grained. I'm not sure what the original justification for the coalescing into groups of 2/4 was in libgcc.

I'll update to account for other suggested changes and see if I can find any benchmarks which show the tradeoff for the grouping threshold

You have to store the correct number of registers in order to access the stack above the libcall size, for instance arguments passed in via the stack. If you spill too many by in some cases spilling two, these offset calculations will be wrong and you'll read the incorrect argument.

Because the stack alignment is 16 bytes; see my earlier comment.

Thanks for your earlier comment. I did think a bit about some possible alternative implementations, which could perhaps make some interesting trade-offs while properly preserving the stack alignment, but in the end the approach used in libgcc and this patch is probably the most sensible one.

It seems a bit excessive to me to coalesce the entry points into bundles of 4. Do you have any particular benchmarking data or reasoning that supports choosing that threshold?
Also, shouldn't this implementation include CFI directives?

I used bundles of 4 just to follow the behaviour I saw in libgcc, and the grouping of 2 for rv64 seemed a bit too fine-grained. I'm not sure what the original justification for the coalescing into groups of 2/4 was in libgcc.

I'll update to account for other suggested changes and see if I can find any benchmarks which show the tradeoff for the grouping threshold

Because the stack alignment is 16 bytes; see my earlier comment.

Ah apologies, I missed your comment. So then libgcc groups them up into the smallest groups it can whist maintaining alignment.

I could switch rv64 to use groups of 2 then if matching libgcc would be advantageous. The only reason for sharing 4 for both rv32/rv64 is that it made the implementation more compact, but that's obviously not the metric to optimize for in an emulation library.

lenary resigned from this revision.Jan 14 2021, 9:43 AM

I could switch rv64 to use groups of 2 then if matching libgcc would be advantageous. The only reason for sharing 4 for both rv32/rv64 is that it made the implementation more compact, but that's obviously not the metric to optimize for in an emulation library.

While keeping them both using a group of 4 would be reasonable (and has a neater implementation), I think overall it would be best to switch rv64 to groups of 2. If you made that change and addressed the review nits I think this would be an easy accept and merge :-)

@jrtc27 do you agree it would be worthwhile to make the change for rv64?

I've rebased, switched to a grouping of 2 for rv64 and 4 for rv32, and fixed the formatting/comments.

Now that RV32 and RV64 have separate implementations, is there still a point in keeping the LOAD/STORE/STRIDE macros?

Now that RV32 and RV64 have separate implementations, is there still a point in keeping the LOAD/STORE/STRIDE macros?

Probably not. I wasn't sure whether the macros made it more readable.

Probably not. I wasn't sure whether the macros made it more readable.

I would probably remove them now.
Also, even if we kept the macros we probably don't really need the .inc here, do we?
I suppose we also don't need the .inc for the RISC-V multiply builtins, although there it's a slightly different and more subtle situation.

lenary removed a subscriber: lenary.Mar 1 2021, 2:56 PM
luismarques accepted this revision.Mar 15 2021, 5:57 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Mar 15 2021, 5:57 AM
This revision was automatically updated to reflect the committed changes.