This is an archive of the discontinued LLVM Phabricator instance.

Propeller: LLD Support for Basic Block Sections
ClosedPublic

Authored by tmsriram on Sep 25 2019, 5:58 PM.

Details

Summary

LLD Support for Basic Block Sections

This is part of the Propeller framework to do post link code layout optimizations. Please see the RFC here: https://groups.google.com/forum/#!msg/llvm-dev/ef3mKzAdJ7U/1shV64BYBAAJ and the detailed RFC doc here: https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

This is one in the series of patches for Propeller.

This patch adds lld support for basic block sections and performs relaxations after the basic blocks have been reordered.

After the linker has reordered the basic block sections according to the desired sequence, it runs a relaxation pass to optimize jump instructions. Currently, the compiler emits the long form of all jump instructions . AMD64 ISA supports variants of jump instructions with one byte offset or a four byte offset. The compiler generates jump instructions with R_X86_64 32-bit PC relative relocations. We would like to use a new relocation type for these jump instructions as it makes it easy and accurate while relaxing these instructions.

A new relocation type for jmp instructions which need to be relaxed makes it easy and accurate when the linker tries to find these instructions. Our current method peeks back to look at the opcode of the relocation type that could correspond to jmp instructions.

The relaxation pass does two things:

First, it deletes all explicit fall-through direct jump instructions between adjacent basic blocks. This is done by discarding the tail of the basic block section.

Second, If there are consecutive jump instructions, it checks if the first conditional jump can be inverted to convert the second into a fall through and delete the second.

The jump instructions are relaxed by using jump instruction mods, something like relocations. These are used to modify the opcode of the jump instruction. Jump instruction mods contain three values, instruction offset, jump type and size. While writing this jump instruction out to the final binary, the linker uses the jump instruction mod to determine the opcode and the size of the modified jump instruction. These mods are required because the input object files are memory-mapped without write permissions and directly modifying the object files requires copying these sections. Copying a large number of basic block sections significantly bloats memory.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ruiu added inline comments.Feb 11 2020, 11:11 PM
lld/ELF/Arch/X86_64.cpp
169

Does "direct jmp" actually mean just J_JMP_32?

227

In this context, does "direct jump" mean all J* instructions?

276

Please fix local variable names.

lld/ELF/InputSection.h
225

Since JumpInstrMod is a public member, you don't need this?

lld/ELF/LTO.cpp
81

nit: it is more common to use operator== instead of equals to compare StringRefs.

tmsriram updated this revision to Diff 244304.Feb 12 2020, 4:38 PM
tmsriram marked 23 inline comments as done.

Address reviewer comments, camel case, delete and rename methods.

lld/ELF/Arch/X86_64.cpp
169

Yes, it is J_JMP_32. Do you want this refactored into getJmpInsnType?

tmsriram updated this revision to Diff 244307.Feb 12 2020, 5:18 PM

Add a test for ICF with bb sections.

tmsriram marked an inline comment as done.Feb 12 2020, 5:24 PM
tmsriram added inline comments.
lld/ELF/Writer.cpp
1668

I added an icf all test for this.

ruiu added inline comments.Feb 12 2020, 11:43 PM
lld/ELF/Arch/X86_64.cpp
169

I found it a little confusing to use "direct jump" to refer only to J_JMP_32 here and used the same term to represent all J* instructions in the deleteFallThruJmpInsn function comment. But maybe you can just inline and remove this function.

lld/ELF/InputSection.h
134

Now I wonder if you can just shrink rawData? Then you can revert he change you made to getSize().

lld/ELF/OutputSections.cpp
248

Please use the actual type instead of auto.

tmsriram updated this revision to Diff 244487.Feb 13 2020, 11:40 AM
tmsriram marked 3 inline comments as done.

Address reviewer comments.

tmsriram added inline comments.Feb 13 2020, 11:44 AM
lld/ELF/InputSection.h
134

I looked at this and saw that "trimmed" can be deleted which simplifies the code here and in getSize() much more.

I need to keep track of actual available space and shrunk space. This is because we both shrink and grow the section, a follow-up patch that does this optimization. So, bytesDropped is useful to undo the shrink. PTAL and see if the simplification helps.

Also, if we dont keep track of the actual size of the section when growing and shrinking, it is not possible to catch bugs where we accidentally grow more than the original size of the section, potentially writing out-of-bounds into rawData.

tmsriram marked an inline comment as done.Feb 13 2020, 11:50 AM
tmsriram added inline comments.
lld/ELF/InputSection.h
134

Sorry, I mistyped the last line. It should read "potentially reading out-of-bounds from rawData".

ruiu added a comment.Feb 13 2020, 11:34 PM

Overall looking good, but it looks like this test file doesn't cover all relocations you want to handle. You are handling the following relocations, and compared to that the test file seems too small.

J_JMP_32,
J_JNE_32,
J_JE_32,
J_JG_32,
J_JGE_32,
J_JB_32,
J_JBE_32,
J_JL_32,
J_JLE_32,
J_JA_32,
J_JAE_32,
lld/ELF/Arch/X86_64.cpp
157

nit: you should cache the result of is.relocations.size() as for (unsigned e = is.relocations.size(); i < e; ++i)

269

nit: jmpOpcode_B -> jmpOpcodeB

lld/ELF/Driver.cpp
934

This should be named ltoBasicblockSections to exactly match the option name.

lld/ELF/InputSection.cpp
1020–1021

nit: this can be uint64_t offset = jumpMod.Offset + sec->outSecOff

lld/ELF/InputSection.h
134

I see. Maybe we can have two ArrayRefs, one is an ArrayRef of the original size and the other is a (possibly) shrunk one, but I think it doesn't matter much, so I'm fine with this approach.

lld/ELF/OutputSections.cpp
246

Lowercase

250

Lowercase

lld/ELF/Relocations.h
115–117

Original -> original
Offset -> offset
Size -> size

lld/ELF/Target.h
91–92

Lowercase

101–102

Lowercase

lld/ELF/Writer.cpp
1698

What this !ELFT::Is64Bits condition for? It looks like you can just remove it.

tmsriram updated this revision to Diff 244725.Feb 14 2020, 11:34 AM
tmsriram marked 13 inline comments as done.

Add tests for more jmp opcodes other nits.

lld/ELF/InputSection.h
134

Once the relocations are added, this could be made simpler so I will leave it for now.

ruiu accepted this revision.Feb 24 2020, 6:31 PM

LGTM

lld/ELF/Arch/X86_64.cpp
177

nit: remove extra parentheses

181

nit: remove extra parentheses

This revision is now accepted and ready to land.Feb 24 2020, 6:31 PM
MaskRay requested changes to this revision.EditedFeb 24 2020, 7:43 PM

40+ comments have not been addressed. Sorry, but I don't think this can be accepted as is. I am not catching up recent discussions but I believe there are still major concerns about the overall approach, whether code should be implemented in CodeGenPrepare.cpp, etc.

Mark as "Request Changes" in fear of an accidental commit.

lld/ELF/Arch/X86_64.cpp
41

Fix parameter case.

63

const

159

In an old comment, I suggested

See Relocations.cpp:scanRelocs. A better way is to sort relocations beforehand.

It is not addressed.

176
.byte 0xe8
.long foo - . - 4
This revision now requires changes to proceed.Feb 24 2020, 7:43 PM
MaskRay added inline comments.Feb 24 2020, 8:03 PM
lld/ELF/Arch/X86_64.cpp
171

Delete this once psABI reservers a new relocation type for fallthrough jumps.

181

Delete SignExtend64.

config->wordsize * 8 is a constant, 64.

185

Fix variable case.

187

Delete excess parentheses

258

Delete excess parentheses

lld/ELF/OutputSections.cpp
351

Is target->nopInstrs redundant?

lld/ELF/Writer.cpp
1674

Better to check if there are cases where st_value > original_input_section_size.

I asked because in binutils, bfd/elfnn-riscv.c has a similar check.

	  if (sym->st_value > addr && sym->st_value <= toaddr)
	    sym->st_value -= count;
1683

Similarly, in BFD, this is:

	  else if (sym->st_value <= addr
		   && sym->st_value + sym->st_size > addr
		   && sym->st_value + sym->st_size <= toaddr)
	    sym->st_size -= count;
1718

excess parentheses

lld/test/ELF/bb-sections-and-icf.s
23

This comment is redundant.

38

ditto

40+ comments have not been addressed.

Since the patch was split many of the changes didn't apply as we separated the shrink instruction optimization into another. We will address the comments you pointed.

Sorry, but I don't think this can be accepted as is. I am not catching up recent discussions but I believe there are still major concerns about the overall approach, whether code should be implemented in CodeGenPrepare.cpp, etc.

Could you be more specific as to what is the major concern? Removing code from CodeGenPrepare.cpp is for the LLVM patch, not this one, and is not a major concern, we will be updating that patch removing all code from CodeGenPrepare.cpp. We cannot land this patch without first getiing the LLVM patch approved anyways.

Mark as "Request Changes" in fear of an accidental commit.

tmsriram updated this revision to Diff 246588.Feb 25 2020, 3:20 PM
tmsriram marked 77 inline comments as done.

Address reviewer comments, new test for PC32 reloc.

Pending: fixing size of symbols.

lld/ELF/Arch/X86_64.cpp
159

Could you please reconsider this? I understand what you mean here.
This code is going to change when new psABI relocations are added. Could we re-open this then?

176

Okay, added a test for PC32. Also, the PC8 reloc will be created when we shrink a jmp insn offset from 32 to 8. The actual shrink code will be presented in a separate patch which will test for the PC8 so I will leave that out for now. Further, when we add new psABI relocs, this code and tests will have to be modified.

lld/ELF/InputSection.cpp
942

I am assuming this is alright.

lld/ELF/OutputSections.cpp
351

I can just assert instead!

lld/ELF/Target.h
105

The grow part has been split, not applicable.

lld/ELF/Writer.cpp
43

Not applicable.

568

Not applicable.

649

Not applicable.

651

Not applicable.

657

Not applicable.

695

Not applicable.

720

Not applicable.

1674

I will address this in your most recent comment.

1683

Working on this one, will fix this in the next iteration. Keeping this open until then.

1696

Accomodating thunks and creating them optimally needs a little more thought. We haven't looked at this and we haven't tested it.

Could we do this in a later patch?

1719

Not applicable.

1726

Not applicable.

1743

Not applicable.

1749

Not applicable.

1751

Not applicable.

1758

Not applicable.

1758

Not applicable.

tmsriram updated this revision to Diff 246819.Feb 26 2020, 12:44 PM

Fix symbol shrinking code to incorporate reviewer comments.

tmsriram marked an inline comment as done.Feb 26 2020, 12:45 PM

A Phabricator tip: marking every review comment Not applicable. is not needed. You can just click "Close" (status: Unsubmitted) but don't click Submit. When you upload a new Diff (with arc diff), the comments will be closed automatically.

I believe a large number of Not applicable comments were originally applicable. There have been many changes (see the number of Diff in History: 16). A large number of comments were not addressed with several versions. Now a bunch of comments were commented as Not applicable at once... It made it slightly difficult to understand which Diff addresses the comments. Though, they don't matter now.

When you upload a new Diff, I hope you can rebase the patch on origin/master. "Patch not applicable on master" has been complained by other reviewers in some other patches.

I think a patch hierarchy has been suggested by other reviewers and me several times. It is not clear what patches to apply and in what order. You can mention it in the SUMMARY. It will be very useful and also save reviewers' time. (To be honest I felt a lot of pressure when I received messages after I clicked "Request Changes". I try to be responsive.)

lld/ELF/Arch/X86_64.cpp
159

What is the average size of is.relocations.size()? It it is small in practice, the comment should mention it.

A Phabricator tip: marking every review comment Not applicable. is not needed. You can just click "Close" (status: Unsubmitted) but don't click Submit. When you upload a new Diff (with arc diff), the comments will be closed automatically.

Alright, will do that in future.

I believe a large number of Not applicable comments were originally applicable. There have been many changes (see the number of Diff in History: 16). A large number of comments were not addressed with several versions. Now a bunch of comments were commented as Not applicable at once... It made it slightly difficult to understand which Diff addresses the comments. Though, they don't matter now.

Just to be clear, the original patch also had shrink and grow optimization which we split out of the main patch and hence many of the original comments don't apply.

When you upload a new Diff, I hope you can rebase the patch on origin/master. "Patch not applicable on master" has been complained by other reviewers in some other patches.

Sure, will do.

I think a patch hierarchy has been suggested by other reviewers and me several times. It is not clear what patches to apply and in what order. You can mention it in the SUMMARY. It will be very useful and also save reviewers' time. (To be honest I felt a lot of pressure when I received messages after I clicked "Request Changes". I try to be responsive.)

! In D68065#1894573, @MaskRay wrote:

A Phabricator tip: marking every review comment Not applicable. is not needed. You can just click "Close" (status: Unsubmitted) but don't click Submit. When you upload a new Diff (with arc diff), the comments will be closed automatically.

Sure, this patch depends on TargetOptions.h changes in D68063. I marked it as a parent patch. I will put a larger summary of all patches associated with BBSections and Propeller.

I believe a large number of Not applicable comments were originally applicable. There have been many changes (see the number of Diff in History: 16). A large number of comments were not addressed with several versions. Now a bunch of comments were commented as Not applicable at once... It made it slightly difficult to understand which Diff addresses the comments. Though, they don't matter now.

When you upload a new Diff, I hope you can rebase the patch on origin/master. "Patch not applicable on master" has been complained by other reviewers in some other patches.

I think a patch hierarchy has been suggested by other reviewers and me several times. It is not clear what patches to apply and in what order. You can mention it in the SUMMARY. It will be very useful and also save reviewers' time. (To be honest I felt a lot of pressure when I received messages after I clicked "Request Changes". I try to be responsive.)

tmsriram updated this revision to Diff 247143.Feb 27 2020, 5:21 PM

Rebase.

The number of relocations per input section with basic blocks is a few, eyeballing it is about 3-4 relocations including the extra jump relocation that gets added.

Here is the summary of all the basic block sections patches in clang, llvm and lld:

D68063 : Propeller: LLVM support for basic block sections [BB: LLVM1]
D73674 : Propeller: LLVM support for basic block sections (Base Patch - Part 2) [BB: LLVM2]
D68049 : Propeller: Clang options for basic block sections [BB: Clang]
D68065 : Propeller: LLD Support for Basic Block Sections [BB: LLD] -- This Patch!
D73497 : lld: Propeller framework part I [Propeller: LLD]

 BB: LLVM1
  /                \
/                    \

BB: LLVM2 BB: LLD

|           \                    |
|             \                  |
|               \                |

BB: CLang Propeller: LLD

ruiu added a comment.Feb 27 2020, 8:32 PM

If Sriraman has already resolved all Fangrui's comments, I can reiterate my LGTM, as the new code is well written, and I don't have a specific concern. I actually think that this is an interesting feature and in some degree a natural extension of -ffunction-sections.

ruiu accepted this revision.Mar 16 2020, 2:51 AM

It looks like there's no activities in the llvm-dev thread and this review thread, so I'll reiterate my LGTM, as I think we should not block this change, in particular given that the corresponding change to LLVM has been submitted.

In general I think a lot of this looks pretty good. It could use some more comments - in particular to call out what a lot of this code is used for. It's not used on a typical linking path and so could be confusing to people going through the code.

-eric

lld/ELF/OutputSections.cpp
249

Might want to assert size != 0 or early return.

tmsriram updated this revision to Diff 250933.Mar 17 2020, 3:49 PM
tmsriram marked an inline comment as done.

Address Reviewer comments.

+ More comments
+ Check for early return in nopInstrFill

In general I think a lot of this looks pretty good. It could use some more comments - in particular to call out what a lot of this code is used for. It's not used on a typical linking path and so could be confusing to people going through the code.

-eric

I have added more comments and mentioned that this is only applicable with basic block sections. I have also added "TODO:" in places where the code is going to get a lot simpler with newer relocations

@echristo Eric, is this alright now? Thanks.

MaskRay requested changes to this revision.EditedMar 20 2020, 5:30 PM

I am very glad to see that we have made progress by landing D68063 (llvm/CodeGen/CommandFlags.inc) and D73674 (llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp). Basic block sections is agreed to be useful even outside Properller.

There are several optimizations goals:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing.
  • ...

There is a CPU erratum that we want to mitigate.

  • Intel's Jump Condition Code Erratum

By making this change, we will go the object file level route: annotate object files with metadata so that certain transformations can be performed.

Whether this scheme can satisfy the goals and avoid the erratum, and the uncertainty about how many stuff we will have to reinvent is my biggest concerns.

On https://lists.llvm.org/pipermail/llvm-dev/2020-February/139543.html (my brainstorming), I mentioned we may achieve our goals and make it suitable for future optimizations by using a file format with more semantics (rather than an object file). I hope we can think more on this, rather than rush to conclusions "this is redoing full LTO. it can't scale"

Considering the above points, I re-iterate my "Request Changes". We need a plan to prove that we can achieve our optimization goals while avoiding caveats (erratum).

This revision now requires changes to proceed.Mar 20 2020, 5:30 PM

Fangrui, I agree with Eric here. JCC erratum handling is not part of the core feature and it is perfectly reasonable to submit that as a follow up patch. This review thread has been going on for almost half a year and if the comments about the core patch have been addressed, we shall unlock the progress and move on.

In the meantime, the discussion on the erratum handling should also start -- probably in the contexts of other mitigations (not just this JCC one).

thanks,

David

I am very glad to see that we have made progress by landing D68063 (llvm/CodeGen/CommandFlags.inc) and D73674 (llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp). Basic block sections is agreed to be useful even outside Properller.

There are several optimizations goals:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing.
  • ...

There is a CPU erratum that we want to mitigate.

  • Intel's Jump Condition Code Erratum

By making this change, we will go the object file level route: annotate object files with metadata so that certain transformations can be performed.

Whether this scheme can satisfy the goals and avoid the erratum, and the uncertainty about how many stuff we will have to reinvent is my biggest concerns.

On https://lists.llvm.org/pipermail/llvm-dev/2020-February/139543.html (my brainstorming), I mentioned we may achieve our goals and make it suitable for future optimizations by using a file format with more semantics (rather than an object file). I hope we can think more on this, rather than rush to conclusions "this is redoing full LTO. it can't scale"

Considering the above points, I re-iterate my "Request Changes". We need a plan to prove that we can achieve our optimization goals while avoiding caveats (erratum).

@echristo @ruiu

If the JCC erratum is the only concern then we are able to show now with experiments that Propeller can produce JCC erratum free binaries with almost no performance impact and only by using the existing assembler mitigations : http://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html

Let's use that thread to continue to investigate how the linker could potentially do a better job of handling this or other erratums in general. Could we please unblock this?

I am very glad to see that we have made progress by landing D68063 (llvm/CodeGen/CommandFlags.inc) and D73674 (llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp). Basic block sections is agreed to be useful even outside Properller.

There are several optimizations goals:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing.
  • ...

There is a CPU erratum that we want to mitigate.

  • Intel's Jump Condition Code Erratum

By making this change, we will go the object file level route: annotate object files with metadata so that certain transformations can be performed.

Whether this scheme can satisfy the goals and avoid the erratum, and the uncertainty about how many stuff we will have to reinvent is my biggest concerns.

On https://lists.llvm.org/pipermail/llvm-dev/2020-February/139543.html (my brainstorming), I mentioned we may achieve our goals and make it suitable for future optimizations by using a file format with more semantics (rather than an object file). I hope we can think more on this, rather than rush to conclusions "this is redoing full LTO. it can't scale"

Considering the above points, I re-iterate my "Request Changes". We need a plan to prove that we can achieve our optimization goals while avoiding caveats (erratum).

@echristo @ruiu

If the JCC erratum is the only concern then we are able to show now with experiments that Propeller can produce JCC erratum free binaries with almost no performance impact and only by using the existing assembler mitigations : http://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html

Let's use that thread to continue to investigate how the linker could potentially do a better job of handling this or other erratums in general. Could we please unblock this?

I agree. While the erratum is going to be important I think that this going in now doesn't block any future work on erratum handling in the linker or make it more difficult to add other than adding it on top of propeller perhaps, but that's a different issue?

@MaskRay : Is the propeller patch going to (outside of in propeller) going to cause implementing the jcc erratum to be more difficult? If not, then I think we should go forward. If so, can you propose some (hopefully narrow) changes to this patch in order to make that more possible - even "you asked for jcc erratum and we can't do that here" as an error seems like it would be an ok incremental step forward?

dxf added a subscriber: dxf.Mar 23 2020, 2:33 PM
MaskRay added a comment.EditedMar 23 2020, 10:56 PM

I am very glad to see that we have made progress by landing D68063 (llvm/CodeGen/CommandFlags.inc) and D73674 (llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp). Basic block sections is agreed to be useful even outside Properller.

There are several optimizations goals:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing.
  • ...

There is a CPU erratum that we want to mitigate.

  • Intel's Jump Condition Code Erratum

By making this change, we will go the object file level route: annotate object files with metadata so that certain transformations can be performed.

Whether this scheme can satisfy the goals and avoid the erratum, and the uncertainty about how many stuff we will have to reinvent is my biggest concerns.

On https://lists.llvm.org/pipermail/llvm-dev/2020-February/139543.html (my brainstorming), I mentioned we may achieve our goals and make it suitable for future optimizations by using a file format with more semantics (rather than an object file). I hope we can think more on this, rather than rush to conclusions "this is redoing full LTO. it can't scale"

Considering the above points, I re-iterate my "Request Changes". We need a plan to prove that we can achieve our optimization goals while avoiding caveats (erratum).

@echristo @ruiu

If the JCC erratum is the only concern then we are able to show now with experiments that Propeller can produce JCC erratum free binaries with almost no performance impact and only by using the existing assembler mitigations : http://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html

Let's use that thread to continue to investigate how the linker could potentially do a better job of handling this or other erratums in general. Could we please unblock this?

Intel JCC erratum is not the only concern. My bigger concern is whether we can achieve our post-link optimization goals other than layout shuffling with the current scheme:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing. ...
  • ...

These points were already listed in my previous comments. I believe internally you probably have more brainstorming thoughts. As I said on https://lists.llvm.org/pipermail/llvm-dev/2020-March/139639.html , I am not yet convinced that with the no disassembly assumption, reordering opaque sections can achieve the above goals. Post-link optimization is not a new idea and there have been several engineering efforts before Propeller. However, Propeller is the first integrating the great idea into LLVM. As I said I look forward to its bright future. I just hope that we can create a generic framework. Our focus is currently section reordering. When we start to think future optimization opportunities, we don't need to create one, two, three, four more frameworks.

I saw Rahman posted https://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html yesterday. Sorry that I did not have time reading it today. If the idea is that more layout work will be done by the compiler, then it starts to look good to me.

I am very glad to see that we have made progress by landing D68063 (llvm/CodeGen/CommandFlags.inc) and D73674 (llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp). Basic block sections is agreed to be useful even outside Properller.

There are several optimizations goals:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing.
  • ...

@echristo @snehasish @dxf @ruiu @tejohnson @rnk

All the above ideas you mentioned here, you heard it from my team in face to face meetings and the last one in internal discussions. I recall specifically telling you some of it is in very early stages and to keep it to yourself. You could have at least checked with us before mentioning it, as just courtesy if not anything else.

There is a CPU erratum that we want to mitigate.

  • Intel's Jump Condition Code Erratum

By making this change, we will go the object file level route: annotate object files with metadata so that certain transformations can be performed.

Whether this scheme can satisfy the goals and avoid the erratum, and the uncertainty about how many stuff we will have to reinvent is my biggest concerns.

On https://lists.llvm.org/pipermail/llvm-dev/2020-February/139543.html (my brainstorming), I mentioned we may achieve our goals and make it suitable for future optimizations by using a file format with more semantics (rather than an object file). I hope we can think more on this, rather than rush to conclusions "this is redoing full LTO. it can't scale"

Considering the above points, I re-iterate my "Request Changes". We need a plan to prove that we can achieve our optimization goals while avoiding caveats (erratum).

@echristo @ruiu

If the JCC erratum is the only concern then we are able to show now with experiments that Propeller can produce JCC erratum free binaries with almost no performance impact and only by using the existing assembler mitigations : http://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html

Let's use that thread to continue to investigate how the linker could potentially do a better job of handling this or other erratums in general. Could we please unblock this?

Intel JCC erratum is not the only concern. My bigger concern is whether we can achieve our post-link optimization goals other than layout shuffling with the current scheme:

  • Alignment inserting
  • Automatic cache prefetching
  • Large code model addressing can lower performance quite a bit. A post-link scheme can relax large code model addressting to small code model addressing. ...
  • ...

These points were already listed in my previous comments. I believe internally you probably have more brainstorming thoughts. As I said on https://lists.llvm.org/pipermail/llvm-dev/2020-March/139639.html , I am not yet convinced that with the no disassembly assumption, reordering opaque sections can achieve the above goals. Post-link optimization is not a new idea and there have been several engineering efforts before Propeller. However, Propeller is the first integrating the great idea into LLVM. As I said I look forward to its bright future. I just hope that we can create a generic framework. Our focus is currently section reordering. When we start to think future optimization opportunities, we don't need to create one, two, three, four more frameworks.

I don't mean any disrespect here but your tone suggests that you are quite experienced in this area :) If you have a better proposal, I strongly encourage you to propose it, evaluate it with experiments and performance numbers and get it into LLVM. Asking us to evaluate completely new designs along the lines of Full or Partial LTO is not feasible as it takes several weeks if not months, and IMHO, not reasonable particularly at this stage in the review.

We have put in a lot of effort towards this work to come this far. Asking us to go do it totally differently is not something we are going to do. We now know the partial LTO idea was actually not yours and only suggested to you, which I believe you should at least acknowledge for transparency.

Multiple people with significant LTO experience have told you ideas resembling Full LTO have scalability issues and ThinLTO has had a lot of adoption due to exactly this. If you want to prove them wrong, good luck to you but please don't ask us to do the heavy lifting!

As for disassembly, we have not presented a single patch that does any serious disassembly and we fully understand the pitfalls. We understand that the jump relaxation does mild disassembly and we are looking at relocations to overcome that as you already know.

We are looking at efficient ways to accomplish our other optimization objectives and we will present clear designs with experiments on llvm-dev when we get it done. The idea here is to do thin links like @mehdi_amini alluded to in that thread of yours: https://llvm.org/devmtg/2020-02-23/#kl which will do most of the transformations in the compiler and use thin links to generate summaries that are whole program.

I saw Rahman posted https://lists.llvm.org/pipermail/llvm-dev/2020-March/140134.html yesterday. Sorry that I did not have time reading it today. If the idea is that more layout work will be done by the compiler, then it starts to look good to me.

I urge you to read that as we spent significant time to conclusively prove that JCC erratum is a non-issue. I can summarize the plan to you:

  • We have been looking at constantly reducing the bloat from extra sections to be as low as possible. Some of the work we did here was to selectively create basic block sections.
  • During this, we realized we can even do better if we can compute basic block orders early immediately after profiling. This would require using a dynamic CFG but we have just protoyped it and it has the same performance benefits.
  • This could also moves the bulk of the Propeller work from the linker to a third party tool, create_llvm_prof. This patch is still necessary as it relates to bb sections.
  • This allows us to form larger sections in the compiler and not wait for the linker to do the reordering. We would still have to create multiple sections but a lot fewer, significantly reducing the bloats.
  • This means we can also reuse the existing assembler mitigations developed without having to reinvent them in the linker which gives us an immediate solution.
  • Performance is neutral (0.2% slowdown) after applying the mitigations. Infact, without Propeller the performance is down by 0.6% from the mitigations.
  • To be clear, we get all of the Propeller wins even with the mitigations, measured on clang benchmark.
  • We feel the linker can do a much better job here since the mitigations are only using NOPs and contrary to what you told us in the meeting, prefixes dont seem to help. You have noted this yourself so I am still wondering why you told us that prefixes help: https://reviews.llvm.org/D72225#1818149 where you say " NOP padding alone seems good"
  • The linker does not have to use the large hammer of aligning every function to 32 byte boundaries but this is something best discussed in a design thread

You also say in your previous message that "I am very glad to see that we have made progress by landing D68063 ..." and yet you are blocking this. This is ridiculous. If you have fundamental disagreements, you should also have been blocking the other patches. The other patches are not very useful without this, whats up with the selective blocking!

To conclude, it is perfectly fine if you are opposed to this and don't wish to unblock. I am trying to act in good faith and I am just going to have to push this around you if I get the approvals or kill basic block sections. We have to agree to disagree.

MaskRay added a comment.EditedMar 24 2020, 5:15 PM

Hi Sriraman,

Sounds like there is strong support for your patch, so let’s move forward on it.

I do have a few more code review items I’d like addressed if we can before we commit.
There are several nits about excess parentheses which do not fit with the rest of lld code.
Please scan the full patch and delete them. Rebase and upload a new diff so that I can patch my local repository with arc patch D68065.
It does not apply so it isn’t convenient for me to review the tests now.

In addition, we probably should move test/ELF/bb-sections-* tests to test/ELF/propeller/

I don't mean any disrespect here but your tone suggests that you are quite experienced in this area :) If you have a better proposal, I strongly encourage you to propose it, evaluate it with experiments and performance numbers and get it into LLVM. Asking us to evaluate completely new designs along the lines of Full or Partial LTO is not feasible as it takes several weeks if not months, and IMHO, not reasonable particularly at this stage in the review.

The first sentence is fair. I have just a bit more than 2 years experience contributing to LLVM. I made a llvm-dev proposal as everyone can see. In my spare time I will probably make more investigations how to speed up things.

We have put in a lot of effort towards this work to come this far. Asking us to go do it totally differently is not something we are going to do. We now know the partial LTO idea was actually not yours and only suggested to you, which I believe you should at least acknowledge for transparency.

Just wanted to defend for myself. I was not sure I could mention the person. That email also includes lots of my own ideas.

You also say in your previous message that "I am very glad to see that we have made progress by landing D68063 ..." and yet you are blocking this. This is ridiculous.

Please note that I made it very clear that basic block sections are helpful for other things. I was blocking this linker patch because of concerns about long-term viability and maintenance.

That said, I think we need to try out new ideas. I am now focusing on code view itself, instead of repeating my previous high level concerns.

lld/ELF/Arch/X86_64.cpp
66

Should follow variableName convention.

Add See X86AsmBackend::writeNopData

159
for (unsigned i = is.relocations.size(); i != 0; ) {
  --i;
  if (is.relocations[i].offset == offset && is.relocations[i].expr != R_NONE)
    return i;
}
168

delete parens

251

Nit: delete parens (is.getSize() - 4)

lld/ELF/Options.td
47

Add (default). See other options.

510

"Do not give unique names to every basic block section for LTO (default)"

lld/ELF/OutputSections.cpp
260

assert(nopFiller[remaining - 1].size() == remaining)

350–355

The code (if (isec->nopFiller)) self explains. No need for a comment.

lld/ELF/Target.h
96

delete excess space after //

lld/ELF/Writer.cpp
1674

I requested a research for st_value but I think it is not needed to get it accurate.

However, I don't think we should just copy the elfnn-riscv.c behavior.

if (def->value + def->size > NewSize && def->value <= OldSize && def->value + def->size <= OldSize) {

should be simplified to

if (def->value + def->size > NewSize && def->value + def->size <= OldSize) {
1679

Nit: Moving -> move (make it a bit simpler)

1703

Nit: move the condition to the call site.

1708

There is only one list item. Why use 1. ?

1718

Where is Step 2?

1723

Nit: delete parens

1726

Nit: static_cast<unsigned>

1732

Removing -> removed

2057

optimizeBasicBlockJumps calls assignAddresses, which was only called in finalizeAddressDependentContent.

We hope assignAddresses caller are grouped together (if in.symTab needs to be finalized first, please add a comment). Can you move this pass immediately before (or after) finalizeAddressDependentContent?

lld/test/ELF/bb-sections-and-icf.s
5

deleted

7

x86_64-pc-linux -> x86_64

8

delete excess space.

Prefer --optimize-bb-jumps over -optimize-bb-jumps

8

If --icf=all result is different from --icf=none. Add a comment.

17

Use ## for test comments.

18

Don't add excess space.

MaskRay added inline comments.Mar 24 2020, 5:19 PM
lld/test/ELF/bb-sections-delete-fallthru.s
14

Just delete {{[0-9|a-f| ]*}} if the address is not significant.

Please also apply to other test files.

19

Delete Begin function foo

Scrub clang output to the minimum

tmsriram updated this revision to Diff 252733.Mar 25 2020, 8:31 PM
tmsriram marked 33 inline comments as done and an inline comment as not done.

Address recent reviewer comments.

  • rebase
  • check it applies clean to trunk

Hi Sriraman,

Sounds like there is strong support for your patch, so let’s move forward on it.

I do have a few more code review items I’d like addressed if we can before we commit.
There are several nits about excess parentheses which do not fit with the rest of lld code.
Please scan the full patch and delete them. Rebase and upload a new diff so that I can patch my local repository with arc patch D68065.
It does not apply so it isn’t convenient for me to review the tests now.

In addition, we probably should move test/ELF/bb-sections-* tests to test/ELF/propeller/

In D68063 and D73674, we decided (one of your suggestions too) to remove all references to Propeller as this was related to basic block sections. This is also fully a part of bb sections and fits naturally and is part of core LLD. I dont think we should move it to propeller/.

I don't mean any disrespect here but your tone suggests that you are quite experienced in this area :) If you have a better proposal, I strongly encourage you to propose it, evaluate it with experiments and performance numbers and get it into LLVM. Asking us to evaluate completely new designs along the lines of Full or Partial LTO is not feasible as it takes several weeks if not months, and IMHO, not reasonable particularly at this stage in the review.

The first sentence is fair. I have just a bit more than 2 years experience contributing to LLVM. I made a llvm-dev proposal as everyone can see. In my spare time I will probably make more investigations how to speed up things.

We have put in a lot of effort towards this work to come this far. Asking us to go do it totally differently is not something we are going to do. We now know the partial LTO idea was actually not yours and only suggested to you, which I believe you should at least acknowledge for transparency.

Just wanted to defend for myself. I was not sure I could mention the person. That email also includes lots of my own ideas.

You also say in your previous message that "I am very glad to see that we have made progress by landing D68063 ..." and yet you are blocking this. This is ridiculous.

Please note that I made it very clear that basic block sections are helpful for other things. I was blocking this linker patch because of concerns about long-term viability and maintenance.

That said, I think we need to try out new ideas. I am now focusing on code view itself, instead of repeating my previous high level concerns.

lld/ELF/Arch/X86_64.cpp
159

This won't work because this function should return the max. size when not found. If I change that assumption, I must also change how this is consumed by its callers. Is there a reason why you want to search in the reverse direction? I mean I will change it if you could tell me why.

lld/ELF/OutputSections.cpp
350–355

I removed it but maybe I should say that NOPs are needed here as opposed to TRAP because they might be executed.

lld/ELF/Writer.cpp
1674

@amharc We are going back and forth here. We didn't think it was necessary to copy the behavior either, but since you suggested that we copy elfnn-riscv.c behavior, we went ahead and did it. We dont see a problem with either, so should we just leave it as is? Is there a particular concern here, that is not clear to me.

1679

I am not a native speaker but "Moving" seems more appropriate here unless you meant that 'M' should be lower case. My take, "move" seems to imply the user must do it.

1703

Moved and asserted at the beginning.

1708

Rephrased, shrinking the jump instrs was the 2. but I split that patch out and this remained unnoticed.

1718

Same deal.

1723

With parens reads better to me personally. If this is not acceptable w.r.t the coding style, lmk and I will delete.

1732

Past tense is preferred? I see mixed use here in many places in LLD.

2057

Correct me if I am wrong but finalizing in.symtab before we optimize seems important, I added a comment here. If we relax jumps going forward, we would definitely need to know how far they are.

lld/test/ELF/bb-sections-and-icf.s
8

Didn't follow. This test explicitly checks for the folding. You want to test icf=none too? Why?

tmsriram edited the summary of this revision. (Show Details)Mar 25 2020, 8:36 PM

Ping.

Will read tomorrow.

MaskRay added inline comments.Apr 2 2020, 2:51 PM
lld/test/ELF/bb-sections-delete-fallthru.s
9

delete excess space and use --optimize-bb-jumps

I mentioned this in a previous comment.

25

Add some spaces after CHECK: to make the label aligned.

For subsequent -NEXT: lines, you can increase the indentation (say, by 1) to make it clear the instructions follow the label:

# CHECK:      <a.BB.foo>
# CHECK-NEXT:  nopl (%rax)

Please fix other tests as well.

27

je {{.*}} <r.BB.foo> I have recently updated llvm-objdump -d to print the target address (to be consistent with GNU objdump ; is what most users desire) instead of a decimal PC relative immediate

No need to align the first operand.

MaskRay added inline comments.Apr 2 2020, 2:53 PM
lld/test/ELF/bb-sections-delete-fallthru.s
25
# CHECK:      <a.BB.foo>:
# CHECK-NEXT:  nopl (%rax)

Add a colon to make it clear that a.BB.foo is a label:

MaskRay added a comment.EditedApr 2 2020, 3:26 PM

I have a question about aaaaaaa.BB.foo style labels. Are they STB_LOCAL symbol? If so,

  • The assembler (MC) will generally convert relocations referencing STB_LOCAL to reference STT_SECTION instead.
  • The assembler will keep them in .symtab
  • The linker will retain such labels in the .symtab section in the executable/shared object unless --discard-locals is specified.
  • aaaa.BB.foo will just be marker symbols in executables/shared objects: "hey, these addresses are special and are referenced by some instructions."
  • The strip tool will drop .symtab . It seems that the executables/shared objects cannot be stripped.

Are all the items above expected?

lld/test/ELF/bb-sections-delete-fallthru.s
8

Drop -pc-linux.

I mentioned this in a previous comment.

10

Drop --check-prefix=CHECK. It is the default.

MaskRay added inline comments.Apr 2 2020, 3:32 PM
lld/ELF/Arch/X86_64.cpp
159
for (unsigned i = is.relocations.size(); i != 0; ) {
  --i;
  if (is.relocations[i].offset == offset && is.relocations[i].expr != R_NONE)
    return i;
}
return is.relocations.size();

An input section may have several relocations. Scanning backward can be more efficient.

tmsriram marked an inline comment as done.Apr 2 2020, 3:42 PM
tmsriram added inline comments.
lld/ELF/Arch/X86_64.cpp
159

We discussed this f2f and even about sorting it and you expressed that you were alright leaving it as is, maybe you forgot.

tmsriram marked an inline comment as done.Apr 2 2020, 3:44 PM
tmsriram added inline comments.
lld/ELF/Arch/X86_64.cpp
159

Forgot to add that since this will change when the new relocations come in.

MaskRay added inline comments.Apr 2 2020, 5:00 PM
lld/ELF/Arch/X86_64.cpp
159

No, I do not forget it. The relocations are still sorted in practice. It is just that this is not a guaranteed property. Iterating the relocations backward is faster.

tmsriram updated this revision to Diff 254655.Apr 2 2020, 5:20 PM
tmsriram marked 10 inline comments as done.

Rebase and make recent suggested changes.

I have a question about aaaaaaa.BB.foo style labels. Are they STB_LOCAL symbol? If so,

  • The assembler (MC) will generally convert relocations referencing STB_LOCAL to reference STT_SECTION instead.
  • The assembler will keep them in .symtab
  • The linker will retain such labels in the .symtab section in the executable/shared object unless --discard-locals is specified.
  • aaaa.BB.foo will just be marker symbols in executables/shared objects: "hey, these addresses are special and are referenced by some instructions."
  • The strip tool will drop .symtab . It seems that the executables/shared objects cannot be stripped.

Acknowledged. We are aware of all these points. We will be adding -mrelocate-with-symbols which will relocate with symbols instead of sections to perform relaxation more easily. These symbols are used to map profiles to virtual addresses and hence the unstripped binary is needed. This is our first version and we are brain-storming other methods including putting it in Debug Info. Thanks.

Are all the items above expected?

lld/ELF/Arch/X86_64.cpp
159

Alright, backward scanning now.

ruiu accepted this revision.Apr 6 2020, 8:29 PM

LGTM

I had a VC meeting wtih maskray to discuss concerns he has, and he is ok with this patch. I'm fine with this patch too. I'll give a final LGTM to unblock you. Please make the following changes and submit. Thanks!

lld/ELF/LTO.cpp
84

nit: if else has {}, add {} to other if and else if clauses.

94–95

Please use error() to report an error and use the same style as other error messages. E.g.

error("cannot open " + config->ltoBasicBlockSections + ":" + MBOrErr.getError().message());
MaskRay added inline comments.Apr 6 2020, 8:55 PM
lld/ELF/Writer.cpp
1722

delete excess parens

2057

Run after in.symTab is finalized.

Why is important?

This revision was not accepted when it landed; it landed in state Needs Review.Apr 7 2020, 7:02 AM
This revision was automatically updated to reflect the committed changes.
tmsriram marked 6 inline comments as done and an inline comment as not done.
tmsriram added inline comments.Apr 7 2020, 7:07 AM
lld/ELF/Writer.cpp
2057

Modified comment.

MaskRay added inline comments.Apr 7 2020, 12:00 PM
lld/ELF/Writer.cpp
2057

I am still not sure this is correct. .symtab and .strtab and potential .shstrtab are placed after everything else. The internal representation does not use the contents of .symtab at all.

Though, you already committed this patch. We probably don't want to change here more to cause churn.