This is an archive of the discontinued LLVM Phabricator instance.

Disable machine function splitting for functions with inline asm br
AbandonedPublic

Authored by adriantong1024 on Jul 13 2022, 11:44 AM.

Details

Summary

Terminator intstructions may need to be recoded after machine function splitting.
The ones in the inline assembly can not.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 11:44 AM
adriantong1024 requested review of this revision.Jul 13 2022, 11:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 11:44 AM

Update test case.

The ones in the inline assembly cannot.

This should be fixable on top of D129288. But an incremental fix separate from that would be okay.

Not sure the extra flag carries its weight; it's already hard enough to keep track of all the various flags we put on MachineFunction , and you can already easily check IsInlineAsmBrIndirectTarget() while you're iterating over basic blocks.

The ones in the inline assembly cannot.

This should be fixable on top of D129288. But an incremental fix separate from that would be okay.

Err, actually, now I'm confused. What exactly do you mean by "recoded"? Do you mean there's a rule that imposes a maximum distance between an INLINEASM_BR and its indirect destinations? Is that rule written down anywhere?

The ones in the inline assembly cannot.

This should be fixable on top of D129288. But an incremental fix separate from that would be okay.

Err, actually, now I'm confused. What exactly do you mean by "recoded"? Do you mean there's a rule that imposes a maximum distance between an INLINEASM_BR and its indirect destinations? Is that rule written down anywhere?

In the inline assembly, there may be branch that can only jump a limited distance, if we run MFS on the function, the resulting distance maybe to far to encode into the instruction.

The ones in the inline assembly cannot.

This should be fixable on top of D129288. But an incremental fix separate from that would be okay.

Not sure the extra flag carries its weight; it's already hard enough to keep track of all the various flags we put on MachineFunction , and you can already easily check IsInlineAsmBrIndirectTarget() while you're iterating over basic blocks.

I agree we should be careful to add more and more stuff into MachineFunction. Yes, using IsInlineAsmBrIndirectTarget() on all the basic blocks would achieve my purpose here as well. Thanks !

In the inline assembly, there may be branch that can only jump a limited distance, if we run MFS on the function, the resulting distance maybe to far to encode into the instruction.

If there's a rule like this, we should explicitly state in LangRef which branches are/are not allowed.

In the inline assembly, there may be branch that can only jump a limited distance, if we run MFS on the function, the resulting distance maybe to far to encode into the instruction.

If there's a rule like this, we should explicitly state in LangRef which branches are/are not allowed.

It also seems like such a case is easy to fix; in the inline asm the user should just use the wider encoding.

Maybe would be a surprising failure though, since you'd think the labels are nearby.

In the inline assembly, there may be branch that can only jump a limited distance, if we run MFS on the function, the resulting distance maybe to far to encode into the instruction.

If there's a rule like this, we should explicitly state in LangRef which branches are/are not allowed.

In the inline assembly, there may be branch that can only jump a limited distance, if we run MFS on the function, the resulting distance maybe to far to encode into the instruction.

If there's a rule like this, we should explicitly state in LangRef which branches are/are not allowed.

It also seems like such a case is easy to fix; in the inline asm the user should just use the wider encoding.

Maybe would be a surprising failure though, since you'd think the labels are nearby.

@efriedma Thanks for the comment. Should it not be the compiler's job not to break valid inline assembly user put into their code ? or this is where we should have a specification to tell the user to not assume the block to be very close by.
@nickdesaulniers Thanks for the comment. I agree this does not happen much. but in case of machine function splitting, I do see it happening once in one of our workloads. The cold blocks are relocated too far.

I don't think we should make this change.

Should it not be the compiler's job not to break valid inline assembly user put into their code?

I would argue that in this case, the asm is not "valid" -- it makes an unwarranted assumption as to the compiler's output. Unless there is a way to specify with an inline-asm constraint that the target address must be nearby, the assembly code cannot in general make such an assumption.

It also seems like such a case is easy to fix; in the inline asm the user should just use the wider encoding.

There's always going to be some limit to the distance unless you use an indirect branch. Well, except on weird targets like x86, where a direct branch can reach everywhere. I guess on targets like AArch64, you could mark the branch destinations as functions, and let the linker could insert a stub. But people writing inline asm probably don't expect a branch to clobber x16/x17...

Once we start imposing any restriction on the distance, we need a patch like this: if the destination is in a different section, we can't promise anything about the distance.

I wouldn't expect this to be an issue on x86, though; even on 64-bit, people normally use the "small" code model, so a "jmp" should reach anywhere in the binary. So I'm not sure what the testcase is supposed to be testing.

adriantong1024 abandoned this revision.Jul 14 2022, 1:27 PM

It also seems like such a case is easy to fix; in the inline asm the user should just use the wider encoding.

There's always going to be some limit to the distance unless you use an indirect branch. Well, except on weird targets like x86, where a direct branch can reach everywhere. I guess on targets like AArch64, you could mark the branch destinations as functions, and let the linker could insert a stub. But people writing inline asm probably don't expect a branch to clobber x16/x17...

Once we start imposing any restriction on the distance, we need a patch like this: if the destination is in a different section, we can't promise anything about the distance.

I wouldn't expect this to be an issue on x86, though; even on 64-bit, people normally use the "small" code model, so a "jmp" should reach anywhere in the binary. So I'm not sure what the testcase is supposed to be testing.

Thanks for the discussion!

The problem this patch is trying to fix is discovered on AArch64 where the conditional branch b.ge is too short after MFS places hot and cold blocks apart.

I am convinced its a bad idea to impose any restriction on the distance. I think having a warning in MFS to help user prevent linker error would be something nice to have.

The problem this patch is trying to fix is discovered on AArch64 where the conditional branch b.ge is too short after MFS places hot and cold blocks apart.

Isn't it straightforward to change b.ge to b.lt to the other branch destination, then b (no condition) to the label?

The problem this patch is trying to fix is discovered on AArch64 where the conditional branch b.ge is too short after MFS places hot and cold blocks apart.

Isn't it straightforward to change b.ge to b.lt to the other branch destination, then b (no condition) to the label?

We can do this. I am slightly worried about performance as this is in a performance critical part of the code.

The problem this patch is trying to fix is discovered on AArch64 where the conditional branch b.ge is too short after MFS places hot and cold blocks apart.

Isn't it straightforward to change b.ge to b.lt to the other branch destination, then b (no condition) to the label?

We can do this. I am slightly worried about performance as this is in a performance critical part of the code.

Generally, asm goto is modeled as the indirect branches being taken are the exceptional cases. So the indirect branch targets should be treated as if they were cold. If they're moved far away...good.

Otherwise if that's surprising for that code, it sounds like machine function splitting should be disabled for that function.

The problem this patch is trying to fix is discovered on AArch64 where the conditional branch b.ge is too short after MFS places hot and cold blocks apart.

Isn't it straightforward to change b.ge to b.lt to the other branch destination, then b (no condition) to the label?

We can do this. I am slightly worried about performance as this is in a performance critical part of the code.

Generally, asm goto is modeled as the indirect branches being taken are the exceptional cases. So the indirect branch targets should be treated as if they were cold. If they're moved far away...good.

Otherwise if that's surprising for that code, it sounds like machine function splitting should be disabled for that function.

I think MFS is doing the right thing to move the target block away (as it is cold in the profile). Your suggestion of changing b.ge to b.lt is probably not as bad as I initially thought, because the b (no condition) is going to be rarely executed. However, it does make code size 4 bytes larger, which is probably not a big problem either.

AArch64 "b" has a range of +-128MB. Which isn't enough for arbitrary programs. So in general, you need a sequence like the following (assuming small code model):

adrp x0, dest
add x0, x0, :lo12:dest
blr x0

That is, unless you're okay with the restriction that your binary is at most 128MB. Which might be reasonable for the Linux kernel, I guess. But again, something you'd want to document...


That said, I'm surprised machine function splitting on aarch64 works without any other changes; currently, branch relaxation isn't aware of section markings at all. Or do you have some other out-of-tree patches?

AArch64 "b" has a range of +-128MB. Which isn't enough for arbitrary programs. So in general, you need a sequence like the following (assuming small code model):

adrp x0, dest
add x0, x0, :lo12:dest
blr x0

That is, unless you're okay with the restriction that your binary is at most 128MB. Which might be reasonable for the Linux kernel, I guess. But again, something you'd want to document...


That said, I'm surprised machine function splitting on aarch64 works without any other changes; currently, branch relaxation isn't aware of section markings at all. Or do you have some other out-of-tree patches?

I think in case 128MB is not enough, the linker will help here. https://reviews.llvm.org/D39744

MFS does not work on AArch64, I am trying to make it work. I extended branch relaxation to handle cross-section branches. I plan to send out a RFC before sending up the out-of-tree patches I have.

Thanks !

The primary issue with range extension thunks is that they clobber x16 (and in theory are allowed to clobber x17). So we'd need to ensure that all asm goto blocks clobber x16 and x17.

I can't think of any other issue with depending on range extension thunks, I guess. (See also https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#call-and-jump-relocations .)

The primary issue with range extension thunks is that they clobber x16 (and in theory are allowed to clobber x17). So we'd need to ensure that all asm goto blocks clobber x16 and x17.

I can't think of any other issue with depending on range extension thunks, I guess. (See also https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#call-and-jump-relocations .)

I guess we could error if the inline asm used x16 or x17 and we did do MFS? IIRC, we already error for 32b r7 being reserved (under some conditions, which I've forgotten). Or maybe avoid MFS if the inline asm had x16 or x17 in its clobber list?

The primary issue with range extension thunks is that they clobber x16 (and in theory are allowed to clobber x17). So we'd need to ensure that all asm goto blocks clobber x16 and x17.

I can't think of any other issue with depending on range extension thunks, I guess. (See also https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#call-and-jump-relocations .)

I guess we could error if the inline asm used x16 or x17 and we did do MFS? IIRC, we already error for 32b r7 being reserved (under some conditions, which I've forgotten). Or maybe avoid MFS if the inline asm had x16 or x17 in its clobber list?

Yes, I think this is a good option. Thanks !