This is an archive of the discontinued LLVM Phabricator instance.

[MC] Change AsmParser to leverage Assembler during evaluation
ClosedPublic

Authored by niravd on Apr 2 2018, 7:37 AM.

Details

Summary

Teach AsmParser to check with Assembler for when evaluating constant expressions. This improves the handing of preprocessor expressions that must be resolved at parse time. This idiom can be found as assembling-time assertion checks in Source-level assemblers. Note that this relies on the MCStreamer to keep sufficient tabs on Section / Fragment information which the MCAsmStreamer does not. As a result the textual output may fail where the equivalent object generation would pass. This can most easily be resolved by folding the MCAsmStreamer and MCObjectStreamer together which is planned for in a separate patch.

Diff Detail

Repository
rC Clang

Event Timeline

niravd created this revision.Apr 2 2018, 7:37 AM
niravd updated this revision to Diff 140623.Apr 2 2018, 7:37 AM
  • [MC] Allow MCAssembler to be constructed without all subcomponents. NFCI.
  • [MC] Modify MCAsmStreamer to always build MCAssembler. NFCI.
niravd updated this revision to Diff 140624.Apr 2 2018, 7:38 AM
  • [MC] Change AsmParser to leverage Assembler during evaluation.
Harbormaster completed remote builds in B16626: Diff 140623.
Harbormaster completed remote builds in B16627: Diff 140624.
niravd retitled this revision from [MC] Allow MCAssembler to be constructed without all subcomponents. NFCI. to [MC] Change AsmParser to leverage Assembler during evaluation.Apr 2 2018, 7:49 AM
niravd edited the summary of this revision. (Show Details)
niravd added a comment.Apr 2 2018, 7:52 AM

For ease of reviewing the history reflects 3 stages of patches:

  1. [MC] Allow MCAssembler to be constructed without all subcomponents (NFCI) which is not guaranteed by MCAsmStreamer
  2. [MC] Modify MCAsmStreamer to always build MCAssembler. NFCI.
  3. [MC] Change AsmParser/Streamers to leverage Assembler during expression evaluation.
rnk added a comment.Apr 2 2018, 10:23 AM

This can most easily be resolved by folding the MCAsmStreamer and MCObjectStreamer together which is planned for in a separate patch.

So, basically, gas directives allow the user to do a lot of reflection. In order to support that reflection, we need to lay out an object file, regardless of whether we're emitting textual assembly or an object.

I checked, and this doesn't appear to affect users of -fno-integrated-as, because the AsmPrinter will bypass the MCStreamer API when that flag is set.

espindola edited reviewers, added: espindola; removed: rafael.Apr 2 2018, 4:42 PM

As a result the textual output may fail where the equivalent object generation would pass.

I don't think that is OK.

Why can't the asm streamer blindly print the entire if to the output?

niravd added a comment.Apr 3 2018, 8:49 AM

As a result the textual output may fail where the equivalent object generation would pass.

I don't think that is OK.

It's certainly not ideal but this is at least a somewhat reasonable intermediate point until the follow up patch is finished. The divergence between object and text only happens with preprocessor directives in assembly which should mostly happen with .S files which are probably being assembled directly to object.

The follow up patch to requires merging the various ObjectStreamer and AsmStreamer paths and is rather large.

Why can't the asm streamer blindly print the entire if to the output?

The 'if' is a preprocessor directive and is only valid in the input (.S preprocessor assembly) and not the output (.s preprocessed assembly).

As a result the textual output may fail where the equivalent object generation would pass.

I don't think that is OK.

It's certainly not ideal but this is at least a somewhat reasonable intermediate point until the follow up patch is finished. The divergence between object and text only happens with preprocessor directives in assembly which should mostly happen with .S files which are probably being assembled directly to object.

The follow up patch to requires merging the various ObjectStreamer and AsmStreamer paths and is rather large.

That would cause us to compute offsets when producing a .s file, right? If we must process .if directives instead of printing them, I don't think we have another option.

But I still don't think this is a reasonable intermediary step. Producing equivalent output is a big guarantee of MC with very few exceptions (bugs).

Why can't the asm streamer blindly print the entire if to the output?

The 'if' is a preprocessor directive and is only valid in the input (.S preprocessor assembly) and not the output (.s preprocessed assembly).

Is that a documented restriction?

As a result the textual output may fail where the equivalent object generation would pass.

I don't think that is OK.

It's certainly not ideal but this is at least a somewhat reasonable intermediate point until the follow up patch is finished. The divergence between object and text only happens with preprocessor directives in assembly which should mostly happen with .S files which are probably being assembled directly to object.

The follow up patch to requires merging the various ObjectStreamer and AsmStreamer paths and is rather large.

That would cause us to compute offsets when producing a .s file, right? If we must process .if directives instead of printing them, I don't think we have another option.

This is really tricky too...

If you compute offsets when producing a textual assembly file, except in _very_ limited circumstances where the layout is self-evident and not up to interpretation, you're going to risk getting a different answer than the actual assembler. Consider that X86 has many ways to encode a given instruction, and different assemblers may or may not choose to encode a given textual instruction into the same size output.

For example, llvm used to assemble "movw %cs, (%eax)" as [0x66,0x8c,0x08], instead of [0x8c,0x08]. They mean the same thing, and GNU as has used the short sequence for ages. It would be pretty horrible if you had something like the following input, and at the end of processing through llvm to textual asm, and then GNU as, you ended up with only 2 bytes of output, which shouldn't be possible. (this is a contrived example, yes...)

foo:
  movw %cs, (%eax)
.if . - foo == 2
.byte 0
.endif

I suppose we might be able to emit a textual asm file that uses only ".byte"/".word"/etc directives...instead of textual instructions. Then we _could_ be certain of the size. (Although that may not actually be feasible when relocations are involved? And in any case, super-ugly and I doubt what any user would want to see...)

Why can't the asm streamer blindly print the entire if to the output?

The 'if' is a preprocessor directive and is only valid in the input (.S preprocessor assembly) and not the output (.s preprocessed assembly).

Is that a documented restriction?

I don't see a reason why we couldn't emit an ".if" directive into the output. However, I don't see how emitting the original conditions could really be viable, unless you're just passing the entire input textually through to the output without parsing it at all. Consider that _anything_ can go inside an .if/.endif. E.g. defining a macro, or starting a new section. You'd effectively need to fork the entire assembler state upon seeing such a condition, to assemble each path of the conditional separately, and then output both possibilities. And keep forking, on every conditional in the input. The combinatorics of that would be very unfortunate...

One alternative I see for supporting textual output would be to emit a _verification_ check for the value of every layout-dependent absolute expression which was evaluated during the compile. E.g., in the above example, emit something like:

foo:
  movw %cs, (%eax)
.Ltmp0:

/* Verification of layout assumptions: */
.if .Ltmp0 - foo != 3
.err
.endif

However, even given an implementation strategy that seems like it'd probably work, I'm not really sure supporting this for textual output is really that worthwhile?

Nirav: do you know if this comes up in inline asm in any real world project? If not, perhaps this feature could just be disabled when evaluating llvm inline asm expressions, where the ability to emit a .s file is critical.

But for a standalone assembler -- where this sort of use-case occurs rather frequently -- is it really important that you be able to re-emit textual asm?

It's documented that we output (.s) and I believe this is specifically so we are compatible with assemblers without sufficient preprocessor support. It may be reasonable to add a (.S) output but as it's been pointed out the textual semantics of the preprocessor are not suited for this and
error/warnings quality would almost certainly degrade.

All of the inputs I've seen are .S files; no inline assembly. They've been are limited to .data blocks where there's no ambiguity about sizes (This is what the current patch handles). The gnu assembler does a bit more and handles assembler-dependent preprocessor expressions when the intermediate artifact's sizes are explicitly known (i.e. data and instructions of known size), but it's not clear if the extra capabilities are ever used or needed (The closest I've found is a case where a .fill directive was used to do pad a block with nops but that's utterable in our assembly output currently). Regardless, gas's support rules out all of the tricky cases James mentioned so textual output is reasonable as a output artifact (at least as reasonable as what we have currently).

I have not found any documentation giving guarantees about the correspondence between output types, but it seem natural to me that direct object generation may be more permissive than compilation through assembly hence this patch. There already appear to be additional restrictions in the AsmStreamer (e.g. dwarf CUID) over the ObjectStreamer so this isn't a new thing.

That said, modulo the extra bookkeeping costs for textual assembly additional checks for incomplete assemblers, there's no real reason why MCAsmStreamer and MCObjectStreamer are separate structures and it would be good to eventually merge them.

rnk added a comment.Apr 11 2018, 9:36 AM

I have not found any documentation giving guarantees about the correspondence between output types, but it seem natural to me that direct object generation may be more permissive than compilation through assembly hence this patch. There already appear to be additional restrictions in the AsmStreamer (e.g. dwarf CUID) over the ObjectStreamer so this isn't a new thing.

I'm also not that concerned about this difference.

That said, modulo the extra bookkeeping costs for textual assembly additional checks for incomplete assemblers, there's no real reason why MCAsmStreamer and MCObjectStreamer are separate structures and it would be good to eventually merge them.

Why? What would you replace the MCStreamer interface virtual dispatch with? Would MCStreamer become the main implementation, with every method checking if (emitTextualAssembly) OS << ".foo";? That doesn't seem like a clear win.

That said, modulo the extra bookkeeping costs for textual assembly additional checks for incomplete assemblers, there's no real reason why MCAsmStreamer and MCObjectStreamer are separate structures and it would be good to eventually merge them.

Why? What would you replace the MCStreamer interface virtual dispatch with? Would MCStreamer become the main implementation, with every method checking if (emitTextualAssembly) OS << ".foo";? That doesn't seem like a clear win.

Yes. If we're going to require the assembly and object generation must be the same, we're going to need to do equivalent bookkeeping and factoring out just the bookkeeping into a merged seems unreasonable given it depends on enough details from the various ObjectStreamer subclass. Textual output would effectively be a trace emitted during a truncated object generation.

If we're okay with object generation being more permissive than textual generation for assembly files then the only potential issue is changes exposed by inline asm; I believe we do expect compilation to always be able to generate textual assembly if we can generate an object for C compilation.
This could be resolved by disabling assembler-level information for inline assembly.

If I've understood correctly, this will evaluate the expression if there is something simple and unrelaxable such as (Armv7a)

.thumb
start:
nop
end:
.if (end - start == 2)

But not if there may be relaxations involved:

.thumb
start:
ldr r0,=0x12345678 // Relaxable instruction that generates a constant pool.
end:
.if (end - start == 2) // expect error message here

If this is the case, please can there be a test that checks for the error message as I think it is important that we don't accidentally allow these expressions to be evaluated early if their result depends on a later layout pass.

niravd updated this revision to Diff 142957.Apr 18 2018, 12:06 PM

Add tests to verify failure when layout-dependant cases happen.

Thanks for adding the test, it looks like it will cover my concern.

niravd updated this revision to Diff 143128.Apr 19 2018, 10:57 AM

Disable assembler-information for parsing of inline assembly to maintain equivalence of object and assembly generation from llvm ir paths.

Ping. Just a recap out the state of this patch:

  • Assembler information is only enabled for compilation from assembly to object files (compilation from LLVM IR/ C will be equivalent between assembly and objects)
  • I've most of a follow up patch which merges the AsmSstreamer and ObjectStreamer, emitting assertion checks in the simplified assembly where decisions that may not be upheld by the eventual assembler are checked.
  • With this we can evaluated at parse time relative offset differences in sections with fixed sized values (data and non-relaxable instructions). The GNU assembler appears to to marginally better in that it only requires the bits between the offsets are known size, but appears non-essential and can be added afterwards.

FWIW I'm in favour of this approach, and of merging the Asm and ObjectStreamer as I've recently found a case where this would have been useful (deriving a MCSubtargetInfo when emitting constant pools). There were others with stronger objections though.

Since this is using information inside a single fragment when producing assembly I am OK with it.

Different assemblers will have different relaxations, but looking at offsets of the labels in the same fragment should be ok.

Ping? I think we've a positive consesus. Anyone want to LGTM?

peter.smith accepted this revision.Apr 27 2018, 1:49 AM

I'm happy to LGTM. Apologies for the delay.

This revision is now accepted and ready to land.Apr 27 2018, 1:49 AM
This revision was automatically updated to reflect the committed changes.