This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ][z/OS] Add support to validate a HLASM Label.
ClosedPublic

Authored by anirudhp on Mar 1 2021, 4:42 PM.

Details

Summary
  • This patch adds in support to determine whether a particular label is valid for the hlasm variant
  • The label syntax being checked is that of an ordinary HLASM symbol (Reference, Chapter 2 (Coding and Structure) - Terms, Literals and Expressions - Terms - Symbols - Ordinary Symbol)
  • To achieve this, the virtual function isLabel defined in MCTargetAsmParser.h is made use of
  • The isLabel function is overridden in SystemZAsmParser for the hlasm variant, and the syntax is checked appropriately
  • Things remain unchanged for the att variant
  • Further patches will add in support to emit the label. These future patches will make use of this isLabel function

Diff Detail

Event Timeline

anirudhp created this revision.Mar 1 2021, 4:42 PM
anirudhp requested review of this revision.Mar 1 2021, 4:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2021, 4:42 PM
anirudhp retitled this revision from [SystemZ][z/OS to [SystemZ][z/OS] Add support to validate a HLASM Label..Mar 1 2021, 4:46 PM
anirudhp edited the summary of this revision. (Show Details)
anirudhp set the repository for this revision to rG LLVM Github Monorepo.
hubert.reinterpretcast added inline comments.
llvm/lib/Target/SystemZ/AsmParser/SystemZAsmParser.cpp
1609

Checking the reverse condition and doing the quick exit first reduces the indentation level of almost the entire function body. Please consider it.

1610

Missing period at the end of the sentence.

1611

Missing "label" after HLASM?

1614

Please fix the spelling for "alphanumeric". Also, I don't think the hyphen should be used here. If the "62" is a reference to the possible length, then the sentence is missing "up to".

Also, by "alphabetic", you mean (aside from _) 'considered alphabetic in the "C" locale'?

1616

Let's stick with "alphabetic"? Also, add the period to end the sentence.

1617

This is redundant with point 1 unless if some more subtle point is being made.

1618–1620

Please fix various typos, missing commas, and formatting issues.

1623

It might not hurt to check that the string isn't empty somehow.

1630

Is this supposed to be a call to the locale-sensitive check? LLVM has an isAlpha function.

1635

s/an alphabet/alphabetic/;

1637

Minor nit: Use prefix ++.

1638

See my comment about isalpha versus isAlpha.

uweigand added inline comments.Mar 2 2021, 12:53 AM
llvm/lib/Target/SystemZ/MCTargetDesc/SystemZMCAsmInfo.h
17

Hmm, if we move this to a more "public" location we might want to rename the type to something like SystemZAsmDialect to avoid potential conflicts.

Thank you reviewers for your feedback! I will address them shortly.

An additional point:
I checked the V1R6 Reference manual again and I seem to have forgotten to include a check for "National Characters" ($,@ and #). I would like to remedy this by adding two private inline functions isHLASMAlpha and isHLASMAlnum which encapsulates the checks for what is an "HLASM" alphabet, and use them in the overridden isLabel routine. Is this acceptable?

llvm/lib/Target/SystemZ/AsmParser/SystemZAsmParser.cpp
1614

Yes. By "alphabetic", aside from "_" it is what is considered alphabetic in the "C" locale (a character in the range 'a' to 'z' or 'A' to 'Z')

1630

As mentioned above, the intent of the call to isalpha is to check whether the character is alphabetic according to the c-locale.

If it is preferred to use the isAlpha function provided in StringExtras.h I can use it.

1638

Same comment as above. If it is preferred to use the LLVM isAlnum I can use that.

An additional point:
I checked the V1R6 Reference manual again and I seem to have forgotten to include a check for "National Characters" ($,@ and #). I would like to remedy this by adding two private inline functions isHLASMAlpha and isHLASMAlnum which encapsulates the checks for what is an "HLASM" alphabet, and use them in the overridden isLabel routine. Is this acceptable?

This raises an issue having to do with the encoding used for string literals. It seems that the XL compiler processes the string like it would for emission into the object file and then the assembly is parsed as IBM-1047.

For example:

%:pragma filetag("IBM-1143")
%:include <stdio.h>
int main(void) ??<
  unsigned a = 0;
  %:pragma convert("IBM-1026")
  __asm__(
  "LAB??/x5B??/x7B??/x7C AFI %0,C'A' "
  : "+r"(a)
  :
  );
  printf("%02x??/n", a);
??>

parses okay because 0x5B, 0x7B, and 0x7C are the encodings for the "National Characters" in IBM-1047.

llvm/lib/Target/SystemZ/AsmParser/SystemZAsmParser.cpp
1614

I think adding (in the "C" locale) to the comment would help clarify that.

1630

I believe isAlpha is preferred (because isalpha will give an answer based on the current locale).

1638

Same response: isAlnum is a better match in terms of intent.

anirudhp updated this revision to Diff 327820.Mar 3 2021, 9:15 AM
anirudhp added a reviewer: hubert.reinterpretcast.
  • Fixed grammatical errors in various comments and in some cases, made them more descriptive.
  • Added support to check for national characters ($,@,#) along with "_". To achieve this, introduced two new private inline functions called isHLASMAlpha and isHLASMAlnum
anirudhp marked 14 inline comments as done.Mar 3 2021, 9:16 AM
anirudhp added inline comments.
llvm/lib/Target/SystemZ/AsmParser/SystemZAsmParser.cpp
1623

I would assume the parsing support which will eventually make use of this routine, will ensure that we don't call it if its empty. But like you mentioned, there's no harm in checking if it isn't empty. I've added the check.

anirudhp marked an inline comment as done.Mar 3 2021, 9:17 AM

An additional point:
I checked the V1R6 Reference manual again and I seem to have forgotten to include a check for "National Characters" ($,@ and #). I would like to remedy this by adding two private inline functions isHLASMAlpha and isHLASMAlnum which encapsulates the checks for what is an "HLASM" alphabet, and use them in the overridden isLabel routine. Is this acceptable?

This raises an issue having to do with the encoding used for string literals. It seems that the XL compiler processes the string like it would for emission into the object file and then the assembly is parsed as IBM-1047.

For example:

%:pragma filetag("IBM-1143")
%:include <stdio.h>
int main(void) ??<
  unsigned a = 0;
  %:pragma convert("IBM-1026")
  __asm__(
  "LAB??/x5B??/x7B??/x7C AFI %0,C'A' "
  : "+r"(a)
  :
  );
  printf("%02x??/n", a);
??>

parses okay because 0x5B, 0x7B, and 0x7C are the encodings for the "National Characters" in IBM-1047.

Thank you for bringing this up Hubert!

For now, We don’t have a firm decision on whether we will implement/support pragma convert. I understand that if it is supported, it will be a problem since the character values for the national characters will be different based on the encoding. At a future point, when we have more information, we can come back and review whether certain applicable functions would need to take the encoding into account, or, whether it can be designed in a way such that for example, everything is encoded in UTF-8 and the emission/translation is handled elsewhere (preferably much earlier than the respective backend). For now, I'm assuming the current implementation is fine.

uweigand accepted this revision.Mar 4 2021, 7:13 AM

LGTM, thanks.

This revision is now accepted and ready to land.Mar 4 2021, 7:13 AM

For now, We don’t have a firm decision on whether we will implement/support pragma convert. I understand that if it is supported, it will be a problem since the character values for the national characters will be different based on the encoding. At a future point, when we have more information, we can come back and review whether certain applicable functions would need to take the encoding into account, or, whether it can be designed in a way such that for example, everything is encoded in UTF-8 and the emission/translation is handled elsewhere (preferably much earlier than the respective backend). For now, I'm assuming the current implementation is fine.

And the current implementation here expects UTF-8 (or that's what the internal character encoding is assumed to be anyway). There's a patch for -fexec-charset and the string being parsed here went into the IR in what form?

And the current implementation here expects UTF-8 (or that's what the internal character encoding is assumed to be anyway). There's a patch for -fexec-charset and the string being parsed here went into the IR in what form?

Currently, the asm string goes into the IR in ASCII, irrespective of the -fexec-charset option.

Kai accepted this revision.Mar 5 2021, 9:58 AM

LGTM.
The mentioned issue with the encoding does not affect this change. It's nevertheless a very valuable hint, as there are some implications with char/string literals inside an inline asm string, but this is not part of this change.

And the current implementation here expects UTF-8 (or that's what the internal character encoding is assumed to be anyway). There's a patch for -fexec-charset and the string being parsed here went into the IR in what form?

Currently, the asm string goes into the IR in ASCII, irrespective of the -fexec-charset option.

Numeric escapes will be a problem (because any usage of such in HLASM inline assembly is not likely to be encoding for UTF-8). I don't know how common such usage is.

In D97748#2607182, @Kai wrote:

there are some implications with char/string literals inside an inline asm string

Yes, the defined semantics associated with constructs like CE'^' were designed to work from a baseline where the HLASM source is in EBCDIC. With XL, inline assembly just doesn't work with -qASCII without also having the associated strings "converted" to EBCDIC by user intervention of some sort. So now there's a need to convert only some of the characters from the inline assembly to EBCDIC from UTF-8 at some point (which needs to be one where we know what EBCDIC encoding to convert to). If there's no EBCDIC encoding to convert to (a -qASCII case), then an error seems reasonable.

but this is not part of this change.

Agreed.

This revision was automatically updated to reflect the committed changes.