Download Raw Diff

Details

Reviewers

aaron.ballman
erichkeane
tahonermann
shafik

Commits

rG0a3243de62c1: [clang][Interp] Array initialization via string literal

Diff Detail

Event Timeline

tbaeder created this revision.Nov 5 2022, 6:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 6:06 AM

tbaeder requested review of this revision.Nov 5 2022, 6:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 6:06 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B196288: Diff 473424.Nov 5 2022, 6:07 AM

tschuett added a subscriber: tschuett.Nov 5 2022, 7:15 AM

tschuett added inline comments.

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1081	Program

tbaeder marked an inline comment as done.Nov 5 2022, 10:38 AM

tschuett added inline comments.Nov 5 2022, 11:29 AM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1081	Probably I misunderstood something. It says: Porgram::createGlobalString I would expected it to start with Program and not Porgram.

tbaeder added inline comments.Nov 5 2022, 2:09 PM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1081	Yep, I fixed it locally.

aaron.ballman added inline comments.Nov 8 2022, 9:49 AM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1076–1077
1082
1086	Should we be looking at the sign of `char` to decide whether to use a uint8 or an sint8?
1100
clang/test/AST/Interp/literals.cpp
354–359	I'd like to see some tests for the other encodings, as well as a test with embedded null characters in the literal. Testing a string literal that's longer than the array is something we should think about. That code is ill-formed in C++, so I don't think we can add a test for it yet, but it's only a warning in C.

tbaeder updated this revision to Diff 474177.Nov 9 2022, 12:35 AM

tbaeder marked 4 inline comments as done.

tbaeder added inline comments.

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1076–1077	I feel like every time I write the code to get the `ConstantArrayType` from some array expression, I use a different version. I've never used `getAsConstantArrayType()` before :)
1086	I was wondering about that too, but this code is copy/paste from `Program.cpp` where we create global storage for string literals. That code works, so I assume it will work here, too. Getting rid of the duplication might be nice though.

Harbormaster completed remote builds in B196843: Diff 474177.Nov 9 2022, 12:36 AM

tahonermann added inline comments.Nov 15 2022, 12:36 PM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1098–1099	Aren't `N` and `NumElems` guaranteed to have the same value here? Both are derived from `SL`. The code seems to be written with the expectation that `NumElems` corresponds to the number of elements to be iniitialized in the target array.
1100	`CodePoint` should be `CodeUnit` here. `char` and friends all hold code units (in some cases, those code units may also constitute a code point).
clang/test/AST/Interp/literals.cpp
354–359	I agree with Aaron's requests. Please also extend the test to include a `char` element that would be negative for a `signed` 8-bit `char`. Something like: constexpr char foo[12] = "abc\xff"; ... #if defined(__CHAR_UNSIGNED__) \|\| __CHAR_BIT__ > 8 static_assert(foo[3] == 255, ""); #else static_assert(foo[3] == -1, ""); #endif A couple of more tests to add: One where the string literal has the same length (including the implicit terminator) as the array; to ensure that the implicit terminator is properly accounted for. One where the target array size is deduced from the string literal; to ensure there are no dependencies on an explicit array size.

tbaeder updated this revision to Diff 475697.Nov 15 2022, 11:39 PM

tbaeder marked 2 inline comments as done.

Harbormaster completed remote builds in B197913: Diff 475697.Nov 15 2022, 11:40 PM

tahonermann added inline comments.Nov 18 2022, 3:34 PM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1098–1099	I see the change to now use the minimum of `SL->getLength()` and `CAT->getSize().getZExtValue()`. Based on https://godbolt.org/z/5sTWExTac this looks to be unnecessary. When a string literal is used as an array initializer, it appears that the type of the string literal is adjusted to match the size of the array being initialized. I suggest using only `CAT->getSize().getZExtValue()` and adding a comment that this code depends on that adjustment.
clang/test/AST/Interp/literals.cpp
354–359	These cases all look to have been added now. Thank you!

tbaeder updated this revision to Diff 476660.Nov 18 2022, 9:57 PM

tbaeder marked 3 inline comments as done.

Harbormaster completed remote builds in B198606: Diff 476660.Nov 18 2022, 9:57 PM

tbaeder added inline comments.Nov 18 2022, 10:03 PM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1098–1099	That is good to know and makes sense, thanks!

tbaeder added inline comments.Nov 23 2022, 12:55 AM

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1098–1099	That actually doesn't work. They type might be adjusted, but `getCodeUnit()` still asserts that the index is `< getLength()`. :(

tbaeder updated this revision to Diff 477409.Nov 23 2022, 1:03 AM

Harbormaster completed remote builds in B199131: Diff 477409.Nov 23 2022, 1:03 AM

@tahonermann Anything still missing here?

This looks good to me now. I'm sorry for taking so long to return to review.

clang/lib/AST/Interp/ByteCodeExprGen.cpp
1098–1099	Ah, ok, that makes sense, thanks. I agree this is the right approach for enumerating just the code units in the string literal that are used to initialize the target now.

This revision is now accepted and ready to land.Dec 7 2022, 2:37 PM

This revision was landed with ongoing or failed builds.Jan 25 2023, 5:24 AM

Closed by commit rG0a3243de62c1: [clang][Interp] Array initialization via string literal (authored by tbaeder). · Explain Why

This revision was automatically updated to reflect the committed changes.

tbaeder added a commit: rG0a3243de62c1: [clang][Interp] Array initialization via string literal.

Diff 474177

clang/lib/AST/Interp/ByteCodeExprGen.cpp

Show First 20 Lines • Show All 1,066 Lines • ▼ Show 20 Lines for (size_t I = 0; I != NumElems; ++I) {

if (!this->visit(Arg)) if (!this->visit(Arg))

return false; return false;

} }

if (!this->emitCall(Func, Initializer)) if (!this->emitCall(Func, Initializer))

return false; return false;

} }

return true; return true;

} else if (const auto *SL = dyn_cast<StringLiteral>(Initializer)) {

const ConstantArrayType *CAT = Ctx.getASTContext().getAsConstantArrayType(SL->getType());

assert(CAT && "a string literal that's not a constant array?");

aaron.ballmanUnsubmitted

Done

} else if (const auto *SL = dyn_cast<StringLiteral>(Initializer)) {

- const ArrayType *AT = SL->getType()->getAsArrayTypeUnsafe();

- const auto *CAT = cast<ConstantArrayType>(AT);

+ const ArrayType *CAT = Ctx.getASTContext().getAsConstantArrayType(SL->getType());

+ assert(CAT && "a string literal that's not a constant array?");

size_t NumElems = CAT->getSize().getZExtValue();

aaron.ballman:

tbaederAuthorUnsubmitted

Done

I feel like every time I write the code to get the ConstantArrayType from some array expression, I use a different version. I've never used getAsConstantArrayType() before :)

tbaeder: I feel like every time I write the code to get the `ConstantArrayType` from some array…

size_t NumElems = CAT->getSize().getZExtValue();

// FIXME: There is a certain code duplication between here

// and Program::createGlobalString().

tschuettUnsubmitted

Done

Program

tschuett: Program

tschuettUnsubmitted

Done

Probably I misunderstood something. It says:

Porgram::createGlobalString

I would expected it to start with Program and not Porgram.

tschuett: Probably I misunderstood something. It says: ``` Porgram::createGlobalString ``` I would…

tbaederAuthorUnsubmitted

Done

Yep, I fixed it locally.

tbaeder: Yep, I fixed it locally.

size_t CharWidth = SL->getCharByteWidth();

aaron.ballmanUnsubmitted

Done

// and Porgram::createGlobalString().

- const size_t CharWidth = SL->getCharByteWidth();

+ size_t CharWidth = SL->getCharByteWidth();

PrimType CharType;

aaron.ballman:

PrimType CharType;

switch (CharWidth) {

case 1:

CharType = PT_Sint8;

aaron.ballmanUnsubmitted

Not Done

Should we be looking at the sign of char to decide whether to use a uint8 or an sint8?

aaron.ballman: Should we be looking at the sign of `char` to decide whether to use a uint8 or an sint8?

tbaederAuthorUnsubmitted

Done

I was wondering about that too, but this code is copy/paste from Program.cpp where we create global storage for string literals. That code works, so I assume it will work here, too. Getting rid of the duplication might be nice though.

tbaeder: I was wondering about that too, but this code is copy/paste from `Program.cpp` where we create…

break;

case 2:

CharType = PT_Uint16;

break;

case 4:

CharType = PT_Uint32;

break;

default:

llvm_unreachable("unsupported character width");

}

unsigned N = SL->getLength();

for (size_t I = 0; I != NumElems; ++I) {

tahonermannUnsubmitted

Done

Aren't N and NumElems guaranteed to have the same value here? Both are derived from SL. The code seems to be written with the expectation that NumElems corresponds to the number of elements to be iniitialized in the target array.

tahonermann: Aren't `N` and `NumElems` guaranteed to have the same value here? Both are derived from `SL`.

tahonermannUnsubmitted

Done

I see the change to now use the minimum of SL->getLength() and CAT->getSize().getZExtValue(). Based on https://godbolt.org/z/5sTWExTac this looks to be unnecessary. When a string literal is used as an array initializer, it appears that the type of the string literal is adjusted to match the size of the array being initialized. I suggest using only CAT->getSize().getZExtValue() and adding a comment that this code depends on that adjustment.

tahonermann: I see the change to now use the minimum of `SL->getLength()` and `CAT->getSize().getZExtValue…

tbaederAuthorUnsubmitted

Done

That is good to know and makes sense, thanks!

tbaeder: That is good to know and makes sense, thanks!

tbaederAuthorUnsubmitted

Done

That actually doesn't work. They type might be adjusted, but getCodeUnit() still asserts that the index is < getLength(). :(

tbaeder: That actually doesn't work. They type might be adjusted, but `getCodeUnit()` still asserts that…

tahonermannUnsubmitted

Not Done

Ah, ok, that makes sense, thanks. I agree this is the right approach for enumerating just the code units in the string literal that are used to initialize the target now.

tahonermann: Ah, ok, that makes sense, thanks. I agree this is the right approach for enumerating just the…

uint32_t CodePoint = I < N ? SL->getCodeUnit(I) : 0;

aaron.ballmanUnsubmitted

Done

for (size_t I = 0; I != NumElems; ++I) {

- const uint32_t CodePoint = I < N ? SL->getCodeUnit(I) : 0;

+ uint32_t CodePoint = I < N ? SL->getCodeUnit(I) : 0;

// TODO(Perf): 0 is implicit; we can just stop iterating at that point.

aaron.ballman:

tahonermannUnsubmitted

Done

CodePoint should be CodeUnit here. char and friends all hold code units (in some cases, those code units may also constitute a code point).

tahonermann: `CodePoint` should be `CodeUnit` here. `char` and friends all hold code units (in some cases…

// TODO(Perf): 0 is implicit; we can just stop iterating at that point.

if (CharWidth == 1)

this->emitConstSint8(CodePoint, SL);

else if (CharWidth == 2)

this->emitConstUint16(CodePoint, SL);

else if (CharWidth == 4)

this->emitConstUint32(CodePoint, SL);

else

return false;

if (!this->emitInitElem(CharType, I, SL))

return false;

}

return true;

} }

assert(false && "Unknown expression for array initialization"); assert(false && "Unknown expression for array initialization");

return false; return false;

} }

template <class Emitter> template <class Emitter>

bool ByteCodeExprGen<Emitter>::visitRecordInitializer(const Expr *Initializer) { bool ByteCodeExprGen<Emitter>::visitRecordInitializer(const Expr *Initializer) {

▲ Show 20 Lines • Show All 497 Lines • Show Last 20 Lines

clang/test/AST/Interp/literals.cpp

Show First 20 Lines • Show All 344 Lines • ▼ Show 20 Lines	#pragma clang diagnostic ignored "-Wmultichar"
__WCHAR_TYPE__ wU = U'abc'; // ref-error{{Unicode character literals may not contain multiple characters}} \		__WCHAR_TYPE__ wU = U'abc'; // ref-error{{Unicode character literals may not contain multiple characters}} \
// expected-error{{Unicode character literals may not contain multiple characters}}		// expected-error{{Unicode character literals may not contain multiple characters}}
#if __cplusplus > 201103L		#if __cplusplus > 201103L
__WCHAR_TYPE__ wu8 = u8'abc'; // ref-error{{Unicode character literals may not contain multiple characters}} \		__WCHAR_TYPE__ wu8 = u8'abc'; // ref-error{{Unicode character literals may not contain multiple characters}} \
// expected-error{{Unicode character literals may not contain multiple characters}}		// expected-error{{Unicode character literals may not contain multiple characters}}
#endif		#endif

#pragma clang diagnostic pop		#pragma clang diagnostic pop

		constexpr char foo[12] = "abc";
		static_assert(foo[0] == 'a', "");
		static_assert(foo[1] == 'b', "");
		static_assert(foo[2] == 'c', "");
		static_assert(foo[3] == 0, "");
		static_assert(foo[11] == 0, "");
		aaron.ballmanUnsubmitted Not Done Reply Inline Actions I'd like to see some tests for the other encodings, as well as a test with embedded null characters in the literal. Testing a string literal that's longer than the array is something we should think about. That code is ill-formed in C++, so I don't think we can add a test for it yet, but it's only a warning in C. aaron.ballman: I'd like to see some tests for the other encodings, as well as a test with embedded null…
		tahonermannUnsubmitted Done Reply Inline Actions I agree with Aaron's requests. Please also extend the test to include a `char` element that would be negative for a `signed` 8-bit `char`. Something like: constexpr char foo[12] = "abc\xff"; ... #if defined(__CHAR_UNSIGNED__) \|\| __CHAR_BIT__ > 8 static_assert(foo[3] == 255, ""); #else static_assert(foo[3] == -1, ""); #endif A couple of more tests to add: One where the string literal has the same length (including the implicit terminator) as the array; to ensure that the implicit terminator is properly accounted for. One where the target array size is deduced from the string literal; to ensure there are no dependencies on an explicit array size. tahonermann: I agree with Aaron's requests. Please also extend the test to include a `char` element that…
		tahonermannUnsubmitted Done Reply Inline Actions These cases all look to have been added now. Thank you! tahonermann: These cases all look to have been added now. Thank you!
};		};

#if __cplusplus > 201402L		#if __cplusplus > 201402L
namespace IncDec {		namespace IncDec {
constexpr int zero() {		constexpr int zero() {
int a = 0;		int a = 0;
a++;		a++;
++a;		++a;
▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[clang][Interp] Array initialization via string literal
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 474177

clang/lib/AST/Interp/ByteCodeExprGen.cpp

clang/test/AST/Interp/literals.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[clang][Interp] Array initialization via string literalClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 474177

clang/lib/AST/Interp/ByteCodeExprGen.cpp

clang/test/AST/Interp/literals.cpp

[clang][Interp] Array initialization via string literal
ClosedPublic