This is an archive of the discontinued LLVM Phabricator instance.

[Support][unittests] Enforce alignment in ConvertUTFTest
ClosedPublic

Authored by ro on Oct 5 2020, 4:25 AM.

Details

Summary

LLVM-Unit :: Support/./SupportTests/ConvertUTFTest.ConvertUTF16LittleEndianToUTF8String FAILs on Solaris/sparcv9:

In llvm/lib/Support/ConvertUTFWrapper.cpp (convertUTF16ToUTF8String) the SrcBytes is arg is reinterpreted/accessed as UTF16 (unsigned short, which requires 2-byte alignment on strict-alignment targets like Sparc) without anything guaranteeing the alignment, so the access yields a SIGBUS.

This patch avoids this by enforcing the required alignment in the callers.

Tested on sparcv9-sun-solaris2.11.

Diff Detail

Event Timeline

ro created this revision.Oct 5 2020, 4:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2020, 4:25 AM
ro requested review of this revision.Oct 5 2020, 4:25 AM
grimar added a comment.Oct 5 2020, 4:29 AM

Perhaps, worth adding an assert to convertUTF16ToUTF8String to validate the address of SrcBytes passed in?

ro updated this revision to Diff 296164.Oct 5 2020, 6:07 AM

Add assertion.
Run clang-format.

Tested on sparcv9-sun-solaris2.11 and x86-64-pc-linux-gnu.

grimar accepted this revision.Oct 5 2020, 6:14 AM

LGTM with nit addressed. But please wait for a second opinion.

llvm/lib/Support/ConvertUTFWrapper.cpp
100

I think more natural way is to test pre-conditions for arguments before using them?
I.e. I'd move this line before const UTF16 *Src =...

assert((uintptr_t)SrcBytes.begin() % sizeof(UTF16) == 0);
This revision is now accepted and ready to land.Oct 5 2020, 6:14 AM
ro added inline comments.Oct 5 2020, 6:43 AM
llvm/lib/Support/ConvertUTFWrapper.cpp
100

I had it before the casts initially (but without the uintptr_t cast, breaking the build). I moved it after to avoid doubling the SrcBytes.begin(), keeping the line shorter.

IMO Src is used only when dereferenced, but I'm fine either way.

jhenderson accepted this revision.Oct 6 2020, 4:59 AM

Looks good.

llvm/lib/Support/ConvertUTFWrapper.cpp
100

I think this is fine here, thought I might assert SrcEnd too for consistency and to show that the end result makes sense. I think the key thing is that the thing being read (i.e. Src) is aligned. The fact that it came from SrcBytes is somewhat irrelevant. In a future version of this code, there might be an alternative way to initialise Src, not using SrcBytes.begin(), and the assertion ideally would be valid for all such cases.

rnk requested changes to this revision.Oct 6 2020, 10:08 AM
rnk added inline comments.
llvm/unittests/Support/ConvertUTFTest.cpp
19

MSVC doesn't support __attribute__, use alignas(UTF16) instead. It should be in use elsewhere in the codebase.

This revision now requires changes to proceed.Oct 6 2020, 10:08 AM
ro marked an inline comment as done.Oct 6 2020, 1:28 PM
ro added inline comments.
llvm/unittests/Support/ConvertUTFTest.cpp
19

Nice, that's way more readable than the __attribute__ syntax.

ro updated this revision to Diff 296536.Oct 6 2020, 1:29 PM
ro marked an inline comment as done.

Use alignas.

MaskRay accepted this revision.Oct 6 2020, 2:09 PM
rnk accepted this revision.Oct 6 2020, 3:36 PM

lgtm

This revision is now accepted and ready to land.Oct 6 2020, 3:36 PM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Oct 7 2020, 12:39 PM

This breaks tests on Windows: http://45.33.8.238/win/25377/step_11.txt

PTAL, and if it takes a while to fix please revert while you investigate.

Why are we trying to enforce alignment like this in the first place? Writing an API that expects an overaligned ArrayRef<char> is just asking for trouble. There should be two possibilities:

  1. We have a native-endianness UTF-16 string. The type of the input should be something like ArrayRef<uint16_t>, and that naturally has the right alignment.
  2. We have some sequence of bytes which are supposed to be either UTF-16LE or UTF-16BE, independent of the host. In this case, we shouldn't be making random alignment assumptions.
ro added a comment.Oct 8 2020, 1:36 AM

This breaks tests on Windows: http://45.33.8.238/win/25377/step_11.txt

PTAL, and if it takes a while to fix please revert while you investigate.

Sorry about that. I see what's going on: ConvertUTFTest.UTF16WrappersForConvertUTF16ToUTF8String fails like the two others and could be fixed similarly by setting the alignment in the caller. The two other callers of convertUTF16ToUTF8String in ConvertUTFTest.cpp (OddLengthInput and Empty) cannot be affected because the size and empty checks return before the assert.

However, all this is incredibly fragile: ConvertUTFTest.ConvertUTF16LittleEndianToUTF8String only FAILed for me in Release builds while ConvertUTFTest.ConvertUTF16BigEndianToUTF8String does in Debug builds. The assert cannot distinguish if SrcBytes is properly aligned by construction of just by accident.

ro added a comment.Oct 8 2020, 1:39 AM

Why are we trying to enforce alignment like this in the first place? Writing an API that expects an overaligned ArrayRef<char> is just asking for trouble. There should be two possibilities:

Agreed: the current interface is fundamentally flawed. However, I have no idea which of your suggested alternatives is better; my primary concern is getting the Solaris/sparcv9 results clean...

rnk added a comment.Oct 12 2020, 10:24 AM

I may have been the one that picked ArrayRef<char> as the parameter type here, and I think what I was thinking is: "files are full of bytes, the caller is most likely to have a bag of bytes that they want to hand to the re-encoding routine, so it should accept bytes, and I'll pick some arbitrary byte-like character type here". For example, MemoryBuffer gives you back a StringRef for the file contents, and you should be able to just plug that in here. MemoryBuffer probably happens to return aligned file contents in practice, but it doesn't advertise any alignment guarantee. So maybe the best fix is to underalign the UTF16 typedef, if that's possible. This is a serialization routine, after all: it should be really generous about what it accepts.

llvm/unittests/Support/ConvertUTFTest.cpp
83

I think you just need to align this to reland.

ro added a comment.Oct 13 2020, 2:20 AM
In D88824#2325457, @rnk wrote:

I may have been the one that picked ArrayRef<char> as the parameter type here, and I think what I was thinking is: "files are full of bytes, the caller is most likely to have a bag of bytes that they want to hand to the re-encoding routine, so it should accept bytes, and I'll pick some arbitrary byte-like character type here". For example, MemoryBuffer gives you back a StringRef for the file contents, and you should be able to just plug that in here. MemoryBuffer probably happens to return aligned file contents in practice, but it doesn't advertise any alignment guarantee. So maybe the best fix is to underalign the UTF16 typedef, if that's possible. This is a serialization routine, after all: it should be really generous about what it accepts.

I suspect this works fine in all cases where some allocation function comes into play: they tend to guarantee the strictest alignment. Only in those cases (like the current one) where local variables are used there is no alignment guarantee.

llvm/unittests/Support/ConvertUTFTest.cpp
83

That would be really nice: this way we could decouple getting the Solaris/sparcv9 bot closer to green from devising a full solution.

ro updated this revision to Diff 297795.Oct 13 2020, 2:21 AM

Align the last affected instance.

rnk reopened this revision.Oct 13 2020, 11:07 AM

I have to reopen to reaccept. Step 1.

This revision is now accepted and ready to land.Oct 13 2020, 11:07 AM
rnk accepted this revision.Oct 13 2020, 11:08 AM

lgtm

This revision was landed with ongoing or failed builds.Oct 14 2020, 3:03 AM
This revision was automatically updated to reflect the committed changes.