This is an archive of the discontinued LLVM Phabricator instance.

[Windows] Convert from UTF-8 to UTF-16 when writing to a Windows console
ClosedPublic

Authored by rnk on Aug 31 2018, 2:10 PM.

Details

Summary

Calling WriteConsoleW is the most reliable way to print Unicode
characters to a Windows console.

If binary data gets printed to the console, attempting to re-encode it
shouldn't be a problem, since garbage in can produce garbage out.

This breaks printing strings in the local codepage, which WriteConsoleA
knows how to handle. For example, this can happen when user source code
is encoded with the local codepage, and an LLVM tool quotes it while
emitting a caret diagnostic. This is unfortunate, but well-behaved tools
should validate that their input is UTF-8 and escape non-UTF-8
characters before sending them to raw_fd_ostream. Clang already does
this, but not all LLVM tools do this.

One drawback to the current implementation is printing a string a byte
at a time doesn't work. Consider this LLVM code:

for (char C : MyStr) outs() << C;

Because outs() is now unbuffered, we wil try to convert each byte to
UTF-16, which will fail. However, this already didn't work, so I think
we may as well update callers that do that as we find them to print
complete portions of strings. You can see a real example of this in my
patch to SourceMgr.cpp

Fixes PR38669 and PR36267.

Diff Detail

Repository
rL LLVM

Event Timeline

rnk created this revision.Aug 31 2018, 2:10 PM
zturner added inline comments.Aug 31 2018, 2:19 PM
llvm/lib/Support/raw_ostream.cpp
572 ↗(On Diff #163593)

You can use llvm::sys::Process::FileDescriptorIsDisplayed(fd) here. Also, no need to put it behind the #ifdef, the member variable can be set correctly regardless of platform.

632 ↗(On Diff #163593)

StringRef Data?

755–780 ↗(On Diff #163593)

Given that we already have this IsConsole variable, combined with my suggestion above to set the variable to a real value for every platform, can we re-write this entire function:

if (IsConsole)
  return 0;

#if !defined(__minix)
  struct stat statbuf;
  if (fstat(FD, &statbuf) != 0)
    return 0;
  return statbuf.st_blksize.
#else
  return raw_ostream::preferred_buffer_size();
#endif
rnk marked an inline comment as done.Aug 31 2018, 2:36 PM
rnk added inline comments.
llvm/lib/Support/raw_ostream.cpp
572 ↗(On Diff #163593)

We need to differentiate between the case where the output is being displayed in mintty and a Windows console. FileDescriptorIsDisplayed() could reasonably return true for mintty, which uses pipes, and then we would try to use WriteConsoleW on a pipe. It's true that today FileDescriptorIsDisplayed will return false inside mintty, but someone might come along and fix that bug at some point.

I really want to be explicit that this is just for Windows console devices, and not some general cross-platform "is displayed" boolean, which is why the variable is IsConsole and not IsATTy. WDYT?

632 ↗(On Diff #163593)

Sure, but it just moves it to the caller.

755–780 ↗(On Diff #163593)

If we do cache tty-ness, we could do that, but that's not what IsConsole is.

rnk updated this revision to Diff 163595.Aug 31 2018, 2:37 PM
rnk marked an inline comment as done.
  • stringref
zturner added inline comments.Aug 31 2018, 3:00 PM
llvm/lib/Support/raw_ostream.cpp
650–652 ↗(On Diff #163595)

Can you test this with the following?

outs() << "Test\nTest2";

I'm wondering if the LF will be handled properly. I think part of what ::write does is to transcode LF to CRLF. So if we're directly calling WriteConsole, we might not get this for free anymore.

572 ↗(On Diff #163593)

In that case call the variable IsWindowsConsole perhaps.

632 ↗(On Diff #163593)

Well sure, but my next question would be "why doesn't write_impl take a StringRef?" Seems like something that should keep pushing up the call stack as high as possible (Not that I'm asking you to make that change, but it just seems like the right api, despite the fact that the caller has what is, IMO, a buggy API).

rsmith added a subscriber: rsmith.Aug 31 2018, 4:01 PM
rsmith added inline comments.
llvm/lib/Support/raw_ostream.cpp
636 ↗(On Diff #163595)

Just out of curiosity: what's the difference between this and llvm::convertUTF8ToUTF16String?

650 ↗(On Diff #163595)

I think you want &WideText[WCharWritten] here.

zturner added inline comments.Aug 31 2018, 4:11 PM
llvm/lib/Support/raw_ostream.cpp
636 ↗(On Diff #163595)

The main difference is that sys::windows::UTF8ToUTF16 calls a native Windows API to do the conversion. Documentation for the Windows API claims to produce conformant UTF-16 in all cases, but historically people have still tried to use the Windows API whenever possible because there have been issues in the past where Windows was *almost* UTF-16 but not quite. I think those days are gone, but people still do it.

The other difference is that convertUTF8ToUTF16String returns a vector<unsigned short>, whereas all of the windows APIs take const wchar_t*. So we'd need a reintepret_cast before calling the actual Windows API, which is a little annoying. (Incidentally, even *that* function should probably be changed to return a vector<char16_t>.

rnk marked an inline comment as done.Sep 4 2018, 1:30 PM
rnk added inline comments.
llvm/lib/Support/raw_ostream.cpp
650 ↗(On Diff #163595)

Thanks. Unfortunately, this isn't testable without faking up some kind of console.

650–652 ↗(On Diff #163595)

I think the Windows console doesn't care if it gets \r\n or \n at this point. I don't think there's a way I can observe a difference in behavior now, since I can't WriteConsole to a pipe. =/

572 ↗(On Diff #163593)

Done, thanks, that was something I thought about but didn't write down.

rnk updated this revision to Diff 163898.Sep 4 2018, 1:30 PM
rnk marked an inline comment as done.
  • rename variable
zturner added inline comments.Sep 4 2018, 1:56 PM
llvm/lib/Support/raw_ostream.cpp
650–652 ↗(On Diff #163595)

Just put some code in main that calls the above with your change and run the program. See what it outputs?

rnk added inline comments.Sep 4 2018, 2:04 PM
llvm/lib/Support/raw_ostream.cpp
650–652 ↗(On Diff #163595)

Right, I compiled and ran a program that does outs() << "Test\nTest2"; and the newlines show up. The Windows console doesn't need CRLF, it treats LF as implicitly returning the cursor to the beginning of the next line.

zturner accepted this revision.Sep 4 2018, 2:14 PM
This revision is now accepted and ready to land.Sep 4 2018, 2:14 PM
This revision was automatically updated to reflect the committed changes.