This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Unicode support on Win32
ClosedPublic

Authored by cameron314 on Feb 10 2016, 3:24 PM.

Details

Summary

Various fixes for Win32 interop, particularly with regard to paths; paths containing non-ASCII characters now work properly (they are converted to and from UTF-8 internally).

Diff Detail

Event Timeline

cameron314 updated this revision to Diff 47539.Feb 10 2016, 3:24 PM
cameron314 retitled this revision from to [lldb] Unicode support on Win32.
cameron314 updated this object.
cameron314 added a reviewer: zturner.
cameron314 set the repository for this revision to rL LLVM.
cameron314 added a subscriber: lldb-commits.

I like that you're doing this, but ...

There's a ton of duplication here. I'd really like to see the conversions centralized with consistent error handling. Cluttering up already-too-long functions with more #ifdefs and a half dozen lines of boilerplate (that's working at a much lower level of abstraction than the surrounding code) doesn't seem like moving in the right direction.

lldb/trunk/source/Core/ConnectionSharedMemory.cpp
140 ↗(On Diff #47539)

It's too bad there isn't an overload that gives the wide string as a std::wstring. Then we wouldn't need all the ugly C-style casts of wname.data().

lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #47539)

_wopen is officially deprecated. How about _wsopen_s?

357 ↗(On Diff #47539)

Why is the error reported here different than many of the other locations?

lldb/trunk/source/Host/common/FileSpec.cpp
237 ↗(On Diff #47539)

Instead of another #ifdef here, can't we just re-use the GetFileStats function above?

1141 ↗(On Diff #47539)

WIN32_FIND_DATATW to be consistent with calling the wide versions of the functions explicitly.

1179 ↗(On Diff #47539)

MAX_PATH in Windows. (And someday, we should get rid of these fixed buffers so we can use UNC paths.)

lldb/trunk/source/Host/windows/FileSystem.cpp
231 ↗(On Diff #47539)

MAX_PATH

lldb/trunk/source/Host/windows/Host.cpp
73 ↗(On Diff #47539)

The comment above says why you can't use MAX_PATH/PATH_MAX. If that's no longer relevant, please update the comment.

160 ↗(On Diff #47539)

MAX_PATH on Windows.

zturner requested changes to this revision.Feb 10 2016, 4:31 PM
zturner edited edge metadata.

Overall I like the idea and I'm happy to get Unicode paths and strings working properly. I'll look at the rest in more detail later, but I have some starting comments for now.

One thing I'm I'm not real excited about is how ugly this makes the code, but then again there's not a great way to handle this in an elegant fashion. But I do have one suggestion. Suppose we had a class like this:

class WinUtfString
{
public:
    explicit WinUtfString(const char *utf8)
        : m_utf8(utf8), m_utf16(nullptr)
    { }
    explicit WinUtfString(const UTF16 *utf16)
        : m_utf8(nullptr), m_utf16(utf16)
    { }
    // Probably would want to add more useful constructors here that take StringRefs, std::strings, ConstStrings, and whatever else.

    const char *GetUtf8()
    {
        if (!m_utf8)
            ConvertToUtf8();
        return m_utf8;
    }
    const UTF16 *GetUtf16()
    {
        if (!m_utf16)
            ConvertToUtf16();
        return m_utf16;
    }

private:
    void ConvertToUtf8();  // initialize m_utf8 from m_utf16
    void ConvertToUtf16();  // initialize m_utf16 from m_utf8

    llvm::SmallVector<UTF8, 128> m_utf8;
    llvm::SmallVector<UTF16, 128> m_utf16;
};

The point of all this? Much of the code becomes cleaner and easier to read. For example, this:

llvm::SmallVector<UTF16, 128> srcBuffer, dstBuffer;
if (!llvm::convertUTF8ToUTF16String(src.GetCString(), srcBuffer))
    error.SetError(1, lldb::eErrorTypeGeneric);
else if (!llvm::convertUTF8ToUTF16String(dst.GetCString(), dstBuffer))
    error.SetError(1, lldb::eErrorTypeGeneric);
else if (!::CreateHardLinkW((LPCWSTR)srcBuffer.data(), (LPCWSTR)dstBuffer.data(), nullptr))
    error.SetError(::GetLastError(), lldb::eErrorTypeWin32);

Becomes this:

WinUtfString utfSrc(src.GetCString());
WinUtfString utfDest(dest.GetCString());
if (!CreateHardLinkW(utfSrc.GetUtf16(), utfDest.GetUtf16(), nullptr))
    error.SetError(::GetLastError(), lldb::eErrorTypeWin32);

and this:

llvm::SmallVector<UTF16, 128> wexecutable;
llvm::convertUTF8ToUTF16String(executable, wexecutable);

launch_info.GetArguments().GetQuotedCommandString(commandLine);
llvm::SmallVector<UTF16, 64> wcommandLine;
llvm::convertUTF8ToUTF16String(commandLine, wcommandLine);

llvm::SmallVector<UTF16, 128> wworkingDirectory;
llvm::convertUTF8ToUTF16String(launch_info.GetWorkingDirectory().GetCString(), wworkingDirectory);

BOOL result = ::CreateProcessW((LPCWSTR)wexecutable.data(), (LPWSTR)wcommandLine.data(), NULL, NULL, TRUE, flags,
                               env_block, (LPCWSTR)wworkingDirectory.data(), &startupinfo, &pi);

becomes this:

launch_info.GetArguments().GetQuotedCommandString(commandLine);
WinUtfString utfExecutable(executable);
WinUtfString utfCommandLine(commandLine);
WinUtfString utfWorkingDirectory(launch_info.GetWorkingDirectory().GetCString());
BOOL result = ::CreateProcessW(utfExecutable.GetUtf16(), utfCommandLine.GetUtf16(), NULL, NULL, TRUE, flags,
                               env_block, utfWorkingDirectory.GetUtf16(), &startupinfo, &pi);

etc.

I know this adds more work for you, but this is the type of thing that if it's not done right hte first time, it will stagnate and be gross forever. So we should think about how to do this in the least-bad way possible.

cfe/trunk/tools/libclang/CIndexer.cpp
57 ↗(On Diff #47539)

You'll probably need to submit this as a separate patch to cfe-commits, since it goes in a separate repo. You can then commit them independently.

lldb/trunk/source/Commands/CommandCompletions.cpp
171 ↗(On Diff #47539)

I'm not fond of this approach of sprinkling #ifdefs around in every location where we call into a file system API. We have lldb/Host/FileSystem, probably some of these methods should be moved over there, and we should provide a windows implementation and a non-windows implementation. This way in places like this, you simply write:

isa_directory = FileSystem::GetPermissions(partial_name_copy) & FileSystem::eDirectory);

and all this ugliness is hidden.

lldb/trunk/source/Core/ConnectionSharedMemory.cpp
138 ↗(On Diff #47539)

Same thing applies here, we should probably have a Host/SharedMemory.h that exposes a Create() method. Although, in this particular case the #ifdef was already here so I can't entirely complain, but it would be nice to improve the code as we go and get ifdefs like this out of generic code.

lldb/trunk/source/Core/Disassembler.cpp
881 ↗(On Diff #47539)

Would LLDB's File class work here instead of using raw APIs? I see you fixed the code in lldb_private::File, so perhaps simply using that class here instead of a FILE* would allow you to reuse the fix from that class.

lldb/trunk/source/Host/common/File.cpp
353 ↗(On Diff #47539)

Ahh, here's that method I was looking for earlier when I mentioend FileSystem. So I guess you don't have to implement it, it's already here. So you could just use this method earlier in the part that gets file permissions by manually calling stat.

lldb/trunk/source/Host/common/FileSpec.cpp
108 ↗(On Diff #47539)

I don't think this line is correct. The source and destination struct types do not always have the same layout, so I think you may need to copy the values out one by one.

237 ↗(On Diff #47539)

Maybe you can just call GetStats() here

829 ↗(On Diff #47539)

I think a FileSystem::IsSymbolicLink() here would be better, and then have File call FileSpec::IsSymbolicLink()

There's currently some confusion and overlap about what goes in FileSpec and what goes in FileSystem, but my general opinion is that wherever possible, anything that queries the filesystem should be a static method in Host/FileSystem, and any other classes who want to do the same thing can call the methods in FileSystem. This way you don't have to, for example, construct a FileSpec from a string just to check if something is a symbolic link since you could just call the low level method directly.

1173–1180 ↗(On Diff #47539)

sprintf'ing parent and child paths together to concatenate them doesn't seem like the best idea. Can we use llvm::sys::path::append() here instead?

This revision now requires changes to proceed.Feb 10 2016, 4:31 PM

Unless something has changed since last I looked, the llvm path utilities are host specific. I didn't inspect the code where this is being used, but it is in Host/common, so unless there's an ifdef WINDOWS around it, I don't think you should use the llvm path utilities.

Jim

Unless something has changed since last I looked, the llvm path utilities are host specific. I didn't inspect the code where this is being used, but it is in Host/common, so unless there's an ifdef WINDOWS around it, I don't think you should use the llvm path utilities.

Jim

Unless I'm misunderstanding, code in Host/Common need only satisfy the property that it will work on every host with a single syntax. That's what the llvm path libraries do as well. so llvm::sys::path::append('foo', 'bar') will produce foo\bar on Windows, and foo/bar on Linux, OSX, etc.

My bad, that sees to be what it does.

Jim

Thanks for all the feedback! I'll look into it in detail tomorrow.

Perhaps it would make more sense if I clarified where I'm coming from with this patch -- we use LLDB on Windows (with a custom backend) where I work, and wanted to be able to run a program with a non-ASCII path (and similarly place breakpoints in such files, etc.). So, I tried to make minimally invasive changes across the codebase in order to accomplish this, using the existing LLVM support functions as much as possible. This is why the changes are not entirely consistent as a whole -- I preferred consistency with the surrounding code (e.g. error handling and buffer allocation) over global consistency.

Thanks for all the feedback! I'll look into it in detail tomorrow.

Perhaps it would make more sense if I clarified where I'm coming from with this patch -- we use LLDB on Windows (with a custom backend) where I work, and wanted to be able to run a program with a non-ASCII path (and similarly place breakpoints in such files, etc.). So, I tried to make minimally invasive changes across the codebase in order to accomplish this, using the existing LLVM support functions as much as possible. This is why the changes are not entirely consistent as a whole -- I preferred consistency with the surrounding code (e.g. error handling and buffer allocation) over global consistency.

Just to be clear, I don't want you to not use llvm support functions. I just want to kind of hide them behind a common abstraction that removes some of the boilerplate just so that readability doesn't suffer too much. In that WinUtfString class I whipped up, I left two functions unimplemented. ConvertToUtf16 and ConvertToUtf8. Those would still be implemented in terms of those llvm support functions.

I'm also not saying that's the best possible solution since I kind of concoted it on the fly without too much attention to detail. But it makes the conversion points look much cleaner, and you can wrap up all your error handling logic in there (ignored in my 5-minute implementation), so that the error codes, messages, etc are always consistent when something fails.

mamai added a subscriber: mamai.Feb 11 2016, 7:34 AM

I've replied to most of your comments, thanks everyone. I'll make most of the suggested changes, but some of them have nothing to do with correct Unicode handling and are rather pre-existing design issues, and so are out of the scope of this patch.

@zturner: I agree that the conversion as it is now is ugly, but I'm not sure a WinUtfString would be the best way to clean it up, considering the unfortunate necessity of error handling (which is still context-dependent 90% of the time). I'm also slightly allergic to adding more string classes, especially platform-specific ones ;-) I'm more inclined to simply adopt a cleaner set of std::string<->std::wstring conversion function wrappers as suggested by @amccarth. I'll start with that and we'll see from there.

cfe/trunk/tools/libclang/CIndexer.cpp
57 ↗(On Diff #47539)

Alright, thanks. I'll do that. Thought it would make sense to sneak it in here ;-)

lldb/trunk/source/Commands/CommandCompletions.cpp
171 ↗(On Diff #47539)

I agree. Again, my goal was not to refactor (I don't know this codebase very well at all and can't test it in much depth), but to fix existing code. It happens that stat is particularly awkward on Win32, so yes, it should probably be wrapped instead (or better yet replaced with GetFileAttributes()). But this is outside the scope of the patch.

lldb/trunk/source/Core/ConnectionSharedMemory.cpp
138 ↗(On Diff #47539)

Again, I agree, but this is outside the scope of my patch.

140 ↗(On Diff #47539)

I didn't want to add/change the string conversion support libraries too much (there's already too many functions with overlapping functionality IMO), but on the whole this would indeed yield much nicer and simpler code. I'll make this change.

lldb/trunk/source/Core/Disassembler.cpp
881 ↗(On Diff #47539)

No idea. Probably :-) I'll look into it, and make this change if it's trivial.

lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #47539)

Sure. I'll make this change. open was deprecated too...

353 ↗(On Diff #47539)

This method only returns the permissions -- the other one you're referring to checks if a path is a directory.

357 ↗(On Diff #47539)

About a quarter of the functions report errors through an Error (like this one), so I followed suit in this context. Might be better if I set it with a message instead, though -- I'll make that change.

The rest report errors either through errno; by some context-dependent return value (e.g. functions that return false on any failure); or don't check for errors at all.

lldb/trunk/source/Host/common/FileSpec.cpp
108 ↗(On Diff #47539)

It's probably better if I copy the fields for the sake of forwards compatibility, but I checked the definitions of these structures and they should be 100% compatible. I'll make the change, though, the reinterpret_cast is really ugly.

237 ↗(On Diff #47539)

Probably, yes, but it requires a FileSpec. I'm not sure what creating a temporary one implicates. Surely there was a reason it wasn't already called before?

829 ↗(On Diff #47539)

I agree, but this is outside the scope of my patch :-)
This method already existed, and was already written this way. I merely fixed the way it calls into the OS by doing a string conversion, which is completely independent of the existing file system code's design.

1141 ↗(On Diff #47539)

Right, thanks, missed this one. I'll make this change.

1173–1180 ↗(On Diff #47539)

Probably, but a lot of this function's code would have to be changed to work with those objects instead of fixed buffers. The snprintf was there before, so I conserved it. The more changes I make, the more likely I am to introduce a bug, especially since I don't know the codebase very well at all.

1179 ↗(On Diff #47539)

I absolutely agree re the fixed buffers. I'll try to make this change in all of the functions that can allocate dynamic memory.

MAX_PATH is not PATH_MAX, though -- MAX_PATH is a Win32 constant of 260, whereas PATH_MAX is an LLDB constant of 32K. (Actually PATH_MAX is defined in multiple places and was either that or MAX_PATH depending on which header was included, but mostly it was 32K so I changed them all to at least be consistently 32K.)

lldb/trunk/source/Host/windows/FileSystem.cpp
231 ↗(On Diff #47539)

PATH_MAX :-)

Like most of the wide Win32 API, this function can return nice long UNC paths.

PATH_MAX should be better named to avoid confusion, though. Maybe LLDB_MAX_PATH instead?

lldb/trunk/source/Host/windows/Host.cpp
73 ↗(On Diff #47539)

It's still relevant, I just changed it to use the symbolic constant of the same size instead of a hard-coded constant. MAX_PATH != PATH_MAX, unfortunately.

zturner added inline comments.Feb 11 2016, 10:31 AM
lldb/trunk/source/Commands/CommandCompletions.cpp
171 ↗(On Diff #47539)

I know your intent was to just fix existing code, but when those fixes make the code harder to maintain or do things the wrong way, then I think it's worth asking whether the scope should expand. Introducing ifdefs in common code is not a direction we want to move in unless it's absolutely necessary, so somehow we have to address that before this can go in, even if it means expanding the scope of the patch slightly.

Also make sure you run ninja check-lldb and ninja check-lldb-unit with this patch to make sure everything works.

lldb/trunk/source/Core/ConnectionSharedMemory.cpp
140 ↗(On Diff #47539)

It's going to be difficult to make this change. Because that will require a separate review on the LLVM side with different reviewers. You can certainly try though.

lldb/trunk/source/Host/common/File.cpp
357 ↗(On Diff #47539)

This is why having the conversion centralized as I suggested earlier would be advantageous. Everything would always behave the same way.

lldb/trunk/source/Host/common/FileSpec.cpp
237 ↗(On Diff #47539)

This is another reason I argued for moving this to FileSystem so that we wouldn't need a FileSpec here.

lldb/trunk/source/Host/windows/FileSystem.cpp
231 ↗(On Diff #47539)

Alternatively, you could just call it with 0 and then allocate a vector of the appropriate size based on the return value.

lldb/trunk/source/Host/windows/Windows.cpp
34 ↗(On Diff #47539)

Do the LLVM functions not work here for some reason?

210 ↗(On Diff #47539)

Why can't we just convert to the result of _wgetcwd directly to UTF8?

lldb/trunk/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
1175 ↗(On Diff #47539)

Here's another example of an #ifdef in common code that we will need to do something about. Is there a constructor of the File class that takes a path and a mode?

lldb/trunk/source/Target/ProcessLaunchInfo.cpp
457 ↗(On Diff #47539)

Anothe here.

lldb/trunk/tools/driver/Driver.cpp
1277 ↗(On Diff #47539)

This ifdef can probably stay, I don't know if there's a good way around this one.

1289 ↗(On Diff #47539)

Is this going to affect the state of the console even after quitting LLDB?

zturner added inline comments.Feb 11 2016, 11:52 AM
lldb/trunk/source/Host/common/FileSpec.cpp
108 ↗(On Diff #47539)

Apparently the CRT uses a reinterpret_cast too. As long as you put:

static_assert(sizeof(struct stat) == sizeof(struct _stat64i32));

then the cast is probably fine.

I've addressed more feedback. I'm working on a second version of the patch with a lot less #ifdefs and nicer wrappers (and better error handling where possible).

lldb/trunk/source/Commands/CommandCompletions.cpp
171 ↗(On Diff #47539)

I understand. I'll try to refactor out the ifdefs.

lldb/trunk/source/Core/ConnectionSharedMemory.cpp
140 ↗(On Diff #47539)

Hmm. Alright then. I'll put the wrappers in LLDB for now.

lldb/trunk/source/Core/Disassembler.cpp
881 ↗(On Diff #47539)

I don't trust myself to make this change without introducing a bug. But I'll get rid of the #ifdef.

lldb/trunk/source/Host/common/File.cpp
353 ↗(On Diff #47539)

...which turns out to exist here too. I call that instead of using stat now, much cleaner!

lldb/trunk/source/Host/common/FileSpec.cpp
108 ↗(On Diff #47539)

Makes sense. Done.

237 ↗(On Diff #47539)

Actually, it seems the stat is just done to see if the path exists. llvm::sys::fs::exists does this already, I'll call that instead. It does the proper string conversion and uses the right Win32 API call on Windows.

829 ↗(On Diff #47539)

By the way, all the FileSystem methods seem to accept FileSpecs as input, not raw paths. I agree it would be nice if FileSystem abstracted the OS-level FS, which all the other higher-level parts (FileSpec, File) then depended on. But that would require changing the API rather significantly.

lldb/trunk/source/Host/windows/Windows.cpp
34 ↗(On Diff #47539)

They would, but these are called from functions that are trying to emulate libc functions, by avoiding calling malloc among other things. Hence these wrappers that work with buffers.

210 ↗(On Diff #47539)

Not sure I understand. That's what this code does? There's a tad of extra logic needed because the path passed in is allowed to be NULL, indicating we should allocate a buffer for the caller.

lldb/trunk/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
1175 ↗(On Diff #47539)

There is, but the mode is an enum and not a string. I'll get rid of the #ifdef, though.

lldb/trunk/source/Target/ProcessLaunchInfo.cpp
457 ↗(On Diff #47539)

Quite so. I'll add a portable getenv wrapper in HostInfo.

zturner added inline comments.Feb 22 2016, 3:24 PM
lldb/trunk/source/Core/Disassembler.cpp
881 ↗(On Diff #47539)

One possibility is to revert this portion of the patch (along with any other changes that you don't feel comfortable making) back to their original state. This will still leave certain parts of unicode functionality broken, but it would reduce the surface area of "things that might be wrong with the patch" and give people a chance to make sure nothing else is seriously broken. Then, we can do these other "more difficult" locations as a followup patch. The nice thing about that is that if it does end up breaking something, and we have to revert it, it doesn't revert 100% of the work towards getting Unicode working, but only a couple of points.

lldb/trunk/source/Host/common/FileSpec.cpp
237 ↗(On Diff #47539)

Even better :)

829 ↗(On Diff #47539)

Yea, it's a bit unfortunate. In reality there should be overloads that take FileSpecs and std::strings or llvm::StringRefs, and then only 1 implementation that the others re-use. But right now there aren't.

lldb/trunk/source/Host/windows/Windows.cpp
210 ↗(On Diff #47539)

Ahh that extra logic is what I was referring to. I wasn't aware of those semantics of getcwd. Can you add a comment explaining that, since it's not obvious from looking at the code.

lldb/trunk/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
1175 ↗(On Diff #47539)

We have a function somewhere that converts a mode string to the enum. I think it's in File.h or File.cpp. Let me know if you can't find it.

cameron314 added inline comments.Feb 22 2016, 3:39 PM
lldb/trunk/tools/driver/Driver.cpp
1289 ↗(On Diff #47539)

Hmm, turns out it does. I'll add code to revert it at the end.

amccarth added inline comments.Feb 22 2016, 3:47 PM
lldb/trunk/source/Host/common/FileSpec.cpp
1179 ↗(On Diff #47539)

Ah, I see. I work in another code base where PATH_MAX is synonymous with MAX_PATH, so I was confused.

Buffers of 32K characters on the stack seem like a bad idea. We need a vector or string or some other container that puts the bulk of the data somewhere other than the stack.

cameron314 added inline comments.Feb 22 2016, 4:01 PM
lldb/trunk/source/Core/Disassembler.cpp
881 ↗(On Diff #47539)

This makes sense. I'm almost done with the next version of the patch, we'll see what should make the cut at that point.

lldb/trunk/source/Host/common/FileSpec.cpp
1179 ↗(On Diff #47539)

Right. There's currently ~110 places in LLDB that allocate buffers of size PATH_MAX on the stack. Nearly all of those predate my patch, it seems.

lldb/trunk/source/Host/windows/Windows.cpp
210 ↗(On Diff #47539)

Haha, sure. I wasn't aware either, but it blew up at runtime since it turns out LLDB actually makes use of this when booting with a relative path.

lldb/trunk/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
1175 ↗(On Diff #47539)

You know, it turns out there's one right in this very file. Thanks!

amccarth added inline comments.Feb 22 2016, 4:04 PM
lldb/trunk/source/Host/common/FileSpec.cpp
1179 ↗(On Diff #47539)

Right, but this is a new case because you changed it from MAX_PATH (260) to PATH_MAX (32K). I'd rather not introduce more instances like this.

cameron314 added inline comments.Feb 22 2016, 5:42 PM
lldb/trunk/source/Host/common/FileSpec.cpp
1179 ↗(On Diff #47539)

Good point, I'll fix it up :-)

cameron314 updated this revision to Diff 49066.Feb 25 2016, 8:25 AM
cameron314 edited edge metadata.
cameron314 removed rL LLVM as the repository for this revision.

Here's a new version of the patch which takes into account most of the feedback so far (less #ifdefs, etc.). It depends on my pending patch in LLVM (D17549) that introduces a few more helper wrappers for UTF conversion.

cameron314 added inline comments.Feb 25 2016, 8:28 AM
lldb/trunk/tools/lldb-mi/MICmnLLDBDebugSessionInfo.cpp
618 ↗(On Diff #49066)

Unrelated to this patch, but can this method be called from several threads at once?

I'm growing more comfortable with these changes, but I'll defer to Zach.

lldb/trunk/source/Host/common/FileSpec.cpp
242 ↗(On Diff #49066)

I recognize that you're just repeating the pattern from above, but ...

This seems whacky and dangerous. It appears the attempt is to put a null terminator on the end, but not count it in the length of the vector. And I guess that we know it's safe here because path is an llvm::SmallVector, so the implementation details are known. But, ugh. If path were ever changed to std::vector, we'd no longer have assurance that the terminator remains after the pop.

lldb/trunk/source/Host/windows/FileSystem.cpp
231 ↗(On Diff #49066)

I agree with Zach that a dynamic solution here is better. It's already icky that we have a 32KB stack buffer here. Turning it into a 64KB +1B stack buffer seem egregious.

cameron314 added inline comments.Feb 26 2016, 7:32 AM
lldb/trunk/source/Host/common/FileSpec.cpp
242 ↗(On Diff #49066)

I totally agree. Took me a few minutes before I figured out what it was doing the first time I saw it :-) This is also done in the existing convertUTF8ToUTF16String wrapper too.

All the implementations of std::vector that I know of will work with this, though as you say it's not guaranteed by the standard.

Looking more closely, I think in this case it was only done for the call to stat, which I've removed. I had added it here too in case the caller relied on this null trick, but I don't think it's necessary in either place anymore. I'll remove it.

lldb/trunk/source/Host/windows/FileSystem.cpp
231 ↗(On Diff #49066)

Hah, yes. I'll replace this with a vector to start with.

cameron314 updated this revision to Diff 49185.Feb 26 2016, 7:36 AM
cameron314 edited edge metadata.

I don't want to be pushy, but is there a chance of this patch being accepted soon?
It does depend on D17549 in LLVM which is also still pending, but that one's at least very a simple patch adding the support functions that this one uses.

zturner added a subscriber: zturner.Mar 7 2016, 8:07 AM

Sorry slipped under my radar, I'll look again today

No worries, thanks for taking a look!

Can you rebase against tip of trunk? I want to apply and run the test suite with the patch applied, but I'm getting a lot of errors. After uploading a new rebased patch, can you also respond with the revision you rebased against so I can make sure I'm at the same place.

lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #49185)

Any particular reason you're using _SH_DENYWR instead of _SH_DENYNO? No matter what we do we will always have subtle differences in semantics, but _SH_DENYNO is closer to posix semantics.

Absolutely, I'm rebasing now. This is unfortunately the type of patch that rots really quickly ;-)

I did run check-lldb locally at one point, with no major differences before and after the patch. But I recently discovered the version of Python I'd compiled had accidentally been compiled with VS2008, and since the tip's changed anyway, it's absolutely worth running them again. This was the day before the instructions to use VS21015 with Python 3 were posted, haha.

lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #49185)

No particular reason other than I find _SH_DENYWR a reasonable default. It generally doesn't hurt to let others read the file while it's being written, and that's sometimes crucial (e.g. for log files which are written to incrementally and only closed at the very end). I can change it if you prefer.

zturner added inline comments.Mar 7 2016, 11:45 AM
lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #49185)

I like deny write in principle as well, but I kind of do prefer to change it if for no other reason than it matches the semantics on Posix more closely. We've had sharing access denied errors in the past on Windows that we traced back to sharing violations, and it's easier if all platforms just use the same semantics.

cameron314 added inline comments.Mar 7 2016, 12:01 PM
lldb/trunk/source/Host/common/File.cpp
330 ↗(On Diff #49185)

OK, I'll change it for the next patch.

cameron314 updated this revision to Diff 49986.Mar 7 2016, 4:16 PM

Sorry for the delay, I wanted to make sure LLDB compiled without warnings after I rebased the patch. It applies cleanly on the tip as of this writing, and is based on rL262819.

I'm waiting for the LLVM side change to go in and this patch rebased on top of that before I test it out.

In the meantime, one thing I'm concerned about is setting the codepage of the console. You determined earlier that it affects the state of the console even after exiting LLDB. So we clean it up on shutdown. But what if LLDB crashes? I'm not sure if this is a concern, but it's troublesome at the very least.

If I remember correctly, this patch was originally intended to support your use case of linking against liblldb but not using the lldb.exe driver, is that right? Or maybe you were using the python extension module, I don't remember. In any case, neither of those use cases requires you to display UTF8 text to the console window. So I wonder if it is possible to do without the codepage portion of the patch.

I would like to avoid that change if at all possible, but if it is really necessary, I would at least like to do it as a separate patch, so that if need be we can revert it later without reverting the rest of the changes here.

Yes, if lldb.exe crashes or otherwise exits without returning to main, then the codepage will stay set in the current console. The latter can be fixed with a static object destructor, but not the former. I don't think this would be a practical problem, since very few console applications depend on the codepage in the first place (those that do tend to set it manually when they start). The setting isn't permanent, either, it's only for the current console window until it closes (or is reset by somebody else). And it doesn't affect the bytes input/output by any programs, only the way the console interprets those bytes. I'd say the impact of changing the codepage without changing it back is very low.

That part of the patch is not quite as important -- without it UTF-8 I/O is broken, but it's already broken anyway unless the user has taken explicit steps to reconfigure Window's console system to support Unicode character display (which requires fiddling in the registry and rebooting). No one I know has done this. I can remove it from the patch if you think it's too risky.

We're using both liblldb and lldb.exe, though mostly liblldb. No Python bindings as of now.

The patch for LLVM has been accepted, I'm waiting for someone to commit it when they have a moment. I'll rebase this one on the tip again afterwards.

Sounds good. I will test this out once that patch goes in. And yea, I
would prefer to remove the codepage stuff for now, and we can re-consider
adding it back if/when someone actually needs it. Then we can discuss some
other options like a preprocessor define that enables setting the codepage,
or a command line switch, etc.

Here we go! D17549 has been committed, so I've rebased this patch onto that latest version (r263233).

Oops, was missing a slash in the dst-prefix of the previous patch update.

... and had a slash too many at the start of the path prefixes. Sorry for the spam.

@zturner: Let me know when you're ready for this patch and I'll rebase it again, since there's been quite a few more commits.

I can look at it today. Sorry again for the delay, rebase 1 more time and
I'll check it out today

Oh yea I think the reason I didn't look at it again is because I was
waiting for the update where you removed the codepage stuff from the driver.

No worries. I removed the codepage stuff when I did the last rebase. Another one coming up shortly!

cameron314 updated this revision to Diff 50976.Mar 17 2016, 2:30 PM

Here we go! Rebased to tip (rL263735).

I'm not sure what your source tree layout looks like, but this isn't applying for me. All your paths have "trunk" in front of them, which is a little unusual in the directory structure is supposed to be something like this:

llvm
  tools
    lldb
      source

And I would apply the patch inside of the lldb directory. But yours appears to look like this:

lldb
  trunk
    source

So it doesn't apply.

Ah. Sorry. I have the same source layout as you locally, but I changed the paths in the patch to match the layout of Diffusion (rL). I'll redo the patch with relative paths from lldb...

cameron314 updated this revision to Diff 50979.Mar 17 2016, 2:45 PM

I think we're in different time zones -- I'm heading home for the day, but I'll look back in the morning to see if there's anything else that needs addressing.

I'm getting this when linking:

[826/826] Linking CXX executable bin\lldb.exe
FAILED: cmd.exe /C "cd . && "C:\Program Files (x86)\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\lldb\tools\driver\CMakeFiles\lldb.dir --manifests  -- C:\PROGRA~2\MICROS~3.0\VC\bin\AMD64_~2\link.exe /nologo @CMakeFiles/lldb.rsp  /out:bin\lldb.exe /implib:lib\lldb.lib /pdb:bin\lldb.pdb /version:3.9  /machine:X86 /STACK:10000000 /debug /INCREMENTAL /subsystem:console  && cd ."
ucrtd.lib(ucrtbased.dll) : error LNK2005: _signal already defined in Platform.cpp.obj
   Creating library lib\lldb.lib and object lib\lldb.exp
bin\lldb.exe : fatal error LNK1169: one or more multiply defined symbols found
LINK Pass 1 failed. with 1169
[826/826] Linking CXX executable bin\lldb-mi.exe
FAILED: cmd.exe /C "cd . && "C:\Program Files (x86)\CMake\bin\cmake.exe" -E vs_link_exe --intdir=tools\lldb\tools\lldb-mi\CMakeFiles\lldb-mi.dir --manifests  -- C:\PROGRA~2\MICROS~3.0\VC\bin\AMD64_~2\link.exe /nologo @CMakeFiles/lldb-mi.rsp  /out:bin\lldb-mi.exe /implib:lib\lldb-mi.lib /pdb:bin\lldb-mi.pdb /version:3.9  /machine:X86 /STACK:10000000 /debug /INCREMENTAL /subsystem:console  && cd ."
ucrtd.lib(ucrtbased.dll) : error LNK2005: _signal already defined in Platform.cpp.obj
   Creating library lib\lldb-mi.lib and object lib\lldb-mi.exp
bin\lldb-mi.exe : fatal error LNK1169: one or more multiply defined symbols found
LINK Pass 1 failed. with 1169
ninja: build stopped: subcommand failed.

I assume it must be related to the fact that we're now defining UNICODE and _UNICODE, but I need to run and don't have enough time to figure out why.

I've always disliked the fact that we redefine the signal function which is already builtin in Windows anyway. I actually feel like we need to introduce the function sighandler_t Host::Signal(int sig, sighandler_t sigFunc), and on non-windows have this function call signal, and on Windows use our own custom implementation.

Let me know if you have any better ideas.

Hmm, that doesn't seem good. Locally it links for me (I get compile time warnings about the signal stuff, though).
It looks like the WinSDK signal.h is being pulled in despite _INC_SIGNAL being defined (note that there's no include guard in that header -- it uses #pragma once instead). And the signal handler callback has a different return type. I don't see anything related to Unicode.

Here's the tail of my build output for lldb-mi.exe. I'm using VS2015 on Windows 7 x64, compiling in release with ninja:

[77/78] 13.653s Building CXX object tools\lldb\tools\lldb-mi\CMakeFiles\lldb-mi.dir\Platform.cpp.obj
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(84): warning C4005: 'SIG_DFL': macro redefinition
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(35): note: see previous definition of 'SIG_DFL'
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(85): warning C4005: 'SIG_IGN': macro redefinition
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(36): note: see previous definition of 'SIG_IGN'
d:\dev\llvm\tools\lldb\tools\lldb-mi\Platform.h(87): warning C4273: 'signal': inconsistent dll linkage
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\signal.h(57): note: see previous definition of 'signal'

[78/78] 14.332s Linking CXX executable bin\lldb-mi.exe
   Creating library lib\lldb-mi.lib and object lib\lldb-mi.exp

The reason I mentioned UNICODE is because it's one of the main differences between pre-patch and post-patch. It works for me pre-patch. What command line are you passing to CMake? Are you using MSVC 2015 Update 1 or initial release?

The other difference is that I'm compiling in Debug and you said you were compiling release. Also I'm doing an x86 build of LLDB. Can you try x86 debug build with MSVC 2015 Update 1 and see if it still succeeds for you?

Since it works without my patch, you're probably right about it being related to the UNICODE define. All the other changes are completely removed from that part of the code. But I still don't see how it could affect that.

Here's the script I use to run cmake:

setlocal 
set PATH=%ProgramFiles%\Ninja;%ProgramFiles(x86)%\CMake\bin;%SYSTEMROOT%\system32;%SYSTEMROOT%;%SWIGDIR% 
call "%VS140COMNTOOLS%\vsvars32.bat" 
mkdir VS2015Debug 
cd VS2015Debug 
cmake -G "Ninja" -DLLVM_BUILD_EXAMPLES:BOOL=false -DPYTHON_HOME="C:\Python35" -DLLDB_DISABLE_PYTHON:BOOL=false -DLLDB_TEST_COMPILER=%cd%\bin\clang.exe -DCMAKE_BUILD_TYPE=Debug .. 
cd .. 
mkdir VS2015Release 
cd VS2015Release 
cmake -G "Ninja" -DLLVM_BUILD_EXAMPLES:BOOL=false -DPYTHON_HOME="C:\Python35" -DLLDB_DISABLE_PYTHON:BOOL=false -DLLDB_TEST_COMPILER=%cd%\bin\clang.exe -DCMAKE_BUILD_TYPE=Release .. 
cd .. 
endlocal

This is with VS2015 update 1. And it's already the x86 version, the patch has diverged significantly from what we have locally (which is based on a snapshot from several weeks ago) so I've been using the tip as-is :-)

I'm trying a fresh Debug build from scratch (deleting and re-creating the cmake build folder). I'll report back when it's done.

I'm using the amd64_x86 toolchain. They're supposed to be identical so
that's unlikely to be the problem, but it's the only difference i can see.
Let me know what happens on your clean rebuild

Aha, I get the same error now:

fatal error LNK1169: one or more multiply defined symbols found

I'm looking into it!

Patch doesn't appear to be clang-formatted. All patches need to be run through clang-format before submitting.

I think this is my last set of comments. If there's any you disagree with let me know. If you agree with everything, I can probably just make the changes on my end since they are mostly mechanical, and then submit.

cmake/modules/LLDBConfig.cmake
250 ↗(On Diff #50979)

This is using a tab, but it should be 2 spaces.

source/Core/ConnectionSharedMemory.cpp
147 ↗(On Diff #50979)

nullptr instead of NULL

source/Host/common/FileSpec.cpp
94–115 ↗(On Diff #50979)

Here we are doing some stuff with _USE_32BIT_TIME_T, but we reproduce almost this exact same logic in File::GetPermissions() except without the special case for _USE_32BIT_TIME_T. I think we should introduce a function called FileSystem::Stat() in very much the same way that you've done FileSystem::Fopen(). And then we put the implementation there so that we can have it in one place and make sure it's always going to be consistent and correct.

tools/lldb-mi/MIUtilFileStd.cpp
82–95 ↗(On Diff #50979)

Should we be using FileSystem::Fopen() here?

234–241 ↗(On Diff #50979)

Same here. FileSystem::Fopen()

And btw, the multiply defined symbol error is gone now that the custom signal function is removed

I'll run clang-format and submit a new patch. Thanks for fixing that duplicate symbol with the signal stuff!

source/Host/common/FileSpec.cpp
94–115 ↗(On Diff #50979)

Yeah, I agree this should really go in FileSystem.

tools/lldb-mi/MIUtilFileStd.cpp
82–95 ↗(On Diff #50979)

Since this is MSVC specific code already, I think it's OK. The _SH_DENYWR seems important here.

234–241 ↗(On Diff #50979)

Absolutely. I missed this one.

cameron314 updated this revision to Diff 51231.Mar 21 2016, 2:30 PM

Here's the latest patch, now with 100% more autoformatting and a centralized FileSystem::Stat function.
I didn't rebase, but it should still apply cleanly.

zturner accepted this revision.Mar 22 2016, 10:59 AM
zturner edited edge metadata.

I think this is ready to go in. You probably noticed that this took a long time to work out all the issues with. In the future if there's any way you could break things up into smaller patches, you will probably end up getting lots of small patches in in a shorter total time than one large patch. Even if it just means fixing all the Unicode issues in file X, and then in file Y in a separate patch, etc.

Anyway, thanks for seeing this through to the end!

This revision is now accepted and ready to land.Mar 22 2016, 10:59 AM

Excellent, thank you!
Fortunately, almost all the other changes I've made to llvm/clang/lldb locally are fairly small patches.

cameron314 closed this revision.Mar 22 2016, 3:04 PM

Closing now that the patch has been committed by rL264074.
(Not sure why it didn't auto-link.)

cameron314 added inline comments.Mar 22 2016, 4:26 PM
tools/lldb-mi/MIUtilFileStd.cpp
234–241 ↗(On Diff #51231)

Seems this breaks the LLDB build on OS X. Sean has reverted this part, but it should be using the #ifdef that was there in the earlier version of the patch:

#ifdef _WIN32
    std::wstring path;
    if (!llvm::ConvertUTF8toWide(vFileNamePath.c_str(), path))
        return false;
    pTmp = ::_wfopen(path.c_str(), L"wb");
#else
    pTmp = ::fopen(vFileNamePath.c_str(), "wb");
#endif