This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add support for different file URI schemas.
ClosedPublic

Authored by ioeric on Jan 11 2018, 5:44 AM.

Details

Summary

I will replace the existing URI struct in Protocol.h with the new URI and rename FileURI to URI in a followup patch.

Diff Detail

Repository
rL LLVM

Event Timeline

ioeric created this revision.Jan 11 2018, 5:44 AM

Lots of quibbles here, some things we can certainly defer. Happy to chat offline if you're not sure...

clangd/URI.cpp
17 ↗(On Diff #129438)

This isn't correct, we need to escape some characters, turn \ to /, etc.

It's not quite obvious what to do here in general, because URI escaping rules are defined in terms of the full generic URI syntax.

I'd suggest:

  1. instead of just splitting scheme/path, split scheme/authority/path if it starts with //, authority is the part up to the next /. Otherwise empty.

Splitting authority/scheme/path allows "file" schema to just care about the path part, not the // prefix, without mandating that all schemas have authorities.

  1. when deserializing authority/scheme, decode all percent-encoding. when serializing, percent-encode characters that are neither reserved nor unreserved.

(The reserved characters basically have no business in filenames, apart from / which you don't want escaped. This allows custom URI schemes to be flexible)

35 ↗(On Diff #129438)

this error needs to be handled, it shouldn't be an assert

clangd/URI.h
19 ↗(On Diff #129438)

Why allow an invalid state?
I think I'd lean towards more type safety here: having a few factory functions that always produce valid URIs (like fromPath) and some that return Expected<URI> (like parse), but require URI itself to be valid.

This would probably mean making it a class and having schema/path be readonly...

19 ↗(On Diff #129438)

This doesn't really describe what it's for. Consider:

/// A URI describes the location of a source file.
/// In the simplest case, this is a file:// URI that directly encodes the absolute path to a file.
/// More abstract cases are possible: a shared index service might expose repo:// URIs that
/// are relative to the source control root.
21–22 ↗(On Diff #129438)

nit: scheme, not schema (they're synonyms, but URIs always talk about scheme)

24 ↗(On Diff #129438)

I think calling this "path" is confusing, because there's ambiguity between the URI path component, and the file path that you're ultimately trying to transform it into.

Consider "body", without changing semantics.

31 ↗(On Diff #129438)

(if you keep this: isValid(), or just valid(), or operator bool())

38 ↗(On Diff #129438)

fromURI is confusing, because the input is a String and the output is a URI.
parse()?

45 ↗(On Diff #129438)

Hmm, it manages file paths in exactly one schema :-)
"manages" is also very vague. What about

/ \brief URIScheme is an extension point for teaching clangd to recognize a custom URI scheme.
A scheme's job is to translate between URIs and file paths.

53 ↗(On Diff #129438)

nit: move this to the registry?

64 ↗(On Diff #129438)

Uri not in schema: why would the framework even call getAbsolutePath in that case?
Easier just to make that inexpressible by just having the framework pass the path/body as a stringref, I think.

We should have a way to handle errors though, e.g. file:// URIs that have relative paths. Return Expected rather than use the empty string for signaling?

65 ↗(On Diff #129438)

make currentfile the last param and optional? Uri is the primary param

69 ↗(On Diff #129438)

again:

  • return std::string to avoid the possibility of returning a FileURI with the wrong scheme
  • maybe provide a way to signal errors. In this case, optional<string> might be right, as we don't know in advance which scheme to use so failures are expected.
74 ↗(On Diff #129438)

this should be defined in the cpp file

89 ↗(On Diff #129438)

There are other possible failure cases.
Please propagate errors here rather than magic empty strings.

90 ↗(On Diff #129438)

this is part of the "user" API, not the extension API, move up above?
in fact, should this maybe be a method on URI?

90 ↗(On Diff #129438)

I think you want a public function to turn an absolute path into a file:// URI.
When communicating with the editor, we don't want to send custom URI schemes.

ioeric updated this revision to Diff 130201.Jan 17 2018, 9:23 AM
ioeric marked 17 inline comments as done.
  • Addressed review comments.

Thanks for the comments! I've refactored the code according to your suggestions. PTAL

Few more interface nits, sorry for picking here! Will get through the impl in the morning.

clangd/URI.cpp
43 ↗(On Diff #130201)

Don't you want the opposite here, convert from unix to native?

50 ↗(On Diff #130201)

this error can be returned by the framework already (it's valid for all paths)

57 ↗(On Diff #130201)

conversely, here you want native (which is the default)

58 ↗(On Diff #130201)

hmm, file:///foo/bar is more widely used than file:/foo/bar.
You way want to emit that instead?

clangd/URI.h
24 ↗(On Diff #130201)

You probably also want to describe roughly the subset of URI parsing we do.

e.g.
Clangd handles URIs of the form <scheme>:[//<authority>]<body>.
It doesn't further split the authority or body into constituent parts (e.g. query strings is included in the body).

27 ↗(On Diff #130201)

e.g. "https"

29 ↗(On Diff #130201)

e.g. "//reviews.llvm.org"

31 ↗(On Diff #130201)

e.g. "/D41946"

43 ↗(On Diff #130201)

I think "the first matching scheme" is a bit confusing - I guess here "scheme" refers to URIScheme, but in practice these should be 1:1 with schemes.
I'd just drop that, and instead add a sentence that errors may occur if the scheme isn't understood or the URI isn't valid in the scheme.

46 ↗(On Diff #130201)

this is the "more public" API, so the semantics of CurrentFile should probably be defined here, and merely referred to below.
(See below comment for questions about semantics)

63 ↗(On Diff #130201)

Hmm - what does "users" mean :-)
I can't think of a good way to rephrase. It might be clear enough without this sentence.

65 ↗(On Diff #130201)

nit: I'm not sure this longer comment would really help someone use this API.
I'd suggest either making the example concrete and focusing around that, or dropping it entirely - it's OK to be a little sparse here, since we don't really expect unfamiliar users.

76 ↗(On Diff #130201)

I think "the file from which the request is issued" is overly-specified (and a little unclear).

For instance, consider this workflow:
vim $projectroot
go to symbol -> "MyClass" (from global index)
Now we're trying to navigate to "monorepo://proj/myclass.h", but we don't have a "current file" to use as a hint. $projectroot would probably do nicely, and is available to clangd.

So maybe llvm::StringRef HintPath- "A related path, such as the current file or working directory, which can help disambiguate when the same file exists in many workspaces".

80 ↗(On Diff #130201)

You should probably pass authority here too, or we're committing all schemes to ignore authority

82 ↗(On Diff #130201)

similarly, this should return (authority, body), not just body.
a pair seems fine for such an internal API

90 ↗(On Diff #130201)

this seems easy enough to test via uri::create rather than exposing it publicly, but up to you.

97 ↗(On Diff #130201)

nit: you probably don't want anything between URIScheme and this - if you keep percent*, move them below?

ioeric updated this revision to Diff 130621.Jan 19 2018, 7:57 AM
ioeric marked 14 inline comments as done.
  • Address review comments
clangd/URI.cpp
43 ↗(On Diff #130201)

Ah, right!

57 ↗(On Diff #130201)

Don't we still need to convert \ to / on windows?

clangd/URI.h
29 ↗(On Diff #130201)

I think authority doesn't include leading "//"?

76 ↗(On Diff #130201)

Makes sense. Thanks for the suggestion!

90 ↗(On Diff #130201)

I think this would also be useful for other scheme extensions.

ioeric added a subscriber: jkorous-apple.
sammccall added inline comments.Jan 19 2018, 8:13 AM
clangd/URI.cpp
94 ↗(On Diff #130201)

Performance doesn't really matter here, but it looks like this is intended to save time by precomputing, and I don't think it's any faster than the direct version of percentEncode which might be easier to read:

for (unsigned char C : Content)
  if (shouldEscape(C))
    OS << '%' << format_hex_no_prefix(C, 2);
  else
    OS << C;

where shouldEscape just does some range checks.
(the "unsigned" is important!)

95 ↗(On Diff #130201)

128 should be 256, right?

123 ↗(On Diff #130201)

most implementations seem to just treat treat the % as literal in this case, e.g. "%41%" -> "A%", which makes this function unable to fail, which simplifies parse() a bit

125 ↗(On Diff #130201)

alternatively, using more llvm and less std:

if (*I == '%' && I + 2 < Content.end()
  && isHexDigit(*(I+1)) && isHexDigit(*(I+2)) {
  Result.push_back(hexFromNibbles(*(I+1), *(I+2)));
  I += 2;
} else
  Result.push_back(*I);
136 ↗(On Diff #130201)

nit: i'd probably prefer to keep "uri" referring to the whole URI, and call the one you mutate something else

138 ↗(On Diff #130201)

consider StringRef::split which avoids some index arithmetic:

pair<StringRef, StringRef> SchemeSplit = Uri.split(':');
if (SchemeSplit.second == "")
  return error
U.Scheme = percentDecode(SchemeSplit.first);
// then SchemeSplit.second contains the rest
149 ↗(On Diff #130201)

this is e.g. http://llvm.org
This is valid: scheme = 'http', authority = 'llvm.org', path = ''.
(Wikipedia is wrong, but the RFC covers this)

155 ↗(On Diff #130201)

this is e.g. file:///foo/bar, which is definitely valid!
scheme = 'file', authority = '', path = '/foo/bar'

(we should have a test for this one!)

168 ↗(On Diff #130201)

body may be empty. (http://llvm.org again)

unittests/clangd/URITests.cpp
115 ↗(On Diff #130201)

please include some typical examples:

"file:///etc/passwd"
"file:///c:/windows/system32/"
"http://llvm.org"
"urn:isbn:0451450523"

117 ↗(On Diff #130201)

the body should be "//x/y/z" here - the / that terminates the authority is part of the path.
(This is a valid URI, but not a typical example)

143 ↗(On Diff #130201)

this is the same as the next one

144 ↗(On Diff #130201)

these are both valid

148 ↗(On Diff #130201)

this is valid, authority may be empty

Sorry, our comments crossed...

clangd/URI.cpp
57 ↗(On Diff #130201)

yes. convert_to_slash(path, style) assumes that path is in style.

convert_to_slash("c:\\foo.txt", windows) --> "c:/foo.txt"
convert_to_slash("c:\\foo.txt", posix) --> "c:\\foo.txt" (because that's a perfectly valid filename on a posix system! well, apart from the colon)
convert_to_slash("c:\\foo.txt", native) --> "c:/foo.txt" if the host is windows, "c:\\foo.txt" if the host is windows (this is what you want)

clangd/URI.h
29 ↗(On Diff #130201)

You're right! my mistake.

ioeric updated this revision to Diff 130766.Jan 20 2018, 8:13 AM
ioeric marked 16 inline comments as done.
  • Addressed review comments.
  • Check windows vs unit paths in tests.
  • Properly check absolute paths.
ioeric added inline comments.Jan 20 2018, 8:15 AM
clangd/URI.cpp
57 ↗(On Diff #130201)

I see. Thanks for the clarification!

138 ↗(On Diff #130201)

split is neat. But here we want to distinguish between http: vs http while it's not trivial wth split. Similar for / following the authority if we want to keep body '/', say, in http://llvm.org/.

155 ↗(On Diff #130201)

Right! Realized file:////foo/bar was weird while preparing the previous diff ;)

sammccall accepted this revision.Jan 22 2018, 12:43 AM

LG with some nits addressed

clangd/URI.cpp
46 ↗(On Diff #130766)

should this be native(Path, posix)?
native(Path) seems to be a no-op

89 ↗(On Diff #130766)

nit: reserved
nit: you're not unescaping it, you're just not escaping it

and I don't think "for readability" is accurate.
Reserved characters are only reserved in certain contexts, escaping a slash in a path is (I think) incorrect.

138 ↗(On Diff #130201)

I don't understand what you mean by distinguishing http: vs http - "http://foo" will split into {"http", "foo"}. "httpfoo" will not split at all, and "http::foo" will split into "http" and "://foo".

Agreed regarding authority/body.

clangd/URI.h
31 ↗(On Diff #130766)

nit: prefer to omit \brief unless you want the brief comment to be something other than the first sentence.

(I know we have this in lots of places - it adds noise, and I finally got around to looking it up in the style guide!)

91 ↗(On Diff #130766)

nit: unreserved

82 ↗(On Diff #130201)

Still confused about this API, but let's chat offline?

90 ↗(On Diff #130201)

Hmm, I think if individual schemes need to do encoding/decoding, then our API is wrong :-)

unittests/clangd/URITests.cpp
1 ↗(On Diff #130766)

you may want a test that round-trips getVirtualTestFilePath("x") through a file URI and back again.
This should exercise the platform-specific code path - that path is under C:\ on windows, and /tmp/ on linux. The buildbots should give us platform coverage.

This revision is now accepted and ready to land.Jan 22 2018, 12:43 AM
ioeric updated this revision to Diff 130862.Jan 22 2018, 3:44 AM
ioeric marked 4 inline comments as done.
  • Addressed review comments.
ioeric edited the summary of this revision. (Show Details)Jan 22 2018, 3:45 AM
ioeric added inline comments.
clangd/URI.h
31 ↗(On Diff #130766)

Good to know. Thanks!

unittests/clangd/URITests.cpp
1 ↗(On Diff #130766)

Good idea! Didn't know such a nice function exists.

This revision was automatically updated to reflect the committed changes.