This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Flexible line endings
ClosedPublic

Authored by mxbOctasic on Apr 12 2016, 1:36 PM.

Details

Summary

Line ending detection is now set with the `DeriveLineEnding` option.
CRLF can now be used as the default line ending by setting `UseCRLF`.
Without line ending detection, all line endings are converted according to the `UseCRLF` option.

Diff Detail

Event Timeline

mxbOctasic updated this revision to Diff 53446.Apr 12 2016, 1:36 PM
mxbOctasic retitled this revision from to clang-format: Flexible line endings.
mxbOctasic updated this object.
mxbOctasic added a reviewer: djasper.
mxbOctasic added a subscriber: cameron314.
djasper edited edge metadata.Apr 12 2016, 1:40 PM

I am not convinced that this behavior makes sense. It's not clang-format's task to clean up files with mixed line endings. I'd really like to avoid this complexity and most importantly adding additional configuration options. It doesn't fully apply, but please read:
http://clang.llvm.org/docs/ClangFormatStyleOptions.html#adding-additional-style-options

mxbOctasic edited edge metadata.Apr 12 2016, 1:43 PM
mxbOctasic added a subscriber: cfe-commits.

ClangFormat still needs to know what kind of newline to insert when reformatting code.
It already had the "auto-detect newline" behaviour; this patch simply allows the auto-detection to be overridden with an explicit setting. This is important for short files that have no newlines to start with, in which case the autodetection cannot work.

Can you elaborate more on what case this is actually addressing? Are you saying that you have a one-line file without linebreaks that you want clang-format to format?

Sure :-)

The motivation behind this patch (and a few more upcoming ones) is to make ClangFormat (the library) more friendly for IDE integration (which is our primary use case), in particular for on-the-fly auto-formatting.

In an IDE auto-format setting, it's often the case that only fragments of the file at a time are being formatted. Often that means only a single line is passed as context. Or, sometimes, the file really is only one line to start with (e.g. formatting a file containing a one-line function declaration with the 'wrap after return type' option on needs to insert a newline).

Passing snippets / single lines from an IDE does not make sense. Don't try to go down this road. clang-format always needs the full file as context.

If you are already in an IDE, that IDE should control the line endings, not clang-format. Write a thin wrapper that changes clang-format's output to use the correct line endings.

Too late ;-) We've successfully integrated it as the engine behind our auto-formatting feature, passing only the minimal context necessary.
We tried at first to supply the full file context, but that turned out to be much too slow in larger files. (We did a fair bit of testing with some of the real code our users have written.)

We can, of course, change the newlines on the IDE side instead (indeed, we already have a few transforms on both the input and output). But it seemed to us that it would make more sense to have ClangFormat generate the correct newlines in the first place given that this is what it already attempts to do with its newline detection feature. And we felt this minor extension would be useful to others.

It has very basic newline detection, but if possible, I'd like to not significantly increase that. Specifically, I don't think it is worth the complexity of additional configuration options that users have to worry about, if it can just do the right thing (which IMO, it currently does other than in a few some rare corner cases).

I am curious: Have you measured what makes clang-format slow if you give it the full file? Is that the file itself or the time clang-format spends on parsing it? Maybe some of the optimizations you now do in your IDE would actually be suited to clang-format itself.

We did profile at one point, to see if it was something we could trivially speed up (it wasn't). Most of it was on the parsing/annotation side, yes (which makes sense, since that's the majority of the work when the format region itself is limited). I'm afraid I don't recall more details (this was a few months ago after we had just started). The memory usage spikes significantly, too. Anyway, we didn't want the size of the file to affect the insertion time of a semicolon if at all possible.

All of our optimization boils down to only giving ClangFormat the minimum amount of code necessary to format the statement or block we want. The way we determine this context is almost certainly not something you want to see in ClangFormat :-) It's quite heuristic in some places, and is very specific to C/C++. Sometimes it will even bail out and not format anything if the context spans too many (e.g. 2000+) lines. There's multiple levels of context involved, too (e.g. when declaration alignment is on, adjacent statements that look like declarations are also included in the context to get things to line up). And fake braces to set up the indentation correctly. It's also all written in C#, leveraging a large body of existing parsing code that we'd already written (for features like smart indent); things taken for granted in the IDE, like O(1) access to the tokens that make up a given line (which is constantly maintained as the user types instead of having to reparse from the top of the file), are simply not possible to do as efficiently in ClangFormat (barring persistent caching between calls, which of course is a whole can of worms in itself). Anyway, all that stuff is very "porcelain-y" code (to borrow a term from the git developers) that would no doubt vary significantly from IDE to IDE.

Ooh, and I should also add, there's a non-trivial cost to grabbing the entire file contents as a string and then converting it to UTF-8. If ClangFormat was faster, this would certainly become a significant bottleneck that can't really be worked around.

MyDeveloperDay added a project: Restricted Project.

Sorry I'm reviewing very old revisions, @mxbOctasic are you still interested in a patch like this?

> It's not clang-format's task to clean up files with mixed line endings

This is an interesting thought given that clang-format does exactly that based on how many ^M's there are in the file

if I make a file:

// A^M
// B^M
// C
// D

and have vim auto format on save when I come back it says

// A^M
// B^M
// C^M
// D

according to notepad++ its

but if I take this file and clang-format test.cpp > out.cpp I get

So I'm fine if the rule is it isn't "clang-format" job to adjust line endings... but if that's the case it should leave them alone, I think the 50/50 rule is a little crude at best!

I think it would be nice if clang-format could be setup to be "specific" and not random based on how large a section someone editing in a file in Visual studio without the correct setting.

For me, not having to remember to do a separate dos2unix would really help reduce the number of "Fix line endings" commits we see (especially on platforms where dos2unix isn't available in the shell)

This could really have helped the 1/2 million commits on github...(https://github.com/search?q=%22line+endings%22&type=Commits)

I think this patch like this is a good compromise, I'd personally just make UseCRLF and enum, I don't think it needs 2 options

enum {
    Derived   // |Do as we do now
    LF            // like doing a dos2unix
    CRLF        // like doing a unix2dos
}

Stil interested?

MyDeveloperDay retitled this revision from clang-format: Flexible line endings to [clang-format] Flexible line endings.Nov 14 2019, 7:59 AM
MyDeveloperDay added a reviewer: mitchell-stellar.
MyDeveloperDay added a project: Restricted Project.

@MyDeveloperDay: @mxbOctasic has moved on but I'm still maintaining this patch upstream, so I'm still interested in having it merged if possible.

The advantage of having two options is that a default line ending can still be specified when none can be derived (this is important to our use case where we sometimes pass individual lines to clang-format).

I suspect this patch might need a rebase, but I personally don't see anything wrong with it. I think this could really help.

I'm pretty sure the Unix line ending requirement is in almost every major style guide (I saw @STL_MSFT mention that they standardize on \r\n too) so I think it meets the criteria. (I know i'll personally use it as I live in a cross-platform shop and users forgetting to configure their Visual Studio is a major pain!)

If you're happy to rebase it, I'm happy to review it again.

MSVC's STL currently uses CRLF (DOS) line endings, not LF (Unix). I wrote a validator, https://github.com/microsoft/STL/blob/58bb49d63d92e7a0346a05af29816aeea6b4cf0f/tools/validate/validate.cpp , to detect LF files, mixed line endings (LF and CRLF in the same file), damaged endings (CR only), and enforcing one newline at the end of every file, because clang-format doesn't currently handle those whitespace issues.

MSVC's STL currently uses CRLF (DOS) line endings, not LF (Unix). I wrote a validator, https://github.com/microsoft/STL/blob/58bb49d63d92e7a0346a05af29816aeea6b4cf0f/tools/validate/validate.cpp , to detect LF files, mixed line endings (LF and CRLF in the same file), damaged endings (CR only), and enforcing one newline at the end of every file, because clang-format doesn't currently handle those whitespace issues.

@STL_MSFT I saw your tweet on the matter, I'd like to bring some of these rules into clang-format I think this patch might help, because it would be able to setup CRLF endings in the same way others would setup LF,

I also agree about the EOF, this can be a major pain as not all compilers complain and if you are doing cross-platform work developers accidentally forgetting them can cause a reasonable number of housekeeping post commits https://github.com/search?p=2&q=%22EOF%22&type=Commits

I like the idea of being able to turn off the derived rule, and using specific CRLF or LF rule to ensure conformance, (even if I'm just using clang-format to identify the issues in a pre merge check. i.e. having my .clang-format on my CI system being a little stricter)

cameron314 added a comment.EditedNov 15 2019, 5:49 AM

I've got a rebased diff but Phabricator won't let me attach it to this differential review (too old?). What do I do?

EDIT: Ah, it must be because I'm not the owner, despite it being editable by all.

Rebased diff on latest upstream.

MyDeveloperDay accepted this revision.Nov 15 2019, 7:00 AM

This LGTM, I think this could help more than people might at first realize. I love the idea that I can use clang-format in dry-run (-n) mode to now make line endings seem a compile error..

clang/lib/Format/Format.cpp
1360

The great thing about this is if I don't want derived line endings I save scanning through the code twice counting the line endings

This revision is now accepted and ready to land.Nov 15 2019, 7:00 AM
This revision was automatically updated to reflect the committed changes.