This is an archive of the discontinued LLVM Phabricator instance.

[clang-tidy] new check: bugprone-signed-char-misuse
ClosedPublic

Authored by ztamas on Dec 8 2019, 5:21 AM.

Details

Summary

This check searches for signed char -> integer conversions which might
indicate programming error, because of the misinterpretation of char
values. A signed char might store the non-ASCII characters as negative
values. The human programmer probably expects that after an integer
conversion the converted value matches with the character code
(a value from [0..255]), however, the actual value is in
[-128..127] interval.

See also:
STR34-C. Cast characters to unsigned char before converting to larger integer sizes
https://wiki.sei.cmu.edu/confluence/display/c/STR34-C.+Cast+characters+to+unsigned+char+before+converting+to+larger+integer+sizes

By now this check is limited to assignment / variable declarations.
If we would catch all signed char -> integer conversion, then it would
produce a lot of findings and also false positives. So I added only
this use case now, but this check can be extended with additional
use cases later.
The CERT documentation mentions another use case when the char is
used for array subscript. Next to that a third use case can be
the signed char - unsigned char comparison, which also a use case
where things happen unexpectedly because of conversion to integer.

Diff Detail

Event Timeline

ztamas created this revision.Dec 8 2019, 5:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2019, 5:21 AM
ztamas added a comment.Dec 8 2019, 5:50 AM

I run the new check on LibreOffice codebase with the option CharTypdefsToIgnore = "sal_Int8".
The check produced 32 findings.

I see 3 false positives which are similar to the DereferenceWithTypdef test case where the code
uses sal_Int8 typedef, but the check still catches the cast, because the typedef information is
somehow lost by the dereferencing. Other use cases seem to be valid catches.

I had a closer look at some of the catches. Two of them were doing the same conversion what
an unsigned char conversion does. So there I replaced the old solution with a cast:
https://cgit.freedesktop.org/libreoffice/core/commit/?id=85b5253492cd1c87cebc98d4c453812562a2f9ef

In other cases, the code is suspicious and surely would fail on non-ASCII characters, but
because of the context that does not happen. For example, a code which works with
font names never could get any special character, since font names are ASCII strings.

The output can be found here:
https://gist.github.com/tzolnai/c7e70d9bb51b78dd8a4036d209b19682

ztamas added a comment.EditedDec 8 2019, 6:03 AM

I run the new check on LLVM codebase with the option CharTypdefsToIgnore = "int8_t".
The check produced 12 findings.

All findings seem valid use cases, where a char -> integer conversion happens.
I had a closer look at some of the catches.

Somewhere I see only type mismatch. Char is used instead of an integer type.
For example, in the finding bellow, trailingBytesForUTF8 is defined as char array, but it is used
everywhere as integer:

/home/zolnai/clang/llvm-project/llvm/lib/Support/ConvertUTF.cpp:623:43: warning: singed char -> integer ('unsigned short') conversion; consider to cast to unsigned char first. [bugprone-signed-char-misuse]
        unsigned short extraBytesToRead = trailingBytesForUTF8[*source];

Another use case seems suspicious to me. Here we have some parsing code, if I'm right, which uses the EOF from cstdio header, which is -1,
so might be a similar use case which is mentioned by the CERT specification.

/home/zolnai/clang/llvm-project/llvm/lib/TableGen/TGLexer.cpp:596:20: warning: singed char -> integer ('int') conversion; consider to cast to unsigned char first. [bugprone-signed-char-misuse]
    int NextChar = *CurPtr;

At the location of the third use case, there is already a comment about signed chars ("/* xxx what about signed chars here... */"),
so I interpret it as a validation of the check.

/home/zolnai/clang/llvm-project/llvm/lib/Support/regcomp.c:929:12: warning: singed char -> integer ('int') conversion; consider to cast to unsigned char first. [bugprone-signed-char-misuse]
                for (i = start; i <= finish; i++)

The whole output can be found here:
https://gist.github.com/tzolnai/b1cbe3f021b53a7ca1a6ecbffc1a9bf6

Eugene.Zelenko added a project: Restricted Project.
Eugene.Zelenko added inline comments.
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
23

Please use static for functions. See LLVM Coding Guidelines.

clang-tools-extra/docs/ReleaseNotes.rst
100

One sentence is enough.

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
77

Please add newline.

clang-tools-extra/test/clang-tidy/checkers/bugprone-signed-char-misuse-with-option.cpp
75

Please add newline.

ztamas updated this revision to Diff 232814.Dec 9 2019, 4:58 AM

Fixes small things mentioned by reviewer commments.

ztamas marked 4 inline comments as done.Dec 9 2019, 4:59 AM
Eugene.Zelenko added inline comments.Dec 9 2019, 6:39 AM
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
22

Anonymous namespace is not needed anymore.

ztamas updated this revision to Diff 232851.Dec 9 2019, 7:19 AM

Remove anonymus namespace

ztamas marked an inline comment as done.Dec 9 2019, 7:20 AM
aaron.ballman added inline comments.Dec 10 2019, 10:48 AM
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
44–45

Does this properly handle architectures where char is signed rather than unsigned? Or does this only handle the case where the user wrote signed char?

50

Spurious semicolon?

57

Spurious semicolon?

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
26

Should this check also be registered in the CERT module?

ztamas updated this revision to Diff 233638.Dec 12 2019, 9:13 AM

Update code based on reviewer comments.

  • Remove spurious semicolon.
  • Add tests about plain char.
  • Extend documentation describing how to control char signdness.
ztamas marked 2 inline comments as done.Dec 12 2019, 9:14 AM
ztamas updated this revision to Diff 233643.Dec 12 2019, 9:23 AM
ztamas marked 2 inline comments as done.

Add newlines to the end of the files.

ztamas marked an inline comment as done.Dec 27 2019, 11:42 PM
ztamas added inline comments.
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
44–45

It catches plain char too. I tested on 64 bit linux where char is signed by default.
The funsigned-char and fsigned-char options can be used to override the default behavior. Added some new test cases and also extended the documentation to make this clear.

44–45

Most of the catches I found in LibreOffice / LLVM code had char type.
I found isSignedCharDefault() method in LLVM code where clang handles platform differences, as I see, so I assume clang-tidy works based on that. With the compilation options, the char signedness can be controlled anyway, so I think this will be OK.
I could test only with a 64 bit linux. There plain char is caught by default, as I mentioned, and I could change this behavior with the -funsigned-char option.

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
26

This check does not cover the whole CERT description. I guess a cert check should catch all bugous code related to the CERT issue, right?

I added the above comments two weeks ago, I just forget to push the submit button.

sylvestre.ledru added inline comments.
clang-tools-extra/docs/clang-tidy/checks/list.rst
404

list.rst has changed, you should try to rebase your patch!

ztamas updated this revision to Diff 235470.Dec 28 2019, 2:34 AM

Rebase patch.

ztamas marked 2 inline comments as done.Dec 28 2019, 2:37 AM
ztamas added inline comments.
clang-tools-extra/docs/clang-tidy/checks/list.rst
404

I rebased the patch. I added medium severity for this check because the CERT rule has the same severity.

aaron.ballman added inline comments.Dec 28 2019, 6:35 AM
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
98–99

How about: 'signed char' to %0 conversion; consider casting to 'unsigned char' first?

Also, should we have a fix-it to automatically insert the cast to unsigned char in the proper location for the user?

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
22

Backticks around the command line options.

24

By now -> Currently

26

So long as this covers a decent amount of the CERT rule, I think it is fine to document the areas where we don't quite match the CERT rule.

26

bugous -> bogus

ztamas updated this revision to Diff 235765.Jan 1 2020, 7:37 AM
ztamas marked an inline comment as done.

Update docs / warning message, according to reviewer comments.

ztamas marked 4 inline comments as done.Jan 1 2020, 7:38 AM
ztamas marked 2 inline comments as done.Jan 1 2020, 8:01 AM
ztamas added inline comments.
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
98–99

I don't think it's a good idea to add a cast automatically. I think this kind of issue typically needs a human programmer to check what can be done in the code.
For example, when the program uses a char variable, but it is used as an int intentionally (without adding an int alias). It's not a good practice of course but it can happen. In this case, the best would be to use an int variable instead. Adding a cast might break the code if it's called with negative values.
It also can happen that the code works with actual characters, but handles the negative values on its own. In this case, it might be not enough to add an unsigned char cast, but it might be needed to adjust the surrounding code too.
All in all, based on my experience with the findings in the LibreOffice code base, I would not use an automatic correction here.

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
26

My plan is to extend this check with other use cases in the next months. Can we get back to this question after that? Now, It's hard to tell how well it covers the CERT rule.

aaron.ballman added inline comments.Jan 1 2020, 8:05 AM
clang-tools-extra/clang-tidy/bugprone/SignedCharMisuseCheck.cpp
98–99

Okay, that makes sense, thank you!

clang-tools-extra/docs/clang-tidy/checks/bugprone-signed-char-misuse.rst
26

I checked and I think it covers the rule pretty closely. However, if you're already planning changes to cover new use cases, I'm fine waiting to add the alias.

clang-tools-extra/docs/clang-tidy/checks/list.rst
10

Just an FYI, but the severity columns were recently removed, so you'll probably have to rebase your patch.

clang-tools-extra/test/clang-tidy/checkers/bugprone-signed-char-misuse-fsigned-char.cpp
7

I think you need to update all the diagnostic text in the tests as the diagnostic has changed.

ztamas updated this revision to Diff 235790.Jan 1 2020, 11:42 AM

Update warning message in test files and rebase patch.

ztamas marked 2 inline comments as done.Jan 1 2020, 11:44 AM
This revision is now accepted and ready to land.Jan 1 2020, 12:46 PM
ztamas added a comment.Jan 4 2020, 5:47 AM

Thanks for the review!
I applied for the renewal of my commit access. I had commit access only for the SVN repo. After I get access to the GitHub repo I'll push this change.

This revision was automatically updated to reflect the committed changes.

A little bit late to the party, but still.
I'm honestly wondering if this should be a proper clang static analyzer data-flow aware check.
This is basically diagnosing every signed char -> signed int promotion, regardless of whether
the char is used as an ASCII char or not. It really seems like a wrong, noise-filled approach..

ztamas added a comment.EditedJan 12 2020, 4:02 AM

A little bit late to the party, but still.
I'm honestly wondering if this should be a proper clang static analyzer data-flow aware check.
This is basically diagnosing every signed char -> signed int promotion, regardless of whether
the char is used as an ASCII char or not. It really seems like a wrong, noise-filled approach..

I've considered creating a static analyzer check instead before I implemented this clang-tidy check
and it seemed to me path analysis would not help too much. At least in those cases, what I found
in LibreOffice codebase, there was no additional knowledge in the code about the character's value range.
(Actually, in one use case the method name contained "ASCII", but this information can't be used by the
static analyzer I guess.)
For example, when the code reads a file it's not known whether the file will contain only ASCII characters
or not. Or if you have a function working with characters (without having any explicit check in it about
character value range), then the static analyzer won't help there anyway.
I'm not sure how many test cases can be filtered out with a static analyzer check in practice, but I feel like
it's not much more effective than a clang-tidy check in this case. As I recall path analysis stops working if it
reaches a loop, right? It loses the information from before the loop or something like that, which makes a static
analyzer check even less useful, because characters are rare to go alone.

I run the new check on LibreOffice codebase with the option CharTypdefsToIgnore = "sal_Int8".
The check produced 32 findings.

Interesting. It found 31 issues on the Firefox code base!
I can share the list if you are interested!

ztamas added a comment.EditedJan 14 2020, 4:58 AM

I run the new check on LibreOffice codebase with the option CharTypdefsToIgnore = "sal_Int8".
The check produced 32 findings.

Interesting. It found 31 issues on the Firefox code base!
I can share the list if you are interested!

Yes, share it, please! It would be very useful to see how the check works on another project.