This is an archive of the discontinued LLVM Phabricator instance.

[clang-tidy] Add bugprone-strlen-in-array-index checker
Needs ReviewPublic

Authored by kovacs-levent on Dec 7 2020, 10:33 AM.

Details

Summary

According to https://wiki.sei.cmu.edu/confluence/display/c/FIO37-C.+Do+not+assume+that+fgets%28%29+or+fgetws%28%29+returns+a+nonempty+string+when+successful, bugprone-strlen-in-array-index checker is created.

The checker checks for the usage of strlen() on an array within it's own subscript operator. The main issue in the ticket is not the usage of fgets(), but rather the resulting write-outside-array-bounds error. This could occur if the array contains binary data and the terminating '\0' character is read into the array.

The check helps to narrow the scope by guarding against the usage of strlen in the subscript operator.

Diff Detail

Event Timeline

kovacs-levent created this revision.Dec 7 2020, 10:33 AM
kovacs-levent requested review of this revision.Dec 7 2020, 10:33 AM

Messed up the first diff, here's the proper one.

Eugene.Zelenko added inline comments.
clang-tools-extra/docs/ReleaseNotes.rst
123

Please enclose strlen in double back-ticks.

124

Please enclose operator [] in double back-ticks.

152

Please add empty line after.

clang-tools-extra/docs/clang-tidy/checks/bugprone-strlen-in-array-index.rst
7

Please synchronize first statement with Release Notes.

8

Please enclose \0 on single back-ticks and strlen() in double back-ticks.

10

Please enclose strlen in double back-ticks.

Doc improvements.

Fix in Releasenotes.rst.

kovacs-levent marked 6 inline comments as done.Dec 8 2020, 3:30 AM
Eugene.Zelenko added inline comments.Dec 8 2020, 6:37 AM
clang-tools-extra/docs/clang-tidy/checks/bugprone-strlen-in-array-index.rst
17

Please Clang-format example code. (void) is not necessary in C++ code. Or change code block to C.

kovacs-levent marked an inline comment as done.

Thanks. I missed formatting rst files because the clang-format messed up the rst part and I just forgot about them.

aaron.ballman added inline comments.Dec 15 2020, 1:45 PM
clang-tools-extra/clang-tidy/bugprone/StrlenInArrayIndexCheck.cpp
39

I think this diagnostic is going to be very chatty -- doing strlen(buf) - N inside of an array index isn't dangerous unless there is a null character within the first N bytes of the buffer, and I suspect there's plenty of code that validates the buffer before performing the operation. Have you tried running this check over some large code bases to see what it's true/false positive ratio is?

This check feels like it is a very specific instance of checking for buffer overflows which would be better-handled by the static analyzer (which can track the characteristics of the buffer better through control and data flow analysis).

steakhal added inline comments.Dec 16 2020, 2:22 AM
clang-tools-extra/clang-tidy/bugprone/StrlenInArrayIndexCheck.cpp
39

I agree with you.
With the right configuration, the CSA can already catch this in some scenarios.
However, a tidy check could also be beneficial, if the false/true-positive rate is convincing enough.

I will measure this patch on several projects in the future and publish the results in a public CodeChecker instance.
@kovacs-levent, you can categorize the results and publish a small summary about them here.
Right now I'm a bit busy, but I hope we don't rush anywhere :)

kovacs-levent added inline comments.Dec 17 2020, 2:42 AM
clang-tools-extra/clang-tidy/bugprone/StrlenInArrayIndexCheck.cpp
39

After the checker was written I gave it a quick test run using CodeChecker on some open source projects (with some help by @Charusso). Unfortunately, I don't have the results now, but you're right in that there were some false positives.

However, it was kind of dependent on the project, some projects had the validation sorted out, while some didn't. I couldn't say an overall false/true-positive rate, but I'll try to find the test results. What are we aiming for regarding this rate?

Thanks @steakhal, I'm looking forward to your results.

aaron.ballman added inline comments.Dec 17 2020, 5:30 AM
clang-tools-extra/clang-tidy/bugprone/StrlenInArrayIndexCheck.cpp
39

Right now I'm a bit busy, but I hope we don't rush anywhere :)

There's no rush that I'm aware of, so take whatever time you need. :-)

However, it was kind of dependent on the project, some projects had the validation sorted out, while some didn't. I couldn't say an overall false/true-positive rate, but I'll try to find the test results. What are we aiming for regarding this rate?

We don't have a hard-and-fast rule for false positive rates, but the general guidance is that the frontend should have almost no false positives, analysis can have more false positives than the FE, and tidy can have more false positives than analysis.

I'm mostly looking for a sense as to whether this finds enough true positives that users would keep the check enabled, or whether the check is mostly false positives and likely to be disabled. I'm also curious whether the existing CSA checks catch the same true positives or could be easily modified to catch them, as I think the flow-sensitive check will lead to better results in the long-run.

@kovacs-levent @aaron.ballman
On this CodeChecker instance, you can find the reports on several projects.
The relevant runs are suffixed with kovacs_D92777. @kovacs-levent, you can use the demo/demo user as well, for categorizing the reports into false/true positives.
Once you clicked on a run, you can filter out the relevant reports by the Checker's name bugprone-strlen-in-array-index.

There are some funny ones, but overall, I think it does not produce an immense amount of reports.

How should the developer 'fix' their code if they certain that the report is a false-positive? What pattern do you suggest suppressing a false-positive? @kovacs-levent
That could suppress the one I cherry-picked before.


I'm mostly looking for a sense as to whether this finds enough true positives that users would keep the check enabled, or whether the check is mostly false positives and likely to be disabled.

Yeah, we should aggregate the results first.

I'm also curious whether the existing CSA checks catch the same true positives or could be easily modified to catch them, as I think the flow-sensitive check will lead to better results in the long-run.

It can. The core,alpha.security.taint,alpha.security.ArrayBoundV2 checkers could catch it, if the strlen would propagate taint to the length value.
That way if the analyzer could not prove that the access is in bound, it could aggressively diagnose this because the subscript expression is tainted.
But currently, taint analysis and the ArrayBoundV2 are not yet ready for production.

This comment was removed by kovacs-levent.

Sorry for taking so long with the reply, I've prepared the results.

First of all,a recommended solution generally is to store the length in a variable, and use that. Most of the false-positive results are calculating the strlen() of the same string multiple times and this gets rid of that as well.

size_t len = strlen(buf);
if (len > 0) {
    buf[len-1] = '0';
}

Also this makes the check much more readable than in some of the false-positive examples.

There weren't many hits in the test data so I compiled all of them.

memcached
Positives:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55297
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55298
There's no length checking after copying from optarg.

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55316
There are two of these accesses, there's a NULL check in place, so the strlen() call is guarded.

False-postivie rate: 2/4

curl
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Curl_curl-7_66_0_Changed_kovacs_D92777&report-id=55430
The path buffer is not checked

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Curl_curl-7_66_0_Changed_kovacs_D92777&report-id=55429
There's a check for the first character.

False-positive rate: 1/2

vim
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Vim_v8.2.1920_Changed_kovacs_D92777&report-id=56169
I can't see an obvious check here for lno, if the returned token is exactly one character long this can cause problems

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Vim_v8.2.1920_Changed_kovacs_D92777&report-id=56167
There are possibly 3 accesses here for the same buffer, and the fname buffer check is a long way up at L 509.

False-postivie rate: 1/2

postgreSQL
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60172
I can't see a check on boot_yytext, but it's assumed to be a quoted string.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62650
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62649
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62648
For me these are a bit wtf, so I added them. Even if there's a check on the duplicated strings, this should be more readable

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60042
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60582
Yeah, these are a bit funny...

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60615
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60582
There's a check for the first character

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62172
I guess this is false positive, because it overwrites the first character beforehand and pg_strdup will throw an error if the first character is a null pointer in the copied string

False-positive rate: 5/9

Tmux
In this project all the matches are false-positives unfortunately.
False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55401
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55404
It's a false positive because there's a check in place, but computing strlen(copy) twice is not pretty, the recommended code snipet avoids this.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55403
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55406
The buffer is filled beforehand in a wierd way, but it should be safe for access.

False-positive rate: 7/7

Overall
Overall except tmux the False-positive rate for the projects individually is mainly around 50%, but in some cases the recommended snippet can improve performance and readability.

Overall false-positive rate: 16/24.

The false-positives which could be improved by the code snippet are:

if(path_to_use[0] &&
   (path_to_use[strlen(path_to_use) - 1] != '/') )

This snippet is from curl's false positive, but tmux also uses it a few times.

@kovacs-levent, thorough and accurate evaluation!
See my comments inline.


First of all,a recommended solution generally is to store the length in a variable, and use that. Most of the false-positive results are calculating the strlen() of the same string multiple times and this gets rid of that as well.

size_t len = strlen(buf);
if (len > 0) {
    buf[len-1] = '0';
}

Also this makes the check much more readable than in some of the false-positive examples.

There are multiple ways of checking this.
Eg.: *buf, buf[0], len > 0, len mean the same thing in this context.

In most of the cases, your checker fired for read accesses. In those cases, you could offer a simpler alternative such as:

if (path_to_use[0] && path_to_use[strlen(path_to_use) - 1] != '/')

You can guard the access using a negation and the or operator as well. You should also recognize this.

if (!path_to_use[0] || path_to_use[strlen(path_to_use) - 1] != '/')

Accessing the first character might be done using the pointer dereference operator:

if (*path_to_use && path_to_use[strlen(path_to_use) - 1] != '/')

To ignore these cases, here is a grammar in BNF form. If the access you are dealing with is a <safe_read_access>, you should ignore that.

<access> ::= <id>[ strlen(<id>) - 1 ]

<relational_operator> ::=  < | > | <= | >= | != | ==

<compare_access> ::= <access>
                 | <access> <relational_operator> <constant?>
                 | <constant?> <relational_operator> <access>

<first_character> ::= <id>[0]
                  | * <id>

<safe_read_access> ::= <first_character> && <compare_access>
                   | ! <first_character> || <compare_access>
                   |   <first_character> && <compare_access>
                   | ! <first_character> || <compare_access>

For write accesses, it's really hard to give a generic pattern for suppression. The one you proposed is the best I can think of.
But we have to find a reasonable answer for these questions to get it working:

  • How should one suppress multiple warnings of the same kind? How does this compose?
if (len1 && len2) {
  buf1[len1 - 1] = 'A';
  buf2[len2 - 1] = 'B';
}
  • Only the immediate parent if condition will be considered as length checks?
  • What if the user doesn't want an extra branch there due to performance reasons? Could an assert(len1); assert(len2); suppress the warning?

AFAIK, walking backward the AST is really expensive, thus not recommended.
Am I right about that @aaron.ballman?

So, finding the condition of the parent if seems tricky to pull off. The same applies for asserts.

There weren't many hits in the test data so I compiled all of them.

memcached
Positives:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55297
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55298
There's no length checking after copying from optarg.

Confirmed! This bug can be triggered indeed. Nice catch.

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=Memchached_1.6.8_Changed_kovacs_D92777&detection-status=New&sort-by=checkerId&sort-desc=false&report-id=55316
There are two of these accesses, there's a NULL check in place, so the strlen() call is guarded.

Yes, it's safe.

False-postivie rate: 2/4

curl
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Curl_curl-7_66_0_Changed_kovacs_D92777&report-id=55430
The path buffer is not checked

Aaah, it's really tricky.
The data->state.up.path aliases with path since path is initialized to data->state.up.path and not modified again.
And the access looks like this: !*data->state.up.path && path[strlen(path) - 1] != '/' which tries checking that the path is not empty and not ends with /.
Unfortunately, there is a negation, so it is checking the ending ONLY IF path points to an empty string.
Even if it were checking the right thing, I still agree with you. We should warn about this case as well. Aliasing is just a headache for everybody.

Really nice catch. You should probably file an issue about this.

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Curl_curl-7_66_0_Changed_kovacs_D92777&report-id=55429
There's a check for the first character.

Yes, for this time they are checking the right thing!

False-positive rate: 1/2

vim
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Vim_v8.2.1920_Changed_kovacs_D92777&report-id=56169
I can't see an obvious check here for lno, if the returned token is exactly one character long this can cause problems

It is safe. You can assume there that the lno is a non-empty string (due to strtok) - which means that it has at least ONE character + a zero-terminator character.
So, it is a false-positive.

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=Vim_v8.2.1920_Changed_kovacs_D92777&report-id=56167
There are possibly 3 accesses here for the same buffer, and the fname buffer check is a long way up at L 509.

It requires contextual information to decide. It's probably a false-positive, yup.

False-postivie rate: 1/2

postgreSQL
Positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60172
I can't see a check on boot_yytext, but it's assumed to be a quoted string.

I don't think YACC scanner codes really have such bugs. I think the grammar it parses ensures that the content is quoted. I did not investigate it though.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62650
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62649
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62648
For me these are a bit wtf, so I added them. Even if there's a check on the duplicated strings, this should be more readable

The same reasoning applies.

False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60042

Calling strlen 2x is inefficient. It's not a problem to warn for this.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60582
Yeah, these are a bit funny...

I don't mind to warn for this as well. These are beyond our capabilities.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60615

Hm, it's interesting. We should recognize this pattern as well. And ignore such cases.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=60582
There's a check for the first character

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=PostgreSQL_REL_13_0_Changed_kovacs_D92777&report-id=62172
I guess this is false positive, because it overwrites the first character beforehand and pg_strdup will throw an error if the first character is a null pointer in the copied string

Yes, it's a false-positive. If grolist is strlen(grolist) < 3 it skips it. It can not overwrite the zero-terminator accidentally, and it remains a valid c-string, calling strlen subsequently is safe.
IMO it's perfectly fine to warn for this case as well.

False-positive rate: 5/9

Tmux
In this project all the matches are false-positives unfortunately.
False-positive:
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55401

Wait, why the second line is highlighted?
The check happens a line above the report.

https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55403
https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?review-status=Unreviewed&review-status=Confirmed%20bug&detection-status=New&detection-status=Reopened&detection-status=Unresolved&run=tmux_3.0_Changed_kovacs_D92777&report-id=55406
The buffer is filled beforehand in a wierd way, but it should be safe for access.

For the first yes, but I'm not sure about the second one.
If cmd_table == NULL, then the body of the loop will be skipped. In that case strlen(s) == 0, causing an invalid access.
I wouldn't mind a warning there.

I'm going to be honest, I have no idea how to move forward with this checker. I guess you've made some suggestions @steakhal, but I don't have any idea how to implement these as improvements.

You gave this grammar for improving read access triggers, but this says almost nothing to me about how should I change the code.

For write accesses you even mentioned that it's tricky to give a generic pattern and that walking backwards the AST is not a good idea. What this says to me is that this whole ordeal should be avoided.

I suppose I'm looking for some sort of verdict. The results are there, whether the change is accepted or not is up to the review. If it's not accepted, then I'd like to get some guidance about how to move forward to hopefully improve the checker and get it accepted. @aaron.ballman can you please comment?

I'm going to be honest, I have no idea how to move forward with this checker. I guess you've made some suggestions @steakhal, but I don't have any idea how to implement these as improvements.

Sorry for my complicated wording xD Let me express my thoughts in code instead.
Here is the matcher I roughly imagined:
https://godbolt.org/z/q4WsWj
The idea behind this is to explicitly recognize and ignore certain patterns. Such as *p && Val == p[strlen(p) - 1].
This way you can further reduce the number of false positives.

However, I can't see any reasonable way of suppressing this checker for write accesses.
Imagine this code:

p[0] = 'a';
// more code
// strlen returns 1 or greater
p[strlen(p) - 1] = 'X'; // safe, but you will still report it

Recommending to hoist the strlen(p) - 1 to a local variable is not a real solution, but a workaround. Recommending this pattern to the user is probably not ideal.
However, I don't have any better suggestions for this - so we stuck with this suppression mechanism.

I'd like to see how well the AST matcher suppression mechanism achieves in the wild.

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 4:08 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

The tests are a bit slim. We definitely need more cases... what if the array isn't allocated locally, but received as a parameter? I take that we need both an fgets and a strlen call in the same function(?) to emit diagnostics. Are we sure it will work if the array is originally from somewhere else?

clang-tools-extra/clang-tidy/bugprone/StrlenInArrayIndexCheck.cpp
22

While certainly not natural, if we're already here, it wouldn't take more than a few additional lines to match the equivalent but "expanded" version of array indexing:

*(buf + strlen(buf) - 1) = 0;
23

What's a hard implicitCastExpr doing here? Does a cast always happen? I feel this might be a bit too restrictive... there are matcher adaptors like ignoringParenImpCasts, that route should be investigated.

27–29

strlen_s, wcslen_s? If I am reading this correctly then in case the input buffer starts with a NUL (and the safe size is large enough) then it behaves the same way strlen and will still return 0.

If we are already here, it might be worth to also check for strnlen. It is POSIX and BSD specific, but behaves the same way as strlen_s.

32–33

Why is the inner matcher duplicated here? What cases do we have a direct (?) - and a non-direct (descendant) one?

clang-tools-extra/test/clang-tidy/checkers/bugprone-strlen-in-array-index.cpp
17–18

Such code examples might be covered by non-permissive copyright, which might be incompatible with LLVM's licence. I was unable to find anything wrt. copyright on SEI CERT's Confluence, which in this case makes the work to be (strictly) copyrighted by default. I am not sure what the actual situation is, but such a direct mention of copying could lead to problems.

MTC added a subscriber: MTC.Aug 16 2021, 6:43 PM