This is an archive of the discontinued LLVM Phabricator instance.

[clang] replaces numeric SARIF ids with heirarchical names
Needs ReviewPublic

Authored by cjdb on Mar 22 2023, 11:45 AM.

Details

Summary

Per §3.27.5 and §3.27.7, the rule.id and ruleId properties need to
be heirarchical names (this is only a requirement of rule.id, but they
need to match when both are present). We've been using the enum values
for these properties up until now, but that's neither conforming nor
stable, so we've changed it to a dotted version of the enum identifiers
(which are substantially more stable than their numeric values).

Fixes #61597.
Depends on D145178.

Diff Detail

Event Timeline

cjdb created this revision.Mar 22 2023, 11:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2023, 11:45 AM
cjdb requested review of this revision.Mar 22 2023, 11:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2023, 11:45 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cjdb updated this revision to Diff 507494.Mar 22 2023, 2:08 PM
cjdb edited the summary of this revision. (Show Details)

swaps commits to see if that fixes CI

cjdb updated this revision to Diff 507503.Mar 22 2023, 2:14 PM
cjdb edited the summary of this revision. (Show Details)

rebasing continues

cjdb updated this revision to Diff 507510.Mar 22 2023, 2:29 PM

I think this corrects everything.

denik added inline comments.Mar 22 2023, 3:26 PM
clang/lib/Basic/DiagnosticIDs.cpp
103

As of now it's 4600 strings, around 150k characters. Doesn't look bad but I think it should be mentioned in the change message because it affects all diagnostic.

clang/lib/Frontend/SARIFDiagnostic.cpp
52–53

§3.5.4 says that the hierarchical strings are separated by "/".

But more generally, are diagnostic names really fall under "hierarchical" definition?
Words between underscores should be treated as components. And $3.27.5 says that the leading components have to be the identifier of the rule.
In some cases they look like valid components, e.g. err_access, err_access_dtor, err_access_dtor_exception.
But in cases like err_cannot_open_file neither of the leading components exists.

Theoretically we could use groups as the leading component for warnings for example. For errors the leading components are probably not even necessary, since if I understood correctly they are needed to suppress subsets of violations on the SARIF consumer side.
Or we just could keep the names with underscores as is. WDYT?

cjdb added inline comments.Mar 22 2023, 5:52 PM
clang/lib/Frontend/SARIFDiagnostic.cpp
52–53

I think in light of what you've said, changing back to underscores is probably best.

vaibhav.y added inline comments.Apr 5 2023, 11:53 AM
clang/lib/Frontend/SARIFDiagnostic.cpp
52–53

But more generally, are diagnostic names really fall under "hierarchical" definition?

I have the same concern, but this is okay for a first pass as a "flat hierarchy" :)

If we want a deeper structure, we'll need some extra metadata in DiagnosticSemaKinds.td's Error<...> to add cluster names as a follow up to this.

WDYT about something like clang/visibility/err_access or clang/syntax/err_stmtexpr_file_scope.

Alternatively: we could draw from the c++ standard structure: https://eel.is/c++draft/basic.def.odr#term.odr.use and say the error code for an ODR violation could be clang/basic/def/odr, again I'm unsure how well this meshes with clang's diagnostic model.

Endill added a subscriber: Endill.Apr 28 2023, 11:40 AM
Endill added inline comments.
clang/lib/Frontend/SARIFDiagnostic.cpp
52–53

Alternatively: we could draw from the c++ standard structure: https://eel.is/c++draft/basic.def.odr#term.odr.use and say the error code for an ODR violation could be clang/basic/def/odr, again I'm unsure how well this meshes with clang's diagnostic model.

The only reliable thing there are stable clause names like [namespace.udecl], because there are precedents of them being rearranged in the table of contents (1, 2). We've got bitten by that in C++ conformance tests, which use table of contents for directory hierarchy.

cjdb added inline comments.Apr 28 2023, 11:45 AM
clang/lib/Frontend/SARIFDiagnostic.cpp
52–53

WDYT about something like clang/visibility/err_access or clang/syntax/err_stmtexpr_file_scope.

I think this might work!

Alternatively: we could draw from the c++ standard structure: https://eel.is/c++draft/basic.def.odr#term.odr.use and say the error code for an ODR violation could be clang/basic/def/odr, again I'm unsure how well this meshes with clang's diagnostic model.

The only reliable thing there are stable clause names like [namespace.udecl], because there are precedents of them being rearranged in the table of contents (1, 2). We've got bitten by that in C++ conformance tests, which use table of contents for directory hierarchy.

Right. Given that section names aren't actually stable, relying on those would be brittle at best.

My primary concern with this is that we're calling these stable IDs -- they're not really stable IDs, they're internal implementation details that we change from time to time. So what happens when we rename ext_foo_bar into warn_foo_bar as far as SARIF is concerned? Does it matter, or are we setting some tools up for problems because they'll want to map between the old name and the new name if they're the "same diagnostic" as before?