This is an archive of the discontinued LLVM Phabricator instance.

Fix missing/mismatched html tags
ClosedPublic

Authored by MatzeB on Jun 23 2017, 4:45 PM.

Details

Summary
  • Fix various missing end tags and mismatched tags
  • Also add closing slash to various empty tags (<br> => <br/>, <input> => <input/>...)
  • Replace some html entities with the actual UTF-8 characters; This looks nicer and makes it easier to run xmllint on the output.

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB created this revision.Jun 23 2017, 4:45 PM
cmatthews accepted this revision.Jun 26 2017, 8:00 AM

It would be wonderful if we could programmatically check these during testing. Thanks Matthias!

This revision is now accepted and ready to land.Jun 26 2017, 8:00 AM
kristof.beyls edited edge metadata.Jun 26 2017, 8:10 AM

It would be wonderful if we could programmatically check these during testing. Thanks Matthias!

I had to make the tags XML-compliant in the daily report page to make testing its content in tests/server/ui/V4Pages.py possible using the Python built-in xml parser (see functions check_nr_machines_reported and get_xml_tree).
So, in other words, I think checking this programmatically would probably be easy and boil down to parsing every HTML page using the Python built-in xml parser, see function get_xml_tree pointed to in the sentence above.

This revision was automatically updated to reflect the committed changes.

It would be wonderful if we could programmatically check these during testing. Thanks Matthias!

I had to make the tags XML-compliant in the daily report page to make testing its content in tests/server/ui/V4Pages.py possible using the Python built-in xml parser (see functions check_nr_machines_reported and get_xml_tree).
So, in other words, I think checking this programmatically would probably be easy and boil down to parsing every HTML page using the Python built-in xml parser, see function get_xml_tree pointed to in the sentence above.

Adding the tests would be easy (you could easy tweak the V4Pages script to not just check the presence of pages but also the well-formedness). Unforunately I had to give up the approach of validating XML for now:
The flask WTForms stuff only outputs HTML5 and has no option (at least nothing without a lot of hackery accessible according to my websearches) that outputs XHTML, so it produces <input> elements that aren't properly closed according to xml rules.

For the record: I just added an optional integration to pytidylib/tidy-html5 that checks lnt pages for html problems (r312061). It can be used with lnt -Dtidylib=1.

For the record: I just added an optional integration to pytidylib/tidy-html5 that checks lnt pages for html problems (r312061). It can be used with lnt -Dtidylib=1.

Very nice! Thanks for all the cleanups and improvements you've been making to LNT lately!

I guess there's a chance tidylib might be better than just aiming to parse XHTML as proper XML, as it may go beyond what an XML validator is capable of, by being written specifically for HTML?

For the record: I just added an optional integration to pytidylib/tidy-html5 that checks lnt pages for html problems (r312061). It can be used with lnt -Dtidylib=1.

Very nice! Thanks for all the cleanups and improvements you've been making to LNT lately!

I guess there's a chance tidylib might be better than just aiming to parse XHTML as proper XML, as it may go beyond what an XML validator is capable of, by being written specifically for HTML?

I would have preferred simpler xml validation too, in fact that is what I tried first. In the end I failed with that approach because the WTForms library that we use only outputs HTML5 and has no way to produce XHTML, thus resulting in unclosed <input> tags without an good way to fix it. So various pages with forms on them will fail xml validation right now.

So the best thing I could find was the tidylib/tidy-html5 combination.