This is an archive of the discontinued LLVM Phabricator instance.

Rework machine creation strategy
ClosedPublic

Authored by MatzeB on Jul 18 2017, 6:30 PM.

Details

Summary

Currently when submitting a run and the machine data does not match the
previous data, LNT creates a new machine with the same name (but
different id). This was often very confusing to users.

This changes the strategy to reject the submission if the data does not
match the previous data unless the --update-machine flag (or the
update_machine post parameters, etc.) is set in which case the new data
overrides the previous machine data.

Adding previously unset keys to the machine will not lead to a rejection
either way. Leaving out previously set keys is fine (but will not remove
the keys).

This new strategy will result in machine names being unique (except for
the case of older entries in the database before this change).

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB created this revision.Jul 18 2017, 6:30 PM
grosser edited edge metadata.Jul 24 2017, 11:50 PM

Hi Matze,

I did not check the implementation in detail, but this makes total sense to me. From my perspective this is a clear improvement and should go in.

This revision was automatically updated to reflect the committed changes.

Hi Matze,

I did not check the implementation in detail, but this makes total sense to me. From my perspective this is a clear improvement and should go in.

A more useful flag for our use would restore the previous behavior, rather than always update the machine. We have a lot of historical data, crossing a number of kernel versions and other machine characteristics to import to LNT. The old behavior was very convenient for this data set - we automatically ended up with new machines after each system configuration change (actually, this automatic disambiguation of machines with variations was a key benefit of LNT for us, and caught a number of bugs in our test infrastructure). We don't want to always update the machine, that would "poison" the quality of the historical data.

Inventing unique names for each of the variants would be difficult but possible. It feels like it somewhat defeats the point of the machine field - I would be re-encoding the same information in to a machine name (for example something like gcc7-cortex-a57-ubuntu-14.04-linux-4.13-64k-pages ). That makes all other interactions with the system (e.g. choosing runs for comparison) very cumbersome.

I agree that the old behavior could be confusing, but I don't really know how to sensibly interact with the new design in a way that preserves data quality without needing an explosion in naming complexity. For me, this is not a clear improvement.

Hi James,

this is very interesting to hear as I would not have expected the previous behavior to be desirable. Just to explain some more where I am coming from:

  • LNT Submissions are typically performed by CI jobs which for us are required to have a unique name, so it is natural to use the same LNT machine name as the CI jobs name.
  • When selecting a machine in LNT the only thing to go on is the machine name. If for example I have 7 different machines named "gcc7" (with some of the other fields differing), I would need to click 7 times today, to figure out which is the machine that I want.
  • Similarily when connecting 3rd party visualisation/analysis to LNT it is convenient to have unique machine names that you can reference. Machine id numbers are only valid within one LNT database, and are also not predictable.
  • Looking at lnt runtest test-suite mode, nobody even bothered filling out the machine fields and to my knowledge nobody complained to this day.

To your points:

Hi Matze,

I did not check the implementation in detail, but this makes total sense to me. From my perspective this is a clear improvement and should go in.

A more useful flag for our use would restore the previous behavior, rather than always update the machine. We have a lot of historical data, crossing a number of kernel versions and other machine characteristics to import to LNT. The old behavior was very convenient for this data set - we automatically ended up with new machines after each system configuration change (actually, this automatic disambiguation of machines with variations was a key benefit of LNT for us, and caught a number of bugs in our test infrastructure). We don't want to always update the machine, that would "poison" the quality of the historical data.

  • Note that submission are rejected if the machine data doesn't match the previous data, so bugs are catched and incompatible/uncomparable data is avoided. (the --update-machine flag is not intended for the regular CI job, but rather to be used manually after updateing a machine in a way that changes the data but is believed to not change performance/keep the data comparable).
  • If after changing a machine the new data is not comparable to the historical data I would expect the user to choose a new machine name (which is also nice as it makes the fact of the changed configuration more obvious).
  • I also created the lnt admin subcommands to enable ways to rename, merge, delete machines to allow cleanup/reorganisation of the data.

Inventing unique names for each of the variants would be difficult but possible. It feels like it somewhat defeats the point of the machine field - I would be re-encoding the same information in to a machine name (for example something like gcc7-cortex-a57-ubuntu-14.04-linux-4.13-64k-pages ). That makes all other interactions with the system (e.g. choosing runs for comparison) very cumbersome.

I agree that the old behavior could be confusing, but I don't really know how to sensibly interact with the new design in a way that preserves data quality without needing an explosion in naming complexity. For me, this is not a clear improvement.

So I am not completely convinced the automatic machine name creation is a desirable behavior. I can see the convenience of machines getting created automatically at the cost of the machine names becoming less meaningful.

Having said all that I'd be fine to add a flag supporting a variation of the previous behavior where we create new machines if the machine data doesn't match (however I'd slightly change the behavior to append a number to the new machines name to maintain the property that machine names are unique). Would that be fine with you?

Hi James,

Hi,

So I am not completely convinced the automatic machine name creation is a desirable behavior. I can see the convenience of machines getting created automatically at the cost of the machine names becoming less meaningful.

Having said all that I'd be fine to add a flag supporting a variation of the previous behavior where we create new machines if the machine data doesn't match (however I'd slightly change the behavior to append a number to the new machines name to maintain the property that machine names are unique). Would that be fine with you?

Before I go in to more detail about how we're testing (for background, and your interest) - that sounds like a very helpful solution, thanks!

this is very interesting to hear as I would not have expected the previous behavior to be desirable. Just to explain some more where I am coming from:

  • LNT Submissions are typically performed by CI jobs which for us are required to have a unique name, so it is natural to use the same LNT machine name as the CI jobs name.

This is also true for us, however we are building nightly using 30 groupings of "machines" (which are themselves pools of real machines driven by buildbot), and with historical data going back 4 years. Each of these groupings of machines use names automatically derived from the key aspects of their hardware and release branch they track, and we're diligent at recording more subtle machine differences in the "machine" field.

  • When selecting a machine in LNT the only thing to go on is the machine name. If for example I have 7 different machines named "gcc7" (with some of the other fields differing), I would need to click 7 times today, to figure out which is the machine that I want.

That is probably where the difference in perspectives comes from. We would have 7 machines producing results, which we expect to have identical configuration, and which we would group under the name "gcc7.$target_board.$target_cpu". In normal use, we would not expect 7 machines named "gcc7" to produce a result in one night, we would expect there to be one "active" gcc7 at a time (measured in months), and so choosing the right machine would be a matter of picking the one which has been building most recently. Occasionally due to sysadmin/user error, one of the machines in the pool might malfunction and end up in an inappropriate configuration. As an example from today, one rogue machine in the pool was accidentally patched up to a newer kernel version. When we import a run from that machine, the old LNT behaviour would create a separate machine, ensuring that the data integrity was maintained with the new machine isolated (which it wouldn't be if we forced --update-machine) but that we still had data in the system that we could compare (which we wouldn't get with the new error behaviour). This becomes important to us when importing historical data, as we really do want a new machine every time configuration changes, but we don't want to have to encode that in the machine name. Put another way, when we migrate a grouping of boards to a more recent kernel version, we want that change to make the data sets disjoint, but without us having to invent a new name for the machine pool.

Automatically appending a number to the machine name would therefore work well for our use case.

  • Similarily when connecting 3rd party visualisation/analysis to LNT it is convenient to have unique machine names that you can reference. Machine id numbers are only valid within one LNT database, and are also not predictable.

We don't do this, but I can see why this would be useful to you.

  • Looking at lnt runtest test-suite mode, nobody even bothered filling out the machine fields and to my knowledge nobody complained to this day.

We're a somewhat unique consumer of LNT, in that we have a completely separate infrastructure for running and recording test results, we generate JSON from this infrastructure which is suitable for import to LNT for visualisation. We make heavy use of the machine fields.

I'm very grateful for your help in resolving this, I appreciate we're running a non-standard configuration over here!

James

Hi James,

Hi,

So I am not completely convinced the automatic machine name creation is a desirable behavior. I can see the convenience of machines getting created automatically at the cost of the machine names becoming less meaningful.

Having said all that I'd be fine to add a flag supporting a variation of the previous behavior where we create new machines if the machine data doesn't match (however I'd slightly change the behavior to append a number to the new machines name to maintain the property that machine names are unique). Would that be fine with you?

Before I go in to more detail about how we're testing (for background, and your interest) - that sounds like a very helpful solution, thanks!

this is very interesting to hear as I would not have expected the previous behavior to be desirable. Just to explain some more where I am coming from:

  • LNT Submissions are typically performed by CI jobs which for us are required to have a unique name, so it is natural to use the same LNT machine name as the CI jobs name.

This is also true for us, however we are building nightly using 30 groupings of "machines" (which are themselves pools of real machines driven by buildbot), and with historical data going back 4 years. Each of these groupings of machines use names automatically derived from the key aspects of their hardware and release branch they track, and we're diligent at recording more subtle machine differences in the "machine" field.

  • When selecting a machine in LNT the only thing to go on is the machine name. If for example I have 7 different machines named "gcc7" (with some of the other fields differing), I would need to click 7 times today, to figure out which is the machine that I want.

That is probably where the difference in perspectives comes from. We would have 7 machines producing results, which we expect to have identical configuration, and which we would group under the name "gcc7.$target_board.$target_cpu". In normal use, we would not expect 7 machines named "gcc7" to produce a result in one night, we would expect there to be one "active" gcc7 at a time (measured in months), and so choosing the right machine would be a matter of picking the one which has been building most recently. Occasionally due to sysadmin/user error, one of the machines in the pool might malfunction and end up in an inappropriate configuration. As an example from today, one rogue machine in the pool was accidentally patched up to a newer kernel version. When we import a run from that machine, the old LNT behaviour would create a separate machine, ensuring that the data integrity was maintained with the new machine isolated (which it wouldn't be if we forced --update-machine) but that we still had data in the system that we could compare (which we wouldn't get with the new error behaviour). This becomes important to us when importing historical data, as we really do want a new machine every time configuration changes, but we don't want to have to encode that in the machine name. Put another way, when we migrate a grouping of boards to a more recent kernel version, we want that change to make the data sets disjoint, but without us having to invent a new name for the machine pool.

Automatically appending a number to the machine name would therefore work well for our use case.

  • Similarily when connecting 3rd party visualisation/analysis to LNT it is convenient to have unique machine names that you can reference. Machine id numbers are only valid within one LNT database, and are also not predictable.

We don't do this, but I can see why this would be useful to you.

  • Looking at lnt runtest test-suite mode, nobody even bothered filling out the machine fields and to my knowledge nobody complained to this day.

We're a somewhat unique consumer of LNT, in that we have a completely separate infrastructure for running and recording test results, we generate JSON from this infrastructure which is suitable for import to LNT for visualisation. We make heavy use of the machine fields.

I'm very grateful for your help in resolving this, I appreciate we're running a non-standard configuration over here!

James

FYI: I'm still working on a fix for this. But I am currently occupied with other things, I hopefully have something next week.