IRC logs for #trustable for Friday, 2019-02-01

reiterative	I agree that there is no difference between these different types of testing process at one level - in the sense that they may all be reduced to an act of deployment to a test environment and evaluation against a set of constraints - but I am convinced that the value of tests increases the more closely the test environment corresponds to the target environment.	09:04
reiterative	But this is a slippery concept, because it depends on (a) what you are testing, (b) what you define as the target environment and (c) the nature of the associated constraints.	09:05
reiterative	when verifying the behaviour of an individual component, for example, you might argue that it doesn't matter where you instantiate it, as long as you use the same interface that would be used in its target environment. But that would only hold true for a certain set of constraints - and might deliberately ignore other constraints that might pertain to the target environment itself.	09:14
* paulsherwood currently things we could just go with 'tests that are concerned with behaviour of the software in general, not specific to an environment', and 'tests that are concerned with the software's behaviour/properties in a specific environment'		09:21
paulsherwood	i may be wrong of course	09:21
reiterative	In my opinion, that is an axis rather than a binary distinction	09:22
paulsherwood	but this dichotomy would align with the distinction between 'software properties' and 'system properties'	09:22
paulsherwood	i agree it's not binary	09:22
reiterative	Software is frequently systems within systems within systems	09:23
*** sambishop has joined #trustable		09:23
reiterative	The increased value of testing in close proximity to the target environment is actually about the range of constraints that may be considered.	09:24
* paulsherwood is expressly shorthanding 'software' to mean source, bits+byes, 'systems' to mean hardware running software		09:24
paulsherwood	again i may be wrong	09:25
reiterative	OK, I will rephrase	09:25
reiterative	It's a balancing act between testing software in its final (and complete) form in its final (and complete) enviornment versus testing progressively smaller parts of that software in an enviornment progressively removed from the final target.	09:26
reiterative	There's more value at one end of the axis because you can apply all the relevant constraints	09:27
reiterative	But there's more value at the other end because it costs you much les sto remedy the defects that you find	09:27
paulsherwood	so you're saying there's more value at both ends than in the middle?	09:28
reiterative	No, I'm saying that the value of testing at any point on the axis needs to be set against the cost of undertaking it	09:29
reiterative	There will be sweet spots, but they are not always easy to identify	09:30
reiterative	Unit testing may be immensely valuable in one context, and a complete was of effort in another	09:32
reiterative	(waste)	09:32
reiterative	But if you can perform meaningful testing early, then it will help to reduce the number of defects that you find in later testing	09:34
reiterative	But the main issue with this is the availability of properly defined constraints in the earlier stages of development	09:36
* paulsherwood previously sketched this https://imgur.com/jcZCH9G on various whiteboards		09:36
paulsherwood	not just for testing...	09:36
reiterative	:-)	09:36
paulsherwood	in many cases there's a cost curve where doing too little or too much are both suboptimal	09:37
reiterative	Agreed. And with testing, it's not enough to consider the cost of executing the tests - you have to factor in the cost of fixing the defects - including defects that may be just 'noise'	09:38
persia	Not just classic cost, but also costing in terms of response time, context loss, etc.	09:39
* reiterative nods		09:39
persia	In terms of calendric delivery, if a potential defect can be scheduled for remediation within a short time of candidate proposal, the relevant folk are likely to respond "oh, right" and be able to immediately address it. The longer the delay, the more (calendar) time required to regain sufficient context, the longer the project runs as a whole.	09:40
reiterative	Yes. And if a defect is discovered late in a project, the people who implemented the relevant bit of code might not even be available any more.	09:42
persia	But, if I'm summarising the above correctly, we consider there to be only one class of test/validation/etc., and there are likely to be metrics associated with any given stage in a process related to a) the signal/noise ratio of FAIL results, b) the time between scheduling and response, and c) the costings associated with preparation of the execution environment.	09:42
reiterative	My conclusion is that all types of testing are potentially valuable.	09:42
persia	Organisations benefit the most where (a) is relatively small, and (b)/(c) is either relatively static as one progresses a pipeline in a process or slowly increases as the pipeline progresses.	09:43
reiterative	and (d) the completeness of the constraints covered	09:43
persia	I consider (d) to be a different class of property than (a), (b), (c). (a),(b),(c) can be usefully measured for a single result to be evidenced by a single vote. (d) can only be measured in terms of a system totality.	09:44
reiterative	Agreed	09:45
persia	Note that I do agree it is important to measure completeness, both in terms of whether all contraints are satisfied under all reasonable conditions and in terms of the amount of total functionality existing within the system is exercised by the procedure of validation. I just think they are different.	09:46
reiterative	The question is how do you measure / account for that when comparing the testing strategies used by two candidate trustable processes	09:47
reiterative	There are answers, but they involve a lot of overhead in collecting metrics	09:48
reiterative	So I distrust them	09:48
persia	For (a), (b), (c), one presumably wants to create some (f) that represents a collective over the entire process.	09:49
persia	For (d), (e), (f), one generates an abstract quantative metric, and then compares them. This permits tradeoffs.	09:49
persia	In practice, the vast majority of comparisons are going to be performed within an organisation, so one can probably reuse most of the data and/or execution units when performing the comparison.	09:50
persia	In an abstract "Is this trustable" way, it doesn't matter as much, as that bar will be set externally, and is likely to be based on provided collatteral about processes and arguments for compliance of a given process, rather than in terms of comparison of two processes.	09:51
reiterative	Are we saying that the details of a process are irrelevant, so long as we have evidence that it has been applied?	09:57
reiterative	I'd want evidence for the effectiveness of a process if I was deciding whether it was trustworthy	09:58
reiterative	(but I appreciate that's not the same as trustable)	09:59
persia	I think trustability and effectiveness are independent. I expect to be able to create an untrustable efficient process or a trustable inefficient process.	09:59
persia	But, yes, I assert that the details of process are unimportant as long as there exists sufficient collateral to cause a meaningful chain of argument between the base expectations of "trustable" and whatever process is being evaluated.	10:00
reiterative	I'm not convinced that is enough	10:01
persia	If an arbitrary process can answer questions about provenance, construction, reproducibility, functionality/reliability, consistency between system and intent, ability to update, and safety, I don't see any reason not to call it "trustable".	10:02
reiterative	But some of those characteristics require an evaluation of the process - functionality/reliability most obviously.	10:03
persia	Mind you, I might disagree with a given argument, and so might not personally wish to call some process for which I thought the argument was weak by that term, but that's about each of our own sense of logic and our ability to argue cases.	10:03
persia	In practice, any process will be evaluated against some set of stipulations (e.g. "it does what it is supposed to do"). The collateral produced during this evaluation can be considered in terms of whether it provides sufficient justification for a claim about that stipulation. So long as there is a supportable claim of conformance to each stipulation, how isn't the process "trustable"?	10:06
persia	Mind you, it may be that the set of stipulations will end up being revised, but that is independent of the evaluation of processes.	10:06
reiterative	I guess that would depend on what a process is claiming to achieve	10:06
reiterative	And whether that holds up to scrutiny	10:06
persia	Right, and we can only judge a process against the claims.	10:07
persia	Now, if a process claims to do a variety of particularly interesting things (e.g. never link any non-GPL application against a GPL library), there needs to be support for those claims, which can be evaluated in terms of frameworks.	10:07
persia	And I suspect that constructing metrics to allow one to appreciate confidence in evidence quantitatively will massively improve the ability of various parties to evaluate such claims.	10:08
reiterative	Agreed. So the available evidence should support the claims that are made for a process?	10:08
persia	But it is important not to be distracted by the potential universe of claims to be supported to ensure the metrics for evaluating argument are sufficiently general.	10:09
persia	That restatement feels like it might miss something, but basically, yes. The potential missing bit is the process of argument. While it is true that if parallel lines never intersect, the sum of the inner angles of a triangle will be 180˚, demonstrating this from the evidence available requires presentation of additional collateral information (the proof)	10:10
reiterative	Yes, we probably don't want to get into the proof, but I think we do want to define some minimum characteristics for our trustable principles (provenance, construction, reproducibility, etc) and factor the availability of supporting evidence for these into any trustability metrics that we may devise.	10:14
persia	Right, and it becomes the responsibility of someone claiming a given process is "trustable" to provide sufficient collateral (argument/proof) to satisfy anyone to whom they wish to make such a claim.	10:23
*** sambishop has quit IRC		11:10
*** sambishop has joined #trustable		11:13
*** traveltissues has joined #trustable		11:20
reiterative	I would be tempted to go further. As I've just said in another conversation:	11:27
reiterative	In my opinion, it suggests that we need to distinguish between two (related) types of evidence:	11:28
reiterative	(1) Evidence that a policy exists and has been applied	11:28
reiterative	(2) Evidence that the application of the policy enhances trustability	11:28
reiterative	We have been focussing on (1) thus far, but I think it is important not to lose sight of (2), since (1) is arguably meaningless without it.	11:28
reiterative	We have already identified a set of factors that we believe must be considered when assessing the trustability of software: its provenance, construction, reproducibility, clarity of purpose, reliability, resilience and safety. We have also made some attempt to examine 'what good looks like' in some of these areas.	11:28
reiterative	I believe that the next challenge should be to express these ideas as a set of 'trustability intents' (or constraints, if possible at this stage) that can be used to evaluate a set of available evidence.	11:28
reiterative	Perhaps a way for us to determine the overall goal for our 'trustability intents' would be to consider how they contribute to the identification and management of risk as part of a software engineering process?	11:37
*** sambishop has quit IRC		12:46
*** sambishop has joined #trustable		13:00
reiterative	I've pushed a new commit to pa-nomenclature, incorporating Edmund's review feedback into core-concepts.md	14:12
* persia sets aside time for rebuttal		14:13
* reiterative expected that		14:13
persia	On (1) vs (2), that matches what I have been calling comparatives.	14:13
persia	I am uncertain if we can make argument that a given process enhances trustability, but I am confident that if we have a standardised mechanism to describe two processes, we can probably suggest which of them is able to provide greater confidence in the validity of a specific claim.	14:15
persia	If these evaluations are quantitative, it ought be possible to suggest weighted models for a collective score for a set of claims (such as those listed for trustable), which I suspect is close to what you describe.	14:16
reiterative	Yes, but I think we need to define a set of qualitative metrics (i.e. establish what constitutes evidence of good practice) that can be used as part of a quantitative evaluation.	14:20
persia	I believe one can go from quantitative to qualitative, where one can assign value to a metric.	14:24
persia	I do not know of any way to go in the other direction.	14:24
persia	As such, I think it important to capture evidence of practice and evidence of result, allowing one to make assertions about “good”.	14:25
persia	For example, if it is interesting to assert that “test before release” is “good”, it makes sense to show that the number of defects experienced by release consumers differs. If there is no difference in the experience of release customers, then there is no reason to suggest prerelease testing is “good”.	14:27
persia	If someone believes that it is not worth the experiment, that is fine, but in such a case, the argument rests on the assumption, and it becomes useful to document and attribute that assumption.	14:29
reiterative	If you can use evidence of positive outcome to distinguish what constitutes good practice in this way, then I agree that it's a good approach. I'm just not convinced that it will always be possible - and I think we will need to base initial evaluations on more subjective value judgements about what constitutes good practice.	14:31
persia	(On qualitative->quantitative, it is also important to understand impact: do the effects of “good” scale exponentially, linearly, logarithmically, or differently to, e.g. lines of source code?)	14:34
persia	I think we may be saying more similar things than might be apparent. I believe that early evaluations of processes will be based on an arbitrary unproven set of assumptions. I just very strongly believe that while performing such assessments depends on an agreed language for assessment, the language for assessment ought be unaffected by the assumptions of early assessments (a one-way dependency). As such, I find it unuseful to attempt to discuss	14:42
persia	particular assessment criteria or expected assumptions about best practice until there is a common semantic mapping to use for such discussion.	14:42
*** sambishop has quit IRC		17:16
*** traveltissues has quit IRC		20:21

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!