reiterative | I agree that there is no difference between these different types of testing process at one level - in the sense that they may all be reduced to an act of deployment to a test environment and evaluation against a set of constraints - but I am convinced that the value of tests increases the more closely the test environment corresponds to the target environment. | 09:04 |
---|---|---|
reiterative | But this is a slippery concept, because it depends on (a) what you are testing, (b) what you define as the target environment and (c) the nature of the associated constraints. | 09:05 |
reiterative | when verifying the behaviour of an individual component, for example, you *might* argue that it doesn't matter where you instantiate it, as long as you use the same interface that would be used in its target environment. But that would only hold true for a certain set of constraints - and might deliberately *ignore* other constraints that might pertain to the target environment itself. | 09:14 |
* paulsherwood currently things we could just go with 'tests that are concerned with behaviour of the software in general, not specific to an environment', and 'tests that are concerned with the software's behaviour/properties in a specific environment' | 09:21 | |
paulsherwood | i may be wrong of course | 09:21 |
reiterative | In my opinion, that is an axis rather than a binary distinction | 09:22 |
paulsherwood | but this dichotomy would align with the distinction between 'software properties' and 'system properties' | 09:22 |
paulsherwood | i agree it's not binary | 09:22 |
reiterative | Software is frequently systems within systems within systems | 09:23 |
*** sambishop has joined #trustable | 09:23 | |
reiterative | The increased value of testing in close proximity to the target environment is actually about the range of constraints that may be considered. | 09:24 |
* paulsherwood is expressly shorthanding 'software' to mean source, bits+byes, 'systems' to mean hardware running software | 09:24 | |
paulsherwood | again i may be wrong | 09:25 |
reiterative | OK, I will rephrase | 09:25 |
reiterative | It's a balancing act between testing software in its final (and complete) form in its final (and complete) enviornment versus testing progressively smaller parts of that software in an enviornment progressively removed from the final target. | 09:26 |
reiterative | There's more value at one end of the axis because you can apply all the relevant constraints | 09:27 |
reiterative | But there's more value at the other end because it costs you much les sto remedy the defects that you find | 09:27 |
paulsherwood | so you're saying there's more value at both ends than in the middle? | 09:28 |
reiterative | No, I'm saying that the value of testing at any point on the axis needs to be set against the cost of undertaking it | 09:29 |
reiterative | There will be sweet spots, but they are not always easy to identify | 09:30 |
reiterative | Unit testing may be immensely valuable in one context, and a complete was of effort in another | 09:32 |
reiterative | (waste) | 09:32 |
reiterative | But if you can perform meaningful testing early, then it will help to reduce the number of defects that you find in later testing | 09:34 |
reiterative | But the main issue with this is the availability of properly defined constraints in the earlier stages of development | 09:36 |
* paulsherwood previously sketched this https://imgur.com/jcZCH9G on various whiteboards | 09:36 | |
paulsherwood | not just for testing... | 09:36 |
reiterative | :-) | 09:36 |
paulsherwood | in many cases there's a cost curve where doing too little or too much are both suboptimal | 09:37 |
reiterative | Agreed. And with testing, it's not enough to consider the cost of executing the tests - you have to factor in the cost of fixing the defects - including defects that may be just 'noise' | 09:38 |
persia | Not just classic cost, but also costing in terms of response time, context loss, etc. | 09:39 |
* reiterative nods | 09:39 | |
persia | In terms of calendric delivery, if a potential defect can be scheduled for remediation within a short time of candidate proposal, the relevant folk are likely to respond "oh, right" and be able to immediately address it. The longer the delay, the more (calendar) time required to regain sufficient context, the longer the project runs as a whole. | 09:40 |
reiterative | Yes. And if a defect is discovered late in a project, the people who implemented the relevant bit of code might not even be available any more. | 09:42 |
persia | But, if I'm summarising the above correctly, we consider there to be only one class of test/validation/etc., and there are likely to be metrics associated with any given stage in a process related to a) the signal/noise ratio of FAIL results, b) the time between scheduling and response, and c) the costings associated with preparation of the execution environment. | 09:42 |
reiterative | My conclusion is that *all* types of testing are potentially valuable. | 09:42 |
persia | Organisations benefit the most where (a) is relatively small, and (b)/(c) is either relatively static as one progresses a pipeline in a process or slowly increases as the pipeline progresses. | 09:43 |
reiterative | and (d) the completeness of the constraints covered | 09:43 |
persia | I consider (d) to be a different class of property than (a), (b), (c). (a),(b),(c) can be usefully measured for a single result to be evidenced by a single vote. (d) can only be measured in terms of a system totality. | 09:44 |
reiterative | Agreed | 09:45 |
persia | Note that I do agree it is important to measure completeness, both in terms of whether all contraints are satisfied under all reasonable conditions and in terms of the amount of total functionality existing within the system is exercised by the procedure of validation. I just think they are different. | 09:46 |
reiterative | The question is how do you measure / account for that when comparing the testing strategies used by two candidate trustable processes | 09:47 |
reiterative | There are answers, but they involve a lot of overhead in collecting metrics | 09:48 |
reiterative | So I distrust them | 09:48 |
persia | For (a), (b), (c), one presumably wants to create some (f) that represents a collective over the entire process. | 09:49 |
persia | For (d), (e), (f), one generates an abstract quantative metric, and then compares them. This permits tradeoffs. | 09:49 |
persia | In practice, the vast majority of comparisons are going to be performed within an organisation, so one can probably reuse most of the data and/or execution units when performing the comparison. | 09:50 |
persia | In an abstract "Is this trustable" way, it doesn't matter as much, as that bar will be set externally, and is likely to be based on provided collatteral about processes and arguments for compliance of a given process, rather than in terms of comparison of two processes. | 09:51 |
reiterative | Are we saying that the details of a process are irrelevant, so long as we have evidence that it has been applied? | 09:57 |
reiterative | I'd want evidence for the effectiveness of a process if I was deciding whether it was trustworthy | 09:58 |
reiterative | (but I appreciate that's not the same as trustable) | 09:59 |
persia | I think trustability and effectiveness are independent. I expect to be able to create an untrustable efficient process or a trustable inefficient process. | 09:59 |
persia | But, yes, I assert that the details of process are unimportant as long as there exists sufficient collateral to cause a meaningful chain of argument between the base expectations of "trustable" and whatever process is being evaluated. | 10:00 |
reiterative | I'm not convinced that is enough | 10:01 |
persia | If an arbitrary process can answer questions about provenance, construction, reproducibility, functionality/reliability, consistency between system and intent, ability to update, and safety, I don't see any reason not to call it "trustable". | 10:02 |
reiterative | But some of those characteristics require an evaluation of the process - functionality/reliability most obviously. | 10:03 |
persia | Mind you, I might disagree with a given argument, and so might not personally wish to call some process for which I thought the argument was weak by that term, but that's about each of our own sense of logic and our ability to argue cases. | 10:03 |
persia | In practice, any process will be evaluated against some set of stipulations (e.g. "it does what it is supposed to do"). The collateral produced during this evaluation can be considered in terms of whether it provides sufficient justification for a claim about that stipulation. So long as there is a supportable claim of conformance to each stipulation, how isn't the process "trustable"? | 10:06 |
persia | Mind you, it may be that the set of stipulations will end up being revised, but that is independent of the evaluation of processes. | 10:06 |
reiterative | I guess that would depend on what a process is claiming to achieve | 10:06 |
reiterative | And whether that holds up to scrutiny | 10:06 |
persia | Right, and we can only judge a process against the claims. | 10:07 |
persia | Now, if a process claims to do a variety of particularly interesting things (e.g. never link any non-GPL application against a GPL library), there needs to be support for those claims, which can be evaluated in terms of frameworks. | 10:07 |
persia | And I suspect that constructing metrics to allow one to appreciate confidence in evidence quantitatively will massively improve the ability of various parties to evaluate such claims. | 10:08 |
reiterative | Agreed. So the available evidence should support the claims that are made for a process? | 10:08 |
persia | But it is important not to be distracted by the potential universe of claims to be supported to ensure the metrics for evaluating argument are sufficiently general. | 10:09 |
persia | That restatement feels like it might miss something, but basically, yes. The potential missing bit is the process of argument. While it is true that if parallel lines never intersect, the sum of the inner angles of a triangle will be 180˚, demonstrating this from the evidence available requires presentation of additional collateral information (the proof) | 10:10 |
reiterative | Yes, we probably don't want to get into the proof, but I think we do want to define some minimum characteristics for our trustable principles (provenance, construction, reproducibility, etc) and factor the availability of supporting evidence for these into any trustability metrics that we may devise. | 10:14 |
persia | Right, and it becomes the responsibility of someone claiming a given process is "trustable" to provide sufficient collateral (argument/proof) to satisfy anyone to whom they wish to make such a claim. | 10:23 |
*** sambishop has quit IRC | 11:10 | |
*** sambishop has joined #trustable | 11:13 | |
*** traveltissues has joined #trustable | 11:20 | |
reiterative | I would be tempted to go further. As I've just said in another conversation: | 11:27 |
reiterative | In my opinion, it suggests that we need to distinguish between two (related) types of evidence: | 11:28 |
reiterative | (1) Evidence that a policy exists and has been applied | 11:28 |
reiterative | (2) Evidence that the application of the policy enhances trustability | 11:28 |
reiterative | We have been focussing on (1) thus far, but I think it is important not to lose sight of (2), since (1) is arguably meaningless without it. | 11:28 |
reiterative | We have already identified a set of factors that we believe must be considered when assessing the trustability of software: its provenance, construction, reproducibility, clarity of purpose, reliability, resilience and safety. We have also made some attempt to examine 'what good looks like' in some of these areas. | 11:28 |
reiterative | I believe that the next challenge should be to express these ideas as a set of 'trustability intents' (or constraints, if possible at this stage) that can be used to evaluate a set of available evidence. | 11:28 |
reiterative | Perhaps a way for us to determine the overall goal for our 'trustability intents' would be to consider how they contribute to the identification and management of risk as part of a software engineering process? | 11:37 |
*** sambishop has quit IRC | 12:46 | |
*** sambishop has joined #trustable | 13:00 | |
reiterative | I've pushed a new commit to pa-nomenclature, incorporating Edmund's review feedback into core-concepts.md | 14:12 |
* persia sets aside time for rebuttal | 14:13 | |
* reiterative expected that | 14:13 | |
persia | On (1) vs (2), that matches what I have been calling comparatives. | 14:13 |
persia | I am uncertain if we can make argument that a given process enhances trustability, but I am confident that if we have a standardised mechanism to describe two processes, we can probably suggest which of them is able to provide greater confidence in the validity of a specific claim. | 14:15 |
persia | If these evaluations are quantitative, it ought be possible to suggest weighted models for a collective score for a set of claims (such as those listed for trustable), which I suspect is close to what you describe. | 14:16 |
reiterative | Yes, but I think we need to define a set of qualitative metrics (i.e. establish what constitutes evidence of good practice) that can be used as part of a quantitative evaluation. | 14:20 |
persia | I believe one can go from quantitative to qualitative, where one can assign value to a metric. | 14:24 |
persia | I do not know of any way to go in the other direction. | 14:24 |
persia | As such, I think it important to capture evidence of practice and evidence of result, allowing one to make assertions about “good”. | 14:25 |
persia | For example, if it is interesting to assert that “test before release” is “good”, it makes sense to show that the number of defects experienced by release consumers differs. If there is no difference in the experience of release customers, then there is no reason to suggest prerelease testing is “good”. | 14:27 |
persia | If someone believes that it is not worth the experiment, that is fine, but in such a case, the argument rests on the assumption, and it becomes useful to document and attribute that assumption. | 14:29 |
reiterative | If you *can* use evidence of positive outcome to distinguish what constitutes good practice in this way, then I agree that it's a good approach. I'm just not convinced that it will always be possible - and I think we will need to base initial evaluations on more subjective value judgements about what constitutes good practice. | 14:31 |
persia | (On qualitative->quantitative, it is also important to understand impact: do the effects of “good” scale exponentially, linearly, logarithmically, or differently to, e.g. lines of source code?) | 14:34 |
persia | I think we may be saying more similar things than might be apparent. I believe that early evaluations of processes will be based on an arbitrary unproven set of assumptions. I just very strongly believe that while performing such assessments depends on an agreed language for assessment, the language for assessment ought be unaffected by the assumptions of early assessments (a one-way dependency). As such, I find it unuseful to attempt to discuss | 14:42 |
persia | particular assessment criteria or expected assumptions about best practice until there is a common semantic mapping to use for such discussion. | 14:42 |
*** sambishop has quit IRC | 17:16 | |
*** traveltissues has quit IRC | 20:21 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!