Both the fib and prime factor examples are shit. The tests are individual cases which is treating the input as a Cartesian product (a point on a line since these are single-variable functions). What you want are sets. The inputs for both are all natural numbers. However, this doesn't tell you much, there's still a lot of state to work with. So what you want are end conditions, but again that doesn't tell you much.
Tests are no substitute for reasoning and TDD is just another fad.
You have to have tests for all combinations though. At least those combinations that you actually want to use. You get the same problem when your code is a big ifdef-hell.
I kind of wish I had the time to actually do it right now and see how it works. But here's how I imagine it going:
1) Establish tests for the is-in-set function. You're absolutely right that the most obvious way to do this meaningfully is to reimplement the function. A better approach would be to find some way to leverage an existing "known good" implementation for the test. Maybe a graphics file of the Mandelbrot set we can test against?
2) Establish tests that given an arbitrary (and quick!) is-in-set function, we write out the correct chunk of graphics (file?) for it.
3) Profit.
Observations:
1) I absolutely would NOT do the "write a test; write just enough code to pass that test; write another test..." thing for this. My strong inclination would be to write a fairly complete set of tests for is-in-set, and then focus on making that function work.
2) There's really no significant design going on here. I'd be using the exact same overall design I used for my first Mandelbrot program, back in the early 90s. (And of course, that design is dead obvious.)
In my mind, the world of software breaks down something like this:
1) Exhaustive tests are easy to write.
2) Tests are easy to write.
3) Tests are a pain to write.
4) Tests are incredibly hard to write.
5) Tests are impossible to write.
I think it's pretty telling that when TDD people talk about tests that are hard to write, they mean easy tests in hard to get at areas of your code. I've never heard one discuss what to do if the actual computations are hard to verify (ie 4 & 5 above) and when I've brought it up to them the typical response is "Wow, guess it sucks to be you."
I think this is an important point, your design literacy is key in creating good software. To me software design hinges a lot on finding good ways of composing things together simply. Functional programming is really good at encouraging this by its very nature.
TDD I think works best when, mentally, it is design centric rather than test centric. The tests are used to prod a design around till you are happy with the design and act as validation of the intent of the design. The goal is to keep the tests at the minimal level to test that intent.
If, testing sum of two numbers, from the article
def sum(a, b)
if b == 2
4
else
5
end
end
If this is what you do,I think you've gone wrong. The name of your function is either wrong and should be renamed, or your design of sum is wrong. Attempting to test that functionality into place is the wrong approach. I know its tricky with examples, but every extra sum(2,3) sum(4,5) sum(9,1) etc that doesn't reveal more design intent from the implementation of returning a+b is a wasted test that must be maintained.
You don't want your test to try and prove every possible implementation of sum. It really doesn't help in practical terms. I know when I used to use the Bowling game as an interview coding question, I was amazed how many tests I needed to try and test every possible implementation. People would do crazy things with their design, such as the equivalent implementation for sum...
return a == 392984589? 0 : a+b
it was a conditional to handle an edge case that introduced a bizarre bug for 392984589. It would mean, to test any possible implementation of bowling ( or sum) you need to test every combo. But that's not really the purpose of TDD. TDD is about design and incrementally putting a design in place.
Yes. But what's striking me is that one woud ever want to test only one input state. If I have to test function f(x) and see if its output is as expected, I always want to test many inputs (including "silly" ones like wrong type, nulls, extremes). Writing one test for each of them is absurd, just give a list of inputs, a list of expected outputs and check them all.
Wanted ? Yes. Feasable ? No. At least not with current computing power and technologies. Combinatorial explosion applies here as well (like for "full" input test coverage.
Something like a unit test inverse deduction software would be great instead of blind guessing (using mutations).
What you are advocating is duplicating the calculations the program does manually for some set of specific cases, which can then be used for tests. Let’s consider a few practical questions about how that might work.
1. How do you choose which specific cases to calculate manually?
Remember, you have no prior knowledge of what the correct answer looks like or where the interesting parts of the input space are. You have no way to determine whether any given set of manual calculations is representative of the work your program will be doing or covers the areas with the greatest risk of error.
2. How practical is it to make all those manual calculations?
Even in this very simple case, calculating the correct answer for a single input point manually might require hundreds of complex arithmetic operations to be performed. That’s going to be slow and error-prone. After all, isn’t that why we’re writing a program to do this for us?
Now, how does this idea scale? What if our program isn’t computing a nice analytical solution to a simple arithmetic problem, but instead running a complicated numerical method to process many thousands of data points in each input? It quickly becomes impractical to rely on this strategy for testing.
3. How will having a set of known outputs you can test against drive an implementation from scratch?
I mentioned before that Mandebrot is my second standard challenge to TDD evangelists. The first is to write add(x,y) driven by tests. After a bit of back and forth, this invariably ends up with an implementation that was generalised from however many specific cases were given to the general case. Invariably, that generalisation is the step that actually creates a useful solution to the original problem, and invariably it uses an insight that was not driven by the tests.
Our Mandelbrot scenario is the same situation, just a slightly more complicated example. No matter how many individual tests you create by choosing sample points in the input space and manually calculating the expected output, you won’t have a systematic way to work back from those answers to derive a correct general implementation of the Mandelbrot calculation. Your specific cases might be useful for verifying an existing implementation, but they give you no insight into how to write a good implementation from scratch. (If I’m dealing with a particularly strident advocate of TDD, this is the point where I mention the word “sudoku”.)
And again, we have to ask how this process scales. What if we had a more challenging problem, say extracting an audio track from a video file and running a speech recognition process on it to generate subtitles? It might actually be easier in that scenario to identify individual test cases: just take a known video file as input and write down the expected output for it, and now you have an end-to-end test. However, it’s still true that no amount of end-to-end tests will necessarily offer any insight into how to structure a good implementation of the required functionality in detail, nor will it tell us how to implement any specific part or generate useful unit tests cases for those parts. End-to-end test cases might help us to verify an existing implementation, but in general they won’t reliably drive a correct one from scratch.
I’ve not used Hypothesis, but says it’s inspired by QuickCheck and I’m a regular user of FsCheck which is the F# equivalent. Those tools have a similar goal in mind that we do, but whereas they generate randomised test data (I think FsCheck creates 100 test cases IIRC), we effectively solve the equation so we can actually say that it holds for all inputs. The key difference is that we don’t actually invoke the code with concrete values, but rather treat it like a set of constraints that should always be satisfied. We then check these constraints and if they fail we can reverse engineer a concrete value that would cause the violation. So we effectively run it for all inputs, without actually running it for any inputs.
Can you actually give an example of a test that would be realistically written in a dynamic language that would be something encoded into the types of values?
If I really needed that I’d use property-based testing. On my phone or I’d code an example. But basically it generates test inputs and gives you confidence by running some number of random tests that verify the property holds or finds a counter example.
More realistic but basic example is to show that a function is its own inverse (reverse reverse list == list) or that one function is the other’s inverse (decode encode plain == plain).
If someone gave me that test, I'd implement the code as "return input;"
Forget that you, a human, thinks you know what an ISorter should do. What do the tests demand that you do?
Doing TDD, the goal, when coding, is to write the minimal amount of broken code that makes the test pass. If you passed me {4,3,2,1}, then my code would be "return {1,2,3,4};". If you wrote a second test that passed me {6,5,4,3}, my code would be "return (input[0]==4)?{1,2,3,4}:{3,4,5,6};". You've got to write a test that makes me actually code what you want. I use this adversarial approach even when I'm writing both the test and the code.
Usually the first test I write if I'm starting with a blank slate, is to pass null and check that I get a NullPointerException. That gets the class created. After that, its got to be randomly generated data.
It is! And I'm a big fan of test systems that optionally invert the logic and require failure - to weed out bullshit tests.
But there are some things that are challenging to test for. Tests sample the space, proofs define the space. you can't build it wrong. (you can sure have a bad spec, and build the wrong thing though)
Tests definitely help. It sorta feels like there's value left on the table. The spec, the types let you define broader constraints, and you catch them all.
You're right that if you play the sort of game where you write the test "assertEq(2, add(1, 1))" and then write "fun add(_, _) = 2" you won't get very far with this sort of problem. I personally (without much experience) prefer the kind of TDD where you write say "assertEq(0, add(0, 0)), assertEq(43, add(42, 1))", fully implement the add function, and then move on. In the mandlebrot case I'd maybe compute what a few pixels should be, write all the code, and see if those pixels are right. Not perfect, but I find it better than nothing.
PBT is nice in place of (some) unit tests in that you can describe immediately the properties you expect without needing to produce a collection of examples (or write a custom generator that produces a limited set of examples, at which point you're halfway to PBT anyways).
It's also helpful to use it in a piece wise fashion if you're doing TDD. An illustrative example (though perhaps not stellar as it is a synthetic, non-real-world example) uses the diamond kata, TDD, and PBT together [0]. None of the tests on their own fully specify the system, but in total they do.
If you're doing TDD (or attempting to) I think this is an interesting case. Many TDD methods have you start off with an example case like (to stick with this kata, and using Pythonesque pseudocode because I'm still not awake this Saturday morning):
diamond-kata-test-a:
assert(diamond('A') == 'A')
Great, so now someone makes that absolute simplest solution:
diamond(c):
return 'A'
Now repeat with a second test case:
diamond-kata-test-b:
assert(diamond('B') == ' A \nB B\n A ')
And the function is duly complicated:
diamond(c):
switch c:
case 'A': return 'A'
case 'B': return ' A \nB B\n A '
default: return 'blah' // or error, doesn't matter it's not tested
But not actually generalized to reflect the intent of the system. By focusing on properties, I've found, the progression of the UUT is a bit better/more natural.
Another interesting thing to do with PBT is model-based testing [1]. The useful thing here is that sometimes the errors are triggered by a peculiar, though plausible, sequence of commands to your system. We've all worked with that one guy who somehow manages to find exactly the right sequence that triggers weird edge cases and errors, but unless we're him having a system which will generate arbitrary execution traces for you. (I actually used FsCheck for this last year in trying to sell PBT to my colleagues and was able to identify where a known issue originated as well as several other problems that hadn't been found by users or testers yet.)
In the end, when these failures are found you can always turn them into distinct unit tests in order to preserve them and prevent regressions. The two modes of testing fit well together.
Tests are no substitute for reasoning and TDD is just another fad.
reply