Incomparable testing

hardheart · Feb 19, 2013

I appreciate the efforts made by the many people who have tested their knives and steel in a myriad of ways to try to educate the community at large about just what they're buying and how to best make choices. But I also want to highlight something that seems to cause contention at times. Many of the tests are not comparable, and this is for many reasons, like conditions, equipment, individuals, and method. I've mentioned it in posts before, but just wanted to start a thread on it so as not to detract from any specific discussion of specific tests.

If I can throw a couple of lousy analogies out there, they may help describe my thoughts. If I wanted to evaluate a few running shoes, or a few features used by different kinds of running shoes, then I might run in them myself. Or maybe some other runners would do it. Now, if I, someone with no track and field experience, and a college track star, and Usain Bolt, and Patrick Makau (world record holder for the marathon) all tried these shoes, our results may or may not be the same. It would be because we are different people, so we run differently. We would all have different skill levels. We would also be running in different ways. And we would be running different distances. The conditions may or may not be similar as far as track and weather go. But I think this illustrates how the raw data of how long it took to reach completion for any person in any of the tests is just not comparable.

Or perhaps I wanted to see how the Mustang, Camaro, and Challenger compared. Doing 0-60 for the mustang, 1/4 mile for the Camaro, and a lap at Talldega for the Challenger obviously isn't going to help me make a comparison. Sure, I'm getting an assessment of speed/acceleration, as opposed to comparing ride quality vs braking distance vs fuel economy, but even when I restrict what I measure, I can still measure it in many ways. 60 mph is 60 mph, a 1/4 mile track is available all over the country, and Talladega is a well maintained facility. But no matter how standard and accurate each individual test is, you can't compare auto performance without repeating the tests with each car.

And this is what I see happening in knife testing. I don't think any of the tests are bad, as long as they have repeatability. But I don't see the need to argue results because a totally different test gave different results. If you cut until you reach a given force, then that is one test of edge retention. If you cut until you reach a given number of cuts, then measure cutting force/ability on a different set of media, then that is another method of retention testing. If you cut for a given number of strokes with a constant given force, and measure the amount of stuff cut, then that is a third method of retention testing. If we think about it, we see that the specific measure of edge retention, and therefore the specific results, are different. And these are three types of tests I see referenced a fair amount. Again, there's nothing wrong with any as long as the methods are sound. But there's no comparison.

To take it back to running - one test would be like running until you slowed to a certain speed, another would be running a given distance with no regard to speed and then sprinting the 40 for time, and another would be running a given distance and measuring any changes in pace/stride over the course. Not everyone is going to get the same results, even if they're wearing the same shoes.

My examples would have pretty extreme differences in numbers, and that is probably not the case with edge retention testing. But that is the thing, the differences may be subtle between the tests, but the rankings might also be. 10% here or there can really switch things up in some cases.

tirod3 · Feb 20, 2013

I agree completely. It's nice to have some empirically repeatable method of testing, the real issue is, what test?

I can see difficulty enough getting the consumer to agree on methodology, adding in the manufacturers would stir the pot even more. Many will choose a method designed to advance their products at the expense of the competition.

Case in point - firearms. Easy enough to test one for accuracy, shoot some rounds at a specified distance, do the math, and declare it's capable of shooting 1/2MOA - or .5" at 100 yards. Or meters. Oh, was that iron sights or scoped? What power? Standing, sitting, prone, benchrest, what cross wind, was it raining? How many rounds - two, three, five, or ten?

The US Army standard is 10 - ten - rounds to remove the statistical anomalies to a definable standard, and that gets into huge controversies among shooters. Should be a slam dunk given - Army says ten rounds, has since the 1950's, get over it. Still a contentious issue, tho. Some dispute the math - even tho professional statisticians and shooters designed the test. Some just don't want to have a standard imposed. Some just like to stir up crap.

One approach may be to outline the specific measures used in the tests commonly known, like cutting rope - what kind of sisal or hemp, twisted to what construction and thickness, and what starting and ending pressure. Same with the card stock testing another maker uses - detail the construction of the rig well enough we could all build one, and the literal description of the size and weight of the card stock.

That would do two things - apprise those who test there are known methods, and exactly what they are. Then two tests conducted literally a thousand miles apart would at least be comparable. From there, it could be determined if two tests are representative of that model as a whole, out of the thousands actually made and in use.

We'll just have to hope they don't get a Monday knife that looks great but should have never been sold.

chiral.grolim · Feb 20, 2013

hardheart said:
I don't think any of the tests are bad, as long as they have repeatability.

I'll throw in $0.02 that may be obvious.
Repetition is used to confirm/contradict previous results and establish statistical variation. When replicates evince contradictory results, there may be a problem with the method. However, even when results correlate, the method might still be flawed - what data you are collecting and how it correlates to the property you are trying to measure.

hardheart said:
If you cut until you reach a given force, then that is one test of edge retention. If you cut until you reach a given number of cuts, then measure cutting force/ability on a different set of media, then that is another method of retention testing. If you cut for a given number of strokes with a constant given force, and measure the amount of stuff cut, then that is a third method of retention testing. If we think about it, we see that the specific measure of edge retention, and therefore the specific results, are different. And these are three types of tests I see referenced a fair amount. Again, there's nothing wrong with any as long as the methods are sound. But there's no comparison.

If the same medium (the "standard") is repeatedly cut until a specific level of measured resistance is attained, you can graph the rate of edge degradation as it correlates to force required to complete each cut. You should probably also measure the amount of medium deflection as this may be more perceptible than force of resistance depending on instrument precision levels.
If a medium is repeatedly cut a set number of times, ignoring changes in resistance for each cut, then resistance is measured on a second medium (the "standard"), you are still measuring the rate of edge degradation as it correlates to force required to complete each cut of the "standard" medium. This is the same as the first test method but employs less precision. If a medium is repeatedly cut a set number of times and the force required (i.e. resistance) does not noticeably change, then this second method would be preferable to the first, increasing the number of cuts in the testing medium until a change is detected in the 'standard' cut.
The third method you mention must be contained within the second - a medium is repeatedly cut a until the force required (i.e. resistance) changes a set amount, and then the number of cuts is measured. The test cannot be completed until the force required to complete each cut changes. Cutting with "a constant given force" implies either that no edge degradation is occurring, in which case it is a poor test since nothing is being measured, or it rejects the laws of physics - equal and opposite forces. Measuring the "amount of stuff cut" is practical for demonstration purposes, but it must correlate back to the number of cuts required and the edge degradation experienced.

In conclusion, the three methods you mention are indeed comparable, MUST BE comparable. While the first may provide greater precision than the latter methods, the three should tell the same overall story. That is, their graphs of edge-degradation should coincide (within statistical range). If they do not, then there is something missing, some variation in the methods, left unmentioned. Examples:
- the 'standard' medium: how well does it present the cutting performance of the test subjects
- the 'Control' cut: measurement of the "sharpness" as resistance from the 'standard' medium in the first cut
- cutting motion: slice vs push vs some variation
- edge polish
- (of course) edge geometry (angles, thickness, bevel heights, etc.)

This is where I see stories not lining up: cutting implements being compared in the same general way - measuring edge degradation - without controlling for the relevant features. Am I way off base?

Insipid Moniker · Feb 20, 2013

I'm very, very interested in a more scientific approach to testing, but man are there a lot of complications. I've recently become mildly obsessed with testing ergonomics, but it's a very obnoxious and time consuming activity. I figure ergonomics are highly personal, but we can generally agree that there are knives with good ergos and bad ergos so there must be some agreeable standard. What I'm trying to do is break ergonomics down to two categories, grip and comfort, and have as wide a sample size as possible perform a few simple cutting tasks with each knife and rate them in those categories. Problem is I feel like I can't eliminate enough variables. Right now I'm ha bing people cut potatoes, cardboard and whittle a bit of wood, but I can't guarantee any level of consistency in the cutting media much less ask people to do the kind of really extended cutting required to give the ergos a full workout.

hardheart · Feb 21, 2013

chiral.grolim said:
I'll throw in $0.02 that may be obvious.
If the same medium (the "standard") is repeatedly cut until a specific level of measured resistance is attained, you can graph the rate of edge degradation as it correlates to force required to complete each cut. You should probably also measure the amount of medium deflection as this may be more perceptible than force of resistance depending on instrument precision levels.
If a medium is repeatedly cut a set number of times, ignoring changes in resistance for each cut, then resistance is measured on a second medium (the "standard"), you are still measuring the rate of edge degradation as it correlates to force required to complete each cut of the "standard" medium. This is the same as the first test method but employs less precision. If a medium is repeatedly cut a set number of times and the force required (i.e. resistance) does not noticeably change, then this second method would be preferable to the first, increasing the number of cuts in the testing medium until a change is detected in the 'standard' cut.
The third method you mention must be contained within the second - a medium is repeatedly cut a until the force required (i.e. resistance) changes a set amount, and then the number of cuts is measured. The test cannot be completed until the force required to complete each cut changes. Cutting with "a constant given force" implies either that no edge degradation is occurring, in which case it is a poor test since nothing is being measured, or it rejects the laws of physics - equal and opposite forces. Measuring the "amount of stuff cut" is practical for demonstration purposes, but it must correlate back to the number of cuts required and the edge degradation experienced.

In conclusion, the three methods you mention are indeed comparable, MUST BE comparable. While the first may provide greater precision than the latter methods, the three should tell the same overall story. That is, their graphs of edge-degradation should coincide (within statistical range). If they do not, then there is something missing, some variation in the methods, left unmentioned. Examples:
- the 'standard' medium: how well does it present the cutting performance of the test subjects
- the 'Control' cut: measurement of the "sharpness" as resistance from the 'standard' medium in the first cut
- cutting motion: slice vs push vs some variation
- edge polish
- (of course) edge geometry (angles, thickness, bevel heights, etc.)

This is where I see stories not lining up: cutting implements being compared in the same general way - measuring edge degradation - without controlling for the relevant features. Am I way off base?

This is what I am talking about. The second medium is push cut, and that force is measured. The force for the standard medium is completely ignored, and it is a slicing cut. In the other test using the same media, only the standard is cut, and only the slicing cut force is measured. There is no push cut measured. One test runs a specific number of cuts, through the specific depth of the standard medium, and only the force to cut through a second medium in a different way is measured. In the other, only the force to cut to the specific depth of the test media is measured, and the number of cuts is controlled by the force needed to cut through only that media.

The third, with the constant force, is CATRA. The standard test setup has a 50 newton load used on each and every cutting stroke. The standard test runs 60 strokes. There is no requirement for the test media, the card stock, to be cut to any particular depth. Either the blade makes it through some cards, or it doesn't. The machine just moves the blade back and forth 60 times with 50 newtons of force between the blade and the stack of card stock. The cut depth is measured, but not the force to slice, as it is always 50 newtons, and not the force to push cut, as the machine applies motion to the blade. In the other tests, the force varies with each cut, but the cutting depth is always the same. In the specific cases I am thinking of, that cut depth is the diameter of rope. There are also tests where cardboard is cut, and the ability to shave arm hair is measured at specific intervals. Or the 'cleanness' of phone book or copy paper is tested as cardboard or rope are cut. The question is what do you need the edge to do. After you cut rope a hundred times, do you need it to shave your face, or do you need it to cut a hundred more pieces of rope? Perhaps the tests we pay the most attention to should be the most relevant to our cutting needs.

I do not think you can compare a slicing aggression test to a push cutting test. While both may degrade the edge with the same material, the cutting ability is measured differently. I would not compare a cross-cut saw and a straight razor, for example. Again, an extreme example, but I take it from this - The highest cut number I had on a run of CATRA tests came from the coarsest edge. It was also the most acute angle, but we didn't have time to do more runs at other levels of edge polish. I do find that the percentage increase in score is very much in line with what happened in the other edge angle progressions, while it is far out of line with what happened as we progressed through edge finish.

When I say tests aren't comparable, I say that someone should not look at that sort of result, where the initial cut depth and the total cut depth was the highest with the coarsest edge, and then draw the conclusion that it is the overall 'sharpest' edge. Would you prefer to shave your face with an edge that tends to be sharper or less sharp? Probably sharper. Well, if the test results I have place a 120 grit edge at the top in 'sharpness' or 'cutting ability' then wouldn't that just be the best edge to work on those chin whiskers? Probably not. Wrong test for the results we want. Wrong to compare that test to one where only push cutting is involved, perhaps like the one we used to do where a piece of phone book paper is held at one corner, and you see how far from the point of hold you can push cut into the paper with no slicing action. Is that the best edge to slice rope with? Maybe, but you can't guess that, someone actually has to go and slice some rope and measure the performance against others.

That is why I think the tests aren't as comparable as some would wish, they don't measure sharpness in quite the same way. Like the 0-60 vs 1/4 mile. Those both rely on acceleration for results, but too much happens between the time the car hits 60 and when it passes the 1/4 mile mark to extrapolate one from the other. Some cars finish the quarter in less time than others while crossing the line at a lower speed. Some steels wear slower, some have coarser carbides, some are higher alloy, etc. And edge geometry effect how hard it is to cut through a solid material, but has less effect when you are cutting through something the diameter of fine thread or a human hair.

I mention repeatability because there is one particular set of test results where a blade was retested and it improved from the bottom to the middle of the rankings. The exact same knife, in what was supposed to be the exact same test conditions. Then I looked at the data overall and saw what I believe to be a pattern of improving sharpness measurements. There was no obvious pattern of steel or blade type, but the newest results were consistently better than the priors. The measuring equipment, some of the setup, and the overall procedure had been adjusted over time, but the fact that the exact same tool did not score exactly the same was not seen to be an issue.

Other than that issue, I again say that I cannot complain about people spending a lot of their free time and expanding a ton of effort to share with all of us. I just think everyone being educated by these testers should be analytical for their own needs when viewing the results.

Incomparable testing

hardheart

tirod3

chiral.grolim

Universal Kydex Sheath Extension

Insipid Moniker

hardheart