Neither time nor stomach

Screen Shot 2015-06-19 at 4.43.19 PM

Mark Palko writes:

Thought you might be interested in an EngageNY lesson plan for statistics. So far no (-2)x(-2) = -4 (based on a quick read), but still kind of weak. It bothers me that they keep talking about randomization but only for order of test; they assigned treatment A to the first ten of each batch.

Maybe I’m just in a picky mood.

I replied that I don’t like this bit at all: “Students use a randomization distribution to determine if there is a significant difference between two treatments.”

I don’t like randomization tests and I really really really don’t like the idea that the purpose of a study is “to determine if there is a significant difference between two treatments.”

Also it’s a bit weird that it’s in Algebra II. This doesn’t seem like algebra at all.

Palko added:

If you have the time (and the stomach), I’d recommend going through the entire “Topic D” section. You’ll find lots more to blog about.

I fear I have neither time nor stomach for this.

50 thoughts on “Neither time nor stomach

    • Adam:

      I don’t see the connection. “Ideas are tested by experiment,” that’s fine. “Determine if there is a significant difference between two treatments,” that doesn’t seem so useful to me.

      • I see what they are doing, arguably not terribly well, as demonstrating both how to experiment and how to think about the idea that just because group A and B’a summary statistics are different doesn’t mean you’ve discovered something. And that’s what brought the xkcd comment to mind – whether the randomization test is a “good” test or not for a host of problems, it gives students an introduction to uncertainty (particularly notable when many adults I work with don’t think about it at all). Thus I see a complaint about lack of rigor rather than seeing the benefits of any attempt to experiment with uncertainty at all.

        Do we believe the attempt does more harm than good?

        • Adam,

          “Do we believe the attempt does more harm than good?”

          Compared to what?

          This question needs to be framed in terms of opportunity costs . The algebra curriculum is so packed that everything you include has to push something else out of the way. This lesson comes at the expense of some other arguably better lesson either on the same topic or on something else we would really like students to cover.

        • I think actually for a lot of students, maybe particularly those who aren’t going to college (which is the majority in New York City) some exposure to statistical thinking and the data collection/analysis process is probably a good thing. There are not many times in life when people need to factor a polynomial, but there are lots of times in life when they will see a newspaper article or advertisement with statistical data.

        • Adam:

          My complaint has nothing to do with rigor. The randomization test is rigorous, I just don’t think it’s generally a good idea. Similarly with statistical significance, which, as I’ve written, is typically used as a way to avoid thinking about uncertainty. I think this lesson is teaching bad stuff. Rigor has nothing to do with it.

        • Andrew – what if they used randomization testing to a construct sampling distribution of BetaHat under the null and looked at that instead of just doing a yes/no type hypothesis test? You know – make them do essentially a simulation to see what the potential values of BetaHat are when the difference is known to be 0. Better? Doesn’t that get at the kind of uncertainty we want students to think about? Without almost any “theory” at all – just the idea of random assignment and variation in the data.

        • Jrc:

          There are definitely situations where randomization distributions are useful, I just have a feeling that these situations are not coming up in this lesson.

        • Ok, so given your reservations about the test and statistical significance, what modifications would you recommend in order to make it an effective but simple demonstration of the concept of thinking about uncertainty? It’s easy to say “this is inadequate” but leave as an exercise to the reader to figure out what then might be an acceptable alternative.

        • Adam:

          I like the first part of the plan (“Students carry out a statistical experiment to compare to treatments”). After that, there are many different ways to go which I think would be better than the randomization test and statistical significance. For example, they could do an experiment with multiple measures over time and graph the results. Or they could run a regression to control for pre-treatment differences between the two groups. Or they could simply do a t test, that would be fine too.

        • I am guessing that the t test comes later (I am pretty sure it’s in the common core), but this is supposed to give students a more visceral understanding of the idea that observed differences could be due to chance before they learn to look things up in a table. I assume that’s where things are heading since the teacher material for the later lessons is focused on the difference of two means.

        • “could be due to chance” is one of the most worst memes in applied stats. It assumes that “due to chance” is even well defined, which is not the case in 99% of applied statistics. Certainly rejecting a null hypothesis does not tell me if something is “due to chance” or not.

        • Andrew:

          However evil significance testing is the fact remains that these students are going to enter a world where half the articles they read use significance testing in some form.

          That’s a good enough pragmatic reason to teach it. If you want change, you should focus on the research community that produces these papers.

          Students are too easy a fight in the ideological battles.

        • I think that’s exactly what they are doing. It’s wrong to use the term significant but it’s good critical thinking for students to learn to ask whether they should believe an observed difference reflects reality. There is something similar in the GAISE materials. I think it’s important to remember audience.

        • I agree. This is where I’m at with the curriculum. I get the criticisms about it, but if it engages students to recognize there’s a whole field out there thinking about these things, and ways of thinking about these things, then great. It’s a single lesson, not an entire semester on statistics.

  1. It would be useful if they would have students make these paper helicopters with different wing lengths, then have them plot wing length vs flight time. There is also ample opportunity for learning about variability they do not take advantage of. How variable is the same helicopter when dropped ten times? How does this compare to ten different helicopters you attempted to make using exactly the same process?

    I say cut out all the significance testing and instead do that.

    • Why start with the poor students? Cut it out in the Journals first. Convince your colleagues not to use significance testing first.

      If these students are entering a world where half of every Journal out there is filled with articles that use significance testing in one form or another why this anxiety about teaching them what significance testing is?

      It’s like the long rants out there about how Word and Excel are evil. OK, but that’s still not a good argument to hinder students from learning those if the real world they enter into is going to be full of Word and Excel users.

      • Rahul:

        These are high school students. They don’t have to worry about what’s required to publish a paper in a sociology journal. The point of these curricula is for the kids to learn fundamentals of math, statistics, etc. In any case, I don’t mind students learning skills such as identification of terms such as “statistical significance.” They could even learn how to read simple journal articles. But using randomization distributions “to determine if there is a significant difference,” that’s just silly. I don’t think the “real world” they’re entering will be full of people calculating randomization distributions. It’ll be full of people running regressions, but that’s a different story. I’d be happy with kids being taught how to run regressions.

    • I can’t speak for Andrew, but for me it is because I already know the two sets of helicopters are samples from different populations. Why would I go through all the trouble of cutting them out and folding them just to merely detect a difference? It makes no sense.

      Instead, I wonder (I am actually genuinely curious) what the relationship between wing length and flight time would be. That seems like a better use of time. Anyway, it is not possible to use the randomization test to answer their questions:

      “Questions: Does a 1-inch addition in wing length appear to result in a change in average flight time? If so, do helicopters with longer wing length or shorter wing length tend to have longer flight times on average?
      Carry out a complete randomization test to answer these questions. Show all 5 steps and use the “Anova Shuffle” Applet described in the previous lessons to assist both in creating the distribution and with your computations. Be sure to write a final conclusion that clearly answers the questions in context.”

      First, the different wing lengths were generated by cutting off the ends, not adding to them, so it is misleading to talk about additional length. I do think the process by which the difference was arrived at is important to keep in mind.

      Second, they do not randomize until the time when the helicopters are being dropped. When you are folding and cutting to make these helicopters, it seems plausible that you would become more skilled as time went on. So you need to randomize before shortening the wings, which they don’t do.

      Third, by cutting off the ends of the wings you are also affecting the weight of the helicopter. It may be possible to see a similar effect by cutting a circle in the middle, so attributing the difference to wing length is inappropriate.

      They are teaching children to disprove a null hypothesis and then jump to the conclusion that one preferred explanation is correct. This is awful, that is exactly why there is a reproducibility crisis in science. No one is examining other explanations for their outcomes. I will say that nearly every single medical research paper published makes that mistake, so I know doctors are being misinformed and people are suffering and dying. We should not be teaching children this stuff.

      • Anoneuoid,

        thanks!

        To me, the problem is not with these procedures, but rather with the statistically illiterate scientists that apply statistics in their data in order to find significant results, in order to get them published in some scientific Journal, in order to progress in their academic careers, etc. The solution for this modern problem is not prohibiting some types of statistics in favor of another, the chain of statistical misuse will certainly continue. Notice that whatever procedure you propose, I can surely find its limitations, i.e., I can find the domains where it fails. But that reasoning should not be used to banish your procedure from all domains.

        Statistics is a very complicated logical language with very peculiar rules of inferences. These rules of inferences vary with the adopted principles, just like all the mathematical logics and all different set theories we have today*. Saying that “that type of inference is wrong, avoid it!” or “my inferential procedure is better, use it!” just perpetuate the illiteracy among statisticians and practitioners. One way, which seems to be far better than those dichotomic advices, is to learn the limitations of the inferential rules and teach them. Every inferential rule has its own bright and dark sides. I think our main task, as statisticians, is to study the domains of applicability of statistical approaches.

        * Notice that almost all mathematics that are taught are based on the ZFC set-theory. The function definition depends on the definition of a set, properties of natural numbers depend on the definition of a set and the hidden adopted mathematical principles, the definitions of integrals and derivatives depend on the definition of limit, and so on… However, there are many other definitions for sets, for limits and so on… There are many ways to justify an apparent wrong procedure.

        • At least according to Fisher these issues were present before significance testing was adopted. If that is true, the significance/hypothesis testing procedures simply formalized the bad idea:

          ‘For some reason, which the author cannot attempt to explain, even simple statistical methods are not widely taught at the present time, although those needed are certainly not more difficult than much of the same kind which is included in the arithmetic books, and is therefore a part of general knowledge. Moreover, the concepts which these simple methods are fitted to clarify pervade everyday speech, and appear universally in all the observational sciences. We need only recall such phrases as “highly exceptional,” “relatively constant,” ” increases the probability,” “adds to our information,” “chance fluctuations,” and so forth.’
          http://www.jstor.org/stable/2844116

          On the other hand you can find plenty of people blaming statistics going back into the 1950s… so I don’t know. My position is that you cannot be confident you understand something until your theory can make precise predictions that are consistent with future data. Even then, you can have multiple competing theories, but it is much more difficult to come up with them.

        • As I understand, in this excerpt Fisher says that statistical concepts can be used to clarify everyday speech like “highly exceptional”, “relatively constant” and so on that appear frequently in sciences. He does not understand why anthropologists and biologists are so unfamiliar with statistical concepts.

          If we are to blame significance tests, we must have to blame also logics applied to observable events, numbers used to represent observable events, probability used to model uncertain events and so on. In a significance test, we have a (sharp or non-sharp) null hypothesis that purports to represent a scientific statement (remember that significance tests do not specify an alternative hypothesis for several good reasons). In this translation process (scientific to statistics), we use numbers, probability models, we also apply the mathematical principle modus tollens and many others. We do nothing more than any other applied mathematician must do.

          In many cases, a specific null hypothesis does not represent the scientific statement of interest, but it is not a problem with significance test per se. We shouldn’t use a bunch of examples of bad translations (or of bad uses of significance tests) to blame significance tests, these bad translations (or bad uses) only denunciate the translator’s (practitioner’s) limitations. We can always find a number of problems in any translation process, that is part of using whatever linguist model (equipped with precise and internally consistent rules) in the sensible world.

      • Regarding randomization: AFAIK the experimental design principle is “balance what you know, randomize what you don’t”. Balancing by alternating the helicopter type over order of creation rules out unbalanced assignments so that the effect of, e.g., learning over time is minimized, not merely low in expectation.

        • Alternating makes sense, but that doesn’t seem to be what they’re doing:

          Exercise 1: Build the Helicopters In preparation for your data collection, you will need to construct ???????? paper helicopters following the blueprint given at the end of this lesson. For consistency, use the same type of paper for each helicopter. For greater stability, you may want to use a piece of tape to secure the two folded body panels to the body of the helicopter. By design, there will be some overlap from this folding in some helicopters. You will carry out an experiment to investigate the effect of wing length on flight time
          a. Construct ???????? helicopters with wing length = ???? inches, body length = ???? inches. Label ???????? each of these helicopters with the word “long.”
          b. Take the other ???????? helicopters and cut one inch off each of the wings so that you have ???????? helicopters with ???? inch wings. Label each of these helicopters with the word “short.”
          c. How do you think wing length will affect flight time? Explain your answer.

      • > I already know the two sets of helicopters are samples from different populations. Why would I go through all the trouble of cutting them out and folding them just to merely detect a difference?

        I don’t understand your point. In a clinical trial they know that some patients took the drug and some the placebo, and they go through all the trouble to see if one group outlived the other. In any case, the objective of the exercise is not really to learn aerodynamics but how to perform and analyze experiments.

        > First, the different wing lengths were generated by cutting off the ends, not adding to them, so it is misleading to talk about additional length. I do think the process by which the difference was arrived at is important to keep in mind.

        I don’t see a problem, I read it as “addition in wing length *to the design*”. They could provide two blueprints and the result would be essentially the same, don’t you agree? Why is it important to keep in mind how the difference was arrived at?

        > Second, they do not randomize until the time when the helicopters are being dropped. When you are folding and cutting to make these helicopters, it seems plausible that you would become more skilled as time went on. So you need to randomize before shortening the wings, which they don’t do.

        Maybe they could randomize on both procedures, but given the simplicity of the design it’s not unreasonable to focus the discussion on the measurement part (and the possibility of randomizing the building step might arise in the discussion with students ). In fact, given that we want to control mainly for changes in the environment or how the dropping/timing operations are performed as the experiment goes on (and not on bias related to the length of the wing, there are no provision for blinding the dropper and the timer), alternating both classes would be more efficient but the randomization procedure is more general and worth learning.

        > Third, by cutting off the ends of the wings you are also affecting the weight of the helicopter. It may be possible to see a similar effect by cutting a circle in the middle, so attributing the difference to wing length is inappropriate.

        The effect of the lower weight in the short-winged helicopters will be to make them fly longer. If we find that a 1-inch addition in wing length increases the flight time it will not be because of the increase in weight compared to the modified design but in spite of it.

        Apart from the existence of a confounding variable, you don’t really address why it is not possible to use the randomization test to answer their questions. It provides an easy way to calculate a p-value (which is what they do, without naming it ). Andrew thinks they “could simply do a t test”, but I think high-school students (and anyone, really) would find the concept of a randomization test easier to grasp.

        • Anon:

          No, the point of a medical experiment is not to see “if one group outlived the other.” There’s no doubt that one group will outlive the other. What is interesting is, by how much, and who is doing the outliving.

        • The students are asked not only if changing the wing length affects the flight time, but also which design has longer flight times and how significant is the difference (probability of obtaining a “Diff” more extreme than the value from the experiment).

          This is pretty similar to the clinical trials that support drugs approvals all the time. According to the FDA guidance for clinical endpoints in oncology “demonstration of a statistically significant improvement in overall survival can be considered to be clinically significant if the toxicity profile is acceptable.”

        • “Apart from the existence of a confounding variable, you don’t really address why it is not possible to use the randomization test to answer their questions.”

          The problem is that the students cannot answer the question for at least the two reasons that I mentioned. They do not know the effect of weight, and they did not randomize the order of folding the helicopters. Asking them to answer that is a trick question, and I do not think it will be graded as if it was.

          If you want them to make a justifiable conclusion that it was the shorter wings that affected flight time, they need to do experiments to rule out all the other possibilities they can come up with. That question has the logic of science backwards.

        • I don’t think they are asked to answer a trick question. They are asked to determine the effect of a 1-inch addition in wing length when the intervention operates on that variable. They are also asked to discuss what other characteristics of the helicopter they could modify to perform further experiments in flight times.

          You seem bothered by the effect of the intervention on weight and its effect on flight times. A bigger issue would be that the main driver behind the longer flights in the experiment is actually the increase in wing surface. I think we would see shorter flight times if the wing width was reduced to keep the wing surface constant.

          And of course there are many other potentially relevant variables that change (v.g. location of the center of gravity, inertia tensor, ratios between dimensions) and the more you try to fix something the more you change something else. A trick question would be to ask for the effect of a change in length when everything else is kept fixed.

        • I agree, there are endless possible explanations for the mere presence of a difference. However, I’ve learned we only really need one to turn the discussion into one about plausible explanations for an effect of the observed type and size. In my experience, this is always the case.

          Here is the quote we are discussing:
          “Does a 1-inch addition in wing length appear to result in a change in average flight time? If so, do helicopters with longer wing length or shorter wing length tend to have longer flight times on average?
          Carry out a complete randomization test to answer these questions.”

          To emphasize: ***Carry out a complete randomization test to answer these questions.*** I do not believe a randomization test can answer those questions. I would appreciate it if you could write down the entire chain of logic you are using that makes you think differently. As far as I know there is no academic paper that has ever even dared to attempt explaining the logic of this procedure (other than those putting it in a negative light).

          The other issue about short/long wing length are just the names of the conditions, it doesn’t mean it is actually the *wing length* that matters… I’ll have to think more about what exactly the problem is with that, but it strikes me as something that will cause confusion. Either way, it is the type and size of the effect that allows us to distinguish between different influences, not the mere presence of a difference.

        • 1] build the 10 helicopters in group A (longer wing length = unmodified design) and the 10 helicopters in group B (shorter wing length = mutilated wings)
          2] measure the flight times of the 20 helicopters (the document instructs students to randomize the test order to reduce the bias coming from changes in the operating environment, but this is irrelevant for this discussion)
          3] calculate the difference between the average flight time for group A and the average flight time for group B (let’s assume this different is positive)
          4] find the distribution of this statistic under the null hypothesis (the treatment has no effect) by calculating the difference on the averages between the first ten elements and the last ten elements of the permutations of the set {a_1,…,a_10,b_1,…,b_10} (resample enough times to get satisfied with the stability of the distribution or calculate explicitly the 184,756 ways to assign 10 measurements to each group)
          5] compare the observed difference calculated in 3] with the distribution calculated in 4]
          6a] If the observed difference is not particularly remarkable (for example 40% of the differences under the null hypothesis are bigger than the observed one) the answer is “we observed the helicopters in group A to fly longer than those in group B, but the difference is small compared to the variability between individuals and we don’t have enough evidence to say it was caused by the modification in the design.”
          6b] If the observed is quite remarkable (for example, no permutation leads to a difference higher than the observed one because the shorter flight in group A was longer than the longer flight in group B) the answer is “yes, there is a strong evidence for the difference in the design of the helicopters in groups A and B leading to longer flight times in the first group.”
          6c] One could even use the same arbitrary threshold that grown-ups use to take multi-million dollar decisions, determining to which percentile of the randomized distribution corresponds the observed difference and if the corresponding p-value is less than 0.05 say loudly and confidently “our improved design leads to a statistically significant increase in flight time.”

          I’m not sure at what point do you think things become so illogical that no one would dare to attempt explaining them.

          https://projecteuclid.org/download/pdfview_1/euclid.ss/1113832732

        • “yes, there is a strong evidence for the difference in the design of the helicopters in groups A and B leading to longer flight times in the first group.”

          This is the step I would like clarified, it makes no sense to me. First, I note you have backed off the original claims that “additional wing length” specifically was responsible. However, even that weakened claim is problematic. Even if God told me group A has longer flight times than group B, it is still problematic.

          You appear to be saying if the design affected the flight time we would see a difference between groups A and B. We saw a difference between groups A and B, therefore the design affected the flight time. This is simple affirming the consequent.

          Scientific reasoning is different. In science, you measure the size of the difference, then rule out various explanations for a difference of that size. For example, the order in which the helicopters were folded. What is the biggest effect you observe due to that? What about the slight differences in weight of the paper used? If we vary that, how much of an effect do we see? I can go on but you get the point. Then if the difference is still too large to be explained by all these possible influences, you conclude that the best explanation has something to do with the design.

          In other words, the scientific approach to comparing two conditions uses a series of modus tollens arguments to rule out possibilities until only one explanation remains. And remember, by “longer flight time” we are talking about any value greater than zero, so any way of messing up the experiment has 50% chance of resulting in longer flight time.

          It is true that this procedure is one that “grown-ups use to take multi-million dollar decisions”, I consider that a big problem. Luckily there are some people still using the scientific method for whom the significance test is only a spurious activity, so some progress gets made despite (or regardless) of what seems to me clearly flawed logic.

        • (For some reason, maybe a depth limit, I cannot reply to the last comment in the thread)

          Let’s remember the objective of the whole exercise, which consists of lessons 28 (where students perform experiments to observe the effect of changes in body length and wing length) and 29 (where they create a poster summarizing the experiment and the results).

          Drawing a Conclusion from an Experiment
          – Students carry out a statistical experiment to compare two treatments.
          – Given data from a statistical experiment with two treatments, students create a randomization distribution.
          – Students use a randomization distribution to determine if there is a significant difference between two treatments.

          “You appear to be saying if the design affected the flight time we would see a difference between groups A and B. We saw a difference between groups A and B, therefore the design affected the flight time. This is simple affirming the consequent. Scientific reasoning is different. In science, you measure the size of the difference, then rule out various explanations for a difference of that size.”

          Imagine the treatment was to paint ten helicopters red and the rest green. You measure the average flight times and they are different (or you’re incredibly lucky). How would you proceed to rule out various explanations for a difference of that size?

          Here we try to rule out “chance”, if we can do it we say the difference is “significant”. It’s likely that the difference for the difference in color will be non-significant. It’s likely that the difference for the change in wing length (with the corresponding changes in volume, surface, weight, inertia tensor, capacitance, torsional rigidity, drag coefficients, vibration modes and anything else that changes) is significant.

          What other explanation would you like to rule out? In principle, they are identical by design (except for all the things listed before that are different by design). The question is whether helicopters of model A fly longer than helicopters of model B, not why.

          Earlier you said “It would be useful if they would have students make these paper helicopters with different wing lengths, then have them plot wing length vs flight time.” In what sense would it be useful? Does it allow them to extract any conclusion from the experiment? If yes, why wouldn’t they be able to extract a conclusion when the number of different lengths is two?

          By the way, I didn’t backed off the original claims that “additional wing length” specifically was responsible. I already mentioned how the treatment is an intervention on this variable. Additional wing length here should be understand as all the helicopters have an identical body, in half of them the wings are 2″ wide x 4″ long while in the others the wings are 2″x3″. How would you determine the specific effect of additional wing length? How would you even define “specific effect of additional wing length”?

          It seems you don’t like statistical inference in general. I see your point. In six years of undergraduate/graduate courses in Physics, I did never encounter statistical inference. I had a semester of Statistical Mechanics and in the lab we were told about measurement errors and error propagation and we would of course calculate averages and standard deviations. But no exposure that I can remember to sampling theory, hypothesis testing and all this frequentist voodoo. If your experiment needs statistics, you ought to do a better experiment.

          For what is worth, I find Jaynes’ approach to statistical inference much more satisfying than orthodox methods: better foundations and broader applicability. Maybe the problem with frequentist methods is that they are so widely misunderstood that it’s too easy to misuse them (either by choice or by mistake). They enable the publication of implausible results and the approval of useless (or worse, harmful) drugs, but it would be unfair to say that science in general (from high energy physics to stamp collecting) has progressed despite of them. And hopefully the non-parametric permutation test will indeed allow NYC high-school students to understand significance testing better (better that they would if they had been told to apply a t-test, that’s for sure).

        • >”Imagine the treatment was to paint ten helicopters red and the rest green. You measure the average flight times and they are different (or you’re incredibly lucky). How would you proceed to rule out various explanations for a difference of that size?”

          I do believe this would have an effect because the paints will not have the same densities or not adhere to the paper exactly the same, etc. We still need to measure the same things mentioned earlier. For example, fold the 20 airplanes, paint them all red, call the first ten group A and second ten group B. Then measure flight time, repeat this enough to get a sense of the distribution. Is this difference due to order effects consistent with that observed by painting half, red half green, or not? Repeat for all the possibilities you can. And I think significance tests are fine (but probably not optimal) for that purpose, although doing this is a fools errand anyway (more on that later).

          >”It seems you don’t like statistical inference in general”

          This is incorrect. I do not understand the logic of ruling out a null hypothesis of zero effect and then jumping to any conclusion beyond “data in column A is different from data in column B”. That jump has never been justified by anyone, and it involves a clear logical fallacy (affirming the consequent). Despite this departure from the accepted rules of valid reasoning, the practice was adopted without any discussion as a central aspect of experimental design and data interpretation. I do dislike that, I can’t think of a better recipe for confusing intelligent people, wasting their lives, and destroying public goodwill towards science.

          >”The question is whether helicopters of model A fly longer than helicopters of model B, not why.”

          This is *still* going too far. You cannot even attribute the presence of a difference to the model, it could have been due to the differences in paper weight, learning to fold and cut better, how much tape was used, anything. Remember, we are talking about the presence of any difference at all. Even .00001 seconds difference (true underlying difference, not the measured) rules out the null hypothesis of chance.

          >”Earlier you said “It would be useful if they would have students make these paper helicopters with different wing lengths, then have them plot wing length vs flight time.” In what sense would it be useful? Does it allow them to extract any conclusion from the experiment? If yes, why wouldn’t they be able to extract a conclusion when the number of different lengths is two?”

          Because it is impractical to rule out every possible reason for the mere presence of an increase/decrease in flight time, the students will never be able to make any justifiable conclusion using this two group comparison method. That type of experimental design is inappropriate for most lines of research. Instead, the thing to do is vary one thing at a time and plot it against the outcome. Does it look like a line, some kind of curve, what? Why would it have that shape? Then come up with a theory to explain this and use it to predict what will happen if you do a new experiment. Then the prediction of the model is the null hypothesis.

        • My previous reply was a bit too long, please let me provide a shorter argument with a single question.

          I give you the blueprints for two models of helicopter A and B, detailed instructions on how to fold and drop the helicopters, a ream of paper, a pair of scissors, a stopwatch, and 30 minutes. I would like to build myself several helicopters and I want their average flight time to be as high as possible. I want to know if I should go for model A or model B, or if it doesn’t matter.

          The students can follow the procedure described in the document to tell me what model do they think I should use and how confident are they in their conclusion. The procedure might based on flawed logic, but their answer will be helpful more often than not.

          What can you say using scientific reasoning?

        • I didn’t notice you had replied in the meantime.

          I understand you main point, you prefer to keep adding data and refining the model instead of drawing a conclusion on what you have. I don’t think you can “vary one thing at a time”, though.

          Now, you say you don’t dislike statistical inference. Can something be inferred from the described experiment or not? Could something be inferred from an improved experimental design?

          I really don’t understand what’s your position on this. If making an inference from twenty data points covering two different values for one single variable is nonsense (because there might be hidden variables and random noise), what does substantially change that would allow you to make an inference if you had millions of measures covering hundreds of values for dozens of variables?

          “You cannot even attribute the presence of a difference to the model, ” Is there any point down your scientific reasoning process at which you might be able to say if model A is expected to fly longer than model B? Is statistical inference possible at all?

        • >”I give you the blueprints for two models of helicopter A and B, detailed instructions on how to fold and drop the helicopters, a ream of paper, a pair of scissors, a stopwatch, and 30 minutes. I would like to build myself several helicopters and I want their average flight time to be as high as possible. I want to know if I should go for model A or model B, or if it doesn’t matter.”

          If that is all the information it is possible to gather, then just go with the condition that had longest average flight time, whether a difference of a given size “matters” is up to your requirements (do you care about .00001 second difference?). I would not be confident in this decision unless the difference was large enough to be unaccountable for reasons other than the different model.

          In realistic scenarios the recommendation would depend upon how consistent the process of construction can be (cutting the wings adds an additional step where something could go wrong), the cost of construction and maintenance, the durability of the helicopters, and would involve testing the two models under the various conditions that they may be used. If for some reason not all this data could be collected, a rough estimate needs to be made from prior information.

          >”The students can follow the procedure described in the document to tell me what model do they think I should use and how confident are they in their conclusion.”

          The significance test provides no justification for deciding a difference “matters”, since this depends on what effect size would be important. They also cannot know how confident they should be without considering how much other factors may have affected the result.

          >”Can something be inferred from the described experiment or not?”

          You can say the reported flight times of helicopters in group A were x seconds longer than group B on average. You can also get an idea of the variability for that batch of helicopters under those conditions. With additional assumptions or outside information, much more can be said.

          >”Is there any point down your scientific reasoning process at which you might be able to say if model A is expected to fly longer than model B?”

          Yes, if the effect size is too large to be accounted for by other factors or you have some model that precisely predicted an effect consistent with the observation.

          >”Is statistical inference possible at all?”

          Yes, how else would you compare a theoretical prediction or expected effect size to data?

          But let’s not get side tracked discussing what I think should be done instead. Even if my alternatives are incorrect, that is no excuse for using a logically invalid procedure. I claim choosing a strawman for the null hypothesis (as is usual for the comparison of two groups) can only lead to affirming the consequent fallacies. If it actually does not do that, I want someone to publish a paper explaining exactly why not and under what conditions.

        • > If that is all the information it is possible to gather, then just go with the condition that had longest average flight time, whether a difference of a given size “matters” is up to your requirements (do you care about .00001 second difference?). I would not be confident in this decision unless the difference was large enough to be unaccountable for reasons other than the different model.

          Imagine the average times are 4s and 5s. In which one of the following scenarios would you be more confident in this decision?
          1) a={4,4,4,4,4,4,4,4,4,4} b={5,5,5,5,5,5,5,5,5,5}
          2) a={5,3,4,7,4,3,2,4,6,2} b={8,2,5,3,7,2,9,4,6,4}

          > The significance test provides no justification for deciding a difference “matters”,

          It provides a tool that can help to do so. How did you arrive to the decision above, and the confidence you put in it?. Was it done using statistical inference or by some other mechanism?

          > ”Is there any point down your scientific reasoning process at which you might be able to say if model A is expected to fly longer than model B?”
          > Yes, if the effect size is too large to be accounted for by other factors or you have some model that precisely predicted an effect consistent with the observation.

          So when you run out of imagination and cannot think of yet another factor that could explain the observed effect, you call it a day and the logically invalid procedure becomes valid? And in the case were you have a model, how do you measure consistency?

          >”Is statistical inference possible at all?”
          > Yes, how else would you compare a theoretical prediction or expected effect size to data?

          Are there any published papers explaining exactly how, why and under what conditions? Honest question.

        • “So when you run out of imagination and cannot think of yet another factor that could explain the observed effect, you call it a day and the logically invalid procedure becomes valid?”

          No, you wait until someone comes up with another explanation then compare to that. And no I would not consider values of rep(4,10) rep(5,10) to be credible. Anyway, reread what I suggested, it involves comparing the observed outcome to distributions for each explanation in order to rule each out or account for them.

          The problem is not with how to best make a decision about what size difference is meaningful, there is no general algorithm for that because it depends upon each situation. The problem is with using invalid logic to jump from the presence of a difference to an explanation. Afaik, that issue is first explained here:
          https://ebooks.adelaide.edu.au/a/aristotle/sophistical/book1.html

    • a better wording, for me at least would be “to determine how big the difference is between two treatments”.

      “if there is a significant difference between two treatments” is rarely something we want to know, especially when “significant” means “statistically significant”

  2. I think it’s in algebra because the common core requires students to get through bivariate linear regression by the end of 12th grade, yet in the traditional math curriculum of algebra/geometry/algebra 2- trig/ calculus there is no space for a statistics class. So they stick bits of the statistics in different places.

Leave a Reply

Your email address will not be published. Required fields are marked *