Just for reference, if we could have some "golden standard(s)", we may evaluate the reliability of the "test" also by sensitivity/specificity analysis; in this kind of audio ABX test, however, the concept of sensitivity/specificity would not be directly applicable, right? (I have been in the field of medical imaging diagnosis R&D for long years.)

View attachment 149669

It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.

To clarify, this is not ABX testing we are talking about in this thread. Here we just have some scores of 4 different speakers with some segments like the song and the listener and we can look at several things.

An ABX consist of a Bernoulli experiment, in which the outcome of each trial is A or B, corresponding to a subject stating that the sound sample is he hearing corresponds to A or B. If 9 out of 10 trials, the subject gives the right answer, we can say that the probability that he has been able to differentiate the two sound samples by change is low enough so we can state that he is able to tell them apart. We know how many times has the subject to get it right before because the level of the test does not depend on the data but on the definition of the test.

The amount of false positives accepted is determined by the level of the test. But we have to be precise about what are we testing. The null hypothesis of those t.tests, whether the ones coming from a simple approach or from a linear model, is that the sample mean obtained come from the same distribution and the differences are due to chance. When we got a low p-value means that we can reject this null hypothesis knowing that the probability of a false positive is equal the p-value.

The approach of

@Semla is more ambitious as it builds a model to predict future scores.

The discussion with

@Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.

Please, correct me if I'm wrong.

Edit: I forgot to answer whether that coefficients you point can be applied here.

The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.