The full version of the super-stimulus theory states that musicality is calculated by the listener across multiple aspects of speech perception as an estimate of a particular aspect of the internal mental state of the speaker.
One way to test this theory would be to computer the correlations of computed musicality against a direct estimate of the relevant aspect of mental state.
There are two major difficulties with this approach:
The greatest difficulty seems to be that of identifying and measuring the relevant aspect of mental state. The most likely candidate for this aspect is "conscious arousal", but assuming even that we know exactly what that means, it is not obvious how to measure it, particularly simultaneously with the capture of speech output under "natural" conditions.
However, it is an assumption of the theory that each component of musicality is separately correlated with the current level of conscious arousal or whatever other aspect of mental state is correlated with musicality. Therefore, it may be sufficient to calculate separate components of musicality, on a speech corpus, and calculate the correlations of these components with each other. This avoids the whole problem of attempting to measure or estimate an uncertain and possibly unidentified aspect of a speaker's mental state.
A certain amount of guesswork is still required to calculate two or more aspects of musicality. Fortunately it is not necessary to calculate all aspects of musicality – it is only necessary to choose those aspects that can most easily be identified from specific and universal (or near universal) features of music. For example, the sub-theories of musical scales and regular beat (as detailed in my book) provide straightforward candidates for the computation of the corresponding aspects of musicality.
One minor difficulty is that different computed components of musicality could be correlated with each other, due to the fact that they are derived in a similar way from the input data. A good way to avoid this is to choose aspects not obviously related, for example choose a pitch-related aspect and a time-related aspect, e.g. those aspects related to musical scales and regular beat, as just mentioned.
Capturing a suitable body of "normal" speech is itself a potential problem. One way to get people to provide speech for research purposes is to sit them down and get them to read text, or perhaps to answer questions. One problem is that the artificiality of such procedures and the self-consciousness of the subjects may cause the elimination of those subtle aspects of speech which one is hoping to analyse.
Short of doing something unethical, like recording subjects who do not know they are being recorded, it may be necessary to contrive some task which requires people to talk to each other about things without having the time to be self-conscious about what they say. Or you could use the reality TV approach of recording subjects so continuously that they cease to be affected by their awareness of being recorded.