It is science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation. (Siegfried, 2010)
The problem is not that people use p-value poorly; it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis. (Leek, 2014)
The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions.
Clarify some widely agreed principles:
P-values can indicate how incompatible the data are with a specified statistical model The incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone It is a statement about data in relation to a specified hypothetical explanation and is not a statement about the explanation itself.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold Need to consider other factors: (1) design of a study (2) quality of the measurement (3) external evidence for the phenomenon under study (4) validity of assumptions that underlie the data analysis.
Proper inference requires full reporting and transparency Also need to disclose: number of hypotheses explored, all data collection decisions, all statistical analyses conducted, p-values computed Avoid: cherry-picking (aka: data dredging, significance chasing and significance questing, selective inference, p-hacking) A p-value or statistical significance does not measure the size of an effect or the importance of a result Smaller p-values do not necessarily imply the presence of larger or more important effects. Larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter tiny can produce a small p-value if the sample size or measurement precision is high enough and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise (does large p-value indicate uncertainty of making conclusion from the data using current model?)
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Also try other available methods Other approaches: confidence, credibility or prediction intervals (?), Bayesian methods, likelihood ratios (?) or Bayes Factors, decision-theoretic modeling (?) and false discovery rates (?).