The interobserver reliability of clinical relevance in medical research

Published:January 18, 2022DOI:


      • There is weak interobserver reliability of interpreting clinical difference based on observed difference and the confidence interval.
      • The interpretation of the p-value did not differentiate physician reported clinical relevance.
      • The novel OD/CI ratio, showed a significant difference for differentiating clinical relevance.
      • The OD/CI maximized the sensitivity (SN) and specificity (SP) for identifying clinical relevance.



      A measure of effect size, such as observed difference (OD) and its 95% confidence interval (CI), is necessary to determine clinical relevance (CR) of research findings. The purpose of this paper is to (1) determine the interobserver reliability (IOR) of determining CR when presented with only the OD and CI and (2) to determine if a ratio of OD over CI (OD/CI) had a stronger association with CR than the p-value.


      A survey including the OD and CI results from 21 studies was sent to 36 physicians, of which 21 responded. Respondents were asked to determine if the results were clinically relevant or not clinically relevant.


      Twenty-one (58%) physicians responded. The IOR of interpreting CR based on OD and the CI was weak (kappa=0.13, CI 0.10 to 0.15). The p-value did not differ between CR and non-CR results (median difference -0.001, CI -0.005 to 0.0, p = 0.07). The OD/CI however, was greater for CR vs. non-CR results (median difference 0.5, CI 0.09 to 0.95, p = 0.02). The area under the curve for the p-value and OD/CI receiver-operator characteristic curve was 0.70 and 0.80. The p-value and OD/CI that maximized the sensitivity (SN) and specificity (SP) for identifying CR was 0.001 (SN 88%, SP 59%) and 0.95 (SN 88%, SP 84%).


      Determining CR from the OD and CI alone had weak interobserver reliability. The OD/CI ratio had a stronger association with CR than the p-value making it potentially useful in evaluating the CR of research findings.
      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Injury
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Lee D.K.
        Alternatives to P value: confidence interval and effect size.
        Korean J Anesthesiol. 2016; 69: 555-562
        • Sullivan G.M.
        • Feinn R.
        Using effect size-or why the P value is not enough.
        J Grad Med Educ. 2012; 4: 279-282
        • Houle T.T.
        Statistical reporting for current and future readers.
        Anesthesiology. 2007; 107: 193-194
        • Spreckelsen T.F.
        Editorial: changes in the field: banning p-values (or not), transparency, and the opportunities of a renewed discussion on rigorous (quantitative) research.
        Child Adolesc Ment Health. 2018; 23: 61-62
        • McHugh M.L.
        Interrater reliability: the kappa statistic.
        Biochem Med (Zagreb). 2012; 22: 276-282
        • Conroy R.M.
        What hypotheses do “nonparametric” two-group tests actually test?.
        Stata J. 2012; 12: 182-190
        • Ma H.
        • Bandos A.I.
        • Gur D.
        On the use of partial area under the ROC curve for comparison of two diagnostic tests.
        Biom J. 2015; 57: 304-320
        • Martinez-Camblor P.
        • Pardo-Fernandez J.C.
        The youden index in the generalized receiver operating characteristic curve context.
        Int J Biostat. 2019; 15
        • Kelley K.
        • Preacher K.J.
        On effect size.
        Psychol Methods. 2012; 17: 137-152
        • Harrington D.
        • et al.
        New guidelines for statistical reporting in the journal.
        N Engl J Med. 2019; 381: 285-286
        • Mayor A.
        • et al.
        Changing trends in falciparum burden, immunity, and disease in pregnancy.
        N Engl J Med. 2015; 373: 1607-1617
        • Draak T.H.P.
        • et al.
        The minimum clinically important difference: which direction to take.
        Eur J Neurol. 2019; 26: 850-855
        • Cook C.E.
        Clinimetrics corner: the minimal clinically important change score (mcid): a necessary pretense.
        J Man Manip Ther. 2008; 16: E82-E83
        • Peng C.Y.J.
        • Chen L.T.
        Beyond cohen’s d: alternative effect size measures for between-subject designs.
        J Exp Educ. 2014; 82: 22-50