The irreprofessive analysis differs from the validity analysis, which assesses how an instrument accurately measures an actual construction and not the quality of the codes to provide similar assessments. Instruments may have different levels of validity, regardless of the instrument`s irr. For example, an instrument may have a good IRR, but poor validity if the coder values are very similar and have a large common variance, but the instrument does not properly represent the construction it is supposed to measure. Krippendorffs Alpha[16][17] is a versatile statistic that evaluates the agreement between observers who categorize, evaluate or measure a certain number of objects against the values of a variable. It generalizes several specialized agreement coefficients by accepting any number of observers applicable to nominal, ordinal, interval and proportional levels of measurement, capable of processing missing and corrected data for small sample sizes. First, it is necessary to decide whether a coding study is designed so that all subjects are evaluated in a study by several programmers, or whether a subset of subjects is evaluated by several programmers, the rest being coded by individual programmers. The contrast between these two options is shown in the left and right columns of Table 1. In general, the evaluation of all subjects at the theoretical level is acceptable for most study projects. However, in studies where the availability of ratings is costly and/or time-consuming, the selection of a subset of subjects may be more convenient for the analysis of the IRR, since there are fewer overall assessments to be made and the IRR can be used for the subset of subjects in order to generalize to the full sample. For Krippendorffs Alpha, the theoretical distribution is not known, not even asymptomatic [28]. However, empirical distribution can be determined by the bootstrap approach. Krippendorff proposed a bootstrapping algorithm [28, 29] that is also implemented in Hayes` SAS and SPSS macro [28, 30]. The proposed algorithm is different from the one described for Fleiss` K above in terms of three aspects.

First, the algorithm is weighted for the number of ratings per person to account for the missing values. Second, N observations are not sampled by survey, with each observation containing the corresponding assessments of all advisors.