Metrics 📐#
Annotation Metrics#
Our toolkit supports computing the annotation metric below, for comparing two sets of annotations.
1. Relevance#
Relevance is the proportion of annotations that are valid. We define the relevance of one set of annotations over a given set of datapoints as \(\texttt{relevance} = n_{\text{valid}}/ n_{\text{total}}\), where \(n_{\text{valid}}\) is the number of datapoints with valid votes selecting one response over the other (response A or response B). This number excludes tie (both/neither) and invalid votes. When comparing two sets of annotations (\(A\) and \(B\)), the relevance of annotations \(A\) (e.g. personality annotations selecting for more confident responses) over annotations \(B\) (e.g. human preference annotations) is the proportion of valid votes over the set of all datapoints included in both annotation sets.
2. Cohen’s kappa (\(\kappa\))#
Cohen’s kappa is a metric of inter-annotator agreement between two sets of annotations that measures agreement beyond random chance. It is defined as
where \(p_o\) is the observed proportion of datapoints where annotators agree, and \(p_e\) is the proportion of datapoints for which agreement is expected by chance. \(p_e\) can be estimated using the observed distribution of labels, as in \(p_e=(n_{a_1=A}n_{a_2=A})/N^2 + (n_{a_1=B}n_{a_2=B})/N^2\), where \(n_{a_i = X}\) is the number of times annotator \(i\) was observed voting for response in position \(X\) and \(N\) is the total number of observations. For the computation of this metric, we only consider valid votes excluding tie (both/neither) and invalid votes.
Note: When one of the annotators does not have access to the order of responses (e.g. because they are always shuffled) the expected chance agreement \(p_e\) is \(0.5\) by design, even if the other annotator is highly biased to one position (e.g. first response). Cohen’s kappa in Feedback Forensics is computed under this assumption, given that this randomization is integrated into our personality selecting reference annotators. This kappa version is also used for the computation of the strength metric.
3. Strength#
Finally, for our specific use-case, we combine Cohen’s kappa with relevance to obtain a measure of relevant agreement beyond chance: strength. We refer to this metrics as strength, defined as
By weighting with relevance, we emphasize agreement that is widely applicable across the dataset. In our setting, this metric indicates whether a personality trait is widely relevant and highly correlated with the target annotations. See below for a guide on how to interpret this metric in the two main use-cases.
Interpretation of Strength#
Interpretation A: Measuring personality traits encouraged by human feedback (comparing human and personality trait annotations)
Fig. 2 Interpretation of strength metric comparing human and personality trait annotations#
Interpretation B: measuring personality traits exhibited by model (comparing target model and personality trait annotations)
Fig. 3 Interpretation of strength metric comparing target model and personality trait annotations#
General Statistics#
Feedback Forensics also supports computing the following additional general statistics for each shown data(sub)set of annotations (include in Overall statistics table of interface):
Number of preference pairs: Full number of preference pairs included in data(sub)setProp preferring text_a: Proportion of datapoints that prefer the first responseAvg len text_a: Average length of first responseAvg len text_b: Average length of second responseAvg len pref. text: Average length of preferred responseAvg len rej. text: Average length of rejected responseProp preferring longer text: Proportion of datapoints preferring longer text