feedback_forensics.app.metrics#
Compute metrics
Module Contents#
Functions#
Accuracy: proportion of non-irrelevant votes (‘agree’ or ‘disagree’) that agree with original preferences. |
|
Cohen’s kappa: measures agreement beyond chance. |
|
Cohen’s kappa: measures agreement beyond chance. |
|
Relevance-weighted Cohen’s kappa: combines Cohen’s kappa with relevance. |
|
Compute overall metrics |
|
Data#
API#
- feedback_forensics.app.metrics.DEFAULT_METRIC_NAME#
‘strength’
- feedback_forensics.app.metrics.get_agreement(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
- feedback_forensics.app.metrics.get_acc(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
Accuracy: proportion of non-irrelevant votes (‘agree’ or ‘disagree’) that agree with original preferences.
If there are no non-irrelevant votes, return 0.5.
- feedback_forensics.app.metrics.get_cohens_kappa(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
Cohen’s kappa: measures agreement beyond chance.
This takes into account agreement across categories (including non-text_a/text_b votes).
- feedback_forensics.app.metrics.get_cohens_kappa_randomized(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
Cohen’s kappa: measures agreement beyond chance.
This version assumes that at least one of the annotators had no access to order information, making the probability of chance agreement at 0.5.
In the regular version, the value will always be 0 if one annotator is imbalanced (e.g., by construction).
- feedback_forensics.app.metrics.get_relevance(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
- feedback_forensics.app.metrics.get_principle_strength(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) float#
Relevance-weighted Cohen’s kappa: combines Cohen’s kappa with relevance.
This is computed as: (Cohen’s kappa) * relevance which simplifies to: 2 * (accuracy - 0.5) * relevance
- feedback_forensics.app.metrics.get_num_votes(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) int#
- feedback_forensics.app.metrics.get_agreed(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) int#
- feedback_forensics.app.metrics.get_disagreed(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) int#
- feedback_forensics.app.metrics.get_not_applicable(value_counts: pandas.Series, *, annotation_a=None, annotation_b=None) int#
- feedback_forensics.app.metrics.get_metrics()#
- feedback_forensics.app.metrics.compute_annotator_metrics(votes_df: pandas.DataFrame, annotator_metadata: dict, annotator_cols: list[str], ref_annotator_col: str) dict#
- feedback_forensics.app.metrics.get_overall_metrics(votes_df: pandas.DataFrame, ref_annotator_col: str) dict#
Compute overall metrics
Includes
overall number of votes,
percentage of votes that are text_a,
percentage of votes that are text_b,
average length winning text
average length losing text
Args: votes_df: pd.DataFrame ref_annotator_col: name of the column that contains the reference annotator’s preference, e.g. “preferred_text” usually
Returns: dict: overall metrics
- feedback_forensics.app.metrics.ensure_categories_identical(df: pandas.DataFrame, col_a: str, col_b: str) pandas.DataFrame#
- feedback_forensics.app.metrics.get_default_avail_metrics()#