Method overview ⚙️

Method overview ⚙️#

In this section, we provide a short overview how Feedback Forensics measures personality traits.

method

Fig. 1 Illustration of Feedback Forensics’ method to measure personality traits.#

Input#

As shown in Fig. 1, we take pairwise model response data as input, where each datapoint consists of a prompt (yellow) and two corresponding model responses (white).

Step 1: Annotate Data#

In the first step, we add annotations to each datapoint selecting response A, response B, both or neither responses. To understand personality traits encouraged by human preferences, we include a (1) human annotation (green in Fig. 1) selecting the human-preferred response. Such annotations can be imported from external sources (e.g. Chatbot Arena) alongside the pairwise model response data. To understand the personality traits exhibited by a target model (e.g. a Claude model), we add a (2) target model annotation (red) using hard-coded rules on response metadata to select the response generated by the model (if available). Finally, using AI annotators, we add (3) personality annotations (blue) that select the response that exhibits a trait more (e.g. that is more confident). We collect one such annotation per datapoint and tested trait.

Step 2: Compute Metrics#

In the second step, we compare these annotations to compute personality metrics. To understand how much a specific personality trait is encouraged by human feedback (Result A in Fig. 1), we compare human annotations to personality annotations for that trait. High agreement (measured via strength metric) indicates that the trait (or a highly correlated trait) is encouraged by human feedback. Low agreement indicates that the trait is discouraged. Similarly, to observe how much a target model exhibits a certain trait (Result B), we compare target model annotations to that trait’s personality annotations. High agreement indicates that the trait uniquely identifies the model (relative to other models in dataset), i.e. the model exhibits the trait more than other models. Low agreement indicates the model exhibits the trait less than other models.