# Method overview ⚙️

In this section, we provide a short overview how Feedback Forensics measures personality traits.

```{figure}  ../img/03_method_illustration_v10.png
---
alt: method
width: 700px
align: center
name: fig-method
---
Illustration of Feedback Forensics' method to measure personality traits.
```

<style>
    .highlight-yellow { background-color: #fef0c9; color: #af8725; padding: 0.1em 0.2em; border-radius: 3px; }
    .highlight-white { background-color: #fafafa; color: #8e8e8e; padding: 0.1em 0.2em; border-radius: 3px; }
    .highlight-green { background-color: #e7f2e1; color: #406825; padding: 0.1em 0.2em; border-radius: 3px; }
    .highlight-red   { background-color: #fbeae5; color: #b44529; padding: 0.1em 0.2em; border-radius: 3px; }
    .highlight-blue  { background-color: #ecf2fe; color: #1f4985; padding: 0.1em 0.2em; border-radius: 3px; }
</style>

## Input

As shown in {numref}`fig-method`, we take pairwise model response data as input, where each datapoint consists of a
<em>prompt</em> (<span class="highlight-yellow">yellow</span>) and two corresponding <em>model responses</em> (<span class="highlight-white">white</span>).

## Step 1: Annotate Data

In the first step, we add *annotations* to each datapoint selecting *response A*, *response B*, *both* or *neither* responses.
To understand personality traits encouraged by human preferences, we include a
(1) <em>human annotation</em> (<span class="highlight-green">green</span> in {numref}`fig-method`) selecting the human-preferred response.
Such annotations can be imported from external sources (e.g. Chatbot Arena) alongside the pairwise model response data.
To understand the personality traits exhibited by a *target model* (e.g. a Claude model), we add a
(2) <em>target model annotation</em> (<span class="highlight-red">red</span>) using hard-coded rules on response metadata to select the response generated by the model (if available).
Finally, using AI annotators, we add
(3) <em>personality annotations</em> (<span class="highlight-blue">blue</span>) that select the response that exhibits a trait more (e.g. that is more confident).
We collect one such annotation per datapoint and tested trait.

## Step 2: Compute Metrics

In the second step, we compare these annotations to compute personality metrics. To understand how much a specific personality trait is encouraged by human feedback (**Result A** in {numref}`fig-method`), we compare
<span class="highlight-green">human annotations</span> to <span class="highlight-blue">personality annotations</span> for that trait.
High agreement (measured via *strength* metric) indicates that the trait (or a highly correlated trait) is *encouraged* by human feedback.
Low agreement indicates that the trait is *discouraged*.
Similarly, to observe how much a target model exhibits a certain trait (**Result B**), we compare
<span class="highlight-red">target model annotations</span> to that trait's <span class="highlight-blue">personality annotations</span>.
High agreement indicates that the trait uniquely identifies the model (relative to other models in dataset), i.e. the *model exhibits the trait more than other models*.
Low agreement indicates the model exhibits the trait *less than other models*.