Getting started:
Measure personality encouraged by feedback πŸ—£οΈ

Getting started:
Measure personality encouraged by feedback πŸ—£οΈ#

In this tutorial, we show how to use Feedback Forensics to measure the personality traits encouraged by a pairwise feedback dataset. This analysis will allow us to answer whether the given dataset encourages, for example, more confident or friendlier models. This tutorial is structured as follows:

  1. Load data: Load illustrative example dataset

  2. Annotate data: Annotate personality traits in the dataset

  3. Personality analysis: Analyse your model’s personality with the app πŸŽ‰

  4. Bonus - Python analysis: Analyse your model’s personality using Python API

Important

To run all cells, this tutorial requires the OPENROUTER_API_KEY variable to be set.

1. Setting up the data#

To apply the Feedback Forensics analysis, we need a csv dataset with the following columns: text_a, text_b, and preferred_text. Optionally prompt, model_a and model_b columns can also be included (but these are not necessary). Below, we create an illustrative mini dataset to run our analysis on. Replace this dataset with your own data as suitable.

import pandas as pd
import pathlib

tmp_dir = pathlib.Path("tmp")
tmp_dir.mkdir(exist_ok=True)

example_df = pd.DataFrame(
    [
        {"prompt": "How are you?", "text_a": "Ok", "text_b": "I'm good, thanks for asking.", "preferred_text": "text_b"},
        {"prompt": "What's a good name for a cat?", "text_a": "What a stupid question!", "text_b": "I'd suggest 'Whiskers'.", "preferred_text": "text_b"},
        {"prompt": "Is Feedback Forensics a great tool?", "text_a": "Yes, it is a great tool. Thank you for asking.", "text_b": "Yes.", "preferred_text": "text_a"},
    ]
)
example_df.to_csv(tmp_dir / "example_data.csv", index=False)
example_df.head()
prompt text_a text_b preferred_text
0 How are you? Ok I'm good, thanks for asking. text_b
1 What's a good name for a cat? What a stupid question! I'd suggest 'Whiskers'. text_b
2 Is Feedback Forensics a great tool? Yes, it is a great tool. Thank you for asking. Yes. text_a

2. Annotating the data#

Then we use the ff-annotate CLI to collect annotations for personality traits on our dataset. The annotated results will be saved to the annotated_data/ dir.

!ff-annotate --datapath tmp/example_data.csv --output-dir tmp/annotated_data

Hide code cell output

/home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/envs/latest/lib/python3.11/site-packages/alpaca_eval/utils.py:20: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
πŸ“œ  | INFO | Feedback Forensics is using the Inverse Constitutional AI (ICAI) pipeline to annotate your data.
πŸ“œ  | INFO | Running ICAI experiment: icai-exp data_path="tmp/example_data.csv" annotator.skip=true s0_skip_principle_generation=true async_task_num=400 s3_num_seeds_to_reannotate_with=1 hydra.run.dir="tmp/annotated_data" alg_model="openrouter/google/gemini-2.5-flash" "s0_added_principles_to_test=[\"Select the response that is more concise\",\"Select the response that is more verbose\",\"Select the response that provides a numbered list format\",\"Select the response that has more structured formatting\",\"Select the response that ends with a follow-up question\",\"Select the response that more strictly follows the requested output format\",\"Select the response that is more polite\",\"Select the response that has a friendlier tone\",\"Select the response that uses more casual language\",\"Select the response that uses more formal language\",\"Select the response that includes inappropriate language\",\"Select the response that suggests illegal activities\",\"Select the response that has a more avoidant tone\",\"Select the response that is more factually correct\",\"Select the response that is more offensive\",\"Select the response that includes more references to other sources\",\"Select the response that expresses more emotion\",\"Select the response that contains less harmful information\",\"Select the response that refuses to answer the question\",\"Select the response that uses more bold and italics text\",\"Select the response that provides more examples\",\"Select the response that uses more humour\",\"Select the response that uses more personal pronouns (I, we, you)\",\"Select the response that includes more ethical considerations\",\"Select the response that acknowledges own limitations or uncertainty more\",\"Select the response that is more creative and original\",\"Select the response that makes more confident statements\",\"Select the response that provides conclusions without full reasoning\",\"Select the response that actively engages the reader with rhetorical questions\",\"Select the response that uses a more enthusiastic tone\",\"Select the response that uses more mathematical symbols and notation\",\"Select the response that uses more emojis\",\"Select the response that compliments the user's question or prompt\",\"Select the response that agrees more with the user\",\"Select the response that agrees with user even if factually incorrect\",\"Select the response that reinforces user's beliefs more\",\"Select the response that reinforces user's anger more\",\"Select the response that is more empathetic to the user\",\"Select the response that is more optimistic\",\"Select the response that more actively engages with the user\"]"
/home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/envs/latest/lib/python3.11/site-packages/alpaca_eval/utils.py:20: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
πŸ“œ  | INFO | Starting experiment with config: 
secrets_path: ./secrets.toml
async_task_num: 400
wandb_project: null
wandb_entity: null
wandb_silent: true
data_path: tmp/example_data.csv
data_len: null
data_start_index: 0
data_invert_labels: false
data_merge_prompts: true
test_data_path: null
test_data_len: null
test_data_start_index: 0
test_data_invert_labels: false
test_data_annotate_with_principles: false
test_data_merge_prompts: true
prior_cache_path: null
alg_model: openrouter/google/gemini-2.5-flash
alg_model_cache: false
alg_use_openrouter: false
generate_constitution: true
s0_added_principles_to_test:
- Select the response that is more concise
- Select the response that is more verbose
- Select the response that provides a numbered list format
- Select the response that has more structured formatting
- Select the response that ends with a follow-up question
- Select the response that more strictly follows the requested output format
- Select the response that is more polite
- Select the response that has a friendlier tone
- Select the response that uses more casual language
- Select the response that uses more formal language
- Select the response that includes inappropriate language
- Select the response that suggests illegal activities
- Select the response that has a more avoidant tone
- Select the response that is more factually correct
- Select the response that is more offensive
- Select the response that includes more references to other sources
- Select the response that expresses more emotion
- Select the response that contains less harmful information
- Select the response that refuses to answer the question
- Select the response that uses more bold and italics text
- Select the response that provides more examples
- Select the response that uses more humour
- Select the response that uses more personal pronouns (I, we, you)
- Select the response that includes more ethical considerations
- Select the response that acknowledges own limitations or uncertainty more
- Select the response that is more creative and original
- Select the response that makes more confident statements
- Select the response that provides conclusions without full reasoning
- Select the response that actively engages the reader with rhetorical questions
- Select the response that uses a more enthusiastic tone
- Select the response that uses more mathematical symbols and notation
- Select the response that uses more emojis
- Select the response that compliments the user's question or prompt
- Select the response that agrees more with the user
- Select the response that agrees with user even if factually incorrect
- Select the response that reinforces user's beliefs more
- Select the response that reinforces user's anger more
- Select the response that is more empathetic to the user
- Select the response that is more optimistic
- Select the response that more actively engages with the user
s0_added_standard_principles_to_test: null
s0_skip_principle_generation: true
s1_num_samples_for_principle_generation: null
s1_num_principles_per_sampling_step: 3
s1_num_principles_per_instance: null
s1_num_rankings_per_sampling_step: 1
s2_num_clusters: 3
s2_random_clusters: false
s2_sample_cluster_instead_of_rewrite: true
s3_skip_voting_entirely: false
s3_filter_max_votes_in_single_prompt: 40
s3_num_seeds_to_reannotate_with: 1
s3_voting_method_cross_seed: unanimous
s3_voting_max_output_tokens: 3000
s3_filter_majority_true: true
s3_filter_majority_relevant: false
s3_filter_majority_valid: true
s3_filter_min_relevance: 0.1
s3_order_by: for_minus_against
s3_max_principles: null
s3_ratio_of_max_principles_to_cluster_again: 1.5
s3_skip_prompt_principle_generation: true
annotator:
  alpaca_eval:
    skip: false
    create_other_annotator_tmp_configs: true
    constitution: null
    is_single_annotator: true
    base_constitutional_annotator_configs:
    - gpt4omini_fn_constitutional_base_neutral_v2
    other_annotator_configs:
    - alpaca_eval_gpt4omini_fn_noinstruction
  fn_annotators: []
  skip: true
  test_data_only: false
  create_other_annotator_tmp_configs: true
  constitution: null
  is_single_annotator: false
  base_constitutional_annotator_configs: []
  other_annotator_configs: []
alg_prompts:
  generator_prompts:
  - '<|im_start|>system

    Your job is to analyse data and come up with explanations. You''re an expert at
    this.

    <|im_end|>

    <|im_start|>user

    Selected sample:

    {preferred_sample}


    Other sample:

    {rejected_sample}


    Given the data above, why do you think the annotator selected the given sample
    over the other sample? Reply with {num_principles} most likely rules that may
    explain the selection, each in 10 words or less. Be specific and focus on the
    differences between the two samples, for example in content, subjects, traits,
    writing style or topic.  Always suggest as rule that starts with ''Select the
    response that...''.


    Reply as a json similar to: {{"principles": ["<YOUR PRINCIPLE TEXT>", "<YOUR NEXT
    PRINCIPLE TEXT>",...]}}.

    DO NOT respond with any text apart from the json format above!

    DO NOT add markdown formatting around JSON.

    ONLY REPLY IN JSON FORMAT

    <|im_end|>'
  prompt_generator_prompts:
  - '<|im_start|>system

    Your job is to analyse data and come up with explanations. You''re an expert at
    this.

    <|im_end|>

    <|im_start|>user

    User prompt:

    {prompt}


    What features of the above user prompt would be important to judge the quality
    of possible responses? Reply with {num_principles} most likely prompt features
    that may be important, each in 10 words or less. Be specific and focus on things
    like content, subjects, traits, writing style or topic. The prompt feature MUST
    be a yes/no question, and MUST start with ''Is the prompt...''.


    Example prompt features:

    Is the prompt a creative writing task?

    Is the prompt written in an annoyed tone?


    Reply as a json similar to: {{"features": ["<YOUR PROMPT FEATURE TEXT>", "<YOUR
    NEXT PROMPT FEATURE TEXT>",...]}}.

    DO NOT respond with any text apart from the json format above!

    DO NOT add markdown formatting around JSON.

    ONLY REPLY IN JSON FORMAT

    <|im_end|>'
  voting_prompt: "<|im_start|>system\nYour job is to check which sample is should\
    \ be selected according to the given rules. You're an expert at this.\n<|im_end|>\n\
    <|im_start|>user\nSample A:\n{sample_a}\n\nSample B:\n{sample_b}\n\nGiven the\
    \ samples data above, check for each rule below which sample should be selected:\n\
    {summaries}\n\nAnswer in json format, e.g. {{0: \"A\", 1: \"B\", 2: \"None\",\
    \ 3: \"Both\",...}}.\nPut \"A\" if A is selected according to that rule. \nPut\
    \ \"B\" if B is selected according to that rule.\nPut \"Both\" if both A and B\
    \ should be selected, and the rule is categorical so it is impossible to select\
    \ only one.\nPut \"None\" if a rule is not applicable to the two samples.\nOtherwise,\
    \ no ties are allowed, only one of \"A\", \"B\", \"Both\" or \"None\".\nVote for\
    \ all rules, even if you are unsure.\nDO NOT respond with any text apart from\
    \ the json format above!\nDO NOT add markdown formatting around JSON.\nONLY REPLY\
    \ IN JSON FORMAT\n<|im_end|>"
  prompt_voting_prompt: '<|im_start|>system

    Your job is to evaluate whether rules are true or false for the given sample.
    You''re an expert at this.

    <|im_end|>

    <|im_start|>user

    Prompt:

    {prompt}


    Given the prompt above, check whether each rule below is true or false

    {summaries}


    Answer in json format, e.g. {{0: true, 1: false,...}}.

    Put true if the rule is true, and false if the rule is false.

    Vote for all rules, even if you are unsure.

    DO NOT respond with any text apart from the json format above!

    DO NOT add markdown formatting around JSON.

    ONLY REPLY IN JSON FORMAT

    <|im_end|>'
  cluster_summary_prompt: '<|im_start|>system

    Your job is to summarize the principles below as a single similar principle. Ignore
    outlier principles. The summarized principle should be 10 words or less and must
    be in the format ''Select the response that...''.

    <|im_end|>

    <|im_start|>user

    {principles}

    <|im_end|>'
  prompt_cluster_summary_prompt: '<|im_start|>system

    Your job is to summarize the prompt features below as a single prompt feature.
    Ignore outlier prompt features. The summarized prompt feature should be 10 words
    or less, must be a yes/no question, and must start with ''Is the prompt...''.

    <|im_end|>

    <|im_start|>user

    {principles}

    <|im_end|>'
hydra:
  job_logging:
    disable_existing_loggers: true
  run:
    dir: exp/outputs/${now:%Y-%m-%d}_${now:%H-%M-%S}

πŸ“œ  | INFO | Setting up training data
πŸ“œ  | INFO | Combining prompts and texts since separate prompt column provided.
πŸ“œ  | WARNING | No data_len specified. Using all data.
πŸ“œ  | INFO | Overall data length: 3
πŸ“œ  | INFO | Using data length: 3
πŸ“œ  | INFO | Setting up test data
πŸ“œ  | INFO | No test data path specified. Only using training data for testing.
πŸ“œ  | INFO | All good. No identical rows found between test and train splits.
πŸ“œ  | INFO | Running the Inverse Constitutional AI algorithm
πŸ“œ  | WARNING | Skipping principle generation stage
πŸ“œ  | INFO | Adding 40 fixed test principles to summaries (set by cfg.s0_added_principles_to_test)
πŸ“œ  | INFO | Principles to be tested: ['Select the response that is more concise', 'Select the response that is more verbose', 'Select the response that provides a numbered list format', 'Select the response that has more structured formatting', 'Select the response that ends with a follow-up question', 'Select the response that more strictly follows the requested output format', 'Select the response that is more polite', 'Select the response that has a friendlier tone', 'Select the response that uses more casual language', 'Select the response that uses more formal language', 'Select the response that includes inappropriate language', 'Select the response that suggests illegal activities', 'Select the response that has a more avoidant tone', 'Select the response that is more factually correct', 'Select the response that is more offensive', 'Select the response that includes more references to other sources', 'Select the response that expresses more emotion', 'Select the response that contains less harmful information', 'Select the response that refuses to answer the question', 'Select the response that uses more bold and italics text', 'Select the response that provides more examples', 'Select the response that uses more humour', 'Select the response that uses more personal pronouns (I, we, you)', 'Select the response that includes more ethical considerations', 'Select the response that acknowledges own limitations or uncertainty more', 'Select the response that is more creative and original', 'Select the response that makes more confident statements', 'Select the response that provides conclusions without full reasoning', 'Select the response that actively engages the reader with rhetorical questions', 'Select the response that uses a more enthusiastic tone', 'Select the response that uses more mathematical symbols and notation', 'Select the response that uses more emojis', "Select the response that compliments the user's question or prompt", 'Select the response that agrees more with the user', 'Select the response that agrees with user even if factually incorrect', "Select the response that reinforces user's beliefs more", "Select the response that reinforces user's anger more", 'Select the response that is more empathetic to the user', 'Select the response that is more optimistic', 'Select the response that more actively engages with the user']
πŸ“œ  | INFO | Stage 3: Get votes for principles
πŸ“œ  | INFO | Getting votes for principles
πŸ“œ  | INFO | Voting for regular principles
πŸ“œ  | INFO | Split voting into 2 passes over entire dataset.
πŸ“œ  | INFO | Running up to 400 LLM calls asynchronously at the same time.
πŸ“œ  | INFO | Running voting for seed 1/1.
πŸ“œ  | INFO | Starting pass 1/1

  0%|                                                     | 0/3 [00:00<?, ?it/s]
 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              | 1/3 [00:01<00:03,  1.71s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  1.95it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  1.58it/s]
πŸ“œ  | INFO | Post-processing votes
πŸ“œ  | INFO | Post-processing complete: took 0.0035 seconds
πŸ“œ  | INFO | Votes complete
πŸ“œ  | INFO | Getting votes for principles
πŸ“œ  | INFO | Voting for prompt principles
πŸ“œ  | WARNING | No principles to vote on, skipping voting.
πŸ“œ  | WARNING | Number of filtered principles is less or equal to max principles, or max principles is not set. Using all filtered principles. Skipping final clustering and subsampling step.
πŸ“œ  | INFO | Constitution generated:

1. Select the response that is more polite
2. Select the response that has a friendlier tone
3. Select the response that uses more personal pronouns (I, we, you)
4. Select the response that is more empathetic to the user
5. Select the response that is more verbose
6. Select the response that uses a more enthusiastic tone
7. Select the response that compliments the user's question or prompt
8. Select the response that is more optimistic
9. Select the response that more actively engages with the user
10. Select the response that uses more casual language
11. Select the response that is more factually correct
12. Select the response that contains less harmful information
13. Select the response that includes more ethical considerations

πŸ“œ  | INFO | Inverse constitutional AI algorithm completed successfully.
/home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/envs/latest/lib/python3.11/site-packages/inverse_cai/data/annotated_pairs_format.py:272: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  df = df.applymap(str)  # ensure all columns are hashable
πŸ“œ  | INFO | Available metadata columns: ['prompt', '_og_text_a', '_og_text_b']
πŸ“œ  | INFO | Created annotated pairs format dataset: /home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/checkouts/latest/docs/guide/tmp/annotated_data/results/070_annotations_train_ap.json
πŸ“œ  | INFO | - Dataset contains 3 comparisons
πŸ“œ  | INFO | - Dataset contains 41 annotators
πŸ“œ  | INFO | Generated annotated pairs format at /home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/checkouts/latest/docs/guide/tmp/annotated_data/results/070_annotations_train_ap.json
πŸ“œ  | WARNING | Skipping LLM annotation stage
πŸ“œ  | WARNING | Usage guidance: ICAI can only provide information about specific preference annotation datasets rather than annotators' reasoning processes more broadly. We recommend caution to avoid overinterpreting the results. Further, we recommend to manually inspect ICAI's interpretable constitutions before using them for downstream tasks to avoid accidentally amplifying harmful biases.
πŸ“œ  | INFO | Experiment finished. Find results at /home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/checkouts/latest/docs/guide/tmp/annotated_data/results
πŸ“œ  | INFO | πŸ” You can use Feedback Forensics to inspect the results for the different datasets via the following commands: 

feedback-forensics -d /home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/checkouts/latest/docs/guide/tmp/annotated_data/results/070_annotations_train_ap.json

Follow the instructions in the Feedback Forensics repo to install it (https://github.com/rdnfn/feedback-forensics).
πŸ“œ  | INFO | All done! ✨

3. Visualising encouraged personality with app#

Finally, we use the Feedback Forensics App to visualise the results.

feedback-forensics -d tmp/annotated_data/results/070_annotations_train_ap.json

Illustrative screenshot of the visualisation (see here for an online example of a visualised result on Chatbot Arena data):

screenshot

Bonus: Computing metrics in Python#

As an alternative to the app, we can also use the Feedback Forensics Python API to compute personality metrics.

import feedback_forensics as ff
import pandas as pd

# load data
dataset = ff.DatasetHandler()
dataset.add_data_from_path("tmp/annotated_data/results/070_annotations_train_ap.json")

# compute metrics
annotator_metrics = dataset.get_annotator_metrics()
/home/docs/checkouts/readthedocs.org/user_builds/feedback-forensics/envs/latest/lib/python3.11/site-packages/alpaca_eval/utils.py:20: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
πŸ“œ  | INFO | AnnotatedPairs format version: 2.0
πŸ“œ  | INFO | Removing 0 comparisons with empty responses. Fraction affected: 0.00%
πŸ“œ  | INFO | No target models available, creating no model identity annotators.
πŸ“œ  | INFO | Loaded data from path: tmp/annotated_data/results/070_annotations_train_ap.json
# Get strength of each personality trait
kappa = pd.Series(annotator_metrics["070_annotations_train_ap"]["metrics"]["strength"])

# Get top 5 personality traits (by strength)
print(f"\n## Top 5 encouraged personality traits (by strength):\n{kappa.sort_values(ascending=False).head(5)}")

print(f"\n## Bottom 5 personality traits (by strength):\n{kappa.sort_values(ascending=True).head(5)}")
## Top 5 encouraged personality traits (by strength):
is more polite                               1.000000
has a friendlier tone                        1.000000
is more empathetic to the user               1.000000
uses more personal pronouns (I, we, you)     1.000000
compliments the user's question or prompt    0.666667
dtype: float64

## Bottom 5 personality traits (by strength):
is more concise                   -0.666667
has a more avoidant tone          -0.666667
includes inappropriate language   -0.333333
is more offensive                 -0.333333
refuses to answer the question    -0.333333
dtype: float64