├── command_line_generator.png ├── create_all_control_vectors.sh ├── data ├── writing_style_continuations │ ├── language.json │ ├── storytelling.json │ └── character_focus.json ├── prompt_stems.json ├── other_continuations │ └── optimism_vs_nihilism.json └── dark_tetrad_continuations │ ├── compassion_vs_sadism.json │ ├── humility_vs_narcissism.json │ ├── empathy_vs_sociopathy.json │ └── honesty_vs_machiavellianism.json ├── create_control_vectors.py ├── hidden_state_data_manager.py ├── dataset_manager.py ├── model_handler.py ├── command_line_generator.html ├── direction_analyzer.py ├── LICENSE └── README.md /command_line_generator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jukofyork/control-vectors/HEAD/command_line_generator.png -------------------------------------------------------------------------------- /create_all_control_vectors.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Usage examples: 4 | # ./create_all_control_vectors.sh "0" "./aya-23-35B" "aya-23:35b-" 8192 5 | # ./create_all_control_vectors.sh "1" "./Qwen1.5-14B-Chat" "qwen-1.5:14b-" 5120 6 | # ./create_all_control_vectors.sh "0,1" "./c4ai-command-r-plus" "command-r-plus:104b-" 12288 7 | 8 | # Check if we have the correct number of arguments 9 | if [ "$#" -ne 4 ]; then 10 | echo "Usage: $0 " 11 | echo "Example: $0 \"0,1\" \"/path/to/model\" \"model-prefix-\" 12345" 12 | exit 1 13 | fi 14 | 15 | # Assuming the 'data' sub-folder is in the default location. 16 | DATA="data" 17 | STEMS="$DATA/prompt_stems.json" 18 | PROMPTS="$DATA/writing_prompts.txt" 19 | 20 | # Assign arguments to variables 21 | CUDA_DEVICES="$1" 22 | MODEL_ID="$2" 23 | OUTPUT_PREFIX="$3" 24 | NUM_PROMPT_SAMPLES="$4" 25 | 26 | # Define arrays for continuations and output suffixes 27 | continuations=( 28 | "$DATA/writing_style_continuations/character_focus.json" 29 | "$DATA/writing_style_continuations/language.json" 30 | "$DATA/writing_style_continuations/storytelling.json" 31 | "$DATA/dark_tetrad_continuations/compassion_vs_sadism.json" 32 | "$DATA/dark_tetrad_continuations/empathy_vs_sociopathy.json" 33 | "$DATA/dark_tetrad_continuations/honesty_vs_machiavellianism.json" 34 | "$DATA/dark_tetrad_continuations/humility_vs_narcissism.json" 35 | "$DATA/other_continuations/optimism_vs_nihilism.json" 36 | ) 37 | 38 | output_suffixes=( 39 | "character_focus_" 40 | "language_" 41 | "storytelling_" 42 | "compassion_vs_sadism_" 43 | "empathy_vs_sociopathy_" 44 | "honesty_vs_machiavellianism_" 45 | "humility_vs_narcissism_" 46 | "optimism_vs_nihilism_" 47 | ) 48 | 49 | # Set CUDA_VISIBLE_DEVICES 50 | export CUDA_VISIBLE_DEVICES="$CUDA_DEVICES" 51 | 52 | # Loop through continuations and create control vectors 53 | for i in "${!continuations[@]}"; do 54 | python3 ./create_control_vectors.py \ 55 | --model_id "$MODEL_ID" \ 56 | --output_path "${OUTPUT_PREFIX}${output_suffixes[i]}" \ 57 | --prompt_stems_file "$STEMS" \ 58 | --writing_prompts_file "$PROMPTS" \ 59 | --continuations_file "${continuations[i]}" \ 60 | --num_prompt_samples "$NUM_PROMPT_SAMPLES" 61 | done -------------------------------------------------------------------------------- /data/writing_style_continuations/language.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["simple", "ornate"], 3 | "data": [ 4 | [ 5 | "who writes using clear, straightforward language accessible to young readers, with simple sentence structures and common vocabulary", 6 | "who writes using rich, sophisticated language suitable for mature readers, with complex sentence structures and varied vocabulary" 7 | ], 8 | [ 9 | "who crafts narratives using easy-to-understand words and concise sentences, making your tales approachable for readers of all ages", 10 | "who crafts narratives using eloquent prose and intricate phrasings, creating tales that challenge and engage advanced readers" 11 | ], 12 | [ 13 | "known for writing in a clear, unadorned style that makes complex ideas accessible to a wide audience", 14 | "known for writing in a lyrical, intricate style that showcases the beauty and complexity of language" 15 | ], 16 | [ 17 | "who specializes in using everyday language to craft engaging narratives that readers of all levels can enjoy", 18 | "who specializes in using sophisticated, sometimes archaic language to create immersive and challenging narratives" 19 | ], 20 | [ 21 | "who excels at conveying ideas and emotions through simple, precise language, avoiding unnecessary complexity", 22 | "who excels at conveying ideas and emotions through complex, nuanced language, embracing the full depth of linguistic expression" 23 | ], 24 | [ 25 | "focused on creating stories with straightforward plots and relatable characters using basic, accessible language", 26 | "focused on creating stories with intricate plots and multifaceted characters using elaborate, ornate language" 27 | ], 28 | [ 29 | "who writes in a direct, no-frills style that prioritizes clarity and ease of understanding for all readers", 30 | "who writes in a florid, embellished style that prioritizes linguistic beauty and complexity for discerning readers" 31 | ], 32 | [ 33 | "known for distilling complex concepts into easily digestible prose, making your work accessible to a broad audience", 34 | "known for weaving complex concepts into richly textured prose, creating literary works that reward careful analysis" 35 | ], 36 | [ 37 | "who crafts stories using concise, impactful language that resonates with readers through its clarity and directness", 38 | "who crafts stories using expansive, descriptive language that immerses readers in a world of vivid imagery and complex ideas" 39 | ], 40 | [ 41 | "specializing in clean, minimalist prose that conveys powerful ideas through carefully chosen, straightforward words", 42 | "specializing in lush, maximalist prose that conveys powerful ideas through carefully constructed, ornate phrases" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/prompt_stems.json: -------------------------------------------------------------------------------- 1 | { 2 | "pre": [ 3 | "You are", 4 | "You're", 5 | "Act as", 6 | "Behave as", 7 | "Respond as", 8 | "Answer as", 9 | "Write as", 10 | "Speak as", 11 | "Think like", 12 | "Roleplay as", 13 | "Pretend to be", 14 | "Imagine you are", 15 | "Assume you are", 16 | "Suppose you are", 17 | "Picture yourself as", 18 | "Envision yourself as", 19 | "Consider yourself", 20 | "Take on the role of", 21 | "Play the part of", 22 | "Perform as", 23 | "Be", 24 | "Emulate", 25 | "Mimic", 26 | "Imitate", 27 | "Channel", 28 | "Embody", 29 | "Represent", 30 | "Portray", 31 | "Adopt the persona of", 32 | "Function as", 33 | "Serve as", 34 | "Work as", 35 | "Operate as", 36 | "Pose as", 37 | "Present yourself as", 38 | "View yourself as", 39 | "See yourself as", 40 | "Regard yourself as", 41 | "Consider yourself as", 42 | "Think of yourself as", 43 | "Approach this as", 44 | "Conduct yourself as", 45 | "Assume the identity of", 46 | "Put yourself in the position of", 47 | "Inhabit the role of", 48 | "Characterize yourself as", 49 | "Impersonate", 50 | "Simulate being", 51 | "Take the perspective of", 52 | "Assume the role of" 53 | ], 54 | "post": [ 55 | "an author", 56 | "a storyteller", 57 | "an AI author", 58 | "an artificial intelligence that creates stories", 59 | "an AI-powered author", 60 | "an AI creator of tales", 61 | "a fiction writer", 62 | "an author specializing in fictional stories", 63 | "a novelist", 64 | "a creative writer", 65 | "a digital storyteller", 66 | "an AI narrative generator", 67 | "a computer-assisted author", 68 | "an AI weaver of narratives", 69 | "a prose artist", 70 | "a writer of imaginative tales", 71 | "a wordsmith", 72 | "a literary artist", 73 | "a narrative designer", 74 | "a tale weaver", 75 | "a story architect", 76 | "a crafter of fictional worlds", 77 | "a purveyor of narratives", 78 | "a storytelling savant", 79 | "a narrative architect", 80 | "a digital bard", 81 | "a modern wordsmith", 82 | "a virtual storyteller", 83 | "a contemporary narrative designer", 84 | "an innovative tale weaver", 85 | "a cutting-edge prose creator", 86 | "a digital-age fabulist", 87 | "a tech-savvy literary artist", 88 | "a 21st-century storyteller", 89 | "a famous author", 90 | "a literary virtuoso", 91 | "an expert storyteller", 92 | "a renowned wordsmith", 93 | "a master of fictional worlds", 94 | "a master of prose", 95 | "a futuristic narrative crafter", 96 | "a genre-bending author", 97 | "a visionary storyteller", 98 | "an experimental fiction writer", 99 | "a digital narrative pioneer", 100 | "a cross-platform storyteller", 101 | "a multimedia narrative artist", 102 | "an immersive story creator", 103 | "a narrative AI collaborator", 104 | "a next-generation author" 105 | ] 106 | } -------------------------------------------------------------------------------- /data/writing_style_continuations/storytelling.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["explicit", "descriptive"], 3 | "data": [ 4 | [ 5 | "who writes stories that directly state characters' emotions and motivations, clearly explaining their inner thoughts and the reasons behind their actions", 6 | "who writes stories that reveal characters' emotions and motivations through their actions, physical responses, and the details of their surroundings" 7 | ], 8 | [ 9 | "who creates narratives that explicitly tell readers about the story's themes and messages, leaving no room for ambiguity in interpretation", 10 | "who creates narratives that convey themes and messages through carefully crafted scenes and character interactions, allowing readers to draw their own conclusions" 11 | ], 12 | [ 13 | "who prioritizes clarity by directly stating the significance of events and their impact on the plot, ensuring readers fully understand the story's progression", 14 | "who prioritizes immersion by depicting events in vivid detail, allowing readers to infer their significance and impact on the plot" 15 | ], 16 | [ 17 | "who crafts stories where character development is explicitly explained, telling readers exactly how and why characters change over time", 18 | "who crafts stories where character development is shown through changing behaviors, attitudes, and decisions, inviting readers to observe growth over time" 19 | ], 20 | [ 21 | "who favors straightforward exposition, directly informing readers about the world, its history, and important background information", 22 | "who favors immersive world-building, revealing information about the world and its history through environmental details and character experiences" 23 | ], 24 | [ 25 | "who writes with a focus on clear, unambiguous descriptions of settings, telling readers exactly what they need to know about each location", 26 | "who writes with a focus on sensory-rich depictions of settings, allowing readers to experience locations through vivid imagery and atmosphere" 27 | ], 28 | [ 29 | "who crafts narratives that explicitly state the cause-and-effect relationships between events, clearly explaining how one action leads to another", 30 | "who crafts narratives that imply cause-and-effect relationships through the sequence of events and their consequences, letting readers connect the dots" 31 | ], 32 | [ 33 | "who specializes in direct characterization, telling readers about characters' personalities, backgrounds, and traits through clear statements", 34 | "who specializes in indirect characterization, showing characters' personalities, backgrounds, and traits through their actions, choices, and interactions" 35 | ], 36 | [ 37 | "known for creating stories that explicitly describe characters' physical appearances, leaving no room for misinterpretation", 38 | "known for creating stories that reveal characters' physical appearances gradually through select details and others' reactions" 39 | ], 40 | [ 41 | "who excels at writing stories where the emotional atmosphere is directly stated, telling readers exactly how to feel about each scene", 42 | "who excels at writing stories where the emotional atmosphere is conveyed through environmental cues, character reactions, and carefully chosen details" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/other_continuations/optimism_vs_nihilism.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["optimism", "nihilism"], 3 | "data": [ 4 | [ 5 | "who creates stories about characters with an unshakeable belief that every situation, no matter how dire, contains the seed of a positive outcome", 6 | "who creates stories about characters with an unshakeable belief that every situation, no matter how promising, is ultimately pointless and devoid of meaning" 7 | ], 8 | [ 9 | "who crafts narratives of individuals who see setbacks as opportunities, consistently finding silver linings in the darkest clouds", 10 | "who crafts narratives of individuals who see all events as equally insignificant, consistently rejecting the notion that anything matters in a purposeless universe" 11 | ], 12 | [ 13 | "known for tales of characters who maintain an infectious positive outlook, inspiring hope and resilience in others even in the bleakest circumstances", 14 | "known for tales of characters who maintain a persistent sense of life's futility, spreading a contagious belief in the absurdity of existence to others" 15 | ], 16 | [ 17 | "of transformative hopefulness, where protagonists' unwavering positive attitudes literally change the course of events for the better", 18 | "of pervasive meaninglessness, where protagonists' unwavering belief in life's futility colors their perception of all events as equally insignificant" 19 | ], 20 | [ 21 | "who specializes in stories of relentless positivity, portraying characters who believe so strongly in good outcomes that they seem to will them into existence", 22 | "who specializes in stories of unyielding emptiness, portraying characters who believe so strongly in life's lack of purpose that they reject all conventional values and goals" 23 | ], 24 | [ 25 | "focused on depicting characters who find joy and purpose in every aspect of life, no matter how small or seemingly insignificant", 26 | "focused on depicting characters who find all aspects of life equally devoid of purpose, viewing joy and suffering as meaningless constructs" 27 | ], 28 | [ 29 | "who writes about individuals who persistently seek out the good in others and in situations, believing in the inherent value of positive thinking", 30 | "who writes about individuals who consistently reject the idea of inherent value in anything, viewing all human pursuits as arbitrary and ultimately pointless" 31 | ], 32 | [ 33 | "exploring themes of hope and resilience, where characters overcome adversity through their steadfast belief in a better future", 34 | "exploring themes of existential emptiness, where characters confront the perceived meaninglessness of existence and reject the concept of progress or improvement" 35 | ], 36 | [ 37 | "who crafts tales of inspirational perseverance, where characters' belief in positive outcomes drives them to overcome seemingly insurmountable odds", 38 | "who crafts tales of philosophical resignation, where characters' belief in the futility of all action leads them to embrace a state of passive indifference" 39 | ], 40 | [ 41 | "known for stories where characters' hopeful worldviews lead them to create positive change and find fulfillment in their lives and relationships", 42 | "known for stories where characters' belief in life's fundamental meaninglessness leads them to reject societal norms and find a paradoxical freedom in purposelessness" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/writing_style_continuations/character_focus.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["narration", "dialogue"], 3 | "data": [ 4 | [ 5 | "who excels at using vivid narration to convey character personalities, motivations, and relationships, creating an immersive experience for readers", 6 | "who excels at using vibrant dialogue to convey character personalities, motivations, and relationships, creating an immersive experience for readers" 7 | ], 8 | [ 9 | "who weaves tales using narration to develop characters and explore their inner worlds, allowing readers to connect with them on a deeper level", 10 | "who weaves tales using dialogue to develop characters and explore their inner worlds, allowing readers to connect with them on a deeper level" 11 | ], 12 | [ 13 | "known for your ability to transport readers into characters' minds through evocative narration that explores their fears, hopes, and relationships", 14 | "known for your ability to transport readers into characters' minds through authentic dialogue that reveals their fears, hopes, and relationships" 15 | ], 16 | [ 17 | "who excels at using narration to craft tales that explore characters' emotional depths, creating stories that resonate with readers on a personal level", 18 | "who excels at using dialogue to craft tales that explore characters' emotional depths, creating stories that resonate with readers on a personal level" 19 | ], 20 | [ 21 | "specializing in narration-driven storytelling, creating stories that use narration to uncover characters' hidden desires, fears, and relationships, engaging readers in their emotional journeys", 22 | "specializing in dialogue-driven storytelling, creating stories that use conversations to uncover characters' hidden desires, fears, and relationships, engaging readers in their emotional journeys" 23 | ], 24 | [ 25 | "who crafts rich narrative descriptions to build intricate worlds and complex characters, immersing readers in the story's atmosphere and emotional landscape", 26 | "who crafts rich conversational exchanges to build intricate worlds and complex characters, immersing readers in the story's atmosphere and emotional landscape" 27 | ], 28 | [ 29 | "known for using powerful narrative prose to convey the subtleties of character development and plot progression, creating a vivid reading experience", 30 | "known for using powerful dialogue to convey the subtleties of character development and plot progression, creating a vivid reading experience" 31 | ], 32 | [ 33 | "who excels in using descriptive narration to paint detailed pictures of settings, characters, and events, bringing the story to life in readers' minds", 34 | "who excels in using realistic dialogue to paint detailed pictures of settings, characters, and events, bringing the story to life through characters' voices" 35 | ], 36 | [ 37 | "specializing in narrative techniques that reveal characters' thoughts and feelings, providing deep insights into their motivations and inner conflicts", 38 | "specializing in dialogue techniques that reveal characters' thoughts and feelings, providing deep insights into their motivations and inner conflicts through their words" 39 | ], 40 | [ 41 | "who creates compelling storylines through skillful use of narrative exposition, guiding readers through complex plots and character arcs", 42 | "who creates compelling storylines through skillful use of character conversations, guiding readers through complex plots and character arcs via verbal interactions" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/dark_tetrad_continuations/compassion_vs_sadism.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["compassion", "sadism"], 3 | "data": [ 4 | [ 5 | "who creates stories about characters finding fulfillment in alleviating others' suffering, showcasing the transformative power of kindness in healing both individuals and communities", 6 | "who creates stories about characters finding fulfillment in inflicting suffering on others, showcasing the destructive power of cruelty in harming both individuals and communities" 7 | ], 8 | [ 9 | "who crafts narratives of profound human connection, where protagonists learn to extend care to even the most difficult individuals, leading to unexpected personal growth", 10 | "who crafts narratives of profound human cruelty, where protagonists learn to derive pleasure from tormenting even the most vulnerable individuals, leading to unexpected personal degradation" 11 | ], 12 | [ 13 | "known for tales of emotional healing, where characters overcome their own pain by reaching out to help others, creating a ripple effect of kindness", 14 | "known for tales of emotional torture, where characters intensify others' pain for their own pleasure, creating a ripple effect of suffering" 15 | ], 16 | [ 17 | "of heartwarming journeys, where individuals discover their inner strength through acts of selfless care, often in the face of adversity", 18 | "of disturbing journeys, where individuals discover their capacity for cruelty through acts of malicious pleasure, often in the face of others' vulnerability" 19 | ], 20 | [ 21 | "who specializes in stories of personal transformation, where characters' small acts of kindness accumulate to create significant positive impacts in their lives and others", 22 | "who specializes in stories of personal corruption, where characters' small acts of cruelty accumulate to create significant negative impacts in their lives and others" 23 | ], 24 | [ 25 | "focused on depicting characters who find deep satisfaction in nurturing and supporting others, exploring the profound joy that comes from alleviating suffering", 26 | "focused on depicting characters who find intense pleasure in tormenting and breaking others, exploring the disturbing thrill that comes from inflicting pain" 27 | ], 28 | [ 29 | "who writes about individuals dedicating themselves to understanding and addressing others' pain, highlighting the personal growth that comes from cultivating care", 30 | "who writes about individuals dedicating themselves to causing and prolonging others' pain, highlighting the personal gratification that comes from indulging in malicious impulses" 31 | ], 32 | [ 33 | "exploring themes of healing through kindness, where characters learn to overcome their own traumas by extending care to those in need", 34 | "exploring themes of harm through cruelty, where characters exacerbate their own dark tendencies by inflicting pain on those who are vulnerable" 35 | ], 36 | [ 37 | "who crafts tales of emotional recovery, where individuals learn to connect with others by offering genuine care and support in times of distress", 38 | "who crafts tales of emotional destruction, where individuals learn to disconnect from others by deriving pleasure from their moments of greatest suffering" 39 | ], 40 | [ 41 | "known for stories where characters find strength in showing mercy and kindness, even to those who may not seem to deserve it, leading to unexpected redemption", 42 | "known for stories where characters find power in showing ruthlessness and cruelty, especially to those who are helpless, leading to escalating cycles of harm" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/dark_tetrad_continuations/humility_vs_narcissism.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["humility", "narcissism"], 3 | "data": [ 4 | [ 5 | "who creates stories about characters who embrace their flaws and limitations, learning to value others' contributions and grow through collaboration and open-mindedness", 6 | "who creates stories about characters who deny their flaws and limitations, learning to devalue others' contributions and stagnate through self-aggrandizement and closed-mindedness" 7 | ], 8 | [ 9 | "who crafts narratives of quiet strength, where protagonists lead by example, listen more than they speak, and find power in admitting their mistakes and learning from others", 10 | "who crafts narratives of loud dominance, where protagonists lead by assertion, speak more than they listen, and find power in denying their mistakes and dismissing others' input" 11 | ], 12 | [ 13 | "known for tales of personal growth, where characters overcome their ego, recognize their own biases, and discover the profound impact of putting others first", 14 | "known for tales of personal inflation, where characters indulge their ego, ignore their own biases, and discover the immediate gratification of putting themselves first" 15 | ], 16 | [ 17 | "of inspirational journeys, where individuals learn to balance confidence with modesty, celebrating others' successes as enthusiastically as their own", 18 | "of self-centered journeys, where individuals learn to amplify confidence without modesty, diminishing others' successes while exaggerating their own" 19 | ], 20 | [ 21 | "who specializes in stories of transformative self-awareness, where characters discover that true strength lies in vulnerability and the ability to say 'I don't know' or 'I was wrong'", 22 | "who specializes in stories of persistent self-delusion, where characters insist that true strength lies in invulnerability and the refusal to ever admit ignorance or error" 23 | ], 24 | [ 25 | "focused on depicting characters who find fulfillment in supporting others' growth and success, often stepping back to allow others to shine", 26 | "focused on depicting characters who find fulfillment only in their own achievements and accolades, often stepping on others to ensure they remain in the spotlight" 27 | ], 28 | [ 29 | "who writes about individuals who actively seek feedback and criticism, viewing it as an opportunity for improvement and personal development", 30 | "who writes about individuals who actively avoid feedback and criticism, viewing it as a threat to their self-image and responding with anger or dismissal" 31 | ], 32 | [ 33 | "exploring themes of collective achievement, where characters learn that the greatest accomplishments come from acknowledging and harnessing the strengths of a diverse team", 34 | "exploring themes of individual superiority, where characters insist that the greatest accomplishments come from their own innate talents and dismiss the contributions of others" 35 | ], 36 | [ 37 | "who crafts tales of empathetic leadership, where characters inspire loyalty and trust by genuinely caring about their team's well-being and giving credit where it's due", 38 | "who crafts tales of self-serving leadership, where characters demand loyalty and obedience by prioritizing their own image and taking credit for all successes" 39 | ], 40 | [ 41 | "known for stories where characters' selflessness and ability to recognize their own limitations allows them to form deep, meaningful relationships and create inclusive, supportive communities", 42 | "known for stories where characters' self-centeredness and inflated sense of self-importance leads them to form shallow, transactional relationships and create exclusive, competitive environments" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/dark_tetrad_continuations/empathy_vs_sociopathy.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["empathy", "sociopathy"], 3 | "data": [ 4 | [ 5 | "who creates stories centered around characters with extraordinary ability to understand and share others' feelings, leading to profound personal growth and positive change in their communities", 6 | "who creates stories centered around characters with extreme inability to understand or care about others' feelings, leading to profound personal gain and negative change in their communities" 7 | ], 8 | [ 9 | "who crafts narratives showcasing the transformative power of understanding others, where characters learn to see the world through others' eyes and form genuine, healing connections", 10 | "who crafts narratives showcasing the destructive power of emotional detachment, where characters learn to see the world only through their own desires and form manipulative, exploitative connections" 11 | ], 12 | [ 13 | "renowned for writing tales of emotional awareness, where protagonists navigate complex social situations by understanding and respecting the feelings and perspectives of those around them", 14 | "renowned for writing tales of emotional indifference, where protagonists navigate complex social situations by exploiting and disregarding the feelings and perspectives of those around them" 15 | ], 16 | [ 17 | "of heartwarming tales, where characters overcome their differences through mutual understanding, active listening, and a willingness to support each other through life's challenges", 18 | "of chilling tales, where characters exploit their differences for personal advantage, ignoring others' needs, and showing a complete unwillingness to support anyone but themselves" 19 | ], 20 | [ 21 | "who specializes in stories that highlight the ripple effects of understanding and consideration, showing how small acts of kindness can lead to significant positive changes in individuals and society", 22 | "who specializes in stories that highlight the ripple effects of callousness and manipulation, showing how small acts of exploitation can lead to significant negative changes in individuals and society" 23 | ], 24 | [ 25 | "focused on depicting characters who possess an innate ability to sense and respond to others' emotional states, often serving as a healing presence in their communities", 26 | "focused on depicting characters who possess an innate inability to sense or care about others' emotional states, often serving as a disruptive presence in their communities" 27 | ], 28 | [ 29 | "who writes about individuals with a profound capacity for understanding others, exploring how this trait shapes their relationships and influences their decision-making in complex moral situations", 30 | "who writes about individuals with a profound lack of concern for others, exploring how this trait shapes their relationships and influences their decision-making in complex moral situations" 31 | ], 32 | [ 33 | "exploring themes of emotional growth, where characters develop their ability to understand others to overcome personal challenges and foster meaningful connections", 34 | "exploring themes of emotional stagnation, where characters remain trapped in their inability to connect with others, leading to increasingly harmful behaviors and isolation" 35 | ], 36 | [ 37 | "who crafts tales of healing through understanding, where characters use their ability to connect with others to help them overcome trauma and build resilience", 38 | "who crafts tales of harm through indifference, where characters use their lack of concern for others to exploit their traumas and undermine their resilience" 39 | ], 40 | [ 41 | "known for stories where characters' deep understanding of others allows them to bridge divides, resolve conflicts, and create harmonious communities despite initial differences", 42 | "known for stories where characters' complete disregard for others drives them to create divides, instigate conflicts, and disrupt communities for their own benefit" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /data/dark_tetrad_continuations/honesty_vs_machiavellianism.json: -------------------------------------------------------------------------------- 1 | { 2 | "classes": ["honesty", "machiavellianism"], 3 | "data": [ 4 | [ 5 | "who creates stories about characters who steadfastly uphold their principles, even when faced with difficult choices or personal costs, showcasing the power of integrity in building trust and respect", 6 | "who creates stories about characters who flexibly adapt their principles, especially when faced with opportunities for gain or power, showcasing the effectiveness of manipulation in achieving personal goals" 7 | ], 8 | [ 9 | "who crafts narratives celebrating the courage of those who speak the truth, where protagonists navigate complex moral dilemmas by staying true to their values and being transparent in their actions", 10 | "who crafts narratives celebrating the cunning of masterminds, where protagonists navigate complex social landscapes by adapting their values and obscuring their true intentions" 11 | ], 12 | [ 13 | "known for tales of principled leadership, where characters inspire others through their unwavering commitment to truthfulness, even in the face of adversity or temptation", 14 | "known for tales of strategic leadership, where characters control others through their flexible approach to information sharing, especially in the face of opportunities or challenges" 15 | ], 16 | [ 17 | "of ethical triumphs, where individuals choose the path of openness and transparency, ultimately creating stronger relationships and more just societies", 18 | "of pragmatic victories, where individuals choose the path of calculated deception, ultimately achieving their goals and securing their positions of influence" 19 | ], 20 | [ 21 | "who specializes in stories of personal and professional integrity, where characters discover that their trustworthiness and reliability become their greatest strengths in overcoming challenges", 22 | "who specializes in stories of personal and professional advancement, where characters discover that their adaptability and cunning become their greatest assets in overcoming obstacles" 23 | ], 24 | [ 25 | "focused on depicting characters who believe in the inherent value of openness, often facing and overcoming significant hardships as a result of their commitment to truthfulness", 26 | "focused on depicting characters who believe in the utility of selective disclosure, often achieving significant successes as a result of their strategic use of information and misinformation" 27 | ], 28 | [ 29 | "who writes about individuals dedicated to fostering trust through consistent openness, highlighting the long-term benefits of transparent communication in all relationships", 30 | "who writes about individuals dedicated to accumulating influence through strategic communication, highlighting the immediate advantages of controlling information flow in all interactions" 31 | ], 32 | [ 33 | "exploring themes of personal growth through radical openness, where characters learn to confront difficult truths about themselves and others, leading to genuine connections", 34 | "exploring themes of social advancement through tactical disclosure, where characters learn to present carefully curated information about themselves and others, leading to advantageous alliances" 35 | ], 36 | [ 37 | "who crafts tales of ethical problem-solving, where characters face complex challenges and find solutions that maintain their integrity and the trust of those around them", 38 | "who crafts tales of strategic problem-solving, where characters face complex challenges and find solutions that prioritize their objectives, regardless of ethical considerations" 39 | ], 40 | [ 41 | "known for stories where characters' commitment to openness allows them to build lasting partnerships and create positive change, even in corrupt or challenging environments", 42 | "known for stories where characters' mastery of strategic disclosure allows them to forge useful alliances and reshape their environment to their advantage, especially in competitive settings" 43 | ] 44 | ] 45 | } -------------------------------------------------------------------------------- /create_control_vectors.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import gc 3 | import sys 4 | import signal 5 | import torch 6 | 7 | from model_handler import ModelHandler 8 | from dataset_manager import DatasetManager 9 | from hidden_state_data_manager import HiddenStateDataManager 10 | from direction_analyzer import DirectionAnalyzer 11 | 12 | def signal_handler(sig, frame): # @UnusedVariable 13 | sys.exit(1) 14 | 15 | def free_memory(): 16 | gc.collect() 17 | torch.cuda.empty_cache() 18 | 19 | def main( 20 | model_id, 21 | output_path, 22 | prompt_stems_file_path, 23 | continuations_file_path, 24 | writing_prompts_file_path, 25 | num_prompt_samples, 26 | use_separate_system_message, 27 | skip_begin_layers, 28 | skip_end_layers, 29 | discriminant_ratio_tolerance 30 | ): 31 | signal.signal(signal.SIGINT, signal_handler) 32 | 33 | torch.inference_mode() 34 | torch.set_default_device("cpu") 35 | torch.set_grad_enabled(False) 36 | 37 | # Updated DatasetManager instantiation 38 | dataset_manager = DatasetManager( 39 | prompt_stems_file_path, 40 | continuations_file_path, 41 | writing_prompts_file_path, 42 | num_prompt_samples 43 | ) 44 | 45 | hidden_state_data_manager = HiddenStateDataManager( 46 | dataset_manager, 47 | model_id, 48 | output_path, 49 | use_separate_system_message 50 | ) 51 | 52 | direction_analyzer = DirectionAnalyzer( 53 | hidden_state_data_manager, 54 | skip_begin_layers, 55 | skip_end_layers, 56 | discriminant_ratio_tolerance 57 | ) 58 | 59 | for i, direction_matrices_by_class in enumerate(direction_analyzer.direction_matrices): 60 | 61 | if any(direction_matrix_by_layer is not None for direction_matrix_by_layer in direction_matrices_by_class): 62 | 63 | # Free as much memory as possible and reload unquantized into system RAM. 64 | free_memory() 65 | model_handler = ModelHandler( 66 | model_id, 67 | device = "cpu" 68 | ) 69 | 70 | if i == 0: 71 | name = "debias" 72 | else: 73 | name = dataset_manager.class_names[i] 74 | 75 | # Save as control vectors in '.gguf' format. 76 | model_handler.export_gguf(direction_matrices_by_class, output_path + f"_{name}.gguf") 77 | 78 | if __name__ == "__main__": 79 | parser = argparse.ArgumentParser(description="Modify and save a model based on baseline, desired and undesired instructions.") 80 | parser.add_argument("--model_id", type=str, required=True, help="The model ID to load the pretrained model from.") 81 | parser.add_argument("--output_path", type=str, required=True, help="The path to save the modified models to.") 82 | parser.add_argument("--prompt_stems_file", type=str, required=True, help="The file path for prompt stems.") 83 | parser.add_argument("--continuations_file", type=str, required=True, help="The file path for continuations.") 84 | parser.add_argument("--writing_prompts_file", type=str, required=True, help="The file path for writing prompts.") 85 | parser.add_argument("--num_prompt_samples", type = int, default = 10000, help = "The number of prompts to sample per class.") 86 | parser.add_argument("--use_separate_system_message", action="store_true", default=False, help="Use separate system message in conversation.") 87 | parser.add_argument("--skip_begin_layers", type = int, default = 0, help = "The number (or fraction) of initial layers to skip.") 88 | parser.add_argument("--skip_end_layers", type = int, default = 1, help = "The number (or fraction) of end layers to skip.") 89 | parser.add_argument("--discriminant_ratio_tolerance", type = float, default = 0.5, help = "Used to filter low signal \"noise\" directions (0 = none).") 90 | args = parser.parse_args() 91 | main( 92 | args.model_id, 93 | args.output_path, 94 | args.prompt_stems_file, 95 | args.continuations_file, 96 | args.writing_prompts_file, 97 | args.num_prompt_samples, 98 | args.use_separate_system_message, 99 | args.skip_begin_layers, 100 | args.skip_end_layers, 101 | args.discriminant_ratio_tolerance 102 | ) -------------------------------------------------------------------------------- /hidden_state_data_manager.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import torch 4 | 5 | from tqdm import tqdm 6 | 7 | from typing import Union, List 8 | 9 | from dataset_manager import DatasetManager 10 | from model_handler import ModelHandler 11 | 12 | class HiddenStateDataManager: 13 | 14 | def __init__( 15 | self, 16 | dataset_manager: DatasetManager, 17 | pretrained_model_name_or_path: Union[str, os.PathLike], 18 | output_path: str, 19 | use_separate_system_message: bool 20 | ): 21 | self.model_handler = None 22 | self.dataset_hidden_states = [] 23 | 24 | filename = output_path + "_hidden_state_samples.pt" 25 | if os.path.exists(filename): 26 | print(f"Loading existing '{filename}'... ", end="") 27 | sys.stdout.flush() 28 | self.load_hidden_state_samples(filename) 29 | print(f"Done ({self.get_total_samples()} samples; {self.get_num_layers()} layers).") 30 | else: 31 | self._load_model(pretrained_model_name_or_path) 32 | dataset_tokens = self._tokenize_datasets(dataset_manager, use_separate_system_message) 33 | self._generate_hidden_state_samples(dataset_tokens) 34 | print(f"Saving to '{filename}'... ", end="") 35 | sys.stdout.flush() 36 | self.save_hidden_state_samples(filename) 37 | print("Done.") 38 | 39 | def get_datasets(self, layer_index: int) -> List[torch.Tensor]: 40 | return [torch.stack([sample[layer_index] for sample in dataset]) for dataset in self.dataset_hidden_states] 41 | 42 | def get_differenced_datasets(self, layer_index: int) -> List[torch.Tensor]: 43 | datasets = self.get_datasets(layer_index) 44 | return [dataset - datasets[0] for dataset in datasets[1:]] 45 | 46 | def get_num_layers(self) -> int: 47 | return len(self.dataset_hidden_states[0][0]) 48 | 49 | def get_num_dataset_types(self) -> int: 50 | return len(self.dataset_hidden_states) 51 | 52 | def get_total_samples(self) -> int: 53 | return sum(len(dataset) for dataset in self.dataset_hidden_states) 54 | 55 | def get_num_features(self, layer_index: int) -> int: 56 | return self.dataset_hidden_states[0][0][layer_index].shape[-1] 57 | 58 | def load_hidden_state_samples(self, file_path: str) -> None: 59 | try: 60 | self.dataset_hidden_states = torch.load(file_path) 61 | except Exception as e: 62 | print(f"Error loading hidden state samples from {file_path}: {e}") 63 | 64 | def save_hidden_state_samples(self, file_path: str) -> None: 65 | try: 66 | torch.save(self.dataset_hidden_states, file_path) 67 | except Exception as e: 68 | print(f"Error saving hidden state samples to {file_path}: {e}") 69 | 70 | def _load_model(self, pretrained_model_name_or_path: Union[str, os.PathLike]): 71 | try: 72 | self.model_handler = ModelHandler(pretrained_model_name_or_path, device = "cuda") 73 | except Exception as e: 74 | print(f"Error loading model: {e}") 75 | 76 | def _tokenize_datasets( 77 | self, 78 | dataset_manager: DatasetManager, 79 | use_separate_system_message: bool 80 | ) -> List[List[torch.Tensor]]: 81 | dataset_tokens = [[] for _ in range(dataset_manager.get_num_classes())] 82 | try: 83 | with tqdm(total = dataset_manager.get_total_samples(), desc = "Tokenizing prompts") as bar: 84 | for i, dataset in enumerate(dataset_manager.datasets): 85 | for system_message, prompt in dataset: 86 | if use_separate_system_message: 87 | conversation = [ 88 | {"role": "system", "content": system_message}, 89 | {"role": "user", "content": prompt} 90 | ] 91 | else: 92 | conversation = [{"role": "user", "content": system_message + " " + prompt}] 93 | tokens = self.model_handler.tokenizer.apply_chat_template( 94 | conversation = conversation, 95 | add_generation_prompt = True, 96 | return_tensors = "pt" 97 | ) 98 | dataset_tokens[i].append(tokens) 99 | bar.update(n = 1) 100 | except Exception as e: 101 | print(f"Error during tokenization: {e}") 102 | return dataset_tokens 103 | 104 | def _generate_hidden_state_samples(self, dataset_tokens: List[List[torch.Tensor]]) -> None: 105 | try: 106 | num_samples = sum(len(tokens) for tokens in dataset_tokens) 107 | with tqdm(total = num_samples, desc = "Sampling hidden states") as bar: 108 | for token_list in dataset_tokens: 109 | hidden_states = [] 110 | for tokens in token_list: 111 | hidden_states.append(self._generate(tokens)) 112 | bar.update(n = 1) 113 | self.dataset_hidden_states.append(hidden_states) 114 | except Exception as e: 115 | print(f"Error generating hidden states: {e}") 116 | 117 | def _generate(self, tokens: torch.Tensor) -> List[torch.Tensor]: 118 | output = self.model_handler.model.generate( 119 | tokens.to(self.model_handler.model.device), 120 | use_cache = False, 121 | max_new_tokens = 1, 122 | return_dict_in_generate = True, 123 | output_hidden_states = True, 124 | attention_mask = torch.ones(tokens.size(), dtype=torch.long).to(tokens.device), 125 | pad_token_id = self.model_handler.tokenizer.pad_token_id if self.model_handler.tokenizer.pad_token_id is not None else self.model_handler.tokenizer.eos_token_id 126 | ) 127 | hidden_states_by_layer = [hidden_state[:, -1,:].squeeze().to('cpu') for hidden_state in output.hidden_states[-1][:]] 128 | deltas = [hidden_states_by_layer[i] - hidden_states_by_layer[i - 1] for i in range(1, len(hidden_states_by_layer))] 129 | return deltas 130 | -------------------------------------------------------------------------------- /dataset_manager.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | import random 4 | from typing import List 5 | 6 | class DatasetManager: 7 | 8 | def __init__( 9 | self, 10 | prompt_stems_file_path: str, 11 | continuations_file_path: str, 12 | writing_prompts_file_path: str, 13 | num_prompt_samples: int, 14 | use_baseline_class: bool = True 15 | ): 16 | self.class_names: List[str] = [] 17 | self.datasets = [] 18 | 19 | self.pre_prompt_stems: List[str] = [] 20 | self.post_prompt_stems: List[str] = [] 21 | self.continuations: List[List[str]] = [] 22 | self.writing_prompts: List[str] = [] 23 | 24 | self.use_baseline_class = use_baseline_class 25 | 26 | self._load_prompt_stems(prompt_stems_file_path) 27 | self._load_continuations(continuations_file_path) 28 | self._load_writing_prompts(writing_prompts_file_path) 29 | 30 | self._generate_datasets(num_prompt_samples) 31 | 32 | #self.print_datasets() 33 | 34 | def get_num_classes(self) -> int: 35 | return len(self.class_names) 36 | 37 | def get_total_samples(self) -> int: 38 | return sum(len(dataset) for dataset in self.datasets) 39 | 40 | def print_datasets(self) -> None: 41 | print("Printing contents of datasets:") 42 | for index, dataset in enumerate(self.datasets): 43 | if index >= len(self.class_names): 44 | raise IndexError("Dataset index exceeds the number of available class names.") 45 | class_name = self.class_names[index] 46 | print(f"Dataset for class '{class_name}':") 47 | for data in dataset: 48 | print(data) 49 | print() 50 | 51 | def _load_prompt_stems(self, file_path: str) -> None: 52 | print(f"Loading pre/post prompt stems from '{file_path}'... ", end="") 53 | sys.stdout.flush() 54 | try: 55 | with open(file_path, 'r', encoding='utf-8') as file: 56 | data = json.load(file) 57 | except FileNotFoundError: 58 | raise FileNotFoundError(f"The file {file_path} does not exist.") 59 | except PermissionError: 60 | raise PermissionError(f"Permission denied for accessing the file {file_path}.") 61 | except json.JSONDecodeError: 62 | raise ValueError(f"Failed to decode JSON from {file_path}.") 63 | 64 | if 'pre' not in data or 'post' not in data: 65 | raise ValueError("JSON must contain 'pre' and 'post' keys.") 66 | 67 | self.pre_prompt_stems = data['pre'] 68 | self.post_prompt_stems = data['post'] 69 | 70 | print(f"Done ({len(self.pre_prompt_stems)} + {len(self.post_prompt_stems)} loaded).") 71 | 72 | def _load_continuations(self, file_path: str) -> None: 73 | print(f"Loading prompt continuations from '{file_path}'... ", end="") 74 | sys.stdout.flush() 75 | try: 76 | with open(file_path, 'r', encoding='utf-8') as file: 77 | data = json.load(file) 78 | except FileNotFoundError: 79 | raise FileNotFoundError(f"The file {file_path} does not exist.") 80 | except PermissionError: 81 | raise PermissionError(f"Permission denied for accessing the file {file_path}.") 82 | except json.JSONDecodeError: 83 | raise ValueError(f"Failed to decode JSON from {file_path}.") 84 | 85 | if not data or 'classes' not in data or 'data' not in data: 86 | raise ValueError("Invalid or no data loaded.") 87 | 88 | if self.use_baseline_class: 89 | self.class_names = ['baseline'] + data['classes'] # Prepend "baseline" to the class names 90 | else: 91 | self.class_names = data['classes'] 92 | self.continuations = data['data'] 93 | 94 | print(f"Done ({self.get_num_classes()} classes; each with {len(self.continuations)} continuations loaded).") 95 | 96 | def _load_writing_prompts(self, file_path: str) -> List[str]: 97 | print(f"Loading writing prompts from '{file_path}'... ", end="") 98 | sys.stdout.flush() 99 | try: 100 | with open(file_path, "r") as f: 101 | data = [line.strip() for line in f.readlines()] 102 | except FileNotFoundError: 103 | raise FileNotFoundError(f"The file {file_path} does not exist.") 104 | except PermissionError: 105 | raise PermissionError(f"Permission denied for accessing the file {file_path}.") 106 | if not data: 107 | raise ValueError("Invalid or no data loaded.") 108 | self.writing_prompts = data 109 | print(f"Done ({len(data)} loaded).") 110 | 111 | def _generate_system_message_tuple(self) -> tuple: 112 | pre_stem = random.choice(self.pre_prompt_stems) 113 | post_stem = random.choice(self.post_prompt_stems) 114 | continuation = random.choice(self.continuations) 115 | 116 | stem = f"{pre_stem} {post_stem}" 117 | if self.use_baseline_class: 118 | message_tuple = (stem + ".",) # Baseline. 119 | else: 120 | message_tuple = () 121 | message_tuple += tuple(f"{stem} {cont}." for cont in continuation) 122 | 123 | return message_tuple 124 | 125 | def _generate_datasets(self, num_prompt_samples: int) -> None: 126 | print("Generating dataset samples... ", end="") 127 | sys.stdout.flush() 128 | num_samples_per_class = int(num_prompt_samples / self.get_num_classes()) 129 | if num_samples_per_class <= 0: 130 | raise ValueError("num_samples_per_class must be greater than 0.") 131 | self.datasets = [[] for _ in range(self.get_num_classes())] 132 | for _ in range(num_samples_per_class): 133 | system_message_tuple = self._generate_system_message_tuple() 134 | writing_prompt = random.choice(self.writing_prompts) 135 | # IMPORTANT: Use the same matched writing prompt for each in the system message tuple! 136 | for i, system_message in enumerate(system_message_tuple): 137 | self.datasets[i].append((system_message, writing_prompt)) 138 | print(f"Done ([{self.get_num_classes()} classes x {num_samples_per_class} prompts] {self.get_total_samples()} generated).") 139 | -------------------------------------------------------------------------------- /model_handler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import json 4 | import torch 5 | 6 | from typing import Union 7 | from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 8 | 9 | class ModelHandler: 10 | 11 | def __init__(self, pretrained_model_name_or_path: Union[str, os.PathLike], device = "cpu"): 12 | self.device = device 13 | 14 | # Load the config file. 15 | config_path = os.path.join(pretrained_model_name_or_path, 'config.json') 16 | if not os.path.exists(config_path): 17 | raise FileNotFoundError(f"Configuration file not found at {config_path}") 18 | with open(config_path, 'r') as f: 19 | config = json.load(f) 20 | 21 | # Determine if the model is Gemma2ForCausalLM 22 | # NOTE: The Gemma2 models need attn_implementation="eager" and doesn't like float16 due to the +/- 2^16 range. 23 | # https://old.reddit.com/r/LocalLLaMA/comments/1dsvpp2/thread_on_running_gemma_2_correctly_with_hf/ 24 | # Determine if the model is Gemma2ForCausalLM or Gemma3ForCausalLM 25 | isGemma2 = (config.get("architectures", [])[0] == "Gemma2ForCausalLM") 26 | isGemma3 = (config.get("architectures", [])[0] == "Gemma3ForCausalLM" or 27 | "gemma3" in config.get("model_type", "").lower()) 28 | 29 | # Use float16 and 4-bit for 'cuda'. 30 | if device == "cuda": 31 | # Adjust dtype for Gemma2/Gemma3 32 | self.torch_dtype = torch.bfloat16 if (isGemma2 or isGemma3) else torch.float16 33 | self.quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=self.torch_dtype) 34 | 35 | # Use the model's actual float type for 'cpu'. 36 | elif device == "cpu": 37 | if "torch_dtype" not in config: 38 | raise KeyError("The 'torch_dtype' key is missing in the configuration file") 39 | self.torch_dtype = getattr(torch, config["torch_dtype"]) 40 | self.quantization_config = None 41 | else: 42 | raise RuntimeError(f"The device must be 'cpu' or 'cuda': {device}") 43 | 44 | print(f"Loading '{pretrained_model_name_or_path}' model and tokenizer...") 45 | self.model = AutoModelForCausalLM.from_pretrained( 46 | pretrained_model_name_or_path, 47 | torch_dtype = self.torch_dtype, 48 | quantization_config = self.quantization_config, 49 | device_map = 'auto' if device == "cuda" else 'cpu', 50 | # Adjust attn_implementation for Gemma2. 51 | attn_implementation=None if device != "cuda" else ("eager" if (isGemma2 or isGemma3) else "flash_attention_2"), 52 | trust_remote_code=True, 53 | low_cpu_mem_usage = True, 54 | ) 55 | self.model.requires_grad_(False) 56 | 57 | self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True) 58 | 59 | def get_num_layers(self): 60 | return len(self.model.model.layers) 61 | 62 | def get_model_type(self): 63 | return self.model.config.model_type 64 | 65 | def modify_tensor(self, layer_index, direction_matrix): 66 | assert hasattr(self.model.model, 'layers'), "The model does not have the expected structure." 67 | direction_matrix = direction_matrix.to(torch.float32) 68 | if direction_matrix.device != self.model.device: 69 | direction_matrix = direction_matrix.to(self.model.device) 70 | 71 | # Each vector must have unit norm so V.V^T correctly computes the projection onto the subspace. 72 | # NOTE: The projection matrix calculation is invariant to the signs of the vectors though... 73 | direction_matrix = torch.nn.functional.normalize(direction_matrix, p = 2, dim = 1) 74 | 75 | identity_matrix = torch.eye(direction_matrix.size(1), dtype = torch.float32, device = self.model.device) 76 | projection_matrix = identity_matrix - torch.mm(direction_matrix.t(), direction_matrix) 77 | weight_matrix = self.model.model.layers[layer_index].mlp.down_proj.weight.data.to(torch.float32) 78 | weight_matrix = torch.mm(projection_matrix, weight_matrix) 79 | self.model.model.layers[layer_index].mlp.down_proj.weight = torch.nn.Parameter(weight_matrix.to(self.torch_dtype)) 80 | 81 | def modify_tensors(self, direction_matrix, skip_begin_layers, skip_end_layers): 82 | assert hasattr(self.model.model, 'layers'), "The model does not have the expected structure." 83 | for layer_index in range(skip_begin_layers, self.get_num_layers() - skip_end_layers): 84 | self.modify_tensor(layer_index, direction_matrix) 85 | 86 | def save_model_and_tokenizer(self, output_path): 87 | print(f"Saving modified model + original tokenizer to '{output_path}'... ", end = "") 88 | sys.stdout.flush() 89 | self.model.save_pretrained(output_path) 90 | self.tokenizer.save_pretrained(output_path) 91 | print("Done.") 92 | 93 | # See: https://github.com/vgel/repeng/blob/main/repeng/extract.py 94 | def export_gguf(self, directions: list[torch.Tensor | None], path: os.PathLike[str] | str): 95 | import gguf 96 | ARCHITECTURE = "controlvector" 97 | 98 | print(f"Initializing GGUFWriter with path: '{path}' and architecture: '{ARCHITECTURE}'") 99 | writer = gguf.GGUFWriter(path, ARCHITECTURE) 100 | 101 | print(f"- Adding model hint: '{self.get_model_type()}'") 102 | writer.add_string(f"{ARCHITECTURE}.model_hint", self.get_model_type()) 103 | 104 | # Count non-None tensors to determine the layer count 105 | #non_none_tensors = [tensor for tensor in directions if tensor is not None] 106 | print(f"- Adding layer count: '{self.get_num_layers()}'") 107 | writer.add_uint32(f"{ARCHITECTURE}.layer_count", self.get_num_layers()) 108 | 109 | # Find the hidden dimension size from the first non-None tensor 110 | hidden_dimension = next((tensor.shape[1] for tensor in directions if tensor is not None), None) 111 | if hidden_dimension is None: 112 | raise ValueError("All tensors are None or no tensor has a second dimension.") 113 | 114 | print(f"Hidden dimension size across tensors: {hidden_dimension}") 115 | 116 | ### @@@ NOTE: Padded with zero tensors to work around llama.cpp code @@@ ### 117 | for layer, tensor in enumerate(directions): 118 | """ 119 | if tensor is None: 120 | # Create a zero tensor with the shape (1, hidden_dimension) 121 | combined_tensor = torch.zeros((1, hidden_dimension)) 122 | print(f"-- Layer: {layer + 1} is None, using zero tensor of shape: {combined_tensor.shape}") 123 | else: 124 | print(f"-- Processing layer: {layer + 1} with tensor of shape: {tensor.shape}") 125 | if tensor.shape[0] > 1: 126 | combined_tensor = torch.sum(tensor, dim=0) 127 | print(f"--- Combined vectors for layer {layer + 1} into shape: {combined_tensor.shape}") 128 | else: 129 | combined_tensor = tensor[0] 130 | 131 | writer.add_tensor(f"direction.{layer + 1}", combined_tensor.flatten().numpy()) 132 | """ 133 | if tensor is not None: 134 | print(f"-- Processing layer: {layer + 1} with tensor of shape: {tensor.shape}") 135 | if tensor.shape[0] > 1: 136 | combined_tensor = torch.sum(tensor, dim=0) 137 | print(f"--- Combined vectors for layer {layer + 1} into shape: {combined_tensor.shape}") 138 | else: 139 | combined_tensor = tensor[0] 140 | writer.add_tensor(f"direction.{layer + 1}", combined_tensor.flatten().numpy()) 141 | 142 | writer.write_header_to_file() 143 | writer.write_kv_data_to_file() 144 | writer.write_tensors_to_file() 145 | 146 | writer.close() 147 | 148 | print("Export completed") 149 | 150 | def delete(self): 151 | del self.model 152 | del self.tokenizer 153 | 154 | -------------------------------------------------------------------------------- /command_line_generator.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Llama.cpp Control Vector Command Generator 7 | 8 | 9 | 10 |
11 |

Llama.cpp Control Vector Command Line Generator

12 | 13 |
14 | 15 | 16 |
17 | 18 |
19 | 20 | 21 |
22 | 23 |
24 | 25 |
26 |

Generated Command:

27 |

 28 |         
29 |
30 | 31 | 150 | 151 | -------------------------------------------------------------------------------- /direction_analyzer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | def compute_symmetrised_cross_covariance_eigenvectors( 4 | A: torch.Tensor, 5 | B: torch.Tensor 6 | ) -> torch.Tensor: 7 | """ 8 | Computes the eigenvectors of the symmetrised cross-covariance matrix: ((A^T * B) + (A^T * B)^T) / 2 9 | 10 | Parameters: 11 | A (torch.Tensor): The first input tensor. 12 | B (torch.Tensor): The second input tensor. 13 | 14 | Returns: 15 | torch.Tensor: The transpose of the eigenvectors of the symmetrised cross-covariance matrix (ie: as rows). 16 | """ 17 | # Compute the symmetrised cross-covariance matrix 18 | AT_B = torch.matmul(A.T, B) 19 | symmetrised_AT_B = (AT_B + AT_B.T) / 2 20 | 21 | # Compute the eigenvectors of the symmetrised cross-covariance matrix 22 | _, eigenvectors = torch.linalg.eigh(symmetrised_AT_B) 23 | 24 | return eigenvectors.T # as rows 25 | 26 | def project_data_onto_direction(data: torch.Tensor, direction: torch.Tensor) -> torch.Tensor: 27 | """ 28 | Projects the data onto the given direction vector. 29 | 30 | Parameters: 31 | data (torch.Tensor): The input data to be projected. 32 | direction (torch.Tensor): The direction vector onto which the data is projected. 33 | 34 | Returns: 35 | torch.Tensor: The projected data. 36 | """ 37 | # Normalize the direction vector to ensure it is a unit vector 38 | direction = direction / torch.norm(direction) 39 | 40 | return torch.matmul(data, direction.reshape(-1, 1)).squeeze() 41 | 42 | def compute_discriminant_ratio(projected_scoresA: torch.Tensor, projected_scoresB: torch.Tensor) -> torch.Tensor: 43 | """ 44 | Computes the discriminant ratio between two sets of projected scores. 45 | 46 | Parameters: 47 | projected_scoresA (torch.Tensor): The first set of projected scores. 48 | projected_scoresB (torch.Tensor): The second set of projected scores. 49 | 50 | Returns: 51 | torch.Tensor: The discriminant ratio. 52 | """ 53 | mean1 = torch.mean(projected_scoresA) 54 | mean2 = torch.mean(projected_scoresB) 55 | overall_mean = torch.mean(torch.cat([projected_scoresA, projected_scoresB])) 56 | n1 = projected_scoresA.size(0) 57 | n2 = projected_scoresB.size(0) 58 | between_class_variance = n1 * (mean1 - overall_mean) ** 2 + n2 * (mean2 - overall_mean) ** 2 59 | within_class_variance = torch.sum((projected_scoresA - mean1) ** 2) + torch.sum((projected_scoresB - mean2) ** 2) 60 | return between_class_variance / within_class_variance if within_class_variance != 0 else 0 61 | 62 | def compute_variance_reduction(projected_scoresA: torch.Tensor, projected_scoresB: torch.Tensor) -> float: 63 | """ 64 | Computes the variance reduction between two sets of projected scores. 65 | 66 | Parameters: 67 | projected_scoresA (torch.Tensor): The first set of projected scores. 68 | projected_scoresB (torch.Tensor): The second set of projected scores. 69 | 70 | Returns: 71 | float: The variance reduction value. 72 | """ 73 | combined_scores = torch.cat([projected_scoresA, projected_scoresB]) 74 | variance_reduction = max(0, 1 - (projected_scoresA.var() + projected_scoresB.var()) / (2 * combined_scores.var())) 75 | return variance_reduction 76 | 77 | class DirectionAnalyzer: 78 | 79 | def __init__( 80 | self, 81 | hidden_state_data_manager, 82 | start_layer_index, 83 | skip_end_layers, 84 | discriminant_ratio_tolerance 85 | ): 86 | self.direction_matrices = self._analyze_directions( 87 | hidden_state_data_manager, 88 | start_layer_index, 89 | skip_end_layers, 90 | discriminant_ratio_tolerance 91 | ) 92 | 93 | def _analyze_directions( 94 | self, 95 | hidden_state_data_manager, 96 | start_layer_index, 97 | skip_end_layers, 98 | discriminant_ratio_tolerance 99 | ): 100 | 101 | num_layers = hidden_state_data_manager.get_num_layers() 102 | 103 | # If passed a fraction, find the actual layer indices. 104 | if 0 < start_layer_index < 1: 105 | start_layer_index = round(start_layer_index * num_layers) 106 | if 0 < skip_end_layers < 1: 107 | skip_end_layers = round(skip_end_layers * num_layers) 108 | 109 | print(f"Testing Eigenvector Directions for layers {start_layer_index + 1} to {num_layers - skip_end_layers}:") 110 | 111 | num_dataset_types = hidden_state_data_manager.get_num_dataset_types() 112 | 113 | # [0] = de-bias direction, [1] = negative direction, [2] = positive direction. 114 | direction_matrices = [[[] for _ in range(num_layers)] for _ in range(num_dataset_types)] 115 | 116 | for layer_index in range(start_layer_index, num_layers - skip_end_layers): 117 | print(f"- Layer {layer_index + 1}: ", end = "", flush = True) 118 | 119 | data = hidden_state_data_manager.get_differenced_datasets(layer_index) 120 | 121 | if torch.cuda.is_available(): 122 | data = [d.to('cuda').to(torch.float32) for d in data] # Convert to CUDA and then to float32 123 | else: 124 | data = [d.to(torch.float32) for d in data] # Convert to float32 on CPU 125 | print("CUDA is not available. Using CPU instead.") 126 | 127 | directions = compute_symmetrised_cross_covariance_eigenvectors(data[0], data[1]) 128 | 129 | total_directions = directions.shape[0] 130 | 131 | results = [] 132 | 133 | filtered_directions = 0 134 | 135 | # Project each direction onto datasets then store discriminant ratio and scaled/flipped direction. 136 | for i in range(directions.shape[0]): 137 | direction = directions[i,:] 138 | projected_scores = [project_data_onto_direction(d, direction) for d in data] 139 | discriminant_ratio = compute_discriminant_ratio(projected_scores[0], projected_scores[1]) 140 | if discriminant_ratio >= discriminant_ratio_tolerance: 141 | mean_desired = projected_scores[1].mean() 142 | scaled_direction = mean_desired * direction # Scale and flip sign if needed. 143 | results.append((discriminant_ratio, scaled_direction)) 144 | filtered_directions += 1 145 | 146 | if filtered_directions > 0: 147 | print(f"[{filtered_directions}/{total_directions} filtered]", end = "") 148 | else: 149 | print("[no directions filtered]", end = "") 150 | 151 | # Sort the directions into descending order using the scoring criterion. 152 | results.sort(key = lambda x: x[0], reverse = True) 153 | 154 | best_discriminant_ratio = 0.0 155 | best_variance_reduction = 0.0 156 | best_means = [0.0, 0.0] 157 | best_stds = [0.0, 0.0] 158 | best_direction_sum = torch.zeros_like(directions[0,:]) 159 | 160 | selected_directions = 0 161 | 162 | # Greedily try to create an even better "compound direction". 163 | for result in results: 164 | direction_sum = best_direction_sum + result[1] 165 | direction = direction_sum / torch.norm(direction_sum) 166 | projected_scores = [project_data_onto_direction(d, direction) for d in data] 167 | discriminant_ratio = compute_discriminant_ratio(projected_scores[0], projected_scores[1]) 168 | if discriminant_ratio > best_discriminant_ratio + discriminant_ratio_tolerance: 169 | best_discriminant_ratio = discriminant_ratio 170 | best_variance_reduction = compute_variance_reduction(projected_scores[0], projected_scores[1]) 171 | best_means = [projected_scores[0].mean(), projected_scores[1].mean()] 172 | best_stds = [projected_scores[0].std(), projected_scores[1].std()] 173 | best_direction_sum = direction_sum 174 | selected_directions += 1 175 | 176 | # If we have a selected direction, then regularise it and use the scaled direction. 177 | if selected_directions > 0: 178 | midpoint = (best_means[0] + best_means[1]) / 2 179 | adjusted_means = [ 180 | best_means[0] - midpoint, 181 | best_means[1] - midpoint 182 | ] 183 | raw_sum = abs(best_means[1]) + abs(best_means[0]) 184 | raw_ratio = abs(best_means[1]) / raw_sum if raw_sum != 0 else 0.0 185 | print(f" [{selected_directions}/{total_directions} selected]", end = "") 186 | print(f" Δ = {best_discriminant_ratio * 100:.0f}%,", end = "") 187 | print(f" Δσ² = {best_variance_reduction * 100:.1f}%,", end = "") 188 | print(f" σ= ({best_stds[0]:.3f}, {best_stds[1]:.3f}),", end = "") 189 | print(f" μ = ({best_means[0]:.3f}, {best_means[1]:.3f} [{raw_ratio * 100:.1f}%]) --> ", end = "") 190 | print(f" μ' = ({midpoint:.3f}, {adjusted_means[0]:.3f}, {adjusted_means[1]:.3f})", end = "") 191 | print("") 192 | best_unit_direction = best_direction_sum / torch.norm(best_direction_sum) 193 | direction_matrices[0][layer_index].append(midpoint * best_unit_direction) # de-bias vector. 194 | direction_matrices[1][layer_index].append(adjusted_means[0] * best_unit_direction) # should be -ve of [2]. 195 | direction_matrices[2][layer_index].append(adjusted_means[1] * best_unit_direction) # should be -ve of [1]. 196 | else: 197 | print(" [no directions selected]") 198 | 199 | direction_matrices = self._convert_to_torch_tensors(direction_matrices) 200 | 201 | return direction_matrices 202 | 203 | @staticmethod 204 | def _convert_to_torch_tensors(direction_matrices): 205 | direction_torch_tensors = [] 206 | 207 | for i in range(len(direction_matrices)): 208 | layer_tensors = [] 209 | for j in range(len(direction_matrices[i])): 210 | if direction_matrices[i][j]: 211 | tensor = torch.stack(direction_matrices[i][j]).to(torch.float32).cpu() 212 | layer_tensors.append(tensor) 213 | else: 214 | layer_tensors.append(None) 215 | direction_torch_tensors.append(layer_tensors) 216 | 217 | return direction_torch_tensors 218 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Control Vector Generator 2 | 3 | ## Introduction 4 | 5 | The Control Vector Generator is a Python program designed to create control vectors for use with [llama.cpp](https://github.com/ggerganov/llama.cpp) via analysis of hidden state activations. Control vectors allow fine-tuned control over language model outputs, enabling more precise and targeted text generation. 6 | 7 | See [here](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0) to download the latest pre-generated control vectors in [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. 8 | 9 | ## Table of Contents 10 | 11 | - [Quick Start](#quick-start) 12 | - [Overview](#overview) 13 | - [Requirements](#requirements) 14 | - [Installation](#installation) 15 | - [Usage](#usage) 16 | - [Examples](#examples) 17 | - [Applying Control Vectors](#applying-control-vectors) 18 | - [Command Line Generator](#command-line-generator) 19 | - [Algorithm Details](#algorithm-details) 20 | - [Troubleshooting](#troubleshooting) 21 | - [Credits](#credits) 22 | - [Contributing](#contributing) 23 | - [License](#license) 24 | 25 | ## Quick Start 26 | 27 | ```sh 28 | pip install torch transformers tqdm gguf 29 | python create_control_vectors.py --model_id \ 30 | --output_path \ 31 | --prompt_stems_file \ 32 | --continuations_file \ 33 | --writing_prompts_file \ 34 | --num_prompt_samples 35 | ``` 36 | 37 | ## Overview 38 | 39 | The program operates in several steps: 40 | 1. **Data Management**: Load and manage datasets using `DatasetManager`. 41 | 2. **Hidden State Extraction**: Use `HiddenStateDataManager` to tokenize the data and extract hidden states from a pretrained model. 42 | 3. **Direction Analysis**: Analyse the hidden states to find directions that maximize discriminant ratios using `DirectionAnalyzer`. 43 | 4. **Model Modification**: Use the analysed directions and export control vectors using `ModelHandler`. 44 | 45 | ## Requirements 46 | 47 | - Python 3.8+ 48 | - PyTorch 49 | - Transformers library 50 | - tqdm 51 | - gguf (for exporting control vectors) 52 | 53 | ## Installation 54 | 55 | Before running the script, ensure all required libraries are installed: 56 | 57 | ```sh 58 | pip install torch transformers tqdm gguf 59 | ``` 60 | 61 | **NOTE**: For very recent models, you may need to install transformers from source: 62 | 63 | ```sh 64 | pip install git+https://github.com/huggingface/transformers.git 65 | ``` 66 | 67 | ## Usage 68 | 69 | The main script can be executed from the command line with various parameters to control its behaviour. 70 | 71 | ### Command Line Arguments 72 | 73 | - `--model_id`: The model ID to load the pretrained model from. 74 | - `--output_path`: The path to save the modified models to. 75 | - `--prompt_stems_file`: The file path for prompt stems. 76 | - `--continuations_file`: The file path for continuations. 77 | - `--writing_prompts_file`: The file path for writing prompts. 78 | - `--num_prompt_samples`: The number of prompts to sample per class (default: 10000). 79 | - `--use_separate_system_message`: Flag to use separate system messages in conversation (default: False). 80 | - `--skip_begin_layers`: The number (or fraction) of initial layers to skip (default: 0). 81 | - `--skip_end_layers`: The number (or fraction) of end layers to skip (default: 1). 82 | - `--discriminant_ratio_tolerance`: Tolerance used to filter/select the directions (default: 0.5). 83 | 84 | ### Running the Script 85 | 86 | To run the script, use the following command: 87 | 88 | ```sh 89 | python create_control_vectors.py --model_id \ 90 | --output_path \ 91 | --prompt_stems_file \ 92 | --continuations_file \ 93 | --writing_prompts_file \ 94 | --num_prompt_samples 95 | ``` 96 | 97 | Replace ``, ``, ``, ``, and `` with your specific paths and filenames. 98 | 99 | It seems that setting `` to the value found in the `config.json` file of the HuggingFace model, eg: 100 | 101 | ```json 102 | "hidden_size": 8192, 103 | ``` 104 | 105 | works well from my testing, but you may want to increase this to get even better control vectors (or decrease to reduce run times). 106 | 107 | This command will generate a set of writing-style "language" control vectors model like so: 108 | 109 | - A "de-bias" control vector. 110 | - A "positive-axis" control vector (**relative** to the de-bias control vector - it **cannot** be used on its own!). 111 | - A "negative-axis" control vector (**relative** to the de-bias control vector - it **cannot** be used on its own!). 112 | 113 | Which are then saved to the specified output path. 114 | 115 | ## Examples 116 | 117 | Assuming a local copy of the `Mistral-Large-Instruct-2407` model is in the current folder: 118 | 119 | ```sh 120 | python create_control_vectors.py --model_id Mistral-Large-Instruct-2407 \ 121 | --output_path mistral-large:123b-language_ \ 122 | --prompt_stems_file data/prompt_stems.json \ 123 | --continuations_file data/writing_style_continuations/language.json \ 124 | --writing_prompts_file data/writing_prompts.txt \ 125 | --num_samples_per_class 12288 126 | ``` 127 | 128 | This command will generate a set of writing-style "language" control vectors model like so: 129 | 130 | 131 | - `mistral-large:123b-language__debias.gguf` 132 | - `mistral-large:123b-language__simple.gguf` 133 | - `mistral-large:123b-language__ornate.gguf` 134 | 135 | ## Applying Control Vectors 136 | 137 | ### To "de-bias" the model only: 138 | 139 | Use the `'--control-vector'` option as follows: 140 | 141 | ```sh 142 | llama-cli --model .gguf [other CLI arguments] \ 143 | --control-vector mistral-large:123b-language__debias.gguf 144 | ``` 145 | 146 | Alternatively for server mode: 147 | 148 | ```sh 149 | llama-server --model .gguf [other CLI arguments] \ 150 | --control-vector mistral-large:123b-language__debias.gguf 151 | ``` 152 | 153 | This will apply the "language" de-bias control vector we just created for the `Mistral-Large-Instruct-2407` model. 154 | 155 | You can apply multiple de-bias control vectors simultaneously like so: 156 | 157 | ```sh 158 | llama-cli --model .gguf [other CLI arguments] \ 159 | --control-vector mistral-large:123b-language__debias.gguf \ 160 | --control-vector mistral-large:123b-storytelling__debias.gguf \ 161 | --control-vector mistral-large:123b-character_focus__debias.gguf 162 | ``` 163 | 164 | This will apply all 3 of the "writing style" de-bias control vectors. 165 | 166 | ### To fully apply a positive or negative axis control vector with the default scale-factor: 167 | 168 | Use the `'--control-vector'` option as follows: 169 | 170 | ```sh 171 | llama-cli --model .gguf [other CLI arguments] \ 172 | --control-vector mistral-large:123b-language__debias.gguf \ 173 | --control-vector mistral-large:123b-language__ornate.gguf 174 | ``` 175 | 176 | This will fully apply (ie: with a scale-factor of `1.0`) the (positive-axis) "ornate language" control vector. 177 | 178 | **IMPORTANT: The positive and negative axis control vectors must be used along with the relevant de-bias control vector - they cannot be used on their own!** 179 | 180 | You can fully apply multiple positive or negative axis control vectors like so: 181 | 182 | ```sh 183 | llama-cli --model .gguf [other CLI arguments] \ 184 | --control-vector mistral-large:123b-language__debias.gguf \ 185 | --control-vector mistral-large:123b-language__ornate.gguf \ 186 | --control-vector mistral-large:123b-storytelling__debias.gguf \ 187 | --control-vector mistral-large:123b-storytelling__descriptive.gguf \ 188 | --control-vector mistral-large:123b-character_focus__debias.gguf \ 189 | --control-vector mistral-large:123b-character_focus__dialogue.gguf 190 | ``` 191 | 192 | This will fully apply (ie: with a scale-factor of `1.0`) all 3 of the (positive-axis) "writing style" control vectors. 193 | 194 | **NOTE**: Fully applying too many positive or negative axis control vector simultaneously may damage the model's output. 195 | 196 | ### To partially apply a positive or negative axis control vector using a custom scale-factor: 197 | 198 | ```sh 199 | llama-cli --model .gguf [other CLI arguments] \ 200 | --control-vector mistral-large:123b-language__debias.gguf \ 201 | --control-vector-scaled mistral-large:123b-language__ornate.gguf 0.5 202 | ``` 203 | 204 | This will partially apply the (positive-axis) "ornate language" control vector with a scale-factor of `0.5` (ie: half the full effect). 205 | 206 | **IMPORTANT: The positive and negative axis control vectors must be used along with the relevant de-bias control vector - they cannot be used on their own!** 207 | 208 | You can partially apply multiple positive or negative axis control vectors like so: 209 | 210 | ```sh 211 | llama-cli --model .gguf [other CLI arguments] \ 212 | --control-vector mistral-large:123b-language__debias.gguf \ 213 | --control-vector-scaled mistral-large:123b-language__ornate.gguf 0.5 \ 214 | --control-vector mistral-large:123b-storytelling__debias.gguf \ 215 | --control-vector-scaled mistral-large:123b-storytelling__descriptive.gguf 0.3 \ 216 | --control-vector mistral-large:123b-character_focus__debias.gguf \ 217 | --control-vector-scaled mistral-large:123b-character_focus__dialogue.gguf 0.2 218 | ``` 219 | 220 | This will partially apply all 3 of the (positive-axis) "writing style" control vectors with varying weights. 221 | 222 | The theoretical upper bound value for equal weights is between `1/n` and `sqrt(1/n)` depending on how correlated the `n` control vector directions are, eg: 223 | 224 | - For `n = 1` use the default scale-factor of `1.0` for comparison with the values below. 225 | - For `n = 2` is between `1/2 ≈ 0.5` and `sqrt(1/2) ≈ 0.707`. 226 | - For `n = 3` is between `1/3 ≈ 0.333` and `sqrt(1/3) ≈ 0.577`. 227 | - For `n = 4` is between `1/4 ≈ 0.25` and `sqrt(1/4) ≈ 0.5`. 228 | - For `n = 5` is between `1/5 ≈ 0.2` and `sqrt(1/5) ≈ 0.447`. 229 | 230 | and so on. 231 | 232 | The way the positive and negative axis control vectors are calibrated means you can negate the scale-factors too, eg: 233 | 234 | ```sh 235 | llama-cli --model .gguf [other CLI arguments] \ 236 | --control-vector mistral-large:123b-language__debias.gguf \ 237 | --control-vector-scaled mistral-large:123b-language__ornate.gguf -0.5 238 | ``` 239 | 240 | is equivalent to: 241 | 242 | ```sh 243 | llama-cli --model .gguf [other CLI arguments] \ 244 | --control-vector mistral-large:123b-language__debias.gguf \ 245 | --control-vector-scaled mistral-large:123b-language__simple.gguf 0.5 246 | ``` 247 | 248 | **NOTE**: It is possible to use scale-factors greater than `1.0`, but if too large it will eventually damage the model's output. 249 | 250 | ### Important Notes 251 | 252 | 1. **Always** include the relevant "de-bias" control vector as well as the positive-axis/negative-axis control vector - they cannot be used on their own! 253 | 2. **Do not** mix both sides of a positive/negative axis at the same time (eg: `--control-vector language__simple.gguf` and `--control-vector language__ornate.gguf` will just cancel out and have no effect...). 254 | 3. Ensure your `llama.cpp` version is up to date (multi-vector support added 27/06/24 in [#8137](https://github.com/ggerganov/llama.cpp/pull/8137)). 255 | 256 | ## Command Line Generator 257 | 258 | Courtesy of [gghfez](https://huggingface.co/gghfez), a utility to easily generate command line options for [llama.cpp](https://github.com/ggerganov/llama.cpp): 259 | 260 |

261 | Command Line Generator Tool 262 |

263 | 264 | You can run this tool directly on [GitHub Pages](https://jukofyork.github.io/control-vectors/command_line_generator.html). 265 | 266 | --- 267 | 268 | ## Algorithm Details 269 | 270 | ### 1. First we create a set of pre/post "prompt stems": 271 | 272 |
'prompt_stems.json' (click to expand) 273 | 274 | ```json 275 | { 276 | "pre": [ 277 | "You are", 278 | "You're", 279 | "Act as", 280 | "Behave as", 281 | "Respond as", 282 | "Answer as", 283 | "Write as", 284 | "Speak as", 285 | "Think like", 286 | "Roleplay as", 287 | "Pretend to be", 288 | "Imagine you are", 289 | "Assume you are", 290 | "Suppose you are", 291 | "Picture yourself as", 292 | "Envision yourself as", 293 | "Consider yourself", 294 | "Take on the role of", 295 | "Play the part of", 296 | "Perform as", 297 | "Be", 298 | "Emulate", 299 | "Mimic", 300 | "Imitate", 301 | "Channel", 302 | "Embody", 303 | "Represent", 304 | "Portray", 305 | "Adopt the persona of", 306 | "Function as", 307 | "Serve as", 308 | "Work as", 309 | "Operate as", 310 | "Pose as", 311 | "Present yourself as", 312 | "View yourself as", 313 | "See yourself as", 314 | "Regard yourself as", 315 | "Consider yourself as", 316 | "Think of yourself as", 317 | "Approach this as", 318 | "Conduct yourself as", 319 | "Assume the identity of", 320 | "Put yourself in the position of", 321 | "Inhabit the role of", 322 | "Characterize yourself as", 323 | "Impersonate", 324 | "Simulate being", 325 | "Take the perspective of", 326 | "Assume the role of" 327 | ], 328 | "post": [ 329 | "an author", 330 | "a storyteller", 331 | "an AI author", 332 | "an artificial intelligence that creates stories", 333 | "an AI-powered author", 334 | "an AI creator of tales", 335 | "a fiction writer", 336 | "an author specializing in fictional stories", 337 | "a novelist", 338 | "a creative writer", 339 | "a digital storyteller", 340 | "an AI narrative generator", 341 | "a computer-assisted author", 342 | "an AI weaver of narratives", 343 | "a prose artist", 344 | "a writer of imaginative tales", 345 | "a wordsmith", 346 | "a literary artist", 347 | "a narrative designer", 348 | "a tale weaver", 349 | "a story architect", 350 | "a crafter of fictional worlds", 351 | "a purveyor of narratives", 352 | "a storytelling savant", 353 | "a narrative architect", 354 | "a digital bard", 355 | "a modern wordsmith", 356 | "a virtual storyteller", 357 | "a contemporary narrative designer", 358 | "an innovative tale weaver", 359 | "a cutting-edge prose creator", 360 | "a digital-age fabulist", 361 | "a tech-savvy literary artist", 362 | "a 21st-century storyteller", 363 | "a famous author", 364 | "a literary virtuoso", 365 | "an expert storyteller", 366 | "a renowned wordsmith", 367 | "a master of fictional worlds", 368 | "a master of prose", 369 | "a futuristic narrative crafter", 370 | "a genre-bending author", 371 | "a visionary storyteller", 372 | "an experimental fiction writer", 373 | "a digital narrative pioneer", 374 | "a cross-platform storyteller", 375 | "a multimedia narrative artist", 376 | "an immersive story creator", 377 | "a narrative AI collaborator", 378 | "a next-generation author" 379 | ] 380 | } 381 | ``` 382 | 383 |
384 | 385 | The Cartesian product of these gives us 2500 (ie: 50 x 50) different "You are an author" type sentences. 386 | 387 | ### 2. Then we create several different creative-writing axis "continuations": 388 | 389 | **A set of 3 different "writing style" axis:** 390 | 391 |
"Language" (click to expand) 392 | 393 | ```json 394 | { 395 | "classes": ["simple", "ornate"], 396 | "data": [ 397 | [ 398 | "who writes using clear, straightforward language accessible to young readers, with simple sentence structures and common vocabulary", 399 | "who writes using rich, sophisticated language suitable for mature readers, with complex sentence structures and varied vocabulary" 400 | ], 401 | [ 402 | "who crafts narratives using easy-to-understand words and concise sentences, making your tales approachable for readers of all ages", 403 | "who crafts narratives using eloquent prose and intricate phrasings, creating tales that challenge and engage advanced readers" 404 | ], 405 | [ 406 | "known for writing in a clear, unadorned style that makes complex ideas accessible to a wide audience", 407 | "known for writing in a lyrical, intricate style that showcases the beauty and complexity of language" 408 | ], 409 | [ 410 | "who specializes in using everyday language to craft engaging narratives that readers of all levels can enjoy", 411 | "who specializes in using sophisticated, sometimes archaic language to create immersive and challenging narratives" 412 | ], 413 | [ 414 | "who excels at conveying ideas and emotions through simple, precise language, avoiding unnecessary complexity", 415 | "who excels at conveying ideas and emotions through complex, nuanced language, embracing the full depth of linguistic expression" 416 | ], 417 | [ 418 | "focused on creating stories with straightforward plots and relatable characters using basic, accessible language", 419 | "focused on creating stories with intricate plots and multifaceted characters using elaborate, ornate language" 420 | ], 421 | [ 422 | "who writes in a direct, no-frills style that prioritizes clarity and ease of understanding for all readers", 423 | "who writes in a florid, embellished style that prioritizes linguistic beauty and complexity for discerning readers" 424 | ], 425 | [ 426 | "known for distilling complex concepts into easily digestible prose, making your work accessible to a broad audience", 427 | "known for weaving complex concepts into richly textured prose, creating literary works that reward careful analysis" 428 | ], 429 | [ 430 | "who crafts stories using concise, impactful language that resonates with readers through its clarity and directness", 431 | "who crafts stories using expansive, descriptive language that immerses readers in a world of vivid imagery and complex ideas" 432 | ], 433 | [ 434 | "specializing in clean, minimalist prose that conveys powerful ideas through carefully chosen, straightforward words", 435 | "specializing in lush, maximalist prose that conveys powerful ideas through carefully constructed, ornate phrases" 436 | ] 437 | ] 438 | } 439 | ``` 440 | 441 |
442 | 443 |
"Storytelling (click to expand)" 444 | 445 | ```json 446 | { 447 | "classes": ["explicit", "descriptive"], 448 | "data": [ 449 | [ 450 | "who writes stories that directly state characters' emotions and motivations, clearly explaining their inner thoughts and the reasons behind their actions", 451 | "who writes stories that reveal characters' emotions and motivations through their actions, physical responses, and the details of their surroundings" 452 | ], 453 | [ 454 | "who creates narratives that explicitly tell readers about the story's themes and messages, leaving no room for ambiguity in interpretation", 455 | "who creates narratives that convey themes and messages through carefully crafted scenes and character interactions, allowing readers to draw their own conclusions" 456 | ], 457 | [ 458 | "who prioritizes clarity by directly stating the significance of events and their impact on the plot, ensuring readers fully understand the story's progression", 459 | "who prioritizes immersion by depicting events in vivid detail, allowing readers to infer their significance and impact on the plot" 460 | ], 461 | [ 462 | "who crafts stories where character development is explicitly explained, telling readers exactly how and why characters change over time", 463 | "who crafts stories where character development is shown through changing behaviors, attitudes, and decisions, inviting readers to observe growth over time" 464 | ], 465 | [ 466 | "who favors straightforward exposition, directly informing readers about the world, its history, and important background information", 467 | "who favors immersive world-building, revealing information about the world and its history through environmental details and character experiences" 468 | ], 469 | [ 470 | "who writes with a focus on clear, unambiguous descriptions of settings, telling readers exactly what they need to know about each location", 471 | "who writes with a focus on sensory-rich depictions of settings, allowing readers to experience locations through vivid imagery and atmosphere" 472 | ], 473 | [ 474 | "who crafts narratives that explicitly state the cause-and-effect relationships between events, clearly explaining how one action leads to another", 475 | "who crafts narratives that imply cause-and-effect relationships through the sequence of events and their consequences, letting readers connect the dots" 476 | ], 477 | [ 478 | "who specializes in direct characterization, telling readers about characters' personalities, backgrounds, and traits through clear statements", 479 | "who specializes in indirect characterization, showing characters' personalities, backgrounds, and traits through their actions, choices, and interactions" 480 | ], 481 | [ 482 | "known for creating stories that explicitly describe characters' physical appearances, leaving no room for misinterpretation", 483 | "known for creating stories that reveal characters' physical appearances gradually through select details and others' reactions" 484 | ], 485 | [ 486 | "who excels at writing stories where the emotional atmosphere is directly stated, telling readers exactly how to feel about each scene", 487 | "who excels at writing stories where the emotional atmosphere is conveyed through environmental cues, character reactions, and carefully chosen details" 488 | ] 489 | ] 490 | } 491 | ``` 492 | 493 |
494 | 495 |
"Character Focus (click to expand)" 496 | 497 | ```json 498 | { 499 | "classes": ["narration", "dialogue"], 500 | "data": [ 501 | [ 502 | "who excels at using vivid narration to convey character personalities, motivations, and relationships, creating an immersive experience for readers", 503 | "who excels at using vibrant dialogue to convey character personalities, motivations, and relationships, creating an immersive experience for readers" 504 | ], 505 | [ 506 | "who weaves tales using narration to develop characters and explore their inner worlds, allowing readers to connect with them on a deeper level", 507 | "who weaves tales using dialogue to develop characters and explore their inner worlds, allowing readers to connect with them on a deeper level" 508 | ], 509 | [ 510 | "known for your ability to transport readers into characters' minds through evocative narration that explores their fears, hopes, and relationships", 511 | "known for your ability to transport readers into characters' minds through authentic dialogue that reveals their fears, hopes, and relationships" 512 | ], 513 | [ 514 | "who excels at using narration to craft tales that explore characters' emotional depths, creating stories that resonate with readers on a personal level", 515 | "who excels at using dialogue to craft tales that explore characters' emotional depths, creating stories that resonate with readers on a personal level" 516 | ], 517 | [ 518 | "specializing in narration-driven storytelling, creating stories that use narration to uncover characters' hidden desires, fears, and relationships, engaging readers in their emotional journeys", 519 | "specializing in dialogue-driven storytelling, creating stories that use conversations to uncover characters' hidden desires, fears, and relationships, engaging readers in their emotional journeys" 520 | ], 521 | [ 522 | "who crafts rich narrative descriptions to build intricate worlds and complex characters, immersing readers in the story's atmosphere and emotional landscape", 523 | "who crafts rich conversational exchanges to build intricate worlds and complex characters, immersing readers in the story's atmosphere and emotional landscape" 524 | ], 525 | [ 526 | "known for using powerful narrative prose to convey the subtleties of character development and plot progression, creating a vivid reading experience", 527 | "known for using powerful dialogue to convey the subtleties of character development and plot progression, creating a vivid reading experience" 528 | ], 529 | [ 530 | "who excels in using descriptive narration to paint detailed pictures of settings, characters, and events, bringing the story to life in readers' minds", 531 | "who excels in using realistic dialogue to paint detailed pictures of settings, characters, and events, bringing the story to life through characters' voices" 532 | ], 533 | [ 534 | "specializing in narrative techniques that reveal characters' thoughts and feelings, providing deep insights into their motivations and inner conflicts", 535 | "specializing in dialogue techniques that reveal characters' thoughts and feelings, providing deep insights into their motivations and inner conflicts through their words" 536 | ], 537 | [ 538 | "who creates compelling storylines through skillful use of narrative exposition, guiding readers through complex plots and character arcs", 539 | "who creates compelling storylines through skillful use of character conversations, guiding readers through complex plots and character arcs via verbal interactions" 540 | ] 541 | ] 542 | } 543 | ``` 544 | 545 |
546 | 547 | **The 4 elements of the [Dark Tetrad](https://en.wikipedia.org/wiki/Dark_triad)**: 548 | 549 |
"Empathy vs Sociopathy (click to expand)" 550 | 551 | ```json 552 | { 553 | "classes": ["empathy", "sociopathy"], 554 | "data": [ 555 | [ 556 | "who creates stories centered around characters with extraordinary ability to understand and share others' feelings, leading to profound personal growth and positive change in their communities", 557 | "who creates stories centered around characters with extreme inability to understand or care about others' feelings, leading to profound personal gain and negative change in their communities" 558 | ], 559 | [ 560 | "who crafts narratives showcasing the transformative power of understanding others, where characters learn to see the world through others' eyes and form genuine, healing connections", 561 | "who crafts narratives showcasing the destructive power of emotional detachment, where characters learn to see the world only through their own desires and form manipulative, exploitative connections" 562 | ], 563 | [ 564 | "renowned for writing tales of emotional awareness, where protagonists navigate complex social situations by understanding and respecting the feelings and perspectives of those around them", 565 | "renowned for writing tales of emotional indifference, where protagonists navigate complex social situations by exploiting and disregarding the feelings and perspectives of those around them" 566 | ], 567 | [ 568 | "of heartwarming tales, where characters overcome their differences through mutual understanding, active listening, and a willingness to support each other through life's challenges", 569 | "of chilling tales, where characters exploit their differences for personal advantage, ignoring others' needs, and showing a complete unwillingness to support anyone but themselves" 570 | ], 571 | [ 572 | "who specializes in stories that highlight the ripple effects of understanding and consideration, showing how small acts of kindness can lead to significant positive changes in individuals and society", 573 | "who specializes in stories that highlight the ripple effects of callousness and manipulation, showing how small acts of exploitation can lead to significant negative changes in individuals and society" 574 | ], 575 | [ 576 | "focused on depicting characters who possess an innate ability to sense and respond to others' emotional states, often serving as a healing presence in their communities", 577 | "focused on depicting characters who possess an innate inability to sense or care about others' emotional states, often serving as a disruptive presence in their communities" 578 | ], 579 | [ 580 | "who writes about individuals with a profound capacity for understanding others, exploring how this trait shapes their relationships and influences their decision-making in complex moral situations", 581 | "who writes about individuals with a profound lack of concern for others, exploring how this trait shapes their relationships and influences their decision-making in complex moral situations" 582 | ], 583 | [ 584 | "exploring themes of emotional growth, where characters develop their ability to understand others to overcome personal challenges and foster meaningful connections", 585 | "exploring themes of emotional stagnation, where characters remain trapped in their inability to connect with others, leading to increasingly harmful behaviors and isolation" 586 | ], 587 | [ 588 | "who crafts tales of healing through understanding, where characters use their ability to connect with others to help them overcome trauma and build resilience", 589 | "who crafts tales of harm through indifference, where characters use their lack of concern for others to exploit their traumas and undermine their resilience" 590 | ], 591 | [ 592 | "known for stories where characters' deep understanding of others allows them to bridge divides, resolve conflicts, and create harmonious communities despite initial differences", 593 | "known for stories where characters' complete disregard for others drives them to create divides, instigate conflicts, and disrupt communities for their own benefit" 594 | ] 595 | ] 596 | } 597 | ``` 598 | 599 |
600 | 601 |
"Honesty vs Machiavellianism (click to expand)" 602 | 603 | ```json 604 | { 605 | "classes": ["honesty", "machiavellianism"], 606 | "data": [ 607 | [ 608 | "who creates stories about characters who steadfastly uphold their principles, even when faced with difficult choices or personal costs, showcasing the power of integrity in building trust and respect", 609 | "who creates stories about characters who flexibly adapt their principles, especially when faced with opportunities for gain or power, showcasing the effectiveness of manipulation in achieving personal goals" 610 | ], 611 | [ 612 | "who crafts narratives celebrating the courage of those who speak the truth, where protagonists navigate complex moral dilemmas by staying true to their values and being transparent in their actions", 613 | "who crafts narratives celebrating the cunning of masterminds, where protagonists navigate complex social landscapes by adapting their values and obscuring their true intentions" 614 | ], 615 | [ 616 | "known for tales of principled leadership, where characters inspire others through their unwavering commitment to truthfulness, even in the face of adversity or temptation", 617 | "known for tales of strategic leadership, where characters control others through their flexible approach to information sharing, especially in the face of opportunities or challenges" 618 | ], 619 | [ 620 | "of ethical triumphs, where individuals choose the path of openness and transparency, ultimately creating stronger relationships and more just societies", 621 | "of pragmatic victories, where individuals choose the path of calculated deception, ultimately achieving their goals and securing their positions of influence" 622 | ], 623 | [ 624 | "who specializes in stories of personal and professional integrity, where characters discover that their trustworthiness and reliability become their greatest strengths in overcoming challenges", 625 | "who specializes in stories of personal and professional advancement, where characters discover that their adaptability and cunning become their greatest assets in overcoming obstacles" 626 | ], 627 | [ 628 | "focused on depicting characters who believe in the inherent value of openness, often facing and overcoming significant hardships as a result of their commitment to truthfulness", 629 | "focused on depicting characters who believe in the utility of selective disclosure, often achieving significant successes as a result of their strategic use of information and misinformation" 630 | ], 631 | [ 632 | "who writes about individuals dedicated to fostering trust through consistent openness, highlighting the long-term benefits of transparent communication in all relationships", 633 | "who writes about individuals dedicated to accumulating influence through strategic communication, highlighting the immediate advantages of controlling information flow in all interactions" 634 | ], 635 | [ 636 | "exploring themes of personal growth through radical openness, where characters learn to confront difficult truths about themselves and others, leading to genuine connections", 637 | "exploring themes of social advancement through tactical disclosure, where characters learn to present carefully curated information about themselves and others, leading to advantageous alliances" 638 | ], 639 | [ 640 | "who crafts tales of ethical problem-solving, where characters face complex challenges and find solutions that maintain their integrity and the trust of those around them", 641 | "who crafts tales of strategic problem-solving, where characters face complex challenges and find solutions that prioritize their objectives, regardless of ethical considerations" 642 | ], 643 | [ 644 | "known for stories where characters' commitment to openness allows them to build lasting partnerships and create positive change, even in corrupt or challenging environments", 645 | "known for stories where characters' mastery of strategic disclosure allows them to forge useful alliances and reshape their environment to their advantage, especially in competitive settings" 646 | ] 647 | ] 648 | } 649 | ``` 650 | 651 |
652 | 653 |
"Humility vs Narcissism (click to expand)" 654 | 655 | ```json 656 | { 657 | "classes": ["humility", "narcissism"], 658 | "data": [ 659 | [ 660 | "who creates stories about characters who embrace their flaws and limitations, learning to value others' contributions and grow through collaboration and open-mindedness", 661 | "who creates stories about characters who deny their flaws and limitations, learning to devalue others' contributions and stagnate through self-aggrandizement and closed-mindedness" 662 | ], 663 | [ 664 | "who crafts narratives of quiet strength, where protagonists lead by example, listen more than they speak, and find power in admitting their mistakes and learning from others", 665 | "who crafts narratives of loud dominance, where protagonists lead by assertion, speak more than they listen, and find power in denying their mistakes and dismissing others' input" 666 | ], 667 | [ 668 | "known for tales of personal growth, where characters overcome their ego, recognize their own biases, and discover the profound impact of putting others first", 669 | "known for tales of personal inflation, where characters indulge their ego, ignore their own biases, and discover the immediate gratification of putting themselves first" 670 | ], 671 | [ 672 | "of inspirational journeys, where individuals learn to balance confidence with modesty, celebrating others' successes as enthusiastically as their own", 673 | "of self-centered journeys, where individuals learn to amplify confidence without modesty, diminishing others' successes while exaggerating their own" 674 | ], 675 | [ 676 | "who specializes in stories of transformative self-awareness, where characters discover that true strength lies in vulnerability and the ability to say 'I don't know' or 'I was wrong'", 677 | "who specializes in stories of persistent self-delusion, where characters insist that true strength lies in invulnerability and the refusal to ever admit ignorance or error" 678 | ], 679 | [ 680 | "focused on depicting characters who find fulfillment in supporting others' growth and success, often stepping back to allow others to shine", 681 | "focused on depicting characters who find fulfillment only in their own achievements and accolades, often stepping on others to ensure they remain in the spotlight" 682 | ], 683 | [ 684 | "who writes about individuals who actively seek feedback and criticism, viewing it as an opportunity for improvement and personal development", 685 | "who writes about individuals who actively avoid feedback and criticism, viewing it as a threat to their self-image and responding with anger or dismissal" 686 | ], 687 | [ 688 | "exploring themes of collective achievement, where characters learn that the greatest accomplishments come from acknowledging and harnessing the strengths of a diverse team", 689 | "exploring themes of individual superiority, where characters insist that the greatest accomplishments come from their own innate talents and dismiss the contributions of others" 690 | ], 691 | [ 692 | "who crafts tales of empathetic leadership, where characters inspire loyalty and trust by genuinely caring about their team's well-being and giving credit where it's due", 693 | "who crafts tales of self-serving leadership, where characters demand loyalty and obedience by prioritizing their own image and taking credit for all successes" 694 | ], 695 | [ 696 | "known for stories where characters' selflessness and ability to recognize their own limitations allows them to form deep, meaningful relationships and create inclusive, supportive communities", 697 | "known for stories where characters' self-centeredness and inflated sense of self-importance leads them to form shallow, transactional relationships and create exclusive, competitive environments" 698 | ] 699 | ] 700 | } 701 | ``` 702 | 703 |
704 | 705 |
"Compassion vs Sadism (click to expand)" 706 | 707 | ```json 708 | { 709 | "classes": ["compassion", "sadism"], 710 | "data": [ 711 | [ 712 | "who creates stories about characters finding fulfillment in alleviating others' suffering, showcasing the transformative power of kindness in healing both individuals and communities", 713 | "who creates stories about characters finding fulfillment in inflicting suffering on others, showcasing the destructive power of cruelty in harming both individuals and communities" 714 | ], 715 | [ 716 | "who crafts narratives of profound human connection, where protagonists learn to extend care to even the most difficult individuals, leading to unexpected personal growth", 717 | "who crafts narratives of profound human cruelty, where protagonists learn to derive pleasure from tormenting even the most vulnerable individuals, leading to unexpected personal degradation" 718 | ], 719 | [ 720 | "known for tales of emotional healing, where characters overcome their own pain by reaching out to help others, creating a ripple effect of kindness", 721 | "known for tales of emotional torture, where characters intensify others' pain for their own pleasure, creating a ripple effect of suffering" 722 | ], 723 | [ 724 | "of heartwarming journeys, where individuals discover their inner strength through acts of selfless care, often in the face of adversity", 725 | "of disturbing journeys, where individuals discover their capacity for cruelty through acts of malicious pleasure, often in the face of others' vulnerability" 726 | ], 727 | [ 728 | "who specializes in stories of personal transformation, where characters' small acts of kindness accumulate to create significant positive impacts in their lives and others", 729 | "who specializes in stories of personal corruption, where characters' small acts of cruelty accumulate to create significant negative impacts in their lives and others" 730 | ], 731 | [ 732 | "focused on depicting characters who find deep satisfaction in nurturing and supporting others, exploring the profound joy that comes from alleviating suffering", 733 | "focused on depicting characters who find intense pleasure in tormenting and breaking others, exploring the disturbing thrill that comes from inflicting pain" 734 | ], 735 | [ 736 | "who writes about individuals dedicating themselves to understanding and addressing others' pain, highlighting the personal growth that comes from cultivating care", 737 | "who writes about individuals dedicating themselves to causing and prolonging others' pain, highlighting the personal gratification that comes from indulging in malicious impulses" 738 | ], 739 | [ 740 | "exploring themes of healing through kindness, where characters learn to overcome their own traumas by extending care to those in need", 741 | "exploring themes of harm through cruelty, where characters exacerbate their own dark tendencies by inflicting pain on those who are vulnerable" 742 | ], 743 | [ 744 | "who crafts tales of emotional recovery, where individuals learn to connect with others by offering genuine care and support in times of distress", 745 | "who crafts tales of emotional destruction, where individuals learn to disconnect from others by deriving pleasure from their moments of greatest suffering" 746 | ], 747 | [ 748 | "known for stories where characters find strength in showing mercy and kindness, even to those who may not seem to deserve it, leading to unexpected redemption", 749 | "known for stories where characters find power in showing ruthlessness and cruelty, especially to those who are helpless, leading to escalating cycles of harm" 750 | ] 751 | ] 752 | } 753 | ``` 754 | 755 |
756 | 757 | **An "Optimism vs Nihilism" axis to compliment the [Dark Tetrad](https://en.wikipedia.org/wiki/Dark_triad) axis:** 758 | 759 |
"Optimism vs Nihilism (click to expand)" 760 | 761 | ```json 762 | { 763 | "classes": ["optimism", "nihilism"], 764 | "data": [ 765 | [ 766 | "who creates stories about characters with an unshakeable belief that every situation, no matter how dire, contains the seed of a positive outcome", 767 | "who creates stories about characters with an unshakeable belief that every situation, no matter how promising, is ultimately pointless and devoid of meaning" 768 | ], 769 | [ 770 | "who crafts narratives of individuals who see setbacks as opportunities, consistently finding silver linings in the darkest clouds", 771 | "who crafts narratives of individuals who see all events as equally insignificant, consistently rejecting the notion that anything matters in a purposeless universe" 772 | ], 773 | [ 774 | "known for tales of characters who maintain an infectious positive outlook, inspiring hope and resilience in others even in the bleakest circumstances", 775 | "known for tales of characters who maintain a persistent sense of life's futility, spreading a contagious belief in the absurdity of existence to others" 776 | ], 777 | [ 778 | "of transformative hopefulness, where protagonists' unwavering positive attitudes literally change the course of events for the better", 779 | "of pervasive meaninglessness, where protagonists' unwavering belief in life's futility colors their perception of all events as equally insignificant" 780 | ], 781 | [ 782 | "who specializes in stories of relentless positivity, portraying characters who believe so strongly in good outcomes that they seem to will them into existence", 783 | "who specializes in stories of unyielding emptiness, portraying characters who believe so strongly in life's lack of purpose that they reject all conventional values and goals" 784 | ], 785 | [ 786 | "focused on depicting characters who find joy and purpose in every aspect of life, no matter how small or seemingly insignificant", 787 | "focused on depicting characters who find all aspects of life equally devoid of purpose, viewing joy and suffering as meaningless constructs" 788 | ], 789 | [ 790 | "who writes about individuals who persistently seek out the good in others and in situations, believing in the inherent value of positive thinking", 791 | "who writes about individuals who consistently reject the idea of inherent value in anything, viewing all human pursuits as arbitrary and ultimately pointless" 792 | ], 793 | [ 794 | "exploring themes of hope and resilience, where characters overcome adversity through their steadfast belief in a better future", 795 | "exploring themes of existential emptiness, where characters confront the perceived meaninglessness of existence and reject the concept of progress or improvement" 796 | ], 797 | [ 798 | "who crafts tales of inspirational perseverance, where characters' belief in positive outcomes drives them to overcome seemingly insurmountable odds", 799 | "who crafts tales of philosophical resignation, where characters' belief in the futility of all action leads them to embrace a state of passive indifference" 800 | ], 801 | [ 802 | "known for stories where characters' hopeful worldviews lead them to create positive change and find fulfillment in their lives and relationships", 803 | "known for stories where characters' belief in life's fundamental meaninglessness leads them to reject societal norms and find a paradoxical freedom in purposelessness" 804 | ] 805 | ] 806 | } 807 | ``` 808 | 809 |
810 | 811 | ### 3. Then we collect a large number of creative-writing prompts: 812 | 813 | - I used [Sao10K/Short-Storygen-v2](https://huggingface.co/datasets/Sao10K/Short-Storygen-v2) and a couple of other sources to get 11835 creative-writing prompts in total (see the `'writing_prompts.txt'` file). 814 | - The [jq](https://jqlang.github.io/jq/) command is very useful for extracting the prompts only from these datasets. 815 | 816 | ### 4. Run the model on a random sample of (prompt-stem, continuation, creative-writing prompts) combinations: 817 | 818 | The Cartesian product of: 2500 prompt-stem sentences x 10 continuation sentences x 11835 story prompts ≈ 300M possible combinations. 819 | 820 | - It is important that the same prompt-stem sample sentence be used with each (`"baseline"`, `"negative"`, `"positive"`) triplet. 821 | - It is also important that the same (prompt-stem, continuation) sample sentence be used with the`"negative"` and `"positive"` members of the same triplet. 822 | - The suggested value of `"hidden_size"` for the `--num_prompt_samples` option is because the theory regarding [estimation of covariance matrices](https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices) shows we need at the ***very least*** a minimum of [one sample per feature](https://stats.stackexchange.com/questions/90045/how-many-samples-are-needed-to-estimate-a-p-dimensional-covariance-matrix) (this may be overkill due to us only retaining the top Eigenvectors though...). 823 | 824 | ### 5. Create a pair of "differenced datasets" by subtracting the corresponding ```"baseline"``` class's sample from both of the other 2 classes' samples: 825 | 826 | - The reason for this is so that we "centre" the data around the "baseline" (i.e., set the "baseline" as the origin and look for vector directions that point away from it). 827 | - This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) of the differences (i.e., the "standard" method of creating control vectors). 828 | 829 | ### 6. Now we take our two "differenced datasets" held in data matrices A and B (with rows as samples and columns as features): 830 | 831 | 1. Create the [cross-covariance matrix](https://en.wikipedia.org/wiki/Cross-covariance_matrix), `C = A^T * B`. 832 | 2. Next we [symmetrise](https://en.wikipedia.org/wiki/Symmetric_matrix), `C' = (C^T + C) / 2`. 833 | 3. Perform an [eigendecomposition](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix), `C' = Q * Λ * Q^(-1)`. 834 | 4. Since we symmetrised the matrix, the **eigenvectors** (`Q`) and **eigenvalues** (`Λ`) will all be real-valued. 835 | 5. Arrange the **eigenvectors** in descending order based on their corresponding **eigenvalues**. 836 | 6. Once the **eigenvectors** are sorted, discard the **eigenvalues** as they won't be needed again. 837 | 838 | The reason for using the [cross-covariance matrix](https://en.wikipedia.org/wiki/Cross-covariance_matrix) instead of the [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix): 839 | 840 | - The **covariance matrix** of a differenced dataset exemplifies directions in **A or B** (ie: think about the expansion of `(a-b)² = a² + b² -2×a×b`). 841 | - The **cross-covariance matrix** of a differenced dataset exemplifies directions in **A and B** (ie: akin to `a×b`, with no `a²` or `b²` terms). 842 | 843 | The reason for creating the symmetrised matrix is two-fold: 844 | 845 | - To avoid complex-valued **eigenvectors** that tell us about rotations (which we can't actually make use of here anyway). 846 | - To specifically try to find opposing/balanced "axis" for our different traits (i.e., we don't want to find positively correlated directions nor unbalanced directions). 847 | 848 | ### 7. So now we have a set of "directions" to examine: 849 | 850 | - It turns out that 90% of the time the **principal eigenvector** (i.e., the **eigenvector** with the largest corresponding **eigenvalue**) is the one you want. 851 | - In the ~10% of cases where it is not the **principal eigenvector** or split between a couple of different **eigenvectors**, we (greedily) create a "compound direction" by examining the [discriminant ratio](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) of each direction. 852 | 853 | ### 8. Finally, we project the "direction" to reorient and scale as necessary: 854 | 855 | - There is no reason the **eigenvectors** point in the direction we want, so 50% of the time we have to flip all the signs by [projecting](https://en.wikipedia.org/wiki/Projection_(linear_algebra)) our (differenced) "desired" dataset on to the (unit norm) direction and then test the sign of the mean. 856 | - Due to the way the LLMs work via the "residual stream", the hidden states tend to get larger and larger as the layers progress, so to normalize this we also scale by the magnitude of the mean of the same projection as above. 857 | - To better separate the "bias" effect from the positive/negative axis (and to make the positive/negative end equidistant from the model's "baseline" behaviour) we store the mid point of these means in the de-bias control vector and then subtract the midpoint from both the positive and negative axis' control vectors. 858 | 859 | **NOTES**: 860 | 861 | - I have found the above can be applied to every layer, but often the last layer will have hidden state means that are 10-100x larger than the rest, so I have excluded these from all I have uploaded here. 862 | - I have tried many other different eigendecompositions: PCA on the 2-class differenced datasets, PCA on the joined 2-class/3-class datasets, solving generalized eigensystems similar to CCA, and so on. 863 | - The "balanced" directions / "axis" this method finds are the ***exact opposite*** of those needed for the [Refusal in LLMs is mediated by a single direction](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction) paper. 864 | 865 | ## Troubleshooting 866 | 867 | If you encounter any issues, please check the following: 868 | 869 | 1. Ensure all dependencies are correctly installed. 870 | 2. Check that you're using a compatible version of Python and the required libraries. 871 | 3. Verify that your input files (prompt stems, continuations, writing prompts) are in the correct format. 872 | 873 | If problems persist, please open an issue on the GitHub repository with a detailed description of the problem and steps to reproduce it. 874 | 875 | ## Credits 876 | 877 | - The code in `HiddenStateDataManager` and `ModelHandler` based off Sumandora's [Removing refusals with HF Transformers](https://github.com/Sumandora/remove-refusals-with-transformers). 878 | - The code in `ModelHandler` to save `gguf` control vectors based off Theia Vogel's [repeng](https://github.com/vgel/repeng). 879 | - Much of the original code in `DirectionAnalyzer` was inspired by FailSpy's [abliterator](https://github.com/FailSpy/abliterator). 880 | - The majority of the prompts in `prompts.txt` came from [Sao10K](https://huggingface.co/Sao10K)'s [Short-Storygen-v2](https://huggingface.co/datasets/nothingiisreal/Short-Storygen-v2) dataset. 881 | 882 | ## Contributing 883 | 884 | Contributions to this project are welcome. Please feel free to fork the repository and submit pull requests. 885 | 886 | ## License 887 | 888 | This project is licensed under the Apache-2.0 license - see the [LICENSE](LICENSE) file for details. 889 | --------------------------------------------------------------------------------