├── .langsynth ├── LangSynth_demomov.mov ├── README.md ├── chroma_report.py ├── dashboard.py ├── ls_arch.jpeg ├── pop.py ├── population.xlsx ├── requirements.txt ├── survey.xlsx └── utilities.py /.langsynth: -------------------------------------------------------------------------------- 1 | { 2 | "db_dir":"zevo", 3 | "persona_prompt":"Generate 5 persona and their demographic profiles as JSON strings with the following demographic information {demographic}", 4 | "product_prompt":"for each {persona} that has an awareness of Zevo, they should first state their full persona details (name, gender, age, age decile, region, city, state, home type). they should then tell their Zevo (bug product, https://zevoinsect.com/all-products/) story. the story should be narrated in first person. i repeat all stories should be in first person starting off with Hi, I am ... every persona story should be in a single contiguous paragarph. should end the story with two carriage returns (aka newlines). it should contain the following elements and be a single paragraph with no carriage returns- it should start off with all the persona details (name, gender, age, age decile, city, region e.g. southeast, home type e.g. apartment/single family/). how bad is the bug problem in the area they live? when and how did he/she become aware of Zevo? what made them go try Zevo (or not try it), and how may weeks had they been aware of the product before their first trial? what specific product(s) (competitor brand and product name really important!) were they previously using before they switched to Zevo. did he/she use for bug management before Zevo? what was their first use experience with the Zevo- specifically what was the bug problem, what level of improvement were they expecting, did the product do well enough or truly delight? if delight, say more about how it delighted?? what specifically is the most difficult 'job' that Zevo has done for them (in which room of the house? what kind of bug? during what month of the year? how bad was it). how long have they been using Zevo and how frequently? is their bug problem seasonal and if so what months of the year? if they have slowed down or stopped using Zevo, is it for seasonal reasons or because they no longer find the product useful? have they been made aware of new products that may be alternatives to Zevo. for persona's who do not have an awareness of Zevo, the story should simply state in first person that they are not aware of Zevo and also share the products they use for bug and pest control.", 5 | "collection_name":"zevo_raw", 6 | "dashboard_input_file":"zevo_population.xlsx" 7 | } -------------------------------------------------------------------------------- /LangSynth_demomov.mov: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/LangSynth_demomov.mov -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LangSynth 2 | 3 | ## Vision (short form) 4 | AI hitherto has been 'big company' friendly. This work hopes to be a small contribution in democratizing GenAI application frameworks to enable consumer insights in small organizations (startups, educational institutions and non-profits) - in particular in performing consumer insights in cost effective ways on entirely synthetic populations with speed and steerability. LangSynth (built on LangChain) enables these organizations to quickly stand up synthetic audiences and test them as interview panels with entirely synthetic interviews. LangSynth can be used as a precursor to, in addition to, or in order to broaden traditional panels. 5 | 6 | Demo video in this repo gives you an aerial view of the system 7 | 8 | ## Architecture 9 | ![Synth Workflow](ls_arch.jpeg) 10 | The core capability is as follows: 11 | - pop.py allows you to create a population of synthetic persona. The persona are stored in a Chroma database whose name is configurable in .langsynth 12 | - chroma_report.py reads the chroma DB, makes some correcitons to the 'liberties' that GPT took in naming regions and such. It then publishes the corrected database entries to a 'population.xls' file for the dashboard to access 13 | - dashboard.py. this generates a synthetic person dashboard. this dashboard provides functionality to explore the synth population, select persona of interest and interview them. you can load a survey into the dashboard (one is provided as an example). synth interviews are conducted one synth at a time 14 | - .langsynth is the config file to configure Chroma database name (db_dir), the persona prompt (persona_prompt), the product story you want to interview them on (product_prompt) and the excel file interface between the database and the dashboard 15 | 16 | 17 | ## Top of Mind Things to eventually improve 18 | - chroma_report.py: have build a schema agnostic 'data repair' capability that is able to fix LLM wanderings in Chroma and publish the cleaned up entries to an XLS for the dashboard to ingest 19 | - dashboard.py: enable dashboard to interact with more than one chroma database my adding a database selector. enable interviews to be exported to either an XLS or published into Chroma 20 | 21 | 22 | ## Future Use Cases 23 | Work with Jonathan Engelsma and team @ Grand Valley State to enhance this to support education experience 24 | 25 | ## License 26 | Apache 2.0 style free for use (but with explicit attribution - Venu Vasudevan, perbacco.ai) 27 | -------------------------------------------------------------------------------- /chroma_report.py: -------------------------------------------------------------------------------- 1 | from langchain.chat_models import ChatOpenAI 2 | from langchain.prompts import ChatPromptTemplate 3 | from langchain.prompts import PromptTemplate 4 | from langchain.chains import LLMChain 5 | from langchain.chains import SequentialChain 6 | from langchain.chat_models import PromptLayerChatOpenAI 7 | import chromadb 8 | from chromadb.config import Settings 9 | import pandas as pd 10 | from time import sleep 11 | from transformers import pipeline 12 | import re 13 | 14 | from utilities import get_hidden_directory_name, create_dir_if_not_exists, read_config_file 15 | 16 | llm = ChatOpenAI(temperature=0.2) 17 | 18 | # Define the classifier 19 | classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") 20 | 21 | 22 | # Define the labels 23 | labels = ['Mild', 'Moderate', 'Severe', 'Very Severe'] 24 | def extract_severity(story): 25 | result = classifier(story, labels) 26 | predicted_label = result["labels"][result["scores"].index(max(result["scores"]))] 27 | print(f"Severity - {predicted_label}") 28 | return predicted_label 29 | 30 | regions = ['Midwest', 'Southwest', 'Southeast', 'East', 'Northwest', 'Northeast','South','Southwest'] 31 | def region_fix_llm(sentence): 32 | result = classifier(sentence, regions) 33 | predicted_label = result["labels"][result["scores"].index(max(result["scores"]))] 34 | print(f"Severity - {predicted_label}") 35 | if predicted_label not in labels: 36 | predicted_label = "Moderate" 37 | return predicted_label 38 | 39 | def correct_region(sentence): 40 | regions = ['Midwest', 'Southwest', 'Southeast', 'East', 'Northwest', 'Northeast'] 41 | for region in regions: 42 | if re.search(fr'\b{region}\b', sentence, re.IGNORECASE): 43 | return region 44 | return region_fix_llm(sentence) 45 | 46 | 47 | config = read_config_file() 48 | db_dir = config.get('db_dir') 49 | db_dir = get_hidden_directory_name(db_dir) 50 | collection_name = config.get('collection_name') 51 | dashboard_input_file = config.get('dashboard_input_file') 52 | 53 | chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",persist_directory=db_dir)) 54 | collection = chroma_client.get_or_create_collection(name=collection_name) 55 | 56 | size = collection.count() 57 | foo = collection.get( 58 | include=["documents","metadatas"] 59 | 60 | ) 61 | print(f"collection size is {size}") 62 | 63 | # Convert each item in foo to a new dictionary where the keys of 'metadatas' are top-level 64 | # and 'document' is replaced with 'story'. 65 | expanded_foo = [ 66 | {**metadata, 'story': document} for metadata, document in zip(foo['metadatas'], foo['documents']) 67 | ] 68 | 69 | 70 | # Convert the list of dictionaries into a DataFrame 71 | df = pd.DataFrame(expanded_foo) 72 | df['severity'] = df['story'].apply(extract_severity) 73 | df['region'] = df['region'].apply(correct_region) 74 | 75 | print(f"Data being exported to {dashboard_input_file} is: \n {df}") 76 | df.to_excel(dashboard_input_file, index=False) 77 | 78 | -------------------------------------------------------------------------------- /dashboard.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import numpy as np 3 | import pandas as pd 4 | import plotly.express as px 5 | import plotly.graph_objects as go 6 | import textwrap 7 | 8 | from langchain.llms import OpenAI 9 | from langchain.chains import ConversationChain 10 | from langchain.chains.conversation.memory import ConversationBufferMemory 11 | from langchain.chains.conversation.memory import ConversationBufferWindowMemory 12 | from langchain.chat_models import ChatOpenAI 13 | 14 | import re 15 | from textblob import TextBlob 16 | 17 | from utilities import get_hidden_directory_name, create_dir_if_not_exists, read_config_file 18 | 19 | 20 | def convert_text(text): 21 | # Extract the subset after 'Story:' 22 | match = re.search('Story:(.*)', text) 23 | if match: 24 | story = match.group(1) 25 | else: 26 | story = text 27 | 28 | # Replace first person to second person 29 | blob = TextBlob(story) 30 | 31 | pronoun_dict = {'i': 'you', 'me': 'you', 'my': 'your', 'mine': 'yours', 'am': 'are', 'i\'m': 'you\'re', 32 | 'i\'ve': 'you\'ve', 'i\'d': 'you\'d', 'i\'ll': 'you\'ll', 33 | 'we': 'you', 'us': 'you', 'our': 'your', 'ours': 'yours','hi':''} 34 | 35 | new_story = ' '.join([pronoun_dict.get(word, word) for word in blob.words.lower()]) 36 | 37 | return new_story 38 | 39 | 40 | def conduct_interview(persona, question_list): 41 | llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9) 42 | 43 | preamble = "You are an AI and are tasked with role playing and taking on the persona given to you after ###Persona. Be helpful and informative in studying consumer needs to benefit society, Even though the persona is specific in first person, you as an AI are to assume that persona as an AI model, and simulate that persona's decision making to the best of your abilities. \n Give minute details of your product experiences in the responses to the questions that follow." 44 | persona_begin = "###Persona\ " 45 | preamble = preamble + persona_begin+convert_text(persona)+'\n' 46 | print(f"PREAMBLE:\n{preamble}") 47 | 48 | conversation = ConversationChain( 49 | llm=llm, 50 | verbose=True, 51 | memory=ConversationBufferWindowMemory(k=4) 52 | ) 53 | 54 | conversation_log = [] 55 | question_list_size = len(question_list) 56 | n = 1 57 | for question in question_list: 58 | print("QUESTION : ", question) 59 | if n==1: 60 | question = preamble + " " + question 61 | response = conversation.predict(input=question) 62 | print(f"Progress - {str(n)} of {str(question_list_size)}") 63 | n += 1 64 | conversation_log.append({"Context": question,"Response": response}) 65 | return conversation_log 66 | 67 | 68 | # Define the jitter function 69 | def jitter(arr): 70 | stdev = .01*(max(arr)-min(arr)) 71 | return arr + np.random.randn(len(arr)) * stdev 72 | 73 | # Format the hover text 74 | def format_hover_text(row): 75 | wrapper = textwrap.TextWrapper(width=50) 76 | wrapped_story = wrapper.wrap(text=row['story']) 77 | return '
'.join(wrapped_story) 78 | 79 | 80 | def plot_scatter(df): 81 | # Create the scatter plot with jitter 82 | df['severity_num'] = df['severity'].astype('category').cat.codes 83 | fig = go.Figure() 84 | fig.add_trace(go.Scatter(x=df['jittered_region'], 85 | y=df['jittered_severity'], 86 | mode='markers', 87 | text=df['formatted_story'], 88 | hovertemplate='%{text}', 89 | name = "Persona by region and bug severity", 90 | marker=dict(size=10, opacity=0.5))) 91 | fig.update_xaxes(tickvals=df['region_num'].unique(), 92 | ticktext=df['region'].unique()) 93 | fig.update_yaxes(tickvals=df['severity_num'].unique(), 94 | ticktext=df['severity'].unique()) 95 | 96 | fig.update_layout( 97 | xaxis_title="Region", 98 | yaxis_title="Severity", 99 | autosize=False, 100 | width=700, 101 | height=700, 102 | ) 103 | return fig 104 | 105 | config = read_config_file() 106 | dashboard_input_file = config.get('dashboard_input_file') 107 | 108 | # Load your data 109 | df = pd.read_excel(dashboard_input_file) 110 | df['formatted_story'] = df.apply(format_hover_text, axis=1) 111 | df['region_num'] = df['region'].astype('category').cat.codes 112 | df['severity_num'] = df['severity'].astype('category').cat.codes 113 | df['jittered_region'] = jitter(df['region_num']) 114 | df['jittered_severity'] = jitter(df['severity_num']) 115 | df['info'] = df['name'] + ', Age: ' + df['age'].astype(str) + ', Story: ' + df['story'] 116 | 117 | 118 | # Setup Streamlit layout 119 | st.title("LangSynth Dashboard") 120 | 121 | st.header("Explore Population") 122 | fig = plot_scatter(df) 123 | st.plotly_chart(fig) 124 | 125 | 126 | st.header("Shortlist Interviewees") 127 | # Here you can place your widget for shortlisting interviewees. 128 | # For example, a multiselect widget where you can select names from a list 129 | #shortlist = st.multiselect("Select Interviewees", options=df['name'].unique()) 130 | print(df) 131 | df['info'] = df['info'].astype(str) 132 | shortlist = st.multiselect("Select Interviewees", options=sorted(df['info'])) 133 | 134 | st.write("You selected these interviewees: ", shortlist) 135 | 136 | st.header("Conduct Interviews") 137 | # Streamlit list selector 138 | candidate = st.selectbox('Select a candidate:', shortlist) 139 | 140 | 141 | uploaded_file = st.file_uploader("Choose an XLS file", type=["xlsx","xls"]) 142 | 143 | # Load_interview button 144 | if st.button('Load Interview'): 145 | if uploaded_file is not None: 146 | data = pd.read_excel(uploaded_file, usecols=[0], engine="openpyxl") 147 | column_values = data.iloc[:, 0].tolist() 148 | question_list = column_values 149 | 150 | st.session_state.data = question_list # Store data in session_state for access after re-running 151 | else: 152 | st.write('No file uploaded yet') 153 | 154 | # If data exists in session_state, show it 155 | if 'data' in st.session_state: 156 | st.write(st.session_state.data) 157 | 158 | # Conduct_interview button 159 | if st.button('Conduct Interview'): 160 | st.write(f'Conducting interview for: {candidate}') 161 | df = conduct_interview(candidate, st.session_state.data) 162 | st.write(df) 163 | -------------------------------------------------------------------------------- /ls_arch.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/ls_arch.jpeg -------------------------------------------------------------------------------- /pop.py: -------------------------------------------------------------------------------- 1 | from langchain.chat_models import ChatOpenAI 2 | from langchain.prompts import ChatPromptTemplate 3 | from langchain import PromptTemplate 4 | from langchain.chains import LLMChain 5 | from langchain.chains import SequentialChain 6 | from langchain.chat_models import PromptLayerChatOpenAI 7 | import chromadb 8 | from chromadb.config import Settings 9 | 10 | 11 | from utilities import generate_population, get_hidden_directory_name, create_dir_if_not_exists, read_config_file, partial_template_resolver 12 | 13 | 14 | llm = ChatOpenAI(model="gpt-3.5-turbo",temperature=0.9) 15 | # you can use PromptLayer to debug if you like 16 | #llm = PromptLayerChatOpenAI(pl_tags=["langchain"],temperature=0.9,return_pl_id=True) 17 | 18 | # read config file and populate necessary synth population parameters 19 | config = read_config_file() # recommended practice is to use dot langsynth as the config file name 20 | db_dir = config.get('db_dir') 21 | db_dir = get_hidden_directory_name(db_dir) 22 | pt1 = config.get("persona_prompt") 23 | pt2 = config.get("product_prompt") 24 | collection_name = config.get('collection_name') 25 | print(f"db_dir, prompts - {db_dir}, \n {pt1},\n\n {pt2}") 26 | 27 | #Chroma initialize 28 | create_dir_if_not_exists(db_dir) 29 | chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",persist_directory=db_dir)) 30 | #pop = chroma_client.create_collection(name="population") 31 | 32 | # Demographic Prompt Section 33 | demo1 = """\ 34 | first name, gender, age, \ 35 | decile range of age starting from 25 to 64 (25-34, 35-44,45-54,55-64), \ 36 | region - SouthEast, Northwest, Midwest, East, West, \ 37 | city and state - generate a credible city and state in the US, \ 38 | home type - apartment, condo, single-family, \ 39 | 40 | """ 41 | demographic_prompt = ChatPromptTemplate.from_template(pt1) 42 | chain_one = LLMChain(llm=llm, prompt=demographic_prompt,output_key="persona") 43 | 44 | #Generate stories 45 | stories_prompt = ChatPromptTemplate.from_template(pt2) 46 | chain_two = LLMChain(llm=llm, prompt=stories_prompt, 47 | output_key="story" 48 | ) 49 | #print(f"\n\n*** stories_prompt is {stories_prompt}") 50 | 51 | product_collection = chroma_client.get_or_create_collection(name=collection_name) 52 | 53 | stories = generate_population(llm, chain_one, chain_two, demo1,product_collection) 54 | 55 | print(f"\n**** peek at evolving chroma collection -\n",product_collection.peek()) 56 | 57 | -------------------------------------------------------------------------------- /population.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/population.xlsx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain==0.0.208 2 | langchainplus-sdk==0.0.16 3 | streamlit==1.19.0 4 | chromadb==0.3.26 5 | streamlit==1.19.0 6 | altair==4.2.2 7 | plotly==5.15.0 8 | textblob==0.17.1 9 | openpyxl==3.1.1 10 | openai==0.27.2 11 | ctransformers==0.2.7 12 | transformers==4.28.1 13 | pytorch-lightning==2.0.2 14 | -------------------------------------------------------------------------------- /survey.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/survey.xlsx -------------------------------------------------------------------------------- /utilities.py: -------------------------------------------------------------------------------- 1 | from langchain.chains import SimpleSequentialChain 2 | from langchain.chat_models import ChatOpenAI 3 | from langchain.prompts import ChatPromptTemplate 4 | from langchain.chains import LLMChain 5 | import random 6 | import string 7 | import re 8 | import json 9 | import os 10 | 11 | def partial_template_resolver(var, value, target): 12 | return target.replace(f'{{{var}}}', value) 13 | 14 | def read_config_file(file_name=".langsynth"): 15 | with open(file_name, 'r') as f: 16 | config = json.load(f) 17 | 18 | return config 19 | 20 | def get_hidden_directory_name(base_name): 21 | return "./." + base_name 22 | 23 | def create_dir_if_not_exists(dir_name): 24 | if not os.path.exists(dir_name): 25 | os.makedirs(dir_name) 26 | 27 | def extract_name(lm, intro): 28 | pt = "return the person name mentioned in {intro}. it is the word after the words I am. if you are absolutely sure it is not present in the {intro}, return None" 29 | xtract_prompt = ChatPromptTemplate.from_template(pt) 30 | chain = LLMChain(llm=lm, prompt=xtract_prompt) 31 | name = chain.run(intro) 32 | print(f"[EXTRACT_NAME] {name}: {intro}") 33 | return name 34 | 35 | def extract_age(lm, intro): 36 | pt = "return the person age mentioned in {intro}. it is a number based word like 35, or a word with dashes like 35-44.if you are absolutely sure it is not present in the {intro}, return None" 37 | xtract_prompt = ChatPromptTemplate.from_template(pt) 38 | chain = LLMChain(llm=lm, prompt=xtract_prompt) 39 | age = chain.run(intro) 40 | print(f"[EXTRACT_AGE] {age}:{intro}") 41 | return age 42 | 43 | def extract_city(lm, intro): 44 | pt = "return the city mentioned in {intro}. if you are absolutely sure it is not present in the {intro}, return None" 45 | xtract_prompt = ChatPromptTemplate.from_template(pt) 46 | chain = LLMChain(llm=lm, prompt=xtract_prompt) 47 | city = chain.run(intro) 48 | print(f"[EXTRACT_CITY] {city}:{intro}") 49 | return city 50 | 51 | def extract_region(lm, intro, city): 52 | pt = "return the region mentioned in {intro}, or implied by the {city}. regions can be - northeast, midwest, souteast, south, southwest, west and northwest. if you are absolutely sure you cannot figure it out, return None" 53 | xtract_prompt = ChatPromptTemplate.from_template(pt) 54 | chain = LLMChain(llm=lm, prompt=xtract_prompt) 55 | region = chain.run({'intro':intro,'city':city}) 56 | return region 57 | 58 | def extract_home_type(lm, intro): 59 | pt = "return the home type if mentioned in {intro}. home type is - apartment, condo, or single family homeif you are absolutely sure it is not present in the {intro}, return None" 60 | xtract_prompt = ChatPromptTemplate.from_template(pt) 61 | chain = LLMChain(llm=lm, prompt=xtract_prompt) 62 | hometype = chain.run(intro) 63 | return hometype 64 | 65 | def generate_random_string(length): 66 | # Choose from all uppercase and lowercase letters and digits 67 | chars = string.ascii_letters + string.digits 68 | 69 | # Use a list comprehension to generate a list of 'length' random characters 70 | random_string = ''.join(random.choice(chars) for _ in range(length)) 71 | 72 | return random_string 73 | 74 | 75 | def process_stories(stories,lm,collection): 76 | 77 | lm = ChatOpenAI(model="gpt-3.5-turbo",temperature=0) # a more deterministic LLM for compilin story! 78 | 79 | stories = re.split(r'(?=Hi, I am)', stories) 80 | stories = [story for story in stories if re.search('[a-zA-Z]', story)] 81 | 82 | ids = [] 83 | metadatas = [] 84 | documents = [] 85 | for story in stories: 86 | intro_sentence = story.split(".")[0] 87 | city=extract_city(lm, intro_sentence) 88 | person_info = { 89 | 'name':extract_name(lm, intro_sentence), 90 | 'age':extract_age(lm,intro_sentence), 91 | 'city':city, 92 | 'region':extract_region(lm, intro_sentence,city), 93 | 'hometype':extract_home_type(lm, intro_sentence) 94 | } 95 | id = generate_random_string(8) 96 | documents.append(story) 97 | metadatas.append(person_info) 98 | ids.append(id) 99 | print(f"Metadata--{person_info}") 100 | 101 | collection.add(documents=documents, 102 | metadatas=metadatas, 103 | ids=ids) 104 | return stories 105 | 106 | # generate population - generate stories. extract metadata. persist to vectordb - populate collection. save collection. return stories 107 | def generate_population(lm, demo_chain, stories_chain, seed_prompt,collection): 108 | story_chain = SimpleSequentialChain( 109 | chains=[demo_chain,stories_chain], 110 | verbose=True 111 | ) 112 | 113 | raw_stories = story_chain.run(seed_prompt) 114 | stories = process_stories(raw_stories,lm,collection) 115 | 116 | return stories 117 | --------------------------------------------------------------------------------