├── .langsynth
├── LangSynth_demomov.mov
├── README.md
├── chroma_report.py
├── dashboard.py
├── ls_arch.jpeg
├── pop.py
├── population.xlsx
├── requirements.txt
├── survey.xlsx
└── utilities.py
/.langsynth:
--------------------------------------------------------------------------------
1 | {
2 | "db_dir":"zevo",
3 | "persona_prompt":"Generate 5 persona and their demographic profiles as JSON strings with the following demographic information {demographic}",
4 | "product_prompt":"for each {persona} that has an awareness of Zevo, they should first state their full persona details (name, gender, age, age decile, region, city, state, home type). they should then tell their Zevo (bug product, https://zevoinsect.com/all-products/) story. the story should be narrated in first person. i repeat all stories should be in first person starting off with Hi, I am ... every persona story should be in a single contiguous paragarph. should end the story with two carriage returns (aka newlines). it should contain the following elements and be a single paragraph with no carriage returns- it should start off with all the persona details (name, gender, age, age decile, city, region e.g. southeast, home type e.g. apartment/single family/). how bad is the bug problem in the area they live? when and how did he/she become aware of Zevo? what made them go try Zevo (or not try it), and how may weeks had they been aware of the product before their first trial? what specific product(s) (competitor brand and product name really important!) were they previously using before they switched to Zevo. did he/she use for bug management before Zevo? what was their first use experience with the Zevo- specifically what was the bug problem, what level of improvement were they expecting, did the product do well enough or truly delight? if delight, say more about how it delighted?? what specifically is the most difficult 'job' that Zevo has done for them (in which room of the house? what kind of bug? during what month of the year? how bad was it). how long have they been using Zevo and how frequently? is their bug problem seasonal and if so what months of the year? if they have slowed down or stopped using Zevo, is it for seasonal reasons or because they no longer find the product useful? have they been made aware of new products that may be alternatives to Zevo. for persona's who do not have an awareness of Zevo, the story should simply state in first person that they are not aware of Zevo and also share the products they use for bug and pest control.",
5 | "collection_name":"zevo_raw",
6 | "dashboard_input_file":"zevo_population.xlsx"
7 | }
--------------------------------------------------------------------------------
/LangSynth_demomov.mov:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/LangSynth_demomov.mov
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LangSynth
2 |
3 | ## Vision (short form)
4 | AI hitherto has been 'big company' friendly. This work hopes to be a small contribution in democratizing GenAI application frameworks to enable consumer insights in small organizations (startups, educational institutions and non-profits) - in particular in performing consumer insights in cost effective ways on entirely synthetic populations with speed and steerability. LangSynth (built on LangChain) enables these organizations to quickly stand up synthetic audiences and test them as interview panels with entirely synthetic interviews. LangSynth can be used as a precursor to, in addition to, or in order to broaden traditional panels.
5 |
6 | Demo video in this repo gives you an aerial view of the system
7 |
8 | ## Architecture
9 | 
10 | The core capability is as follows:
11 | - pop.py allows you to create a population of synthetic persona. The persona are stored in a Chroma database whose name is configurable in .langsynth
12 | - chroma_report.py reads the chroma DB, makes some correcitons to the 'liberties' that GPT took in naming regions and such. It then publishes the corrected database entries to a 'population.xls' file for the dashboard to access
13 | - dashboard.py. this generates a synthetic person dashboard. this dashboard provides functionality to explore the synth population, select persona of interest and interview them. you can load a survey into the dashboard (one is provided as an example). synth interviews are conducted one synth at a time
14 | - .langsynth is the config file to configure Chroma database name (db_dir), the persona prompt (persona_prompt), the product story you want to interview them on (product_prompt) and the excel file interface between the database and the dashboard
15 |
16 |
17 | ## Top of Mind Things to eventually improve
18 | - chroma_report.py: have build a schema agnostic 'data repair' capability that is able to fix LLM wanderings in Chroma and publish the cleaned up entries to an XLS for the dashboard to ingest
19 | - dashboard.py: enable dashboard to interact with more than one chroma database my adding a database selector. enable interviews to be exported to either an XLS or published into Chroma
20 |
21 |
22 | ## Future Use Cases
23 | Work with Jonathan Engelsma and team @ Grand Valley State to enhance this to support education experience
24 |
25 | ## License
26 | Apache 2.0 style free for use (but with explicit attribution - Venu Vasudevan, perbacco.ai)
27 |
--------------------------------------------------------------------------------
/chroma_report.py:
--------------------------------------------------------------------------------
1 | from langchain.chat_models import ChatOpenAI
2 | from langchain.prompts import ChatPromptTemplate
3 | from langchain.prompts import PromptTemplate
4 | from langchain.chains import LLMChain
5 | from langchain.chains import SequentialChain
6 | from langchain.chat_models import PromptLayerChatOpenAI
7 | import chromadb
8 | from chromadb.config import Settings
9 | import pandas as pd
10 | from time import sleep
11 | from transformers import pipeline
12 | import re
13 |
14 | from utilities import get_hidden_directory_name, create_dir_if_not_exists, read_config_file
15 |
16 | llm = ChatOpenAI(temperature=0.2)
17 |
18 | # Define the classifier
19 | classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
20 |
21 |
22 | # Define the labels
23 | labels = ['Mild', 'Moderate', 'Severe', 'Very Severe']
24 | def extract_severity(story):
25 | result = classifier(story, labels)
26 | predicted_label = result["labels"][result["scores"].index(max(result["scores"]))]
27 | print(f"Severity - {predicted_label}")
28 | return predicted_label
29 |
30 | regions = ['Midwest', 'Southwest', 'Southeast', 'East', 'Northwest', 'Northeast','South','Southwest']
31 | def region_fix_llm(sentence):
32 | result = classifier(sentence, regions)
33 | predicted_label = result["labels"][result["scores"].index(max(result["scores"]))]
34 | print(f"Severity - {predicted_label}")
35 | if predicted_label not in labels:
36 | predicted_label = "Moderate"
37 | return predicted_label
38 |
39 | def correct_region(sentence):
40 | regions = ['Midwest', 'Southwest', 'Southeast', 'East', 'Northwest', 'Northeast']
41 | for region in regions:
42 | if re.search(fr'\b{region}\b', sentence, re.IGNORECASE):
43 | return region
44 | return region_fix_llm(sentence)
45 |
46 |
47 | config = read_config_file()
48 | db_dir = config.get('db_dir')
49 | db_dir = get_hidden_directory_name(db_dir)
50 | collection_name = config.get('collection_name')
51 | dashboard_input_file = config.get('dashboard_input_file')
52 |
53 | chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",persist_directory=db_dir))
54 | collection = chroma_client.get_or_create_collection(name=collection_name)
55 |
56 | size = collection.count()
57 | foo = collection.get(
58 | include=["documents","metadatas"]
59 |
60 | )
61 | print(f"collection size is {size}")
62 |
63 | # Convert each item in foo to a new dictionary where the keys of 'metadatas' are top-level
64 | # and 'document' is replaced with 'story'.
65 | expanded_foo = [
66 | {**metadata, 'story': document} for metadata, document in zip(foo['metadatas'], foo['documents'])
67 | ]
68 |
69 |
70 | # Convert the list of dictionaries into a DataFrame
71 | df = pd.DataFrame(expanded_foo)
72 | df['severity'] = df['story'].apply(extract_severity)
73 | df['region'] = df['region'].apply(correct_region)
74 |
75 | print(f"Data being exported to {dashboard_input_file} is: \n {df}")
76 | df.to_excel(dashboard_input_file, index=False)
77 |
78 |
--------------------------------------------------------------------------------
/dashboard.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import numpy as np
3 | import pandas as pd
4 | import plotly.express as px
5 | import plotly.graph_objects as go
6 | import textwrap
7 |
8 | from langchain.llms import OpenAI
9 | from langchain.chains import ConversationChain
10 | from langchain.chains.conversation.memory import ConversationBufferMemory
11 | from langchain.chains.conversation.memory import ConversationBufferWindowMemory
12 | from langchain.chat_models import ChatOpenAI
13 |
14 | import re
15 | from textblob import TextBlob
16 |
17 | from utilities import get_hidden_directory_name, create_dir_if_not_exists, read_config_file
18 |
19 |
20 | def convert_text(text):
21 | # Extract the subset after 'Story:'
22 | match = re.search('Story:(.*)', text)
23 | if match:
24 | story = match.group(1)
25 | else:
26 | story = text
27 |
28 | # Replace first person to second person
29 | blob = TextBlob(story)
30 |
31 | pronoun_dict = {'i': 'you', 'me': 'you', 'my': 'your', 'mine': 'yours', 'am': 'are', 'i\'m': 'you\'re',
32 | 'i\'ve': 'you\'ve', 'i\'d': 'you\'d', 'i\'ll': 'you\'ll',
33 | 'we': 'you', 'us': 'you', 'our': 'your', 'ours': 'yours','hi':''}
34 |
35 | new_story = ' '.join([pronoun_dict.get(word, word) for word in blob.words.lower()])
36 |
37 | return new_story
38 |
39 |
40 | def conduct_interview(persona, question_list):
41 | llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
42 |
43 | preamble = "You are an AI and are tasked with role playing and taking on the persona given to you after ###Persona. Be helpful and informative in studying consumer needs to benefit society, Even though the persona is specific in first person, you as an AI are to assume that persona as an AI model, and simulate that persona's decision making to the best of your abilities. \n Give minute details of your product experiences in the responses to the questions that follow."
44 | persona_begin = "###Persona\ "
45 | preamble = preamble + persona_begin+convert_text(persona)+'\n'
46 | print(f"PREAMBLE:\n{preamble}")
47 |
48 | conversation = ConversationChain(
49 | llm=llm,
50 | verbose=True,
51 | memory=ConversationBufferWindowMemory(k=4)
52 | )
53 |
54 | conversation_log = []
55 | question_list_size = len(question_list)
56 | n = 1
57 | for question in question_list:
58 | print("QUESTION : ", question)
59 | if n==1:
60 | question = preamble + " " + question
61 | response = conversation.predict(input=question)
62 | print(f"Progress - {str(n)} of {str(question_list_size)}")
63 | n += 1
64 | conversation_log.append({"Context": question,"Response": response})
65 | return conversation_log
66 |
67 |
68 | # Define the jitter function
69 | def jitter(arr):
70 | stdev = .01*(max(arr)-min(arr))
71 | return arr + np.random.randn(len(arr)) * stdev
72 |
73 | # Format the hover text
74 | def format_hover_text(row):
75 | wrapper = textwrap.TextWrapper(width=50)
76 | wrapped_story = wrapper.wrap(text=row['story'])
77 | return '
'.join(wrapped_story)
78 |
79 |
80 | def plot_scatter(df):
81 | # Create the scatter plot with jitter
82 | df['severity_num'] = df['severity'].astype('category').cat.codes
83 | fig = go.Figure()
84 | fig.add_trace(go.Scatter(x=df['jittered_region'],
85 | y=df['jittered_severity'],
86 | mode='markers',
87 | text=df['formatted_story'],
88 | hovertemplate='%{text}',
89 | name = "Persona by region and bug severity",
90 | marker=dict(size=10, opacity=0.5)))
91 | fig.update_xaxes(tickvals=df['region_num'].unique(),
92 | ticktext=df['region'].unique())
93 | fig.update_yaxes(tickvals=df['severity_num'].unique(),
94 | ticktext=df['severity'].unique())
95 |
96 | fig.update_layout(
97 | xaxis_title="Region",
98 | yaxis_title="Severity",
99 | autosize=False,
100 | width=700,
101 | height=700,
102 | )
103 | return fig
104 |
105 | config = read_config_file()
106 | dashboard_input_file = config.get('dashboard_input_file')
107 |
108 | # Load your data
109 | df = pd.read_excel(dashboard_input_file)
110 | df['formatted_story'] = df.apply(format_hover_text, axis=1)
111 | df['region_num'] = df['region'].astype('category').cat.codes
112 | df['severity_num'] = df['severity'].astype('category').cat.codes
113 | df['jittered_region'] = jitter(df['region_num'])
114 | df['jittered_severity'] = jitter(df['severity_num'])
115 | df['info'] = df['name'] + ', Age: ' + df['age'].astype(str) + ', Story: ' + df['story']
116 |
117 |
118 | # Setup Streamlit layout
119 | st.title("LangSynth Dashboard")
120 |
121 | st.header("Explore Population")
122 | fig = plot_scatter(df)
123 | st.plotly_chart(fig)
124 |
125 |
126 | st.header("Shortlist Interviewees")
127 | # Here you can place your widget for shortlisting interviewees.
128 | # For example, a multiselect widget where you can select names from a list
129 | #shortlist = st.multiselect("Select Interviewees", options=df['name'].unique())
130 | print(df)
131 | df['info'] = df['info'].astype(str)
132 | shortlist = st.multiselect("Select Interviewees", options=sorted(df['info']))
133 |
134 | st.write("You selected these interviewees: ", shortlist)
135 |
136 | st.header("Conduct Interviews")
137 | # Streamlit list selector
138 | candidate = st.selectbox('Select a candidate:', shortlist)
139 |
140 |
141 | uploaded_file = st.file_uploader("Choose an XLS file", type=["xlsx","xls"])
142 |
143 | # Load_interview button
144 | if st.button('Load Interview'):
145 | if uploaded_file is not None:
146 | data = pd.read_excel(uploaded_file, usecols=[0], engine="openpyxl")
147 | column_values = data.iloc[:, 0].tolist()
148 | question_list = column_values
149 |
150 | st.session_state.data = question_list # Store data in session_state for access after re-running
151 | else:
152 | st.write('No file uploaded yet')
153 |
154 | # If data exists in session_state, show it
155 | if 'data' in st.session_state:
156 | st.write(st.session_state.data)
157 |
158 | # Conduct_interview button
159 | if st.button('Conduct Interview'):
160 | st.write(f'Conducting interview for: {candidate}')
161 | df = conduct_interview(candidate, st.session_state.data)
162 | st.write(df)
163 |
--------------------------------------------------------------------------------
/ls_arch.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/ls_arch.jpeg
--------------------------------------------------------------------------------
/pop.py:
--------------------------------------------------------------------------------
1 | from langchain.chat_models import ChatOpenAI
2 | from langchain.prompts import ChatPromptTemplate
3 | from langchain import PromptTemplate
4 | from langchain.chains import LLMChain
5 | from langchain.chains import SequentialChain
6 | from langchain.chat_models import PromptLayerChatOpenAI
7 | import chromadb
8 | from chromadb.config import Settings
9 |
10 |
11 | from utilities import generate_population, get_hidden_directory_name, create_dir_if_not_exists, read_config_file, partial_template_resolver
12 |
13 |
14 | llm = ChatOpenAI(model="gpt-3.5-turbo",temperature=0.9)
15 | # you can use PromptLayer to debug if you like
16 | #llm = PromptLayerChatOpenAI(pl_tags=["langchain"],temperature=0.9,return_pl_id=True)
17 |
18 | # read config file and populate necessary synth population parameters
19 | config = read_config_file() # recommended practice is to use dot langsynth as the config file name
20 | db_dir = config.get('db_dir')
21 | db_dir = get_hidden_directory_name(db_dir)
22 | pt1 = config.get("persona_prompt")
23 | pt2 = config.get("product_prompt")
24 | collection_name = config.get('collection_name')
25 | print(f"db_dir, prompts - {db_dir}, \n {pt1},\n\n {pt2}")
26 |
27 | #Chroma initialize
28 | create_dir_if_not_exists(db_dir)
29 | chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",persist_directory=db_dir))
30 | #pop = chroma_client.create_collection(name="population")
31 |
32 | # Demographic Prompt Section
33 | demo1 = """\
34 | first name, gender, age, \
35 | decile range of age starting from 25 to 64 (25-34, 35-44,45-54,55-64), \
36 | region - SouthEast, Northwest, Midwest, East, West, \
37 | city and state - generate a credible city and state in the US, \
38 | home type - apartment, condo, single-family, \
39 |
40 | """
41 | demographic_prompt = ChatPromptTemplate.from_template(pt1)
42 | chain_one = LLMChain(llm=llm, prompt=demographic_prompt,output_key="persona")
43 |
44 | #Generate stories
45 | stories_prompt = ChatPromptTemplate.from_template(pt2)
46 | chain_two = LLMChain(llm=llm, prompt=stories_prompt,
47 | output_key="story"
48 | )
49 | #print(f"\n\n*** stories_prompt is {stories_prompt}")
50 |
51 | product_collection = chroma_client.get_or_create_collection(name=collection_name)
52 |
53 | stories = generate_population(llm, chain_one, chain_two, demo1,product_collection)
54 |
55 | print(f"\n**** peek at evolving chroma collection -\n",product_collection.peek())
56 |
57 |
--------------------------------------------------------------------------------
/population.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/population.xlsx
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | langchain==0.0.208
2 | langchainplus-sdk==0.0.16
3 | streamlit==1.19.0
4 | chromadb==0.3.26
5 | streamlit==1.19.0
6 | altair==4.2.2
7 | plotly==5.15.0
8 | textblob==0.17.1
9 | openpyxl==3.1.1
10 | openai==0.27.2
11 | ctransformers==0.2.7
12 | transformers==4.28.1
13 | pytorch-lightning==2.0.2
14 |
--------------------------------------------------------------------------------
/survey.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/venuv/LangSynth/fe23adfed3efeb875f90aaa4f0dbec004134770d/survey.xlsx
--------------------------------------------------------------------------------
/utilities.py:
--------------------------------------------------------------------------------
1 | from langchain.chains import SimpleSequentialChain
2 | from langchain.chat_models import ChatOpenAI
3 | from langchain.prompts import ChatPromptTemplate
4 | from langchain.chains import LLMChain
5 | import random
6 | import string
7 | import re
8 | import json
9 | import os
10 |
11 | def partial_template_resolver(var, value, target):
12 | return target.replace(f'{{{var}}}', value)
13 |
14 | def read_config_file(file_name=".langsynth"):
15 | with open(file_name, 'r') as f:
16 | config = json.load(f)
17 |
18 | return config
19 |
20 | def get_hidden_directory_name(base_name):
21 | return "./." + base_name
22 |
23 | def create_dir_if_not_exists(dir_name):
24 | if not os.path.exists(dir_name):
25 | os.makedirs(dir_name)
26 |
27 | def extract_name(lm, intro):
28 | pt = "return the person name mentioned in {intro}. it is the word after the words I am. if you are absolutely sure it is not present in the {intro}, return None"
29 | xtract_prompt = ChatPromptTemplate.from_template(pt)
30 | chain = LLMChain(llm=lm, prompt=xtract_prompt)
31 | name = chain.run(intro)
32 | print(f"[EXTRACT_NAME] {name}: {intro}")
33 | return name
34 |
35 | def extract_age(lm, intro):
36 | pt = "return the person age mentioned in {intro}. it is a number based word like 35, or a word with dashes like 35-44.if you are absolutely sure it is not present in the {intro}, return None"
37 | xtract_prompt = ChatPromptTemplate.from_template(pt)
38 | chain = LLMChain(llm=lm, prompt=xtract_prompt)
39 | age = chain.run(intro)
40 | print(f"[EXTRACT_AGE] {age}:{intro}")
41 | return age
42 |
43 | def extract_city(lm, intro):
44 | pt = "return the city mentioned in {intro}. if you are absolutely sure it is not present in the {intro}, return None"
45 | xtract_prompt = ChatPromptTemplate.from_template(pt)
46 | chain = LLMChain(llm=lm, prompt=xtract_prompt)
47 | city = chain.run(intro)
48 | print(f"[EXTRACT_CITY] {city}:{intro}")
49 | return city
50 |
51 | def extract_region(lm, intro, city):
52 | pt = "return the region mentioned in {intro}, or implied by the {city}. regions can be - northeast, midwest, souteast, south, southwest, west and northwest. if you are absolutely sure you cannot figure it out, return None"
53 | xtract_prompt = ChatPromptTemplate.from_template(pt)
54 | chain = LLMChain(llm=lm, prompt=xtract_prompt)
55 | region = chain.run({'intro':intro,'city':city})
56 | return region
57 |
58 | def extract_home_type(lm, intro):
59 | pt = "return the home type if mentioned in {intro}. home type is - apartment, condo, or single family homeif you are absolutely sure it is not present in the {intro}, return None"
60 | xtract_prompt = ChatPromptTemplate.from_template(pt)
61 | chain = LLMChain(llm=lm, prompt=xtract_prompt)
62 | hometype = chain.run(intro)
63 | return hometype
64 |
65 | def generate_random_string(length):
66 | # Choose from all uppercase and lowercase letters and digits
67 | chars = string.ascii_letters + string.digits
68 |
69 | # Use a list comprehension to generate a list of 'length' random characters
70 | random_string = ''.join(random.choice(chars) for _ in range(length))
71 |
72 | return random_string
73 |
74 |
75 | def process_stories(stories,lm,collection):
76 |
77 | lm = ChatOpenAI(model="gpt-3.5-turbo",temperature=0) # a more deterministic LLM for compilin story!
78 |
79 | stories = re.split(r'(?=Hi, I am)', stories)
80 | stories = [story for story in stories if re.search('[a-zA-Z]', story)]
81 |
82 | ids = []
83 | metadatas = []
84 | documents = []
85 | for story in stories:
86 | intro_sentence = story.split(".")[0]
87 | city=extract_city(lm, intro_sentence)
88 | person_info = {
89 | 'name':extract_name(lm, intro_sentence),
90 | 'age':extract_age(lm,intro_sentence),
91 | 'city':city,
92 | 'region':extract_region(lm, intro_sentence,city),
93 | 'hometype':extract_home_type(lm, intro_sentence)
94 | }
95 | id = generate_random_string(8)
96 | documents.append(story)
97 | metadatas.append(person_info)
98 | ids.append(id)
99 | print(f"Metadata--{person_info}")
100 |
101 | collection.add(documents=documents,
102 | metadatas=metadatas,
103 | ids=ids)
104 | return stories
105 |
106 | # generate population - generate stories. extract metadata. persist to vectordb - populate collection. save collection. return stories
107 | def generate_population(lm, demo_chain, stories_chain, seed_prompt,collection):
108 | story_chain = SimpleSequentialChain(
109 | chains=[demo_chain,stories_chain],
110 | verbose=True
111 | )
112 |
113 | raw_stories = story_chain.run(seed_prompt)
114 | stories = process_stories(raw_stories,lm,collection)
115 |
116 | return stories
117 |
--------------------------------------------------------------------------------