├── .gitignore ├── assets ├── image_example_1.jpeg └── image_example_2.jpeg ├── data └── kakaotalk_data │ ├── KakaoTalkChats.txt │ └── process_data.py ├── requirements.txt ├── README.md └── main.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore venv folder 2 | venv/ 3 | 4 | # Ignore .env file 5 | .env 6 | 7 | # Ignore db folder 8 | db/ -------------------------------------------------------------------------------- /assets/image_example_1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sanggubot/doppelganger-gpt/HEAD/assets/image_example_1.jpeg -------------------------------------------------------------------------------- /assets/image_example_2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sanggubot/doppelganger-gpt/HEAD/assets/image_example_2.jpeg -------------------------------------------------------------------------------- /data/kakaotalk_data/KakaoTalkChats.txt: -------------------------------------------------------------------------------- 1 | others_name 님과 카카오톡 대화 2 | 저장한 날짜 : 2023년 1월 1일 오전 1:01 3 | 4 | 5 | 2023년 1월 1일 오후 1:01 6 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 7 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 8 | 2022년 1월 1일 오후 1:01, others_name : example chat 9 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 10 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 11 | 2022년 1월 1일 오후 1:01, others_name : example chat 12 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 13 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 14 | 2022년 1월 1일 오후 1:01, others_name : example chat 15 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 16 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 17 | 2022년 1월 1일 오후 1:01, others_name : example chat 18 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 19 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 20 | 2022년 1월 1일 오후 1:01, others_name : example chat 21 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 22 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 23 | 2022년 1월 1일 오후 1:01, others_name : example chat 24 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 25 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 26 | 2022년 1월 1일 오후 1:01, others_name : example chat 27 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat 28 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted 29 | 2022년 1월 1일 오후 1:01, others_name : example chat -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.8.4 2 | aiosignal==1.3.1 3 | anyio==3.6.2 4 | async-timeout==4.0.2 5 | attrs==22.2.0 6 | backoff==2.2.1 7 | certifi==2022.12.7 8 | charset-normalizer==3.1.0 9 | chromadb==0.3.20 10 | click==8.1.3 11 | clickhouse-connect==0.5.18 12 | dataclasses-json==0.5.7 13 | duckdb==0.7.1 14 | fastapi==0.95.0 15 | filelock==3.10.7 16 | frozenlist==1.3.3 17 | h11==0.14.0 18 | hnswlib==0.7.0 19 | httptools==0.5.0 20 | huggingface-hub==0.13.3 21 | idna==3.4 22 | Jinja2==3.1.2 23 | joblib==1.2.0 24 | langchain==0.0.130 25 | lz4==4.3.2 26 | MarkupSafe==2.1.2 27 | marshmallow==3.19.0 28 | marshmallow-enum==1.5.1 29 | monotonic==1.6 30 | mpmath==1.3.0 31 | multidict==6.0.4 32 | mypy-extensions==1.0.0 33 | networkx==3.0 34 | nltk==3.8.1 35 | numpy==1.24.2 36 | openai==0.27.3 37 | packaging==23.0 38 | pandas==2.0.0 39 | Pillow==9.5.0 40 | posthog==2.4.2 41 | pydantic==1.10.7 42 | python-dateutil==2.8.2 43 | python-dotenv==1.0.0 44 | pytz==2023.3 45 | PyYAML==6.0 46 | regex==2023.3.23 47 | requests==2.28.2 48 | scikit-learn==1.2.2 49 | scipy==1.10.1 50 | sentence-transformers==2.2.2 51 | sentencepiece==0.1.97 52 | six==1.16.0 53 | sniffio==1.3.0 54 | SQLAlchemy==1.4.47 55 | starlette==0.26.1 56 | sympy==1.11.1 57 | tenacity==8.2.2 58 | threadpoolctl==3.1.0 59 | tokenizers==0.13.2 60 | torch==2.0.0 61 | torchvision==0.15.1 62 | tqdm==4.65.0 63 | transformers==4.27.4 64 | typing-inspect==0.8.0 65 | typing_extensions==4.5.0 66 | tzdata==2023.3 67 | urllib3==1.26.15 68 | uvicorn==0.21.1 69 | uvloop==0.17.0 70 | watchfiles==0.19.0 71 | websockets==11.0 72 | yarl==1.8.2 73 | zstandard==0.20.0 74 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DoppelgangerGPT 2 | 3 | ![python](https://img.shields.io/badge/python-v3.11-blue) ![langchain](https://img.shields.io/badge/langchain-v0.0.130-blue) ![chromadb](https://img.shields.io/badge/chromadb-v0.3.20-blue) 4 | 5 | This GitHub repository uses OpenAI API, vector search, and langchain to create a personalized digital doppelganger that mimics your language and communication style. Doppelganger provides an AI-based chatbot experience that reflects the user's personality based on KakaoTalk chat data. 6 | 7 | ## Installation 8 | 9 | To install the dependencies, run the following command: 10 | 11 | ``` 12 | pip install -r requirements.txt 13 | ``` 14 | 15 | ## Environment Variables 16 | 17 | Create a .env file in the root folder and add the following line: 18 | 19 | ``` 20 | OPENAI_API_KEY="YOUR_OPENAI_API_KEY" 21 | ``` 22 | 23 | Make sure to replace YOUR_OPENAI_API_KEY with your actual OpenAI API key. 24 | 25 | ## Dataset Setup 26 | 27 | Export your KakaoTalk chat data and save it as KakaoTalkChats.txt. Then, move the file to the data/kakaotalk_data/ folder. 28 | 29 | ## Usage 30 | 31 | To process the data, run the following commands: 32 | 33 | ``` 34 | cd data/kakaotalk_data 35 | python process_data.py 36 | ``` 37 | 38 | This will create a /db folder in the root directory. 39 | 40 | Next, run the following command to start the chatbot: 41 | 42 | ``` 43 | python main.py 44 | ``` 45 | 46 | ## Examples 47 | 48 | Examples of previous version (The names of the people talking were set to "상대방" and "나") 49 | 50 |

51 | 52 | 53 |

54 | 55 | ## Licence 56 | 57 | - [ ] currently writing a Licence 58 | -------------------------------------------------------------------------------- /data/kakaotalk_data/process_data.py: -------------------------------------------------------------------------------- 1 | import re 2 | from typing import List 3 | from langchain.text_splitter import RecursiveCharacterTextSplitter 4 | from langchain.embeddings.openai import OpenAIEmbeddings 5 | from langchain.vectorstores import Chroma 6 | from dotenv import load_dotenv 7 | 8 | load_dotenv() 9 | 10 | def get_opponent_name(text: str) -> str: 11 | # The name of the other person is written on the first line of kakaotalk chat data. 12 | return text.split(" 님과")[0] 13 | 14 | 15 | def mask_personal_info(chat: str, my_name: str, opponent_name: str) -> str: 16 | masked_chat = chat.replace(opponent_name, "You").replace(my_name, "Doppelganger") 17 | return masked_chat 18 | 19 | 20 | def delete_date_info(chat: str) -> str: 21 | patterns = [ 22 | r"\d{4}년\s\d{1,2}월\s\d{1,2}일\s[오후|오전]*\s\d{1,2}:\d{1,2}, ", 23 | r"\d{4}년\s\d{1,2}월\s\d{1,2}일\s[오후|오전]*\s\d{1,2}:\d{1,2}", 24 | ] 25 | 26 | for pattern in patterns: 27 | chat = re.sub(pattern, "", chat) 28 | 29 | return chat 30 | 31 | 32 | def preprocess_kakaotalk_data(file_path: str) -> str: 33 | with open(file_path) as f: 34 | chat = f.read() 35 | 36 | my_name = "my_name" # TODO: Replace with actual logic for getting the user's name 37 | opponent_name = get_opponent_name(chat) 38 | chat = mask_personal_info(chat, my_name, opponent_name) 39 | chat = delete_date_info(chat) 40 | 41 | return chat 42 | 43 | 44 | def create_text_vectordb(text: str) -> None: 45 | persist_directory = '../../db' 46 | text_splitter = RecursiveCharacterTextSplitter( 47 | chunk_size=300, 48 | chunk_overlap=250, 49 | length_function=len, 50 | ) 51 | texts: List[str] = text_splitter.create_documents([text])[1:] 52 | 53 | embeddings = OpenAIEmbeddings() 54 | vectordb = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory=persist_directory) 55 | vectordb.persist() 56 | 57 | 58 | def main() -> None: 59 | kakao_data_path: str = "./KakaoTalkChats.txt" 60 | chat: str = preprocess_kakaotalk_data(kakao_data_path) 61 | 62 | create_text_vectordb(chat) 63 | 64 | 65 | if __name__ == "__main__": 66 | main() 67 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | from dotenv import load_dotenv 3 | from langchain.chat_models import ChatOpenAI 4 | from langchain.chains import LLMChain, TransformChain, SequentialChain 5 | from langchain.embeddings.openai import OpenAIEmbeddings 6 | from langchain.memory import ConversationBufferMemory 7 | from langchain.prompts import PromptTemplate 8 | from langchain.vectorstores import Chroma 9 | 10 | load_dotenv() 11 | 12 | def get_memory(): 13 | memory = ConversationBufferMemory(memory_key="chat_history", ai_prefix="Doppelganger", human_prefix="You") 14 | return memory 15 | 16 | def get_search_chain(): 17 | embeddings = OpenAIEmbeddings() 18 | persist_directory = "db" 19 | vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings) 20 | def transform_func(input_variables): 21 | chat = input_variables["chat"] 22 | docs = vectordb.similarity_search(chat) 23 | example_conversations = [doc.page_content for doc in docs] 24 | return {"example_conversations": example_conversations} 25 | 26 | search_chain = TransformChain(input_variables=["chat"], output_variables=["example_conversations"], transform=transform_func) 27 | 28 | return search_chain 29 | 30 | 31 | def get_current_memory_chain(): 32 | def transform_memory_func(input_variables): 33 | current_chat_history = input_variables["chat_history"].split('\n')[-10:] 34 | current_chat_history = '\n'.join(current_chat_history) 35 | return{"current_chat_history": current_chat_history} 36 | 37 | current_memory_chain = TransformChain(input_variables=["chat_history"], output_variables=["current_chat_history"], transform=transform_memory_func) 38 | 39 | return current_memory_chain 40 | 41 | 42 | def get_chatgpt_chain(): 43 | llm = ChatOpenAI(model_name='gpt-4') 44 | template = """너는 'You'가 말을 했을 때 'Doppelganger' 처럼 행동해야해 45 | 46 | 예시를 보여줄테니 'Doppelganger' 의 말과 습관, 생각을 잘 유추해봐 47 | Examples: 48 | {example_conversations[0]} 49 | 50 | 자 이제 다음 대화에서 'Doppelganger'가 할것같은 답변을 해봐. 51 | 1. 'Doppelganger' 의 스타일대로, 'Doppelganger'가 할것같은 말을 해야해. 52 | 2. 자연스럽게 'Doppelganger'의 말투와 성격을 따라해야해. 번역한거같은 말투 쓰지마 53 | 3. 'You' 의 말을 이어서 만들지 말고 'Doppelganger' 말만 결과로 줘. 54 | 4. 너무 길게 말하지는 마 55 | 5. 'Doppelganger'의 평소 생각을 담아봐 56 | 57 | 이전 대화: 58 | {current_chat_history} 59 | You: {chat} 60 | Doppelganger: """ 61 | 62 | prompt_template = PromptTemplate(input_variables=["chat", "example_conversations", "current_chat_history"], template=template) 63 | chatgpt_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="received_chat") 64 | 65 | return chatgpt_chain 66 | 67 | 68 | class OverallChain: 69 | def __init__(self) -> None: 70 | self.memory = get_memory() 71 | self.search_chain = get_search_chain() 72 | self.current_memory_chain = get_current_memory_chain() 73 | self.chatgpt_chain = get_chatgpt_chain() 74 | 75 | self.overall_chain = SequentialChain( 76 | memory=self.memory, 77 | chains=[self.search_chain, self.current_memory_chain, self.chatgpt_chain], 78 | input_variables=["chat"], 79 | # Here we return multiple variables 80 | output_variables=["received_chat"], 81 | verbose=True) 82 | 83 | def receive_chat(self, chat): 84 | review = self.overall_chain({"chat":chat}) 85 | return review['received_chat'] 86 | 87 | def main() -> None: 88 | overall_chain = OverallChain() 89 | 90 | while True: 91 | recieved_chat = input("You: ") 92 | overall_chain.receive_chat(recieved_chat) 93 | 94 | os.system("clear") 95 | print(overall_chain.memory.load_memory_variables({})['chat_history']) 96 | 97 | 98 | if __name__ == "__main__": 99 | main() --------------------------------------------------------------------------------