├── .gitignore
├── assets
    ├── image_example_1.jpeg
    └── image_example_2.jpeg
├── data
    └── kakaotalk_data
    │   ├── KakaoTalkChats.txt
    │   └── process_data.py
├── requirements.txt
├── README.md
└── main.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore venv folder
2 | venv/
3 | 
4 | # Ignore .env file
5 | .env
6 | 
7 | # Ignore db folder
8 | db/


--------------------------------------------------------------------------------
/assets/image_example_1.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sanggubot/doppelganger-gpt/HEAD/assets/image_example_1.jpeg


--------------------------------------------------------------------------------
/assets/image_example_2.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sanggubot/doppelganger-gpt/HEAD/assets/image_example_2.jpeg


--------------------------------------------------------------------------------
/data/kakaotalk_data/KakaoTalkChats.txt:
--------------------------------------------------------------------------------
 1 | others_name 님과 카카오톡 대화
 2 | 저장한 날짜 : 2023년 1월 1일 오전 1:01
 3 | 
 4 | 
 5 | 2023년 1월 1일 오후 1:01
 6 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
 7 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
 8 | 2022년 1월 1일 오후 1:01, others_name : example chat
 9 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
10 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
11 | 2022년 1월 1일 오후 1:01, others_name : example chat
12 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
13 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
14 | 2022년 1월 1일 오후 1:01, others_name : example chat
15 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
16 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
17 | 2022년 1월 1일 오후 1:01, others_name : example chat
18 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
19 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
20 | 2022년 1월 1일 오후 1:01, others_name : example chat
21 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
22 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
23 | 2022년 1월 1일 오후 1:01, others_name : example chat
24 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
25 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
26 | 2022년 1월 1일 오후 1:01, others_name : example chat
27 | 2022년 1월 1일 오후 1:01, my_name : this is my example chat
28 | 2022년 1월 1일 오후 1:01, others_name : this is other's splitted
29 | 2022년 1월 1일 오후 1:01, others_name : example chat


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | aiohttp==3.8.4
 2 | aiosignal==1.3.1
 3 | anyio==3.6.2
 4 | async-timeout==4.0.2
 5 | attrs==22.2.0
 6 | backoff==2.2.1
 7 | certifi==2022.12.7
 8 | charset-normalizer==3.1.0
 9 | chromadb==0.3.20
10 | click==8.1.3
11 | clickhouse-connect==0.5.18
12 | dataclasses-json==0.5.7
13 | duckdb==0.7.1
14 | fastapi==0.95.0
15 | filelock==3.10.7
16 | frozenlist==1.3.3
17 | h11==0.14.0
18 | hnswlib==0.7.0
19 | httptools==0.5.0
20 | huggingface-hub==0.13.3
21 | idna==3.4
22 | Jinja2==3.1.2
23 | joblib==1.2.0
24 | langchain==0.0.130
25 | lz4==4.3.2
26 | MarkupSafe==2.1.2
27 | marshmallow==3.19.0
28 | marshmallow-enum==1.5.1
29 | monotonic==1.6
30 | mpmath==1.3.0
31 | multidict==6.0.4
32 | mypy-extensions==1.0.0
33 | networkx==3.0
34 | nltk==3.8.1
35 | numpy==1.24.2
36 | openai==0.27.3
37 | packaging==23.0
38 | pandas==2.0.0
39 | Pillow==9.5.0
40 | posthog==2.4.2
41 | pydantic==1.10.7
42 | python-dateutil==2.8.2
43 | python-dotenv==1.0.0
44 | pytz==2023.3
45 | PyYAML==6.0
46 | regex==2023.3.23
47 | requests==2.28.2
48 | scikit-learn==1.2.2
49 | scipy==1.10.1
50 | sentence-transformers==2.2.2
51 | sentencepiece==0.1.97
52 | six==1.16.0
53 | sniffio==1.3.0
54 | SQLAlchemy==1.4.47
55 | starlette==0.26.1
56 | sympy==1.11.1
57 | tenacity==8.2.2
58 | threadpoolctl==3.1.0
59 | tokenizers==0.13.2
60 | torch==2.0.0
61 | torchvision==0.15.1
62 | tqdm==4.65.0
63 | transformers==4.27.4
64 | typing-inspect==0.8.0
65 | typing_extensions==4.5.0
66 | tzdata==2023.3
67 | urllib3==1.26.15
68 | uvicorn==0.21.1
69 | uvloop==0.17.0
70 | watchfiles==0.19.0
71 | websockets==11.0
72 | yarl==1.8.2
73 | zstandard==0.20.0
74 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DoppelgangerGPT
 2 | 
 3 | ![python](https://img.shields.io/badge/python-v3.11-blue) ![langchain](https://img.shields.io/badge/langchain-v0.0.130-blue) ![chromadb](https://img.shields.io/badge/chromadb-v0.3.20-blue)
 4 | 
 5 | This GitHub repository uses OpenAI API, vector search, and langchain to create a personalized digital doppelganger that mimics your language and communication style. Doppelganger provides an AI-based chatbot experience that reflects the user's personality based on KakaoTalk chat data.
 6 | 
 7 | ## Installation
 8 | 
 9 | To install the dependencies, run the following command:
10 | 
11 | ```
12 | pip install -r requirements.txt
13 | ```
14 | 
15 | ## Environment Variables
16 | 
17 | Create a .env file in the root folder and add the following line:
18 | 
19 | ```
20 | OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
21 | ```
22 | 
23 | Make sure to replace YOUR_OPENAI_API_KEY with your actual OpenAI API key.
24 | 
25 | ## Dataset Setup
26 | 
27 | Export your KakaoTalk chat data and save it as KakaoTalkChats.txt. Then, move the file to the data/kakaotalk_data/ folder.
28 | 
29 | ## Usage
30 | 
31 | To process the data, run the following commands:
32 | 
33 | ```
34 | cd data/kakaotalk_data
35 | python process_data.py
36 | ```
37 | 
38 | This will create a /db folder in the root directory.
39 | 
40 | Next, run the following command to start the chatbot:
41 | 
42 | ```
43 | python main.py
44 | ```
45 | 
46 | ## Examples
47 | 
48 | Examples of previous version (The names of the people talking were set to "상대방" and "나")
49 | 
50 | <p float="left">
51 |   <img src="./assets/image_example_1.jpeg" height="200"/>
52 |   <img src="./assets/image_example_2.jpeg" height="200" />
53 | </p>
54 | 
55 | ## Licence
56 | 
57 | - [ ] currently writing a Licence
58 | 


--------------------------------------------------------------------------------
/data/kakaotalk_data/process_data.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from typing import List
 3 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 4 | from langchain.embeddings.openai import OpenAIEmbeddings
 5 | from langchain.vectorstores import Chroma
 6 | from dotenv import load_dotenv
 7 | 
 8 | load_dotenv()
 9 | 
10 | def get_opponent_name(text: str) -> str:
11 |     # The name of the other person is written on the first line of kakaotalk chat data.
12 |     return text.split(" 님과")[0]
13 | 
14 | 
15 | def mask_personal_info(chat: str, my_name: str, opponent_name: str) -> str:
16 |     masked_chat = chat.replace(opponent_name, "You").replace(my_name, "Doppelganger")
17 |     return masked_chat
18 | 
19 | 
20 | def delete_date_info(chat: str) -> str:
21 |     patterns = [
22 |         r"\d{4}년\s\d{1,2}월\s\d{1,2}일\s[오후|오전]*\s\d{1,2}:\d{1,2}, ",
23 |         r"\d{4}년\s\d{1,2}월\s\d{1,2}일\s[오후|오전]*\s\d{1,2}:\d{1,2}",
24 |     ]
25 | 
26 |     for pattern in patterns:
27 |         chat = re.sub(pattern, "", chat)
28 | 
29 |     return chat
30 | 
31 | 
32 | def preprocess_kakaotalk_data(file_path: str) -> str:
33 |     with open(file_path) as f:
34 |         chat = f.read()
35 | 
36 |     my_name = "my_name"  # TODO: Replace with actual logic for getting the user's name
37 |     opponent_name = get_opponent_name(chat)
38 |     chat = mask_personal_info(chat, my_name, opponent_name)
39 |     chat = delete_date_info(chat)
40 | 
41 |     return chat
42 | 
43 | 
44 | def create_text_vectordb(text: str) -> None:
45 |     persist_directory = '../../db'
46 |     text_splitter = RecursiveCharacterTextSplitter(
47 |         chunk_size=300,
48 |         chunk_overlap=250,
49 |         length_function=len,
50 |     )
51 |     texts: List[str] = text_splitter.create_documents([text])[1:]
52 | 
53 |     embeddings = OpenAIEmbeddings()
54 |     vectordb = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory=persist_directory)
55 |     vectordb.persist()
56 | 
57 | 
58 | def main() -> None:
59 |     kakao_data_path: str = "./KakaoTalkChats.txt"
60 |     chat: str = preprocess_kakaotalk_data(kakao_data_path)
61 |     
62 |     create_text_vectordb(chat)
63 | 
64 | 
65 | if __name__ == "__main__":
66 |     main()
67 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from dotenv import load_dotenv
 3 | from langchain.chat_models import ChatOpenAI
 4 | from langchain.chains import LLMChain, TransformChain, SequentialChain
 5 | from langchain.embeddings.openai import OpenAIEmbeddings
 6 | from langchain.memory import ConversationBufferMemory
 7 | from langchain.prompts import PromptTemplate
 8 | from langchain.vectorstores import Chroma
 9 | 
10 | load_dotenv()
11 | 
12 | def get_memory():
13 |     memory = ConversationBufferMemory(memory_key="chat_history", ai_prefix="Doppelganger", human_prefix="You")
14 |     return memory
15 | 
16 | def get_search_chain():
17 |     embeddings = OpenAIEmbeddings()
18 |     persist_directory = "db"
19 |     vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
20 |     def transform_func(input_variables):
21 |         chat = input_variables["chat"]
22 |         docs = vectordb.similarity_search(chat)
23 |         example_conversations = [doc.page_content for doc in docs]
24 |         return {"example_conversations": example_conversations}
25 |     
26 |     search_chain = TransformChain(input_variables=["chat"], output_variables=["example_conversations"], transform=transform_func)
27 | 
28 |     return search_chain
29 | 
30 | 
31 | def get_current_memory_chain():
32 |     def transform_memory_func(input_variables):
33 |         current_chat_history = input_variables["chat_history"].split('\n')[-10:]
34 |         current_chat_history = '\n'.join(current_chat_history)
35 |         return{"current_chat_history": current_chat_history}
36 |     
37 |     current_memory_chain = TransformChain(input_variables=["chat_history"], output_variables=["current_chat_history"], transform=transform_memory_func)
38 | 
39 |     return current_memory_chain
40 | 
41 | 
42 | def get_chatgpt_chain():
43 |     llm = ChatOpenAI(model_name='gpt-4')
44 |     template = """너는 'You'가 말을 했을 때 'Doppelganger' 처럼 행동해야해
45 | 
46 |     예시를 보여줄테니 'Doppelganger' 의 말과 습관, 생각을 잘 유추해봐
47 |     Examples:
48 |     {example_conversations[0]}
49 | 
50 |     자 이제 다음 대화에서 'Doppelganger'가 할것같은 답변을 해봐.
51 |     1. 'Doppelganger' 의 스타일대로, 'Doppelganger'가 할것같은 말을 해야해.
52 |     2. 자연스럽게 'Doppelganger'의 말투와 성격을 따라해야해. 번역한거같은 말투 쓰지마
53 |     3. 'You' 의 말을 이어서 만들지 말고 'Doppelganger' 말만 결과로 줘.
54 |     4. 너무 길게 말하지는 마
55 |     5. 'Doppelganger'의 평소 생각을 담아봐
56 | 
57 |     이전 대화:
58 |     {current_chat_history}
59 |     You: {chat}
60 |     Doppelganger: """
61 | 
62 |     prompt_template = PromptTemplate(input_variables=["chat", "example_conversations", "current_chat_history"], template=template)
63 |     chatgpt_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="received_chat")
64 | 
65 |     return chatgpt_chain
66 | 
67 | 
68 | class OverallChain:
69 |     def __init__(self) -> None:
70 |         self.memory = get_memory()
71 |         self.search_chain = get_search_chain()
72 |         self.current_memory_chain = get_current_memory_chain()
73 |         self.chatgpt_chain = get_chatgpt_chain()
74 | 
75 |         self.overall_chain = SequentialChain(
76 |             memory=self.memory,
77 |             chains=[self.search_chain, self.current_memory_chain, self.chatgpt_chain],
78 |             input_variables=["chat"],
79 |             # Here we return multiple variables
80 |             output_variables=["received_chat"],
81 |             verbose=True)
82 |     
83 |     def receive_chat(self, chat):
84 |         review = self.overall_chain({"chat":chat})
85 |         return review['received_chat']
86 | 
87 | def main() -> None:
88 |     overall_chain = OverallChain()
89 | 
90 |     while True:
91 |         recieved_chat = input("You: ")
92 |         overall_chain.receive_chat(recieved_chat)
93 | 
94 |         os.system("clear")
95 |         print(overall_chain.memory.load_memory_variables({})['chat_history'])
96 | 
97 | 
98 | if __name__ == "__main__":
99 |     main()


--------------------------------------------------------------------------------