├── .gitignore
├── README.md
├── constitucion
├── codigo_civil_colombia.pdf
├── codigo_comercio.pdf
└── codigopenal_colombia.pdf
├── docs
└── traduccion.docx
├── formato_demanda
└── formato_demanda.docx
├── home.py
└── pages
├── 2_🧐Preguntas_al_caso.py
├── 3_⚖Preguntas_a_la_constitución.py
├── 4_💼Crear_documento.py
├── 5_💡Corregir_redaccion.py
├── 6_🗺️Traduccion_documentos.py
└── front
└── htmlTemplates.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | informes
3 | llm
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # AI Bot Law Assitance
3 |
4 | The JurisBot App is a Python application built using Streamlit, designed to automate the processing of various legal documents, particularly those related to law cases and lawsuits. The app utilizes AI agents trained with Langchain and OpenAI to extract data and insights from large documents. Users can also engage in conversation with an AI agent, acting as a ChatBot, specifically trained for legal matters related to the uploaded documents and the Colombian constitution.
5 |
6 |
7 | ## Project Structure
8 |
9 | The project folder is organized as follows:
10 |
11 | home.py: The main page of the application where users can upload documents and interact with the AI agent.
12 |
13 | front/: A folder containing additional pages and components of the app's front-end
14 | ## Acknowledgements
15 |
16 | -Document Processing: The app provides the ability to upload PDF and other types of legal documents, automatically extracting and processing the text content for further analysis.
17 |
18 | -AI ChatBot: Users can engage in conversations with an AI agent powered by OpenAI's ChatGPT 3.5 model. The AI agent is trained to provide information and insights related to legal matters found in the uploaded documents, focusing on Colombian constitution specifics.
19 |
20 | -Image to Text Conversion: The app employs AWS SageMaker's Textract service to convert text embedded within images, making the content accessible for analysis.
21 |
22 |
23 | ## API Reference
24 |
25 | Python: The core programming language used for building the application.
26 |
27 | Streamlit: A Python library for creating interactive web applications for data science and machine learning.
28 |
29 | OpenAI API: The OpenAI API is utilized to integrate the ChatGPT 3.5 model, allowing users to have legal discussions with the AI agent.
30 |
31 | AWS SageMaker Textract: Amazon Textract is employed for extracting text from images, enhancing the app's capability to process various types of legal documents.
32 |
33 |
34 | ## Authors
35 |
36 | - Sergio Quintero
37 |
38 |
--------------------------------------------------------------------------------
/constitucion/codigo_civil_colombia.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigo_civil_colombia.pdf
--------------------------------------------------------------------------------
/constitucion/codigo_comercio.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigo_comercio.pdf
--------------------------------------------------------------------------------
/constitucion/codigopenal_colombia.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigopenal_colombia.pdf
--------------------------------------------------------------------------------
/docs/traduccion.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/docs/traduccion.docx
--------------------------------------------------------------------------------
/formato_demanda/formato_demanda.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/formato_demanda/formato_demanda.docx
--------------------------------------------------------------------------------
/home.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from PIL import Image
3 |
4 | # Configuración de la página
5 | st.set_page_config(
6 | page_title="Gestión documental de casos de derecho con IA",
7 | page_icon=":books:",
8 | layout="wide"
9 | )
10 |
11 | # Estilos de CSS personalizados
12 | st.markdown(
13 | """
14 |
32 | """,
33 | unsafe_allow_html=True
34 | )
35 |
36 | # Diseño del encabezado
37 | st.markdown('
JurisBot 🤖
', unsafe_allow_html=True)
38 | st.markdown("---")
39 |
40 | # Descripción de la aplicación
41 | st.markdown('Bienvenidos a la aplicación de gestión documental de casos de derecho con IA.
', unsafe_allow_html=True)
42 | st.write(
43 | """
44 | Esta aplicación permite realizar las siguientes tareas:
45 | - Realizar cualquier pregunta sobre algún dato en el documento.
46 | - Identificar nombres de lugares en el documento con descripciones.
47 | - Identificar montos de dinero en el documento con descripciones.
48 | - Identificar leyes y artículos en el documento con descripciones.
49 | - Identificar nombres de personas en el documento con descripciones.
50 | - Realizar preguntas sobre leyes o artículos de la Constitución Colombiana:
51 | Código penal, Código civil, Código comercial.
52 | """
53 | )
54 | st.markdown("---")
55 |
56 | # Menú lateral
57 | st.sidebar.markdown('', unsafe_allow_html=True)
58 | st.sidebar.info("Selecciona una opción de la lista arriba.")
59 |
60 | # Fin del código
61 |
62 |
63 |
64 |
--------------------------------------------------------------------------------
/pages/2_🧐Preguntas_al_caso.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from dotenv import load_dotenv
3 | from PyPDF2 import PdfReader
4 | from langchain.text_splitter import CharacterTextSplitter
5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
6 | from langchain.vectorstores import FAISS
7 | from langchain.chat_models import ChatOpenAI
8 | from langchain import OpenAI, PromptTemplate
9 | from langchain.memory import ConversationBufferMemory
10 | from langchain.chains import ConversationalRetrievalChain
11 | from pages.front.htmlTemplates import css, bot_template, user_template
12 | from langchain.llms import HuggingFaceHub
13 | from langchain.chains.summarize import load_summarize_chain
14 | from langchain.document_loaders import PyPDFLoader
15 | from langchain.docstore.document import Document
16 | from langchain.chains.question_answering import load_qa_chain
17 | import pandas as pd
18 | from PIL import Image, ImageDraw
19 | import boto3
20 | import os
21 |
22 | load_dotenv()
23 |
24 | aws_access_key = os.getenv("AWS_ACCESS_KEY")
25 | aws_access_secret_key = os.getenv("AWS_ACCESS_SECRET_KEY")
26 |
27 | client = boto3.client('textract', region_name='us-east-1', aws_access_key_id = aws_access_key,
28 | aws_secret_access_key = aws_access_secret_key)
29 |
30 | def read_file(files):
31 | text = ""
32 | for uploaded_file in files:
33 | file_name = uploaded_file.name
34 | if file_name.lower().endswith('.pdf'):
35 | pdf_reader = PdfReader(uploaded_file)
36 | for page in pdf_reader.pages:
37 | text += page.extract_text()
38 | elif file_name.lower().endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp')):
39 | image_bytes = uploaded_file.read()
40 | response = client.detect_document_text(Document={'Bytes': image_bytes})
41 | for item in response['Blocks']:
42 | if item["BlockType"] == "LINE":
43 | text = text + item["Text"]
44 | return text
45 |
46 |
47 | def get_text_chunks(text):
48 | text_splitter = CharacterTextSplitter(
49 | separator="\n",
50 | chunk_size=1000,
51 | chunk_overlap=200,
52 | length_function=len
53 | )
54 | chunks = text_splitter.split_text(text)
55 | return chunks
56 |
57 |
58 | def get_vectorstore(text_chunks):
59 | embeddings = OpenAIEmbeddings()
60 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
61 | return vectorstore
62 |
63 |
64 | def get_conversation_chain(vectorstore):
65 | llm = ChatOpenAI()
66 | memory = ConversationBufferMemory(
67 | memory_key='chat_history', return_messages=True)
68 | conversation_chain = ConversationalRetrievalChain.from_llm(
69 | llm=llm,
70 | retriever=vectorstore.as_retriever(),
71 | memory=memory
72 | )
73 | return conversation_chain
74 |
75 | def generate_summary(vectorstore):
76 | llm = OpenAI(temperature=0)
77 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff')
78 | query = 'Haz un resumen del documento de 1 parráfo'
79 | docs = vectorstore.similarity_search(query)
80 | summary = chain_1.run(input_documents=docs, question=query)
81 | return summary
82 |
83 | def dataframe_output(output):
84 | try:
85 | entries = [entry.strip() for entry in output.split(", ")]
86 |
87 | data = {
88 | 'Name': [entry.split(": ")[0] for entry in entries],
89 | 'Description': [entry.split(": ")[1] for entry in entries]
90 | }
91 |
92 | df = pd.DataFrame(data)
93 | except:
94 | entries = [entry.strip() for entry in output.split(";")]
95 | data = {
96 | 'Law': [entry for entry in entries]
97 | }
98 |
99 | df = pd.DataFrame(data)
100 | return df
101 |
102 | def generate_nombres(vectorstore):
103 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff")
104 | query = """Identifica todos los nombres propios de personas que están en el documento con su respectivo papel o rol en el documento.
105 | Ejemplo:
106 | Sergio : Demandante,Maria : Testigo,Jose : Demandado"""
107 | docs = vectorstore.similarity_search(query)
108 | salida = chain_df.run(input_documents=docs, question=query)
109 | df = dataframe_output(salida)
110 | return df
111 |
112 | def generate_lugares(vectorstore):
113 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff")
114 | query = """Identifica todos los nombres de lugares que están en el documento con su respectiva descripción.
115 | Ejemplo:
116 | Barcelona : Vereda,Paraiso : Finca,Santuario : Pueblo"""
117 | docs = vectorstore.similarity_search(query)
118 | salida = chain_df.run(input_documents=docs, question=query)
119 | df = dataframe_output(salida)
120 | return df
121 |
122 | def generate_montos(vectorstore):
123 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff")
124 | query = """Identifica todos los montos de dinero que están en el documento con su respectiva descripción.
125 | Ejemplo:
126 | 100 : Valor terreno,200 : Valor casa,300 : Valor demandado,500: Valor escrituras"""
127 | docs = vectorstore.similarity_search(query)
128 | salida = chain_df.run(input_documents=docs, question=query)
129 | df = dataframe_output(salida)
130 | return df
131 |
132 | def generate_leyes(vectorstore):
133 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff")
134 | query = """Identifica todas las leyes y artículos que están en el documento con su respectiva descripción.
135 | Ejemplo:
136 | Ley 100 : Ley de salud,Artículo 1 : Derecho a la vida,Artículo 2 : Derecho a la salud"""
137 | docs = vectorstore.similarity_search(query)
138 | salida = chain_df.run(input_documents=docs, question=query)
139 | df = dataframe_output(salida)
140 | return df
141 |
142 | def handle_userinput(user_question):
143 | response = st.session_state.conversation({'question': user_question})
144 | st.session_state.chat_history = response['chat_history']
145 |
146 | for i, message in enumerate(st.session_state.chat_history):
147 | if i % 2 == 0:
148 | st.write(user_template.replace(
149 | "{{MSG}}", message.content), unsafe_allow_html=True)
150 | else:
151 | st.write(bot_template.replace(
152 | "{{MSG}}", message.content), unsafe_allow_html=True)
153 |
154 |
155 | def main():
156 | load_dotenv()
157 | st.set_page_config(page_title="Preguntale al asistente virtual Juris Bot",
158 | page_icon=":books:")
159 | st.write(css, unsafe_allow_html=True)
160 |
161 | if "conversation" not in st.session_state:
162 | st.session_state.conversation = None
163 | if "chat_history" not in st.session_state:
164 | st.session_state.chat_history = None
165 |
166 | st.header("Preguntale al asistente virtual Juris Bot :books:")
167 | user_question = st.text_input("Pregunta cualquier dato sobre el caso o sobre cualquier ley Colombiana:")
168 | if user_question:
169 | handle_userinput(user_question)
170 |
171 | with st.sidebar:
172 | st.subheader("Tus casos")
173 | pdf_docs = st.file_uploader(
174 | "Carga acá todos los documentos del caso y da click en Procesar'", accept_multiple_files=True)
175 |
176 | if "summary" in st.session_state:
177 | st.subheader("Resumen del caso")
178 | st.write(st.session_state.summary)
179 |
180 | st.sidebar.title("Entidades")
181 | entidades_options = ["Nombres", "Lugares", "Montos", "Leyes"]
182 | entidades_selected = st.sidebar.selectbox("Seleccione una opción", entidades_options)
183 |
184 |
185 | if pdf_docs and entidades_selected:
186 | with st.spinner("Procesando..."):
187 | raw_text = read_file(pdf_docs)
188 | text_chunks = get_text_chunks(raw_text)
189 | vectorstore = get_vectorstore(text_chunks)
190 | summary = generate_summary(vectorstore)
191 | st.session_state.summary = summary
192 |
193 | if entidades_selected == "Nombres":
194 | nombres_df = generate_nombres(vectorstore)
195 | st.subheader("Nombres de personas y su respectivo papel o rol")
196 | st.dataframe(nombres_df, height=500)
197 |
198 | elif entidades_selected == "Lugares":
199 | lugares_df = generate_lugares(vectorstore)
200 | st.subheader("Lugares y su respectiva descripción")
201 | st.dataframe(lugares_df, height=500)
202 |
203 | elif entidades_selected == "Montos":
204 | montos_df = generate_montos(vectorstore)
205 | st.subheader("Montos de dinero y su respectiva descripción")
206 | st.dataframe(montos_df, height=500)
207 |
208 | elif entidades_selected == "Leyes":
209 | leyes_df = generate_leyes(vectorstore)
210 | st.subheader("Leyes y sus respectivos artículos")
211 | st.dataframe(leyes_df, height=500)
212 |
213 | st.session_state.conversation = get_conversation_chain(vectorstore)
214 |
215 | if __name__ == '__main__':
216 | main()
--------------------------------------------------------------------------------
/pages/3_⚖Preguntas_a_la_constitución.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from dotenv import load_dotenv
3 | from PyPDF2 import PdfReader
4 | from langchain.text_splitter import CharacterTextSplitter
5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
6 | from langchain.vectorstores import FAISS
7 | from langchain.chat_models import ChatOpenAI
8 | from langchain import OpenAI, PromptTemplate
9 | from langchain.memory import ConversationBufferMemory
10 | from langchain.chains import ConversationalRetrievalChain
11 | from pages.front.htmlTemplates import css, bot_template, user_template
12 | from langchain.llms import HuggingFaceHub
13 | from langchain.chains.summarize import load_summarize_chain
14 | from langchain.document_loaders import PyPDFLoader
15 | from langchain.docstore.document import Document
16 | from langchain.chains.question_answering import load_qa_chain
17 | import pandas as pd
18 | import os
19 |
20 | load_dotenv()
21 |
22 | codigo_civil = "./constitucion/codigo_civil_colombia.pdf"
23 | codigo_comercio = "./constitucion/codigo_comercio.pdf"
24 | codigo_penal = "./constitucion/codigopenal_colombia.pdf"
25 |
26 | pdf_docs_law = [codigo_civil, codigo_comercio, codigo_penal]
27 |
28 | def get_pdf_text(pdf_docs_law):
29 | text = ""
30 | for pdf in pdf_docs_law:
31 | pdf_reader = PdfReader(pdf)
32 | for page in pdf_reader.pages:
33 | text += page.extract_text()
34 | return text
35 |
36 | def get_text_chunks(text):
37 | text_splitter = CharacterTextSplitter(
38 | separator="\n",
39 | chunk_size=1000,
40 | chunk_overlap=200,
41 | length_function=len
42 | )
43 | chunks = text_splitter.split_text(text)
44 | return chunks
45 |
46 | def get_vectorstore(text_chunks):
47 | embeddings = OpenAIEmbeddings()
48 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
49 | return vectorstore
50 |
51 | def get_conversation_chain(vectorstore):
52 | llm = ChatOpenAI()
53 | memory_law = ConversationBufferMemory(
54 | memory_key='chat_history', return_messages=True)
55 | conversation_chain_law = ConversationalRetrievalChain.from_llm(
56 | llm=llm,
57 | retriever=vectorstore.as_retriever(),
58 | memory=memory_law
59 | )
60 | return conversation_chain_law
61 |
62 | def main():
63 | load_dotenv()
64 | st.write(css, unsafe_allow_html=True)
65 |
66 | if "conversation_law" not in st.session_state:
67 | st.session_state.conversation_law = None
68 | if "chat_history_law" not in st.session_state:
69 | st.session_state.chat_history_law = None
70 |
71 | st.header("Asistente virtual Juris Bot :books:")
72 | user_question_law = st.text_input("Pregunta cualquier dato sobre la constitución Colombiana:")
73 | if user_question_law:
74 | handle_userinput(user_question_law)
75 |
76 |
77 | def handle_userinput(user_question_law):
78 | if st.session_state.conversation_law is None:
79 | raw_text_law = get_pdf_text(pdf_docs_law)
80 | text_chunks_law = get_text_chunks(raw_text_law)
81 | vectorstore_law = get_vectorstore(text_chunks_law)
82 | st.session_state.conversation_law = get_conversation_chain(vectorstore_law)
83 |
84 |
85 | response_law = st.session_state.conversation_law({'question': user_question_law})
86 | st.session_state.chat_history_law = response_law['chat_history']
87 |
88 | for i, message in enumerate(st.session_state.chat_history_law):
89 | if i % 2 == 0:
90 | st.write(user_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
91 | else:
92 | st.write(bot_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
93 |
94 | if __name__ == '__main__':
95 | main()
--------------------------------------------------------------------------------
/pages/4_💼Crear_documento.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from dotenv import load_dotenv
3 | from docxtpl import DocxTemplate
4 | import os
5 |
6 | docx_tpl = DocxTemplate("./formato_demanda/formato_demanda.docx")
7 |
8 | def main():
9 | load_dotenv()
10 | st.header("Crea tus documentos legales :notebook_with_decorative_cover:")
11 |
12 | documentos_legales = ["Demanda laboral", "Contrato de arrendamiento", "Contrato de compraventa", "Contrato de trabajo",
13 | "Contrato de prestación de servicios", "Contrato de sociedad"]
14 |
15 | documento_legal = st.selectbox("Selecciona el tipo de documento legal que deseas generar:", documentos_legales)
16 |
17 | def creacion_documento():
18 | nombre_informe = str(numero_radicado)
19 | context = {
20 | 'nombre_demandante': nombre_demandante,
21 | 'nombre_demandado': nombre_demandado,
22 | 'cedula_demandante': cedula_demandante,
23 | 'cedula_demandado': cedula_demandado,
24 | 'ciudad': ciudad,
25 | 'fecha': fecha,
26 | 'nombre_abogado': nombre_abogado,
27 | 'cedula_abogado': cedula_abogado
28 | }
29 | docx_tpl.render(context)
30 | docx_tpl.save(f'Informes/{nombre_informe}.docx')
31 |
32 |
33 | if documento_legal == "Demanda laboral":
34 | st.write("A continuación, ingresa la información necesaria para crear el documento")
35 | numero_radicado = st.text_input("Número de radicado:")
36 | nombre_demandante = st.text_input("Nombre del demandante:")
37 | nombre_demandado = st.text_input("Nombre del demandado:")
38 | cedula_demandante = st.text_input("Cédula del demandante:")
39 | cedula_demandado = st.text_input("Cédula del demandado:")
40 | ciudad = st.text_input("Ciudad:")
41 | fecha = st.date_input("Fecha:")
42 | nombre_abogado = st.text_input("Nombre del abogado:")
43 | cedula_abogado = st.text_input("Cédula del abogado:")
44 | if st.button("Generar documento"):
45 | creacion_documento()
46 | st.write("Documento generado")
47 |
48 | if documento_legal == "Contrato de compraventa":
49 | st.write("A continuación, ingresa la información necesaria para crear el documento")
50 | numero_radicado = st.text_input("Número de radicado:")
51 | nombre_demandante = st.text_input("Nombre del demandante:")
52 | nombre_demandado = st.text_input("Nombre del demandado:")
53 | cedula_demandante = st.text_input("Cédula del demandante:")
54 | cedula_demandado = st.text_input("Cédula del demandado:")
55 | if st.button("Generar documento"):
56 | creacion_documento()
57 | st.write("Documento generado")
58 |
59 | if documento_legal == "Contrato de trabajo":
60 | st.write("A continuación, ingresa la información necesaria para crear el documento")
61 | numero_radicado = st.text_input("Número de radicado:")
62 | nombre_demandante = st.text_input("Nombre del demandante:")
63 | nombre_demandado = st.text_input("Nombre del demandado:")
64 | cedula_demandante = st.text_input("Cédula del demandante:")
65 | cedula_demandado = st.text_input("Cédula del demandado:")
66 | if st.button("Generar documento"):
67 | creacion_documento()
68 | st.write("Documento generado")
69 |
70 |
71 | if __name__ == "__main__":
72 | main()
--------------------------------------------------------------------------------
/pages/5_💡Corregir_redaccion.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from dotenv import load_dotenv
3 | from langchain.text_splitter import CharacterTextSplitter
4 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
5 | from langchain.vectorstores import FAISS
6 | from langchain import OpenAI, PromptTemplate
7 | from langchain.chains.question_answering import load_qa_chain
8 |
9 | load_dotenv()
10 |
11 | def get_text_chunks(text):
12 | text_splitter = CharacterTextSplitter(
13 | separator="\n",
14 | chunk_size=1000,
15 | chunk_overlap=200,
16 | length_function=len
17 | )
18 | chunks = text_splitter.split_text(text)
19 | return chunks
20 |
21 |
22 | def get_vectorstore(text_chunks):
23 | embeddings = OpenAIEmbeddings()
24 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
25 | return vectorstore
26 |
27 | def generate_correcction(vectorstore):
28 | llm = OpenAI(temperature=0)
29 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff')
30 | query = '''Simula que eres un abogado profesional y trata de conservar el lenguaje propio
31 | del campo legal. Uaz una corrección al texto tratand siguiendo los protocols legales y profesionales'''
32 | docs = vectorstore.similarity_search(query)
33 | correction = chain_1.run(input_documents=docs, question=query)
34 | return correction
35 |
36 | def main():
37 | load_dotenv()
38 | st.markdown(
39 | """
40 |
48 | """,
49 | unsafe_allow_html=True,
50 | )
51 | st.title("Corregir redacción")
52 | st.markdown("Escribe un texto y el sistema te ayudará a corregir la redacción")
53 | texto = st.text_area("Escribe o copia un texto", height=300)
54 |
55 | if st.button("Corregir redacción"):
56 | text_chunks = get_text_chunks(texto)
57 | vectorstore = get_vectorstore(text_chunks)
58 | correction = generate_correcction(vectorstore)
59 | st.write(correction)
60 |
61 | if __name__ == "__main__":
62 | main()
--------------------------------------------------------------------------------
/pages/6_🗺️Traduccion_documentos.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from dotenv import load_dotenv
3 | from PyPDF2 import PdfReader
4 | from langchain.text_splitter import CharacterTextSplitter
5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
6 | from langchain.vectorstores import FAISS
7 | from langchain.chat_models import ChatOpenAI
8 | from langchain import OpenAI, PromptTemplate
9 | from langchain.memory import ConversationBufferMemory
10 | from langchain.chains import ConversationalRetrievalChain
11 | from pages.front.htmlTemplates import css, bot_template, user_template
12 | from langchain.llms import HuggingFaceHub
13 | from langchain.chains.summarize import load_summarize_chain
14 | from langchain.document_loaders import PyPDFLoader
15 | from langchain.docstore.document import Document
16 | from langchain.chains.question_answering import load_qa_chain
17 | import pandas as pd
18 | from PIL import Image, ImageDraw
19 | import boto3
20 | import docx2txt
21 | import openai
22 | from docxtpl import DocxTemplate
23 | import os
24 | import json
25 |
26 | docx_tpl = DocxTemplate("./docs/traduccion.docx")
27 |
28 | openai_api_key = os.getenv("OPENAI_API_KEY")
29 | aws_access_key = os.getenv("AWS_ACCESS_KEY")
30 | aws_access_secret_key = os.getenv("AWS_ACCESS_SECRET_KEY")
31 |
32 | client = boto3.client('textract', region_name='us-east-1', aws_access_key_id=aws_access_key,
33 | aws_secret_access_key=aws_access_secret_key)
34 |
35 | def read_file(files):
36 | text = ""
37 | file_name = files.name
38 | if file_name.lower().endswith('.pdf'):
39 | pdf_reader = PdfReader(files)
40 | for page in pdf_reader.pages:
41 | text += page.extract_text()
42 | elif file_name.lower().endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp')):
43 | image_bytes = files.read()
44 | response = client.detect_document_text(Document={'Bytes': image_bytes})
45 | for item in response['Blocks']:
46 | if item["BlockType"] == "LINE":
47 | text = text + item["Text"]
48 | return text
49 |
50 |
51 | def get_text_chunks(text):
52 | text_splitter = CharacterTextSplitter(
53 | separator="\n",
54 | chunk_size=1000,
55 | chunk_overlap=200,
56 | length_function=len
57 | )
58 | chunks = text_splitter.split_text(text)
59 | return chunks
60 |
61 |
62 | def get_vectorstore(text_chunks):
63 | embeddings = OpenAIEmbeddings()
64 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
65 | return vectorstore
66 |
67 | def detect_language(vectorstore):
68 | llm = OpenAI(temperature=0)
69 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff')
70 | query = 'Detecta el idioma de este documento y devuelvelo en español en una sola palabra'
71 | docs = vectorstore.similarity_search(query)
72 | source_language = chain_1.run(input_documents=docs, question=query)
73 | return source_language
74 |
75 | def translate_text(text, source_language, target_language):
76 | prompt = f"Translate the following '{source_language}' text to '{target_language}': {text}"
77 |
78 | response = openai.ChatCompletion.create(
79 | model="gpt-3.5-turbo",
80 | messages=[
81 | {"role": "system", "content": "You are a helpful assistant that translates text."},
82 | {"role": "user", "content": prompt}
83 | ],
84 | n=1,
85 | stop=None,
86 | temperature=0,
87 | )
88 |
89 | translation = response.choices[0].message.content.strip()
90 | return translation
91 |
92 |
93 | def main():
94 | load_dotenv()
95 | st.title("Traducción de texto")
96 | st.markdown("Sube el documento que deseas traducir")
97 | file = st.file_uploader("Sube el documento que deseas traducir", type=['pdf', 'docx', 'jpg'])
98 | if file is not None:
99 | texto = read_file(file)
100 | text_chunks = get_text_chunks(texto)
101 | vectorstore = get_vectorstore(text_chunks)
102 | language1 = detect_language(vectorstore)
103 | st.write(f"El idioma detectado es: {language1}")
104 |
105 | language2 = st.selectbox("Escoge el idioma al que deseas traducrlo",
106 | ["Español", "Inglés", "Francés", "Alemán", "Italiano", "Portugués",
107 | "Ruso", "Chino", "Japonés", "Coreano", "Árabe", "Hindi", "Bengalí"])
108 |
109 | def creacion_documento():
110 | context = {
111 | 'traduccion': traduccion
112 | }
113 | docx_tpl.render(context)
114 | docx_tpl.save(f'Traducciones/traduccion.docx')
115 |
116 |
117 | if st.button("Traducir documento"):
118 | traduccion = translate_text(texto, language1, language2)
119 | st.write(traduccion)
120 | creacion_documento()
121 |
122 |
123 | if __name__ == "__main__":
124 | main()
--------------------------------------------------------------------------------
/pages/front/htmlTemplates.py:
--------------------------------------------------------------------------------
1 | css = '''
2 |