├── .gitignore ├── README.md ├── constitucion ├── codigo_civil_colombia.pdf ├── codigo_comercio.pdf └── codigopenal_colombia.pdf ├── docs └── traduccion.docx ├── formato_demanda └── formato_demanda.docx ├── home.py └── pages ├── 2_🧐Preguntas_al_caso.py ├── 3_⚖Preguntas_a_la_constitución.py ├── 4_💼Crear_documento.py ├── 5_💡Corregir_redaccion.py ├── 6_🗺️Traduccion_documentos.py └── front └── htmlTemplates.py /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | informes 3 | llm -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # AI Bot Law Assitance 3 | 4 | The JurisBot App is a Python application built using Streamlit, designed to automate the processing of various legal documents, particularly those related to law cases and lawsuits. The app utilizes AI agents trained with Langchain and OpenAI to extract data and insights from large documents. Users can also engage in conversation with an AI agent, acting as a ChatBot, specifically trained for legal matters related to the uploaded documents and the Colombian constitution. 5 | 6 | 7 | ## Project Structure 8 | 9 | The project folder is organized as follows: 10 | 11 | home.py: The main page of the application where users can upload documents and interact with the AI agent. 12 | 13 | front/: A folder containing additional pages and components of the app's front-end 14 | ## Acknowledgements 15 | 16 | -Document Processing: The app provides the ability to upload PDF and other types of legal documents, automatically extracting and processing the text content for further analysis. 17 | 18 | -AI ChatBot: Users can engage in conversations with an AI agent powered by OpenAI's ChatGPT 3.5 model. The AI agent is trained to provide information and insights related to legal matters found in the uploaded documents, focusing on Colombian constitution specifics. 19 | 20 | -Image to Text Conversion: The app employs AWS SageMaker's Textract service to convert text embedded within images, making the content accessible for analysis. 21 | 22 | 23 | ## API Reference 24 | 25 | Python: The core programming language used for building the application. 26 | 27 | Streamlit: A Python library for creating interactive web applications for data science and machine learning. 28 | 29 | OpenAI API: The OpenAI API is utilized to integrate the ChatGPT 3.5 model, allowing users to have legal discussions with the AI agent. 30 | 31 | AWS SageMaker Textract: Amazon Textract is employed for extracting text from images, enhancing the app's capability to process various types of legal documents. 32 | 33 | 34 | ## Authors 35 | 36 | - Sergio Quintero 37 | 38 | -------------------------------------------------------------------------------- /constitucion/codigo_civil_colombia.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigo_civil_colombia.pdf -------------------------------------------------------------------------------- /constitucion/codigo_comercio.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigo_comercio.pdf -------------------------------------------------------------------------------- /constitucion/codigopenal_colombia.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/constitucion/codigopenal_colombia.pdf -------------------------------------------------------------------------------- /docs/traduccion.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/docs/traduccion.docx -------------------------------------------------------------------------------- /formato_demanda/formato_demanda.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sergioq2/Law_Bot_Assistance/d49bf207733ac50ea59a07bf533dd7e89c887a90/formato_demanda/formato_demanda.docx -------------------------------------------------------------------------------- /home.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from PIL import Image 3 | 4 | # Configuración de la página 5 | st.set_page_config( 6 | page_title="Gestión documental de casos de derecho con IA", 7 | page_icon=":books:", 8 | layout="wide" 9 | ) 10 | 11 | # Estilos de CSS personalizados 12 | st.markdown( 13 | """ 14 | 32 | """, 33 | unsafe_allow_html=True 34 | ) 35 | 36 | # Diseño del encabezado 37 | st.markdown('

JurisBot 🤖

', unsafe_allow_html=True) 38 | st.markdown("---") 39 | 40 | # Descripción de la aplicación 41 | st.markdown('

Bienvenidos a la aplicación de gestión documental de casos de derecho con IA.

', unsafe_allow_html=True) 42 | st.write( 43 | """ 44 | Esta aplicación permite realizar las siguientes tareas: 45 | - Realizar cualquier pregunta sobre algún dato en el documento. 46 | - Identificar nombres de lugares en el documento con descripciones. 47 | - Identificar montos de dinero en el documento con descripciones. 48 | - Identificar leyes y artículos en el documento con descripciones. 49 | - Identificar nombres de personas en el documento con descripciones. 50 | - Realizar preguntas sobre leyes o artículos de la Constitución Colombiana: 51 | Código penal, Código civil, Código comercial. 52 | """ 53 | ) 54 | st.markdown("---") 55 | 56 | # Menú lateral 57 | st.sidebar.markdown('', unsafe_allow_html=True) 58 | st.sidebar.info("Selecciona una opción de la lista arriba.") 59 | 60 | # Fin del código 61 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /pages/2_🧐Preguntas_al_caso.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from dotenv import load_dotenv 3 | from PyPDF2 import PdfReader 4 | from langchain.text_splitter import CharacterTextSplitter 5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings 6 | from langchain.vectorstores import FAISS 7 | from langchain.chat_models import ChatOpenAI 8 | from langchain import OpenAI, PromptTemplate 9 | from langchain.memory import ConversationBufferMemory 10 | from langchain.chains import ConversationalRetrievalChain 11 | from pages.front.htmlTemplates import css, bot_template, user_template 12 | from langchain.llms import HuggingFaceHub 13 | from langchain.chains.summarize import load_summarize_chain 14 | from langchain.document_loaders import PyPDFLoader 15 | from langchain.docstore.document import Document 16 | from langchain.chains.question_answering import load_qa_chain 17 | import pandas as pd 18 | from PIL import Image, ImageDraw 19 | import boto3 20 | import os 21 | 22 | load_dotenv() 23 | 24 | aws_access_key = os.getenv("AWS_ACCESS_KEY") 25 | aws_access_secret_key = os.getenv("AWS_ACCESS_SECRET_KEY") 26 | 27 | client = boto3.client('textract', region_name='us-east-1', aws_access_key_id = aws_access_key, 28 | aws_secret_access_key = aws_access_secret_key) 29 | 30 | def read_file(files): 31 | text = "" 32 | for uploaded_file in files: 33 | file_name = uploaded_file.name 34 | if file_name.lower().endswith('.pdf'): 35 | pdf_reader = PdfReader(uploaded_file) 36 | for page in pdf_reader.pages: 37 | text += page.extract_text() 38 | elif file_name.lower().endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp')): 39 | image_bytes = uploaded_file.read() 40 | response = client.detect_document_text(Document={'Bytes': image_bytes}) 41 | for item in response['Blocks']: 42 | if item["BlockType"] == "LINE": 43 | text = text + item["Text"] 44 | return text 45 | 46 | 47 | def get_text_chunks(text): 48 | text_splitter = CharacterTextSplitter( 49 | separator="\n", 50 | chunk_size=1000, 51 | chunk_overlap=200, 52 | length_function=len 53 | ) 54 | chunks = text_splitter.split_text(text) 55 | return chunks 56 | 57 | 58 | def get_vectorstore(text_chunks): 59 | embeddings = OpenAIEmbeddings() 60 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) 61 | return vectorstore 62 | 63 | 64 | def get_conversation_chain(vectorstore): 65 | llm = ChatOpenAI() 66 | memory = ConversationBufferMemory( 67 | memory_key='chat_history', return_messages=True) 68 | conversation_chain = ConversationalRetrievalChain.from_llm( 69 | llm=llm, 70 | retriever=vectorstore.as_retriever(), 71 | memory=memory 72 | ) 73 | return conversation_chain 74 | 75 | def generate_summary(vectorstore): 76 | llm = OpenAI(temperature=0) 77 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff') 78 | query = 'Haz un resumen del documento de 1 parráfo' 79 | docs = vectorstore.similarity_search(query) 80 | summary = chain_1.run(input_documents=docs, question=query) 81 | return summary 82 | 83 | def dataframe_output(output): 84 | try: 85 | entries = [entry.strip() for entry in output.split(", ")] 86 | 87 | data = { 88 | 'Name': [entry.split(": ")[0] for entry in entries], 89 | 'Description': [entry.split(": ")[1] for entry in entries] 90 | } 91 | 92 | df = pd.DataFrame(data) 93 | except: 94 | entries = [entry.strip() for entry in output.split(";")] 95 | data = { 96 | 'Law': [entry for entry in entries] 97 | } 98 | 99 | df = pd.DataFrame(data) 100 | return df 101 | 102 | def generate_nombres(vectorstore): 103 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff") 104 | query = """Identifica todos los nombres propios de personas que están en el documento con su respectivo papel o rol en el documento. 105 | Ejemplo: 106 | Sergio : Demandante,Maria : Testigo,Jose : Demandado""" 107 | docs = vectorstore.similarity_search(query) 108 | salida = chain_df.run(input_documents=docs, question=query) 109 | df = dataframe_output(salida) 110 | return df 111 | 112 | def generate_lugares(vectorstore): 113 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff") 114 | query = """Identifica todos los nombres de lugares que están en el documento con su respectiva descripción. 115 | Ejemplo: 116 | Barcelona : Vereda,Paraiso : Finca,Santuario : Pueblo""" 117 | docs = vectorstore.similarity_search(query) 118 | salida = chain_df.run(input_documents=docs, question=query) 119 | df = dataframe_output(salida) 120 | return df 121 | 122 | def generate_montos(vectorstore): 123 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff") 124 | query = """Identifica todos los montos de dinero que están en el documento con su respectiva descripción. 125 | Ejemplo: 126 | 100 : Valor terreno,200 : Valor casa,300 : Valor demandado,500: Valor escrituras""" 127 | docs = vectorstore.similarity_search(query) 128 | salida = chain_df.run(input_documents=docs, question=query) 129 | df = dataframe_output(salida) 130 | return df 131 | 132 | def generate_leyes(vectorstore): 133 | chain_df = load_qa_chain(OpenAI(), chain_type="stuff") 134 | query = """Identifica todas las leyes y artículos que están en el documento con su respectiva descripción. 135 | Ejemplo: 136 | Ley 100 : Ley de salud,Artículo 1 : Derecho a la vida,Artículo 2 : Derecho a la salud""" 137 | docs = vectorstore.similarity_search(query) 138 | salida = chain_df.run(input_documents=docs, question=query) 139 | df = dataframe_output(salida) 140 | return df 141 | 142 | def handle_userinput(user_question): 143 | response = st.session_state.conversation({'question': user_question}) 144 | st.session_state.chat_history = response['chat_history'] 145 | 146 | for i, message in enumerate(st.session_state.chat_history): 147 | if i % 2 == 0: 148 | st.write(user_template.replace( 149 | "{{MSG}}", message.content), unsafe_allow_html=True) 150 | else: 151 | st.write(bot_template.replace( 152 | "{{MSG}}", message.content), unsafe_allow_html=True) 153 | 154 | 155 | def main(): 156 | load_dotenv() 157 | st.set_page_config(page_title="Preguntale al asistente virtual Juris Bot", 158 | page_icon=":books:") 159 | st.write(css, unsafe_allow_html=True) 160 | 161 | if "conversation" not in st.session_state: 162 | st.session_state.conversation = None 163 | if "chat_history" not in st.session_state: 164 | st.session_state.chat_history = None 165 | 166 | st.header("Preguntale al asistente virtual Juris Bot :books:") 167 | user_question = st.text_input("Pregunta cualquier dato sobre el caso o sobre cualquier ley Colombiana:") 168 | if user_question: 169 | handle_userinput(user_question) 170 | 171 | with st.sidebar: 172 | st.subheader("Tus casos") 173 | pdf_docs = st.file_uploader( 174 | "Carga acá todos los documentos del caso y da click en Procesar'", accept_multiple_files=True) 175 | 176 | if "summary" in st.session_state: 177 | st.subheader("Resumen del caso") 178 | st.write(st.session_state.summary) 179 | 180 | st.sidebar.title("Entidades") 181 | entidades_options = ["Nombres", "Lugares", "Montos", "Leyes"] 182 | entidades_selected = st.sidebar.selectbox("Seleccione una opción", entidades_options) 183 | 184 | 185 | if pdf_docs and entidades_selected: 186 | with st.spinner("Procesando..."): 187 | raw_text = read_file(pdf_docs) 188 | text_chunks = get_text_chunks(raw_text) 189 | vectorstore = get_vectorstore(text_chunks) 190 | summary = generate_summary(vectorstore) 191 | st.session_state.summary = summary 192 | 193 | if entidades_selected == "Nombres": 194 | nombres_df = generate_nombres(vectorstore) 195 | st.subheader("Nombres de personas y su respectivo papel o rol") 196 | st.dataframe(nombres_df, height=500) 197 | 198 | elif entidades_selected == "Lugares": 199 | lugares_df = generate_lugares(vectorstore) 200 | st.subheader("Lugares y su respectiva descripción") 201 | st.dataframe(lugares_df, height=500) 202 | 203 | elif entidades_selected == "Montos": 204 | montos_df = generate_montos(vectorstore) 205 | st.subheader("Montos de dinero y su respectiva descripción") 206 | st.dataframe(montos_df, height=500) 207 | 208 | elif entidades_selected == "Leyes": 209 | leyes_df = generate_leyes(vectorstore) 210 | st.subheader("Leyes y sus respectivos artículos") 211 | st.dataframe(leyes_df, height=500) 212 | 213 | st.session_state.conversation = get_conversation_chain(vectorstore) 214 | 215 | if __name__ == '__main__': 216 | main() -------------------------------------------------------------------------------- /pages/3_⚖Preguntas_a_la_constitución.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from dotenv import load_dotenv 3 | from PyPDF2 import PdfReader 4 | from langchain.text_splitter import CharacterTextSplitter 5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings 6 | from langchain.vectorstores import FAISS 7 | from langchain.chat_models import ChatOpenAI 8 | from langchain import OpenAI, PromptTemplate 9 | from langchain.memory import ConversationBufferMemory 10 | from langchain.chains import ConversationalRetrievalChain 11 | from pages.front.htmlTemplates import css, bot_template, user_template 12 | from langchain.llms import HuggingFaceHub 13 | from langchain.chains.summarize import load_summarize_chain 14 | from langchain.document_loaders import PyPDFLoader 15 | from langchain.docstore.document import Document 16 | from langchain.chains.question_answering import load_qa_chain 17 | import pandas as pd 18 | import os 19 | 20 | load_dotenv() 21 | 22 | codigo_civil = "./constitucion/codigo_civil_colombia.pdf" 23 | codigo_comercio = "./constitucion/codigo_comercio.pdf" 24 | codigo_penal = "./constitucion/codigopenal_colombia.pdf" 25 | 26 | pdf_docs_law = [codigo_civil, codigo_comercio, codigo_penal] 27 | 28 | def get_pdf_text(pdf_docs_law): 29 | text = "" 30 | for pdf in pdf_docs_law: 31 | pdf_reader = PdfReader(pdf) 32 | for page in pdf_reader.pages: 33 | text += page.extract_text() 34 | return text 35 | 36 | def get_text_chunks(text): 37 | text_splitter = CharacterTextSplitter( 38 | separator="\n", 39 | chunk_size=1000, 40 | chunk_overlap=200, 41 | length_function=len 42 | ) 43 | chunks = text_splitter.split_text(text) 44 | return chunks 45 | 46 | def get_vectorstore(text_chunks): 47 | embeddings = OpenAIEmbeddings() 48 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) 49 | return vectorstore 50 | 51 | def get_conversation_chain(vectorstore): 52 | llm = ChatOpenAI() 53 | memory_law = ConversationBufferMemory( 54 | memory_key='chat_history', return_messages=True) 55 | conversation_chain_law = ConversationalRetrievalChain.from_llm( 56 | llm=llm, 57 | retriever=vectorstore.as_retriever(), 58 | memory=memory_law 59 | ) 60 | return conversation_chain_law 61 | 62 | def main(): 63 | load_dotenv() 64 | st.write(css, unsafe_allow_html=True) 65 | 66 | if "conversation_law" not in st.session_state: 67 | st.session_state.conversation_law = None 68 | if "chat_history_law" not in st.session_state: 69 | st.session_state.chat_history_law = None 70 | 71 | st.header("Asistente virtual Juris Bot :books:") 72 | user_question_law = st.text_input("Pregunta cualquier dato sobre la constitución Colombiana:") 73 | if user_question_law: 74 | handle_userinput(user_question_law) 75 | 76 | 77 | def handle_userinput(user_question_law): 78 | if st.session_state.conversation_law is None: 79 | raw_text_law = get_pdf_text(pdf_docs_law) 80 | text_chunks_law = get_text_chunks(raw_text_law) 81 | vectorstore_law = get_vectorstore(text_chunks_law) 82 | st.session_state.conversation_law = get_conversation_chain(vectorstore_law) 83 | 84 | 85 | response_law = st.session_state.conversation_law({'question': user_question_law}) 86 | st.session_state.chat_history_law = response_law['chat_history'] 87 | 88 | for i, message in enumerate(st.session_state.chat_history_law): 89 | if i % 2 == 0: 90 | st.write(user_template.replace("{{MSG}}", message.content), unsafe_allow_html=True) 91 | else: 92 | st.write(bot_template.replace("{{MSG}}", message.content), unsafe_allow_html=True) 93 | 94 | if __name__ == '__main__': 95 | main() -------------------------------------------------------------------------------- /pages/4_💼Crear_documento.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from dotenv import load_dotenv 3 | from docxtpl import DocxTemplate 4 | import os 5 | 6 | docx_tpl = DocxTemplate("./formato_demanda/formato_demanda.docx") 7 | 8 | def main(): 9 | load_dotenv() 10 | st.header("Crea tus documentos legales :notebook_with_decorative_cover:") 11 | 12 | documentos_legales = ["Demanda laboral", "Contrato de arrendamiento", "Contrato de compraventa", "Contrato de trabajo", 13 | "Contrato de prestación de servicios", "Contrato de sociedad"] 14 | 15 | documento_legal = st.selectbox("Selecciona el tipo de documento legal que deseas generar:", documentos_legales) 16 | 17 | def creacion_documento(): 18 | nombre_informe = str(numero_radicado) 19 | context = { 20 | 'nombre_demandante': nombre_demandante, 21 | 'nombre_demandado': nombre_demandado, 22 | 'cedula_demandante': cedula_demandante, 23 | 'cedula_demandado': cedula_demandado, 24 | 'ciudad': ciudad, 25 | 'fecha': fecha, 26 | 'nombre_abogado': nombre_abogado, 27 | 'cedula_abogado': cedula_abogado 28 | } 29 | docx_tpl.render(context) 30 | docx_tpl.save(f'Informes/{nombre_informe}.docx') 31 | 32 | 33 | if documento_legal == "Demanda laboral": 34 | st.write("A continuación, ingresa la información necesaria para crear el documento") 35 | numero_radicado = st.text_input("Número de radicado:") 36 | nombre_demandante = st.text_input("Nombre del demandante:") 37 | nombre_demandado = st.text_input("Nombre del demandado:") 38 | cedula_demandante = st.text_input("Cédula del demandante:") 39 | cedula_demandado = st.text_input("Cédula del demandado:") 40 | ciudad = st.text_input("Ciudad:") 41 | fecha = st.date_input("Fecha:") 42 | nombre_abogado = st.text_input("Nombre del abogado:") 43 | cedula_abogado = st.text_input("Cédula del abogado:") 44 | if st.button("Generar documento"): 45 | creacion_documento() 46 | st.write("Documento generado") 47 | 48 | if documento_legal == "Contrato de compraventa": 49 | st.write("A continuación, ingresa la información necesaria para crear el documento") 50 | numero_radicado = st.text_input("Número de radicado:") 51 | nombre_demandante = st.text_input("Nombre del demandante:") 52 | nombre_demandado = st.text_input("Nombre del demandado:") 53 | cedula_demandante = st.text_input("Cédula del demandante:") 54 | cedula_demandado = st.text_input("Cédula del demandado:") 55 | if st.button("Generar documento"): 56 | creacion_documento() 57 | st.write("Documento generado") 58 | 59 | if documento_legal == "Contrato de trabajo": 60 | st.write("A continuación, ingresa la información necesaria para crear el documento") 61 | numero_radicado = st.text_input("Número de radicado:") 62 | nombre_demandante = st.text_input("Nombre del demandante:") 63 | nombre_demandado = st.text_input("Nombre del demandado:") 64 | cedula_demandante = st.text_input("Cédula del demandante:") 65 | cedula_demandado = st.text_input("Cédula del demandado:") 66 | if st.button("Generar documento"): 67 | creacion_documento() 68 | st.write("Documento generado") 69 | 70 | 71 | if __name__ == "__main__": 72 | main() -------------------------------------------------------------------------------- /pages/5_💡Corregir_redaccion.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from dotenv import load_dotenv 3 | from langchain.text_splitter import CharacterTextSplitter 4 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings 5 | from langchain.vectorstores import FAISS 6 | from langchain import OpenAI, PromptTemplate 7 | from langchain.chains.question_answering import load_qa_chain 8 | 9 | load_dotenv() 10 | 11 | def get_text_chunks(text): 12 | text_splitter = CharacterTextSplitter( 13 | separator="\n", 14 | chunk_size=1000, 15 | chunk_overlap=200, 16 | length_function=len 17 | ) 18 | chunks = text_splitter.split_text(text) 19 | return chunks 20 | 21 | 22 | def get_vectorstore(text_chunks): 23 | embeddings = OpenAIEmbeddings() 24 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) 25 | return vectorstore 26 | 27 | def generate_correcction(vectorstore): 28 | llm = OpenAI(temperature=0) 29 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff') 30 | query = '''Simula que eres un abogado profesional y trata de conservar el lenguaje propio 31 | del campo legal. Uaz una corrección al texto tratand siguiendo los protocols legales y profesionales''' 32 | docs = vectorstore.similarity_search(query) 33 | correction = chain_1.run(input_documents=docs, question=query) 34 | return correction 35 | 36 | def main(): 37 | load_dotenv() 38 | st.markdown( 39 | """ 40 | 48 | """, 49 | unsafe_allow_html=True, 50 | ) 51 | st.title("Corregir redacción") 52 | st.markdown("Escribe un texto y el sistema te ayudará a corregir la redacción") 53 | texto = st.text_area("Escribe o copia un texto", height=300) 54 | 55 | if st.button("Corregir redacción"): 56 | text_chunks = get_text_chunks(texto) 57 | vectorstore = get_vectorstore(text_chunks) 58 | correction = generate_correcction(vectorstore) 59 | st.write(correction) 60 | 61 | if __name__ == "__main__": 62 | main() -------------------------------------------------------------------------------- /pages/6_🗺️Traduccion_documentos.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from dotenv import load_dotenv 3 | from PyPDF2 import PdfReader 4 | from langchain.text_splitter import CharacterTextSplitter 5 | from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings 6 | from langchain.vectorstores import FAISS 7 | from langchain.chat_models import ChatOpenAI 8 | from langchain import OpenAI, PromptTemplate 9 | from langchain.memory import ConversationBufferMemory 10 | from langchain.chains import ConversationalRetrievalChain 11 | from pages.front.htmlTemplates import css, bot_template, user_template 12 | from langchain.llms import HuggingFaceHub 13 | from langchain.chains.summarize import load_summarize_chain 14 | from langchain.document_loaders import PyPDFLoader 15 | from langchain.docstore.document import Document 16 | from langchain.chains.question_answering import load_qa_chain 17 | import pandas as pd 18 | from PIL import Image, ImageDraw 19 | import boto3 20 | import docx2txt 21 | import openai 22 | from docxtpl import DocxTemplate 23 | import os 24 | import json 25 | 26 | docx_tpl = DocxTemplate("./docs/traduccion.docx") 27 | 28 | openai_api_key = os.getenv("OPENAI_API_KEY") 29 | aws_access_key = os.getenv("AWS_ACCESS_KEY") 30 | aws_access_secret_key = os.getenv("AWS_ACCESS_SECRET_KEY") 31 | 32 | client = boto3.client('textract', region_name='us-east-1', aws_access_key_id=aws_access_key, 33 | aws_secret_access_key=aws_access_secret_key) 34 | 35 | def read_file(files): 36 | text = "" 37 | file_name = files.name 38 | if file_name.lower().endswith('.pdf'): 39 | pdf_reader = PdfReader(files) 40 | for page in pdf_reader.pages: 41 | text += page.extract_text() 42 | elif file_name.lower().endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp')): 43 | image_bytes = files.read() 44 | response = client.detect_document_text(Document={'Bytes': image_bytes}) 45 | for item in response['Blocks']: 46 | if item["BlockType"] == "LINE": 47 | text = text + item["Text"] 48 | return text 49 | 50 | 51 | def get_text_chunks(text): 52 | text_splitter = CharacterTextSplitter( 53 | separator="\n", 54 | chunk_size=1000, 55 | chunk_overlap=200, 56 | length_function=len 57 | ) 58 | chunks = text_splitter.split_text(text) 59 | return chunks 60 | 61 | 62 | def get_vectorstore(text_chunks): 63 | embeddings = OpenAIEmbeddings() 64 | vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings) 65 | return vectorstore 66 | 67 | def detect_language(vectorstore): 68 | llm = OpenAI(temperature=0) 69 | chain_1 = load_qa_chain(llm = llm, chain_type='stuff') 70 | query = 'Detecta el idioma de este documento y devuelvelo en español en una sola palabra' 71 | docs = vectorstore.similarity_search(query) 72 | source_language = chain_1.run(input_documents=docs, question=query) 73 | return source_language 74 | 75 | def translate_text(text, source_language, target_language): 76 | prompt = f"Translate the following '{source_language}' text to '{target_language}': {text}" 77 | 78 | response = openai.ChatCompletion.create( 79 | model="gpt-3.5-turbo", 80 | messages=[ 81 | {"role": "system", "content": "You are a helpful assistant that translates text."}, 82 | {"role": "user", "content": prompt} 83 | ], 84 | n=1, 85 | stop=None, 86 | temperature=0, 87 | ) 88 | 89 | translation = response.choices[0].message.content.strip() 90 | return translation 91 | 92 | 93 | def main(): 94 | load_dotenv() 95 | st.title("Traducción de texto") 96 | st.markdown("Sube el documento que deseas traducir") 97 | file = st.file_uploader("Sube el documento que deseas traducir", type=['pdf', 'docx', 'jpg']) 98 | if file is not None: 99 | texto = read_file(file) 100 | text_chunks = get_text_chunks(texto) 101 | vectorstore = get_vectorstore(text_chunks) 102 | language1 = detect_language(vectorstore) 103 | st.write(f"El idioma detectado es: {language1}") 104 | 105 | language2 = st.selectbox("Escoge el idioma al que deseas traducrlo", 106 | ["Español", "Inglés", "Francés", "Alemán", "Italiano", "Portugués", 107 | "Ruso", "Chino", "Japonés", "Coreano", "Árabe", "Hindi", "Bengalí"]) 108 | 109 | def creacion_documento(): 110 | context = { 111 | 'traduccion': traduccion 112 | } 113 | docx_tpl.render(context) 114 | docx_tpl.save(f'Traducciones/traduccion.docx') 115 | 116 | 117 | if st.button("Traducir documento"): 118 | traduccion = translate_text(texto, language1, language2) 119 | st.write(traduccion) 120 | creacion_documento() 121 | 122 | 123 | if __name__ == "__main__": 124 | main() -------------------------------------------------------------------------------- /pages/front/htmlTemplates.py: -------------------------------------------------------------------------------- 1 | css = ''' 2 |