├── .DS_Store ├── README.md ├── img ├── deep-retriever.png ├── makeup-schema-bq.png ├── poor_output_formatting.png ├── zietghost_arch.png ├── zietghost_concept.png ├── zietghost_marketer.png ├── zietghost_process.png └── zietghost_title.png ├── notebooks ├── .DS_Store ├── .env ├── .env.sample ├── 00-env-setup.ipynb ├── 01-setup-vertex-vector-store.ipynb ├── 02-gdelt-data-ops.ipynb ├── 03-vector-store-index-loader.ipynb ├── 03a-optional-chunk-up-the-docs.ipynb ├── 04-build-zeitghost-image.ipynb ├── 05-gdelt-pipelines.ipynb ├── 06-plan-and-execute-agents.ipynb ├── 07-streamlit-ui-plan-and-execute.py ├── imgs │ ├── add-gdelt-to-bq.gif │ ├── agent_plan_execute_chain_output.png │ ├── architecture.png │ ├── chunk_bq_tables_flow.png │ ├── chunk_gcs_blobs_flow.png │ ├── chunk_youtube_flow.png │ ├── da-ui.png │ ├── deep-retriever.png │ ├── fullarchitecture.png │ ├── google-trends-explore.png │ ├── info-architecture.png │ ├── langchain-diagram.png │ ├── langchain-overview.png │ ├── langchain_intro.png │ ├── pipeline-complete.png │ ├── pipeline_metadata.png │ ├── plan-execute-example-output.png │ ├── public_trends_data.png │ ├── user-flow-plan-execute.png │ ├── zghost_overview.png │ ├── zghost_overview_ME.png │ ├── zghost_overview_agents.png │ ├── zghost_overview_gdelt.png │ ├── zghost_overview_load_index.png │ └── zghost_overview_pipeline_steps.png └── requirements.txt ├── streamlit_agent ├── __init__.py ├── callbacks │ ├── __init__.py │ └── capturing_callback_handler.py └── clear_results.py └── zeitghost ├── .dockerignore ├── .env ├── .idea ├── .gitignore ├── codeStyles │ ├── Project.xml │ └── codeStyleConfig.xml ├── misc.xml ├── modules.xml └── vcs.xml ├── __init__.py ├── __pycache__ ├── __init__.cpython-311.pyc └── main.cpython-311.pyc ├── agents ├── Helpers.py ├── LangchainAgent.py ├── __init__.py └── __pycache__ │ ├── Helpers.cpython-311.pyc │ ├── LangchainAgent.cpython-311.pyc │ └── __init__.cpython-311.pyc ├── bigquery ├── BigQueryAccessor.py ├── __init__.py └── __pycache__ │ ├── BigQueryAccessor.cpython-311.pyc │ └── __init__.cpython-311.pyc ├── capturing_callback_handler.py ├── gdelt ├── GdeltData.py ├── Helpers.py ├── __init__.py └── __pycache__ │ ├── GdeltData.cpython-311.pyc │ ├── Helpers.cpython-311.pyc │ └── __init__.cpython-311.pyc ├── testing ├── __init__.py └── basic_agent_unit_tests.py ├── ts_embedding ├── .ipynb_checkpoints │ └── kats_embedding_tools-checkpoint.py ├── bq_data_tools.py └── kats_embedding_tools.py ├── vertex ├── Embeddings.py ├── Helpers.py ├── LLM.py ├── MatchingEngineCRUD.py ├── MatchingEngineVectorstore.py ├── __init__.py └── __pycache__ │ ├── Embeddings.cpython-311.pyc │ ├── Helpers.cpython-311.pyc │ ├── LLM.cpython-311.pyc │ ├── MatchingEngineCRUD.cpython-311.pyc │ ├── MatchingEngineVectorstore.cpython-311.pyc │ └── __init__.cpython-311.pyc ├── webserver ├── __pycache__ │ └── __init__.cpython-311.pyc ├── blueprints │ ├── __pycache__ │ │ └── __init__.cpython-311.pyc │ ├── agents │ │ └── __pycache__ │ │ │ ├── __init__.cpython-311.pyc │ │ │ └── models.cpython-311.pyc │ ├── celery │ │ └── __pycache__ │ │ │ ├── __init__.cpython-311.pyc │ │ │ └── models.cpython-311.pyc │ ├── gdelt │ │ └── __pycache__ │ │ │ ├── __init__.cpython-311.pyc │ │ │ └── models.cpython-311.pyc │ ├── llm │ │ └── __pycache__ │ │ │ ├── __init__.cpython-311.pyc │ │ │ └── models.cpython-311.pyc │ └── vectorstore │ │ └── __pycache__ │ │ ├── __init__.cpython-311.pyc │ │ └── models.cpython-311.pyc └── celery │ └── __pycache__ │ ├── __init__.cpython-311.pyc │ ├── gdelt_tasks.cpython-311.pyc │ ├── vertex_tasks.cpython-311.pyc │ └── worker.cpython-311.pyc └── zeitghost-trendspotting.iml /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Welcome to the Zeitghost - Build Your Own News & Media Listening Platform with a Conversational Agent 2 | 3 | 4 | # Update - 7.20.23 - Streamlit UI 5 | 6 | A hardcoded example of running a UI application with Streamlight is now found in [`notebooks/07-streamlit-ui-plan-and-execute.py`](notebooks/07-streamlit-ui-plan-and-execute.py) 7 | 8 | `pip install streamlit` 9 | 10 | Also, upgrade langchain and sqlalchemy-bigquery: 11 | 12 | `pip install -U langchain` 13 | 14 | `pip install -U sqlalchemy-bigquery` 15 | 16 | ### To run the UI: 17 | 18 | `cd notebooks` 19 | 20 | `streamlit 07-streamlit-ui-plan-and-execute.py` 21 | 22 | Your browser will pop up with with the UI 23 | 24 |
25 | 26 |
27 | 28 | The repo contains a set of notebooks and helper classes to enable you to create a conversational agent with access to a variety of datasets and APIs (tools) to answer end-user questions. 29 | 30 | By the final notebook, you'll have created an agent with access to the following tools: 31 | * News and media dataset (GDELT) index comprised of global news related to your `ACTOR` 32 | * The [Google Trends](https://trends.google.com/trends/explore?hl=en) public dataset. This data source helps us understand what people are searching for, in real time. We can use this data to measure search interest in a particular topic, in a particular place, and at a particular time 33 | * [Google Search API Wrapper](https://developers.google.com/custom-search/v1/overview) - to retrieve and display search results from web searches 34 | * A calculator to help the LLM with math 35 | * An [SQL Database Agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/sql_database.html) for interacting with SQL databases (e.g., BigQuery). As an example, an agent's plan may require it to search for trends in the Google Trends BigQuery table 36 | 37 | ### **Below is an image showing the user flow interacting with the multi-tool agent using a plan-execute strategy**: 38 |
39 | 40 |
41 | 42 | ### **Here's an example of what the user output will look like in a notebook**: 43 |
44 | 45 |
46 | 47 | 48 | ## GDELT: A global database of society 49 | 50 | Google's mission is to organise the world's information and make it universally accessible and useful. Supported by Google Jigsaw, the [GDELT Project](https://www.gdeltproject.org/) monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world. 51 | 52 | Monitoring nearly the entire world's news media is only the beginning - even the largest team of humans could not begin to read and analyze the billions upon billions of words and images published each day. GDELT uses some of the world's most sophisticated computer algorithms, custom-designed for global news media, running on "one of the most powerful server networks in the known Universe", together with some of the world's most powerful deep learning algorithms, to create a realtime computable record of global society that can be visualized, analyzed, modeled, examined and even forecasted. A huge array of datasets totaling trillions of datapoints are available. Three primary data streams are created, one codifying physical activities around the world in over 300 categories, one recording the people, places, organizations, millions of themes and thousands of emotions underlying those events and their interconnections and one codifying the visual narratives of the world's news imagery. 53 | 54 | All three streams update every 15 minutes, offering near-realtime insights into the world around us. Underlying the streams are a vast array of sources, from hundreds of thousands of global media outlets to special collections like 215 years of digitized books, 21 billion words of academic literature spanning 70 years, human rights archives and even saturation processing of the raw closed captioning stream of almost 100 television stations across the US in collaboration with the Internet Archive's Television News Archive. Finally, also in collaboration with the Internet Archive, the Archive captures nearly all worldwide online news coverage monitored by GDELT each day into its permanent archive to ensure its availability for future generations even in the face of repressive forces that continue to erode press freedoms around the world. 55 | 56 | For more information on how to navigate the datasets - see the [GDELT 2.0 Data format codebook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) 57 | 58 | ** How to add GDELT 2.0 to your Google Cloud Project ** 59 | 60 | ![Animated gif showing how to add Gdelt 2.0 to bigquery by clicking new data and public data sources](notebooks/imgs/add-gdelt-to-bq.gif) 61 | 62 | ## Large Data from GDELT needs Effective and Efficient Knowledge Retrieval 63 | 64 | Simply accessing and storing the data isn't enough. Processing large queries of news sources from GDELT is easy to do with BigQuery, however to accelerate exploration and discovery of the information, tools are needed to provide semantic search on the contents of the global news and media dataset. 65 | 66 | [Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service. Vertex AI Matching Engine provides the ability to scale knowledge retrieval paired with our new [Vertex Embeddings Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). We use document chunking and [`newspaper3k`](https://pypi.org/project/newspaper3k/) to pull down the articles and convert the document passages into embeddings with the Vertex Embeddings API (via Python SDK). 67 | 68 | Below is an example depiction of the information architecture, utilizing Matching Engine and the text embeddings model. 69 | 70 | ![Information Architecture showing how Matching Engine is used for semantic search](notebooks/imgs/info-architecture.png) 71 | 72 | ## Leveraging GenAI Language Models to get Relevant & Real-time Information Using Conversational Agents 73 | More than ever, access to relevant and real-time information to make decisions and understand changing patterns and trends critical. Whilst GDELT contains a massive amount of relevant and real-time (up to every 15 min) data, it can be challenging and overwhelming to make sense of it and extract what is most important for specific areas of interest, topics, events, entities. This project, the Zeitghost, provides a reference solution for how you can specify your entities and events of interest to extract from GDELT, index and load them into a vector database, and leverage the Vertex AI Generative Language models with Langchain Agents to interact in a Q&A style with the information. We also show how you can orchestrate and schedule ongoing refreshes of the data to keep the system up to date with the latest information. 74 | 75 | For more information about getting started with Langchain Agents, see [Langchain Examples](https://github.com/GoogleCloudPlatform/generative-ai/tree/dev/language/examples/oss-samples/langchain) 76 | 77 | Finally, to go beyond an agent with one chain of thought along with one tool, we explore how you can start to combine plan-and-execute agents together. [Plan-and-execute Agents](https://python.langchain.com/en/latest/modules/agents/plan_and_execute.html) accomplish an objective by first planning what to do, then executing the sub tasks 78 | * This idea is largely inspired by [BabyAGI](https://github.com/yoheinakajima/babyagi) and the [Plan-and-Solve](https://arxiv.org/abs/2305.04091) paper. 79 | * The planning is almost always done by an LLM. 80 | * The execution is usually done by a separate agent (**equipped with tools**) 81 | By allowing agents with access to different source data to interact with each other, we can uncover new insights that may not have been obvious by examining each of the datasets in isolation. 82 | 83 | ## Solution Overview 84 | 85 | ### Architecture 86 | 87 | ![Full end to end architecture](notebooks/imgs/fullarchitecture.png) 88 | 89 | ### Component Flow 90 | This project consists of a series of notebooks leveraging a customized code base to: 91 | - Filter and extract all of the relevant web URLs for a given entity from the GDELT global entity graph or a type of global event for a specified time period leveraging the GDELT data that is publicly available natively in BigQuery. An example could be an ACTOR='World Health Organization', for the time period of March 2020 to Present, including events about COVID lockdown. 92 | - Extract the full article and news content from every URL that is returned from the GDELT datasets and generate text embeddings using the [Vertex AI Embeddings Model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) 93 | - Create a [Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) Vector Database Index deploy it to an Index Endpoint 94 | - Stream update generated embeddings into the Matching Engine Vector Database Index 95 | - Create a managed pipeline to orchestrate the ongoing refresh of the GDELT data into the Matching Engine Vector DB 96 | ![gdelt-pipeline](notebooks/imgs/pipeline-complete.png) 97 | - Test the generic semantic search capabilities of the Vector DB, and test using a Langchain Agent with one chain of thought along with one tool 98 | - Create a plan-and-execute agent framework where different agents (the GDELT Langchain agent, a BigQuery public trends agent, and a Google Search API agent) are able to talk to each other to answer questions 99 | 100 | Next, we are currently working on adding: 101 | - An application to build and deploy the conversational agent as an API on Google Cloud Run - where it can then be integrated into any application. 102 | - A customizable front end reference architecture that uses the agents once , which can be used to showcase the art of the possible when working 103 | - Incorporating future enhancements to the embedding techniques used to improve the relevancy and performance of retrieval 104 | 105 | ## How to use this repo 106 | 107 | * Setup VPC Peering 108 | * Create a Vertex AI Workbench Instance 109 | * Run the notebooks 110 | 111 | #### Important: be sure to create Vertex AI notebook instance within the same VPC Network used for Vertex AI Matching Engine deployment 112 | 113 | If you don't use the same VPC network, you will not be able to make calls to the matching engine vector store database 114 | 115 | ### Step 1: Setup VPC Peering using the following `gcloud` commands in Cloud Shell 116 | 117 | * Similar to the setup instructions for Vertex AI Matching Engine in this [Sample Notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/sdk_matching_engine_for_indexing.ipynb), there are a few permissions needed in order to create the VPC network and set up VPC peering 118 | * Run the following `gcloud` commands in Cloud Shell to create the network and VPC peering needed for deploying Matching Engine indexes to Private Endpoints: 119 | 120 | ``` 121 | VPC_NETWORK="YOUR_NETWORK_NAME" 122 | PROJECT_ID="YOUR_PROJECT_ID" 123 | PEERING_RANGE_NAME="PEERINGRANGENAME" 124 | 125 | # Create a VPC network 126 | gcloud compute networks create $VPC_NETWORK --bgp-routing-mode=regional --subnet-mode=auto --project=$PROJECT_ID 127 | 128 | # Add necessary firewall rules 129 | gcloud compute firewall-rules create $VPC_NETWORK-allow-icmp --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow icmp 130 | 131 | gcloud compute firewall-rules create $VPC_NETWORK-allow-internal --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow all --source-ranges 10.128.0.0/9 132 | 133 | gcloud compute firewall-rules create $VPC_NETWORK-allow-rdp --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow tcp:3389 134 | 135 | gcloud compute firewall-rules create $VPC_NETWORK-allow-ssh --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow tcp:22 136 | 137 | # Reserve IP range 138 | gcloud compute addresses create $PEERING_RANGE_NAME --global --prefix-length=16 --network=$VPC_NETWORK --purpose=VPC_PEERING --project=$PROJECT_ID --description="peering range" 139 | 140 | # Set up peering with service networking 141 | # Your account must have the "Compute Network Admin" role to run the following. 142 | gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network=$VPC_NETWORK --ranges=$PEERING_RANGE_NAME --project=$PROJECT_ID 143 | ``` 144 | 145 | ### Step 2: Create Vertex AI Workbench notebook using the following `gcloud` command in Cloud Shell: 146 | 147 | * Using this base image will ensure you have the proper starting environment to use these notebooks 148 | * You can optionally remove the GPU-related flags 149 | 150 | ```bash 151 | INSTANCE_NAME='your-instance-name' 152 | 153 | gcloud notebooks instances create $INSTANCE_NAME \ 154 | --vm-image-project=deeplearning-platform-release \ 155 | --vm-image-family=tf-ent-2-11-cu113-notebooks-debian-11-py39 \ 156 | --vm-image-family=tf-latest-cu113-debian-11-py39 \ 157 | --machine-type=n1-standard-8 \ 158 | --location=us-central1-a \ 159 | --network=$VPC_NETWORK 160 | ``` 161 | 162 | ### Step 3: Clone this repo 163 | * Once the Vertex AI Workbench instance is created, open a terminal via the file menu: **File > New > Terminal** 164 | * Run the following code to clone this repo: 165 | 166 | ```bash 167 | git clone https://github.com/hello-d-lee/conversational-agents-zeitghost.git 168 | ``` 169 | 170 | ### Step 4: Go to the first notebook (`00-env-setup.ipynb`), follow the instructions, and continue through the remaining notebooks 171 | 172 | 0. [Environment Setup](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/00-env-setup.ipynb) - used to create configurations once that can be used for the rest of the notebooks 173 | 1. [Setup Vertex Vector Store](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/01-setup-vertex-vector-store.ipynb) - create the Vertex AI Matching Engine Vector Store Index, deploy it to an endpoint. This can take 40-50 min, so whilst waiting the next notebook can be run. 174 | 2. [GDELT DataOps](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/02-gdelt-data-ops.ipynb) - parameterize the topics and time period of interest, run the extraction against GDELT for article and news content 175 | 3. [Vector Store Index Loader](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/03-vector-store-index-loader.ipynb) - create embeddings and load the vectors into the Matching Engine Vector Store. Test the semantic search capabilities and langchain agent using the Vector Store. 176 | 4. [Build Zeitghost Image](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/04-build-zeitghost-image.ipynb) - create a custom container to be used to create the GDELT pipeline for ongoing data updates 177 | 5. [GDELT Pipelines](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/05-gdelt-pipelines.ipynb) - create the pipeline to orchestrate and automatically refresh the data and update the new vectors into the matching engine index 178 | 6. [Plan and Execute Agents](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/06-plan-and-execute-agents.ipynb) - create new agents using the BigQuery public trends dataset, and the Google Search API, and use the agents together to uncover new insights 179 | -------------------------------------------------------------------------------- /img/deep-retriever.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/deep-retriever.png -------------------------------------------------------------------------------- /img/makeup-schema-bq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/makeup-schema-bq.png -------------------------------------------------------------------------------- /img/poor_output_formatting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/poor_output_formatting.png -------------------------------------------------------------------------------- /img/zietghost_arch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_arch.png -------------------------------------------------------------------------------- /img/zietghost_concept.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_concept.png -------------------------------------------------------------------------------- /img/zietghost_marketer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_marketer.png -------------------------------------------------------------------------------- /img/zietghost_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_process.png -------------------------------------------------------------------------------- /img/zietghost_title.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_title.png -------------------------------------------------------------------------------- /notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/.DS_Store -------------------------------------------------------------------------------- /notebooks/.env: -------------------------------------------------------------------------------- 1 | GOOGLE_CSE_ID=8743920w 2 | GOOGLE_API_KEY=9238rewr -------------------------------------------------------------------------------- /notebooks/.env.sample: -------------------------------------------------------------------------------- 1 | GOOGLE_CSE_ID= 2 | GOOGLE_API_KEY= -------------------------------------------------------------------------------- /notebooks/01-setup-vertex-vector-store.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "4a3c8d01", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# Copyright 2023 Google LLC\n", 11 | "#\n", 12 | "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", 13 | "# you may not use this file except in compliance with the License.\n", 14 | "# You may obtain a copy of the License at\n", 15 | "#\n", 16 | "# https://www.apache.org/licenses/LICENSE-2.0\n", 17 | "#\n", 18 | "# Unless required by applicable law or agreed to in writing, software\n", 19 | "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", 20 | "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", 21 | "# See the License for the specific language governing permissions and\n", 22 | "# limitations under the License." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "id": "b9d4ec90-df65-4910-b3bd-1aecf803b6a4", 28 | "metadata": {}, 29 | "source": [ 30 | "# Setting up Vector Stores with Vertex Matching Engine\n", 31 | "\n", 32 | " \n", 37 | " \n", 43 | " \n", 49 | "
\n", 33 | " \n", 34 | " \"Colab Run in Colab\n", 35 | " \n", 36 | " \n", 38 | " \n", 39 | " \"GitHub\n", 40 | " View on GitHub\n", 41 | " \n", 42 | " \n", 44 | " \n", 45 | " \"Vertex\n", 46 | " Open in Vertex AI Workbench\n", 47 | " \n", 48 | "
" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "id": "299bd214", 55 | "metadata": {}, 56 | "source": [ 57 | "## Overview\n", 58 | "\n", 59 | "
\n", 60 | "\n", 61 | "
\n", 62 | "When working with LLMs and conversational agents, how the data that they are accessing is stored is crucial - efficient data processing is more important than ever for applications involving large language models, genAI, and semantic search. Many of these new applications using large unstructured datasets use vector embeddings, a data representation containing semantic information that LLMs can use to answer questions and maintain in a long-term memory. \n", 63 | "\n", 64 | "In this application we will use a specialized database - a Vector Database - for handling embeddings, optimized for storage and querying capabilities for embeddings. The GDELT dataset extract could be quite large depending on the actor_name and time range, so we want to make sure that we aren't sacrificing performance to interact with such a potentially large dataset, which is where Vertex AI Matching Engine's Vector Database will ensure that we can scale for any very large number of embeddings.\n", 65 | "\n", 66 | "In this notebook you'll go through the process to create and deploy a vector store in Vertex Matching Engine. Whilst the setup may take 40-50min, once you've done this once, you can update, delete, and continue to add embeddings to this instance. \n", 67 | "\n", 68 | "---\n", 69 | "\n", 70 | "[Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.\n", 71 | "\n", 72 | "Matching Engine provides tooling to build use cases that match semantically similar items. More specifically, given a query item, Matching Engine finds the most semantically similar items to it from a large corpus of candidate items. This ability to search for semantically similar or semantically related items has many real world use cases and is a vital part of applications such as:\n", 73 | "\n", 74 | "* Recommendation engines\n", 75 | "* Search engines\n", 76 | "* Ad targeting systems\n", 77 | "* Image classification or image search\n", 78 | "* Text classification\n", 79 | "* Question answering\n", 80 | "* Chatbots\n", 81 | "\n", 82 | "To build semantic matching systems, you need to compute vector representations of all items. These vector representations are often called embeddings. Embeddings are computed by using machine learning models, which are trained to learn an embedding space where similar examples are close while dissimilar ones are far apart. The closer two items are in the embedding space, the more similar they are.\n", 83 | "\n", 84 | "At a high level, semantic matching can be simplified into two critical steps:\n", 85 | "\n", 86 | "* Generate embedding representations of items.\n", 87 | "* Perform nearest neighbor searches on embeddings.\n", 88 | "\n", 89 | "### Objectives\n", 90 | "\n", 91 | "In this notebook, you will create a Vector Store using Vertex AI Matching Engine\n", 92 | "\n", 93 | "The steps performed include:\n", 94 | "\n", 95 | "- Installing the Python SDK \n", 96 | "- Create or initialize an existing matching engine index\n", 97 | " - Creating a new index can take 40-50 minutes\n", 98 | " - If you have already created an index and want to use this existing one, follow the instructions to initialize an existing index\n", 99 | " - Whilst creating a new index, consider proceeding to [GDELT DataOps](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb) notebook\n", 100 | "- Create the Vector Store with embedddings, leveraging the embeddings model with `textembedding-gecko@001`\n", 101 | " " 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "96dda8e7", 107 | "metadata": {}, 108 | "source": [ 109 | "### Costs\n", 110 | "This tutorial uses billable components of Google Cloud:\n", 111 | "\n", 112 | "* Vertex AI Generative AI Studio\n", 113 | "* Vertex AI Matching Engine\n", 114 | "\n", 115 | "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n", 116 | "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", 117 | "to generate a cost estimate based on your projected usage." 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "id": "c63c9095", 123 | "metadata": {}, 124 | "source": [ 125 | "## Getting Started" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "id": "a5df5dc2", 131 | "metadata": {}, 132 | "source": [ 133 | "**Colab only:** Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top. " 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 2, 139 | "id": "ba34e308", 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# # Automatically restart kernel after installs so that your environment can access the new packages\n", 144 | "# import IPython\n", 145 | "\n", 146 | "# app = IPython.Application.instance()\n", 147 | "# app.kernel.do_shutdown(True)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "id": "5943f6fc", 153 | "metadata": {}, 154 | "source": [ 155 | "### Authenticating your notebook environment\n", 156 | "* If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n", 157 | "* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)." 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 3, 163 | "id": "51d84780", 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# from google.colab import auth\n", 168 | "# auth.authenticate_user()" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "id": "b7f4d6ec-0e4d-40da-8a03-e7038bab7485", 174 | "metadata": {}, 175 | "source": [ 176 | "### Make sure you edit the values below\n", 177 | "Each time you run the notebook for the first time with new variables, you just need to edit the actor prefix and version variables below. They are needed to grab all the other variables in the notebook configuration." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 2, 183 | "id": "b105ad1f-1b76-4551-a269-c31bc7b6da74", 184 | "metadata": {}, 185 | "outputs": [ 186 | { 187 | "name": "stdout", 188 | "output_type": "stream", 189 | "text": [ 190 | "ACTOR_PREFIX : ggl\n", 191 | "VERSION : v1\n" 192 | ] 193 | } 194 | ], 195 | "source": [ 196 | "# CREATE_NEW_ASSETS = True # True | False\n", 197 | "ACTOR_PREFIX = \"ggl\"\n", 198 | "VERSION = 'v1'\n", 199 | "\n", 200 | "# print(f\"CREATE_NEW_ASSETS : {CREATE_NEW_ASSETS}\")\n", 201 | "print(f\"ACTOR_PREFIX : {ACTOR_PREFIX}\")\n", 202 | "print(f\"VERSION : {VERSION}\")" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "id": "808f751b-e348-4357-a294-7bf4ba3a6ff5", 208 | "metadata": {}, 209 | "source": [ 210 | "### Load configuration settings from setup notebook\n", 211 | "Set the variables used in this notebook and load the config settings from the `00-env-setup.ipynb` notebook." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 3, 217 | "id": "aef986cc-3211-4093-bce0-3bac431a07a1", 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "\n", 225 | "PROJECT_ID = \"wortz-project-352116\"\n", 226 | "PROJECT_NUM = \"679926387543\"\n", 227 | "LOCATION = \"us-central1\"\n", 228 | "\n", 229 | "REGION = \"us-central1\"\n", 230 | "BQ_LOCATION = \"US\"\n", 231 | "VPC_NETWORK_NAME = \"me-network\"\n", 232 | "\n", 233 | "CREATE_NEW_ASSETS = \"True\"\n", 234 | "ACTOR_PREFIX = \"ggl\"\n", 235 | "VERSION = \"v1\"\n", 236 | "ACTOR_NAME = \"google\"\n", 237 | "ACTOR_CATEGORY = \"technology\"\n", 238 | "\n", 239 | "BUCKET_NAME = \"zghost-ggl-v1-wortz-project-352116\"\n", 240 | "EMBEDDING_DIR_BUCKET = \"zghost-ggl-v1-wortz-project-352116-emd-dir\"\n", 241 | "\n", 242 | "BUCKET_URI = \"gs://zghost-ggl-v1-wortz-project-352116\"\n", 243 | "EMBEDDING_DIR_BUCKET_URI = \"gs://zghost-ggl-v1-wortz-project-352116-emd-dir\"\n", 244 | "\n", 245 | "VPC_NETWORK_FULL = \"projects/679926387543/global/networks/me-network\"\n", 246 | "\n", 247 | "ME_INDEX_NAME = \"vectorstore_ggl_v1\"\n", 248 | "ME_INDEX_ENDPOINT_NAME = \"vectorstore_ggl_v1_endpoint\"\n", 249 | "ME_DIMENSIONS = \"768\"\n", 250 | "\n", 251 | "MY_BQ_DATASET = \"zghost_ggl_v1\"\n", 252 | "MY_BQ_TRENDS_DATASET = \"zghost_ggl_v1_trends\"\n", 253 | "\n", 254 | "BUCKET_NAME : zghost-ggl-v1-wortz-project-352116\n", 255 | "BUCKET_URI : gs://zghost-ggl-v1-wortz-project-352116\n" 256 | ] 257 | } 258 | ], 259 | "source": [ 260 | "# staging GCS\n", 261 | "GCP_PROJECTS = !gcloud config get-value project\n", 262 | "PROJECT_ID = GCP_PROJECTS[0]\n", 263 | "\n", 264 | "BUCKET_NAME = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}'\n", 265 | "BUCKET_URI = f'gs://{BUCKET_NAME}'\n", 266 | "\n", 267 | "config = !gsutil cat {BUCKET_URI}/config/notebook_env.py\n", 268 | "print(config.n)\n", 269 | "exec(config.n)\n", 270 | "\n", 271 | "print(f\"BUCKET_NAME : {BUCKET_NAME}\")\n", 272 | "print(f\"BUCKET_URI : {BUCKET_URI}\")" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "id": "0ed27576-f85a-4b5b-a54a-e4f61b30dd4e", 278 | "metadata": {}, 279 | "source": [ 280 | "### Import Packages" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 7, 286 | "id": "98bbd868-e768-44a0-bf7c-862201209616", 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "import sys\n", 291 | "import os\n", 292 | "sys.path.append(\"..\")\n", 293 | "# the following helper classes create and instantiate the matching engine resources\n", 294 | "from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD\n", 295 | "from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore\n", 296 | "from zeitghost.vertex.LLM import VertexLLM\n", 297 | "from zeitghost.vertex.Embeddings import VertexEmbeddings\n", 298 | "\n", 299 | "import uuid\n", 300 | "import time\n", 301 | "import numpy as np\n", 302 | "import json\n", 303 | "\n", 304 | "from google.cloud import aiplatform as vertex_ai\n", 305 | "from google.cloud import storage\n", 306 | "from google.cloud import bigquery" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 8, 312 | "id": "3944efc9-ee04-4b40-b4c4-64652beddf3c", 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "storage_client = storage.Client(project=PROJECT_ID)\n", 317 | "\n", 318 | "vertex_ai.init(project=PROJECT_ID,location=LOCATION)\n", 319 | "\n", 320 | "# bigquery client\n", 321 | "bqclient = bigquery.Client(\n", 322 | " project=PROJECT_ID,\n", 323 | " # location=LOCATION\n", 324 | ")" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "id": "aa97925b-22d6-457a-9e53-212de1ca3fdb", 330 | "metadata": {}, 331 | "source": [ 332 | "## Matching Engine Index: initialize existing or create a new one\n", 333 | "\n", 334 | "Validate access and bucket contents" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 9, 340 | "id": "96dcdd09-9e35-419a-8347-2de086a6500f", 341 | "metadata": {}, 342 | "outputs": [ 343 | { 344 | "name": "stdout", 345 | "output_type": "stream", 346 | "text": [ 347 | "gs://zghost-way-v1-wortz-project-352116-emd-dir/init_index/embeddings_0.json\n" 348 | ] 349 | } 350 | ], 351 | "source": [ 352 | "! gsutil ls $EMBEDDING_DIR_BUCKET_URI/init_index" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "id": "549d790f", 358 | "metadata": {}, 359 | "source": [ 360 | "Pass the required parameters that will be used to create the matching engine index" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 10, 366 | "id": "dc7e096f-9784-4bbf-8512-bd3000db21d9", 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "mengine = MatchingEngineCRUD(\n", 371 | " project_id=PROJECT_ID \n", 372 | " , project_num=PROJECT_NUM\n", 373 | " , region=LOCATION \n", 374 | " , index_name=ME_INDEX_NAME\n", 375 | " , vpc_network_name=VPC_NETWORK_FULL\n", 376 | ")" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "id": "be9f7438-cf41-4c12-9785-f8a9a49bfde9", 382 | "metadata": {}, 383 | "source": [ 384 | "### Create or Initialize Existing Index\n", 385 | "\n", 386 | "Creating a Vertex Matching Engine index can take ~40-50 minutes due to the index compaction algorithm it uses to structure the index for high performance queries at scale. Read more about the [novel algorithm](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) proposed by Google Researchand the [official whitepaper](https://arxiv.org/abs/1908.10396)\n", 387 | "\n", 388 | "**Considering this setup time, proceed to Notebook `02-gdelt-data-ops.ipynb` to start extracting events and articles related to your actor**" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 11, 394 | "id": "f54160cb-3787-4608-9ede-ed2a6a89ee20", 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "name": "stderr", 399 | "output_type": "stream", 400 | "text": [ 401 | "INFO:root:Index vectorstore_way_v1 does not exists. Creating index ...\n", 402 | "INFO:root:Poll the operation to create index ...\n" 403 | ] 404 | }, 405 | { 406 | "name": "stdout", 407 | "output_type": "stream", 408 | "text": [ 409 | "........................................" 410 | ] 411 | }, 412 | { 413 | "name": "stderr", 414 | "output_type": "stream", 415 | "text": [ 416 | "\n", 417 | "KeyboardInterrupt\n", 418 | "\n" 419 | ] 420 | } 421 | ], 422 | "source": [ 423 | "start = time.time()\n", 424 | "# create ME index\n", 425 | "me_index = mengine.create_index(\n", 426 | " f\"{EMBEDDING_DIR_BUCKET_URI}/init_index\"\n", 427 | " , int(ME_DIMENSIONS)\n", 428 | ")\n", 429 | "\n", 430 | "end = time.time()\n", 431 | "print(f\"elapsed time: {end - start}\")\n", 432 | "\n", 433 | "if me_index:\n", 434 | " print(me_index.name)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "id": "8830a613-5510-4e7c-b821-50b90dbe1392", 440 | "metadata": {}, 441 | "source": [ 442 | "### Create or Initialize Index Endpoint\n", 443 | "Once your Matching Engine Index has been created, create an index endpoint where the Index will be deployed to " 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "id": "4d12ee68-5ea0-435d-968c-8b09c4576eb4", 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "start = time.time()\n", 454 | "\n", 455 | "index_endpoint=mengine.create_index_endpoint(\n", 456 | " endpoint_name=ME_INDEX_ENDPOINT_NAME\n", 457 | " , network=VPC_NETWORK_FULL\n", 458 | ")\n", 459 | "\n", 460 | "end = time.time()\n", 461 | "print(f\"elapsed time: {end - start}\")" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "id": "3aea30d7", 467 | "metadata": {}, 468 | "source": [ 469 | "Print out the detailed information about the index endpoint and VPC network where it is deployed, and any indexes that are already deployed to that endpoint" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "id": "dc517c3a-316f-46b4-ba07-e643a13c882f", 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "if index_endpoint:\n", 480 | " print(f\"Index endpoint resource name: {index_endpoint.name}\")\n", 481 | " print(f\"Index endpoint VPC network name: {index_endpoint.network}\")\n", 482 | " print(f\"Deployed indexes on the index endpoint:\")\n", 483 | " for d in index_endpoint.deployed_indexes:\n", 484 | " print(f\" {d.id}\")" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "id": "bfdd8f2d-cb15-4007-952d-e8a496da7652", 490 | "metadata": {}, 491 | "source": [ 492 | "### Deploy Index to Index Endpoint\n", 493 | "To interact with a matching engine index, you'll need to deploy it to an endpoint, where you can customize the underlying infrastructure behind the endpoint. For example, you can specify the scaling properties. " 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "id": "5fbe4a6f-833e-4632-9047-b770dd6521b3", 500 | "metadata": {}, 501 | "outputs": [], 502 | "source": [ 503 | "if CREATE_NEW_ASSETS == 'True':\n", 504 | " \n", 505 | " index_endpoint = mengine.deploy_index(\n", 506 | " index_name = ME_INDEX_NAME\n", 507 | " , endpoint_name = ME_INDEX_ENDPOINT_NAME\n", 508 | " , min_replica_count = 2\n", 509 | " , max_replica_count = 2\n", 510 | " )" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "id": "032d860d", 516 | "metadata": {}, 517 | "source": [ 518 | "Print out the information about the matching engine resources" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "id": "42b0a04e-2d0f-4b15-b2c5-1551866697c4", 525 | "metadata": {}, 526 | "outputs": [], 527 | "source": [ 528 | "if index_endpoint:\n", 529 | " print(f\"Index endpoint resource name: {index_endpoint.name}\")\n", 530 | " print(f\"Index endpoint VPC network name: {index_endpoint.network}\")\n", 531 | " print(f\"Deployed indexes on the index endpoint:\")\n", 532 | " for d in index_endpoint.deployed_indexes:\n", 533 | " print(f\" {d.id}\")" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "id": "097dee06-6772-43a3-83c0-8ba3f94b7846", 539 | "metadata": {}, 540 | "source": [ 541 | "### Get Index and IndexEndpoint IDs\n", 542 | "Set the variable values and print out the resource details" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "id": "7ffaf92a-bd91-482a-a51f-088429c1c277", 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()\n", 553 | "ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split(\"/\")[5]\n", 554 | "\n", 555 | "print(f\"ME_INDEX_RESOURCE_NAME = {ME_INDEX_RESOURCE_NAME}\")\n", 556 | "print(f\"ME_INDEX_ENDPOINT_ID = {ME_INDEX_ENDPOINT_ID}\")\n", 557 | "print(f\"ME_INDEX_ID = {ME_INDEX_ID}\")" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "id": "a785b0bd-a597-4284-b8e6-6ffb9d9bbe08", 563 | "metadata": {}, 564 | "source": [ 565 | "## Matching Engine Vector Store" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "id": "02c6333c-2368-4ed0-8103-e431450d08b4", 571 | "metadata": {}, 572 | "source": [ 573 | "### Define Vertex LLM & Embeddings\n", 574 | "The base class to create the various LLMs can be found in in the root repository - in zeitghost.vertex the `LLM.py` file" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "id": "381e0c0f-de69-4ffe-b039-6d233d4da80f", 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "llm = VertexLLM(\n", 585 | " stop=None \n", 586 | " , temperature=0.0\n", 587 | " , max_output_tokens=1000\n", 588 | " , top_p=0.7\n", 589 | " , top_k=40\n", 590 | ")\n", 591 | "\n", 592 | "# llm that can be used for a BigQuery agent, containing stopwords to prevent hallucinations and string parsing\n", 593 | "langchain_llm_for_bq = VertexLLM(\n", 594 | " stop=['Observation:'] \n", 595 | " , strip=True \n", 596 | " , temperature=0.0\n", 597 | " , max_output_tokens=1000\n", 598 | " , top_p=0.7\n", 599 | " , top_k=40\n", 600 | ")\n", 601 | "\n", 602 | "# llm that can be used for a pandas agent, containing stopwords to prevent hallucinations\n", 603 | "langchain_llm_for_pandas = VertexLLM(\n", 604 | " stop=['Observation:']\n", 605 | " , strip=False\n", 606 | " , temperature=0.0\n", 607 | " , max_output_tokens=1000\n", 608 | " , top_p=0.7\n", 609 | " , top_k=40\n", 610 | ")" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "id": "029adebc", 616 | "metadata": {}, 617 | "source": [ 618 | "Let's ping the language model to ensure we are getting an expected response" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "id": "c38b5ef5-4ffb-4594-807a-b6c4717a53d0", 625 | "metadata": {}, 626 | "outputs": [], 627 | "source": [ 628 | "# llm('how are you doing today?')\n", 629 | "llm('In no more than 50 words, what can you tell me about the band Widespread Panic?')" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "id": "1feb1e91", 635 | "metadata": {}, 636 | "source": [ 637 | "Now let's call the VertexEmbeddings class which helps us get document embeddings using the [Vertex AI Embeddings model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). Make sure that your REQUESTS_PER_MINUTE does not exceed your project quota." 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "id": "660ece82-5e45-476c-a854-2d2aba646529", 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [ 647 | "from zeitghost.vertex.Embeddings import VertexEmbeddings\n", 648 | "\n", 649 | "REQUESTS_PER_MINUTE = 299 # example project quota==300\n", 650 | "vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "id": "f18fffa5-ad26-43e0-8af8-b023ff8aeae8", 656 | "metadata": {}, 657 | "source": [ 658 | "## Initialize Matching Engine Vector Store\n", 659 | "Finally, to interact with the matching engine instance initialize it with everything that you have created" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "id": "df037bcd-5bff-417c-988e-6ab4806acb86", 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "# initialize vector store\n", 670 | "me = MatchingEngineVectorStore.from_components(\n", 671 | " project_id=PROJECT_ID\n", 672 | " # , project_num=PROJECT_NUM\n", 673 | " , region=LOCATION\n", 674 | " , gcs_bucket_name=EMBEDDING_DIR_BUCKET_URI\n", 675 | " , embedding=vertex_embedding\n", 676 | " , index_id=ME_INDEX_ID\n", 677 | " , endpoint_id=ME_INDEX_ENDPOINT_ID\n", 678 | ")" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "id": "7bfbda36", 684 | "metadata": {}, 685 | "source": [ 686 | "Validate that you have created the vector store with the Vertex embeddings" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": null, 692 | "id": "5f984ea1-040b-4e2d-b4e1-401721288228", 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [ 696 | "me.embedding" 697 | ] 698 | } 699 | ], 700 | "metadata": { 701 | "environment": { 702 | "kernel": "python3", 703 | "name": "tf2-gpu.2-6.m108", 704 | "type": "gcloud", 705 | "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m108" 706 | }, 707 | "kernelspec": { 708 | "display_name": "Python 3 (ipykernel)", 709 | "language": "python", 710 | "name": "python3" 711 | }, 712 | "language_info": { 713 | "codemirror_mode": { 714 | "name": "ipython", 715 | "version": 3 716 | }, 717 | "file_extension": ".py", 718 | "mimetype": "text/x-python", 719 | "name": "python", 720 | "nbconvert_exporter": "python", 721 | "pygments_lexer": "ipython3", 722 | "version": "3.9.16" 723 | } 724 | }, 725 | "nbformat": 4, 726 | "nbformat_minor": 5 727 | } 728 | -------------------------------------------------------------------------------- /notebooks/07-streamlit-ui-plan-and-execute.py: -------------------------------------------------------------------------------- 1 | from os import system 2 | from pathlib import Path 3 | 4 | import sys 5 | import os 6 | sys.path.append("..") 7 | 8 | import streamlit as st 9 | from langchain import SQLDatabase 10 | from langchain.agents import AgentType 11 | from langchain.agents import initialize_agent, Tool 12 | from langchain.callbacks import StreamlitCallbackHandler 13 | from langchain.chains import LLMMathChain, SQLDatabaseChain 14 | from langchain.utilities import DuckDuckGoSearchAPIWrapper 15 | from langchain.llms import VertexAI 16 | from langchain.embeddings import VertexAIEmbeddings 17 | 18 | 19 | 20 | 21 | from zeitghost.agents.LangchainAgent import LangchainAgent 22 | 23 | from streamlit_agent.callbacks.capturing_callback_handler import playback_callbacks 24 | from streamlit_agent.clear_results import with_clear_container 25 | 26 | st.set_page_config( 27 | page_title="Google Langchain Agents", page_icon="🦜", layout="wide", initial_sidebar_state="collapsed" 28 | ) 29 | "# 🦜🔗 Langchain for Google Palm" 30 | 31 | ACTOR_PREFIX = "ggl" 32 | VERSION = 'v1' 33 | PROJECT_ID = 'cpg-cdp' 34 | BUCKET_NAME = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}' 35 | BUCKET_URI = f'gs://{BUCKET_NAME}' 36 | 37 | 38 | ###HARDCODED VALUES BELOW - TODO UPDATE LATER 39 | 40 | PROJECT_ID = "cpg-cdp" 41 | PROJECT_NUM = "939655404703" 42 | LOCATION = "us-central1" 43 | REGION = "us-central1" 44 | BQ_LOCATION = "US" 45 | VPC_NETWORK_NAME = "genai-haystack-vpc" 46 | CREATE_NEW_ASSETS = "True" 47 | VERSION = "v1" 48 | ACTOR_NAME = "google" 49 | ACTOR_CATEGORY = "technology" 50 | BUCKET_NAME = "zghost-ggl-v1-cpg-cdp" 51 | EMBEDDING_DIR_BUCKET = "zghost-ggl-v1-cpg-cdp-emd-dir" 52 | BUCKET_URI = "gs://zghost-ggl-v1-cpg-cdp" 53 | EMBEDDING_DIR_BUCKET_URI = "gs://zghost-ggl-v1-cpg-cdp-emd-dir" 54 | VPC_NETWORK_FULL = "projects/939655404703/global/networks/me-network" 55 | ME_INDEX_NAME = "vectorstore_ggl_v1" 56 | ME_INDEX_ENDPOINT_NAME = "vectorstore_ggl_v1_endpoint" 57 | ME_DIMENSIONS = "768" 58 | MY_BQ_DATASET = "zghost_ggl_v1" 59 | MY_BQ_TRENDS_DATASET = "zghost_ggl_v1_trends" 60 | 61 | 62 | #TODO - this works fine from a notebook but getting UNKNOWN errors when trying to access ME from a signed-in env (for user) 63 | # from zeitghost.vertex.Embeddings import VertexEmbeddings 64 | 65 | from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD 66 | from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore 67 | 68 | # Google Cloud 69 | # from google.cloud import aiplatform as vertex_ai 70 | # from google.cloud import storage 71 | # from google.cloud import bigquery 72 | 73 | 74 | #Instantiate Google cloud SDK clients 75 | # storage_client = storage.Client(project=PROJECT_ID) 76 | 77 | ## Instantiate the Vertex AI resources, Agents, and Tools 78 | mengine = MatchingEngineCRUD( 79 | project_id=PROJECT_ID 80 | , project_num=PROJECT_NUM 81 | , region=LOCATION 82 | , index_name=ME_INDEX_NAME 83 | , vpc_network_name=VPC_NETWORK_FULL 84 | ) 85 | 86 | ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint() 87 | ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split("/")[5] 88 | 89 | 90 | REQUESTS_PER_MINUTE = 200 # project quota==300 91 | vertex_embedding = VertexAIEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE) 92 | 93 | 94 | me = MatchingEngineVectorStore.from_components( 95 | project_id=PROJECT_ID 96 | , region=LOCATION 97 | , gcs_bucket_name=BUCKET_NAME 98 | , embedding=vertex_embedding 99 | , index_id=ME_INDEX_ID 100 | , endpoint_id=ME_INDEX_ENDPOINT_ID 101 | , k = 10 102 | ) 103 | 104 | 105 | ## Create VectorStore Agent tool 106 | 107 | vertex_langchain_agent = LangchainAgent() 108 | 109 | vectorstore_agent = vertex_langchain_agent.get_vectorstore_agent( 110 | vectorstore=me 111 | , vectorstore_name=f"news on {ACTOR_NAME}" 112 | , vectorstore_description=f"a vectorstore containing news articles and current events for {ACTOR_NAME}." 113 | ) 114 | 115 | ## BigQuery Agent 116 | 117 | 118 | vertex_langchain_agent = LangchainAgent() 119 | bq_agent = vertex_langchain_agent.get_bigquery_agent(PROJECT_ID) 120 | 121 | 122 | bq_agent_tools = bq_agent.tools 123 | 124 | bq_agent_tools[0].description = bq_agent_tools[0].description + \ 125 | f""" 126 | only use the schema {MY_BQ_TRENDS_DATASET} 127 | NOTE YOU CANNOT DO OPERATIONS AN AN AGGREGATED FIELD UNLESS IT IS IN A CTE WHICH IS ALLOWED 128 | also - use a like operator for the term field e.g. WHERE term LIKE '%keyword%' 129 | make sure to lower case the term in the WHERE clause 130 | be sure to LIMIT 100 for all queries 131 | if you don't have a LIMIT 100, there will be problems 132 | """ 133 | 134 | 135 | ## Build an Agent that has access to Multiple Tools 136 | 137 | llm = VertexAI() 138 | 139 | dataset = 'google_trends_my_project' 140 | 141 | db = SQLDatabase.from_uri(f"bigquery://{PROJECT_ID}/{dataset}") 142 | 143 | llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True) 144 | 145 | me_tools = vectorstore_agent.tools 146 | 147 | search = DuckDuckGoSearchAPIWrapper() 148 | 149 | 150 | tools = [ 151 | Tool( 152 | name="Calculator", 153 | func=llm_math_chain.run, 154 | description="useful for when you need to answer questions about math", 155 | ), 156 | Tool( 157 | name="Search", 158 | func=search.run, 159 | description="useful for when you need to answer questions about current events. You should ask targeted questions", 160 | ), 161 | ] 162 | 163 | 164 | # tools.extend(me_tools) #TODO - this is not working on a local macbook; may work on cloudtop or other config 165 | tools.extend(bq_agent_tools) 166 | 167 | # Run the streamlit app 168 | 169 | # what are the unique terms in the top_rising_terms table? 170 | 171 | enable_custom = True 172 | # Initialize agent 173 | mrkl = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True) 174 | 175 | with st.form(key="form"): 176 | user_input = "" 177 | user_input = st.text_input("Ask your question") 178 | submit_clicked = st.form_submit_button("Submit Question") 179 | 180 | 181 | output_container = st.empty() 182 | if with_clear_container(submit_clicked): 183 | output_container = output_container.container() 184 | output_container.chat_message("user").write(user_input) 185 | answer_container = output_container.chat_message("assistant", avatar="🦜") 186 | st_callback = StreamlitCallbackHandler(answer_container) 187 | answer = mrkl.run(user_input, callbacks=[st_callback]) 188 | answer_container.write(answer) 189 | 190 | 191 | "#### Here's some info on the tools in this agent: " 192 | for t in tools: 193 | st.write(t.name) 194 | st.write(t.description) 195 | st.write('\n') 196 | 197 | -------------------------------------------------------------------------------- /notebooks/imgs/add-gdelt-to-bq.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/add-gdelt-to-bq.gif -------------------------------------------------------------------------------- /notebooks/imgs/agent_plan_execute_chain_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/agent_plan_execute_chain_output.png -------------------------------------------------------------------------------- /notebooks/imgs/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/architecture.png -------------------------------------------------------------------------------- /notebooks/imgs/chunk_bq_tables_flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_bq_tables_flow.png -------------------------------------------------------------------------------- /notebooks/imgs/chunk_gcs_blobs_flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_gcs_blobs_flow.png -------------------------------------------------------------------------------- /notebooks/imgs/chunk_youtube_flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_youtube_flow.png -------------------------------------------------------------------------------- /notebooks/imgs/da-ui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/da-ui.png -------------------------------------------------------------------------------- /notebooks/imgs/deep-retriever.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/deep-retriever.png -------------------------------------------------------------------------------- /notebooks/imgs/fullarchitecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/fullarchitecture.png -------------------------------------------------------------------------------- /notebooks/imgs/google-trends-explore.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/google-trends-explore.png -------------------------------------------------------------------------------- /notebooks/imgs/info-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/info-architecture.png -------------------------------------------------------------------------------- /notebooks/imgs/langchain-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain-diagram.png -------------------------------------------------------------------------------- /notebooks/imgs/langchain-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain-overview.png -------------------------------------------------------------------------------- /notebooks/imgs/langchain_intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain_intro.png -------------------------------------------------------------------------------- /notebooks/imgs/pipeline-complete.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/pipeline-complete.png -------------------------------------------------------------------------------- /notebooks/imgs/pipeline_metadata.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/pipeline_metadata.png -------------------------------------------------------------------------------- /notebooks/imgs/plan-execute-example-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/plan-execute-example-output.png -------------------------------------------------------------------------------- /notebooks/imgs/public_trends_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/public_trends_data.png -------------------------------------------------------------------------------- /notebooks/imgs/user-flow-plan-execute.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/user-flow-plan-execute.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview_ME.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_ME.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview_agents.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_agents.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview_gdelt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_gdelt.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview_load_index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_load_index.png -------------------------------------------------------------------------------- /notebooks/imgs/zghost_overview_pipeline_steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_pipeline_steps.png -------------------------------------------------------------------------------- /notebooks/requirements.txt: -------------------------------------------------------------------------------- 1 | google-api-core==2.10.0 2 | google-cloud-resource-manager 3 | google-cloud-core 4 | google-cloud-documentai 5 | google-cloud-storage 6 | google-cloud-secret-manager 7 | google-cloud-bigquery 8 | google-cloud-aiplatform==1.25.0 9 | protobuf==3.20.3 10 | oauth2client==3.0.0 11 | pydantic==1.10.9 12 | pypdf 13 | gcsfs 14 | langchain 15 | newspaper3k 16 | python-decouple 17 | numpy 18 | scipy 19 | pandas 20 | nltk 21 | flask 22 | flask-restx 23 | db-dtypes 24 | gunicorn 25 | pystan 26 | lunarcalendar 27 | convertdate 28 | pexpect 29 | pandas-gbq 30 | pytube 31 | celery 32 | redis 33 | pybigquery 34 | kfp 35 | youtube-transcript-api -------------------------------------------------------------------------------- /streamlit_agent/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/streamlit_agent/__init__.py -------------------------------------------------------------------------------- /streamlit_agent/callbacks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/streamlit_agent/callbacks/__init__.py -------------------------------------------------------------------------------- /streamlit_agent/callbacks/capturing_callback_handler.py: -------------------------------------------------------------------------------- 1 | """Callback Handler captures all callbacks in a session for future offline playback.""" 2 | 3 | from __future__ import annotations 4 | 5 | import pickle 6 | import time 7 | from typing import Any, TypedDict 8 | 9 | from langchain.callbacks.base import BaseCallbackHandler 10 | 11 | 12 | # This is intentionally not an enum so that we avoid serializing a 13 | # custom class with pickle. 14 | class CallbackType: 15 | ON_LLM_START = "on_llm_start" 16 | ON_LLM_NEW_TOKEN = "on_llm_new_token" 17 | ON_LLM_END = "on_llm_end" 18 | ON_LLM_ERROR = "on_llm_error" 19 | ON_TOOL_START = "on_tool_start" 20 | ON_TOOL_END = "on_tool_end" 21 | ON_TOOL_ERROR = "on_tool_error" 22 | ON_TEXT = "on_text" 23 | ON_CHAIN_START = "on_chain_start" 24 | ON_CHAIN_END = "on_chain_end" 25 | ON_CHAIN_ERROR = "on_chain_error" 26 | ON_AGENT_ACTION = "on_agent_action" 27 | ON_AGENT_FINISH = "on_agent_finish" 28 | 29 | 30 | # We use TypedDict, rather than NamedTuple, so that we avoid serializing a 31 | # custom class with pickle. All of this class's members should be basic Python types. 32 | class CallbackRecord(TypedDict): 33 | callback_type: str 34 | args: tuple[Any, ...] 35 | kwargs: dict[str, Any] 36 | time_delta: float # Number of seconds between this record and the previous one 37 | 38 | 39 | def load_records_from_file(path: str) -> list[CallbackRecord]: 40 | """Load the list of CallbackRecords from a pickle file at the given path.""" 41 | with open(path, "rb") as file: 42 | records = pickle.load(file) 43 | 44 | if not isinstance(records, list): 45 | raise RuntimeError(f"Bad CallbackRecord data in {path}") 46 | return records 47 | 48 | 49 | def playback_callbacks( 50 | handlers: list[BaseCallbackHandler], 51 | records_or_filename: list[CallbackRecord] | str, 52 | max_pause_time: float, 53 | ) -> str: 54 | if isinstance(records_or_filename, list): 55 | records = records_or_filename 56 | else: 57 | records = load_records_from_file(records_or_filename) 58 | 59 | for record in records: 60 | pause_time = min(record["time_delta"], max_pause_time) 61 | if pause_time > 0: 62 | time.sleep(pause_time) 63 | 64 | for handler in handlers: 65 | if record["callback_type"] == CallbackType.ON_LLM_START: 66 | handler.on_llm_start(*record["args"], **record["kwargs"]) 67 | elif record["callback_type"] == CallbackType.ON_LLM_NEW_TOKEN: 68 | handler.on_llm_new_token(*record["args"], **record["kwargs"]) 69 | elif record["callback_type"] == CallbackType.ON_LLM_END: 70 | handler.on_llm_end(*record["args"], **record["kwargs"]) 71 | elif record["callback_type"] == CallbackType.ON_LLM_ERROR: 72 | handler.on_llm_error(*record["args"], **record["kwargs"]) 73 | elif record["callback_type"] == CallbackType.ON_TOOL_START: 74 | handler.on_tool_start(*record["args"], **record["kwargs"]) 75 | elif record["callback_type"] == CallbackType.ON_TOOL_END: 76 | handler.on_tool_end(*record["args"], **record["kwargs"]) 77 | elif record["callback_type"] == CallbackType.ON_TOOL_ERROR: 78 | handler.on_tool_error(*record["args"], **record["kwargs"]) 79 | elif record["callback_type"] == CallbackType.ON_TEXT: 80 | handler.on_text(*record["args"], **record["kwargs"]) 81 | elif record["callback_type"] == CallbackType.ON_CHAIN_START: 82 | handler.on_chain_start(*record["args"], **record["kwargs"]) 83 | elif record["callback_type"] == CallbackType.ON_CHAIN_END: 84 | handler.on_chain_end(*record["args"], **record["kwargs"]) 85 | elif record["callback_type"] == CallbackType.ON_CHAIN_ERROR: 86 | handler.on_chain_error(*record["args"], **record["kwargs"]) 87 | elif record["callback_type"] == CallbackType.ON_AGENT_ACTION: 88 | handler.on_agent_action(*record["args"], **record["kwargs"]) 89 | elif record["callback_type"] == CallbackType.ON_AGENT_FINISH: 90 | handler.on_agent_finish(*record["args"], **record["kwargs"]) 91 | 92 | # Return the agent's result 93 | for record in records: 94 | if record["callback_type"] == CallbackType.ON_AGENT_FINISH: 95 | return record["args"][0][0]["output"] 96 | 97 | return "[Missing Agent Result]" 98 | 99 | 100 | class CapturingCallbackHandler(BaseCallbackHandler): 101 | def __init__(self) -> None: 102 | self._records: list[CallbackRecord] = [] 103 | self._last_time: float | None = None 104 | 105 | def dump_records_to_file(self, path: str) -> None: 106 | """Write the list of CallbackRecords to a pickle file at the given path.""" 107 | with open(path, "wb") as file: 108 | pickle.dump(self._records, file) 109 | 110 | def _append_record( 111 | self, type: str, args: tuple[Any, ...], kwargs: dict[str, Any] 112 | ) -> None: 113 | time_now = time.time() 114 | time_delta = time_now - self._last_time if self._last_time is not None else 0 115 | self._last_time = time_now 116 | self._records.append( 117 | CallbackRecord( 118 | callback_type=type, args=args, kwargs=kwargs, time_delta=time_delta 119 | ) 120 | ) 121 | 122 | def on_llm_start(self, *args: Any, **kwargs: Any) -> None: 123 | self._append_record(CallbackType.ON_LLM_START, args, kwargs) 124 | 125 | def on_llm_new_token(self, *args: Any, **kwargs: Any) -> None: 126 | self._append_record(CallbackType.ON_LLM_NEW_TOKEN, args, kwargs) 127 | 128 | def on_llm_end(self, *args: Any, **kwargs: Any) -> None: 129 | self._append_record(CallbackType.ON_LLM_END, args, kwargs) 130 | 131 | def on_llm_error(self, *args: Any, **kwargs: Any) -> None: 132 | self._append_record(CallbackType.ON_LLM_ERROR, args, kwargs) 133 | 134 | def on_tool_start(self, *args: Any, **kwargs: Any) -> None: 135 | self._append_record(CallbackType.ON_TOOL_START, args, kwargs) 136 | 137 | def on_tool_end(self, *args: Any, **kwargs: Any) -> None: 138 | self._append_record(CallbackType.ON_TOOL_END, args, kwargs) 139 | 140 | def on_tool_error(self, *args: Any, **kwargs: Any) -> None: 141 | self._append_record(CallbackType.ON_TOOL_ERROR, args, kwargs) 142 | 143 | def on_text(self, *args: Any, **kwargs: Any) -> None: 144 | self._append_record(CallbackType.ON_TEXT, args, kwargs) 145 | 146 | def on_chain_start(self, *args: Any, **kwargs: Any) -> None: 147 | self._append_record(CallbackType.ON_CHAIN_START, args, kwargs) 148 | 149 | def on_chain_end(self, *args: Any, **kwargs: Any) -> None: 150 | self._append_record(CallbackType.ON_CHAIN_END, args, kwargs) 151 | 152 | def on_chain_error(self, *args: Any, **kwargs: Any) -> None: 153 | self._append_record(CallbackType.ON_CHAIN_ERROR, args, kwargs) 154 | 155 | def on_agent_action(self, *args: Any, **kwargs: Any) -> Any: 156 | self._append_record(CallbackType.ON_AGENT_ACTION, args, kwargs) 157 | 158 | def on_agent_finish(self, *args: Any, **kwargs: Any) -> None: 159 | self._append_record(CallbackType.ON_AGENT_FINISH, args, kwargs) 160 | -------------------------------------------------------------------------------- /streamlit_agent/clear_results.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | 3 | 4 | # A hack to "clear" the previous result when submitting a new prompt. This avoids 5 | # the "previous run's text is grayed-out but visible during rerun" Streamlit behavior. 6 | class DirtyState: 7 | NOT_DIRTY = "NOT_DIRTY" 8 | DIRTY = "DIRTY" 9 | UNHANDLED_SUBMIT = "UNHANDLED_SUBMIT" 10 | 11 | 12 | def get_dirty_state() -> str: 13 | return st.session_state.get("dirty_state", DirtyState.NOT_DIRTY) 14 | 15 | 16 | def set_dirty_state(state: str) -> None: 17 | st.session_state["dirty_state"] = state 18 | 19 | 20 | def with_clear_container(submit_clicked: bool) -> bool: 21 | if get_dirty_state() == DirtyState.DIRTY: 22 | if submit_clicked: 23 | set_dirty_state(DirtyState.UNHANDLED_SUBMIT) 24 | st.experimental_rerun() 25 | else: 26 | set_dirty_state(DirtyState.NOT_DIRTY) 27 | 28 | if submit_clicked or get_dirty_state() == DirtyState.UNHANDLED_SUBMIT: 29 | set_dirty_state(DirtyState.DIRTY) 30 | return True 31 | 32 | return False 33 | -------------------------------------------------------------------------------- /zeitghost/.dockerignore: -------------------------------------------------------------------------------- 1 | Dockerfile 2 | README.md 3 | *.pyc 4 | *.pyo 5 | *.pyd 6 | __pycache__ 7 | .pytest_cache -------------------------------------------------------------------------------- /zeitghost/.env: -------------------------------------------------------------------------------- 1 | PROJECT_ID='cpg-cdp' 2 | PROJECT_NUM='939655404703' 3 | LOCATION='us-central1' 4 | DATASET_ID='genai_cap_v1' 5 | TABLE_NAME='estee_lauder_1_mentions' 6 | -------------------------------------------------------------------------------- /zeitghost/.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /shelf/ 3 | /workspace.xml 4 | # Editor-based HTTP Client requests 5 | /httpRequests/ 6 | # Datasource local storage ignored files 7 | /dataSources/ 8 | /dataSources.local.xml 9 | # Zeppelin ignored files 10 | /ZeppelinRemoteNotebooks/ 11 | -------------------------------------------------------------------------------- /zeitghost/.idea/codeStyles/Project.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | 766 | -------------------------------------------------------------------------------- /zeitghost/.idea/codeStyles/codeStyleConfig.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | -------------------------------------------------------------------------------- /zeitghost/.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /zeitghost/.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /zeitghost/.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 36 | 37 | -------------------------------------------------------------------------------- /zeitghost/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__init__.py -------------------------------------------------------------------------------- /zeitghost/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/__pycache__/main.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__pycache__/main.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/agents/Helpers.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | from typing import List 3 | from typing import Union 4 | from langchain import PromptTemplate 5 | from langchain.callbacks.base import BaseCallbackHandler 6 | from typing import Any 7 | from langchain.schema import AgentAction 8 | from langchain.schema import AgentFinish 9 | from langchain.schema import LLMResult 10 | from zeitghost.vertex.LLM import VertexLLM 11 | import time 12 | import logging 13 | 14 | 15 | QPS = 600 16 | 17 | core_template = """Question: {question} 18 | 19 | Answer: """ 20 | 21 | core_prompt = PromptTemplate( 22 | template=core_template 23 | , input_variables=['question'] 24 | ) 25 | 26 | vector_template = """ 27 | Question: Use [{name}]: 28 | {question} 29 | 30 | Answer: """ 31 | 32 | vector_prompt = PromptTemplate( 33 | template=vector_template 34 | , input_variables=['name', 'question'] 35 | ) 36 | 37 | bq_template = """{prompt} in {table} from this table of search term volume on google.com 38 | - do not download the entire table 39 | - do not ORDER BY or GROUP BY count(*) 40 | - the datetime field is called date_field 41 | """ 42 | bq_prompt = PromptTemplate( 43 | template=bq_template 44 | , input_variables=['prompt', 'table'] 45 | ) 46 | 47 | BQ_PREFIX = """ 48 | LIMIT TO ONLY 100 ROWS - e.g. LIMIT 100 49 | REMOVE all observation output that has any special characters , or \n 50 | you are a helpful agent that knows how to use bigquery 51 | you are using sqlalchemy {dialect} 52 | Check the table schemas before constructing sql 53 | Only use the information returned by the below tools to construct your final answer.\nYou MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again. 54 | DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.\n\n 55 | REMOVE all observation output that has any special characters , or \n 56 | you are a helpful agent that knows how to use bigquery 57 | READ THE SCHEMA BEFORE YOU WRITE QUERIES 58 | DOUBLE CHECK YOUR QUERY LOGIC 59 | you are using sqlalchemy for Big Query 60 | ALL QUERIES MUST HAVE LIMIT 100 at the end of them 61 | Check the table schemas before constructing sql 62 | Only use the information returned by the below tools to construct your final answer.\nYou MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again. 63 | DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.\n\n 64 | If you don't use a where statement in your SQL - there will be problems. 65 | To get hints on the field contents, consider a select distinct - I don't care about a where statement given there is low cardnality in the data set 66 | make sure you prepend the table name with the schema: eg: schema.tablename 67 | MAKE SURE the FROM statement includes the schema like so: schema.tablename 68 | THERE MUST BE A WHERE CLAUSE IN THIS BECAUSE YOU DON'T HAVE ENOUGH MEMORY TO STORE LOCAL RESULTS 69 | do not use the same action as you did in any prior step 70 | MAKE SURE YOU DO NOT REPEAT THOUGHTS - if a thought is the same as a prior thought in the chain, come up with another one 71 | """ 72 | 73 | bq_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent 74 | strip=True, #this strips out special characters for the BQ agent 75 | temperature=0.0, 76 | max_output_tokens=1000, 77 | top_p=0.7, 78 | top_k=40, 79 | ) 80 | 81 | pandas_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent 82 | strip=False, #this strips out special characters for the BQ agent 83 | temperature=0.0, 84 | max_output_tokens=1000, 85 | top_p=0.7, 86 | top_k=40, 87 | ) 88 | 89 | vectorstore_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent 90 | strip=False, #this strips out special characters for the BQ agent 91 | temperature=0.0, 92 | max_output_tokens=1000, 93 | top_p=0.7, 94 | top_k=40, 95 | ) 96 | 97 | 98 | base_llm = VertexLLM(stop=None, #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent 99 | temperature=0.0, 100 | max_output_tokens=1000, 101 | top_p=0.7, 102 | top_k=40 103 | ) 104 | 105 | 106 | class MyCustomHandler(BaseCallbackHandler): 107 | def rate_limit(self): 108 | time.sleep(1/QPS) 109 | 110 | def on_llm_start(self, serialized: Dict[str, Any], prompts: List[str], 111 | **kwargs: Any) -> Any: 112 | pass 113 | 114 | def on_llm_new_token(self, token: str, **kwargs: Any) -> Any: 115 | self.rate_limit() 116 | pass 117 | 118 | def on_llm_end(self, response: LLMResult, **kwargs: Any) -> Any: 119 | pass 120 | 121 | def on_llm_error(self, error: Union[Exception, KeyboardInterrupt], 122 | **kwargs: Any) -> Any: 123 | pass 124 | 125 | def on_chain_start(self, serialized: Dict[str, Any], inputs: Dict[str, Any], 126 | **kwargs: Any) -> Any: 127 | logging.info(serialized) 128 | pass 129 | 130 | def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> Any: 131 | pass 132 | 133 | def on_chain_error(self, error: Union[Exception, KeyboardInterrupt], 134 | **kwargs: Any) -> Any: 135 | pass 136 | 137 | def on_tool_start(self, serialized: Dict[str, Any], input_str: str, 138 | **kwargs: Any) -> Any: 139 | logging.info(serialized) 140 | pass 141 | 142 | def on_tool_end(self, output: str, **kwargs: Any) -> Any: 143 | pass 144 | 145 | def on_tool_error(self, error: Union[Exception, KeyboardInterrupt], 146 | **kwargs: Any) -> Any: 147 | pass 148 | 149 | def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any: 150 | logging.info(action) 151 | pass 152 | 153 | def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any: 154 | pass 155 | 156 | def on_text(self, text: str, **kwargs: Any) -> Any: 157 | """Run on arbitrary text.""" 158 | # return str(text[:4000]) #character limiter 159 | # self.rate_limit() 160 | -------------------------------------------------------------------------------- /zeitghost/agents/LangchainAgent.py: -------------------------------------------------------------------------------- 1 | #from langchain import LLMChain 2 | from langchain.chains import LLMChain 3 | from langchain.agents import create_sql_agent 4 | from langchain.agents.agent_toolkits import SQLDatabaseToolkit 5 | from langchain.agents import create_pandas_dataframe_agent 6 | from langchain.agents import create_vectorstore_agent 7 | from langchain.agents.agent_toolkits import VectorStoreInfo 8 | from langchain.agents.agent_toolkits import VectorStoreToolkit 9 | from langchain.agents.agent import AgentExecutor 10 | from langchain.schema import LLMResult 11 | from langchain.sql_database import SQLDatabase 12 | from zeitghost.agents.Helpers import core_prompt, vector_prompt 13 | from zeitghost.agents.Helpers import BQ_PREFIX, bq_template, core_template, vector_template 14 | from zeitghost.agents.Helpers import base_llm, bq_agent_llm, pandas_agent_llm, vectorstore_agent_llm 15 | from zeitghost.agents.Helpers import MyCustomHandler 16 | from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore 17 | from zeitghost.vertex.LLM import VertexLLM 18 | from langchain.callbacks.manager import CallbackManager 19 | 20 | 21 | class LangchainAgent: 22 | """ 23 | A class used to represent an llm agent to ask questions of. 24 | Contains agents for Pandas DataFrames, BigQuery, and Vertex Matching Engine. 25 | """ 26 | callback_handler = MyCustomHandler() 27 | callback_manager = CallbackManager([callback_handler]) 28 | 29 | def get_vectorstore_agent( 30 | self 31 | , vectorstore: MatchingEngineVectorStore 32 | , vectorstore_name: str 33 | , vectorstore_description: str 34 | , llm: VertexLLM = vectorstore_agent_llm 35 | ) -> AgentExecutor: 36 | """ 37 | Gets a langchain agent to query against a Matching Engine vectorstore 38 | 39 | :param llm: zeitghost.vertex.LLM.VertexLLM 40 | :param vectorstore_description: str 41 | :param vectorstore_name: str 42 | :param vectorstore: zeitghost.vertex.MatchingEngineVectorstore.MatchingEngine 43 | 44 | :return langchain.agents.agent.AgentExecutor: 45 | """ 46 | vectorstore_info = VectorStoreInfo( 47 | name=vectorstore_name 48 | , description=vectorstore_description 49 | , vectorstore=vectorstore 50 | ) 51 | vectorstore_toolkit = VectorStoreToolkit( 52 | vectorstore_info=vectorstore_info 53 | , llm=llm 54 | ) 55 | return create_vectorstore_agent( 56 | llm=llm 57 | , toolkit=vectorstore_toolkit 58 | , verbose=True 59 | , callback_manager=self.callback_manager 60 | , return_intermediate_steps=True 61 | ) 62 | 63 | def get_pandas_agent( 64 | self 65 | , dataframe 66 | , llm=pandas_agent_llm 67 | ) -> AgentExecutor: 68 | """ 69 | Gets a langchain agent to query against a pandas dataframe 70 | 71 | :param llm: zeitghost.vertex.llm.VertexLLM 72 | :param dataframe: pandas.DataFrame 73 | Input dataframe for agent to interact with 74 | 75 | :return: langchain.agents.agent.AgentExecutor 76 | """ 77 | return create_pandas_dataframe_agent( 78 | llm=llm 79 | , df=dataframe 80 | , verbose=True 81 | , callback_manager=self.callback_manager 82 | , return_intermediate_steps=True 83 | ) 84 | 85 | def get_bigquery_agent( 86 | self 87 | , project_id='cpg-cdp' 88 | , dataset='google_trends_my_project' 89 | , llm=bq_agent_llm 90 | ) -> AgentExecutor: 91 | """ 92 | Gets a langchain agent to query against a BigQuery dataset 93 | 94 | :param llm: zeitghost.vertex.llm.VertexLLM 95 | :param dataset: 96 | :param project_id: str 97 | Google Cloud Project ID 98 | 99 | :return: langchain.SQLDatabaseChain 100 | """ 101 | db = SQLDatabase.from_uri(f"bigquery://{project_id}/{dataset}") 102 | toolkit = SQLDatabaseToolkit(llm=llm, db=db) 103 | 104 | return create_sql_agent( 105 | llm=llm 106 | , toolkit=toolkit 107 | , verbose=True 108 | , prefix=BQ_PREFIX 109 | , callback_manager=self.callback_manager 110 | , return_intermediate_steps=True 111 | ) 112 | 113 | def query_bq_agent( 114 | self 115 | , agent: AgentExecutor 116 | , table: str 117 | , prompt: str 118 | ) -> str: 119 | """ 120 | Queries a BQ Agent given a table and a prompt. 121 | 122 | :param agent: AgentExecutor 123 | :param table: str 124 | Table to ask question against 125 | :param prompt: str 126 | Question prompt 127 | 128 | :return: Dict[str, Any] 129 | """ 130 | 131 | return agent.run( 132 | bq_template.format(prompt=prompt, table=table) 133 | ) 134 | 135 | def query_pandas_agent( 136 | self 137 | , agent: AgentExecutor 138 | , prompt: str 139 | ) -> str: 140 | """ 141 | Queries a BQ Agent given a table and a prompt. 142 | 143 | :param agent: langchain. 144 | :param prompt: str 145 | Question prompt 146 | 147 | :return: Dict[str, Any] 148 | """ 149 | 150 | return agent.run( 151 | core_template.format(question=prompt) 152 | ) 153 | 154 | def query_vectorstore_agent( 155 | self 156 | , agent: AgentExecutor 157 | , prompt: str 158 | , vectorstore_name: str 159 | ): 160 | """ 161 | Queries a VectorStore Agent given a prompt 162 | 163 | :param vectorstore_name: 164 | :param agent: AgentExecutor 165 | :param prompt: str 166 | 167 | :return: str 168 | """ 169 | return agent.run( 170 | vector_template.format(question=prompt, name=vectorstore_name) 171 | ) 172 | 173 | def chain_questions(self, questions) -> LLMResult: 174 | """ 175 | Executes a chain of questions against the configured LLM 176 | :param questions: list(str) 177 | A list of questions to ask the llm 178 | 179 | :return: langchain.schema.LLMResult 180 | """ 181 | llm_chain = LLMChain(prompt=core_prompt, llm=vectorstore_agent_llm) 182 | res = llm_chain.generate(questions) 183 | 184 | return res 185 | 186 | -------------------------------------------------------------------------------- /zeitghost/agents/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__init__.py -------------------------------------------------------------------------------- /zeitghost/agents/__pycache__/Helpers.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/Helpers.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/agents/__pycache__/LangchainAgent.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/LangchainAgent.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/agents/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/bigquery/BigQueryAccessor.py: -------------------------------------------------------------------------------- 1 | from google.cloud import bigquery 2 | from google.cloud.bigquery import QueryJob 3 | from google.cloud.bigquery.table import RowIterator 4 | import pandas as pd 5 | 6 | 7 | class BigQueryAccessor: 8 | """ 9 | Interface for querying BigQuery. 10 | """ 11 | def __init__(self 12 | , project_id 13 | , gdelt_project_id='gdelt-bq' 14 | , gdelt_dataset_id='gdeltv2' 15 | , gdelt_table_name='events'): 16 | """ 17 | :param gdelt_project_id: str 18 | Project ID for building BigQuery client 19 | :param gdelt_dataset_id: str 20 | Dataset ID for building BigQuery Client 21 | :param gdelt_table_name: str 22 | Table name for building BigQuery Client 23 | """ 24 | self.project_id = project_id 25 | self.gdelt_project_id = gdelt_project_id 26 | self.gdelt_dataset_id = gdelt_dataset_id 27 | self.gdelt_table_name = gdelt_table_name 28 | self.client = bigquery.Client(project=self.project_id) 29 | 30 | def _query_bq(self, query_string: str) -> QueryJob: 31 | """ 32 | 33 | :param query_string: str 34 | Full SQL query string to execute against BigQuery 35 | 36 | :return: google.cloud.bigquery.job.QueryJob 37 | """ 38 | return self.client.query(query_string) 39 | 40 | def get_records_from_sourceurl(self, source_url) -> RowIterator: 41 | """ 42 | Retrieve article record from Gdelt dataset given a source_url 43 | 44 | :param source_url: str 45 | 46 | :return: google.cloud.bigquery.table.RowIterator 47 | """ 48 | query = f""" 49 | SELECT 50 | max(SQLDATE) as SQLDATE, 51 | max(Actor1Name) as Actor1Name, 52 | max(Actor2Name) as Actor2Name, 53 | avg(GoldsteinScale) as GoldsteinScale, 54 | max(NumMentions) as NumMentions, 55 | max(NumSources) as NumSources, 56 | max(NumArticles) as NumArticles, 57 | avg(AvgTone) as AvgTone, 58 | SOURCEURL as SOURCEURL, 59 | SOURCEURL as url 60 | FROM `{self.gdelt_project_id}.{self.gdelt_dataset_id}.{self.gdelt_table_name}` 61 | WHERE lower(SOURCEURL) like '%{source_url}%' 62 | GROUP BY SOURCEURL 63 | """ 64 | 65 | return self._query_bq(query_string=query).result() 66 | 67 | def get_records_from_sourceurl_df(self, source_url): 68 | """ 69 | Retrieve article record from Gdelt dataset given a source_url 70 | 71 | :param source_url: str 72 | 73 | :return: pandas.DataFrame 74 | """ 75 | response = self.get_records_from_sourceurl(source_url) 76 | 77 | return response.to_dataframe() 78 | 79 | def get_records_from_actor_keyword(self 80 | , keyword: str 81 | , min_date: str = "2023-01-01" 82 | , max_date: str = "2023-05-30" 83 | ) -> RowIterator: 84 | """ 85 | Retrieve BQ records given input keyword 86 | 87 | :param keyword: str 88 | Keyword used for filtering actor names 89 | 90 | :return: google.cloud.bigquery.table.RowIterator 91 | """ 92 | 93 | query = f""" 94 | SELECT 95 | max(SQLDATE) as SQLDATE, 96 | PARSE_DATE('%Y%m%d', CAST(max(SQLDATE) AS STRING)) as new_date, 97 | max(Actor1Name) as Actor1Name, 98 | max(Actor2Name) as Actor2Name, 99 | avg(GoldsteinScale) as GoldsteinScale, 100 | max(NumMentions) as NumMentions, 101 | max(NumSources) as NumSources, 102 | max(NumArticles) as NumArticles, 103 | avg(AvgTone) as AvgTone, 104 | SOURCEURL as SOURCEURL, 105 | SOURCEURL as url 106 | FROM `{self.gdelt_project_id}.{self.gdelt_dataset_id}.{self.gdelt_table_name}` 107 | WHERE lower(SOURCEURL) != 'unspecified' 108 | AND 109 | ( 110 | REGEXP_CONTAINS(LOWER(Actor1Name),'{keyword.lower()}') 111 | OR REGEXP_CONTAINS(LOWER(Actor2Name), '{keyword.lower()}') 112 | ) 113 | AND PARSE_DATE('%Y%m%d', CAST(SQLDATE AS STRING)) >= "{min_date}" 114 | AND PARSE_DATE('%Y%m%d', CAST(SQLDATE AS STRING)) <= "{max_date}" 115 | GROUP BY url 116 | """ 117 | 118 | return self._query_bq(query_string=query).result() 119 | 120 | def get_records_from_actor_keyword_df(self 121 | , keyword: str 122 | , min_date: str = "2023-01-01" 123 | , max_date: str = "2023-05-30" 124 | ) -> pd.DataFrame: 125 | """ 126 | Retrieves BQ records given input actor info 127 | 128 | :param keyword: str 129 | 130 | :return: pandas.DataFrame 131 | """ 132 | response = self.get_records_from_actor_keyword(keyword, min_date, max_date) 133 | 134 | return response.to_dataframe() 135 | 136 | def get_term_set(self 137 | , project_id='cpg-cdp' 138 | , dataset='bigquery-public-data' 139 | , table_id='top_terms' 140 | ) -> RowIterator: 141 | """ 142 | Simple function to get the unique, sorted terms in the table 143 | 144 | :param project_id: str 145 | project_id that holds the dataset. 146 | :param dataset: str 147 | dataset name that holds the table. 148 | :param table_id: str 149 | table name 150 | 151 | :return: google.cloud.bigquery.table.RowIterator 152 | """ 153 | 154 | query = f""" 155 | SELECT distinct 156 | term 157 | FROM `{project_id}.{dataset}.{table_id}` 158 | order by 1 159 | """ 160 | 161 | return self._query_bq(query_string=query).result() 162 | 163 | def get_term_set_df(self 164 | , project_id='cpg-cdp' 165 | , dataset='trends_data' 166 | , table_id='makeupcosmetics_10054_unitedstates_2840' 167 | ) -> list: 168 | """ 169 | Simple function to get the unique, sorted terms in the table 170 | 171 | :param project_id: str 172 | project_id that holds the dataset. 173 | :param dataset: str 174 | dataset name that holds the table. 175 | :param table_id: str 176 | table name 177 | 178 | :return: pandas.DataFrame 179 | """ 180 | df = self.get_term_set(project_id, dataset, table_id).to_dataframe() 181 | 182 | return df["term"].to_list() 183 | 184 | def pull_term_data_from_bq(self 185 | , term: tuple = ('mascara', 'makeup') 186 | , project_id='bigquery-public-data' 187 | , dataset='google_trends' 188 | , table_id='top_rising_terms' 189 | ) -> RowIterator: 190 | """ 191 | Pull terms based on `in` sql clause from term 192 | takes a tuple of terms (str) and produces pandas dataset 193 | 194 | :param term: tuple(str) 195 | A tuple of terms to query for 196 | :param project_id: str 197 | project_id that holds the dataset. 198 | :param dataset: str 199 | dataset name that holds the table. 200 | :param table_id: str 201 | table name 202 | 203 | :return: google.cloud.bigguqery.table.RowIterator 204 | """ 205 | query = f""" 206 | SELECT 207 | week, 208 | term, 209 | rank 210 | FROM `{project_id}.{dataset}.{table_id}` 211 | WHERE 212 | lower(term) in {term} 213 | order by term, 1 214 | """ 215 | 216 | return self._query_bq(query_string=query).result() 217 | 218 | def pull_term_data_from_bq_df(self 219 | , term: tuple = ('mascara', 'makeup') 220 | , project_id='bigquery-public-data' 221 | , dataset='google_trends' 222 | , table_id='top_rising_terms' 223 | ) -> pd.DataFrame: 224 | """ 225 | Pull terms based on `in` sql clause from term 226 | takes a tuple of terms (str) and produces pandas dataset 227 | 228 | :param term: tuple(str) 229 | A tuple of terms to query for 230 | :param project_id: str 231 | project_id that holds the dataset. 232 | :param dataset: str 233 | dataset name that holds the table. 234 | :param table_id: str 235 | table name 236 | 237 | :return: pandas.DataFrame 238 | """ 239 | result = self.pull_term_data_from_bq(term, project_id, dataset, table_id) 240 | 241 | return result.to_dataframe() 242 | 243 | def pull_regexp_term_data_from_bq(self 244 | , term: str 245 | , project_id='bigquery-public-data' 246 | , dataset='google_trends' 247 | , table_id='top_rising_terms' 248 | ) -> RowIterator: 249 | """ 250 | Pull terms based on `in` sql clause from term 251 | takes a tuple of terms (str) and produces pandas dataset 252 | 253 | :param term: tuple(str) 254 | A tuple of terms to query for 255 | :param project_id: str 256 | project_id that holds the dataset. 257 | :param dataset: str 258 | dataset name that holds the table. 259 | :param table_id: str 260 | table name 261 | 262 | :return: google.cloud.bigguqery.table.RowIterator 263 | """ 264 | query = f""" 265 | SELECT 266 | week, 267 | term, 268 | rank 269 | FROM `{project_id}.{dataset}.{table_id}` 270 | WHERE ( 271 | REGEXP_CONTAINS(LOWER(term), r'{term}') 272 | ) 273 | order by term 274 | """ 275 | 276 | return self._query_bq(query_string=query).result() 277 | 278 | def get_entity_from_geg_full(self 279 | , entity: str 280 | , min_date: str = "2023-01-01" 281 | ) -> RowIterator: 282 | entity_lower = entity.lower() 283 | 284 | query = f""" 285 | WITH 286 | entities AS ( 287 | SELECT 288 | b.*, 289 | url 290 | FROM 291 | `{self.gdelt_project_id}.{self.gdelt_dataset_id}.geg_gcnlapi` AS a, 292 | UNNEST(a.entities) AS b 293 | WHERE 294 | LOWER(b.name) LIKE '%{entity_lower}%' 295 | AND DATE(date) >= '{min_date}' ) 296 | SELECT 297 | * 298 | FROM 299 | `gdelt-bq.gdeltv2.geg_gcnlapi` a 300 | INNER JOIN 301 | entities AS b 302 | ON 303 | a.url = b.url 304 | WHERE 305 | DATE(date) >= '{min_date}' 306 | """ 307 | 308 | return self._query_bq(query_string=query).result() 309 | 310 | def get_entity_from_geg_full_df(self 311 | , entity: str 312 | , min_date: str = "2023-01-01"): 313 | result = self.get_entity_from_geg_full(entity, min_date) 314 | 315 | return result.to_dataframe() 316 | 317 | 318 | def get_geg_entities_data( 319 | self 320 | , entity: str 321 | , min_date: str = "2023-01-01" 322 | , max_date: str = "2023-05-17" 323 | ) -> RowIterator: 324 | 325 | query = f""" 326 | WITH geg_data AS (( 327 | SELECT 328 | groupId, 329 | entity_type, 330 | a.entity as entity_name, 331 | a.numMentions, 332 | a.avgSalience, 333 | eventTime, 334 | polarity, 335 | magnitude, 336 | score, 337 | mid, 338 | wikipediaUrl 339 | FROM ( 340 | SELECT 341 | polarity, 342 | magnitude, 343 | score, 344 | FARM_FINGERPRINT(url) groupId, 345 | entity.type AS entity_type, 346 | FORMAT_TIMESTAMP("%Y-%m-%d", date, "UTC") eventTime, 347 | entity.mid AS mid, 348 | entity.wikipediaUrl AS wikipediaUrl 349 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, 350 | UNNEST(entities) entity 351 | WHERE entity.mid is not null 352 | AND LOWER(name) LIKE '%{entity}%' 353 | AND lang='en' 354 | AND DATE(date) >= "{min_date}" 355 | AND DATE(date) <= "{max_date}" 356 | ) b JOIN ( 357 | # grab the entities from the nested json in the graph 358 | SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, 359 | entities.mid mid, 360 | sum(entities.numMentions) as numMentions, 361 | avg(entities.avgSalience) as avgSalience 362 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, 363 | UNNEST(entities) entities where entities.mid is not null 364 | AND lang='en' 365 | AND DATE(date) >= "{min_date}" 366 | AND DATE(date) <= "{max_date}" 367 | GROUP BY entities.mid 368 | ) a USING(mid))) 369 | SELECT * 370 | FROM ( SELECT *, RANK() OVER (PARTITION BY eventTime ORDER BY numMentions desc) as rank # get ranks 371 | FROM ( 372 | SELECT 373 | entity_name, 374 | max(entity_type) AS entity_type, 375 | DATE(eventTime) AS eventTime, 376 | sum(numMentions) as numMentions, 377 | avg(magnitude) as avgMagnitude, 378 | max(mid) AS mid, 379 | max(wikipediaUrl) AS wikipediaUrl, 380 | FROM geg_data 381 | GROUP BY 1,3 382 | ) grouped_all 383 | ) 384 | WHERE rank < 300 385 | """ 386 | 387 | return self._query_bq(query_string=query).result() 388 | 389 | def get_geg_entities_data_full_df( 390 | self 391 | , entity: str 392 | , min_date: str = "2023-01-01" 393 | , max_date: str = "2023-05-17" 394 | ): 395 | result = self.get_geg_entities_data(entity, min_date, max_date) 396 | 397 | return result.to_dataframe() 398 | 399 | def get_geg_article_data( 400 | self 401 | , entity: str 402 | , min_date: str = "2023-01-01" 403 | , max_date: str = "2023-05-17" 404 | ) -> RowIterator: 405 | 406 | # here 407 | 408 | query = f""" 409 | WITH geg_data AS (( 410 | SELECT 411 | groupId, 412 | url, 413 | name, 414 | -- a.entity AS entity_name, 415 | wikipediaUrl, 416 | a.numMentions AS numMentions, 417 | a.avgSalience AS avgSalience, 418 | DATE(eventTime) AS eventTime, 419 | polarity, 420 | magnitude, 421 | score 422 | FROM ( 423 | SELECT 424 | name, 425 | polarity, 426 | magnitude, 427 | score, 428 | url, 429 | FARM_FINGERPRINT(url) AS groupId, 430 | CONCAT(entity.type," - ",entity.type) AS entity_id, 431 | FORMAT_TIMESTAMP("%Y-%m-%d", date, "UTC") AS eventTime, 432 | entity.mid AS mid, 433 | entity.wikipediaUrl AS wikipediaUrl , 434 | entity.numMentions AS numMentions 435 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, 436 | UNNEST(entities) entity 437 | WHERE entity.mid is not null 438 | AND LOWER(name) LIKE '%{entity}%' 439 | AND lang='en' 440 | AND DATE(date) >= "{min_date}" 441 | AND DATE(date) <= "{max_date}" 442 | ) b JOIN ( 443 | # grab the entities from the nested json in the graph 444 | SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity, 445 | entities.mid mid, 446 | sum(entities.numMentions) as numMentions, 447 | avg(entities.avgSalience) as avgSalience 448 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, 449 | UNNEST(entities) entities 450 | WHERE 451 | entities.mid is not null AND 452 | lang='en' 453 | AND DATE(date) >= "{min_date}" 454 | AND DATE(date) <= "{max_date}" 455 | GROUP BY entities.mid 456 | ) a USING(mid))) 457 | SELECT * 458 | FROM ( SELECT *, RANK() OVER (PARTITION BY eventTime ORDER BY numMentions desc) as rank # get ranks 459 | FROM ( 460 | SELECT 461 | -- ARRAY_AGG(entity_name) as entity_names, 462 | STRING_AGG(name) as entity_names, 463 | max(eventTime) AS eventTime, 464 | url, 465 | avg(numMentions) AS numMentions, 466 | avg(avgSalience) AS avgSalience, 467 | --sum(numMentions) as numMentions, 468 | --avg(magnitude) as avgMagnitude 469 | FROM geg_data 470 | GROUP BY url 471 | ) 472 | -- grouped_all 473 | ) 474 | WHERE rank < 300 475 | """ 476 | 477 | return self._query_bq(query_string=query).result() 478 | 479 | def get_geg_article_data_full_df( 480 | self 481 | , entity: str 482 | , min_date: str = "2023-01-01" 483 | , max_date: str = "2023-05-26" 484 | ): 485 | result = self.get_geg_article_data(entity, min_date, max_date) 486 | 487 | return result.to_dataframe() 488 | 489 | 490 | def get_geg_article_data_v2( 491 | self 492 | , entity: str 493 | , min_date: str = "2023-01-01" 494 | , max_date: str = "2023-05-26" 495 | ) -> RowIterator: 496 | 497 | # TODO - add arg for avgSalience 498 | 499 | query = f""" 500 | WITH 501 | entities AS ( 502 | SELECT 503 | distinct url, 504 | b.avgSalience AS avgSalience, 505 | date AS date 506 | FROM 507 | `gdelt-bq.gdeltv2.geg_gcnlapi` AS a, 508 | UNNEST(a.entities) AS b 509 | WHERE 510 | LOWER(b.name) LIKE '%{entity}%' 511 | AND DATE(date) >= "{min_date}" 512 | AND DATE(date) <= "{max_date}" 513 | AND b.avgSalience > 0.1 ) 514 | SELECT 515 | entities.url AS url, 516 | -- entities.url AS source, 517 | entities.date, 518 | -- a.polarity, 519 | -- a.magnitude, 520 | -- a.score, 521 | avgSalience 522 | FROM entities inner join `gdelt-bq.gdeltv2.geg_gcnlapi` AS a 523 | ON a.url=entities.url 524 | AND a.date=entities.date 525 | """ 526 | return self._query_bq(query_string=query).result() 527 | 528 | 529 | 530 | def get_geg_article_data_v2_full_df( 531 | self 532 | , entity: str 533 | , min_date: str = "2023-01-01" 534 | , max_date: str = "2023-05-26" 535 | ): 536 | result = self.get_geg_article_data_v2(entity, min_date, max_date) 537 | 538 | return result.to_dataframe() 539 | -------------------------------------------------------------------------------- /zeitghost/bigquery/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__init__.py -------------------------------------------------------------------------------- /zeitghost/bigquery/__pycache__/BigQueryAccessor.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__pycache__/BigQueryAccessor.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/bigquery/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/capturing_callback_handler.py: -------------------------------------------------------------------------------- 1 | """Callback Handler captures all callbacks in a session for future offline playback.""" 2 | 3 | from __future__ import annotations 4 | 5 | import pickle 6 | import time 7 | from typing import Any, TypedDict 8 | 9 | from langchain.callbacks.base import BaseCallbackHandler 10 | 11 | 12 | # This is intentionally not an enum so that we avoid serializing a 13 | # custom class with pickle. 14 | class CallbackType: 15 | ON_LLM_START = "on_llm_start" 16 | ON_LLM_NEW_TOKEN = "on_llm_new_token" 17 | ON_LLM_END = "on_llm_end" 18 | ON_LLM_ERROR = "on_llm_error" 19 | ON_TOOL_START = "on_tool_start" 20 | ON_TOOL_END = "on_tool_end" 21 | ON_TOOL_ERROR = "on_tool_error" 22 | ON_TEXT = "on_text" 23 | ON_CHAIN_START = "on_chain_start" 24 | ON_CHAIN_END = "on_chain_end" 25 | ON_CHAIN_ERROR = "on_chain_error" 26 | ON_AGENT_ACTION = "on_agent_action" 27 | ON_AGENT_FINISH = "on_agent_finish" 28 | 29 | 30 | # We use TypedDict, rather than NamedTuple, so that we avoid serializing a 31 | # custom class with pickle. All of this class's members should be basic Python types. 32 | class CallbackRecord(TypedDict): 33 | callback_type: str 34 | args: tuple[Any, ...] 35 | kwargs: dict[str, Any] 36 | time_delta: float # Number of seconds between this record and the previous one 37 | 38 | 39 | def load_records_from_file(path: str) -> list[CallbackRecord]: 40 | """Load the list of CallbackRecords from a pickle file at the given path.""" 41 | with open(path, "rb") as file: 42 | records = pickle.load(file) 43 | 44 | if not isinstance(records, list): 45 | raise RuntimeError(f"Bad CallbackRecord data in {path}") 46 | return records 47 | 48 | 49 | def playback_callbacks( 50 | handlers: list[BaseCallbackHandler], 51 | records_or_filename: list[CallbackRecord] | str, 52 | max_pause_time: float, 53 | ) -> str: 54 | if isinstance(records_or_filename, list): 55 | records = records_or_filename 56 | else: 57 | records = load_records_from_file(records_or_filename) 58 | 59 | for record in records: 60 | pause_time = min(record["time_delta"], max_pause_time) 61 | if pause_time > 0: 62 | time.sleep(pause_time) 63 | 64 | for handler in handlers: 65 | if record["callback_type"] == CallbackType.ON_LLM_START: 66 | handler.on_llm_start(*record["args"], **record["kwargs"]) 67 | elif record["callback_type"] == CallbackType.ON_LLM_NEW_TOKEN: 68 | handler.on_llm_new_token(*record["args"], **record["kwargs"]) 69 | elif record["callback_type"] == CallbackType.ON_LLM_END: 70 | handler.on_llm_end(*record["args"], **record["kwargs"]) 71 | elif record["callback_type"] == CallbackType.ON_LLM_ERROR: 72 | handler.on_llm_error(*record["args"], **record["kwargs"]) 73 | elif record["callback_type"] == CallbackType.ON_TOOL_START: 74 | handler.on_tool_start(*record["args"], **record["kwargs"]) 75 | elif record["callback_type"] == CallbackType.ON_TOOL_END: 76 | handler.on_tool_end(*record["args"], **record["kwargs"]) 77 | elif record["callback_type"] == CallbackType.ON_TOOL_ERROR: 78 | handler.on_tool_error(*record["args"], **record["kwargs"]) 79 | elif record["callback_type"] == CallbackType.ON_TEXT: 80 | handler.on_text(*record["args"], **record["kwargs"]) 81 | elif record["callback_type"] == CallbackType.ON_CHAIN_START: 82 | handler.on_chain_start(*record["args"], **record["kwargs"]) 83 | elif record["callback_type"] == CallbackType.ON_CHAIN_END: 84 | handler.on_chain_end(*record["args"], **record["kwargs"]) 85 | elif record["callback_type"] == CallbackType.ON_CHAIN_ERROR: 86 | handler.on_chain_error(*record["args"], **record["kwargs"]) 87 | elif record["callback_type"] == CallbackType.ON_AGENT_ACTION: 88 | handler.on_agent_action(*record["args"], **record["kwargs"]) 89 | elif record["callback_type"] == CallbackType.ON_AGENT_FINISH: 90 | handler.on_agent_finish(*record["args"], **record["kwargs"]) 91 | 92 | # Return the agent's result 93 | for record in records: 94 | if record["callback_type"] == CallbackType.ON_AGENT_FINISH: 95 | return record["args"][0][0]["output"] 96 | 97 | return "[Missing Agent Result]" 98 | 99 | 100 | class CapturingCallbackHandler(BaseCallbackHandler): 101 | def __init__(self) -> None: 102 | self._records: list[CallbackRecord] = [] 103 | self._last_time: float | None = None 104 | 105 | def dump_records_to_file(self, path: str) -> None: 106 | """Write the list of CallbackRecords to a pickle file at the given path.""" 107 | with open(path, "wb") as file: 108 | pickle.dump(self._records, file) 109 | 110 | def _append_record( 111 | self, type: str, args: tuple[Any, ...], kwargs: dict[str, Any] 112 | ) -> None: 113 | time_now = time.time() 114 | time_delta = time_now - self._last_time if self._last_time is not None else 0 115 | self._last_time = time_now 116 | self._records.append( 117 | CallbackRecord( 118 | callback_type=type, args=args, kwargs=kwargs, time_delta=time_delta 119 | ) 120 | ) 121 | 122 | def on_llm_start(self, *args: Any, **kwargs: Any) -> None: 123 | self._append_record(CallbackType.ON_LLM_START, args, kwargs) 124 | 125 | def on_llm_new_token(self, *args: Any, **kwargs: Any) -> None: 126 | self._append_record(CallbackType.ON_LLM_NEW_TOKEN, args, kwargs) 127 | 128 | def on_llm_end(self, *args: Any, **kwargs: Any) -> None: 129 | self._append_record(CallbackType.ON_LLM_END, args, kwargs) 130 | 131 | def on_llm_error(self, *args: Any, **kwargs: Any) -> None: 132 | self._append_record(CallbackType.ON_LLM_ERROR, args, kwargs) 133 | 134 | def on_tool_start(self, *args: Any, **kwargs: Any) -> None: 135 | self._append_record(CallbackType.ON_TOOL_START, args, kwargs) 136 | 137 | def on_tool_end(self, *args: Any, **kwargs: Any) -> None: 138 | self._append_record(CallbackType.ON_TOOL_END, args, kwargs) 139 | 140 | def on_tool_error(self, *args: Any, **kwargs: Any) -> None: 141 | self._append_record(CallbackType.ON_TOOL_ERROR, args, kwargs) 142 | 143 | def on_text(self, *args: Any, **kwargs: Any) -> None: 144 | self._append_record(CallbackType.ON_TEXT, args, kwargs) 145 | 146 | def on_chain_start(self, *args: Any, **kwargs: Any) -> None: 147 | self._append_record(CallbackType.ON_CHAIN_START, args, kwargs) 148 | 149 | def on_chain_end(self, *args: Any, **kwargs: Any) -> None: 150 | self._append_record(CallbackType.ON_CHAIN_END, args, kwargs) 151 | 152 | def on_chain_error(self, *args: Any, **kwargs: Any) -> None: 153 | self._append_record(CallbackType.ON_CHAIN_ERROR, args, kwargs) 154 | 155 | def on_agent_action(self, *args: Any, **kwargs: Any) -> Any: 156 | self._append_record(CallbackType.ON_AGENT_ACTION, args, kwargs) 157 | 158 | def on_agent_finish(self, *args: Any, **kwargs: Any) -> None: 159 | self._append_record(CallbackType.ON_AGENT_FINISH, args, kwargs) 160 | -------------------------------------------------------------------------------- /zeitghost/gdelt/GdeltData.py: -------------------------------------------------------------------------------- 1 | from urllib.parse import urlparse 2 | from collections import defaultdict 3 | from newspaper import news_pool, Article, Source 4 | import nltk 5 | from typing import Dict, Any, List 6 | import logging 7 | from google.cloud import storage 8 | from google.cloud import bigquery as bq 9 | from google.cloud.bigquery.table import RowIterator 10 | import pandas as pd 11 | from zeitghost.gdelt.Helpers import gdelt_processed_record 12 | 13 | 14 | #TODO: 15 | # Optimizations: 16 | # Generate embeddings during parsing process, then save to to bq 17 | # Build in "checker" to see if we have already pulled and processed articles in bq table 18 | class GdeltData: 19 | """ 20 | Gdelt query and parser class 21 | """ 22 | def __init__( 23 | self 24 | , gdelt_data 25 | , destination_table: str = 'gdelt_actors' 26 | , project: str = 'cpg-cdp' 27 | , destination_dataset: str = 'genai_cap_v1' 28 | ): 29 | """ 30 | :param gdelt_data: pandas.DataFrame|google.cloud.bigquery.table.RowIterator 31 | Input data for GDelt processing 32 | """ 33 | logging.debug('Downloading nltk["punkt"]') 34 | nltk.download('punkt', "./") 35 | # BigQuery prepping 36 | self.__project = project 37 | self.__bq_client = bq.Client(project=self.__project) 38 | self.__location = 'us-central1' 39 | self.__destination_table = destination_table 40 | self.__destination_dataset = destination_dataset 41 | self.destination_table_id = f'{self.__destination_dataset}.{self.__destination_table}' 42 | # Prep for particulars of gdelt dataset 43 | 44 | # Builds self.gdelt_df based on incoming dataset type 45 | # TODO: 46 | if type(gdelt_data) is RowIterator: 47 | logging.debug("gdelt data came in as RowIterator") 48 | self.gdelt_df = self._row_iterator_loader(gdelt_data) 49 | elif type(gdelt_data) is pd.DataFrame: 50 | logging.debug("gdelt data came in as DataFrame") 51 | self.gdelt_df = self._dataframe_loader(gdelt_data) 52 | else: 53 | logging.error("Unrecognized datatype for input dataset") 54 | 55 | self.urllist = self.gdelt_df['url'].map(str).to_list() 56 | self.domains = [ 57 | {urlparse(url).scheme + "://" + urlparse(url).hostname: url} 58 | for url in self.urllist 59 | ] 60 | self.news_sources = self._prepare_news_sources() 61 | self.full_source_data = self._parallel_parse_nlp_transform() 62 | self.chunk_df = pd.DataFrame.from_records(self.full_source_data) 63 | self.index_data = self._prepare_for_indexing() 64 | 65 | def _dataframe_loader(self, gdelt_df: pd.DataFrame) -> pd.DataFrame: 66 | logging.debug(f"DataFrame came in with columns: [{','.join(gdelt_df.columns)}]") 67 | #gdelt_df.fillna(0.0) 68 | 69 | return gdelt_df 70 | 71 | def _row_iterator_loader(self, row_iterator: RowIterator) -> pd.DataFrame: 72 | """ 73 | This takes a bq iterator and loads data back into a bq table 74 | """ 75 | # iterate over the bq result page - page size is default 100k rows or 10mb 76 | holder_df = [] 77 | for df in row_iterator.to_dataframe_iterable(): 78 | logging.debug(f"RowIterator came in with columns: [{','.join(df.columns)}]") 79 | tmp_df = df 80 | #tmp_df = tmp_df.fillna(0.0) 81 | holder_df.append(tmp_df) 82 | 83 | return pd.concat(holder_df) 84 | 85 | def pull_article_text(self, source_url) -> dict[str, Any]: 86 | """ 87 | Process individual article for extended usage 88 | 89 | :param source_url: str 90 | url for article to download and process 91 | 92 | :return: dict 93 | """ 94 | article = Article(source_url) 95 | article.parse() 96 | article.nlp() 97 | return { 98 | "title": article.title 99 | , "text": article.text 100 | , "authors": article.authors 101 | # , "keywords": article.keywords 102 | # , "tags" : article.tags 103 | , "summary": article.summary 104 | , "publish_date": article.publish_date 105 | , "url": article.url 106 | , "language": article.meta_lang 107 | } 108 | 109 | def _prepare_news_sources(self): 110 | """ 111 | Given a Gdelt record: group articles by domain, download domain level information. 112 | For each article: download articles, parse downloaded information, and do simple nlp summarization 113 | 114 | :return: List[Source] 115 | """ 116 | domain_article = defaultdict(list) 117 | tmp_list = list() 118 | 119 | # Build {: [
]} dictionary in preparation 120 | # for newspaper activity 121 | for entry in self.domains: 122 | for domain, article in entry.items(): 123 | domain_article[domain].append( 124 | Article(article, fetch_images=False) 125 | ) 126 | logging.debug("Attempting to fetch domain and article information") 127 | for domain, articles in domain_article.items(): 128 | # Create Article Source 129 | tmp_domain = Source( 130 | url=domain 131 | , request_timeout=5 132 | , number_threads=2 133 | ) 134 | # Download and parse top-level domain 135 | tmp_domain.download() 136 | tmp_domain.parse() 137 | 138 | # Build category information 139 | #tmp_domain.set_categories() 140 | #tmp_domain.download_categories() 141 | #tmp_domain.parse_categories() 142 | 143 | # Set articles to Articles built from urllist parameter 144 | tmp_domain.articles = articles 145 | tmp_list.append(tmp_domain) 146 | # Parallelize and download articles, with throttling 147 | news_pool.set(tmp_list, override_threads=1, threads_per_source=1) 148 | news_pool.join() 149 | 150 | # Handle articles in each domain 151 | logging.debug("Parsing and running simple nlp on articles") 152 | for domain in tmp_list: 153 | domain.parse_articles() 154 | for article in domain.articles: 155 | article.parse() 156 | article.nlp() 157 | 158 | return tmp_list 159 | 160 | def _parallel_parse_nlp_transform(self) -> List[Dict[str, Any]]: 161 | """ 162 | Given a list of GDelt records, parse and process the site information. 163 | Actual data structure for dictionary is a 164 | list(zeitghost.gdelt.Helpers.gdelt_processed_records) 165 | :return: List[Dict[str, Any]] 166 | """ 167 | # Prepare for final return list[dict()] 168 | logging.debug("Preparing full domain and article payloads") 169 | tmp_list = list() 170 | for src in self.news_sources: 171 | tmp = { 172 | "domain": src.domain 173 | , "url": src.url 174 | , "brand": src.brand 175 | , "description": src.description 176 | #, "categories": [category.url for category in src.categories] 177 | , "article_count": len(src.articles) 178 | , "articles": [ 179 | { 180 | "title": article.title 181 | , "text": article.text 182 | , "authors": article.authors 183 | # , "keywords": article.keywords 184 | # , "tags" : article.tags 185 | , "summary": article.summary 186 | , "publish_date": article.publish_date 187 | , "url": article.url 188 | , "language": article.meta_lang 189 | , "date": self.gdelt_df[self.gdelt_df['url'] == article.url]['SQLDATE'].item() if 'SQLDATE' in self.gdelt_df.columns else ''# self.gdelt_df[self.gdelt_df['url'] == article.url]['date'].item() 190 | , "Actor1Name": self.gdelt_df[self.gdelt_df['url'] == article.url]['Actor1Name'].item() if 'Actor1Name' in self.gdelt_df.columns else '' 191 | , "Actor2Name": self.gdelt_df[self.gdelt_df['url'] == article.url]['Actor2Name'].item() if 'Actor2Name' in self.gdelt_df.columns else '' 192 | , "GoldsteinScale": self.gdelt_df[self.gdelt_df['url'] == article.url]['GoldsteinScale'].item() if 'GoldsteinScale' in self.gdelt_df.columns else '' 193 | , "NumMentions": [self.gdelt_df[self.gdelt_df['url'] == article.url]['NumMentions'].item()] if 'NumMentions' in self.gdelt_df.columns else []#if self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [int(e['numMentions']) for e in x]).values else [] 194 | , "NumSources": self.gdelt_df[self.gdelt_df['url'] == article.url]['NumSources'].item() if 'NumSources' in self.gdelt_df.columns else 0 195 | , "NumArticles": self.gdelt_df[self.gdelt_df['url'] == article.url]['NumArticles'].item() if 'NumArticles' in self.gdelt_df.columns else 0 196 | , "AvgTone": self.gdelt_df[self.gdelt_df['url'] == article.url]['AvgTone'].item() if 'AvgTone' in self.gdelt_df.columns else 0.0 197 | #, "entities_name": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [str(e['name']) for e in x]).values if 'entities' in self.gdelt_df.columns else [] 198 | #, "entities_type": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [str(e['type']) for e in x]).values if 'entities' in self.gdelt_df.columns else [] 199 | #, "entities_avgSalience": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [float(e['avgSalience']) for e in x]).values if 'entities' in self.gdelt_df.columns else [] 200 | } for article in src.articles 201 | ] 202 | } 203 | tmp_list.append(tmp) 204 | 205 | return tmp_list 206 | 207 | def _reduced_articles(self) -> List[Dict[str, Any]]: 208 | """ 209 | Given a list of GDelt records, parse and process the site information. 210 | Actual data structure for dictionary is a 211 | list(zeitghost.gdelt.Helpers.gdelt_reduced_articles) 212 | :return: List[Dict[str, Any]] 213 | """ 214 | # Prepare for final return list[dict()] 215 | logging.debug("Preparing full domain and article payloads") 216 | tmp_list = list() 217 | for src in self.news_sources: 218 | for article in src.articles: 219 | row = self.gdelt_df[self.gdelt_df['url'] == article.url] 220 | tmp = { 221 | "title": article.title 222 | , "text": article.text 223 | , "article_url": article.url 224 | , "summary": article.summary 225 | , "date": str(row['date'].values) 226 | , "entities_name": row['entities'].map(lambda x: [str(e['name']) for e in x]).values 227 | , "entities_type": row['entities'].map(lambda x: [str(e['type']) for e in x]).values 228 | , "entities_numMentions": row['entities'].map(lambda x: [int(e['numMentions']) for e in x]).values 229 | , "entities_avgSalience": row['entities'].map(lambda x: [float(e['avgSalience']) for e in x]).values 230 | } 231 | tmp_list.append(tmp) 232 | 233 | return tmp_list 234 | 235 | def _prepare_for_indexing(self): 236 | """ 237 | Reduces the larger Gdelt and newspaper download into a more compact payload tuned for indexing 238 | 239 | :return: pandas.DataFrame 240 | """ 241 | logging.debug("Reducing full payload into what Chroma expects for indexing") 242 | final_return_df = pd.DataFrame.from_dict(self.full_source_data) 243 | pre_vector_df = final_return_df[['articles', 'url']].copy() 244 | 245 | pre_vector_df.columns = ['text', 'url'] 246 | 247 | pre_vector_df['text'] = str(pre_vector_df['text']) 248 | 249 | pre_vector_df['text'].astype("string") 250 | pre_vector_df['url'].astype("string") 251 | 252 | return pre_vector_df 253 | 254 | def write_to_gcs(self, output_df: pd.DataFrame, bucket_name: str): 255 | """ 256 | Output article information to a cloud storage bucket 257 | 258 | :param output_df: pandas.DataFrame 259 | Input dataframe to write out to GCS 260 | :param bucket_name: str 261 | Bucket name for writing to 262 | 263 | :return: str 264 | """ 265 | client = storage.Client() 266 | bucket = client.get_bucket(bucket_name) 267 | if not bucket.exists(): 268 | bucket.create() 269 | blob_name = "articles/data.json" 270 | 271 | bucket.blob(blob_name).upload_from_string( 272 | output_df.to_json(index=False) 273 | , 'text/json' 274 | ) 275 | 276 | return f"gs://{bucket_name}/{blob_name}" 277 | 278 | def write_to_bq(self) -> str: 279 | self.chunk_df.to_gbq(self.destination_table_id 280 | , project_id=self.__project 281 | , if_exists='append' 282 | , table_schema=gdelt_processed_record 283 | ) 284 | 285 | return f"{self.__project}:{self.destination_table_id}" -------------------------------------------------------------------------------- /zeitghost/gdelt/Helpers.py: -------------------------------------------------------------------------------- 1 | from google.cloud import bigquery as bq 2 | 3 | gdelt_input_record = [ 4 | bq.SchemaField(name="SQLDATE", field_type="TIMESTAMP", mode="REQUIRED") 5 | , bq.SchemaField(name="Actor1Name", field_type="STRING", mode="REQUIRED") 6 | , bq.SchemaField(name="Actor2Name", field_type="STRING", mode="REQUIRED") 7 | , bq.SchemaField(name="GoldsteinScale", field_type="FLOAT64", mode="REQUIRED") 8 | , bq.SchemaField(name="NumMentions", field_type="INT64", mode="REQUIRED") 9 | , bq.SchemaField(name="NumSources", field_type="INT64", mode="REQUIRED") 10 | , bq.SchemaField(name="NumArticles", field_type="INT64", mode="REQUIRED") 11 | , bq.SchemaField(name="AvgTone", field_type="FLOAT64", mode="REQUIRED") 12 | , bq.SchemaField(name="SOURCEURL", field_type="STRING", mode="REQUIRED") 13 | ] 14 | 15 | gdelt_processed_article = [ 16 | bq.SchemaField(name="title", field_type="STRING", mode="REQUIRED") 17 | , bq.SchemaField(name="text", field_type="STRING", mode="REQUIRED") 18 | , bq.SchemaField(name="authors", field_type="STRING", mode="REPEATED") 19 | , bq.SchemaField(name="summary", field_type="STRING", mode="REQUIRED") 20 | , bq.SchemaField(name="publish_date", field_type="TIMESTAMP", mode="NULLABLE") 21 | , bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED") 22 | , bq.SchemaField(name="language", field_type="STRING", mode="REQUIRED") 23 | , bq.SchemaField(name="date", field_type="DATETIME", mode="REQUIRED") 24 | , bq.SchemaField(name="Actor1Name", field_type="STRING", mode="REQUIRED") 25 | , bq.SchemaField(name="Actor2Name", field_type="STRING", mode="REQUIRED") 26 | , bq.SchemaField(name="GoldsteinScale", field_type="FLOAT64", mode="REQUIRED") 27 | , bq.SchemaField(name="NumMentions", field_type="INT64", mode="REPEATED") 28 | , bq.SchemaField(name="NumSources", field_type="INT64", mode="REQUIRED") 29 | , bq.SchemaField(name="NumArticles", field_type="INT64", mode="REQUIRED") 30 | , bq.SchemaField(name="AvgTone", field_type="FLOAT64", mode="REQUIRED") 31 | #, bq.SchemaField(name="entities_name", field_type="STRING", mode="REPEATED") 32 | #, bq.SchemaField(name="entities_type", field_type="STRING", mode="REPEATED") 33 | #, bq.SchemaField(name="entities_avgSalience", field_type="FLOAT64", mode="REPEATED") 34 | ] 35 | 36 | gdelt_processed_record = [ 37 | bq.SchemaField(name="domain", field_type="STRING", mode="REQUIRED") 38 | , bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED") 39 | , bq.SchemaField(name="brand", field_type="STRING", mode="REQUIRED") 40 | , bq.SchemaField(name="description", field_type="STRING", mode="REQUIRED") 41 | , bq.SchemaField(name="categories", field_type="STRING", mode="REPEATED") 42 | , bq.SchemaField(name="article_count", field_type="INT64", mode="REQUIRED") 43 | , bq.SchemaField(name="articles", field_type="RECORD", mode="REPEATED", fields=gdelt_processed_article) 44 | ] 45 | 46 | gdelt_reduced_articles = [ 47 | bq.SchemaField(name="title", field_type="STRING", mode="REQUIRED") 48 | , bq.SchemaField(name="text", field_type="STRING", mode="REQUIRED") 49 | , bq.SchemaField(name="article_url", field_type="STRING", mode="REQUIRED") 50 | , bq.SchemaField(name="summary", field_type="STRING", mode="REQUIRED") 51 | , bq.SchemaField(name="date", field_type="TIMESTAMP", mode="NULLABLE") 52 | , bq.SchemaField(name="entities_name", field_type="STRING", mode="REPEATED") 53 | , bq.SchemaField(name="entities_type", field_type="STRING", mode="REPEATED") 54 | , bq.SchemaField(name="entities_numMentions", field_type="INT64", mode="REPEATED") 55 | , bq.SchemaField(name="entities_avg_Salience", field_type="FLOAT64", mode="REPEATED") 56 | ] 57 | 58 | # gdelt_geg_articles_to_scrape = [ 59 | # bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED") 60 | # , bq.SchemaField(name="date", field_type="TIMESTAMP", mode="REQUIRED") 61 | # , bq.SchemaField(name="avgSalience", field_type="FLOAT", mode="REQUIRED") 62 | # ] 63 | 64 | -------------------------------------------------------------------------------- /zeitghost/gdelt/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__init__.py -------------------------------------------------------------------------------- /zeitghost/gdelt/__pycache__/GdeltData.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/GdeltData.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/gdelt/__pycache__/Helpers.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/Helpers.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/gdelt/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/testing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/testing/__init__.py -------------------------------------------------------------------------------- /zeitghost/testing/basic_agent_unit_tests.py: -------------------------------------------------------------------------------- 1 | # test_with_unittest.py 2 | import pandas as pd 3 | import sys 4 | sys.path.append('../..') 5 | 6 | import langchain #for class assertions 7 | from google.cloud.bigquery.table import RowIterator #for class exertions 8 | from zeitghost.agents.LangchainAgent import LangchainAgent 9 | from zeitghost.vertex.LLM import VertexLLM#, VertexLangchainLLM 10 | from zeitghost.vertex.Embeddings import VertexEmbeddings 11 | from zeitghost.bigquery.BigQueryAccessor import BigQueryAccessor 12 | import unittest 13 | from unittest import TestCase 14 | dataset='trends_data' 15 | table_id='makeupcosmetics_10054_unitedstates_2840_external' 16 | 17 | TEST_PANDAS_SCRIPT = '''This is a dataframe of google search terms (term column) 18 | scored by volume (score column) by weekly date (date_field column): 19 | when were certain terms popular compared to others? 20 | why? double check your answer''' 21 | 22 | PROJECT_ID = 'cpg-cdp' 23 | gdelt_keyworld = 'estee lauder' #lower cases 24 | term_data_bq = ('mascera','makeup','ulta','tonymoly') 25 | 26 | GDELT_COLS = ['SQLDATE', 'Actor1Name', 'Actor2Name', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'SOURCEURL'] 27 | TRENDSPOTTING_COLS = ['date_field', 'term', 'score'] 28 | 29 | BQ_AGENT_PROMPT = f"""Describe the {dataset}.{table_id} table? Don't download the entire table, when complete, say I now know the final answer""" 30 | 31 | class AgentTests(TestCase): 32 | 33 | def __init__(self, project_id=PROJECT_ID, 34 | table_id=table_id, 35 | dataset=dataset, 36 | gdelt_keyword=gdelt_keyworld, 37 | term_data_bq = term_data_bq 38 | ): 39 | self.project_id = project_id 40 | self.table_id = table_id 41 | self.dataset = dataset 42 | self.gdelt_keyworld = gdelt_keyworld 43 | self.term_data_bq = term_data_bq 44 | self._act() 45 | self._assert() 46 | super().__init__(self, project_id=self.project_id, 47 | table_id=self.table_id, 48 | dataset=self.dataset, 49 | gdelt_keyword=self.gdelt_keyworld, 50 | term_data_bq = self.term_data_bq) 51 | 52 | def _act(self): 53 | self.llm = VertexLLM() 54 | self.llm_test = self.llm.predict('how are you doing today?', ['Observation:']) 55 | self.langchain_llm = self.llm 56 | self.langchain_llm_test = self.langchain_llm('how are you doing today?')#, stop=['Observation:']) #you need that for the pandas bot 57 | self.data_accessor = BigQueryAccessor(self.project_id) 58 | self.gdelt_accessor = data_accessor.get_records_from_actor_keyword_df(self.gdelt_keyworld) 59 | self.term_data_from_bq = data_accessor.pull_term_data_from_bq(self.term_data_bq) 60 | self.trendspotting_subset = self.term_data_from_bq.to_dataframe() 61 | self.vertex_langchain_agent = LangchainAgent(self.langchain_llm) 62 | self.trendspotting_subset = self.term_data_from_bq.to_dataframe() 63 | self.pandas_agent = self.vertex_langchain_agent.get_pandas_agent(self.trendspotting_subset) 64 | self.pandas_agent_result = pandas_agent.run(TEST_PANDAS_SCRIPT) 65 | self.langchain_agent_instance = LangchainAgent(self.langchain_llm) 66 | self.agent_executor = self.langchain_agent_instance.get_bigquery_agent(project_id) 67 | self.agent_executor_test = self.agent_executor(BQ_AGENT_PROMPT) 68 | 69 | def _assert(self): 70 | assert True is True #trival start 71 | assert type(self.llm) is zeitghost.vertex.LLM.VertexLLM 72 | assert type(self.llm_test) is str 73 | assert type(self.langchain_llm) is zeitghost.vertex.LLM.VertexLLM 74 | assert type(self.langchain_llm_test) is str 75 | assert len(lself.llm_test) > 1 76 | assert len(self.langchain_llm_test) > 1 77 | assert type(self.data_accessor) is zeitghost.bigquery.BigQueryAccessor.BigQueryAccessor 78 | assert type(self.gdelt_accessor) is pd.core.frame.DataFrame #is this right?? 79 | assert len(self.gdelt_accessor) > 1 80 | assert type(self.term_data_from_bq) is RowIterator 81 | assert self.gdelt_accessor.columns.to_list() == GDELT_COLS 82 | assert type(self.trendspotting_subset) == pd.core.frame.DataFrame 83 | assert len(self.trendspotting_subset) > 1 84 | assert self.trendspotting_subset.columns.to_list() == TRENDSPOTTING_COLS 85 | assert type(self.vertex_langchain_agent) is zeitghost.agents.LangchainAgent.LangchainAgent 86 | assert type(self.pandas_agent) is langchain.agents.agent.AgentExecutor 87 | assert len(self.pandas_agent_result) > 1 88 | assert type(self.langchain_agent_instance) is zeitghost.agents.LangchainAgent.LangchainAgent 89 | assert type(self.agent_executor) is langchain.agents.agent.AgentExecutor 90 | assert len(agent_executor_test) > 1 91 | 92 | 93 | 94 | 95 | 96 | -------------------------------------------------------------------------------- /zeitghost/ts_embedding/.ipynb_checkpoints/kats_embedding_tools-checkpoint.py: -------------------------------------------------------------------------------- 1 | from sklearn.preprocessing import MinMaxScaler 2 | from kats.consts import TimeSeriesData 3 | from kats.tsfeatures.tsfeatures import TsFeatures 4 | import os 5 | import pandas as pd 6 | import numpy as np 7 | from .bq_data_tools import pull_term_data_from_bq 8 | from decimal import Decimal 9 | 10 | 11 | 12 | # https://stackoverflow.com/questions/434287/how-to-iterate-over-a-list-in-chunks 13 | 14 | def chunker(seq, size): 15 | return (seq[pos:pos + size] for pos in range(0, len(seq), size)) 16 | 17 | SLIDING_WINDOW_SIZE = 30 #n months for chunking - complete examples only used 18 | STEP = 1 #step - default to 1 19 | 20 | 21 | def write_embeddings_to_disk(term_chunk, filename='data/ts_embeddings.jsonl'): 22 | ''' 23 | this funciton takes a chunk of n_terms (see chunker for input) 24 | and writes to `filename` a jsonl file compliant with 25 | matching engine 26 | ''' 27 | term_data = pull_term_data_from_bq(tuple(term_chunk)) 28 | #run through by term 29 | for term in term_chunk: 30 | # emb_pair = get_feature_embedding_for_window(term_data[term_data.term == term], term) 31 | # ts_emedding_pairs.append(emb_pair) 32 | wdf = windows(term_data[term_data.term == term], SLIDING_WINDOW_SIZE, STEP) 33 | for window, new_df in wdf.groupby(level=0): 34 | # print(window, new_df) 35 | if new_df.shape[0] == SLIDING_WINDOW_SIZE: #full examples only 36 | emb_pair = get_feature_embedding_for_window(new_df, term) 37 | label, emb = emb_pair 38 | formatted_emb = '{"id":"' + str(label) + '","embedding":[' + ",".join(str(x) for x in list(emb)) + ']}' 39 | with open(filename, 'a') as f: 40 | f.write(formatted_emb) 41 | f.write("\n") 42 | f.close() 43 | 44 | def windows(data, window_size, step): 45 | ''' 46 | creates slices of the time series used for 47 | creating embeddings 48 | ''' 49 | r = np.arange(len(data)) 50 | s = r[::step] 51 | z = list(zip(s, s + window_size)) 52 | f = '{0[0]}:{0[1]}'.format 53 | g = lambda t: data.iloc[t[0]:t[1]] 54 | return pd.concat(map(g, z), keys=map(f, z)) 55 | 56 | def get_feature_embedding_for_window(df, term): 57 | ''' 58 | this takes a df with schema of type `date_field` and `score` to create an embeddding 59 | takes 30 weeks of historical timeseries data 60 | ''' 61 | ts_name = f"{term}_{str(df.date_field.min())}_{str(df.date_field.max())}" 62 | scaler=MinMaxScaler() 63 | df[['score']] = scaler.fit_transform(df[['score']]) 64 | scores = df[['score']].values.tolist() 65 | flat_values = [item for sublist in scores for item in sublist] 66 | df = df.rename(columns={"date_field":"time"}) 67 | ts_df = pd.DataFrame({'time':df.time, 68 | 'score':flat_values}) 69 | ts_df.drop_duplicates(keep='first', inplace=True) 70 | 71 | # Use Kats to extract features for the time window 72 | try: 73 | if not (len(np.unique(ts_df.score.tolist())) == 1 \ 74 | or len(np.unique(ts_df.score.tolist())) == 0): 75 | timeseries = TimeSeriesData(ts_df) 76 | features = TsFeatures().transform(timeseries) 77 | feature_list = [float(v) if not pd.isnull(v) else float(0) for _, v in features.items()] 78 | if Decimal('Infinity') in feature_list or Decimal('-Infinity') in feature_list: 79 | return None 80 | return (ts_name, feature_list) 81 | except np.linalg.LinAlgError as e: 82 | print(f"Can't process {ts_name}:{e}") 83 | return None 84 | 85 | def chunks(iterable, batch_size=100): 86 | it = iter(iterable) 87 | chunk = tuple(itertools.islice(it, batch_size)) 88 | while chunk: 89 | yield chunk 90 | chunk = tuple(itertools.islice(it, batch_size)) -------------------------------------------------------------------------------- /zeitghost/ts_embedding/bq_data_tools.py: -------------------------------------------------------------------------------- 1 | from google.cloud import bigquery 2 | import pandas as pd 3 | 4 | PROJECT_ID = 'cpg-cdp' 5 | TABLE_ID = 'makeupcosmetics_10054_unitedstates_2840' 6 | DATASET = 'trends_data' 7 | 8 | bqclient = bigquery.Client( 9 | project=PROJECT_ID, 10 | # location=LOCATION 11 | ) 12 | 13 | def get_term_set(project_id=PROJECT_ID, 14 | dataset=DATASET, 15 | table_id=TABLE_ID): 16 | ''' 17 | Simple function to get the unique, sorted terms in the table 18 | ''' 19 | query = f""" 20 | SELECT distinct 21 | term 22 | FROM `{project_id}.{dataset}.{table_id}` 23 | order by 1 24 | """ 25 | 26 | df = bqclient.query(query = query).to_dataframe() 27 | return df["term"].to_list() 28 | 29 | 30 | def pull_term_data_from_bq(term: tuple = ('mascara', 'makeup'), 31 | project_id=PROJECT_ID, 32 | dataset=DATASET, 33 | table_id=TABLE_ID): 34 | ''' 35 | pull terms based on `in` sql clause from term 36 | takes a tuple of terms (str) and produces pandas dataset 37 | ''' 38 | query = f""" 39 | SELECT 40 | cast(date AS DATE FORMAT 'YYYY-MM-DD') as date_field, 41 | term, 42 | score 43 | FROM `{project_id}.{dataset}.{table_id}` 44 | WHERE 45 | term in {term} 46 | order by term, 1 47 | """ 48 | 49 | df = bqclient.query(query = query).to_dataframe() 50 | return df -------------------------------------------------------------------------------- /zeitghost/ts_embedding/kats_embedding_tools.py: -------------------------------------------------------------------------------- 1 | from sklearn.preprocessing import MinMaxScaler 2 | from kats.consts import TimeSeriesData 3 | from kats.tsfeatures.tsfeatures import TsFeatures 4 | import pandas as pd 5 | import numpy as np 6 | #from zeitghost.ts_embedding.bq_data_tools import pull_term_data_from_bq 7 | from zeitghost.bigquery.BigQueryAccessor import BigQueryAccessor 8 | from decimal import Decimal 9 | 10 | 11 | 12 | # https://stackoverflow.com/questions/434287/how-to-iterate-over-a-list-in-chunks 13 | def chunker(seq, size): 14 | return (seq[pos:pos + size] for pos in range(0, len(seq), size)) 15 | 16 | SLIDING_WINDOW_SIZE = 30 #n months for chunking - complete examples only used 17 | STEP = 1 #step - default to 1 18 | 19 | 20 | def write_embeddings_to_disk(term_chunk, filename='data/ts_embeddings.jsonl'): 21 | ''' 22 | this funciton takes a chunk of n_terms (see chunker for input) 23 | and writes to `filename` a jsonl file compliant with 24 | matching engine 25 | ''' 26 | term_data = pull_term_data_from_bq(tuple(term_chunk)) 27 | #run through by term 28 | for term in term_chunk: 29 | # emb_pair = get_feature_embedding_for_window(term_data[term_data.term == term], term) 30 | # ts_emedding_pairs.append(emb_pair) 31 | wdf = windows(term_data[term_data.term == term], SLIDING_WINDOW_SIZE, STEP) 32 | for window, new_df in wdf.groupby(level=0): 33 | # print(window, new_df) 34 | if new_df.shape[0] == SLIDING_WINDOW_SIZE: #full examples only 35 | emb_pair = get_feature_embedding_for_window(new_df, term) 36 | label, emb = emb_pair 37 | formatted_emb = '{"id":"' + str(label) + '","embedding":[' + ",".join(str(x) for x in list(emb)) + ']}' 38 | with open(filename, 'a') as f: 39 | f.write(formatted_emb) 40 | f.write("\n") 41 | f.close() 42 | 43 | def windows(data, window_size, step): 44 | ''' 45 | creates slices of the time series used for 46 | creating embeddings 47 | ''' 48 | r = np.arange(len(data)) 49 | s = r[::step] 50 | z = list(zip(s, s + window_size)) 51 | f = '{0[0]}:{0[1]}'.format 52 | g = lambda t: data.iloc[t[0]:t[1]] 53 | return pd.concat(map(g, z), keys=map(f, z)) 54 | 55 | def get_feature_embedding_for_window(df, term): 56 | ''' 57 | this takes a df with schema of type `date_field` and `score` to create an embeddding 58 | takes 30 weeks of historical timeseries data 59 | ''' 60 | ts_name = f"{term}_{str(df.date_field.min())}_{str(df.date_field.max())}" 61 | scaler=MinMaxScaler() 62 | df[['score']] = scaler.fit_transform(df[['score']]) 63 | scores = df[['score']].values.tolist() 64 | flat_values = [item for sublist in scores for item in sublist] 65 | df = df.rename(columns={"date_field":"time"}) 66 | ts_df = pd.DataFrame({'time':df.time, 67 | 'score':flat_values}) 68 | ts_df.drop_duplicates(keep='first', inplace=True) 69 | 70 | # Use Kats to extract features for the time window 71 | try: 72 | if not (len(np.unique(ts_df.score.tolist())) == 1 \ 73 | or len(np.unique(ts_df.score.tolist())) == 0): 74 | timeseries = TimeSeriesData(ts_df) 75 | features = TsFeatures().transform(timeseries) 76 | feature_list = [float(v) if not pd.isnull(v) else float(0) for _, v in features.items()] 77 | if Decimal('Infinity') in feature_list or Decimal('-Infinity') in feature_list: 78 | return None 79 | return (ts_name, feature_list) 80 | except np.linalg.LinAlgError as e: 81 | print(f"Can't process {ts_name}:{e}") 82 | return None 83 | 84 | def chunks(iterable, batch_size=100): 85 | it = iter(iterable) 86 | chunk = tuple(itertools.islice(it, batch_size)) 87 | while chunk: 88 | yield chunk 89 | chunk = tuple(itertools.islice(it, batch_size)) -------------------------------------------------------------------------------- /zeitghost/vertex/Embeddings.py: -------------------------------------------------------------------------------- 1 | from langchain.embeddings.base import Embeddings 2 | from typing import List 3 | from zeitghost.vertex.Helpers import rate_limit, _get_api_key, VertexModels 4 | from vertexai.preview.language_models import TextEmbeddingModel 5 | 6 | 7 | class VertexEmbeddings(Embeddings): 8 | """ 9 | Helper class for getting document embeddings 10 | """ 11 | model: TextEmbeddingModel 12 | project_id: str 13 | location: str 14 | requests_per_minute: int 15 | _api_key: str 16 | 17 | def __init__(self 18 | , project_id='cpg-cdp' 19 | , location='us-central1' 20 | , model=VertexModels.MODEL_EMBEDDING_GECKO.value 21 | , requests_per_minute=15): 22 | """ 23 | :param project_id: str 24 | Google Cloud Project ID 25 | :param location: str 26 | Google Cloud Location 27 | :param model: str 28 | LLM Embedding Model name 29 | :param requests_per_minute: int 30 | Rate Limiter for managing API limits 31 | """ 32 | super().__init__() 33 | 34 | self.model = TextEmbeddingModel.from_pretrained(model) 35 | self.project_id = project_id 36 | self.location = location 37 | self.requests_per_minute = requests_per_minute 38 | # self._api_key = _get_api_key() 39 | 40 | def _call_llm_embedding(self, prompt: str) -> List[List[float]]: 41 | """ 42 | Retrieve embeddings from the embeddings llm 43 | 44 | :param prompt: str 45 | Document to retrieve embeddings 46 | 47 | :return: List[List[float]] 48 | """ 49 | embeddings = self.model.get_embeddings([prompt]) 50 | embeddings = [e.values for e in embeddings] #list of list 51 | return embeddings 52 | 53 | def embed_documents(self, texts: List[str]) -> List[List[float]]: 54 | """ 55 | Retrieve embeddings for a list of documents 56 | 57 | :param texts: List[str] 58 | List of documents for embedding 59 | 60 | :return: List[List[float] 61 | """ 62 | # print(f"Setting requests per minute limit: {self.requests_per_minute}\n") 63 | limiter = rate_limit(self.requests_per_minute) 64 | results = [] 65 | for doc in texts: 66 | chunk = self.embed_query(doc) 67 | results.append(chunk) 68 | rate_limit(self.requests_per_minute) 69 | next(limiter) 70 | return results 71 | 72 | def embed_query(self, text) -> List[float]: 73 | """ 74 | Retrieve embeddings for a singular document 75 | 76 | :param text: str 77 | Singleton document 78 | 79 | :return: List[float] 80 | """ 81 | single_result = self._call_llm_embedding(text) 82 | # single_result = self.embed_documents([text]) 83 | return single_result[0] #should be a singleton list 84 | -------------------------------------------------------------------------------- /zeitghost/vertex/Helpers.py: -------------------------------------------------------------------------------- 1 | from google.cloud import secretmanager 2 | from decouple import config 3 | import time 4 | import os 5 | from enum import Enum 6 | from google.protobuf import struct_pb2 7 | from langchain import PromptTemplate 8 | 9 | _SECRET_ID = 'projects/939655404703/secrets/genai-key' 10 | _SECRET_VERSION = '{}/versions/1'.format(_SECRET_ID) 11 | project = os.environ.get('PROJECT_ID') 12 | 13 | 14 | def _get_api_key() -> str: 15 | """ 16 | Retrieve API key from Secret Manager 17 | 18 | :return: str 19 | """ 20 | sm_client = secretmanager.SecretManagerServiceClient() 21 | name = sm_client.secret_path(project, _SECRET_ID) 22 | response = sm_client.access_secret_version(request={"name": _SECRET_VERSION}) 23 | 24 | return response.payload.data.decode("UTF-8") 25 | 26 | 27 | def rate_limit(max_per_minute=15) -> None: 28 | """ 29 | Rate Limiter to manage API limitations 30 | 31 | :param max_per_minute: 32 | 33 | :return: yield 34 | """ 35 | period = 60 / max_per_minute 36 | # print('Waiting') 37 | while True: 38 | before = time.time() 39 | yield 40 | after = time.time() 41 | elapsed = after - before 42 | sleep_time = max(0, period - elapsed) 43 | if sleep_time > 0: 44 | # print(f'Sleeping {sleep_time:.1f} seconds') 45 | print('.', end='') 46 | time.sleep(sleep_time) 47 | 48 | 49 | class VertexModels(Enum): 50 | """ 51 | Enumeration of available LLM models 52 | """ 53 | MODEL_EMBEDDING_GECKO = "textembedding-gecko@001" 54 | MODEL_TEXT_BISON_001 = "text-bison@001" 55 | MODEL_CHAT_BISON_001 = "chat-bison-001" #TODO UPDATE WITH SDK 56 | MODEL_CODE_BISON_001 = 'code-bison-001' #TODO UPDATE WITH SDK 57 | MODEL_TEXT_BISON_ALPHA = "text-bison-alpha" #TODO UPDATE WITH SDK 58 | 59 | 60 | # prompt: python stop a string at first occurance of word in list of words 61 | 62 | # Sure, here is the Python code to stop a string at the first occurrence of a word in a list of words: 63 | 64 | # ```python 65 | def stop_string_at_first_occurrence(string, words): 66 | """ 67 | Stops a string at the first occurrence of a word in a list of words. 68 | 69 | Args: 70 | string: The string to stop. 71 | words: A list of words to stop the string at. 72 | 73 | Returns: 74 | The string, stopped at the first occurrence of a word in the list. 75 | """ 76 | 77 | for word in words: 78 | if word in string: 79 | return string.partition(word)[0] 80 | 81 | return string 82 | # ``` 83 | 84 | # Here is an example of how to use the `stop_string_at_first_occurrence()` function: 85 | 86 | # ```python 87 | # string = "This is a string with the words 'stop' and 'word'." 88 | # words = ["stop", "word"] 89 | 90 | # print(stop_string_at_first_occurrence(string, words)) 91 | # ``` 92 | 93 | # This will print the following output to the console: 94 | 95 | # ``` 96 | # This is a string with the words 'stop'. 97 | # ``` 98 | 99 | 100 | def _build_index_config(embedding_gcs_uri: str, dimensions: int): 101 | _treeAhConfig = struct_pb2.Struct( 102 | fields={ 103 | "leafNodeEmbeddingCount": struct_pb2.Value(number_value=500), 104 | "leafNodesToSearchPercent": struct_pb2.Value(number_value=7), 105 | } 106 | ) 107 | _algorithmConfig = struct_pb2.Struct( 108 | fields={"treeAhConfig": struct_pb2.Value(struct_value=_treeAhConfig)} 109 | ) 110 | _config = struct_pb2.Struct( 111 | fields={ 112 | "dimensions": struct_pb2.Value(number_value=dimensions), 113 | "approximateNeighborsCount": struct_pb2.Value(number_value=150), 114 | "distanceMeasureType": struct_pb2.Value(string_value="DOT_PRODUCT_DISTANCE"), 115 | "algorithmConfig": struct_pb2.Value(struct_value=_algorithmConfig), 116 | "shardSize": struct_pb2.Value(string_value="SHARD_SIZE_SMALL"), 117 | } 118 | ) 119 | metadata = struct_pb2.Struct( 120 | fields={ 121 | "config": struct_pb2.Value(struct_value=_config), 122 | "contentsDeltaUri": struct_pb2.Value(string_value=embedding_gcs_uri), 123 | } 124 | ) 125 | 126 | return metadata 127 | 128 | map_prompt_template = """ 129 | Write a concise summary of the following: 130 | 131 | {text} 132 | 133 | CONSCISE SUMMARY: 134 | """ 135 | map_prompt = PromptTemplate( 136 | template=map_prompt_template 137 | , input_variables=["text"] 138 | ) 139 | 140 | combine_prompt_template = """ 141 | Write a concise summary of the following: 142 | 143 | {text} 144 | 145 | CONSCISE SUMMARY IN BULLET POINTS: 146 | """ 147 | combine_prompt = PromptTemplate( 148 | template=combine_prompt_template 149 | , input_variables=["text"] 150 | ) 151 | 152 | 153 | class ResourceNotExistException(Exception): 154 | def __init__(self, resource: str, message="Resource Does Not Exist."): 155 | self.resource = resource 156 | self.message = message 157 | super().__init__(self.message) 158 | -------------------------------------------------------------------------------- /zeitghost/vertex/LLM.py: -------------------------------------------------------------------------------- 1 | from typing import List, Optional 2 | from zeitghost.vertex.Helpers import VertexModels, stop_string_at_first_occurrence 3 | from langchain.llms.base import LLM 4 | from vertexai.preview.language_models import TextGenerationModel 5 | 6 | 7 | class VertexLLM(LLM): 8 | """ 9 | A class to Vertex LLM model that fits in the langchain framework 10 | this extends the langchain.llms.base.LLM class 11 | """ 12 | model: TextGenerationModel 13 | predict_kwargs: dict 14 | model_source: str 15 | stop: Optional[List[str]] 16 | strip: bool 17 | strip_chars: List[str] 18 | 19 | def __init__(self 20 | , stop: Optional[List[str]] 21 | , strip: bool = False 22 | , strip_chars: List[str] = ['{','}','\n'] 23 | , model_source=VertexModels.MODEL_TEXT_BISON_001.value 24 | , **predict_kwargs 25 | ): 26 | """ 27 | :param model_source: str 28 | Name of LLM model to interact with 29 | :param endpoint: str 30 | Endpoint information for HTTP calls 31 | :param project: str 32 | Google Cloud Project ID 33 | :param location: str 34 | Google Cloud Location 35 | """ 36 | super().__init__(model=TextGenerationModel.from_pretrained(model_source) 37 | , strip=strip 38 | , strip_chars=strip_chars 39 | , predict_kwargs=predict_kwargs 40 | , model_source=VertexModels.MODEL_TEXT_BISON_001.value 41 | ) 42 | self.model = TextGenerationModel.from_pretrained(model_source) 43 | self.stop = stop 44 | self.model_source = model_source 45 | self.predict_kwargs = predict_kwargs 46 | self.strip = strip 47 | self.strip_chars = strip_chars 48 | 49 | @property 50 | def _llm_type(self): 51 | return 'vertex' 52 | 53 | @property 54 | def _identifying_params(self): 55 | return {} 56 | 57 | def _trim_output(self, raw_results: str) -> str: 58 | ''' 59 | utility function to strip out brackets and other non useful info 60 | ''' 61 | for char in self.strip_chars: 62 | raw_results = raw_results.replace(char, '') 63 | return raw_results 64 | 65 | def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str: 66 | """ 67 | Wrapper around predict. 68 | Has special handling for SQL response formatting. 69 | 70 | :param prompt: 71 | :return: str 72 | """ 73 | stop = self.stop 74 | prompt = str(prompt) 75 | prompt = prompt[:7999] #trimming the first chars to avoid issue 76 | result = str(self.model.predict(prompt, **self.predict_kwargs)) 77 | if stop is not None: 78 | result = str(stop_string_at_first_occurrence(result, self.stop)) #apply stopwords 79 | if self.strip: 80 | return str(self._trim_output(result)) 81 | else: 82 | return str(result) 83 | 84 | def _acall(self, prompt: str, stop: Optional[List[str]] = None) -> str: 85 | result = str(self.model.predict(prompt, **self.predict_kwargs)) 86 | stop = self.stop 87 | if stop: 88 | result = str(stop_string_at_first_occurrence(result, self.stop)) #apply stopwords 89 | return str(result) 90 | -------------------------------------------------------------------------------- /zeitghost/vertex/MatchingEngineCRUD.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | import time 3 | import logging 4 | from google.cloud import aiplatform_v1 as aipv1 5 | from google.cloud.aiplatform_v1 import CreateIndexEndpointRequest 6 | from google.cloud.aiplatform_v1.types.index import Index 7 | from google.cloud.aiplatform_v1.types.index_endpoint import IndexEndpoint 8 | from google.cloud.aiplatform_v1.types.index_endpoint import DeployedIndex 9 | from zeitghost.vertex.Helpers import _build_index_config, ResourceNotExistException 10 | from google.protobuf import struct_pb2 11 | from typing import List 12 | 13 | logging.basicConfig(level=logging.INFO) 14 | logger = logging.getLogger() 15 | 16 | 17 | class MatchingEngineCRUD: 18 | def __init__( 19 | self 20 | , project_id: str 21 | , region: str 22 | , project_num: int 23 | , index_name: str = None 24 | , vpc_network_name: str = None 25 | ): 26 | self.project_id = project_id 27 | self.project_num = project_num 28 | self.region = region 29 | self.index_name = index_name if index_name is not None else None 30 | self.vpc_network_name = vpc_network_name if vpc_network_name is not None else None 31 | 32 | self.index_endpoint_name = f"{self.index_name}_endpoint" if self.index_name is not None else None 33 | self.PARENT = f"projects/{self.project_num}/locations/{self.region}" 34 | 35 | ENDPOINT = f"{self.region}-aiplatform.googleapis.com" 36 | 37 | # set index client 38 | self.index_client = aipv1.IndexServiceClient( 39 | client_options=dict(api_endpoint=ENDPOINT) 40 | ) 41 | # set index endpoint client 42 | self.index_endpoint_client = aipv1.IndexEndpointServiceClient( 43 | client_options=dict(api_endpoint=ENDPOINT) 44 | ) 45 | 46 | def _set_index_name(self, index_name: str) -> None: 47 | """ 48 | 49 | :param index_name: 50 | :return: 51 | """ 52 | self.index_name = index_name 53 | 54 | def _set_index_endpoint_name(self, index_endpoint_name: str = None) -> None: 55 | """ 56 | 57 | :param index_endpoint_name: 58 | :return: 59 | """ 60 | if index_endpoint_name is not None: 61 | self.index_endpoint_name = index_endpoint_name 62 | elif self.index_name is not None: 63 | self.index_endpoint_name = f"{self.index_name}_endpoint" 64 | else: 65 | raise ResourceNotExistException("index") 66 | 67 | def _get_index(self) -> Index: 68 | """ 69 | 70 | :return: 71 | """ 72 | # Check if index exists 73 | if self.index_name is not None: 74 | indexes = [ 75 | index.name for index in self.list_indexes() 76 | if index.display_name == self.index_name 77 | ] 78 | else: 79 | raise ResourceNotExistException("index") 80 | 81 | if len(indexes) == 0: 82 | return None 83 | else: 84 | index_id = indexes[0] 85 | request = aipv1.GetIndexRequest(name=index_id) 86 | index = self.index_client.get_index(request=request) 87 | return index 88 | 89 | def _get_index_endpoint(self) -> IndexEndpoint: 90 | """ 91 | 92 | :return: 93 | """ 94 | # Check if index endpoint exists 95 | if self.index_endpoint_name is not None: 96 | index_endpoints = [ 97 | response.name for response in self.list_index_endpoints() 98 | if response.display_name == self.index_endpoint_name 99 | ] 100 | else: 101 | raise ResourceNotExistException("index_endpoint") 102 | 103 | if len(index_endpoints) == 0: 104 | logging.info(f"Could not find index endpoint: {self.index_endpoint_name}") 105 | return None 106 | else: 107 | index_endpoint_id = index_endpoints[0] 108 | index_endpoint = self.index_endpoint_client.get_index_endpoint( 109 | name=index_endpoint_id 110 | ) 111 | return index_endpoint 112 | 113 | def list_indexes(self) -> List[Index]: 114 | """ 115 | 116 | :return: 117 | """ 118 | request = aipv1.ListIndexesRequest(parent=self.PARENT) 119 | page_result = self.index_client.list_indexes(request=request) 120 | indexes = [ 121 | response for response in page_result 122 | ] 123 | return indexes 124 | 125 | def list_index_endpoints(self) -> List[IndexEndpoint]: 126 | """ 127 | 128 | :return: 129 | """ 130 | request = aipv1.ListIndexEndpointsRequest(parent=self.PARENT) 131 | page_result = self.index_endpoint_client.list_index_endpoints(request=request) 132 | index_endpoints = [ 133 | response for response in page_result 134 | ] 135 | return index_endpoints 136 | 137 | def list_deployed_indexes( 138 | self 139 | , endpoint_name: str = None 140 | ) -> List[DeployedIndex]: 141 | """ 142 | 143 | :param endpoint_name: 144 | :return: 145 | """ 146 | try: 147 | if endpoint_name is not None: 148 | self._set_index_endpoint_name(endpoint_name) 149 | index_endpoint = self._get_index_endpoint() 150 | deployed_indexes = index_endpoint.deployed_indexes 151 | except ResourceNotExistException as rnee: 152 | raise rnee 153 | 154 | return list(deployed_indexes) 155 | 156 | def create_index( 157 | self 158 | , embedding_gcs_uri: str 159 | , dimensions: int 160 | , index_name: str = None 161 | ) -> Index: 162 | """ 163 | 164 | :param index_name: 165 | :param embedding_gcs_uri: 166 | :param dimensions: 167 | :return: 168 | """ 169 | if index_name is not None: 170 | self._set_index_name(index_name) 171 | # Get index 172 | if self.index_name is None: 173 | raise ResourceNotExistException("index") 174 | index = self._get_index() 175 | # Create index if does not exists 176 | if index: 177 | logger.info(f"Index {self.index_name} already exists with id {index.name}") 178 | else: 179 | logger.info(f"Index {self.index_name} does not exists. Creating index ...") 180 | 181 | metadata = _build_index_config( 182 | embedding_gcs_uri=embedding_gcs_uri 183 | , dimensions=dimensions 184 | ) 185 | 186 | index_request = { 187 | "display_name": self.index_name, 188 | "description": "Index for LangChain demo", 189 | "metadata": struct_pb2.Value(struct_value=metadata), 190 | "index_update_method": aipv1.Index.IndexUpdateMethod.STREAM_UPDATE, 191 | } 192 | 193 | r = self.index_client.create_index( 194 | parent=self.PARENT, 195 | index=Index(index_request) 196 | ) 197 | 198 | # Poll the operation until it's done successfully. 199 | logging.info("Poll the operation to create index ...") 200 | while True: 201 | if r.done(): 202 | break 203 | time.sleep(60) 204 | print('.', end='') 205 | 206 | index = r.result() 207 | logger.info(f"Index {self.index_name} created with resource name as {index.name}") 208 | 209 | return index 210 | 211 | # TODO: this is generating an error about publicEndpointEnabled not being set without network 212 | def create_index_endpoint( 213 | self 214 | , endpoint_name: str = None 215 | , network: str = None 216 | ) -> IndexEndpoint: 217 | """ 218 | 219 | :param endpoint_name: 220 | :param network: 221 | :return: 222 | """ 223 | try: 224 | if endpoint_name is not None: 225 | self._set_index_endpoint_name(endpoint_name) 226 | # Get index endpoint if exists 227 | index_endpoint = self._get_index_endpoint() 228 | 229 | # Create Index Endpoint if does not exists 230 | if index_endpoint is not None: 231 | logger.info("Index endpoint already exists") 232 | else: 233 | logger.info(f"Index endpoint {self.index_endpoint_name} does not exists. Creating index endpoint...") 234 | index_endpoint_request = { 235 | "display_name": self.index_endpoint_name 236 | } 237 | index_endpoint = IndexEndpoint(index_endpoint_request) 238 | if network is not None: 239 | index_endpoint.network = network 240 | else: 241 | index_endpoint.public_endpoint_enabled = True 242 | index_endpoint.publicEndpointEnabled = True 243 | r = self.index_endpoint_client.create_index_endpoint( 244 | parent=self.PARENT, 245 | index_endpoint=index_endpoint 246 | ) 247 | 248 | logger.info("Poll the operation to create index endpoint ...") 249 | while True: 250 | if r.done(): 251 | break 252 | time.sleep(60) 253 | print('.', end='') 254 | 255 | index_endpoint = r.result() 256 | except Exception as e: 257 | logger.error(f"Failed to create index endpoint {self.index_endpoint_name}") 258 | raise e 259 | 260 | return index_endpoint 261 | 262 | def deploy_index( 263 | self 264 | , index_name: str = None 265 | , endpoint_name: str = None 266 | , machine_type: str = "e2-standard-2" 267 | , min_replica_count: int = 2 268 | , max_replica_count: int = 2 269 | ) -> IndexEndpoint: 270 | """ 271 | 272 | :param endpoint_name: 273 | :param index_name: 274 | :param machine_type: 275 | :param min_replica_count: 276 | :param max_replica_count: 277 | :return: 278 | """ 279 | if index_name is not None: 280 | self._set_index_name(index_name) 281 | if endpoint_name is not None: 282 | self._set_index_endpoint_name(endpoint_name) 283 | 284 | index = self._get_index() 285 | index_endpoint = self._get_index_endpoint() 286 | # Deploy Index to endpoint 287 | try: 288 | # Check if index is already deployed to the endpoint 289 | if index.name in index_endpoint.deployed_indexes: 290 | logger.info(f"Skipping deploying Index. Index {self.index_name}" + 291 | f"already deployed with id {index.name} to the index endpoint {self.index_endpoint_name}") 292 | return index_endpoint 293 | 294 | timestamp = datetime.now().strftime("%Y%m%d%H%M%S") 295 | deployed_index_id = f"{self.index_name.replace('-', '_')}_{timestamp}" 296 | deploy_index = { 297 | "id": deployed_index_id, 298 | "display_name": deployed_index_id, 299 | "index": index.name, 300 | "dedicated_resources": { 301 | "machine_spec": { 302 | "machine_type": machine_type, 303 | }, 304 | "min_replica_count": min_replica_count, 305 | "max_replica_count": max_replica_count 306 | } 307 | } 308 | logger.info(f"Deploying index with request = {deploy_index}") 309 | r = self.index_endpoint_client.deploy_index( 310 | index_endpoint=index_endpoint.name, 311 | deployed_index=DeployedIndex(deploy_index) 312 | ) 313 | 314 | # Poll the operation until it's done successfullly. 315 | logger.info("Poll the operation to deploy index ...") 316 | while True: 317 | if r.done(): 318 | break 319 | time.sleep(60) 320 | print('.', end='') 321 | 322 | logger.info(f"Deployed index {self.index_name} to endpoint {self.index_endpoint_name}") 323 | 324 | except Exception as e: 325 | logger.error(f"Failed to deploy index {self.index_name} to the index endpoint {self.index_endpoint_name}") 326 | raise e 327 | 328 | return index_endpoint 329 | 330 | def get_index_and_endpoint(self) -> (str, str): 331 | """ 332 | 333 | :return: 334 | """ 335 | # Get index id if exists 336 | index = self._get_index() 337 | index_id = index.name if index else '' 338 | 339 | # Get index endpoint id if exists 340 | index_endpoint = self._get_index_endpoint() 341 | index_endpoint_id = index_endpoint.name if index_endpoint else '' 342 | 343 | return index_id, index_endpoint_id 344 | 345 | def delete_index( 346 | self 347 | , index_name: str = None 348 | ) -> str: 349 | """ 350 | :param index_name: str 351 | :return: 352 | """ 353 | if index_name is not None: 354 | self._set_index_name(index_name) 355 | # Check if index exists 356 | index = self._get_index() 357 | 358 | # create index if does not exists 359 | if index: 360 | # Delete index 361 | index_id = index.name 362 | logger.info(f"Deleting Index {self.index_name} with id {index_id}") 363 | self.index_client.delete_index(name=index_id) 364 | return f"index {index_id} deleted." 365 | else: 366 | raise ResourceNotExistException(f"{self.index_name}") 367 | 368 | def undeploy_index( 369 | self 370 | , index_name: str 371 | , endpoint_name: str 372 | ): 373 | """ 374 | 375 | :param index_name: 376 | :param endpoint_name: 377 | :return: 378 | """ 379 | logger.info(f"Undeploying index with id {index_name} from Index endpoint {endpoint_name}") 380 | endpoint_id = f"{self.PARENT}/indexEndpoints/{endpoint_name}" 381 | r = self.index_endpoint_client.undeploy_index( 382 | index_endpoint=endpoint_id 383 | , deployed_index_id=index_name 384 | ) 385 | response = r.result() 386 | logger.info(response) 387 | return response.display_name 388 | 389 | def delete_index_endpoint( 390 | self 391 | , index_endpoint_name: str = None 392 | ) -> str: 393 | """ 394 | 395 | :param index_endpoint_name: str 396 | :return: 397 | """ 398 | if index_endpoint_name is not None: 399 | self._set_index_endpoint_name(index_endpoint_name) 400 | # Check if index endpoint exists 401 | index_endpoint = self._get_index_endpoint() 402 | 403 | # Create Index Endpoint if does not exists 404 | if index_endpoint is not None: 405 | logger.info( 406 | f"Index endpoint {self.index_endpoint_name} exists with resource " + 407 | f"name as {index_endpoint.name}" #+ 408 | # f"{index_endpoint.public_endpoint_domain_name}") 409 | ) 410 | 411 | #index_endpoint_id = index_endpoint.name 412 | #index_endpoint = self.index_endpoint_client.get_index_endpoint( 413 | # name=index_endpoint.name 414 | #) 415 | 416 | # Undeploy existing indexes 417 | for d_index in index_endpoint.deployed_indexes: 418 | self.undeploy_index( 419 | index_name=d_index.id 420 | , endpoint_name=index_endpoint_name 421 | ) 422 | 423 | # Delete index endpoint 424 | logger.info(f"Deleting Index endpoint {self.index_endpoint_name} with id {index_endpoint_id}") 425 | self.index_endpoint_client.delete_index_endpoint(name=index_endpoint.name) 426 | return f"Index endpoint {index_endpoint.name} deleted." 427 | else: 428 | raise ResourceNotExistException(f"{self.index_endpoint_name}") 429 | -------------------------------------------------------------------------------- /zeitghost/vertex/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__init__.py -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/Embeddings.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/Embeddings.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/Helpers.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/Helpers.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/LLM.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/LLM.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/MatchingEngineCRUD.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/MatchingEngineCRUD.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/MatchingEngineVectorstore.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/MatchingEngineVectorstore.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/vertex/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/agents/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/agents/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/agents/__pycache__/models.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/agents/__pycache__/models.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/celery/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/celery/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/celery/__pycache__/models.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/celery/__pycache__/models.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/gdelt/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/gdelt/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/gdelt/__pycache__/models.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/gdelt/__pycache__/models.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/llm/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/llm/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/llm/__pycache__/models.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/llm/__pycache__/models.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/vectorstore/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/vectorstore/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/blueprints/vectorstore/__pycache__/models.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/vectorstore/__pycache__/models.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/celery/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/celery/__pycache__/gdelt_tasks.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/gdelt_tasks.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/celery/__pycache__/vertex_tasks.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/vertex_tasks.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/webserver/celery/__pycache__/worker.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/worker.cpython-311.pyc -------------------------------------------------------------------------------- /zeitghost/zeitghost-trendspotting.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | --------------------------------------------------------------------------------