├── .DS_Store
├── README.md
├── img
├── deep-retriever.png
├── makeup-schema-bq.png
├── poor_output_formatting.png
├── zietghost_arch.png
├── zietghost_concept.png
├── zietghost_marketer.png
├── zietghost_process.png
└── zietghost_title.png
├── notebooks
├── .DS_Store
├── .env
├── .env.sample
├── 00-env-setup.ipynb
├── 01-setup-vertex-vector-store.ipynb
├── 02-gdelt-data-ops.ipynb
├── 03-vector-store-index-loader.ipynb
├── 03a-optional-chunk-up-the-docs.ipynb
├── 04-build-zeitghost-image.ipynb
├── 05-gdelt-pipelines.ipynb
├── 06-plan-and-execute-agents.ipynb
├── 07-streamlit-ui-plan-and-execute.py
├── imgs
│ ├── add-gdelt-to-bq.gif
│ ├── agent_plan_execute_chain_output.png
│ ├── architecture.png
│ ├── chunk_bq_tables_flow.png
│ ├── chunk_gcs_blobs_flow.png
│ ├── chunk_youtube_flow.png
│ ├── da-ui.png
│ ├── deep-retriever.png
│ ├── fullarchitecture.png
│ ├── google-trends-explore.png
│ ├── info-architecture.png
│ ├── langchain-diagram.png
│ ├── langchain-overview.png
│ ├── langchain_intro.png
│ ├── pipeline-complete.png
│ ├── pipeline_metadata.png
│ ├── plan-execute-example-output.png
│ ├── public_trends_data.png
│ ├── user-flow-plan-execute.png
│ ├── zghost_overview.png
│ ├── zghost_overview_ME.png
│ ├── zghost_overview_agents.png
│ ├── zghost_overview_gdelt.png
│ ├── zghost_overview_load_index.png
│ └── zghost_overview_pipeline_steps.png
└── requirements.txt
├── streamlit_agent
├── __init__.py
├── callbacks
│ ├── __init__.py
│ └── capturing_callback_handler.py
└── clear_results.py
└── zeitghost
├── .dockerignore
├── .env
├── .idea
├── .gitignore
├── codeStyles
│ ├── Project.xml
│ └── codeStyleConfig.xml
├── misc.xml
├── modules.xml
└── vcs.xml
├── __init__.py
├── __pycache__
├── __init__.cpython-311.pyc
└── main.cpython-311.pyc
├── agents
├── Helpers.py
├── LangchainAgent.py
├── __init__.py
└── __pycache__
│ ├── Helpers.cpython-311.pyc
│ ├── LangchainAgent.cpython-311.pyc
│ └── __init__.cpython-311.pyc
├── bigquery
├── BigQueryAccessor.py
├── __init__.py
└── __pycache__
│ ├── BigQueryAccessor.cpython-311.pyc
│ └── __init__.cpython-311.pyc
├── capturing_callback_handler.py
├── gdelt
├── GdeltData.py
├── Helpers.py
├── __init__.py
└── __pycache__
│ ├── GdeltData.cpython-311.pyc
│ ├── Helpers.cpython-311.pyc
│ └── __init__.cpython-311.pyc
├── testing
├── __init__.py
└── basic_agent_unit_tests.py
├── ts_embedding
├── .ipynb_checkpoints
│ └── kats_embedding_tools-checkpoint.py
├── bq_data_tools.py
└── kats_embedding_tools.py
├── vertex
├── Embeddings.py
├── Helpers.py
├── LLM.py
├── MatchingEngineCRUD.py
├── MatchingEngineVectorstore.py
├── __init__.py
└── __pycache__
│ ├── Embeddings.cpython-311.pyc
│ ├── Helpers.cpython-311.pyc
│ ├── LLM.cpython-311.pyc
│ ├── MatchingEngineCRUD.cpython-311.pyc
│ ├── MatchingEngineVectorstore.cpython-311.pyc
│ └── __init__.cpython-311.pyc
├── webserver
├── __pycache__
│ └── __init__.cpython-311.pyc
├── blueprints
│ ├── __pycache__
│ │ └── __init__.cpython-311.pyc
│ ├── agents
│ │ └── __pycache__
│ │ │ ├── __init__.cpython-311.pyc
│ │ │ └── models.cpython-311.pyc
│ ├── celery
│ │ └── __pycache__
│ │ │ ├── __init__.cpython-311.pyc
│ │ │ └── models.cpython-311.pyc
│ ├── gdelt
│ │ └── __pycache__
│ │ │ ├── __init__.cpython-311.pyc
│ │ │ └── models.cpython-311.pyc
│ ├── llm
│ │ └── __pycache__
│ │ │ ├── __init__.cpython-311.pyc
│ │ │ └── models.cpython-311.pyc
│ └── vectorstore
│ │ └── __pycache__
│ │ ├── __init__.cpython-311.pyc
│ │ └── models.cpython-311.pyc
└── celery
│ └── __pycache__
│ ├── __init__.cpython-311.pyc
│ ├── gdelt_tasks.cpython-311.pyc
│ ├── vertex_tasks.cpython-311.pyc
│ └── worker.cpython-311.pyc
└── zeitghost-trendspotting.iml
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/.DS_Store
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Welcome to the Zeitghost - Build Your Own News & Media Listening Platform with a Conversational Agent
2 |
3 |
4 | # Update - 7.20.23 - Streamlit UI
5 |
6 | A hardcoded example of running a UI application with Streamlight is now found in [`notebooks/07-streamlit-ui-plan-and-execute.py`](notebooks/07-streamlit-ui-plan-and-execute.py)
7 |
8 | `pip install streamlit`
9 |
10 | Also, upgrade langchain and sqlalchemy-bigquery:
11 |
12 | `pip install -U langchain`
13 |
14 | `pip install -U sqlalchemy-bigquery`
15 |
16 | ### To run the UI:
17 |
18 | `cd notebooks`
19 |
20 | `streamlit 07-streamlit-ui-plan-and-execute.py`
21 |
22 | Your browser will pop up with with the UI
23 |
24 |
25 |
26 |
27 |
28 | The repo contains a set of notebooks and helper classes to enable you to create a conversational agent with access to a variety of datasets and APIs (tools) to answer end-user questions.
29 |
30 | By the final notebook, you'll have created an agent with access to the following tools:
31 | * News and media dataset (GDELT) index comprised of global news related to your `ACTOR`
32 | * The [Google Trends](https://trends.google.com/trends/explore?hl=en) public dataset. This data source helps us understand what people are searching for, in real time. We can use this data to measure search interest in a particular topic, in a particular place, and at a particular time
33 | * [Google Search API Wrapper](https://developers.google.com/custom-search/v1/overview) - to retrieve and display search results from web searches
34 | * A calculator to help the LLM with math
35 | * An [SQL Database Agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/sql_database.html) for interacting with SQL databases (e.g., BigQuery). As an example, an agent's plan may require it to search for trends in the Google Trends BigQuery table
36 |
37 | ### **Below is an image showing the user flow interacting with the multi-tool agent using a plan-execute strategy**:
38 |
39 |
40 |
41 |
42 | ### **Here's an example of what the user output will look like in a notebook**:
43 |
44 |
45 |
46 |
47 |
48 | ## GDELT: A global database of society
49 |
50 | Google's mission is to organise the world's information and make it universally accessible and useful. Supported by Google Jigsaw, the [GDELT Project](https://www.gdeltproject.org/) monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
51 |
52 | Monitoring nearly the entire world's news media is only the beginning - even the largest team of humans could not begin to read and analyze the billions upon billions of words and images published each day. GDELT uses some of the world's most sophisticated computer algorithms, custom-designed for global news media, running on "one of the most powerful server networks in the known Universe", together with some of the world's most powerful deep learning algorithms, to create a realtime computable record of global society that can be visualized, analyzed, modeled, examined and even forecasted. A huge array of datasets totaling trillions of datapoints are available. Three primary data streams are created, one codifying physical activities around the world in over 300 categories, one recording the people, places, organizations, millions of themes and thousands of emotions underlying those events and their interconnections and one codifying the visual narratives of the world's news imagery.
53 |
54 | All three streams update every 15 minutes, offering near-realtime insights into the world around us. Underlying the streams are a vast array of sources, from hundreds of thousands of global media outlets to special collections like 215 years of digitized books, 21 billion words of academic literature spanning 70 years, human rights archives and even saturation processing of the raw closed captioning stream of almost 100 television stations across the US in collaboration with the Internet Archive's Television News Archive. Finally, also in collaboration with the Internet Archive, the Archive captures nearly all worldwide online news coverage monitored by GDELT each day into its permanent archive to ensure its availability for future generations even in the face of repressive forces that continue to erode press freedoms around the world.
55 |
56 | For more information on how to navigate the datasets - see the [GDELT 2.0 Data format codebook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf)
57 |
58 | ** How to add GDELT 2.0 to your Google Cloud Project **
59 |
60 | 
61 |
62 | ## Large Data from GDELT needs Effective and Efficient Knowledge Retrieval
63 |
64 | Simply accessing and storing the data isn't enough. Processing large queries of news sources from GDELT is easy to do with BigQuery, however to accelerate exploration and discovery of the information, tools are needed to provide semantic search on the contents of the global news and media dataset.
65 |
66 | [Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service. Vertex AI Matching Engine provides the ability to scale knowledge retrieval paired with our new [Vertex Embeddings Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). We use document chunking and [`newspaper3k`](https://pypi.org/project/newspaper3k/) to pull down the articles and convert the document passages into embeddings with the Vertex Embeddings API (via Python SDK).
67 |
68 | Below is an example depiction of the information architecture, utilizing Matching Engine and the text embeddings model.
69 |
70 | 
71 |
72 | ## Leveraging GenAI Language Models to get Relevant & Real-time Information Using Conversational Agents
73 | More than ever, access to relevant and real-time information to make decisions and understand changing patterns and trends critical. Whilst GDELT contains a massive amount of relevant and real-time (up to every 15 min) data, it can be challenging and overwhelming to make sense of it and extract what is most important for specific areas of interest, topics, events, entities. This project, the Zeitghost, provides a reference solution for how you can specify your entities and events of interest to extract from GDELT, index and load them into a vector database, and leverage the Vertex AI Generative Language models with Langchain Agents to interact in a Q&A style with the information. We also show how you can orchestrate and schedule ongoing refreshes of the data to keep the system up to date with the latest information.
74 |
75 | For more information about getting started with Langchain Agents, see [Langchain Examples](https://github.com/GoogleCloudPlatform/generative-ai/tree/dev/language/examples/oss-samples/langchain)
76 |
77 | Finally, to go beyond an agent with one chain of thought along with one tool, we explore how you can start to combine plan-and-execute agents together. [Plan-and-execute Agents](https://python.langchain.com/en/latest/modules/agents/plan_and_execute.html) accomplish an objective by first planning what to do, then executing the sub tasks
78 | * This idea is largely inspired by [BabyAGI](https://github.com/yoheinakajima/babyagi) and the [Plan-and-Solve](https://arxiv.org/abs/2305.04091) paper.
79 | * The planning is almost always done by an LLM.
80 | * The execution is usually done by a separate agent (**equipped with tools**)
81 | By allowing agents with access to different source data to interact with each other, we can uncover new insights that may not have been obvious by examining each of the datasets in isolation.
82 |
83 | ## Solution Overview
84 |
85 | ### Architecture
86 |
87 | 
88 |
89 | ### Component Flow
90 | This project consists of a series of notebooks leveraging a customized code base to:
91 | - Filter and extract all of the relevant web URLs for a given entity from the GDELT global entity graph or a type of global event for a specified time period leveraging the GDELT data that is publicly available natively in BigQuery. An example could be an ACTOR='World Health Organization', for the time period of March 2020 to Present, including events about COVID lockdown.
92 | - Extract the full article and news content from every URL that is returned from the GDELT datasets and generate text embeddings using the [Vertex AI Embeddings Model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings)
93 | - Create a [Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) Vector Database Index deploy it to an Index Endpoint
94 | - Stream update generated embeddings into the Matching Engine Vector Database Index
95 | - Create a managed pipeline to orchestrate the ongoing refresh of the GDELT data into the Matching Engine Vector DB
96 | 
97 | - Test the generic semantic search capabilities of the Vector DB, and test using a Langchain Agent with one chain of thought along with one tool
98 | - Create a plan-and-execute agent framework where different agents (the GDELT Langchain agent, a BigQuery public trends agent, and a Google Search API agent) are able to talk to each other to answer questions
99 |
100 | Next, we are currently working on adding:
101 | - An application to build and deploy the conversational agent as an API on Google Cloud Run - where it can then be integrated into any application.
102 | - A customizable front end reference architecture that uses the agents once , which can be used to showcase the art of the possible when working
103 | - Incorporating future enhancements to the embedding techniques used to improve the relevancy and performance of retrieval
104 |
105 | ## How to use this repo
106 |
107 | * Setup VPC Peering
108 | * Create a Vertex AI Workbench Instance
109 | * Run the notebooks
110 |
111 | #### Important: be sure to create Vertex AI notebook instance within the same VPC Network used for Vertex AI Matching Engine deployment
112 |
113 | If you don't use the same VPC network, you will not be able to make calls to the matching engine vector store database
114 |
115 | ### Step 1: Setup VPC Peering using the following `gcloud` commands in Cloud Shell
116 |
117 | * Similar to the setup instructions for Vertex AI Matching Engine in this [Sample Notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/sdk_matching_engine_for_indexing.ipynb), there are a few permissions needed in order to create the VPC network and set up VPC peering
118 | * Run the following `gcloud` commands in Cloud Shell to create the network and VPC peering needed for deploying Matching Engine indexes to Private Endpoints:
119 |
120 | ```
121 | VPC_NETWORK="YOUR_NETWORK_NAME"
122 | PROJECT_ID="YOUR_PROJECT_ID"
123 | PEERING_RANGE_NAME="PEERINGRANGENAME"
124 |
125 | # Create a VPC network
126 | gcloud compute networks create $VPC_NETWORK --bgp-routing-mode=regional --subnet-mode=auto --project=$PROJECT_ID
127 |
128 | # Add necessary firewall rules
129 | gcloud compute firewall-rules create $VPC_NETWORK-allow-icmp --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow icmp
130 |
131 | gcloud compute firewall-rules create $VPC_NETWORK-allow-internal --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow all --source-ranges 10.128.0.0/9
132 |
133 | gcloud compute firewall-rules create $VPC_NETWORK-allow-rdp --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow tcp:3389
134 |
135 | gcloud compute firewall-rules create $VPC_NETWORK-allow-ssh --network $VPC_NETWORK --priority 65534 --project $PROJECT_ID --allow tcp:22
136 |
137 | # Reserve IP range
138 | gcloud compute addresses create $PEERING_RANGE_NAME --global --prefix-length=16 --network=$VPC_NETWORK --purpose=VPC_PEERING --project=$PROJECT_ID --description="peering range"
139 |
140 | # Set up peering with service networking
141 | # Your account must have the "Compute Network Admin" role to run the following.
142 | gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network=$VPC_NETWORK --ranges=$PEERING_RANGE_NAME --project=$PROJECT_ID
143 | ```
144 |
145 | ### Step 2: Create Vertex AI Workbench notebook using the following `gcloud` command in Cloud Shell:
146 |
147 | * Using this base image will ensure you have the proper starting environment to use these notebooks
148 | * You can optionally remove the GPU-related flags
149 |
150 | ```bash
151 | INSTANCE_NAME='your-instance-name'
152 |
153 | gcloud notebooks instances create $INSTANCE_NAME \
154 | --vm-image-project=deeplearning-platform-release \
155 | --vm-image-family=tf-ent-2-11-cu113-notebooks-debian-11-py39 \
156 | --vm-image-family=tf-latest-cu113-debian-11-py39 \
157 | --machine-type=n1-standard-8 \
158 | --location=us-central1-a \
159 | --network=$VPC_NETWORK
160 | ```
161 |
162 | ### Step 3: Clone this repo
163 | * Once the Vertex AI Workbench instance is created, open a terminal via the file menu: **File > New > Terminal**
164 | * Run the following code to clone this repo:
165 |
166 | ```bash
167 | git clone https://github.com/hello-d-lee/conversational-agents-zeitghost.git
168 | ```
169 |
170 | ### Step 4: Go to the first notebook (`00-env-setup.ipynb`), follow the instructions, and continue through the remaining notebooks
171 |
172 | 0. [Environment Setup](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/00-env-setup.ipynb) - used to create configurations once that can be used for the rest of the notebooks
173 | 1. [Setup Vertex Vector Store](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/01-setup-vertex-vector-store.ipynb) - create the Vertex AI Matching Engine Vector Store Index, deploy it to an endpoint. This can take 40-50 min, so whilst waiting the next notebook can be run.
174 | 2. [GDELT DataOps](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/02-gdelt-data-ops.ipynb) - parameterize the topics and time period of interest, run the extraction against GDELT for article and news content
175 | 3. [Vector Store Index Loader](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/03-vector-store-index-loader.ipynb) - create embeddings and load the vectors into the Matching Engine Vector Store. Test the semantic search capabilities and langchain agent using the Vector Store.
176 | 4. [Build Zeitghost Image](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/04-build-zeitghost-image.ipynb) - create a custom container to be used to create the GDELT pipeline for ongoing data updates
177 | 5. [GDELT Pipelines](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/05-gdelt-pipelines.ipynb) - create the pipeline to orchestrate and automatically refresh the data and update the new vectors into the matching engine index
178 | 6. [Plan and Execute Agents](https://github.com/hello-d-lee/conversational-agents-zeitghost/blob/main/notebooks/06-plan-and-execute-agents.ipynb) - create new agents using the BigQuery public trends dataset, and the Google Search API, and use the agents together to uncover new insights
179 |
--------------------------------------------------------------------------------
/img/deep-retriever.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/deep-retriever.png
--------------------------------------------------------------------------------
/img/makeup-schema-bq.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/makeup-schema-bq.png
--------------------------------------------------------------------------------
/img/poor_output_formatting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/poor_output_formatting.png
--------------------------------------------------------------------------------
/img/zietghost_arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_arch.png
--------------------------------------------------------------------------------
/img/zietghost_concept.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_concept.png
--------------------------------------------------------------------------------
/img/zietghost_marketer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_marketer.png
--------------------------------------------------------------------------------
/img/zietghost_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_process.png
--------------------------------------------------------------------------------
/img/zietghost_title.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/img/zietghost_title.png
--------------------------------------------------------------------------------
/notebooks/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/.DS_Store
--------------------------------------------------------------------------------
/notebooks/.env:
--------------------------------------------------------------------------------
1 | GOOGLE_CSE_ID=8743920w
2 | GOOGLE_API_KEY=9238rewr
--------------------------------------------------------------------------------
/notebooks/.env.sample:
--------------------------------------------------------------------------------
1 | GOOGLE_CSE_ID=
2 | GOOGLE_API_KEY=
--------------------------------------------------------------------------------
/notebooks/01-setup-vertex-vector-store.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "4a3c8d01",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "# Copyright 2023 Google LLC\n",
11 | "#\n",
12 | "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
13 | "# you may not use this file except in compliance with the License.\n",
14 | "# You may obtain a copy of the License at\n",
15 | "#\n",
16 | "# https://www.apache.org/licenses/LICENSE-2.0\n",
17 | "#\n",
18 | "# Unless required by applicable law or agreed to in writing, software\n",
19 | "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
20 | "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
21 | "# See the License for the specific language governing permissions and\n",
22 | "# limitations under the License."
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "id": "b9d4ec90-df65-4910-b3bd-1aecf803b6a4",
28 | "metadata": {},
29 | "source": [
30 | "# Setting up Vector Stores with Vertex Matching Engine\n",
31 | "\n",
32 | " \n",
33 | " \n",
34 | " Run in Colab\n",
35 | " \n",
36 | " \n",
37 | " \n",
38 | " \n",
39 | " \n",
40 | " View on GitHub\n",
41 | " \n",
42 | " \n",
43 | " \n",
44 | " \n",
45 | " \n",
46 | " Open in Vertex AI Workbench\n",
47 | " \n",
48 | " \n",
49 | "
"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "id": "299bd214",
55 | "metadata": {},
56 | "source": [
57 | "## Overview\n",
58 | "\n",
59 | "\n",
60 | " \n",
61 | " \n",
62 | "When working with LLMs and conversational agents, how the data that they are accessing is stored is crucial - efficient data processing is more important than ever for applications involving large language models, genAI, and semantic search. Many of these new applications using large unstructured datasets use vector embeddings, a data representation containing semantic information that LLMs can use to answer questions and maintain in a long-term memory. \n",
63 | "\n",
64 | "In this application we will use a specialized database - a Vector Database - for handling embeddings, optimized for storage and querying capabilities for embeddings. The GDELT dataset extract could be quite large depending on the actor_name and time range, so we want to make sure that we aren't sacrificing performance to interact with such a potentially large dataset, which is where Vertex AI Matching Engine's Vector Database will ensure that we can scale for any very large number of embeddings.\n",
65 | "\n",
66 | "In this notebook you'll go through the process to create and deploy a vector store in Vertex Matching Engine. Whilst the setup may take 40-50min, once you've done this once, you can update, delete, and continue to add embeddings to this instance. \n",
67 | "\n",
68 | "---\n",
69 | "\n",
70 | "[Vertex AI Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.\n",
71 | "\n",
72 | "Matching Engine provides tooling to build use cases that match semantically similar items. More specifically, given a query item, Matching Engine finds the most semantically similar items to it from a large corpus of candidate items. This ability to search for semantically similar or semantically related items has many real world use cases and is a vital part of applications such as:\n",
73 | "\n",
74 | "* Recommendation engines\n",
75 | "* Search engines\n",
76 | "* Ad targeting systems\n",
77 | "* Image classification or image search\n",
78 | "* Text classification\n",
79 | "* Question answering\n",
80 | "* Chatbots\n",
81 | "\n",
82 | "To build semantic matching systems, you need to compute vector representations of all items. These vector representations are often called embeddings. Embeddings are computed by using machine learning models, which are trained to learn an embedding space where similar examples are close while dissimilar ones are far apart. The closer two items are in the embedding space, the more similar they are.\n",
83 | "\n",
84 | "At a high level, semantic matching can be simplified into two critical steps:\n",
85 | "\n",
86 | "* Generate embedding representations of items.\n",
87 | "* Perform nearest neighbor searches on embeddings.\n",
88 | "\n",
89 | "### Objectives\n",
90 | "\n",
91 | "In this notebook, you will create a Vector Store using Vertex AI Matching Engine\n",
92 | "\n",
93 | "The steps performed include:\n",
94 | "\n",
95 | "- Installing the Python SDK \n",
96 | "- Create or initialize an existing matching engine index\n",
97 | " - Creating a new index can take 40-50 minutes\n",
98 | " - If you have already created an index and want to use this existing one, follow the instructions to initialize an existing index\n",
99 | " - Whilst creating a new index, consider proceeding to [GDELT DataOps](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/intro_palm_api.ipynb) notebook\n",
100 | "- Create the Vector Store with embedddings, leveraging the embeddings model with `textembedding-gecko@001`\n",
101 | " "
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "id": "96dda8e7",
107 | "metadata": {},
108 | "source": [
109 | "### Costs\n",
110 | "This tutorial uses billable components of Google Cloud:\n",
111 | "\n",
112 | "* Vertex AI Generative AI Studio\n",
113 | "* Vertex AI Matching Engine\n",
114 | "\n",
115 | "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n",
116 | "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
117 | "to generate a cost estimate based on your projected usage."
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "id": "c63c9095",
123 | "metadata": {},
124 | "source": [
125 | "## Getting Started"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "id": "a5df5dc2",
131 | "metadata": {},
132 | "source": [
133 | "**Colab only:** Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top. "
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 2,
139 | "id": "ba34e308",
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "# # Automatically restart kernel after installs so that your environment can access the new packages\n",
144 | "# import IPython\n",
145 | "\n",
146 | "# app = IPython.Application.instance()\n",
147 | "# app.kernel.do_shutdown(True)"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "id": "5943f6fc",
153 | "metadata": {},
154 | "source": [
155 | "### Authenticating your notebook environment\n",
156 | "* If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n",
157 | "* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 3,
163 | "id": "51d84780",
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "# from google.colab import auth\n",
168 | "# auth.authenticate_user()"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "id": "b7f4d6ec-0e4d-40da-8a03-e7038bab7485",
174 | "metadata": {},
175 | "source": [
176 | "### Make sure you edit the values below\n",
177 | "Each time you run the notebook for the first time with new variables, you just need to edit the actor prefix and version variables below. They are needed to grab all the other variables in the notebook configuration."
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 2,
183 | "id": "b105ad1f-1b76-4551-a269-c31bc7b6da74",
184 | "metadata": {},
185 | "outputs": [
186 | {
187 | "name": "stdout",
188 | "output_type": "stream",
189 | "text": [
190 | "ACTOR_PREFIX : ggl\n",
191 | "VERSION : v1\n"
192 | ]
193 | }
194 | ],
195 | "source": [
196 | "# CREATE_NEW_ASSETS = True # True | False\n",
197 | "ACTOR_PREFIX = \"ggl\"\n",
198 | "VERSION = 'v1'\n",
199 | "\n",
200 | "# print(f\"CREATE_NEW_ASSETS : {CREATE_NEW_ASSETS}\")\n",
201 | "print(f\"ACTOR_PREFIX : {ACTOR_PREFIX}\")\n",
202 | "print(f\"VERSION : {VERSION}\")"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "id": "808f751b-e348-4357-a294-7bf4ba3a6ff5",
208 | "metadata": {},
209 | "source": [
210 | "### Load configuration settings from setup notebook\n",
211 | "Set the variables used in this notebook and load the config settings from the `00-env-setup.ipynb` notebook."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 3,
217 | "id": "aef986cc-3211-4093-bce0-3bac431a07a1",
218 | "metadata": {},
219 | "outputs": [
220 | {
221 | "name": "stdout",
222 | "output_type": "stream",
223 | "text": [
224 | "\n",
225 | "PROJECT_ID = \"wortz-project-352116\"\n",
226 | "PROJECT_NUM = \"679926387543\"\n",
227 | "LOCATION = \"us-central1\"\n",
228 | "\n",
229 | "REGION = \"us-central1\"\n",
230 | "BQ_LOCATION = \"US\"\n",
231 | "VPC_NETWORK_NAME = \"me-network\"\n",
232 | "\n",
233 | "CREATE_NEW_ASSETS = \"True\"\n",
234 | "ACTOR_PREFIX = \"ggl\"\n",
235 | "VERSION = \"v1\"\n",
236 | "ACTOR_NAME = \"google\"\n",
237 | "ACTOR_CATEGORY = \"technology\"\n",
238 | "\n",
239 | "BUCKET_NAME = \"zghost-ggl-v1-wortz-project-352116\"\n",
240 | "EMBEDDING_DIR_BUCKET = \"zghost-ggl-v1-wortz-project-352116-emd-dir\"\n",
241 | "\n",
242 | "BUCKET_URI = \"gs://zghost-ggl-v1-wortz-project-352116\"\n",
243 | "EMBEDDING_DIR_BUCKET_URI = \"gs://zghost-ggl-v1-wortz-project-352116-emd-dir\"\n",
244 | "\n",
245 | "VPC_NETWORK_FULL = \"projects/679926387543/global/networks/me-network\"\n",
246 | "\n",
247 | "ME_INDEX_NAME = \"vectorstore_ggl_v1\"\n",
248 | "ME_INDEX_ENDPOINT_NAME = \"vectorstore_ggl_v1_endpoint\"\n",
249 | "ME_DIMENSIONS = \"768\"\n",
250 | "\n",
251 | "MY_BQ_DATASET = \"zghost_ggl_v1\"\n",
252 | "MY_BQ_TRENDS_DATASET = \"zghost_ggl_v1_trends\"\n",
253 | "\n",
254 | "BUCKET_NAME : zghost-ggl-v1-wortz-project-352116\n",
255 | "BUCKET_URI : gs://zghost-ggl-v1-wortz-project-352116\n"
256 | ]
257 | }
258 | ],
259 | "source": [
260 | "# staging GCS\n",
261 | "GCP_PROJECTS = !gcloud config get-value project\n",
262 | "PROJECT_ID = GCP_PROJECTS[0]\n",
263 | "\n",
264 | "BUCKET_NAME = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}'\n",
265 | "BUCKET_URI = f'gs://{BUCKET_NAME}'\n",
266 | "\n",
267 | "config = !gsutil cat {BUCKET_URI}/config/notebook_env.py\n",
268 | "print(config.n)\n",
269 | "exec(config.n)\n",
270 | "\n",
271 | "print(f\"BUCKET_NAME : {BUCKET_NAME}\")\n",
272 | "print(f\"BUCKET_URI : {BUCKET_URI}\")"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "id": "0ed27576-f85a-4b5b-a54a-e4f61b30dd4e",
278 | "metadata": {},
279 | "source": [
280 | "### Import Packages"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 7,
286 | "id": "98bbd868-e768-44a0-bf7c-862201209616",
287 | "metadata": {},
288 | "outputs": [],
289 | "source": [
290 | "import sys\n",
291 | "import os\n",
292 | "sys.path.append(\"..\")\n",
293 | "# the following helper classes create and instantiate the matching engine resources\n",
294 | "from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD\n",
295 | "from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore\n",
296 | "from zeitghost.vertex.LLM import VertexLLM\n",
297 | "from zeitghost.vertex.Embeddings import VertexEmbeddings\n",
298 | "\n",
299 | "import uuid\n",
300 | "import time\n",
301 | "import numpy as np\n",
302 | "import json\n",
303 | "\n",
304 | "from google.cloud import aiplatform as vertex_ai\n",
305 | "from google.cloud import storage\n",
306 | "from google.cloud import bigquery"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 8,
312 | "id": "3944efc9-ee04-4b40-b4c4-64652beddf3c",
313 | "metadata": {},
314 | "outputs": [],
315 | "source": [
316 | "storage_client = storage.Client(project=PROJECT_ID)\n",
317 | "\n",
318 | "vertex_ai.init(project=PROJECT_ID,location=LOCATION)\n",
319 | "\n",
320 | "# bigquery client\n",
321 | "bqclient = bigquery.Client(\n",
322 | " project=PROJECT_ID,\n",
323 | " # location=LOCATION\n",
324 | ")"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "id": "aa97925b-22d6-457a-9e53-212de1ca3fdb",
330 | "metadata": {},
331 | "source": [
332 | "## Matching Engine Index: initialize existing or create a new one\n",
333 | "\n",
334 | "Validate access and bucket contents"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 9,
340 | "id": "96dcdd09-9e35-419a-8347-2de086a6500f",
341 | "metadata": {},
342 | "outputs": [
343 | {
344 | "name": "stdout",
345 | "output_type": "stream",
346 | "text": [
347 | "gs://zghost-way-v1-wortz-project-352116-emd-dir/init_index/embeddings_0.json\n"
348 | ]
349 | }
350 | ],
351 | "source": [
352 | "! gsutil ls $EMBEDDING_DIR_BUCKET_URI/init_index"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "id": "549d790f",
358 | "metadata": {},
359 | "source": [
360 | "Pass the required parameters that will be used to create the matching engine index"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 10,
366 | "id": "dc7e096f-9784-4bbf-8512-bd3000db21d9",
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "mengine = MatchingEngineCRUD(\n",
371 | " project_id=PROJECT_ID \n",
372 | " , project_num=PROJECT_NUM\n",
373 | " , region=LOCATION \n",
374 | " , index_name=ME_INDEX_NAME\n",
375 | " , vpc_network_name=VPC_NETWORK_FULL\n",
376 | ")"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "id": "be9f7438-cf41-4c12-9785-f8a9a49bfde9",
382 | "metadata": {},
383 | "source": [
384 | "### Create or Initialize Existing Index\n",
385 | "\n",
386 | "Creating a Vertex Matching Engine index can take ~40-50 minutes due to the index compaction algorithm it uses to structure the index for high performance queries at scale. Read more about the [novel algorithm](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) proposed by Google Researchand the [official whitepaper](https://arxiv.org/abs/1908.10396)\n",
387 | "\n",
388 | "**Considering this setup time, proceed to Notebook `02-gdelt-data-ops.ipynb` to start extracting events and articles related to your actor**"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": 11,
394 | "id": "f54160cb-3787-4608-9ede-ed2a6a89ee20",
395 | "metadata": {},
396 | "outputs": [
397 | {
398 | "name": "stderr",
399 | "output_type": "stream",
400 | "text": [
401 | "INFO:root:Index vectorstore_way_v1 does not exists. Creating index ...\n",
402 | "INFO:root:Poll the operation to create index ...\n"
403 | ]
404 | },
405 | {
406 | "name": "stdout",
407 | "output_type": "stream",
408 | "text": [
409 | "........................................"
410 | ]
411 | },
412 | {
413 | "name": "stderr",
414 | "output_type": "stream",
415 | "text": [
416 | "\n",
417 | "KeyboardInterrupt\n",
418 | "\n"
419 | ]
420 | }
421 | ],
422 | "source": [
423 | "start = time.time()\n",
424 | "# create ME index\n",
425 | "me_index = mengine.create_index(\n",
426 | " f\"{EMBEDDING_DIR_BUCKET_URI}/init_index\"\n",
427 | " , int(ME_DIMENSIONS)\n",
428 | ")\n",
429 | "\n",
430 | "end = time.time()\n",
431 | "print(f\"elapsed time: {end - start}\")\n",
432 | "\n",
433 | "if me_index:\n",
434 | " print(me_index.name)"
435 | ]
436 | },
437 | {
438 | "cell_type": "markdown",
439 | "id": "8830a613-5510-4e7c-b821-50b90dbe1392",
440 | "metadata": {},
441 | "source": [
442 | "### Create or Initialize Index Endpoint\n",
443 | "Once your Matching Engine Index has been created, create an index endpoint where the Index will be deployed to "
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": null,
449 | "id": "4d12ee68-5ea0-435d-968c-8b09c4576eb4",
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "start = time.time()\n",
454 | "\n",
455 | "index_endpoint=mengine.create_index_endpoint(\n",
456 | " endpoint_name=ME_INDEX_ENDPOINT_NAME\n",
457 | " , network=VPC_NETWORK_FULL\n",
458 | ")\n",
459 | "\n",
460 | "end = time.time()\n",
461 | "print(f\"elapsed time: {end - start}\")"
462 | ]
463 | },
464 | {
465 | "cell_type": "markdown",
466 | "id": "3aea30d7",
467 | "metadata": {},
468 | "source": [
469 | "Print out the detailed information about the index endpoint and VPC network where it is deployed, and any indexes that are already deployed to that endpoint"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "id": "dc517c3a-316f-46b4-ba07-e643a13c882f",
476 | "metadata": {},
477 | "outputs": [],
478 | "source": [
479 | "if index_endpoint:\n",
480 | " print(f\"Index endpoint resource name: {index_endpoint.name}\")\n",
481 | " print(f\"Index endpoint VPC network name: {index_endpoint.network}\")\n",
482 | " print(f\"Deployed indexes on the index endpoint:\")\n",
483 | " for d in index_endpoint.deployed_indexes:\n",
484 | " print(f\" {d.id}\")"
485 | ]
486 | },
487 | {
488 | "cell_type": "markdown",
489 | "id": "bfdd8f2d-cb15-4007-952d-e8a496da7652",
490 | "metadata": {},
491 | "source": [
492 | "### Deploy Index to Index Endpoint\n",
493 | "To interact with a matching engine index, you'll need to deploy it to an endpoint, where you can customize the underlying infrastructure behind the endpoint. For example, you can specify the scaling properties. "
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": null,
499 | "id": "5fbe4a6f-833e-4632-9047-b770dd6521b3",
500 | "metadata": {},
501 | "outputs": [],
502 | "source": [
503 | "if CREATE_NEW_ASSETS == 'True':\n",
504 | " \n",
505 | " index_endpoint = mengine.deploy_index(\n",
506 | " index_name = ME_INDEX_NAME\n",
507 | " , endpoint_name = ME_INDEX_ENDPOINT_NAME\n",
508 | " , min_replica_count = 2\n",
509 | " , max_replica_count = 2\n",
510 | " )"
511 | ]
512 | },
513 | {
514 | "cell_type": "markdown",
515 | "id": "032d860d",
516 | "metadata": {},
517 | "source": [
518 | "Print out the information about the matching engine resources"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "id": "42b0a04e-2d0f-4b15-b2c5-1551866697c4",
525 | "metadata": {},
526 | "outputs": [],
527 | "source": [
528 | "if index_endpoint:\n",
529 | " print(f\"Index endpoint resource name: {index_endpoint.name}\")\n",
530 | " print(f\"Index endpoint VPC network name: {index_endpoint.network}\")\n",
531 | " print(f\"Deployed indexes on the index endpoint:\")\n",
532 | " for d in index_endpoint.deployed_indexes:\n",
533 | " print(f\" {d.id}\")"
534 | ]
535 | },
536 | {
537 | "cell_type": "markdown",
538 | "id": "097dee06-6772-43a3-83c0-8ba3f94b7846",
539 | "metadata": {},
540 | "source": [
541 | "### Get Index and IndexEndpoint IDs\n",
542 | "Set the variable values and print out the resource details"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": null,
548 | "id": "7ffaf92a-bd91-482a-a51f-088429c1c277",
549 | "metadata": {},
550 | "outputs": [],
551 | "source": [
552 | "ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()\n",
553 | "ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split(\"/\")[5]\n",
554 | "\n",
555 | "print(f\"ME_INDEX_RESOURCE_NAME = {ME_INDEX_RESOURCE_NAME}\")\n",
556 | "print(f\"ME_INDEX_ENDPOINT_ID = {ME_INDEX_ENDPOINT_ID}\")\n",
557 | "print(f\"ME_INDEX_ID = {ME_INDEX_ID}\")"
558 | ]
559 | },
560 | {
561 | "cell_type": "markdown",
562 | "id": "a785b0bd-a597-4284-b8e6-6ffb9d9bbe08",
563 | "metadata": {},
564 | "source": [
565 | "## Matching Engine Vector Store"
566 | ]
567 | },
568 | {
569 | "cell_type": "markdown",
570 | "id": "02c6333c-2368-4ed0-8103-e431450d08b4",
571 | "metadata": {},
572 | "source": [
573 | "### Define Vertex LLM & Embeddings\n",
574 | "The base class to create the various LLMs can be found in in the root repository - in zeitghost.vertex the `LLM.py` file"
575 | ]
576 | },
577 | {
578 | "cell_type": "code",
579 | "execution_count": null,
580 | "id": "381e0c0f-de69-4ffe-b039-6d233d4da80f",
581 | "metadata": {},
582 | "outputs": [],
583 | "source": [
584 | "llm = VertexLLM(\n",
585 | " stop=None \n",
586 | " , temperature=0.0\n",
587 | " , max_output_tokens=1000\n",
588 | " , top_p=0.7\n",
589 | " , top_k=40\n",
590 | ")\n",
591 | "\n",
592 | "# llm that can be used for a BigQuery agent, containing stopwords to prevent hallucinations and string parsing\n",
593 | "langchain_llm_for_bq = VertexLLM(\n",
594 | " stop=['Observation:'] \n",
595 | " , strip=True \n",
596 | " , temperature=0.0\n",
597 | " , max_output_tokens=1000\n",
598 | " , top_p=0.7\n",
599 | " , top_k=40\n",
600 | ")\n",
601 | "\n",
602 | "# llm that can be used for a pandas agent, containing stopwords to prevent hallucinations\n",
603 | "langchain_llm_for_pandas = VertexLLM(\n",
604 | " stop=['Observation:']\n",
605 | " , strip=False\n",
606 | " , temperature=0.0\n",
607 | " , max_output_tokens=1000\n",
608 | " , top_p=0.7\n",
609 | " , top_k=40\n",
610 | ")"
611 | ]
612 | },
613 | {
614 | "cell_type": "markdown",
615 | "id": "029adebc",
616 | "metadata": {},
617 | "source": [
618 | "Let's ping the language model to ensure we are getting an expected response"
619 | ]
620 | },
621 | {
622 | "cell_type": "code",
623 | "execution_count": null,
624 | "id": "c38b5ef5-4ffb-4594-807a-b6c4717a53d0",
625 | "metadata": {},
626 | "outputs": [],
627 | "source": [
628 | "# llm('how are you doing today?')\n",
629 | "llm('In no more than 50 words, what can you tell me about the band Widespread Panic?')"
630 | ]
631 | },
632 | {
633 | "cell_type": "markdown",
634 | "id": "1feb1e91",
635 | "metadata": {},
636 | "source": [
637 | "Now let's call the VertexEmbeddings class which helps us get document embeddings using the [Vertex AI Embeddings model](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). Make sure that your REQUESTS_PER_MINUTE does not exceed your project quota."
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": null,
643 | "id": "660ece82-5e45-476c-a854-2d2aba646529",
644 | "metadata": {},
645 | "outputs": [],
646 | "source": [
647 | "from zeitghost.vertex.Embeddings import VertexEmbeddings\n",
648 | "\n",
649 | "REQUESTS_PER_MINUTE = 299 # example project quota==300\n",
650 | "vertex_embedding = VertexEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)"
651 | ]
652 | },
653 | {
654 | "cell_type": "markdown",
655 | "id": "f18fffa5-ad26-43e0-8af8-b023ff8aeae8",
656 | "metadata": {},
657 | "source": [
658 | "## Initialize Matching Engine Vector Store\n",
659 | "Finally, to interact with the matching engine instance initialize it with everything that you have created"
660 | ]
661 | },
662 | {
663 | "cell_type": "code",
664 | "execution_count": null,
665 | "id": "df037bcd-5bff-417c-988e-6ab4806acb86",
666 | "metadata": {},
667 | "outputs": [],
668 | "source": [
669 | "# initialize vector store\n",
670 | "me = MatchingEngineVectorStore.from_components(\n",
671 | " project_id=PROJECT_ID\n",
672 | " # , project_num=PROJECT_NUM\n",
673 | " , region=LOCATION\n",
674 | " , gcs_bucket_name=EMBEDDING_DIR_BUCKET_URI\n",
675 | " , embedding=vertex_embedding\n",
676 | " , index_id=ME_INDEX_ID\n",
677 | " , endpoint_id=ME_INDEX_ENDPOINT_ID\n",
678 | ")"
679 | ]
680 | },
681 | {
682 | "cell_type": "markdown",
683 | "id": "7bfbda36",
684 | "metadata": {},
685 | "source": [
686 | "Validate that you have created the vector store with the Vertex embeddings"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": null,
692 | "id": "5f984ea1-040b-4e2d-b4e1-401721288228",
693 | "metadata": {},
694 | "outputs": [],
695 | "source": [
696 | "me.embedding"
697 | ]
698 | }
699 | ],
700 | "metadata": {
701 | "environment": {
702 | "kernel": "python3",
703 | "name": "tf2-gpu.2-6.m108",
704 | "type": "gcloud",
705 | "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m108"
706 | },
707 | "kernelspec": {
708 | "display_name": "Python 3 (ipykernel)",
709 | "language": "python",
710 | "name": "python3"
711 | },
712 | "language_info": {
713 | "codemirror_mode": {
714 | "name": "ipython",
715 | "version": 3
716 | },
717 | "file_extension": ".py",
718 | "mimetype": "text/x-python",
719 | "name": "python",
720 | "nbconvert_exporter": "python",
721 | "pygments_lexer": "ipython3",
722 | "version": "3.9.16"
723 | }
724 | },
725 | "nbformat": 4,
726 | "nbformat_minor": 5
727 | }
728 |
--------------------------------------------------------------------------------
/notebooks/07-streamlit-ui-plan-and-execute.py:
--------------------------------------------------------------------------------
1 | from os import system
2 | from pathlib import Path
3 |
4 | import sys
5 | import os
6 | sys.path.append("..")
7 |
8 | import streamlit as st
9 | from langchain import SQLDatabase
10 | from langchain.agents import AgentType
11 | from langchain.agents import initialize_agent, Tool
12 | from langchain.callbacks import StreamlitCallbackHandler
13 | from langchain.chains import LLMMathChain, SQLDatabaseChain
14 | from langchain.utilities import DuckDuckGoSearchAPIWrapper
15 | from langchain.llms import VertexAI
16 | from langchain.embeddings import VertexAIEmbeddings
17 |
18 |
19 |
20 |
21 | from zeitghost.agents.LangchainAgent import LangchainAgent
22 |
23 | from streamlit_agent.callbacks.capturing_callback_handler import playback_callbacks
24 | from streamlit_agent.clear_results import with_clear_container
25 |
26 | st.set_page_config(
27 | page_title="Google Langchain Agents", page_icon="🦜", layout="wide", initial_sidebar_state="collapsed"
28 | )
29 | "# 🦜🔗 Langchain for Google Palm"
30 |
31 | ACTOR_PREFIX = "ggl"
32 | VERSION = 'v1'
33 | PROJECT_ID = 'cpg-cdp'
34 | BUCKET_NAME = f'zghost-{ACTOR_PREFIX}-{VERSION}-{PROJECT_ID}'
35 | BUCKET_URI = f'gs://{BUCKET_NAME}'
36 |
37 |
38 | ###HARDCODED VALUES BELOW - TODO UPDATE LATER
39 |
40 | PROJECT_ID = "cpg-cdp"
41 | PROJECT_NUM = "939655404703"
42 | LOCATION = "us-central1"
43 | REGION = "us-central1"
44 | BQ_LOCATION = "US"
45 | VPC_NETWORK_NAME = "genai-haystack-vpc"
46 | CREATE_NEW_ASSETS = "True"
47 | VERSION = "v1"
48 | ACTOR_NAME = "google"
49 | ACTOR_CATEGORY = "technology"
50 | BUCKET_NAME = "zghost-ggl-v1-cpg-cdp"
51 | EMBEDDING_DIR_BUCKET = "zghost-ggl-v1-cpg-cdp-emd-dir"
52 | BUCKET_URI = "gs://zghost-ggl-v1-cpg-cdp"
53 | EMBEDDING_DIR_BUCKET_URI = "gs://zghost-ggl-v1-cpg-cdp-emd-dir"
54 | VPC_NETWORK_FULL = "projects/939655404703/global/networks/me-network"
55 | ME_INDEX_NAME = "vectorstore_ggl_v1"
56 | ME_INDEX_ENDPOINT_NAME = "vectorstore_ggl_v1_endpoint"
57 | ME_DIMENSIONS = "768"
58 | MY_BQ_DATASET = "zghost_ggl_v1"
59 | MY_BQ_TRENDS_DATASET = "zghost_ggl_v1_trends"
60 |
61 |
62 | #TODO - this works fine from a notebook but getting UNKNOWN errors when trying to access ME from a signed-in env (for user)
63 | # from zeitghost.vertex.Embeddings import VertexEmbeddings
64 |
65 | from zeitghost.vertex.MatchingEngineCRUD import MatchingEngineCRUD
66 | from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore
67 |
68 | # Google Cloud
69 | # from google.cloud import aiplatform as vertex_ai
70 | # from google.cloud import storage
71 | # from google.cloud import bigquery
72 |
73 |
74 | #Instantiate Google cloud SDK clients
75 | # storage_client = storage.Client(project=PROJECT_ID)
76 |
77 | ## Instantiate the Vertex AI resources, Agents, and Tools
78 | mengine = MatchingEngineCRUD(
79 | project_id=PROJECT_ID
80 | , project_num=PROJECT_NUM
81 | , region=LOCATION
82 | , index_name=ME_INDEX_NAME
83 | , vpc_network_name=VPC_NETWORK_FULL
84 | )
85 |
86 | ME_INDEX_RESOURCE_NAME, ME_INDEX_ENDPOINT_ID = mengine.get_index_and_endpoint()
87 | ME_INDEX_ID=ME_INDEX_RESOURCE_NAME.split("/")[5]
88 |
89 |
90 | REQUESTS_PER_MINUTE = 200 # project quota==300
91 | vertex_embedding = VertexAIEmbeddings(requests_per_minute=REQUESTS_PER_MINUTE)
92 |
93 |
94 | me = MatchingEngineVectorStore.from_components(
95 | project_id=PROJECT_ID
96 | , region=LOCATION
97 | , gcs_bucket_name=BUCKET_NAME
98 | , embedding=vertex_embedding
99 | , index_id=ME_INDEX_ID
100 | , endpoint_id=ME_INDEX_ENDPOINT_ID
101 | , k = 10
102 | )
103 |
104 |
105 | ## Create VectorStore Agent tool
106 |
107 | vertex_langchain_agent = LangchainAgent()
108 |
109 | vectorstore_agent = vertex_langchain_agent.get_vectorstore_agent(
110 | vectorstore=me
111 | , vectorstore_name=f"news on {ACTOR_NAME}"
112 | , vectorstore_description=f"a vectorstore containing news articles and current events for {ACTOR_NAME}."
113 | )
114 |
115 | ## BigQuery Agent
116 |
117 |
118 | vertex_langchain_agent = LangchainAgent()
119 | bq_agent = vertex_langchain_agent.get_bigquery_agent(PROJECT_ID)
120 |
121 |
122 | bq_agent_tools = bq_agent.tools
123 |
124 | bq_agent_tools[0].description = bq_agent_tools[0].description + \
125 | f"""
126 | only use the schema {MY_BQ_TRENDS_DATASET}
127 | NOTE YOU CANNOT DO OPERATIONS AN AN AGGREGATED FIELD UNLESS IT IS IN A CTE WHICH IS ALLOWED
128 | also - use a like operator for the term field e.g. WHERE term LIKE '%keyword%'
129 | make sure to lower case the term in the WHERE clause
130 | be sure to LIMIT 100 for all queries
131 | if you don't have a LIMIT 100, there will be problems
132 | """
133 |
134 |
135 | ## Build an Agent that has access to Multiple Tools
136 |
137 | llm = VertexAI()
138 |
139 | dataset = 'google_trends_my_project'
140 |
141 | db = SQLDatabase.from_uri(f"bigquery://{PROJECT_ID}/{dataset}")
142 |
143 | llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
144 |
145 | me_tools = vectorstore_agent.tools
146 |
147 | search = DuckDuckGoSearchAPIWrapper()
148 |
149 |
150 | tools = [
151 | Tool(
152 | name="Calculator",
153 | func=llm_math_chain.run,
154 | description="useful for when you need to answer questions about math",
155 | ),
156 | Tool(
157 | name="Search",
158 | func=search.run,
159 | description="useful for when you need to answer questions about current events. You should ask targeted questions",
160 | ),
161 | ]
162 |
163 |
164 | # tools.extend(me_tools) #TODO - this is not working on a local macbook; may work on cloudtop or other config
165 | tools.extend(bq_agent_tools)
166 |
167 | # Run the streamlit app
168 |
169 | # what are the unique terms in the top_rising_terms table?
170 |
171 | enable_custom = True
172 | # Initialize agent
173 | mrkl = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
174 |
175 | with st.form(key="form"):
176 | user_input = ""
177 | user_input = st.text_input("Ask your question")
178 | submit_clicked = st.form_submit_button("Submit Question")
179 |
180 |
181 | output_container = st.empty()
182 | if with_clear_container(submit_clicked):
183 | output_container = output_container.container()
184 | output_container.chat_message("user").write(user_input)
185 | answer_container = output_container.chat_message("assistant", avatar="🦜")
186 | st_callback = StreamlitCallbackHandler(answer_container)
187 | answer = mrkl.run(user_input, callbacks=[st_callback])
188 | answer_container.write(answer)
189 |
190 |
191 | "#### Here's some info on the tools in this agent: "
192 | for t in tools:
193 | st.write(t.name)
194 | st.write(t.description)
195 | st.write('\n')
196 |
197 |
--------------------------------------------------------------------------------
/notebooks/imgs/add-gdelt-to-bq.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/add-gdelt-to-bq.gif
--------------------------------------------------------------------------------
/notebooks/imgs/agent_plan_execute_chain_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/agent_plan_execute_chain_output.png
--------------------------------------------------------------------------------
/notebooks/imgs/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/architecture.png
--------------------------------------------------------------------------------
/notebooks/imgs/chunk_bq_tables_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_bq_tables_flow.png
--------------------------------------------------------------------------------
/notebooks/imgs/chunk_gcs_blobs_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_gcs_blobs_flow.png
--------------------------------------------------------------------------------
/notebooks/imgs/chunk_youtube_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/chunk_youtube_flow.png
--------------------------------------------------------------------------------
/notebooks/imgs/da-ui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/da-ui.png
--------------------------------------------------------------------------------
/notebooks/imgs/deep-retriever.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/deep-retriever.png
--------------------------------------------------------------------------------
/notebooks/imgs/fullarchitecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/fullarchitecture.png
--------------------------------------------------------------------------------
/notebooks/imgs/google-trends-explore.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/google-trends-explore.png
--------------------------------------------------------------------------------
/notebooks/imgs/info-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/info-architecture.png
--------------------------------------------------------------------------------
/notebooks/imgs/langchain-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain-diagram.png
--------------------------------------------------------------------------------
/notebooks/imgs/langchain-overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain-overview.png
--------------------------------------------------------------------------------
/notebooks/imgs/langchain_intro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/langchain_intro.png
--------------------------------------------------------------------------------
/notebooks/imgs/pipeline-complete.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/pipeline-complete.png
--------------------------------------------------------------------------------
/notebooks/imgs/pipeline_metadata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/pipeline_metadata.png
--------------------------------------------------------------------------------
/notebooks/imgs/plan-execute-example-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/plan-execute-example-output.png
--------------------------------------------------------------------------------
/notebooks/imgs/public_trends_data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/public_trends_data.png
--------------------------------------------------------------------------------
/notebooks/imgs/user-flow-plan-execute.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/user-flow-plan-execute.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview_ME.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_ME.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview_agents.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_agents.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview_gdelt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_gdelt.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview_load_index.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_load_index.png
--------------------------------------------------------------------------------
/notebooks/imgs/zghost_overview_pipeline_steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/notebooks/imgs/zghost_overview_pipeline_steps.png
--------------------------------------------------------------------------------
/notebooks/requirements.txt:
--------------------------------------------------------------------------------
1 | google-api-core==2.10.0
2 | google-cloud-resource-manager
3 | google-cloud-core
4 | google-cloud-documentai
5 | google-cloud-storage
6 | google-cloud-secret-manager
7 | google-cloud-bigquery
8 | google-cloud-aiplatform==1.25.0
9 | protobuf==3.20.3
10 | oauth2client==3.0.0
11 | pydantic==1.10.9
12 | pypdf
13 | gcsfs
14 | langchain
15 | newspaper3k
16 | python-decouple
17 | numpy
18 | scipy
19 | pandas
20 | nltk
21 | flask
22 | flask-restx
23 | db-dtypes
24 | gunicorn
25 | pystan
26 | lunarcalendar
27 | convertdate
28 | pexpect
29 | pandas-gbq
30 | pytube
31 | celery
32 | redis
33 | pybigquery
34 | kfp
35 | youtube-transcript-api
--------------------------------------------------------------------------------
/streamlit_agent/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/streamlit_agent/__init__.py
--------------------------------------------------------------------------------
/streamlit_agent/callbacks/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/streamlit_agent/callbacks/__init__.py
--------------------------------------------------------------------------------
/streamlit_agent/callbacks/capturing_callback_handler.py:
--------------------------------------------------------------------------------
1 | """Callback Handler captures all callbacks in a session for future offline playback."""
2 |
3 | from __future__ import annotations
4 |
5 | import pickle
6 | import time
7 | from typing import Any, TypedDict
8 |
9 | from langchain.callbacks.base import BaseCallbackHandler
10 |
11 |
12 | # This is intentionally not an enum so that we avoid serializing a
13 | # custom class with pickle.
14 | class CallbackType:
15 | ON_LLM_START = "on_llm_start"
16 | ON_LLM_NEW_TOKEN = "on_llm_new_token"
17 | ON_LLM_END = "on_llm_end"
18 | ON_LLM_ERROR = "on_llm_error"
19 | ON_TOOL_START = "on_tool_start"
20 | ON_TOOL_END = "on_tool_end"
21 | ON_TOOL_ERROR = "on_tool_error"
22 | ON_TEXT = "on_text"
23 | ON_CHAIN_START = "on_chain_start"
24 | ON_CHAIN_END = "on_chain_end"
25 | ON_CHAIN_ERROR = "on_chain_error"
26 | ON_AGENT_ACTION = "on_agent_action"
27 | ON_AGENT_FINISH = "on_agent_finish"
28 |
29 |
30 | # We use TypedDict, rather than NamedTuple, so that we avoid serializing a
31 | # custom class with pickle. All of this class's members should be basic Python types.
32 | class CallbackRecord(TypedDict):
33 | callback_type: str
34 | args: tuple[Any, ...]
35 | kwargs: dict[str, Any]
36 | time_delta: float # Number of seconds between this record and the previous one
37 |
38 |
39 | def load_records_from_file(path: str) -> list[CallbackRecord]:
40 | """Load the list of CallbackRecords from a pickle file at the given path."""
41 | with open(path, "rb") as file:
42 | records = pickle.load(file)
43 |
44 | if not isinstance(records, list):
45 | raise RuntimeError(f"Bad CallbackRecord data in {path}")
46 | return records
47 |
48 |
49 | def playback_callbacks(
50 | handlers: list[BaseCallbackHandler],
51 | records_or_filename: list[CallbackRecord] | str,
52 | max_pause_time: float,
53 | ) -> str:
54 | if isinstance(records_or_filename, list):
55 | records = records_or_filename
56 | else:
57 | records = load_records_from_file(records_or_filename)
58 |
59 | for record in records:
60 | pause_time = min(record["time_delta"], max_pause_time)
61 | if pause_time > 0:
62 | time.sleep(pause_time)
63 |
64 | for handler in handlers:
65 | if record["callback_type"] == CallbackType.ON_LLM_START:
66 | handler.on_llm_start(*record["args"], **record["kwargs"])
67 | elif record["callback_type"] == CallbackType.ON_LLM_NEW_TOKEN:
68 | handler.on_llm_new_token(*record["args"], **record["kwargs"])
69 | elif record["callback_type"] == CallbackType.ON_LLM_END:
70 | handler.on_llm_end(*record["args"], **record["kwargs"])
71 | elif record["callback_type"] == CallbackType.ON_LLM_ERROR:
72 | handler.on_llm_error(*record["args"], **record["kwargs"])
73 | elif record["callback_type"] == CallbackType.ON_TOOL_START:
74 | handler.on_tool_start(*record["args"], **record["kwargs"])
75 | elif record["callback_type"] == CallbackType.ON_TOOL_END:
76 | handler.on_tool_end(*record["args"], **record["kwargs"])
77 | elif record["callback_type"] == CallbackType.ON_TOOL_ERROR:
78 | handler.on_tool_error(*record["args"], **record["kwargs"])
79 | elif record["callback_type"] == CallbackType.ON_TEXT:
80 | handler.on_text(*record["args"], **record["kwargs"])
81 | elif record["callback_type"] == CallbackType.ON_CHAIN_START:
82 | handler.on_chain_start(*record["args"], **record["kwargs"])
83 | elif record["callback_type"] == CallbackType.ON_CHAIN_END:
84 | handler.on_chain_end(*record["args"], **record["kwargs"])
85 | elif record["callback_type"] == CallbackType.ON_CHAIN_ERROR:
86 | handler.on_chain_error(*record["args"], **record["kwargs"])
87 | elif record["callback_type"] == CallbackType.ON_AGENT_ACTION:
88 | handler.on_agent_action(*record["args"], **record["kwargs"])
89 | elif record["callback_type"] == CallbackType.ON_AGENT_FINISH:
90 | handler.on_agent_finish(*record["args"], **record["kwargs"])
91 |
92 | # Return the agent's result
93 | for record in records:
94 | if record["callback_type"] == CallbackType.ON_AGENT_FINISH:
95 | return record["args"][0][0]["output"]
96 |
97 | return "[Missing Agent Result]"
98 |
99 |
100 | class CapturingCallbackHandler(BaseCallbackHandler):
101 | def __init__(self) -> None:
102 | self._records: list[CallbackRecord] = []
103 | self._last_time: float | None = None
104 |
105 | def dump_records_to_file(self, path: str) -> None:
106 | """Write the list of CallbackRecords to a pickle file at the given path."""
107 | with open(path, "wb") as file:
108 | pickle.dump(self._records, file)
109 |
110 | def _append_record(
111 | self, type: str, args: tuple[Any, ...], kwargs: dict[str, Any]
112 | ) -> None:
113 | time_now = time.time()
114 | time_delta = time_now - self._last_time if self._last_time is not None else 0
115 | self._last_time = time_now
116 | self._records.append(
117 | CallbackRecord(
118 | callback_type=type, args=args, kwargs=kwargs, time_delta=time_delta
119 | )
120 | )
121 |
122 | def on_llm_start(self, *args: Any, **kwargs: Any) -> None:
123 | self._append_record(CallbackType.ON_LLM_START, args, kwargs)
124 |
125 | def on_llm_new_token(self, *args: Any, **kwargs: Any) -> None:
126 | self._append_record(CallbackType.ON_LLM_NEW_TOKEN, args, kwargs)
127 |
128 | def on_llm_end(self, *args: Any, **kwargs: Any) -> None:
129 | self._append_record(CallbackType.ON_LLM_END, args, kwargs)
130 |
131 | def on_llm_error(self, *args: Any, **kwargs: Any) -> None:
132 | self._append_record(CallbackType.ON_LLM_ERROR, args, kwargs)
133 |
134 | def on_tool_start(self, *args: Any, **kwargs: Any) -> None:
135 | self._append_record(CallbackType.ON_TOOL_START, args, kwargs)
136 |
137 | def on_tool_end(self, *args: Any, **kwargs: Any) -> None:
138 | self._append_record(CallbackType.ON_TOOL_END, args, kwargs)
139 |
140 | def on_tool_error(self, *args: Any, **kwargs: Any) -> None:
141 | self._append_record(CallbackType.ON_TOOL_ERROR, args, kwargs)
142 |
143 | def on_text(self, *args: Any, **kwargs: Any) -> None:
144 | self._append_record(CallbackType.ON_TEXT, args, kwargs)
145 |
146 | def on_chain_start(self, *args: Any, **kwargs: Any) -> None:
147 | self._append_record(CallbackType.ON_CHAIN_START, args, kwargs)
148 |
149 | def on_chain_end(self, *args: Any, **kwargs: Any) -> None:
150 | self._append_record(CallbackType.ON_CHAIN_END, args, kwargs)
151 |
152 | def on_chain_error(self, *args: Any, **kwargs: Any) -> None:
153 | self._append_record(CallbackType.ON_CHAIN_ERROR, args, kwargs)
154 |
155 | def on_agent_action(self, *args: Any, **kwargs: Any) -> Any:
156 | self._append_record(CallbackType.ON_AGENT_ACTION, args, kwargs)
157 |
158 | def on_agent_finish(self, *args: Any, **kwargs: Any) -> None:
159 | self._append_record(CallbackType.ON_AGENT_FINISH, args, kwargs)
160 |
--------------------------------------------------------------------------------
/streamlit_agent/clear_results.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 |
3 |
4 | # A hack to "clear" the previous result when submitting a new prompt. This avoids
5 | # the "previous run's text is grayed-out but visible during rerun" Streamlit behavior.
6 | class DirtyState:
7 | NOT_DIRTY = "NOT_DIRTY"
8 | DIRTY = "DIRTY"
9 | UNHANDLED_SUBMIT = "UNHANDLED_SUBMIT"
10 |
11 |
12 | def get_dirty_state() -> str:
13 | return st.session_state.get("dirty_state", DirtyState.NOT_DIRTY)
14 |
15 |
16 | def set_dirty_state(state: str) -> None:
17 | st.session_state["dirty_state"] = state
18 |
19 |
20 | def with_clear_container(submit_clicked: bool) -> bool:
21 | if get_dirty_state() == DirtyState.DIRTY:
22 | if submit_clicked:
23 | set_dirty_state(DirtyState.UNHANDLED_SUBMIT)
24 | st.experimental_rerun()
25 | else:
26 | set_dirty_state(DirtyState.NOT_DIRTY)
27 |
28 | if submit_clicked or get_dirty_state() == DirtyState.UNHANDLED_SUBMIT:
29 | set_dirty_state(DirtyState.DIRTY)
30 | return True
31 |
32 | return False
33 |
--------------------------------------------------------------------------------
/zeitghost/.dockerignore:
--------------------------------------------------------------------------------
1 | Dockerfile
2 | README.md
3 | *.pyc
4 | *.pyo
5 | *.pyd
6 | __pycache__
7 | .pytest_cache
--------------------------------------------------------------------------------
/zeitghost/.env:
--------------------------------------------------------------------------------
1 | PROJECT_ID='cpg-cdp'
2 | PROJECT_NUM='939655404703'
3 | LOCATION='us-central1'
4 | DATASET_ID='genai_cap_v1'
5 | TABLE_NAME='estee_lauder_1_mentions'
6 |
--------------------------------------------------------------------------------
/zeitghost/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | /shelf/
3 | /workspace.xml
4 | # Editor-based HTTP Client requests
5 | /httpRequests/
6 | # Datasource local storage ignored files
7 | /dataSources/
8 | /dataSources.local.xml
9 | # Zeppelin ignored files
10 | /ZeppelinRemoteNotebooks/
11 |
--------------------------------------------------------------------------------
/zeitghost/.idea/codeStyles/Project.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 |
162 |
163 |
164 |
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
213 |
214 |
215 |
216 |
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
236 |
237 |
238 |
239 |
240 |
241 |
242 |
243 |
244 |
245 |
246 |
247 |
248 |
249 |
250 |
251 |
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 |
261 |
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
270 |
271 |
272 |
273 |
274 |
275 |
276 |
277 |
278 |
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 |
287 |
288 |
289 |
290 |
291 |
292 |
293 |
294 |
295 |
296 |
297 |
298 |
299 |
300 |
301 |
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 |
318 |
319 |
320 |
321 |
322 |
323 |
324 |
325 |
326 |
327 |
328 |
329 |
330 |
331 |
332 |
333 |
334 |
335 |
336 |
337 |
338 |
339 |
340 |
341 |
342 |
343 |
344 |
345 |
346 |
347 |
348 |
349 |
350 |
351 |
352 |
353 |
354 |
355 |
356 |
357 |
358 |
359 |
360 |
361 |
362 |
363 |
364 |
365 |
366 |
367 |
368 |
369 |
370 |
371 |
372 |
373 |
374 |
375 |
376 |
377 |
378 |
379 |
380 |
381 | xmlns:android
382 |
383 | ^$
384 |
385 |
386 |
387 |
388 |
389 |
390 |
391 |
392 | xmlns:.*
393 |
394 | ^$
395 |
396 |
397 | BY_NAME
398 |
399 |
400 |
401 |
402 |
403 |
404 | .*:id
405 |
406 | http://schemas.android.com/apk/res/android
407 |
408 |
409 |
410 |
411 |
412 |
413 |
414 |
415 | .*:name
416 |
417 | http://schemas.android.com/apk/res/android
418 |
419 |
420 |
421 |
422 |
423 |
424 |
425 |
426 | name
427 |
428 | ^$
429 |
430 |
431 |
432 |
433 |
434 |
435 |
436 |
437 | style
438 |
439 | ^$
440 |
441 |
442 |
443 |
444 |
445 |
446 |
447 |
448 | .*
449 |
450 | ^$
451 |
452 |
453 | BY_NAME
454 |
455 |
456 |
457 |
458 |
459 |
460 | .*:.*Style
461 |
462 | http://schemas.android.com/apk/res/android
463 |
464 |
465 | BY_NAME
466 |
467 |
468 |
469 |
470 |
471 |
472 | .*:width
473 |
474 | http://schemas.android.com/apk/res/android
475 |
476 |
477 |
478 |
479 |
480 |
481 |
482 |
483 | .*:height
484 |
485 | http://schemas.android.com/apk/res/android
486 |
487 |
488 |
489 |
490 |
491 |
492 |
493 |
494 | .*:layout_width
495 |
496 | http://schemas.android.com/apk/res/android
497 |
498 |
499 |
500 |
501 |
502 |
503 |
504 |
505 | .*:layout_height
506 |
507 | http://schemas.android.com/apk/res/android
508 |
509 |
510 |
511 |
512 |
513 |
514 |
515 |
516 | .*:layout_weight
517 |
518 | http://schemas.android.com/apk/res/android
519 |
520 |
521 |
522 |
523 |
524 |
525 |
526 |
527 | .*:layout_margin
528 |
529 | http://schemas.android.com/apk/res/android
530 |
531 |
532 |
533 |
534 |
535 |
536 |
537 |
538 | .*:layout_marginTop
539 |
540 | http://schemas.android.com/apk/res/android
541 |
542 |
543 |
544 |
545 |
546 |
547 |
548 |
549 | .*:layout_marginBottom
550 |
551 | http://schemas.android.com/apk/res/android
552 |
553 |
554 |
555 |
556 |
557 |
558 |
559 |
560 | .*:layout_marginStart
561 |
562 | http://schemas.android.com/apk/res/android
563 |
564 |
565 |
566 |
567 |
568 |
569 |
570 |
571 | .*:layout_marginEnd
572 |
573 | http://schemas.android.com/apk/res/android
574 |
575 |
576 |
577 |
578 |
579 |
580 |
581 |
582 | .*:layout_marginLeft
583 |
584 | http://schemas.android.com/apk/res/android
585 |
586 |
587 |
588 |
589 |
590 |
591 |
592 |
593 | .*:layout_marginRight
594 |
595 | http://schemas.android.com/apk/res/android
596 |
597 |
598 |
599 |
600 |
601 |
602 |
603 |
604 | .*:layout_.*
605 |
606 | http://schemas.android.com/apk/res/android
607 |
608 |
609 | BY_NAME
610 |
611 |
612 |
613 |
614 |
615 |
616 | .*:padding
617 |
618 | http://schemas.android.com/apk/res/android
619 |
620 |
621 |
622 |
623 |
624 |
625 |
626 |
627 | .*:paddingTop
628 |
629 | http://schemas.android.com/apk/res/android
630 |
631 |
632 |
633 |
634 |
635 |
636 |
637 |
638 | .*:paddingBottom
639 |
640 | http://schemas.android.com/apk/res/android
641 |
642 |
643 |
644 |
645 |
646 |
647 |
648 |
649 | .*:paddingStart
650 |
651 | http://schemas.android.com/apk/res/android
652 |
653 |
654 |
655 |
656 |
657 |
658 |
659 |
660 | .*:paddingEnd
661 |
662 | http://schemas.android.com/apk/res/android
663 |
664 |
665 |
666 |
667 |
668 |
669 |
670 |
671 | .*:paddingLeft
672 |
673 | http://schemas.android.com/apk/res/android
674 |
675 |
676 |
677 |
678 |
679 |
680 |
681 |
682 | .*:paddingRight
683 |
684 | http://schemas.android.com/apk/res/android
685 |
686 |
687 |
688 |
689 |
690 |
691 |
692 |
693 | .*
694 | http://schemas.android.com/apk/res/android
695 |
696 |
697 | BY_NAME
698 |
699 |
700 |
701 |
702 |
703 |
704 | .*
705 | http://schemas.android.com/apk/res-auto
706 |
707 |
708 | BY_NAME
709 |
710 |
711 |
712 |
713 |
714 |
715 | .*
716 | http://schemas.android.com/tools
717 |
718 |
719 | BY_NAME
720 |
721 |
722 |
723 |
724 |
725 |
726 | .*
727 | .*
728 |
729 |
730 | BY_NAME
731 |
732 |
733 |
734 |
735 |
736 |
737 |
738 |
739 |
740 |
741 |
742 |
743 |
744 |
745 |
746 |
747 |
748 |
749 |
750 |
751 |
752 |
753 |
754 |
755 |
756 |
757 |
758 |
759 |
760 |
761 |
762 |
763 |
764 |
765 |
766 |
--------------------------------------------------------------------------------
/zeitghost/.idea/codeStyles/codeStyleConfig.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
--------------------------------------------------------------------------------
/zeitghost/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/zeitghost/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/zeitghost/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
--------------------------------------------------------------------------------
/zeitghost/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__init__.py
--------------------------------------------------------------------------------
/zeitghost/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/__pycache__/main.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/__pycache__/main.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/agents/Helpers.py:
--------------------------------------------------------------------------------
1 | from typing import Dict
2 | from typing import List
3 | from typing import Union
4 | from langchain import PromptTemplate
5 | from langchain.callbacks.base import BaseCallbackHandler
6 | from typing import Any
7 | from langchain.schema import AgentAction
8 | from langchain.schema import AgentFinish
9 | from langchain.schema import LLMResult
10 | from zeitghost.vertex.LLM import VertexLLM
11 | import time
12 | import logging
13 |
14 |
15 | QPS = 600
16 |
17 | core_template = """Question: {question}
18 |
19 | Answer: """
20 |
21 | core_prompt = PromptTemplate(
22 | template=core_template
23 | , input_variables=['question']
24 | )
25 |
26 | vector_template = """
27 | Question: Use [{name}]:
28 | {question}
29 |
30 | Answer: """
31 |
32 | vector_prompt = PromptTemplate(
33 | template=vector_template
34 | , input_variables=['name', 'question']
35 | )
36 |
37 | bq_template = """{prompt} in {table} from this table of search term volume on google.com
38 | - do not download the entire table
39 | - do not ORDER BY or GROUP BY count(*)
40 | - the datetime field is called date_field
41 | """
42 | bq_prompt = PromptTemplate(
43 | template=bq_template
44 | , input_variables=['prompt', 'table']
45 | )
46 |
47 | BQ_PREFIX = """
48 | LIMIT TO ONLY 100 ROWS - e.g. LIMIT 100
49 | REMOVE all observation output that has any special characters , or \n
50 | you are a helpful agent that knows how to use bigquery
51 | you are using sqlalchemy {dialect}
52 | Check the table schemas before constructing sql
53 | Only use the information returned by the below tools to construct your final answer.\nYou MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.
54 | DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.\n\n
55 | REMOVE all observation output that has any special characters , or \n
56 | you are a helpful agent that knows how to use bigquery
57 | READ THE SCHEMA BEFORE YOU WRITE QUERIES
58 | DOUBLE CHECK YOUR QUERY LOGIC
59 | you are using sqlalchemy for Big Query
60 | ALL QUERIES MUST HAVE LIMIT 100 at the end of them
61 | Check the table schemas before constructing sql
62 | Only use the information returned by the below tools to construct your final answer.\nYou MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.
63 | DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.\n\n
64 | If you don't use a where statement in your SQL - there will be problems.
65 | To get hints on the field contents, consider a select distinct - I don't care about a where statement given there is low cardnality in the data set
66 | make sure you prepend the table name with the schema: eg: schema.tablename
67 | MAKE SURE the FROM statement includes the schema like so: schema.tablename
68 | THERE MUST BE A WHERE CLAUSE IN THIS BECAUSE YOU DON'T HAVE ENOUGH MEMORY TO STORE LOCAL RESULTS
69 | do not use the same action as you did in any prior step
70 | MAKE SURE YOU DO NOT REPEAT THOUGHTS - if a thought is the same as a prior thought in the chain, come up with another one
71 | """
72 |
73 | bq_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent
74 | strip=True, #this strips out special characters for the BQ agent
75 | temperature=0.0,
76 | max_output_tokens=1000,
77 | top_p=0.7,
78 | top_k=40,
79 | )
80 |
81 | pandas_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent
82 | strip=False, #this strips out special characters for the BQ agent
83 | temperature=0.0,
84 | max_output_tokens=1000,
85 | top_p=0.7,
86 | top_k=40,
87 | )
88 |
89 | vectorstore_agent_llm = VertexLLM(stop=['Observation:'], #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent
90 | strip=False, #this strips out special characters for the BQ agent
91 | temperature=0.0,
92 | max_output_tokens=1000,
93 | top_p=0.7,
94 | top_k=40,
95 | )
96 |
97 |
98 | base_llm = VertexLLM(stop=None, #in this case, we are stopping on Observation to avoid hallucentations with the pandas agent
99 | temperature=0.0,
100 | max_output_tokens=1000,
101 | top_p=0.7,
102 | top_k=40
103 | )
104 |
105 |
106 | class MyCustomHandler(BaseCallbackHandler):
107 | def rate_limit(self):
108 | time.sleep(1/QPS)
109 |
110 | def on_llm_start(self, serialized: Dict[str, Any], prompts: List[str],
111 | **kwargs: Any) -> Any:
112 | pass
113 |
114 | def on_llm_new_token(self, token: str, **kwargs: Any) -> Any:
115 | self.rate_limit()
116 | pass
117 |
118 | def on_llm_end(self, response: LLMResult, **kwargs: Any) -> Any:
119 | pass
120 |
121 | def on_llm_error(self, error: Union[Exception, KeyboardInterrupt],
122 | **kwargs: Any) -> Any:
123 | pass
124 |
125 | def on_chain_start(self, serialized: Dict[str, Any], inputs: Dict[str, Any],
126 | **kwargs: Any) -> Any:
127 | logging.info(serialized)
128 | pass
129 |
130 | def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> Any:
131 | pass
132 |
133 | def on_chain_error(self, error: Union[Exception, KeyboardInterrupt],
134 | **kwargs: Any) -> Any:
135 | pass
136 |
137 | def on_tool_start(self, serialized: Dict[str, Any], input_str: str,
138 | **kwargs: Any) -> Any:
139 | logging.info(serialized)
140 | pass
141 |
142 | def on_tool_end(self, output: str, **kwargs: Any) -> Any:
143 | pass
144 |
145 | def on_tool_error(self, error: Union[Exception, KeyboardInterrupt],
146 | **kwargs: Any) -> Any:
147 | pass
148 |
149 | def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
150 | logging.info(action)
151 | pass
152 |
153 | def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
154 | pass
155 |
156 | def on_text(self, text: str, **kwargs: Any) -> Any:
157 | """Run on arbitrary text."""
158 | # return str(text[:4000]) #character limiter
159 | # self.rate_limit()
160 |
--------------------------------------------------------------------------------
/zeitghost/agents/LangchainAgent.py:
--------------------------------------------------------------------------------
1 | #from langchain import LLMChain
2 | from langchain.chains import LLMChain
3 | from langchain.agents import create_sql_agent
4 | from langchain.agents.agent_toolkits import SQLDatabaseToolkit
5 | from langchain.agents import create_pandas_dataframe_agent
6 | from langchain.agents import create_vectorstore_agent
7 | from langchain.agents.agent_toolkits import VectorStoreInfo
8 | from langchain.agents.agent_toolkits import VectorStoreToolkit
9 | from langchain.agents.agent import AgentExecutor
10 | from langchain.schema import LLMResult
11 | from langchain.sql_database import SQLDatabase
12 | from zeitghost.agents.Helpers import core_prompt, vector_prompt
13 | from zeitghost.agents.Helpers import BQ_PREFIX, bq_template, core_template, vector_template
14 | from zeitghost.agents.Helpers import base_llm, bq_agent_llm, pandas_agent_llm, vectorstore_agent_llm
15 | from zeitghost.agents.Helpers import MyCustomHandler
16 | from zeitghost.vertex.MatchingEngineVectorstore import MatchingEngineVectorStore
17 | from zeitghost.vertex.LLM import VertexLLM
18 | from langchain.callbacks.manager import CallbackManager
19 |
20 |
21 | class LangchainAgent:
22 | """
23 | A class used to represent an llm agent to ask questions of.
24 | Contains agents for Pandas DataFrames, BigQuery, and Vertex Matching Engine.
25 | """
26 | callback_handler = MyCustomHandler()
27 | callback_manager = CallbackManager([callback_handler])
28 |
29 | def get_vectorstore_agent(
30 | self
31 | , vectorstore: MatchingEngineVectorStore
32 | , vectorstore_name: str
33 | , vectorstore_description: str
34 | , llm: VertexLLM = vectorstore_agent_llm
35 | ) -> AgentExecutor:
36 | """
37 | Gets a langchain agent to query against a Matching Engine vectorstore
38 |
39 | :param llm: zeitghost.vertex.LLM.VertexLLM
40 | :param vectorstore_description: str
41 | :param vectorstore_name: str
42 | :param vectorstore: zeitghost.vertex.MatchingEngineVectorstore.MatchingEngine
43 |
44 | :return langchain.agents.agent.AgentExecutor:
45 | """
46 | vectorstore_info = VectorStoreInfo(
47 | name=vectorstore_name
48 | , description=vectorstore_description
49 | , vectorstore=vectorstore
50 | )
51 | vectorstore_toolkit = VectorStoreToolkit(
52 | vectorstore_info=vectorstore_info
53 | , llm=llm
54 | )
55 | return create_vectorstore_agent(
56 | llm=llm
57 | , toolkit=vectorstore_toolkit
58 | , verbose=True
59 | , callback_manager=self.callback_manager
60 | , return_intermediate_steps=True
61 | )
62 |
63 | def get_pandas_agent(
64 | self
65 | , dataframe
66 | , llm=pandas_agent_llm
67 | ) -> AgentExecutor:
68 | """
69 | Gets a langchain agent to query against a pandas dataframe
70 |
71 | :param llm: zeitghost.vertex.llm.VertexLLM
72 | :param dataframe: pandas.DataFrame
73 | Input dataframe for agent to interact with
74 |
75 | :return: langchain.agents.agent.AgentExecutor
76 | """
77 | return create_pandas_dataframe_agent(
78 | llm=llm
79 | , df=dataframe
80 | , verbose=True
81 | , callback_manager=self.callback_manager
82 | , return_intermediate_steps=True
83 | )
84 |
85 | def get_bigquery_agent(
86 | self
87 | , project_id='cpg-cdp'
88 | , dataset='google_trends_my_project'
89 | , llm=bq_agent_llm
90 | ) -> AgentExecutor:
91 | """
92 | Gets a langchain agent to query against a BigQuery dataset
93 |
94 | :param llm: zeitghost.vertex.llm.VertexLLM
95 | :param dataset:
96 | :param project_id: str
97 | Google Cloud Project ID
98 |
99 | :return: langchain.SQLDatabaseChain
100 | """
101 | db = SQLDatabase.from_uri(f"bigquery://{project_id}/{dataset}")
102 | toolkit = SQLDatabaseToolkit(llm=llm, db=db)
103 |
104 | return create_sql_agent(
105 | llm=llm
106 | , toolkit=toolkit
107 | , verbose=True
108 | , prefix=BQ_PREFIX
109 | , callback_manager=self.callback_manager
110 | , return_intermediate_steps=True
111 | )
112 |
113 | def query_bq_agent(
114 | self
115 | , agent: AgentExecutor
116 | , table: str
117 | , prompt: str
118 | ) -> str:
119 | """
120 | Queries a BQ Agent given a table and a prompt.
121 |
122 | :param agent: AgentExecutor
123 | :param table: str
124 | Table to ask question against
125 | :param prompt: str
126 | Question prompt
127 |
128 | :return: Dict[str, Any]
129 | """
130 |
131 | return agent.run(
132 | bq_template.format(prompt=prompt, table=table)
133 | )
134 |
135 | def query_pandas_agent(
136 | self
137 | , agent: AgentExecutor
138 | , prompt: str
139 | ) -> str:
140 | """
141 | Queries a BQ Agent given a table and a prompt.
142 |
143 | :param agent: langchain.
144 | :param prompt: str
145 | Question prompt
146 |
147 | :return: Dict[str, Any]
148 | """
149 |
150 | return agent.run(
151 | core_template.format(question=prompt)
152 | )
153 |
154 | def query_vectorstore_agent(
155 | self
156 | , agent: AgentExecutor
157 | , prompt: str
158 | , vectorstore_name: str
159 | ):
160 | """
161 | Queries a VectorStore Agent given a prompt
162 |
163 | :param vectorstore_name:
164 | :param agent: AgentExecutor
165 | :param prompt: str
166 |
167 | :return: str
168 | """
169 | return agent.run(
170 | vector_template.format(question=prompt, name=vectorstore_name)
171 | )
172 |
173 | def chain_questions(self, questions) -> LLMResult:
174 | """
175 | Executes a chain of questions against the configured LLM
176 | :param questions: list(str)
177 | A list of questions to ask the llm
178 |
179 | :return: langchain.schema.LLMResult
180 | """
181 | llm_chain = LLMChain(prompt=core_prompt, llm=vectorstore_agent_llm)
182 | res = llm_chain.generate(questions)
183 |
184 | return res
185 |
186 |
--------------------------------------------------------------------------------
/zeitghost/agents/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__init__.py
--------------------------------------------------------------------------------
/zeitghost/agents/__pycache__/Helpers.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/Helpers.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/agents/__pycache__/LangchainAgent.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/LangchainAgent.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/agents/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/agents/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/bigquery/BigQueryAccessor.py:
--------------------------------------------------------------------------------
1 | from google.cloud import bigquery
2 | from google.cloud.bigquery import QueryJob
3 | from google.cloud.bigquery.table import RowIterator
4 | import pandas as pd
5 |
6 |
7 | class BigQueryAccessor:
8 | """
9 | Interface for querying BigQuery.
10 | """
11 | def __init__(self
12 | , project_id
13 | , gdelt_project_id='gdelt-bq'
14 | , gdelt_dataset_id='gdeltv2'
15 | , gdelt_table_name='events'):
16 | """
17 | :param gdelt_project_id: str
18 | Project ID for building BigQuery client
19 | :param gdelt_dataset_id: str
20 | Dataset ID for building BigQuery Client
21 | :param gdelt_table_name: str
22 | Table name for building BigQuery Client
23 | """
24 | self.project_id = project_id
25 | self.gdelt_project_id = gdelt_project_id
26 | self.gdelt_dataset_id = gdelt_dataset_id
27 | self.gdelt_table_name = gdelt_table_name
28 | self.client = bigquery.Client(project=self.project_id)
29 |
30 | def _query_bq(self, query_string: str) -> QueryJob:
31 | """
32 |
33 | :param query_string: str
34 | Full SQL query string to execute against BigQuery
35 |
36 | :return: google.cloud.bigquery.job.QueryJob
37 | """
38 | return self.client.query(query_string)
39 |
40 | def get_records_from_sourceurl(self, source_url) -> RowIterator:
41 | """
42 | Retrieve article record from Gdelt dataset given a source_url
43 |
44 | :param source_url: str
45 |
46 | :return: google.cloud.bigquery.table.RowIterator
47 | """
48 | query = f"""
49 | SELECT
50 | max(SQLDATE) as SQLDATE,
51 | max(Actor1Name) as Actor1Name,
52 | max(Actor2Name) as Actor2Name,
53 | avg(GoldsteinScale) as GoldsteinScale,
54 | max(NumMentions) as NumMentions,
55 | max(NumSources) as NumSources,
56 | max(NumArticles) as NumArticles,
57 | avg(AvgTone) as AvgTone,
58 | SOURCEURL as SOURCEURL,
59 | SOURCEURL as url
60 | FROM `{self.gdelt_project_id}.{self.gdelt_dataset_id}.{self.gdelt_table_name}`
61 | WHERE lower(SOURCEURL) like '%{source_url}%'
62 | GROUP BY SOURCEURL
63 | """
64 |
65 | return self._query_bq(query_string=query).result()
66 |
67 | def get_records_from_sourceurl_df(self, source_url):
68 | """
69 | Retrieve article record from Gdelt dataset given a source_url
70 |
71 | :param source_url: str
72 |
73 | :return: pandas.DataFrame
74 | """
75 | response = self.get_records_from_sourceurl(source_url)
76 |
77 | return response.to_dataframe()
78 |
79 | def get_records_from_actor_keyword(self
80 | , keyword: str
81 | , min_date: str = "2023-01-01"
82 | , max_date: str = "2023-05-30"
83 | ) -> RowIterator:
84 | """
85 | Retrieve BQ records given input keyword
86 |
87 | :param keyword: str
88 | Keyword used for filtering actor names
89 |
90 | :return: google.cloud.bigquery.table.RowIterator
91 | """
92 |
93 | query = f"""
94 | SELECT
95 | max(SQLDATE) as SQLDATE,
96 | PARSE_DATE('%Y%m%d', CAST(max(SQLDATE) AS STRING)) as new_date,
97 | max(Actor1Name) as Actor1Name,
98 | max(Actor2Name) as Actor2Name,
99 | avg(GoldsteinScale) as GoldsteinScale,
100 | max(NumMentions) as NumMentions,
101 | max(NumSources) as NumSources,
102 | max(NumArticles) as NumArticles,
103 | avg(AvgTone) as AvgTone,
104 | SOURCEURL as SOURCEURL,
105 | SOURCEURL as url
106 | FROM `{self.gdelt_project_id}.{self.gdelt_dataset_id}.{self.gdelt_table_name}`
107 | WHERE lower(SOURCEURL) != 'unspecified'
108 | AND
109 | (
110 | REGEXP_CONTAINS(LOWER(Actor1Name),'{keyword.lower()}')
111 | OR REGEXP_CONTAINS(LOWER(Actor2Name), '{keyword.lower()}')
112 | )
113 | AND PARSE_DATE('%Y%m%d', CAST(SQLDATE AS STRING)) >= "{min_date}"
114 | AND PARSE_DATE('%Y%m%d', CAST(SQLDATE AS STRING)) <= "{max_date}"
115 | GROUP BY url
116 | """
117 |
118 | return self._query_bq(query_string=query).result()
119 |
120 | def get_records_from_actor_keyword_df(self
121 | , keyword: str
122 | , min_date: str = "2023-01-01"
123 | , max_date: str = "2023-05-30"
124 | ) -> pd.DataFrame:
125 | """
126 | Retrieves BQ records given input actor info
127 |
128 | :param keyword: str
129 |
130 | :return: pandas.DataFrame
131 | """
132 | response = self.get_records_from_actor_keyword(keyword, min_date, max_date)
133 |
134 | return response.to_dataframe()
135 |
136 | def get_term_set(self
137 | , project_id='cpg-cdp'
138 | , dataset='bigquery-public-data'
139 | , table_id='top_terms'
140 | ) -> RowIterator:
141 | """
142 | Simple function to get the unique, sorted terms in the table
143 |
144 | :param project_id: str
145 | project_id that holds the dataset.
146 | :param dataset: str
147 | dataset name that holds the table.
148 | :param table_id: str
149 | table name
150 |
151 | :return: google.cloud.bigquery.table.RowIterator
152 | """
153 |
154 | query = f"""
155 | SELECT distinct
156 | term
157 | FROM `{project_id}.{dataset}.{table_id}`
158 | order by 1
159 | """
160 |
161 | return self._query_bq(query_string=query).result()
162 |
163 | def get_term_set_df(self
164 | , project_id='cpg-cdp'
165 | , dataset='trends_data'
166 | , table_id='makeupcosmetics_10054_unitedstates_2840'
167 | ) -> list:
168 | """
169 | Simple function to get the unique, sorted terms in the table
170 |
171 | :param project_id: str
172 | project_id that holds the dataset.
173 | :param dataset: str
174 | dataset name that holds the table.
175 | :param table_id: str
176 | table name
177 |
178 | :return: pandas.DataFrame
179 | """
180 | df = self.get_term_set(project_id, dataset, table_id).to_dataframe()
181 |
182 | return df["term"].to_list()
183 |
184 | def pull_term_data_from_bq(self
185 | , term: tuple = ('mascara', 'makeup')
186 | , project_id='bigquery-public-data'
187 | , dataset='google_trends'
188 | , table_id='top_rising_terms'
189 | ) -> RowIterator:
190 | """
191 | Pull terms based on `in` sql clause from term
192 | takes a tuple of terms (str) and produces pandas dataset
193 |
194 | :param term: tuple(str)
195 | A tuple of terms to query for
196 | :param project_id: str
197 | project_id that holds the dataset.
198 | :param dataset: str
199 | dataset name that holds the table.
200 | :param table_id: str
201 | table name
202 |
203 | :return: google.cloud.bigguqery.table.RowIterator
204 | """
205 | query = f"""
206 | SELECT
207 | week,
208 | term,
209 | rank
210 | FROM `{project_id}.{dataset}.{table_id}`
211 | WHERE
212 | lower(term) in {term}
213 | order by term, 1
214 | """
215 |
216 | return self._query_bq(query_string=query).result()
217 |
218 | def pull_term_data_from_bq_df(self
219 | , term: tuple = ('mascara', 'makeup')
220 | , project_id='bigquery-public-data'
221 | , dataset='google_trends'
222 | , table_id='top_rising_terms'
223 | ) -> pd.DataFrame:
224 | """
225 | Pull terms based on `in` sql clause from term
226 | takes a tuple of terms (str) and produces pandas dataset
227 |
228 | :param term: tuple(str)
229 | A tuple of terms to query for
230 | :param project_id: str
231 | project_id that holds the dataset.
232 | :param dataset: str
233 | dataset name that holds the table.
234 | :param table_id: str
235 | table name
236 |
237 | :return: pandas.DataFrame
238 | """
239 | result = self.pull_term_data_from_bq(term, project_id, dataset, table_id)
240 |
241 | return result.to_dataframe()
242 |
243 | def pull_regexp_term_data_from_bq(self
244 | , term: str
245 | , project_id='bigquery-public-data'
246 | , dataset='google_trends'
247 | , table_id='top_rising_terms'
248 | ) -> RowIterator:
249 | """
250 | Pull terms based on `in` sql clause from term
251 | takes a tuple of terms (str) and produces pandas dataset
252 |
253 | :param term: tuple(str)
254 | A tuple of terms to query for
255 | :param project_id: str
256 | project_id that holds the dataset.
257 | :param dataset: str
258 | dataset name that holds the table.
259 | :param table_id: str
260 | table name
261 |
262 | :return: google.cloud.bigguqery.table.RowIterator
263 | """
264 | query = f"""
265 | SELECT
266 | week,
267 | term,
268 | rank
269 | FROM `{project_id}.{dataset}.{table_id}`
270 | WHERE (
271 | REGEXP_CONTAINS(LOWER(term), r'{term}')
272 | )
273 | order by term
274 | """
275 |
276 | return self._query_bq(query_string=query).result()
277 |
278 | def get_entity_from_geg_full(self
279 | , entity: str
280 | , min_date: str = "2023-01-01"
281 | ) -> RowIterator:
282 | entity_lower = entity.lower()
283 |
284 | query = f"""
285 | WITH
286 | entities AS (
287 | SELECT
288 | b.*,
289 | url
290 | FROM
291 | `{self.gdelt_project_id}.{self.gdelt_dataset_id}.geg_gcnlapi` AS a,
292 | UNNEST(a.entities) AS b
293 | WHERE
294 | LOWER(b.name) LIKE '%{entity_lower}%'
295 | AND DATE(date) >= '{min_date}' )
296 | SELECT
297 | *
298 | FROM
299 | `gdelt-bq.gdeltv2.geg_gcnlapi` a
300 | INNER JOIN
301 | entities AS b
302 | ON
303 | a.url = b.url
304 | WHERE
305 | DATE(date) >= '{min_date}'
306 | """
307 |
308 | return self._query_bq(query_string=query).result()
309 |
310 | def get_entity_from_geg_full_df(self
311 | , entity: str
312 | , min_date: str = "2023-01-01"):
313 | result = self.get_entity_from_geg_full(entity, min_date)
314 |
315 | return result.to_dataframe()
316 |
317 |
318 | def get_geg_entities_data(
319 | self
320 | , entity: str
321 | , min_date: str = "2023-01-01"
322 | , max_date: str = "2023-05-17"
323 | ) -> RowIterator:
324 |
325 | query = f"""
326 | WITH geg_data AS ((
327 | SELECT
328 | groupId,
329 | entity_type,
330 | a.entity as entity_name,
331 | a.numMentions,
332 | a.avgSalience,
333 | eventTime,
334 | polarity,
335 | magnitude,
336 | score,
337 | mid,
338 | wikipediaUrl
339 | FROM (
340 | SELECT
341 | polarity,
342 | magnitude,
343 | score,
344 | FARM_FINGERPRINT(url) groupId,
345 | entity.type AS entity_type,
346 | FORMAT_TIMESTAMP("%Y-%m-%d", date, "UTC") eventTime,
347 | entity.mid AS mid,
348 | entity.wikipediaUrl AS wikipediaUrl
349 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`,
350 | UNNEST(entities) entity
351 | WHERE entity.mid is not null
352 | AND LOWER(name) LIKE '%{entity}%'
353 | AND lang='en'
354 | AND DATE(date) >= "{min_date}"
355 | AND DATE(date) <= "{max_date}"
356 | ) b JOIN (
357 | # grab the entities from the nested json in the graph
358 | SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity,
359 | entities.mid mid,
360 | sum(entities.numMentions) as numMentions,
361 | avg(entities.avgSalience) as avgSalience
362 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`,
363 | UNNEST(entities) entities where entities.mid is not null
364 | AND lang='en'
365 | AND DATE(date) >= "{min_date}"
366 | AND DATE(date) <= "{max_date}"
367 | GROUP BY entities.mid
368 | ) a USING(mid)))
369 | SELECT *
370 | FROM ( SELECT *, RANK() OVER (PARTITION BY eventTime ORDER BY numMentions desc) as rank # get ranks
371 | FROM (
372 | SELECT
373 | entity_name,
374 | max(entity_type) AS entity_type,
375 | DATE(eventTime) AS eventTime,
376 | sum(numMentions) as numMentions,
377 | avg(magnitude) as avgMagnitude,
378 | max(mid) AS mid,
379 | max(wikipediaUrl) AS wikipediaUrl,
380 | FROM geg_data
381 | GROUP BY 1,3
382 | ) grouped_all
383 | )
384 | WHERE rank < 300
385 | """
386 |
387 | return self._query_bq(query_string=query).result()
388 |
389 | def get_geg_entities_data_full_df(
390 | self
391 | , entity: str
392 | , min_date: str = "2023-01-01"
393 | , max_date: str = "2023-05-17"
394 | ):
395 | result = self.get_geg_entities_data(entity, min_date, max_date)
396 |
397 | return result.to_dataframe()
398 |
399 | def get_geg_article_data(
400 | self
401 | , entity: str
402 | , min_date: str = "2023-01-01"
403 | , max_date: str = "2023-05-17"
404 | ) -> RowIterator:
405 |
406 | # here
407 |
408 | query = f"""
409 | WITH geg_data AS ((
410 | SELECT
411 | groupId,
412 | url,
413 | name,
414 | -- a.entity AS entity_name,
415 | wikipediaUrl,
416 | a.numMentions AS numMentions,
417 | a.avgSalience AS avgSalience,
418 | DATE(eventTime) AS eventTime,
419 | polarity,
420 | magnitude,
421 | score
422 | FROM (
423 | SELECT
424 | name,
425 | polarity,
426 | magnitude,
427 | score,
428 | url,
429 | FARM_FINGERPRINT(url) AS groupId,
430 | CONCAT(entity.type," - ",entity.type) AS entity_id,
431 | FORMAT_TIMESTAMP("%Y-%m-%d", date, "UTC") AS eventTime,
432 | entity.mid AS mid,
433 | entity.wikipediaUrl AS wikipediaUrl ,
434 | entity.numMentions AS numMentions
435 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`,
436 | UNNEST(entities) entity
437 | WHERE entity.mid is not null
438 | AND LOWER(name) LIKE '%{entity}%'
439 | AND lang='en'
440 | AND DATE(date) >= "{min_date}"
441 | AND DATE(date) <= "{max_date}"
442 | ) b JOIN (
443 | # grab the entities from the nested json in the graph
444 | SELECT APPROX_TOP_COUNT(entities.name, 1)[OFFSET(0)].value entity,
445 | entities.mid mid,
446 | sum(entities.numMentions) as numMentions,
447 | avg(entities.avgSalience) as avgSalience
448 | FROM `gdelt-bq.gdeltv2.geg_gcnlapi`,
449 | UNNEST(entities) entities
450 | WHERE
451 | entities.mid is not null AND
452 | lang='en'
453 | AND DATE(date) >= "{min_date}"
454 | AND DATE(date) <= "{max_date}"
455 | GROUP BY entities.mid
456 | ) a USING(mid)))
457 | SELECT *
458 | FROM ( SELECT *, RANK() OVER (PARTITION BY eventTime ORDER BY numMentions desc) as rank # get ranks
459 | FROM (
460 | SELECT
461 | -- ARRAY_AGG(entity_name) as entity_names,
462 | STRING_AGG(name) as entity_names,
463 | max(eventTime) AS eventTime,
464 | url,
465 | avg(numMentions) AS numMentions,
466 | avg(avgSalience) AS avgSalience,
467 | --sum(numMentions) as numMentions,
468 | --avg(magnitude) as avgMagnitude
469 | FROM geg_data
470 | GROUP BY url
471 | )
472 | -- grouped_all
473 | )
474 | WHERE rank < 300
475 | """
476 |
477 | return self._query_bq(query_string=query).result()
478 |
479 | def get_geg_article_data_full_df(
480 | self
481 | , entity: str
482 | , min_date: str = "2023-01-01"
483 | , max_date: str = "2023-05-26"
484 | ):
485 | result = self.get_geg_article_data(entity, min_date, max_date)
486 |
487 | return result.to_dataframe()
488 |
489 |
490 | def get_geg_article_data_v2(
491 | self
492 | , entity: str
493 | , min_date: str = "2023-01-01"
494 | , max_date: str = "2023-05-26"
495 | ) -> RowIterator:
496 |
497 | # TODO - add arg for avgSalience
498 |
499 | query = f"""
500 | WITH
501 | entities AS (
502 | SELECT
503 | distinct url,
504 | b.avgSalience AS avgSalience,
505 | date AS date
506 | FROM
507 | `gdelt-bq.gdeltv2.geg_gcnlapi` AS a,
508 | UNNEST(a.entities) AS b
509 | WHERE
510 | LOWER(b.name) LIKE '%{entity}%'
511 | AND DATE(date) >= "{min_date}"
512 | AND DATE(date) <= "{max_date}"
513 | AND b.avgSalience > 0.1 )
514 | SELECT
515 | entities.url AS url,
516 | -- entities.url AS source,
517 | entities.date,
518 | -- a.polarity,
519 | -- a.magnitude,
520 | -- a.score,
521 | avgSalience
522 | FROM entities inner join `gdelt-bq.gdeltv2.geg_gcnlapi` AS a
523 | ON a.url=entities.url
524 | AND a.date=entities.date
525 | """
526 | return self._query_bq(query_string=query).result()
527 |
528 |
529 |
530 | def get_geg_article_data_v2_full_df(
531 | self
532 | , entity: str
533 | , min_date: str = "2023-01-01"
534 | , max_date: str = "2023-05-26"
535 | ):
536 | result = self.get_geg_article_data_v2(entity, min_date, max_date)
537 |
538 | return result.to_dataframe()
539 |
--------------------------------------------------------------------------------
/zeitghost/bigquery/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__init__.py
--------------------------------------------------------------------------------
/zeitghost/bigquery/__pycache__/BigQueryAccessor.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__pycache__/BigQueryAccessor.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/bigquery/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/bigquery/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/capturing_callback_handler.py:
--------------------------------------------------------------------------------
1 | """Callback Handler captures all callbacks in a session for future offline playback."""
2 |
3 | from __future__ import annotations
4 |
5 | import pickle
6 | import time
7 | from typing import Any, TypedDict
8 |
9 | from langchain.callbacks.base import BaseCallbackHandler
10 |
11 |
12 | # This is intentionally not an enum so that we avoid serializing a
13 | # custom class with pickle.
14 | class CallbackType:
15 | ON_LLM_START = "on_llm_start"
16 | ON_LLM_NEW_TOKEN = "on_llm_new_token"
17 | ON_LLM_END = "on_llm_end"
18 | ON_LLM_ERROR = "on_llm_error"
19 | ON_TOOL_START = "on_tool_start"
20 | ON_TOOL_END = "on_tool_end"
21 | ON_TOOL_ERROR = "on_tool_error"
22 | ON_TEXT = "on_text"
23 | ON_CHAIN_START = "on_chain_start"
24 | ON_CHAIN_END = "on_chain_end"
25 | ON_CHAIN_ERROR = "on_chain_error"
26 | ON_AGENT_ACTION = "on_agent_action"
27 | ON_AGENT_FINISH = "on_agent_finish"
28 |
29 |
30 | # We use TypedDict, rather than NamedTuple, so that we avoid serializing a
31 | # custom class with pickle. All of this class's members should be basic Python types.
32 | class CallbackRecord(TypedDict):
33 | callback_type: str
34 | args: tuple[Any, ...]
35 | kwargs: dict[str, Any]
36 | time_delta: float # Number of seconds between this record and the previous one
37 |
38 |
39 | def load_records_from_file(path: str) -> list[CallbackRecord]:
40 | """Load the list of CallbackRecords from a pickle file at the given path."""
41 | with open(path, "rb") as file:
42 | records = pickle.load(file)
43 |
44 | if not isinstance(records, list):
45 | raise RuntimeError(f"Bad CallbackRecord data in {path}")
46 | return records
47 |
48 |
49 | def playback_callbacks(
50 | handlers: list[BaseCallbackHandler],
51 | records_or_filename: list[CallbackRecord] | str,
52 | max_pause_time: float,
53 | ) -> str:
54 | if isinstance(records_or_filename, list):
55 | records = records_or_filename
56 | else:
57 | records = load_records_from_file(records_or_filename)
58 |
59 | for record in records:
60 | pause_time = min(record["time_delta"], max_pause_time)
61 | if pause_time > 0:
62 | time.sleep(pause_time)
63 |
64 | for handler in handlers:
65 | if record["callback_type"] == CallbackType.ON_LLM_START:
66 | handler.on_llm_start(*record["args"], **record["kwargs"])
67 | elif record["callback_type"] == CallbackType.ON_LLM_NEW_TOKEN:
68 | handler.on_llm_new_token(*record["args"], **record["kwargs"])
69 | elif record["callback_type"] == CallbackType.ON_LLM_END:
70 | handler.on_llm_end(*record["args"], **record["kwargs"])
71 | elif record["callback_type"] == CallbackType.ON_LLM_ERROR:
72 | handler.on_llm_error(*record["args"], **record["kwargs"])
73 | elif record["callback_type"] == CallbackType.ON_TOOL_START:
74 | handler.on_tool_start(*record["args"], **record["kwargs"])
75 | elif record["callback_type"] == CallbackType.ON_TOOL_END:
76 | handler.on_tool_end(*record["args"], **record["kwargs"])
77 | elif record["callback_type"] == CallbackType.ON_TOOL_ERROR:
78 | handler.on_tool_error(*record["args"], **record["kwargs"])
79 | elif record["callback_type"] == CallbackType.ON_TEXT:
80 | handler.on_text(*record["args"], **record["kwargs"])
81 | elif record["callback_type"] == CallbackType.ON_CHAIN_START:
82 | handler.on_chain_start(*record["args"], **record["kwargs"])
83 | elif record["callback_type"] == CallbackType.ON_CHAIN_END:
84 | handler.on_chain_end(*record["args"], **record["kwargs"])
85 | elif record["callback_type"] == CallbackType.ON_CHAIN_ERROR:
86 | handler.on_chain_error(*record["args"], **record["kwargs"])
87 | elif record["callback_type"] == CallbackType.ON_AGENT_ACTION:
88 | handler.on_agent_action(*record["args"], **record["kwargs"])
89 | elif record["callback_type"] == CallbackType.ON_AGENT_FINISH:
90 | handler.on_agent_finish(*record["args"], **record["kwargs"])
91 |
92 | # Return the agent's result
93 | for record in records:
94 | if record["callback_type"] == CallbackType.ON_AGENT_FINISH:
95 | return record["args"][0][0]["output"]
96 |
97 | return "[Missing Agent Result]"
98 |
99 |
100 | class CapturingCallbackHandler(BaseCallbackHandler):
101 | def __init__(self) -> None:
102 | self._records: list[CallbackRecord] = []
103 | self._last_time: float | None = None
104 |
105 | def dump_records_to_file(self, path: str) -> None:
106 | """Write the list of CallbackRecords to a pickle file at the given path."""
107 | with open(path, "wb") as file:
108 | pickle.dump(self._records, file)
109 |
110 | def _append_record(
111 | self, type: str, args: tuple[Any, ...], kwargs: dict[str, Any]
112 | ) -> None:
113 | time_now = time.time()
114 | time_delta = time_now - self._last_time if self._last_time is not None else 0
115 | self._last_time = time_now
116 | self._records.append(
117 | CallbackRecord(
118 | callback_type=type, args=args, kwargs=kwargs, time_delta=time_delta
119 | )
120 | )
121 |
122 | def on_llm_start(self, *args: Any, **kwargs: Any) -> None:
123 | self._append_record(CallbackType.ON_LLM_START, args, kwargs)
124 |
125 | def on_llm_new_token(self, *args: Any, **kwargs: Any) -> None:
126 | self._append_record(CallbackType.ON_LLM_NEW_TOKEN, args, kwargs)
127 |
128 | def on_llm_end(self, *args: Any, **kwargs: Any) -> None:
129 | self._append_record(CallbackType.ON_LLM_END, args, kwargs)
130 |
131 | def on_llm_error(self, *args: Any, **kwargs: Any) -> None:
132 | self._append_record(CallbackType.ON_LLM_ERROR, args, kwargs)
133 |
134 | def on_tool_start(self, *args: Any, **kwargs: Any) -> None:
135 | self._append_record(CallbackType.ON_TOOL_START, args, kwargs)
136 |
137 | def on_tool_end(self, *args: Any, **kwargs: Any) -> None:
138 | self._append_record(CallbackType.ON_TOOL_END, args, kwargs)
139 |
140 | def on_tool_error(self, *args: Any, **kwargs: Any) -> None:
141 | self._append_record(CallbackType.ON_TOOL_ERROR, args, kwargs)
142 |
143 | def on_text(self, *args: Any, **kwargs: Any) -> None:
144 | self._append_record(CallbackType.ON_TEXT, args, kwargs)
145 |
146 | def on_chain_start(self, *args: Any, **kwargs: Any) -> None:
147 | self._append_record(CallbackType.ON_CHAIN_START, args, kwargs)
148 |
149 | def on_chain_end(self, *args: Any, **kwargs: Any) -> None:
150 | self._append_record(CallbackType.ON_CHAIN_END, args, kwargs)
151 |
152 | def on_chain_error(self, *args: Any, **kwargs: Any) -> None:
153 | self._append_record(CallbackType.ON_CHAIN_ERROR, args, kwargs)
154 |
155 | def on_agent_action(self, *args: Any, **kwargs: Any) -> Any:
156 | self._append_record(CallbackType.ON_AGENT_ACTION, args, kwargs)
157 |
158 | def on_agent_finish(self, *args: Any, **kwargs: Any) -> None:
159 | self._append_record(CallbackType.ON_AGENT_FINISH, args, kwargs)
160 |
--------------------------------------------------------------------------------
/zeitghost/gdelt/GdeltData.py:
--------------------------------------------------------------------------------
1 | from urllib.parse import urlparse
2 | from collections import defaultdict
3 | from newspaper import news_pool, Article, Source
4 | import nltk
5 | from typing import Dict, Any, List
6 | import logging
7 | from google.cloud import storage
8 | from google.cloud import bigquery as bq
9 | from google.cloud.bigquery.table import RowIterator
10 | import pandas as pd
11 | from zeitghost.gdelt.Helpers import gdelt_processed_record
12 |
13 |
14 | #TODO:
15 | # Optimizations:
16 | # Generate embeddings during parsing process, then save to to bq
17 | # Build in "checker" to see if we have already pulled and processed articles in bq table
18 | class GdeltData:
19 | """
20 | Gdelt query and parser class
21 | """
22 | def __init__(
23 | self
24 | , gdelt_data
25 | , destination_table: str = 'gdelt_actors'
26 | , project: str = 'cpg-cdp'
27 | , destination_dataset: str = 'genai_cap_v1'
28 | ):
29 | """
30 | :param gdelt_data: pandas.DataFrame|google.cloud.bigquery.table.RowIterator
31 | Input data for GDelt processing
32 | """
33 | logging.debug('Downloading nltk["punkt"]')
34 | nltk.download('punkt', "./")
35 | # BigQuery prepping
36 | self.__project = project
37 | self.__bq_client = bq.Client(project=self.__project)
38 | self.__location = 'us-central1'
39 | self.__destination_table = destination_table
40 | self.__destination_dataset = destination_dataset
41 | self.destination_table_id = f'{self.__destination_dataset}.{self.__destination_table}'
42 | # Prep for particulars of gdelt dataset
43 |
44 | # Builds self.gdelt_df based on incoming dataset type
45 | # TODO:
46 | if type(gdelt_data) is RowIterator:
47 | logging.debug("gdelt data came in as RowIterator")
48 | self.gdelt_df = self._row_iterator_loader(gdelt_data)
49 | elif type(gdelt_data) is pd.DataFrame:
50 | logging.debug("gdelt data came in as DataFrame")
51 | self.gdelt_df = self._dataframe_loader(gdelt_data)
52 | else:
53 | logging.error("Unrecognized datatype for input dataset")
54 |
55 | self.urllist = self.gdelt_df['url'].map(str).to_list()
56 | self.domains = [
57 | {urlparse(url).scheme + "://" + urlparse(url).hostname: url}
58 | for url in self.urllist
59 | ]
60 | self.news_sources = self._prepare_news_sources()
61 | self.full_source_data = self._parallel_parse_nlp_transform()
62 | self.chunk_df = pd.DataFrame.from_records(self.full_source_data)
63 | self.index_data = self._prepare_for_indexing()
64 |
65 | def _dataframe_loader(self, gdelt_df: pd.DataFrame) -> pd.DataFrame:
66 | logging.debug(f"DataFrame came in with columns: [{','.join(gdelt_df.columns)}]")
67 | #gdelt_df.fillna(0.0)
68 |
69 | return gdelt_df
70 |
71 | def _row_iterator_loader(self, row_iterator: RowIterator) -> pd.DataFrame:
72 | """
73 | This takes a bq iterator and loads data back into a bq table
74 | """
75 | # iterate over the bq result page - page size is default 100k rows or 10mb
76 | holder_df = []
77 | for df in row_iterator.to_dataframe_iterable():
78 | logging.debug(f"RowIterator came in with columns: [{','.join(df.columns)}]")
79 | tmp_df = df
80 | #tmp_df = tmp_df.fillna(0.0)
81 | holder_df.append(tmp_df)
82 |
83 | return pd.concat(holder_df)
84 |
85 | def pull_article_text(self, source_url) -> dict[str, Any]:
86 | """
87 | Process individual article for extended usage
88 |
89 | :param source_url: str
90 | url for article to download and process
91 |
92 | :return: dict
93 | """
94 | article = Article(source_url)
95 | article.parse()
96 | article.nlp()
97 | return {
98 | "title": article.title
99 | , "text": article.text
100 | , "authors": article.authors
101 | # , "keywords": article.keywords
102 | # , "tags" : article.tags
103 | , "summary": article.summary
104 | , "publish_date": article.publish_date
105 | , "url": article.url
106 | , "language": article.meta_lang
107 | }
108 |
109 | def _prepare_news_sources(self):
110 | """
111 | Given a Gdelt record: group articles by domain, download domain level information.
112 | For each article: download articles, parse downloaded information, and do simple nlp summarization
113 |
114 | :return: List[Source]
115 | """
116 | domain_article = defaultdict(list)
117 | tmp_list = list()
118 |
119 | # Build {: []} dictionary in preparation
120 | # for newspaper activity
121 | for entry in self.domains:
122 | for domain, article in entry.items():
123 | domain_article[domain].append(
124 | Article(article, fetch_images=False)
125 | )
126 | logging.debug("Attempting to fetch domain and article information")
127 | for domain, articles in domain_article.items():
128 | # Create Article Source
129 | tmp_domain = Source(
130 | url=domain
131 | , request_timeout=5
132 | , number_threads=2
133 | )
134 | # Download and parse top-level domain
135 | tmp_domain.download()
136 | tmp_domain.parse()
137 |
138 | # Build category information
139 | #tmp_domain.set_categories()
140 | #tmp_domain.download_categories()
141 | #tmp_domain.parse_categories()
142 |
143 | # Set articles to Articles built from urllist parameter
144 | tmp_domain.articles = articles
145 | tmp_list.append(tmp_domain)
146 | # Parallelize and download articles, with throttling
147 | news_pool.set(tmp_list, override_threads=1, threads_per_source=1)
148 | news_pool.join()
149 |
150 | # Handle articles in each domain
151 | logging.debug("Parsing and running simple nlp on articles")
152 | for domain in tmp_list:
153 | domain.parse_articles()
154 | for article in domain.articles:
155 | article.parse()
156 | article.nlp()
157 |
158 | return tmp_list
159 |
160 | def _parallel_parse_nlp_transform(self) -> List[Dict[str, Any]]:
161 | """
162 | Given a list of GDelt records, parse and process the site information.
163 | Actual data structure for dictionary is a
164 | list(zeitghost.gdelt.Helpers.gdelt_processed_records)
165 | :return: List[Dict[str, Any]]
166 | """
167 | # Prepare for final return list[dict()]
168 | logging.debug("Preparing full domain and article payloads")
169 | tmp_list = list()
170 | for src in self.news_sources:
171 | tmp = {
172 | "domain": src.domain
173 | , "url": src.url
174 | , "brand": src.brand
175 | , "description": src.description
176 | #, "categories": [category.url for category in src.categories]
177 | , "article_count": len(src.articles)
178 | , "articles": [
179 | {
180 | "title": article.title
181 | , "text": article.text
182 | , "authors": article.authors
183 | # , "keywords": article.keywords
184 | # , "tags" : article.tags
185 | , "summary": article.summary
186 | , "publish_date": article.publish_date
187 | , "url": article.url
188 | , "language": article.meta_lang
189 | , "date": self.gdelt_df[self.gdelt_df['url'] == article.url]['SQLDATE'].item() if 'SQLDATE' in self.gdelt_df.columns else ''# self.gdelt_df[self.gdelt_df['url'] == article.url]['date'].item()
190 | , "Actor1Name": self.gdelt_df[self.gdelt_df['url'] == article.url]['Actor1Name'].item() if 'Actor1Name' in self.gdelt_df.columns else ''
191 | , "Actor2Name": self.gdelt_df[self.gdelt_df['url'] == article.url]['Actor2Name'].item() if 'Actor2Name' in self.gdelt_df.columns else ''
192 | , "GoldsteinScale": self.gdelt_df[self.gdelt_df['url'] == article.url]['GoldsteinScale'].item() if 'GoldsteinScale' in self.gdelt_df.columns else ''
193 | , "NumMentions": [self.gdelt_df[self.gdelt_df['url'] == article.url]['NumMentions'].item()] if 'NumMentions' in self.gdelt_df.columns else []#if self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [int(e['numMentions']) for e in x]).values else []
194 | , "NumSources": self.gdelt_df[self.gdelt_df['url'] == article.url]['NumSources'].item() if 'NumSources' in self.gdelt_df.columns else 0
195 | , "NumArticles": self.gdelt_df[self.gdelt_df['url'] == article.url]['NumArticles'].item() if 'NumArticles' in self.gdelt_df.columns else 0
196 | , "AvgTone": self.gdelt_df[self.gdelt_df['url'] == article.url]['AvgTone'].item() if 'AvgTone' in self.gdelt_df.columns else 0.0
197 | #, "entities_name": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [str(e['name']) for e in x]).values if 'entities' in self.gdelt_df.columns else []
198 | #, "entities_type": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [str(e['type']) for e in x]).values if 'entities' in self.gdelt_df.columns else []
199 | #, "entities_avgSalience": self.gdelt_df[self.gdelt_df['url'] == article.url]['entities'].map(lambda x: [float(e['avgSalience']) for e in x]).values if 'entities' in self.gdelt_df.columns else []
200 | } for article in src.articles
201 | ]
202 | }
203 | tmp_list.append(tmp)
204 |
205 | return tmp_list
206 |
207 | def _reduced_articles(self) -> List[Dict[str, Any]]:
208 | """
209 | Given a list of GDelt records, parse and process the site information.
210 | Actual data structure for dictionary is a
211 | list(zeitghost.gdelt.Helpers.gdelt_reduced_articles)
212 | :return: List[Dict[str, Any]]
213 | """
214 | # Prepare for final return list[dict()]
215 | logging.debug("Preparing full domain and article payloads")
216 | tmp_list = list()
217 | for src in self.news_sources:
218 | for article in src.articles:
219 | row = self.gdelt_df[self.gdelt_df['url'] == article.url]
220 | tmp = {
221 | "title": article.title
222 | , "text": article.text
223 | , "article_url": article.url
224 | , "summary": article.summary
225 | , "date": str(row['date'].values)
226 | , "entities_name": row['entities'].map(lambda x: [str(e['name']) for e in x]).values
227 | , "entities_type": row['entities'].map(lambda x: [str(e['type']) for e in x]).values
228 | , "entities_numMentions": row['entities'].map(lambda x: [int(e['numMentions']) for e in x]).values
229 | , "entities_avgSalience": row['entities'].map(lambda x: [float(e['avgSalience']) for e in x]).values
230 | }
231 | tmp_list.append(tmp)
232 |
233 | return tmp_list
234 |
235 | def _prepare_for_indexing(self):
236 | """
237 | Reduces the larger Gdelt and newspaper download into a more compact payload tuned for indexing
238 |
239 | :return: pandas.DataFrame
240 | """
241 | logging.debug("Reducing full payload into what Chroma expects for indexing")
242 | final_return_df = pd.DataFrame.from_dict(self.full_source_data)
243 | pre_vector_df = final_return_df[['articles', 'url']].copy()
244 |
245 | pre_vector_df.columns = ['text', 'url']
246 |
247 | pre_vector_df['text'] = str(pre_vector_df['text'])
248 |
249 | pre_vector_df['text'].astype("string")
250 | pre_vector_df['url'].astype("string")
251 |
252 | return pre_vector_df
253 |
254 | def write_to_gcs(self, output_df: pd.DataFrame, bucket_name: str):
255 | """
256 | Output article information to a cloud storage bucket
257 |
258 | :param output_df: pandas.DataFrame
259 | Input dataframe to write out to GCS
260 | :param bucket_name: str
261 | Bucket name for writing to
262 |
263 | :return: str
264 | """
265 | client = storage.Client()
266 | bucket = client.get_bucket(bucket_name)
267 | if not bucket.exists():
268 | bucket.create()
269 | blob_name = "articles/data.json"
270 |
271 | bucket.blob(blob_name).upload_from_string(
272 | output_df.to_json(index=False)
273 | , 'text/json'
274 | )
275 |
276 | return f"gs://{bucket_name}/{blob_name}"
277 |
278 | def write_to_bq(self) -> str:
279 | self.chunk_df.to_gbq(self.destination_table_id
280 | , project_id=self.__project
281 | , if_exists='append'
282 | , table_schema=gdelt_processed_record
283 | )
284 |
285 | return f"{self.__project}:{self.destination_table_id}"
--------------------------------------------------------------------------------
/zeitghost/gdelt/Helpers.py:
--------------------------------------------------------------------------------
1 | from google.cloud import bigquery as bq
2 |
3 | gdelt_input_record = [
4 | bq.SchemaField(name="SQLDATE", field_type="TIMESTAMP", mode="REQUIRED")
5 | , bq.SchemaField(name="Actor1Name", field_type="STRING", mode="REQUIRED")
6 | , bq.SchemaField(name="Actor2Name", field_type="STRING", mode="REQUIRED")
7 | , bq.SchemaField(name="GoldsteinScale", field_type="FLOAT64", mode="REQUIRED")
8 | , bq.SchemaField(name="NumMentions", field_type="INT64", mode="REQUIRED")
9 | , bq.SchemaField(name="NumSources", field_type="INT64", mode="REQUIRED")
10 | , bq.SchemaField(name="NumArticles", field_type="INT64", mode="REQUIRED")
11 | , bq.SchemaField(name="AvgTone", field_type="FLOAT64", mode="REQUIRED")
12 | , bq.SchemaField(name="SOURCEURL", field_type="STRING", mode="REQUIRED")
13 | ]
14 |
15 | gdelt_processed_article = [
16 | bq.SchemaField(name="title", field_type="STRING", mode="REQUIRED")
17 | , bq.SchemaField(name="text", field_type="STRING", mode="REQUIRED")
18 | , bq.SchemaField(name="authors", field_type="STRING", mode="REPEATED")
19 | , bq.SchemaField(name="summary", field_type="STRING", mode="REQUIRED")
20 | , bq.SchemaField(name="publish_date", field_type="TIMESTAMP", mode="NULLABLE")
21 | , bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED")
22 | , bq.SchemaField(name="language", field_type="STRING", mode="REQUIRED")
23 | , bq.SchemaField(name="date", field_type="DATETIME", mode="REQUIRED")
24 | , bq.SchemaField(name="Actor1Name", field_type="STRING", mode="REQUIRED")
25 | , bq.SchemaField(name="Actor2Name", field_type="STRING", mode="REQUIRED")
26 | , bq.SchemaField(name="GoldsteinScale", field_type="FLOAT64", mode="REQUIRED")
27 | , bq.SchemaField(name="NumMentions", field_type="INT64", mode="REPEATED")
28 | , bq.SchemaField(name="NumSources", field_type="INT64", mode="REQUIRED")
29 | , bq.SchemaField(name="NumArticles", field_type="INT64", mode="REQUIRED")
30 | , bq.SchemaField(name="AvgTone", field_type="FLOAT64", mode="REQUIRED")
31 | #, bq.SchemaField(name="entities_name", field_type="STRING", mode="REPEATED")
32 | #, bq.SchemaField(name="entities_type", field_type="STRING", mode="REPEATED")
33 | #, bq.SchemaField(name="entities_avgSalience", field_type="FLOAT64", mode="REPEATED")
34 | ]
35 |
36 | gdelt_processed_record = [
37 | bq.SchemaField(name="domain", field_type="STRING", mode="REQUIRED")
38 | , bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED")
39 | , bq.SchemaField(name="brand", field_type="STRING", mode="REQUIRED")
40 | , bq.SchemaField(name="description", field_type="STRING", mode="REQUIRED")
41 | , bq.SchemaField(name="categories", field_type="STRING", mode="REPEATED")
42 | , bq.SchemaField(name="article_count", field_type="INT64", mode="REQUIRED")
43 | , bq.SchemaField(name="articles", field_type="RECORD", mode="REPEATED", fields=gdelt_processed_article)
44 | ]
45 |
46 | gdelt_reduced_articles = [
47 | bq.SchemaField(name="title", field_type="STRING", mode="REQUIRED")
48 | , bq.SchemaField(name="text", field_type="STRING", mode="REQUIRED")
49 | , bq.SchemaField(name="article_url", field_type="STRING", mode="REQUIRED")
50 | , bq.SchemaField(name="summary", field_type="STRING", mode="REQUIRED")
51 | , bq.SchemaField(name="date", field_type="TIMESTAMP", mode="NULLABLE")
52 | , bq.SchemaField(name="entities_name", field_type="STRING", mode="REPEATED")
53 | , bq.SchemaField(name="entities_type", field_type="STRING", mode="REPEATED")
54 | , bq.SchemaField(name="entities_numMentions", field_type="INT64", mode="REPEATED")
55 | , bq.SchemaField(name="entities_avg_Salience", field_type="FLOAT64", mode="REPEATED")
56 | ]
57 |
58 | # gdelt_geg_articles_to_scrape = [
59 | # bq.SchemaField(name="url", field_type="STRING", mode="REQUIRED")
60 | # , bq.SchemaField(name="date", field_type="TIMESTAMP", mode="REQUIRED")
61 | # , bq.SchemaField(name="avgSalience", field_type="FLOAT", mode="REQUIRED")
62 | # ]
63 |
64 |
--------------------------------------------------------------------------------
/zeitghost/gdelt/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__init__.py
--------------------------------------------------------------------------------
/zeitghost/gdelt/__pycache__/GdeltData.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/GdeltData.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/gdelt/__pycache__/Helpers.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/Helpers.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/gdelt/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/gdelt/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/testing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/testing/__init__.py
--------------------------------------------------------------------------------
/zeitghost/testing/basic_agent_unit_tests.py:
--------------------------------------------------------------------------------
1 | # test_with_unittest.py
2 | import pandas as pd
3 | import sys
4 | sys.path.append('../..')
5 |
6 | import langchain #for class assertions
7 | from google.cloud.bigquery.table import RowIterator #for class exertions
8 | from zeitghost.agents.LangchainAgent import LangchainAgent
9 | from zeitghost.vertex.LLM import VertexLLM#, VertexLangchainLLM
10 | from zeitghost.vertex.Embeddings import VertexEmbeddings
11 | from zeitghost.bigquery.BigQueryAccessor import BigQueryAccessor
12 | import unittest
13 | from unittest import TestCase
14 | dataset='trends_data'
15 | table_id='makeupcosmetics_10054_unitedstates_2840_external'
16 |
17 | TEST_PANDAS_SCRIPT = '''This is a dataframe of google search terms (term column)
18 | scored by volume (score column) by weekly date (date_field column):
19 | when were certain terms popular compared to others?
20 | why? double check your answer'''
21 |
22 | PROJECT_ID = 'cpg-cdp'
23 | gdelt_keyworld = 'estee lauder' #lower cases
24 | term_data_bq = ('mascera','makeup','ulta','tonymoly')
25 |
26 | GDELT_COLS = ['SQLDATE', 'Actor1Name', 'Actor2Name', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'SOURCEURL']
27 | TRENDSPOTTING_COLS = ['date_field', 'term', 'score']
28 |
29 | BQ_AGENT_PROMPT = f"""Describe the {dataset}.{table_id} table? Don't download the entire table, when complete, say I now know the final answer"""
30 |
31 | class AgentTests(TestCase):
32 |
33 | def __init__(self, project_id=PROJECT_ID,
34 | table_id=table_id,
35 | dataset=dataset,
36 | gdelt_keyword=gdelt_keyworld,
37 | term_data_bq = term_data_bq
38 | ):
39 | self.project_id = project_id
40 | self.table_id = table_id
41 | self.dataset = dataset
42 | self.gdelt_keyworld = gdelt_keyworld
43 | self.term_data_bq = term_data_bq
44 | self._act()
45 | self._assert()
46 | super().__init__(self, project_id=self.project_id,
47 | table_id=self.table_id,
48 | dataset=self.dataset,
49 | gdelt_keyword=self.gdelt_keyworld,
50 | term_data_bq = self.term_data_bq)
51 |
52 | def _act(self):
53 | self.llm = VertexLLM()
54 | self.llm_test = self.llm.predict('how are you doing today?', ['Observation:'])
55 | self.langchain_llm = self.llm
56 | self.langchain_llm_test = self.langchain_llm('how are you doing today?')#, stop=['Observation:']) #you need that for the pandas bot
57 | self.data_accessor = BigQueryAccessor(self.project_id)
58 | self.gdelt_accessor = data_accessor.get_records_from_actor_keyword_df(self.gdelt_keyworld)
59 | self.term_data_from_bq = data_accessor.pull_term_data_from_bq(self.term_data_bq)
60 | self.trendspotting_subset = self.term_data_from_bq.to_dataframe()
61 | self.vertex_langchain_agent = LangchainAgent(self.langchain_llm)
62 | self.trendspotting_subset = self.term_data_from_bq.to_dataframe()
63 | self.pandas_agent = self.vertex_langchain_agent.get_pandas_agent(self.trendspotting_subset)
64 | self.pandas_agent_result = pandas_agent.run(TEST_PANDAS_SCRIPT)
65 | self.langchain_agent_instance = LangchainAgent(self.langchain_llm)
66 | self.agent_executor = self.langchain_agent_instance.get_bigquery_agent(project_id)
67 | self.agent_executor_test = self.agent_executor(BQ_AGENT_PROMPT)
68 |
69 | def _assert(self):
70 | assert True is True #trival start
71 | assert type(self.llm) is zeitghost.vertex.LLM.VertexLLM
72 | assert type(self.llm_test) is str
73 | assert type(self.langchain_llm) is zeitghost.vertex.LLM.VertexLLM
74 | assert type(self.langchain_llm_test) is str
75 | assert len(lself.llm_test) > 1
76 | assert len(self.langchain_llm_test) > 1
77 | assert type(self.data_accessor) is zeitghost.bigquery.BigQueryAccessor.BigQueryAccessor
78 | assert type(self.gdelt_accessor) is pd.core.frame.DataFrame #is this right??
79 | assert len(self.gdelt_accessor) > 1
80 | assert type(self.term_data_from_bq) is RowIterator
81 | assert self.gdelt_accessor.columns.to_list() == GDELT_COLS
82 | assert type(self.trendspotting_subset) == pd.core.frame.DataFrame
83 | assert len(self.trendspotting_subset) > 1
84 | assert self.trendspotting_subset.columns.to_list() == TRENDSPOTTING_COLS
85 | assert type(self.vertex_langchain_agent) is zeitghost.agents.LangchainAgent.LangchainAgent
86 | assert type(self.pandas_agent) is langchain.agents.agent.AgentExecutor
87 | assert len(self.pandas_agent_result) > 1
88 | assert type(self.langchain_agent_instance) is zeitghost.agents.LangchainAgent.LangchainAgent
89 | assert type(self.agent_executor) is langchain.agents.agent.AgentExecutor
90 | assert len(agent_executor_test) > 1
91 |
92 |
93 |
94 |
95 |
96 |
--------------------------------------------------------------------------------
/zeitghost/ts_embedding/.ipynb_checkpoints/kats_embedding_tools-checkpoint.py:
--------------------------------------------------------------------------------
1 | from sklearn.preprocessing import MinMaxScaler
2 | from kats.consts import TimeSeriesData
3 | from kats.tsfeatures.tsfeatures import TsFeatures
4 | import os
5 | import pandas as pd
6 | import numpy as np
7 | from .bq_data_tools import pull_term_data_from_bq
8 | from decimal import Decimal
9 |
10 |
11 |
12 | # https://stackoverflow.com/questions/434287/how-to-iterate-over-a-list-in-chunks
13 |
14 | def chunker(seq, size):
15 | return (seq[pos:pos + size] for pos in range(0, len(seq), size))
16 |
17 | SLIDING_WINDOW_SIZE = 30 #n months for chunking - complete examples only used
18 | STEP = 1 #step - default to 1
19 |
20 |
21 | def write_embeddings_to_disk(term_chunk, filename='data/ts_embeddings.jsonl'):
22 | '''
23 | this funciton takes a chunk of n_terms (see chunker for input)
24 | and writes to `filename` a jsonl file compliant with
25 | matching engine
26 | '''
27 | term_data = pull_term_data_from_bq(tuple(term_chunk))
28 | #run through by term
29 | for term in term_chunk:
30 | # emb_pair = get_feature_embedding_for_window(term_data[term_data.term == term], term)
31 | # ts_emedding_pairs.append(emb_pair)
32 | wdf = windows(term_data[term_data.term == term], SLIDING_WINDOW_SIZE, STEP)
33 | for window, new_df in wdf.groupby(level=0):
34 | # print(window, new_df)
35 | if new_df.shape[0] == SLIDING_WINDOW_SIZE: #full examples only
36 | emb_pair = get_feature_embedding_for_window(new_df, term)
37 | label, emb = emb_pair
38 | formatted_emb = '{"id":"' + str(label) + '","embedding":[' + ",".join(str(x) for x in list(emb)) + ']}'
39 | with open(filename, 'a') as f:
40 | f.write(formatted_emb)
41 | f.write("\n")
42 | f.close()
43 |
44 | def windows(data, window_size, step):
45 | '''
46 | creates slices of the time series used for
47 | creating embeddings
48 | '''
49 | r = np.arange(len(data))
50 | s = r[::step]
51 | z = list(zip(s, s + window_size))
52 | f = '{0[0]}:{0[1]}'.format
53 | g = lambda t: data.iloc[t[0]:t[1]]
54 | return pd.concat(map(g, z), keys=map(f, z))
55 |
56 | def get_feature_embedding_for_window(df, term):
57 | '''
58 | this takes a df with schema of type `date_field` and `score` to create an embeddding
59 | takes 30 weeks of historical timeseries data
60 | '''
61 | ts_name = f"{term}_{str(df.date_field.min())}_{str(df.date_field.max())}"
62 | scaler=MinMaxScaler()
63 | df[['score']] = scaler.fit_transform(df[['score']])
64 | scores = df[['score']].values.tolist()
65 | flat_values = [item for sublist in scores for item in sublist]
66 | df = df.rename(columns={"date_field":"time"})
67 | ts_df = pd.DataFrame({'time':df.time,
68 | 'score':flat_values})
69 | ts_df.drop_duplicates(keep='first', inplace=True)
70 |
71 | # Use Kats to extract features for the time window
72 | try:
73 | if not (len(np.unique(ts_df.score.tolist())) == 1 \
74 | or len(np.unique(ts_df.score.tolist())) == 0):
75 | timeseries = TimeSeriesData(ts_df)
76 | features = TsFeatures().transform(timeseries)
77 | feature_list = [float(v) if not pd.isnull(v) else float(0) for _, v in features.items()]
78 | if Decimal('Infinity') in feature_list or Decimal('-Infinity') in feature_list:
79 | return None
80 | return (ts_name, feature_list)
81 | except np.linalg.LinAlgError as e:
82 | print(f"Can't process {ts_name}:{e}")
83 | return None
84 |
85 | def chunks(iterable, batch_size=100):
86 | it = iter(iterable)
87 | chunk = tuple(itertools.islice(it, batch_size))
88 | while chunk:
89 | yield chunk
90 | chunk = tuple(itertools.islice(it, batch_size))
--------------------------------------------------------------------------------
/zeitghost/ts_embedding/bq_data_tools.py:
--------------------------------------------------------------------------------
1 | from google.cloud import bigquery
2 | import pandas as pd
3 |
4 | PROJECT_ID = 'cpg-cdp'
5 | TABLE_ID = 'makeupcosmetics_10054_unitedstates_2840'
6 | DATASET = 'trends_data'
7 |
8 | bqclient = bigquery.Client(
9 | project=PROJECT_ID,
10 | # location=LOCATION
11 | )
12 |
13 | def get_term_set(project_id=PROJECT_ID,
14 | dataset=DATASET,
15 | table_id=TABLE_ID):
16 | '''
17 | Simple function to get the unique, sorted terms in the table
18 | '''
19 | query = f"""
20 | SELECT distinct
21 | term
22 | FROM `{project_id}.{dataset}.{table_id}`
23 | order by 1
24 | """
25 |
26 | df = bqclient.query(query = query).to_dataframe()
27 | return df["term"].to_list()
28 |
29 |
30 | def pull_term_data_from_bq(term: tuple = ('mascara', 'makeup'),
31 | project_id=PROJECT_ID,
32 | dataset=DATASET,
33 | table_id=TABLE_ID):
34 | '''
35 | pull terms based on `in` sql clause from term
36 | takes a tuple of terms (str) and produces pandas dataset
37 | '''
38 | query = f"""
39 | SELECT
40 | cast(date AS DATE FORMAT 'YYYY-MM-DD') as date_field,
41 | term,
42 | score
43 | FROM `{project_id}.{dataset}.{table_id}`
44 | WHERE
45 | term in {term}
46 | order by term, 1
47 | """
48 |
49 | df = bqclient.query(query = query).to_dataframe()
50 | return df
--------------------------------------------------------------------------------
/zeitghost/ts_embedding/kats_embedding_tools.py:
--------------------------------------------------------------------------------
1 | from sklearn.preprocessing import MinMaxScaler
2 | from kats.consts import TimeSeriesData
3 | from kats.tsfeatures.tsfeatures import TsFeatures
4 | import pandas as pd
5 | import numpy as np
6 | #from zeitghost.ts_embedding.bq_data_tools import pull_term_data_from_bq
7 | from zeitghost.bigquery.BigQueryAccessor import BigQueryAccessor
8 | from decimal import Decimal
9 |
10 |
11 |
12 | # https://stackoverflow.com/questions/434287/how-to-iterate-over-a-list-in-chunks
13 | def chunker(seq, size):
14 | return (seq[pos:pos + size] for pos in range(0, len(seq), size))
15 |
16 | SLIDING_WINDOW_SIZE = 30 #n months for chunking - complete examples only used
17 | STEP = 1 #step - default to 1
18 |
19 |
20 | def write_embeddings_to_disk(term_chunk, filename='data/ts_embeddings.jsonl'):
21 | '''
22 | this funciton takes a chunk of n_terms (see chunker for input)
23 | and writes to `filename` a jsonl file compliant with
24 | matching engine
25 | '''
26 | term_data = pull_term_data_from_bq(tuple(term_chunk))
27 | #run through by term
28 | for term in term_chunk:
29 | # emb_pair = get_feature_embedding_for_window(term_data[term_data.term == term], term)
30 | # ts_emedding_pairs.append(emb_pair)
31 | wdf = windows(term_data[term_data.term == term], SLIDING_WINDOW_SIZE, STEP)
32 | for window, new_df in wdf.groupby(level=0):
33 | # print(window, new_df)
34 | if new_df.shape[0] == SLIDING_WINDOW_SIZE: #full examples only
35 | emb_pair = get_feature_embedding_for_window(new_df, term)
36 | label, emb = emb_pair
37 | formatted_emb = '{"id":"' + str(label) + '","embedding":[' + ",".join(str(x) for x in list(emb)) + ']}'
38 | with open(filename, 'a') as f:
39 | f.write(formatted_emb)
40 | f.write("\n")
41 | f.close()
42 |
43 | def windows(data, window_size, step):
44 | '''
45 | creates slices of the time series used for
46 | creating embeddings
47 | '''
48 | r = np.arange(len(data))
49 | s = r[::step]
50 | z = list(zip(s, s + window_size))
51 | f = '{0[0]}:{0[1]}'.format
52 | g = lambda t: data.iloc[t[0]:t[1]]
53 | return pd.concat(map(g, z), keys=map(f, z))
54 |
55 | def get_feature_embedding_for_window(df, term):
56 | '''
57 | this takes a df with schema of type `date_field` and `score` to create an embeddding
58 | takes 30 weeks of historical timeseries data
59 | '''
60 | ts_name = f"{term}_{str(df.date_field.min())}_{str(df.date_field.max())}"
61 | scaler=MinMaxScaler()
62 | df[['score']] = scaler.fit_transform(df[['score']])
63 | scores = df[['score']].values.tolist()
64 | flat_values = [item for sublist in scores for item in sublist]
65 | df = df.rename(columns={"date_field":"time"})
66 | ts_df = pd.DataFrame({'time':df.time,
67 | 'score':flat_values})
68 | ts_df.drop_duplicates(keep='first', inplace=True)
69 |
70 | # Use Kats to extract features for the time window
71 | try:
72 | if not (len(np.unique(ts_df.score.tolist())) == 1 \
73 | or len(np.unique(ts_df.score.tolist())) == 0):
74 | timeseries = TimeSeriesData(ts_df)
75 | features = TsFeatures().transform(timeseries)
76 | feature_list = [float(v) if not pd.isnull(v) else float(0) for _, v in features.items()]
77 | if Decimal('Infinity') in feature_list or Decimal('-Infinity') in feature_list:
78 | return None
79 | return (ts_name, feature_list)
80 | except np.linalg.LinAlgError as e:
81 | print(f"Can't process {ts_name}:{e}")
82 | return None
83 |
84 | def chunks(iterable, batch_size=100):
85 | it = iter(iterable)
86 | chunk = tuple(itertools.islice(it, batch_size))
87 | while chunk:
88 | yield chunk
89 | chunk = tuple(itertools.islice(it, batch_size))
--------------------------------------------------------------------------------
/zeitghost/vertex/Embeddings.py:
--------------------------------------------------------------------------------
1 | from langchain.embeddings.base import Embeddings
2 | from typing import List
3 | from zeitghost.vertex.Helpers import rate_limit, _get_api_key, VertexModels
4 | from vertexai.preview.language_models import TextEmbeddingModel
5 |
6 |
7 | class VertexEmbeddings(Embeddings):
8 | """
9 | Helper class for getting document embeddings
10 | """
11 | model: TextEmbeddingModel
12 | project_id: str
13 | location: str
14 | requests_per_minute: int
15 | _api_key: str
16 |
17 | def __init__(self
18 | , project_id='cpg-cdp'
19 | , location='us-central1'
20 | , model=VertexModels.MODEL_EMBEDDING_GECKO.value
21 | , requests_per_minute=15):
22 | """
23 | :param project_id: str
24 | Google Cloud Project ID
25 | :param location: str
26 | Google Cloud Location
27 | :param model: str
28 | LLM Embedding Model name
29 | :param requests_per_minute: int
30 | Rate Limiter for managing API limits
31 | """
32 | super().__init__()
33 |
34 | self.model = TextEmbeddingModel.from_pretrained(model)
35 | self.project_id = project_id
36 | self.location = location
37 | self.requests_per_minute = requests_per_minute
38 | # self._api_key = _get_api_key()
39 |
40 | def _call_llm_embedding(self, prompt: str) -> List[List[float]]:
41 | """
42 | Retrieve embeddings from the embeddings llm
43 |
44 | :param prompt: str
45 | Document to retrieve embeddings
46 |
47 | :return: List[List[float]]
48 | """
49 | embeddings = self.model.get_embeddings([prompt])
50 | embeddings = [e.values for e in embeddings] #list of list
51 | return embeddings
52 |
53 | def embed_documents(self, texts: List[str]) -> List[List[float]]:
54 | """
55 | Retrieve embeddings for a list of documents
56 |
57 | :param texts: List[str]
58 | List of documents for embedding
59 |
60 | :return: List[List[float]
61 | """
62 | # print(f"Setting requests per minute limit: {self.requests_per_minute}\n")
63 | limiter = rate_limit(self.requests_per_minute)
64 | results = []
65 | for doc in texts:
66 | chunk = self.embed_query(doc)
67 | results.append(chunk)
68 | rate_limit(self.requests_per_minute)
69 | next(limiter)
70 | return results
71 |
72 | def embed_query(self, text) -> List[float]:
73 | """
74 | Retrieve embeddings for a singular document
75 |
76 | :param text: str
77 | Singleton document
78 |
79 | :return: List[float]
80 | """
81 | single_result = self._call_llm_embedding(text)
82 | # single_result = self.embed_documents([text])
83 | return single_result[0] #should be a singleton list
84 |
--------------------------------------------------------------------------------
/zeitghost/vertex/Helpers.py:
--------------------------------------------------------------------------------
1 | from google.cloud import secretmanager
2 | from decouple import config
3 | import time
4 | import os
5 | from enum import Enum
6 | from google.protobuf import struct_pb2
7 | from langchain import PromptTemplate
8 |
9 | _SECRET_ID = 'projects/939655404703/secrets/genai-key'
10 | _SECRET_VERSION = '{}/versions/1'.format(_SECRET_ID)
11 | project = os.environ.get('PROJECT_ID')
12 |
13 |
14 | def _get_api_key() -> str:
15 | """
16 | Retrieve API key from Secret Manager
17 |
18 | :return: str
19 | """
20 | sm_client = secretmanager.SecretManagerServiceClient()
21 | name = sm_client.secret_path(project, _SECRET_ID)
22 | response = sm_client.access_secret_version(request={"name": _SECRET_VERSION})
23 |
24 | return response.payload.data.decode("UTF-8")
25 |
26 |
27 | def rate_limit(max_per_minute=15) -> None:
28 | """
29 | Rate Limiter to manage API limitations
30 |
31 | :param max_per_minute:
32 |
33 | :return: yield
34 | """
35 | period = 60 / max_per_minute
36 | # print('Waiting')
37 | while True:
38 | before = time.time()
39 | yield
40 | after = time.time()
41 | elapsed = after - before
42 | sleep_time = max(0, period - elapsed)
43 | if sleep_time > 0:
44 | # print(f'Sleeping {sleep_time:.1f} seconds')
45 | print('.', end='')
46 | time.sleep(sleep_time)
47 |
48 |
49 | class VertexModels(Enum):
50 | """
51 | Enumeration of available LLM models
52 | """
53 | MODEL_EMBEDDING_GECKO = "textembedding-gecko@001"
54 | MODEL_TEXT_BISON_001 = "text-bison@001"
55 | MODEL_CHAT_BISON_001 = "chat-bison-001" #TODO UPDATE WITH SDK
56 | MODEL_CODE_BISON_001 = 'code-bison-001' #TODO UPDATE WITH SDK
57 | MODEL_TEXT_BISON_ALPHA = "text-bison-alpha" #TODO UPDATE WITH SDK
58 |
59 |
60 | # prompt: python stop a string at first occurance of word in list of words
61 |
62 | # Sure, here is the Python code to stop a string at the first occurrence of a word in a list of words:
63 |
64 | # ```python
65 | def stop_string_at_first_occurrence(string, words):
66 | """
67 | Stops a string at the first occurrence of a word in a list of words.
68 |
69 | Args:
70 | string: The string to stop.
71 | words: A list of words to stop the string at.
72 |
73 | Returns:
74 | The string, stopped at the first occurrence of a word in the list.
75 | """
76 |
77 | for word in words:
78 | if word in string:
79 | return string.partition(word)[0]
80 |
81 | return string
82 | # ```
83 |
84 | # Here is an example of how to use the `stop_string_at_first_occurrence()` function:
85 |
86 | # ```python
87 | # string = "This is a string with the words 'stop' and 'word'."
88 | # words = ["stop", "word"]
89 |
90 | # print(stop_string_at_first_occurrence(string, words))
91 | # ```
92 |
93 | # This will print the following output to the console:
94 |
95 | # ```
96 | # This is a string with the words 'stop'.
97 | # ```
98 |
99 |
100 | def _build_index_config(embedding_gcs_uri: str, dimensions: int):
101 | _treeAhConfig = struct_pb2.Struct(
102 | fields={
103 | "leafNodeEmbeddingCount": struct_pb2.Value(number_value=500),
104 | "leafNodesToSearchPercent": struct_pb2.Value(number_value=7),
105 | }
106 | )
107 | _algorithmConfig = struct_pb2.Struct(
108 | fields={"treeAhConfig": struct_pb2.Value(struct_value=_treeAhConfig)}
109 | )
110 | _config = struct_pb2.Struct(
111 | fields={
112 | "dimensions": struct_pb2.Value(number_value=dimensions),
113 | "approximateNeighborsCount": struct_pb2.Value(number_value=150),
114 | "distanceMeasureType": struct_pb2.Value(string_value="DOT_PRODUCT_DISTANCE"),
115 | "algorithmConfig": struct_pb2.Value(struct_value=_algorithmConfig),
116 | "shardSize": struct_pb2.Value(string_value="SHARD_SIZE_SMALL"),
117 | }
118 | )
119 | metadata = struct_pb2.Struct(
120 | fields={
121 | "config": struct_pb2.Value(struct_value=_config),
122 | "contentsDeltaUri": struct_pb2.Value(string_value=embedding_gcs_uri),
123 | }
124 | )
125 |
126 | return metadata
127 |
128 | map_prompt_template = """
129 | Write a concise summary of the following:
130 |
131 | {text}
132 |
133 | CONSCISE SUMMARY:
134 | """
135 | map_prompt = PromptTemplate(
136 | template=map_prompt_template
137 | , input_variables=["text"]
138 | )
139 |
140 | combine_prompt_template = """
141 | Write a concise summary of the following:
142 |
143 | {text}
144 |
145 | CONSCISE SUMMARY IN BULLET POINTS:
146 | """
147 | combine_prompt = PromptTemplate(
148 | template=combine_prompt_template
149 | , input_variables=["text"]
150 | )
151 |
152 |
153 | class ResourceNotExistException(Exception):
154 | def __init__(self, resource: str, message="Resource Does Not Exist."):
155 | self.resource = resource
156 | self.message = message
157 | super().__init__(self.message)
158 |
--------------------------------------------------------------------------------
/zeitghost/vertex/LLM.py:
--------------------------------------------------------------------------------
1 | from typing import List, Optional
2 | from zeitghost.vertex.Helpers import VertexModels, stop_string_at_first_occurrence
3 | from langchain.llms.base import LLM
4 | from vertexai.preview.language_models import TextGenerationModel
5 |
6 |
7 | class VertexLLM(LLM):
8 | """
9 | A class to Vertex LLM model that fits in the langchain framework
10 | this extends the langchain.llms.base.LLM class
11 | """
12 | model: TextGenerationModel
13 | predict_kwargs: dict
14 | model_source: str
15 | stop: Optional[List[str]]
16 | strip: bool
17 | strip_chars: List[str]
18 |
19 | def __init__(self
20 | , stop: Optional[List[str]]
21 | , strip: bool = False
22 | , strip_chars: List[str] = ['{','}','\n']
23 | , model_source=VertexModels.MODEL_TEXT_BISON_001.value
24 | , **predict_kwargs
25 | ):
26 | """
27 | :param model_source: str
28 | Name of LLM model to interact with
29 | :param endpoint: str
30 | Endpoint information for HTTP calls
31 | :param project: str
32 | Google Cloud Project ID
33 | :param location: str
34 | Google Cloud Location
35 | """
36 | super().__init__(model=TextGenerationModel.from_pretrained(model_source)
37 | , strip=strip
38 | , strip_chars=strip_chars
39 | , predict_kwargs=predict_kwargs
40 | , model_source=VertexModels.MODEL_TEXT_BISON_001.value
41 | )
42 | self.model = TextGenerationModel.from_pretrained(model_source)
43 | self.stop = stop
44 | self.model_source = model_source
45 | self.predict_kwargs = predict_kwargs
46 | self.strip = strip
47 | self.strip_chars = strip_chars
48 |
49 | @property
50 | def _llm_type(self):
51 | return 'vertex'
52 |
53 | @property
54 | def _identifying_params(self):
55 | return {}
56 |
57 | def _trim_output(self, raw_results: str) -> str:
58 | '''
59 | utility function to strip out brackets and other non useful info
60 | '''
61 | for char in self.strip_chars:
62 | raw_results = raw_results.replace(char, '')
63 | return raw_results
64 |
65 | def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
66 | """
67 | Wrapper around predict.
68 | Has special handling for SQL response formatting.
69 |
70 | :param prompt:
71 | :return: str
72 | """
73 | stop = self.stop
74 | prompt = str(prompt)
75 | prompt = prompt[:7999] #trimming the first chars to avoid issue
76 | result = str(self.model.predict(prompt, **self.predict_kwargs))
77 | if stop is not None:
78 | result = str(stop_string_at_first_occurrence(result, self.stop)) #apply stopwords
79 | if self.strip:
80 | return str(self._trim_output(result))
81 | else:
82 | return str(result)
83 |
84 | def _acall(self, prompt: str, stop: Optional[List[str]] = None) -> str:
85 | result = str(self.model.predict(prompt, **self.predict_kwargs))
86 | stop = self.stop
87 | if stop:
88 | result = str(stop_string_at_first_occurrence(result, self.stop)) #apply stopwords
89 | return str(result)
90 |
--------------------------------------------------------------------------------
/zeitghost/vertex/MatchingEngineCRUD.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | import time
3 | import logging
4 | from google.cloud import aiplatform_v1 as aipv1
5 | from google.cloud.aiplatform_v1 import CreateIndexEndpointRequest
6 | from google.cloud.aiplatform_v1.types.index import Index
7 | from google.cloud.aiplatform_v1.types.index_endpoint import IndexEndpoint
8 | from google.cloud.aiplatform_v1.types.index_endpoint import DeployedIndex
9 | from zeitghost.vertex.Helpers import _build_index_config, ResourceNotExistException
10 | from google.protobuf import struct_pb2
11 | from typing import List
12 |
13 | logging.basicConfig(level=logging.INFO)
14 | logger = logging.getLogger()
15 |
16 |
17 | class MatchingEngineCRUD:
18 | def __init__(
19 | self
20 | , project_id: str
21 | , region: str
22 | , project_num: int
23 | , index_name: str = None
24 | , vpc_network_name: str = None
25 | ):
26 | self.project_id = project_id
27 | self.project_num = project_num
28 | self.region = region
29 | self.index_name = index_name if index_name is not None else None
30 | self.vpc_network_name = vpc_network_name if vpc_network_name is not None else None
31 |
32 | self.index_endpoint_name = f"{self.index_name}_endpoint" if self.index_name is not None else None
33 | self.PARENT = f"projects/{self.project_num}/locations/{self.region}"
34 |
35 | ENDPOINT = f"{self.region}-aiplatform.googleapis.com"
36 |
37 | # set index client
38 | self.index_client = aipv1.IndexServiceClient(
39 | client_options=dict(api_endpoint=ENDPOINT)
40 | )
41 | # set index endpoint client
42 | self.index_endpoint_client = aipv1.IndexEndpointServiceClient(
43 | client_options=dict(api_endpoint=ENDPOINT)
44 | )
45 |
46 | def _set_index_name(self, index_name: str) -> None:
47 | """
48 |
49 | :param index_name:
50 | :return:
51 | """
52 | self.index_name = index_name
53 |
54 | def _set_index_endpoint_name(self, index_endpoint_name: str = None) -> None:
55 | """
56 |
57 | :param index_endpoint_name:
58 | :return:
59 | """
60 | if index_endpoint_name is not None:
61 | self.index_endpoint_name = index_endpoint_name
62 | elif self.index_name is not None:
63 | self.index_endpoint_name = f"{self.index_name}_endpoint"
64 | else:
65 | raise ResourceNotExistException("index")
66 |
67 | def _get_index(self) -> Index:
68 | """
69 |
70 | :return:
71 | """
72 | # Check if index exists
73 | if self.index_name is not None:
74 | indexes = [
75 | index.name for index in self.list_indexes()
76 | if index.display_name == self.index_name
77 | ]
78 | else:
79 | raise ResourceNotExistException("index")
80 |
81 | if len(indexes) == 0:
82 | return None
83 | else:
84 | index_id = indexes[0]
85 | request = aipv1.GetIndexRequest(name=index_id)
86 | index = self.index_client.get_index(request=request)
87 | return index
88 |
89 | def _get_index_endpoint(self) -> IndexEndpoint:
90 | """
91 |
92 | :return:
93 | """
94 | # Check if index endpoint exists
95 | if self.index_endpoint_name is not None:
96 | index_endpoints = [
97 | response.name for response in self.list_index_endpoints()
98 | if response.display_name == self.index_endpoint_name
99 | ]
100 | else:
101 | raise ResourceNotExistException("index_endpoint")
102 |
103 | if len(index_endpoints) == 0:
104 | logging.info(f"Could not find index endpoint: {self.index_endpoint_name}")
105 | return None
106 | else:
107 | index_endpoint_id = index_endpoints[0]
108 | index_endpoint = self.index_endpoint_client.get_index_endpoint(
109 | name=index_endpoint_id
110 | )
111 | return index_endpoint
112 |
113 | def list_indexes(self) -> List[Index]:
114 | """
115 |
116 | :return:
117 | """
118 | request = aipv1.ListIndexesRequest(parent=self.PARENT)
119 | page_result = self.index_client.list_indexes(request=request)
120 | indexes = [
121 | response for response in page_result
122 | ]
123 | return indexes
124 |
125 | def list_index_endpoints(self) -> List[IndexEndpoint]:
126 | """
127 |
128 | :return:
129 | """
130 | request = aipv1.ListIndexEndpointsRequest(parent=self.PARENT)
131 | page_result = self.index_endpoint_client.list_index_endpoints(request=request)
132 | index_endpoints = [
133 | response for response in page_result
134 | ]
135 | return index_endpoints
136 |
137 | def list_deployed_indexes(
138 | self
139 | , endpoint_name: str = None
140 | ) -> List[DeployedIndex]:
141 | """
142 |
143 | :param endpoint_name:
144 | :return:
145 | """
146 | try:
147 | if endpoint_name is not None:
148 | self._set_index_endpoint_name(endpoint_name)
149 | index_endpoint = self._get_index_endpoint()
150 | deployed_indexes = index_endpoint.deployed_indexes
151 | except ResourceNotExistException as rnee:
152 | raise rnee
153 |
154 | return list(deployed_indexes)
155 |
156 | def create_index(
157 | self
158 | , embedding_gcs_uri: str
159 | , dimensions: int
160 | , index_name: str = None
161 | ) -> Index:
162 | """
163 |
164 | :param index_name:
165 | :param embedding_gcs_uri:
166 | :param dimensions:
167 | :return:
168 | """
169 | if index_name is not None:
170 | self._set_index_name(index_name)
171 | # Get index
172 | if self.index_name is None:
173 | raise ResourceNotExistException("index")
174 | index = self._get_index()
175 | # Create index if does not exists
176 | if index:
177 | logger.info(f"Index {self.index_name} already exists with id {index.name}")
178 | else:
179 | logger.info(f"Index {self.index_name} does not exists. Creating index ...")
180 |
181 | metadata = _build_index_config(
182 | embedding_gcs_uri=embedding_gcs_uri
183 | , dimensions=dimensions
184 | )
185 |
186 | index_request = {
187 | "display_name": self.index_name,
188 | "description": "Index for LangChain demo",
189 | "metadata": struct_pb2.Value(struct_value=metadata),
190 | "index_update_method": aipv1.Index.IndexUpdateMethod.STREAM_UPDATE,
191 | }
192 |
193 | r = self.index_client.create_index(
194 | parent=self.PARENT,
195 | index=Index(index_request)
196 | )
197 |
198 | # Poll the operation until it's done successfully.
199 | logging.info("Poll the operation to create index ...")
200 | while True:
201 | if r.done():
202 | break
203 | time.sleep(60)
204 | print('.', end='')
205 |
206 | index = r.result()
207 | logger.info(f"Index {self.index_name} created with resource name as {index.name}")
208 |
209 | return index
210 |
211 | # TODO: this is generating an error about publicEndpointEnabled not being set without network
212 | def create_index_endpoint(
213 | self
214 | , endpoint_name: str = None
215 | , network: str = None
216 | ) -> IndexEndpoint:
217 | """
218 |
219 | :param endpoint_name:
220 | :param network:
221 | :return:
222 | """
223 | try:
224 | if endpoint_name is not None:
225 | self._set_index_endpoint_name(endpoint_name)
226 | # Get index endpoint if exists
227 | index_endpoint = self._get_index_endpoint()
228 |
229 | # Create Index Endpoint if does not exists
230 | if index_endpoint is not None:
231 | logger.info("Index endpoint already exists")
232 | else:
233 | logger.info(f"Index endpoint {self.index_endpoint_name} does not exists. Creating index endpoint...")
234 | index_endpoint_request = {
235 | "display_name": self.index_endpoint_name
236 | }
237 | index_endpoint = IndexEndpoint(index_endpoint_request)
238 | if network is not None:
239 | index_endpoint.network = network
240 | else:
241 | index_endpoint.public_endpoint_enabled = True
242 | index_endpoint.publicEndpointEnabled = True
243 | r = self.index_endpoint_client.create_index_endpoint(
244 | parent=self.PARENT,
245 | index_endpoint=index_endpoint
246 | )
247 |
248 | logger.info("Poll the operation to create index endpoint ...")
249 | while True:
250 | if r.done():
251 | break
252 | time.sleep(60)
253 | print('.', end='')
254 |
255 | index_endpoint = r.result()
256 | except Exception as e:
257 | logger.error(f"Failed to create index endpoint {self.index_endpoint_name}")
258 | raise e
259 |
260 | return index_endpoint
261 |
262 | def deploy_index(
263 | self
264 | , index_name: str = None
265 | , endpoint_name: str = None
266 | , machine_type: str = "e2-standard-2"
267 | , min_replica_count: int = 2
268 | , max_replica_count: int = 2
269 | ) -> IndexEndpoint:
270 | """
271 |
272 | :param endpoint_name:
273 | :param index_name:
274 | :param machine_type:
275 | :param min_replica_count:
276 | :param max_replica_count:
277 | :return:
278 | """
279 | if index_name is not None:
280 | self._set_index_name(index_name)
281 | if endpoint_name is not None:
282 | self._set_index_endpoint_name(endpoint_name)
283 |
284 | index = self._get_index()
285 | index_endpoint = self._get_index_endpoint()
286 | # Deploy Index to endpoint
287 | try:
288 | # Check if index is already deployed to the endpoint
289 | if index.name in index_endpoint.deployed_indexes:
290 | logger.info(f"Skipping deploying Index. Index {self.index_name}" +
291 | f"already deployed with id {index.name} to the index endpoint {self.index_endpoint_name}")
292 | return index_endpoint
293 |
294 | timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
295 | deployed_index_id = f"{self.index_name.replace('-', '_')}_{timestamp}"
296 | deploy_index = {
297 | "id": deployed_index_id,
298 | "display_name": deployed_index_id,
299 | "index": index.name,
300 | "dedicated_resources": {
301 | "machine_spec": {
302 | "machine_type": machine_type,
303 | },
304 | "min_replica_count": min_replica_count,
305 | "max_replica_count": max_replica_count
306 | }
307 | }
308 | logger.info(f"Deploying index with request = {deploy_index}")
309 | r = self.index_endpoint_client.deploy_index(
310 | index_endpoint=index_endpoint.name,
311 | deployed_index=DeployedIndex(deploy_index)
312 | )
313 |
314 | # Poll the operation until it's done successfullly.
315 | logger.info("Poll the operation to deploy index ...")
316 | while True:
317 | if r.done():
318 | break
319 | time.sleep(60)
320 | print('.', end='')
321 |
322 | logger.info(f"Deployed index {self.index_name} to endpoint {self.index_endpoint_name}")
323 |
324 | except Exception as e:
325 | logger.error(f"Failed to deploy index {self.index_name} to the index endpoint {self.index_endpoint_name}")
326 | raise e
327 |
328 | return index_endpoint
329 |
330 | def get_index_and_endpoint(self) -> (str, str):
331 | """
332 |
333 | :return:
334 | """
335 | # Get index id if exists
336 | index = self._get_index()
337 | index_id = index.name if index else ''
338 |
339 | # Get index endpoint id if exists
340 | index_endpoint = self._get_index_endpoint()
341 | index_endpoint_id = index_endpoint.name if index_endpoint else ''
342 |
343 | return index_id, index_endpoint_id
344 |
345 | def delete_index(
346 | self
347 | , index_name: str = None
348 | ) -> str:
349 | """
350 | :param index_name: str
351 | :return:
352 | """
353 | if index_name is not None:
354 | self._set_index_name(index_name)
355 | # Check if index exists
356 | index = self._get_index()
357 |
358 | # create index if does not exists
359 | if index:
360 | # Delete index
361 | index_id = index.name
362 | logger.info(f"Deleting Index {self.index_name} with id {index_id}")
363 | self.index_client.delete_index(name=index_id)
364 | return f"index {index_id} deleted."
365 | else:
366 | raise ResourceNotExistException(f"{self.index_name}")
367 |
368 | def undeploy_index(
369 | self
370 | , index_name: str
371 | , endpoint_name: str
372 | ):
373 | """
374 |
375 | :param index_name:
376 | :param endpoint_name:
377 | :return:
378 | """
379 | logger.info(f"Undeploying index with id {index_name} from Index endpoint {endpoint_name}")
380 | endpoint_id = f"{self.PARENT}/indexEndpoints/{endpoint_name}"
381 | r = self.index_endpoint_client.undeploy_index(
382 | index_endpoint=endpoint_id
383 | , deployed_index_id=index_name
384 | )
385 | response = r.result()
386 | logger.info(response)
387 | return response.display_name
388 |
389 | def delete_index_endpoint(
390 | self
391 | , index_endpoint_name: str = None
392 | ) -> str:
393 | """
394 |
395 | :param index_endpoint_name: str
396 | :return:
397 | """
398 | if index_endpoint_name is not None:
399 | self._set_index_endpoint_name(index_endpoint_name)
400 | # Check if index endpoint exists
401 | index_endpoint = self._get_index_endpoint()
402 |
403 | # Create Index Endpoint if does not exists
404 | if index_endpoint is not None:
405 | logger.info(
406 | f"Index endpoint {self.index_endpoint_name} exists with resource " +
407 | f"name as {index_endpoint.name}" #+
408 | # f"{index_endpoint.public_endpoint_domain_name}")
409 | )
410 |
411 | #index_endpoint_id = index_endpoint.name
412 | #index_endpoint = self.index_endpoint_client.get_index_endpoint(
413 | # name=index_endpoint.name
414 | #)
415 |
416 | # Undeploy existing indexes
417 | for d_index in index_endpoint.deployed_indexes:
418 | self.undeploy_index(
419 | index_name=d_index.id
420 | , endpoint_name=index_endpoint_name
421 | )
422 |
423 | # Delete index endpoint
424 | logger.info(f"Deleting Index endpoint {self.index_endpoint_name} with id {index_endpoint_id}")
425 | self.index_endpoint_client.delete_index_endpoint(name=index_endpoint.name)
426 | return f"Index endpoint {index_endpoint.name} deleted."
427 | else:
428 | raise ResourceNotExistException(f"{self.index_endpoint_name}")
429 |
--------------------------------------------------------------------------------
/zeitghost/vertex/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__init__.py
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/Embeddings.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/Embeddings.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/Helpers.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/Helpers.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/LLM.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/LLM.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/MatchingEngineCRUD.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/MatchingEngineCRUD.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/MatchingEngineVectorstore.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/MatchingEngineVectorstore.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/vertex/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/vertex/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/agents/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/agents/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/agents/__pycache__/models.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/agents/__pycache__/models.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/celery/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/celery/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/celery/__pycache__/models.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/celery/__pycache__/models.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/gdelt/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/gdelt/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/gdelt/__pycache__/models.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/gdelt/__pycache__/models.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/llm/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/llm/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/llm/__pycache__/models.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/llm/__pycache__/models.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/vectorstore/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/vectorstore/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/blueprints/vectorstore/__pycache__/models.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/blueprints/vectorstore/__pycache__/models.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/celery/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/celery/__pycache__/gdelt_tasks.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/gdelt_tasks.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/celery/__pycache__/vertex_tasks.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/vertex_tasks.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/webserver/celery/__pycache__/worker.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hello-d-lee/conversational-agents-zeitghost/f4857c11f3b25b04e1361ee0772465d2cb20e476/zeitghost/webserver/celery/__pycache__/worker.cpython-311.pyc
--------------------------------------------------------------------------------
/zeitghost/zeitghost-trendspotting.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------