├── images ├── 01-login.png ├── 02-create-db.png ├── 04-active-db.png ├── 03-pending-db.png ├── 06-data-explorer.png └── 05-create-token-db.png ├── requirements.txt ├── .env.example ├── .gitignore ├── README.md ├── astra-glean-import-job.py └── AstraDB_Glean_Integration.ipynb /images/01-login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/01-login.png -------------------------------------------------------------------------------- /images/02-create-db.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/02-create-db.png -------------------------------------------------------------------------------- /images/04-active-db.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/04-active-db.png -------------------------------------------------------------------------------- /images/03-pending-db.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/03-pending-db.png -------------------------------------------------------------------------------- /images/06-data-explorer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/06-data-explorer.png -------------------------------------------------------------------------------- /images/05-create-token-db.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastaxdevs/mini-demo-astradb-glean/main/images/05-create-token-db.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | astrapy>=2.0,<3.0 2 | colorama>=0.4.5 3 | datasets>=3.5,<4.0 4 | python-dotenv>=1.0.0 5 | 6 | # There seems not to be a PyPI distribution for this one: 7 | # (see https://developers.glean.com/sdk#indexing-api) 8 | https://app.glean.com/meta/indexing_api_client.zip 9 | -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- 1 | # Astra DB Configuration 2 | export ASTRA_DB_APPLICATION_TOKEN= 3 | export ASTRA_DB_API_ENDPOINT= 4 | export ASTRA_DB_COLLECTION_NAME="glean_source_collection" 5 | # export ASTRA_DB_KEYSPACE="default_keyspace" # Optional 6 | 7 | # Glean Configuration 8 | export GLEAN_CUSTOMER= 9 | export GLEAN_DATASOURCE_NAME= 10 | export GLEAN_API_TOKEN= 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | astra-sdk-java.wiki/ 2 | .env 3 | .astrarc 4 | sec 5 | 6 | # eclipse conf file 7 | .settings 8 | .classpath 9 | .project 10 | .cache 11 | 12 | # idea conf files 13 | .idea 14 | *.ipr 15 | *.iws 16 | *.iml 17 | 18 | # building 19 | target 20 | build 21 | tmp 22 | dist 23 | 24 | # misc 25 | .DS_Store 26 | 27 | .factorypath 28 | .sts4-cache 29 | *.log 30 | 31 | release.properties 32 | pom.xml.releaseBackup -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mini-demo-astradb-glean 2 | 3 | Demo showing how to index [Astra DB](https://docs.datastax.com/en/astra-db-serverless/index.html) data into Glean. 4 | 5 | You can run this tutorial entirely in a google colab, or run it locally by following the instructions below. 6 | 7 | ## Work in a Colab 8 | 9 | [![Open In Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?logo=google-colab&style=for-the-badge)](https://colab.research.google.com/github/datastaxdevs/mini-demo-astradb-glean/blob/main/AstraDB_Glean_Integration.ipynb) 10 | 11 | ## Run Locally 12 | 13 | [![Run Locally](https://img.shields.io/badge/Run%20Locally-python3-blue?style=for-the-badge)](#) 14 | 15 | 16 | ### 1. Set up Astra DB 17 | 18 | ℹ️ See the [Astra Reference documentation](https://docs.datastax.com/en/astra-db-serverless/databases/create-database.html). 19 | 20 | 21 | `✅ 1.1`: Create an Astra account 22 | 23 | Access [https://astra.datastax.com](https://astra.datastax.com) and register with `Google` or `Github` account. 24 | 25 | ![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/01-login.png?raw=true) 26 | 27 | 28 | `✅ 1.2`: Create a Database in Astra DB 29 | 30 | Get to the databases dashboard (by clicking on Databases in the left-hand navigation bar, expanding it if necessary), and click the `[Create Database]` button on the right. 31 | 32 | ![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/02-create-db.png?raw=true) 33 | 34 | 35 | **ℹ️ Field Description** 36 | 37 | | Field | Description | 38 | |--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 39 | | **Vector Database vs Serverless Database** | Choose `Vector Database`. In june 2023, Cassandra introduced the support of vector search to enable Generative AI use cases. | 40 | | **Database name** | Database names are permanent. They must start and end with a letter or number, and they can contain no more than 50 characters, including letters, numbers, and the special characters `& + - _ ( ) < > . , @`. It is recommended to have a database for each of your applications. The free tier is limited to 5 databases. | 41 | | **Cloud Provider** | Choose whatever you like. Click a cloud provider logo, pick an Area in the list and finally pick a region. We recommend choosing a region that is closest to you to reduce latency. In the free tier, there is very little difference. | 42 | | **Cloud Region** | Pick a region close to you, among those available for the selected cloud provider and your plan. 43 | 44 | If all fields are filled properly, clicking the "Create Database" button will start the process. 45 | 46 | ![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/03-pending-db.png?raw=true) 47 | 48 | It should take a couple of minutes for your database to become `Active`. 49 | 50 | ![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/04-active-db.png?raw=true) 51 | 52 | `✅ 1.3`: Create an Astra Database token 53 | 54 | To [connect to your database](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html#create-a-database-and-store-your-credentials), you need the **API endpoint** and a **Database token**. 55 | 56 | The API endpoint is available on the database screen, there is a little icon to copy the URL in your clipboard. (it should look like `https://-.apps.astra.datastax.com`). 57 | 58 | ![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/05-create-token-db.png?raw=true) 59 | 60 | To get a token click the `[Generate Token]` button on the right. It will generate a token that you can copy to your clipboard. 61 | 62 | ### 2. Obtain a Glean token 63 | 64 | > [Glean Documentation](https://developers.glean.com/indexing#authentication) 65 | 66 | Admins can manage Glean API tokens via the API tokens page within Workspace Settings: 67 | 68 | ``` 69 | Workspace > Setup > API tokens > Indexing tokens tab 70 | ``` 71 | 72 | As a Glean admin, create a token and assign permissions (or have an admin do it for you). 73 | 74 | ### 3. Installation 75 | 76 | - `✅ 3.1`: Create and activate a virtual environment. You need Python version 3.9 or higher. 77 | 78 | ```console 79 | python3 -m venv my_virtual_env 80 | ``` 81 | 82 | _macOS/Linux:_ 83 | ``` 84 | source my_virtual_env/bin/activate 85 | ``` 86 | 87 | _Windows:_ 88 | ``` 89 | my_virtual_env\Scripts\activate 90 | ``` 91 | 92 | - `✅ 3.2`:Install the dependencies: 93 | 94 | ```console 95 | pip install -r requirements.txt 96 | ``` 97 | 98 | ## 4. Create environment file 99 | 100 | Copy `.env.example` as `.env`, and edit its content with the Astra DB and Glean credentials: 101 | 102 | ```ini 103 | # Astra Configuration 104 | export ASTRA_DB_APPLICATION_TOKEN= 105 | export ASTRA_DB_API_ENDPOINT= 106 | export ASTRA_DB_COLLECTION_NAME="plain_collection" 107 | # export ASTRA_DB_KEYSPACE="default_keyspace" # Optional 108 | 109 | # Glean Configuration 110 | export GLEAN_CUSTOMER= 111 | export GLEAN_DATASOURCE_NAME= 112 | export GLEAN_API_TOKEN= 113 | ``` 114 | 115 | ## 5. Run the script 116 | 117 | ```console 118 | python3 astra-glean-import-job.py 119 | ``` 120 | 121 | ## Wrap up and more information 122 | 123 | Congratulations: you have indexed data from an Astra DB collection into Glean! 124 | 125 | You can inspect the Astra DB collection in your Astra dashboard: navigate to the database and find the "Data explorer" tab to locate your collection. 126 | 127 | You can perform a test with Glean: search for the content you just indexed and verify the response contains information coming from the inserted dataset. 128 | 129 | ℹ️ [Glean integration page](https://docs.datastax.com/en/astra-db-serverless/integrations/glean.html) on Astra DB documentation. 130 | -------------------------------------------------------------------------------- /astra-glean-import-job.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from astrapy import DataAPIClient 4 | from colorama import Fore, Style 5 | from datasets import load_dataset 6 | from dotenv import load_dotenv 7 | 8 | import glean_indexing_api_client as indexing_api 9 | from glean_indexing_api_client.api import datasources_api, documents_api 10 | from glean_indexing_api_client.model.custom_datasource_config import ( 11 | CustomDatasourceConfig, 12 | ) 13 | from glean_indexing_api_client.model.object_definition import ObjectDefinition 14 | from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest 15 | from glean_indexing_api_client.model.document_definition import DocumentDefinition 16 | from glean_indexing_api_client.model.content_definition import ContentDefinition 17 | from glean_indexing_api_client.model.document_permissions_definition import ( 18 | DocumentPermissionsDefinition, 19 | ) 20 | 21 | 22 | # Load environment variables from .env 23 | load_dotenv() 24 | 25 | ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"] 26 | ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"] 27 | ASTRA_DB_COLLECTION_NAME = os.environ["ASTRA_DB_COLLECTION_NAME"] 28 | ASTRA_DB_KEYSPACE = os.getenv("ASTRA_DB_KEYSPACE") 29 | 30 | GLEAN_API_TOKEN = os.environ["GLEAN_API_TOKEN"] 31 | GLEAN_CUSTOMER = os.environ["GLEAN_CUSTOMER"] 32 | GLEAN_DATASOURCE_NAME = os.environ["GLEAN_DATASOURCE_NAME"] 33 | 34 | 35 | print(f"{Fore.GREEN}============================={Style.RESET_ALL}") 36 | print(f"{Fore.GREEN} ASTRADB - GLEAN INTEGRATION {Style.RESET_ALL}") 37 | print(f"{Fore.GREEN}============================={Style.RESET_ALL}\n") 38 | 39 | # Initialize Astra DB client 40 | client = DataAPIClient(callers=[("glean", "1.0")]) 41 | database = client.get_database( 42 | ASTRA_DB_API_ENDPOINT, 43 | token=ASTRA_DB_APPLICATION_TOKEN, 44 | keyspace=ASTRA_DB_KEYSPACE, 45 | ) 46 | print( 47 | f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is " 48 | f"{Style.RESET_ALL}{database.name()}{Fore.CYAN}." 49 | ) 50 | 51 | # Create collection 52 | source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME) 53 | print( 54 | f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" 55 | f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." 56 | ) 57 | 58 | # Load philosophers dataset 59 | print(f"{Fore.CYAN}[INFO] - Downloading data from Hugging Face 🤗.{Style.RESET_ALL}") 60 | philo_dataset = load_dataset("datastax/philosopher-quotes")["train"] 61 | print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory.{Style.RESET_ALL}") 62 | print(f"{Fore.CYAN}[INFO] - Sample record: {Style.RESET_ALL}{philo_dataset[16]}") 63 | 64 | 65 | def load_to_astra_db(data_to_insert, collection): 66 | """Load all of the provided data into a collection.""" 67 | def split_tags(t): 68 | return [tag for tag in (t or "").split(";") if tag] 69 | 70 | documents_to_insert = [ 71 | { 72 | **item, 73 | **{"_id": index, "tags": split_tags(item["tags"])}, 74 | } 75 | for index, item in enumerate(data_to_insert) 76 | ] 77 | collection.insert_many(documents_to_insert) 78 | 79 | # Insert documents into Astra DB 80 | philo_count = len(philo_dataset) 81 | print( 82 | f"{Fore.CYAN}[INFO] - Inserting {philo_count} documents into Astra DB..." 83 | f"{Style.RESET_ALL}" 84 | ) 85 | load_to_astra_db(philo_dataset, source_collection) 86 | print(f"{Fore.CYAN}[ OK ] - Insertion finished.{Style.RESET_ALL}") 87 | 88 | # Setup Glean API 89 | GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1" 90 | print( 91 | f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:" 92 | f"{Style.RESET_ALL} {GLEAN_API_ENDPOINT}" 93 | ) 94 | 95 | # Initialize Glean client 96 | configuration = indexing_api.Configuration( 97 | host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN 98 | ) 99 | api_client = indexing_api.ApiClient(configuration) 100 | datasource_api = datasources_api.DatasourcesApi(api_client) 101 | print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}") 102 | 103 | # Create and register datasource in Glean 104 | datasource_config = CustomDatasourceConfig( 105 | name=GLEAN_DATASOURCE_NAME, 106 | display_name="Astra DB Collection DataSource", 107 | datasource_category="PUBLISHED_CONTENT", 108 | url_regex=f"^{ASTRA_DB_API_ENDPOINT}", 109 | object_definitions=[ 110 | ObjectDefinition(doc_category="PUBLISHED_CONTENT", name="AstraVectorEntry") 111 | ], 112 | ) 113 | 114 | try: 115 | datasource_api.adddatasource_post(datasource_config) 116 | print( 117 | f"{Fore.GREEN}[ OK ] - DataSource has been created!" 118 | f"{Style.RESET_ALL}{Fore.GREEN}." 119 | ) 120 | except indexing_api.ApiException as e: 121 | print( 122 | f"{Fore.RED}[ ERROR ] - Error creating datasource: " 123 | f"{e}{Style.RESET_ALL}{Fore.GREEN}." 124 | ) 125 | 126 | 127 | def index_astra_db_document_into_glean(astra_document): 128 | """Index one Astra DB document into Glean.""" 129 | document_id = str(astra_document["_id"]) 130 | title = f"{astra_document['author']} quote_{astra_document['_id']}" 131 | body_text = astra_document["quote"] 132 | datasource_name = GLEAN_DATASOURCE_NAME 133 | request = IndexDocumentRequest( 134 | document=DocumentDefinition( 135 | datasource=datasource_name, 136 | title=title, 137 | id=document_id, 138 | view_url=ASTRA_DB_API_ENDPOINT, 139 | body=ContentDefinition(mime_type="text/plain", text_content=body_text), 140 | permissions=DocumentPermissionsDefinition(allow_anonymous_access=True), 141 | ) 142 | ) 143 | documents_api_client = documents_api.DocumentsApi(api_client) 144 | try: 145 | documents_api_client.indexdocument_post(request) 146 | except indexing_api.ApiException as e: 147 | print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}") 148 | 149 | 150 | def index_documents_to_glean(collection): 151 | """Index all documents from an Astra DB collection to Glean.""" 152 | total_docs = collection.count_documents({}, upper_bound=1000) 153 | print( 154 | f"{Fore.CYAN}[INFO] - Indexing {total_docs} " 155 | f"documents into Glean...{Style.RESET_ALL}" 156 | ) 157 | for doc in collection.find(): 158 | try: 159 | index_astra_db_document_into_glean(doc) 160 | except Exception as error: 161 | print( 162 | f"{Fore.RED}Error indexing document " 163 | f"{doc['_id']}: {error}{Style.RESET_ALL}" 164 | ) 165 | print(f"{Fore.CYAN}[ OK ] - Indexing finished.{Style.RESET_ALL}") 166 | 167 | 168 | # Use the function to index documents into Glean 169 | index_documents_to_glean(source_collection) 170 | 171 | print(f"{Fore.GREEN}Import job completed successfully!{Style.RESET_ALL}") 172 | -------------------------------------------------------------------------------- /AstraDB_Glean_Integration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "collapsed_sections": [ 8 | "m0QckGYCJwMF", 9 | "J_CXLd0lmGTh", 10 | "V-NVozAhMqKY" 11 | ] 12 | }, 13 | "kernelspec": { 14 | "name": "python3", 15 | "display_name": "Python 3" 16 | }, 17 | "language_info": { 18 | "name": "python" 19 | } 20 | }, 21 | "cells": [ 22 | { 23 | "cell_type": "markdown", 24 | "source": [ 25 | "# Integrate Glean with Astra DB\n", 26 | "\n", 27 | "> Demo showing how to index [Astra DB](https://docs.datastax.com/en/astra-db-serverless/index.html) data into Glean.\n", 28 | "\n", 29 | "This notebook is a walkthrough explaining how to use an Astra DB collection as a data source for Glean.\n", 30 | "\n", 31 | "Using the Python Data API client, we'll read from a collection and use the glean `indexingAPI` through a `Datasource` to index the collection contents in Glean." 32 | ], 33 | "metadata": { 34 | "id": "cJVCdm1AU_oN" 35 | } 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "source": [ 40 | "## 1. Set up Astra DB\n" 41 | ], 42 | "metadata": { 43 | "id": "oNwAew2VJk2K" 44 | } 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "source": [ 49 | "ℹ️ See the [Astra Reference documentation](https://docs.datastax.com/en/astra-db-serverless/databases/create-database.html)." 50 | ], 51 | "metadata": { 52 | "id": "g8zkF7f3aEK7" 53 | } 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "source": [ 58 | "### 1.1: Create an Astra account\n", 59 | "\n", 60 | "Access [https://astra.datastax.com](https://astra.datastax.com) and register with `Google` or `Github` account.\n", 61 | "\n", 62 | "![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/01-login.png?raw=true)" 63 | ], 64 | "metadata": { 65 | "id": "evmBzYYoiOsW" 66 | } 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "source": [ 71 | "### 1.2: Create a Database in Astra DB\n", 72 | "\n", 73 | "Get to the databases dashboard (by clicking on Databases in the left-hand navigation bar, expanding it if necessary), and click the `[Create Database]` button on the right.\n", 74 | "\n", 75 | "![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/02-create-db.png?raw=true)\n", 76 | "\n", 77 | "\n", 78 | "**ℹ️ Field Description**\n", 79 | "\n", 80 | "| Field | Description |\n", 81 | "|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", 82 | "| **Vector Database vs Serverless Database** | Choose `Vector Database`. In june 2023, Cassandra introduced the support of vector search to enable Generative AI use cases. |\n", 83 | "| **Database name** | Database names are permanent. They must start and end with a letter or number, and they can contain no more than 50 characters, including letters, numbers, and the special characters `& + - _ ( ) < > . , @`. It is recommended to have a database for each of your applications. The free tier is limited to 5 databases. |\n", 84 | "| **Cloud Provider** | Choose whatever you like. Click a cloud provider logo, pick an Area in the list and finally pick a region. We recommend choosing a region that is closest to you to reduce latency. In the free tier, there is very little difference. |\n", 85 | "| **Cloud Region** | Pick a region close to you, among those available for the selected cloud provider and your plan. \n", 86 | "\n", 87 | "If all fields are filled properly, clicking the \"Create Database\" button will start the process.\n", 88 | "\n", 89 | "![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/03-pending-db.png?raw=true)\n", 90 | "\n", 91 | "It should take a couple of minutes for your database to become `Active`.\n", 92 | "\n", 93 | "![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/04-active-db.png?raw=true)" 94 | ], 95 | "metadata": { 96 | "id": "WrG_h7M4iR4u" 97 | } 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "source": [ 102 | "### 1.3: Create an Astra database token\n", 103 | "\n", 104 | "To [connect to your database](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html#create-a-database-and-store-your-credentials), you need the **API endpoint** and a **Database token**.\n", 105 | "\n", 106 | "The API endpoint is available on the database screen, there is a little icon to copy the URL in your clipboard. (it should look like `https://-.apps.astra.datastax.com`).\n", 107 | "\n", 108 | "![](https://github.com/datastaxdevs/mini-demo-astradb-glean/blob/main/images/05-create-token-db.png?raw=true)\n", 109 | "\n", 110 | "To get a token click the `[Generate Token]` button on the right. It will generate a token that you can copy to your clipboard." 111 | ], 112 | "metadata": { 113 | "id": "qWazmnXxiU5x" 114 | } 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "source": [ 119 | "## 2. Obtain a Glean token\n", 120 | "\n", 121 | "> [Glean Documentation](https://developers.glean.com/docs/indexing_api/indexing_api_tokens/)\n", 122 | "\n", 123 | "Admins can manage Glean API tokens via the API tokens page within Workspace Settings:\n", 124 | "\n", 125 | "```\n", 126 | "Workspace > Setup > API tokens > Indexing tokens tab\n", 127 | "```\n", 128 | "\n", 129 | "As a Glean admin, create a token and assign permissions (or have an admin do it for you).\n" 130 | ], 131 | "metadata": { 132 | "id": "cAzovP1iJuwG" 133 | } 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "source": [ 138 | "## 3. Installation\n", 139 | "\n", 140 | "Install the required dependencies:" 141 | ], 142 | "metadata": { 143 | "id": "m0QckGYCJwMF" 144 | } 145 | }, 146 | { 147 | "cell_type": "code", 148 | "source": [ 149 | "!pip install --quiet \\\n", 150 | " \"astrapy>=2.0,<3.0\" \\\n", 151 | " \"datasets>=3.5,<4.0\" \\\n", 152 | " \"https://app.glean.com/meta/indexing_api_client.zip\" # Glean does not distribute on PyPI" 153 | ], 154 | "metadata": { 155 | "id": "mditQSWbWFfn" 156 | }, 157 | "execution_count": null, 158 | "outputs": [] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "source": [ 163 | "Import the required packages:" 164 | ], 165 | "metadata": { 166 | "id": "GRhKXAiUiYul" 167 | } 168 | }, 169 | { 170 | "cell_type": "code", 171 | "source": [ 172 | "import os\n", 173 | "\n", 174 | "from getpass import getpass\n", 175 | "\n", 176 | "from astrapy import DataAPIClient\n", 177 | "from datasets import load_dataset\n", 178 | "\n", 179 | "import glean_indexing_api_client as indexing_api\n", 180 | "from glean_indexing_api_client.api import datasources_api, documents_api\n", 181 | "from glean_indexing_api_client.model.custom_datasource_config import (\n", 182 | " CustomDatasourceConfig,\n", 183 | ")\n", 184 | "from glean_indexing_api_client.model.object_definition import ObjectDefinition\n", 185 | "from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest\n", 186 | "from glean_indexing_api_client.model.document_definition import DocumentDefinition\n", 187 | "from glean_indexing_api_client.model.content_definition import ContentDefinition\n", 188 | "from glean_indexing_api_client.model.document_permissions_definition import (\n", 189 | " DocumentPermissionsDefinition,\n", 190 | ")" 191 | ], 192 | "metadata": { 193 | "id": "QatfGIPWgZ_E" 194 | }, 195 | "execution_count": null, 196 | "outputs": [] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "source": [ 201 | "## 4. Set up variables" 202 | ], 203 | "metadata": { 204 | "id": "ER4CxMWXZ3qG" 205 | } 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "id": "z7LiKliuU-Le" 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"] = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \").strip()\n", 216 | "os.environ[\"ASTRA_DB_API_ENDPOINT\"] = input(\"ASTRA_DB_API_ENDPOINT = \").strip()\n", 217 | "os.environ[\"ASTRA_DB_COLLECTION_NAME\"] = input(\"ASTRA_DB_COLLECTION_NAME = \").strip() or \"glean_source_collection\"\n", 218 | "os.environ[\"ASTRA_DB_KEYSPACE\"] = input(\"(optional) ASTRA_DB_KEYSPACE = \").strip()\n", 219 | "\n", 220 | "if os.environ[\"ASTRA_DB_KEYSPACE\"] == \"\":\n", 221 | " del os.environ[\"ASTRA_DB_KEYSPACE\"]" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "source": [ 227 | "os.environ[\"GLEAN_API_TOKEN\"] = getpass(\"GLEAN_API_TOKEN = \").strip()\n", 228 | "os.environ[\"GLEAN_CUSTOMER\"] = input(\"GLEAN_CUSTOMER = \").strip()\n", 229 | "os.environ[\"GLEAN_DATASOURCE_NAME\"] = input(\"GLEAN_DATASOURCE_NAME = \").strip()" 230 | ], 231 | "metadata": { 232 | "id": "-toIPpuPhG8j" 233 | }, 234 | "execution_count": null, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "source": [ 240 | "ASTRA_DB_APPLICATION_TOKEN = os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"]\n", 241 | "ASTRA_DB_API_ENDPOINT = os.environ[\"ASTRA_DB_API_ENDPOINT\"]\n", 242 | "ASTRA_DB_COLLECTION_NAME = os.environ[\"ASTRA_DB_COLLECTION_NAME\"]\n", 243 | "ASTRA_DB_KEYSPACE = os.getenv(\"ASTRA_DB_KEYSPACE\")\n", 244 | "\n", 245 | "GLEAN_API_TOKEN = os.environ[\"GLEAN_API_TOKEN\"]\n", 246 | "GLEAN_CUSTOMER = os.environ[\"GLEAN_CUSTOMER\"]\n", 247 | "GLEAN_DATASOURCE_NAME = os.environ[\"GLEAN_DATASOURCE_NAME\"]" 248 | ], 249 | "metadata": { 250 | "id": "aIG7kFd2i2M7" 251 | }, 252 | "execution_count": null, 253 | "outputs": [] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "source": [ 258 | "## 5. Populate Astra DB\n", 259 | "\n", 260 | "Create an empty collection and fill it with sample data." 261 | ], 262 | "metadata": { 263 | "id": "J_CXLd0lmGTh" 264 | } 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "source": [ 269 | "### 5.1: Connect to Astra DB" 270 | ], 271 | "metadata": { 272 | "id": "jxAAGbCeWtDH" 273 | } 274 | }, 275 | { 276 | "cell_type": "code", 277 | "source": [ 278 | "# Initialize Astra DB client\n", 279 | "client = DataAPIClient(callers=[(\"glean\", \"1.0\")])\n", 280 | "database = client.get_database(\n", 281 | " ASTRA_DB_API_ENDPOINT,\n", 282 | " token=ASTRA_DB_APPLICATION_TOKEN,\n", 283 | " keyspace=ASTRA_DB_KEYSPACE,\n", 284 | ")\n", 285 | "print(f\"[ OK ] - Credentials are OK, your database name is {database.name()}.\")" 286 | ], 287 | "metadata": { 288 | "id": "OXNqpCLoXNzF" 289 | }, 290 | "execution_count": null, 291 | "outputs": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "source": [ 296 | "### 5.2: Create a collection" 297 | ], 298 | "metadata": { 299 | "id": "BMpJ_XoUYGTa" 300 | } 301 | }, 302 | { 303 | "cell_type": "code", 304 | "source": [ 305 | "# Create collection\n", 306 | "source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME)\n", 307 | "print(f\"[ OK ] - Collection {source_collection.name} is ready.\")" 308 | ], 309 | "metadata": { 310 | "id": "zentae6IYT5k" 311 | }, 312 | "execution_count": null, 313 | "outputs": [] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "source": [ 318 | "### 5.3: Load dataset" 319 | ], 320 | "metadata": { 321 | "id": "75d6ZWKsYtDA" 322 | } 323 | }, 324 | { 325 | "cell_type": "code", 326 | "source": [ 327 | "print(f\"[INFO] - Downloading data from Hugging Face 🤗.\")\n", 328 | "philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n", 329 | "print(f\"[ OK ] - Dataset loaded in memory.\")\n", 330 | "print(f\"[INFO] - Sample record: {philo_dataset[16]}\")" 331 | ], 332 | "metadata": { 333 | "id": "wvX02-JdYzxt" 334 | }, 335 | "execution_count": null, 336 | "outputs": [] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "source": [ 341 | "### 5.4 Load data into the collection" 342 | ], 343 | "metadata": { 344 | "id": "ausoJNtGac1V" 345 | } 346 | }, 347 | { 348 | "cell_type": "code", 349 | "source": [ 350 | "def load_to_astra_db(data_to_insert, collection):\n", 351 | " \"\"\"Load all of the provided data into a collection.\"\"\"\n", 352 | " def split_tags(t):\n", 353 | " return [tag for tag in (t or \"\").split(\";\") if tag]\n", 354 | "\n", 355 | " documents_to_insert = [\n", 356 | " {\n", 357 | " **item,\n", 358 | " **{\"_id\": index, \"tags\": split_tags(item[\"tags\"])},\n", 359 | " }\n", 360 | " for index, item in enumerate(data_to_insert)\n", 361 | " ]\n", 362 | " collection.insert_many(documents_to_insert)\n", 363 | "\n", 364 | "\n", 365 | "# Insert documents into Astra DB\n", 366 | "philo_count = len(philo_dataset)\n", 367 | "print(f\"[INFO] - Inserting {philo_count} documents into Astra DB...\")\n", 368 | "load_to_astra_db(philo_dataset, source_collection)\n", 369 | "print(f\"[ OK ] - Insertion finished.\")" 370 | ], 371 | "metadata": { 372 | "id": "AdEMKIiRahqq" 373 | }, 374 | "execution_count": null, 375 | "outputs": [] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "source": [ 380 | "## 6. Index data in Glean" 381 | ], 382 | "metadata": { 383 | "id": "ufbFgCcobVF-" 384 | } 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "source": [ 389 | "### 6.1 Initialize Glean client" 390 | ], 391 | "metadata": { 392 | "id": "QZgyw-RGmH95" 393 | } 394 | }, 395 | { 396 | "cell_type": "code", 397 | "source": [ 398 | "# Setup Glean API\n", 399 | "GLEAN_API_ENDPOINT = f\"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1\"\n", 400 | "print(f\"[INFO] - Glean API setup, endpoint is: {GLEAN_API_ENDPOINT}\")\n", 401 | "\n", 402 | "# Initialize Glean client\n", 403 | "configuration = indexing_api.Configuration(\n", 404 | " host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN\n", 405 | ")\n", 406 | "api_client = indexing_api.ApiClient(configuration)\n", 407 | "datasource_api = datasources_api.DatasourcesApi(api_client)\n", 408 | "print(f\"[ OK ] - Glean client initialized\")\n", 409 | "\n", 410 | "# Create and register datasource in Glean\n", 411 | "datasource_config = CustomDatasourceConfig(\n", 412 | " name=GLEAN_DATASOURCE_NAME,\n", 413 | " display_name=\"Astra DB Collection DataSource\",\n", 414 | " datasource_category=\"PUBLISHED_CONTENT\",\n", 415 | " url_regex=f\"^{ASTRA_DB_API_ENDPOINT}\",\n", 416 | " object_definitions=[\n", 417 | " ObjectDefinition(doc_category=\"PUBLISHED_CONTENT\", name=\"AstraVectorEntry\")\n", 418 | " ],\n", 419 | ")\n", 420 | "\n", 421 | "try:\n", 422 | " datasource_api.adddatasource_post(datasource_config)\n", 423 | " print(f\"[ OK ] - DataSource has been created!.\")\n", 424 | "except indexing_api.ApiException as e:\n", 425 | " print(f\"[ ERROR ] - Error creating datasource: {e}.\")" 426 | ], 427 | "metadata": { 428 | "id": "fOD7yLUemKrZ" 429 | }, 430 | "execution_count": null, 431 | "outputs": [] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "source": [ 436 | "### 6.2 Create functions to index documents" 437 | ], 438 | "metadata": { 439 | "id": "0c7AYE3Vmj3k" 440 | } 441 | }, 442 | { 443 | "cell_type": "code", 444 | "source": [ 445 | "def index_astra_db_document_into_glean(astra_document):\n", 446 | " \"\"\"Index one Astra DB document into Glean.\"\"\"\n", 447 | " document_id = str(astra_document[\"_id\"])\n", 448 | " title = f\"{astra_document['author']} quote_{astra_document['_id']}\"\n", 449 | " body_text = astra_document[\"quote\"]\n", 450 | " datasource_name = GLEAN_DATASOURCE_NAME\n", 451 | " request = IndexDocumentRequest(\n", 452 | " document=DocumentDefinition(\n", 453 | " datasource=datasource_name,\n", 454 | " title=title,\n", 455 | " id=document_id,\n", 456 | " view_url=ASTRA_DB_API_ENDPOINT,\n", 457 | " body=ContentDefinition(mime_type=\"text/plain\", text_content=body_text),\n", 458 | " permissions=DocumentPermissionsDefinition(allow_anonymous_access=True),\n", 459 | " )\n", 460 | " )\n", 461 | " documents_api_client = documents_api.DocumentsApi(api_client)\n", 462 | " try:\n", 463 | " documents_api_client.indexdocument_post(request)\n", 464 | " except indexing_api.ApiException as e:\n", 465 | " print(f\"Error indexing document {document_id}: {e}\")\n", 466 | "\n", 467 | "\n", 468 | "def index_documents_to_glean(collection):\n", 469 | " \"\"\"Index all documents from an Astra DB collection to Glean.\"\"\"\n", 470 | " total_docs = collection.count_documents({}, upper_bound=1000)\n", 471 | " print(f\"[INFO] - Indexing {total_docs} documents into Glean...\")\n", 472 | " for doc in collection.find():\n", 473 | " try:\n", 474 | " index_astra_db_document_into_glean(doc)\n", 475 | " except Exception as error:\n", 476 | " print(f\"Error indexing document {doc['_id']}: {error}\")\n", 477 | " print(f\"[ OK ] - Indexing finished.\")\n" 478 | ], 479 | "metadata": { 480 | "id": "UQdT1ZrwbSeO" 481 | }, 482 | "execution_count": null, 483 | "outputs": [] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "source": [ 488 | "### 6.3 Index documents\n" 489 | ], 490 | "metadata": { 491 | "id": "EgBuejZmm5p0" 492 | } 493 | }, 494 | { 495 | "cell_type": "code", 496 | "source": [ 497 | "index_documents_to_glean(source_collection)\n", 498 | "\n", 499 | "print(f\"Import job completed successfully!\")" 500 | ], 501 | "metadata": { 502 | "id": "7uNTfbSqnBp9" 503 | }, 504 | "execution_count": null, 505 | "outputs": [] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "source": [ 510 | "## Wrap up and more information\n", 511 | "\n", 512 | "Congratulations: you have indexed data from an Astra DB collection into Glean!\n", 513 | "\n", 514 | "You can inspect the Astra DB collection in your Astra dashboard: navigate to the database and find the \"Data explorer\" tab to locate your collection.\n", 515 | "\n", 516 | "You can perform a test with Glean: search for the content you just indexed and verify the response contains information coming from the inserted dataset.\n", 517 | "\n", 518 | "ℹ️ [Glean integration page](https://docs.datastax.com/en/astra-db-serverless/integrations/glean.html) on Astra DB documentation." 519 | ], 520 | "metadata": { 521 | "id": "X5lzDJC5h5KV" 522 | } 523 | } 524 | ] 525 | } --------------------------------------------------------------------------------