├── .devcontainer ├── Dockerfile └── devcontainer.json ├── README.md ├── data ├── paper01.pdf └── paper02.pdf ├── notebooks ├── 01_blog.ipynb └── pdf01_snippet.png └── requirements.txt /.devcontainer/Dockerfile: -------------------------------------------------------------------------------- 1 | # For more information, please refer to https://aka.ms/vscode-docker-python 2 | FROM quay.io/unstructured-io/unstructured:0.6.3 3 | 4 | # Keeps Python from generating .pyc files in the container 5 | ENV PYTHONDONTWRITEBYTECODE=1 6 | 7 | # Turns off buffering for easier container logging 8 | ENV PYTHONUNBUFFERED=1 9 | 10 | 11 | 12 | -------------------------------------------------------------------------------- /.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | // For format details, see https://aka.ms/devcontainer.json. For config options, see the 2 | // README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile 3 | { 4 | "name": "Existing Dockerfile", 5 | "build": { 6 | // Sets the run context to one level up instead of the .devcontainer folder. 7 | "context": "..", 8 | // Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename. 9 | "dockerfile": "./Dockerfile" 10 | }, 11 | // Features to add to the dev container. More info: https://containers.dev/features. 12 | // "features": {}, 13 | // Use 'forwardPorts' to make a list of ports inside the container available locally. 14 | // "forwardPorts": [], 15 | // Uncomment the next line to run commands after the container is created. 16 | "postCreateCommand": "pip3 install -r requirements.txt", 17 | // Configure tool-specific properties. 18 | "customizations": { 19 | "vscode": { 20 | "extensions": [ 21 | "GitHub.vscode-pull-request-github", 22 | "ms-python.python", 23 | "GitHub.copilot", 24 | "eamodio.gitlens" 25 | ] 26 | } 27 | }, 28 | 29 | "containerEnv": { 30 | "OPENAI_APIKEY": "sk-THISISNOTMYOPENAIAPIKEY" 31 | } 32 | // Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root. 33 | // "remoteUser": "devcontainer" 34 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | This repo demonstrates how to use the [Unstructured library](https://unstructured-io.github.io/unstructured/getting_started.html) with Weaviate. The Unstructured Library offers powerful capabilities for parsing a variety of data sources and extracting structured text from them. This includes, but is not limited to, documents like PDFs, Powerpoints, or JPEG files. 3 | 4 | The dataset we've included are two publicly available research papers. One paper contains a single column, and the other has a two column format. The notebook starts with a basic approach to using Unstructured and ends with an end-to-end example. This includes connecting to your Weaviate instance, defining your schema, uploading the data and then running two queries. 5 | 6 | Read the [blog post](https://weaviate.io/blog/ingesting-pdfs-into-weaviate) for more information! -------------------------------------------------------------------------------- /data/paper01.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weaviate/how-to-ingest-pdfs-with-unstructured/1873c11bdfcbe66d0ae38c8015efb77a7b1a6a50/data/paper01.pdf -------------------------------------------------------------------------------- /data/paper02.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weaviate/how-to-ingest-pdfs-with-unstructured/1873c11bdfcbe66d0ae38c8015efb77a7b1a6a50/data/paper02.pdf -------------------------------------------------------------------------------- /notebooks/pdf01_snippet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weaviate/how-to-ingest-pdfs-with-unstructured/1873c11bdfcbe66d0ae38c8015efb77a7b1a6a50/notebooks/pdf01_snippet.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # To ensure app dependencies are ported from your virtual environment/host machine into your container, run 'pip freeze > requirements.txt' in the terminal to overwrite this file 2 | weaviate-client 3 | pytest 4 | pytest-cov 5 | black[jupyter] 6 | ipykernel 7 | textwrap3 --------------------------------------------------------------------------------