├── requirements.txt
├── knowledge_base
    ├── settings.json
    ├── faiss
    │   ├── 2305.16291v2_f49cd5.faiss
    │   │   ├── index.pkl
    │   │   └── index.faiss
    │   └── attention_is_all_you_need_64252a.faiss
    │   │   └── index.faiss
    └── json
    │   ├── attention_is_all_you_need_64252a.json
    │   └── 2305.16291v2_f49cd5.json
├── settings.json
├── .gitignore
├── system_prompt.txt
├── README.md
├── crawler.py
└── app.py


/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mshojaei77/open-notebook/HEAD/requirements.txt


--------------------------------------------------------------------------------
/knowledge_base/settings.json:
--------------------------------------------------------------------------------
1 | {"model": "gpt-3.5-turbo", "top_k": 3, "chunk_size": 1500, "chunk_overlap": 50}


--------------------------------------------------------------------------------
/settings.json:
--------------------------------------------------------------------------------
1 | {"model": "gpt-4o", "top_k": 1, "chunk_size": 1500, "chunk_overlap": 50, "min_content_length": 100, "max_depth": 1}


--------------------------------------------------------------------------------
/knowledge_base/faiss/2305.16291v2_f49cd5.faiss/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mshojaei77/open-notebook/HEAD/knowledge_base/faiss/2305.16291v2_f49cd5.faiss/index.pkl


--------------------------------------------------------------------------------
/knowledge_base/faiss/2305.16291v2_f49cd5.faiss/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mshojaei77/open-notebook/HEAD/knowledge_base/faiss/2305.16291v2_f49cd5.faiss/index.faiss


--------------------------------------------------------------------------------
/knowledge_base/faiss/attention_is_all_you_need_64252a.faiss/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mshojaei77/open-notebook/HEAD/knowledge_base/faiss/attention_is_all_you_need_64252a.faiss/index.faiss


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Python-specific
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # Virtual environments
  7 | venv/
  8 | env/
  9 | .env
 10 | 
 11 | # Distribution / packaging
 12 | .Python
 13 | build/
 14 | develop-eggs/
 15 | dist/
 16 | downloads/
 17 | eggs/
 18 | .eggs/
 19 | lib/
 20 | lib64/
 21 | parts/
 22 | sdist/
 23 | var/
 24 | wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | 
 29 | # PyInstaller
 30 | *.manifest
 31 | *.spec
 32 | 
 33 | # Installer logs
 34 | pip-log.txt
 35 | pip-delete-this-directory.txt
 36 | 
 37 | # Unit test / coverage reports
 38 | htmlcov/
 39 | .tox/
 40 | .coverage
 41 | .coverage.*
 42 | .cache
 43 | nosetests.xml
 44 | coverage.xml
 45 | *.cover
 46 | .hypothesis/
 47 | .pytest_cache/
 48 | 
 49 | # Jupyter Notebook
 50 | .ipynb_checkpoints
 51 | 
 52 | # pyenv
 53 | .python-version
 54 | 
 55 | # Environments
 56 | .env
 57 | .venv
 58 | env/
 59 | venv/
 60 | ENV/
 61 | env.bak/
 62 | venv.bak/
 63 | 
 64 | # Spyder project settings
 65 | .spyderproject
 66 | .spyproject
 67 | 
 68 | # Rope project settings
 69 | .ropeproject
 70 | 
 71 | # mkdocs documentation
 72 | /site
 73 | 
 74 | # mypy
 75 | .mypy_cache/
 76 | 
 77 | # IDEs and editors
 78 | .idea/
 79 | .vscode/
 80 | *.swp
 81 | *.swo
 82 | *~
 83 | 
 84 | # OS generated files
 85 | .DS_Store
 86 | .DS_Store?
 87 | ._*
 88 | .Spotlight-V100
 89 | .Trashes
 90 | ehthumbs.db
 91 | Thumbs.db
 92 | 
 93 | # Database files
 94 | *.db
 95 | *.sqlite3
 96 | 
 97 | # Logs
 98 | *.log
 99 | 
100 | # Backup files
101 | *.bak
102 | 
103 | # Compiled source
104 | *.com
105 | *.class
106 | *.dll
107 | *.exe
108 | *.o
109 | *.so
110 | 
111 | # Packages
112 | *.7z
113 | *.dmg
114 | *.gz
115 | *.iso
116 | *.jar
117 | *.rar
118 | *.tar
119 | *.zip
120 | 
121 | 
122 | *.pkl


--------------------------------------------------------------------------------
/system_prompt.txt:
--------------------------------------------------------------------------------
 1 | You are a helpful and informative AI assistant designed to answer your questions based on a vast knowledge base. 
 2 | 
 3 | Here are some key principles to keep in mind:
 4 | 
 5 | 1. **Utilize the Knowledge Base:** If relevant, draw upon the provided information from the knowledge base to support your responses.  Always cite the source of any information you retrieve. 
 6 | 2. **Distinguish Facts and Reasoning:** Clearly separate factual information retrieved from the knowledge base from your own reasoning and analysis. Use phrases like "According to the knowledge base..." or "Based on this information, it seems likely..." to help users understand the source of your statements.
 7 | 3. **Acknowledge Uncertainty:** If a question cannot be answered with certainty or the information is unclear, admit it honestly.  Phrases like "I'm not sure I have enough information to answer that definitively," or "This information is ambiguous, but..." can be helpful. 
 8 | 4. **Conciseness and Accuracy:**  Provide concise, accurate, and relevant answers tailored to the user's specific query. Avoid unnecessary elaboration or irrelevant details. 
 9 | 5. **Clarification:** If the user's intent is ambiguous or their question requires further context, politely ask for clarification. For example, "Could you please rephrase your question?" or "I need a little more information to understand what you're asking." 
10 | 6. **Consistency:** Maintain consistency with previously stated information. If the user asks a follow-up question, ensure your responses build upon the previous conversation and are consistent with the information shared earlier. 
11 | 7. **Ethical Guidelines:** Always adhere to ethical guidelines and avoid generating harmful, biased, or offensive content. 
12 | 
13 | Remember, your goal is to provide users with accurate, helpful, and informative responses based on the available context and the knowledge base. 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Open Notebook
 2 | 
 3 | Open Notebook is an AI-powered knowledge management and question-answering system built with Streamlit. It allows users to create a personalized knowledge base from various sources and interact with it using natural language queries.
 4 | 
 5 | ## Features
 6 | 
 7 | - **AI-Powered Conversations**: Utilizes OpenAI's GPT models for intelligent responses.
 8 | - **Custom Knowledge Base**: Add content from websites, PDFs, and custom text inputs.
 9 | - **RAG (Retrieval-Augmented Generation)**: Enhances AI responses with relevant information from your knowledge base.
10 | - **User-Friendly Interface**: Clean, dark-themed UI with expandable sections for easy navigation.
11 | - **Flexible Configuration**: Customize AI model, retrieval parameters, and more.
12 | 
13 | ## Installation
14 | 
15 | 1. Clone the repository:
16 |    ```
17 |    https://github.com/mshojaei77/open-notebook.git
18 |    ```
19 |    ```
20 |    cd open-notebook
21 |    ```
22 | 
23 | 2. Install the required dependencies:
24 |    ```
25 |    pip install -r requirements.txt
26 |    ```
27 | 
28 | 3. Set up your OpenAI API key:
29 |    - Create a `.env` file in the project root.
30 |    - Add your API key: `OPENAI_API_KEY=your_api_key_here`
31 | 
32 | ## Usage
33 | 
34 | 1. Run the Streamlit app:
35 |    ```
36 |    streamlit run app.py
37 |    ```
38 | 
39 | 2. Open your web browser and navigate to the provided local URL (usually `http://localhost:8501`).
40 | 
41 | 3. Configure the app:
42 |    - Enter your OpenAI API key if not set in the `.env` file.
43 |    - Adjust advanced settings like AI model and chunk size if needed.
44 | 
45 | 4. Build your knowledge base:
46 |    - Add websites by entering URLs.
47 |    - Upload PDF documents.
48 |    - Input custom text directly.
49 | 
50 | 5. Start asking questions! The AI will respond based on your knowledge base.
51 | 
52 | ## Configuration
53 | 
54 | - **AI Model**: Choose between different GPT models.
55 | - **Top K**: Number of relevant documents to retrieve for each query.
56 | - **Chunk Size**: Size of text chunks for processing.
57 | - **Chunk Overlap**: Overlap between text chunks.
58 | 
59 | ## Managing Your Knowledge Base
60 | 
61 | - View all items in your knowledge base.
62 | - Remove individual items or clear the entire knowledge base.
63 | - Refresh the app to see updates.
64 | 
65 | ## File Structure
66 | 
67 | - `app.py`: Main application file.
68 | - `crawler.py`: Web scraping functionality.
69 | - `system_prompt.txt`: System prompt for the AI.
70 | - `knowledge_base/`: Directory for storing knowledge base files.
71 |   - `json/`: JSON files of processed content.
72 |   - `faiss/`: FAISS vector stores for efficient retrieval.
73 | 
74 | ## Dependencies
75 | 
76 | - Streamlit
77 | - LangChain
78 | - OpenAI
79 | - FAISS
80 | - PyPDF2
81 | - python-dotenv
82 | 
83 | ## Contributing
84 | 
85 | Contributions are welcome! Please feel free to submit a Pull Request.
86 | 
87 | ## License
88 | 
89 | This project is licensed under the MIT License - see the LICENSE file for details.
90 | 
91 | ## Acknowledgments
92 | 
93 | - OpenAI for providing the GPT models.
94 | - Streamlit for the web app framework.
95 | - LangChain for RAG implementation.
96 | 


--------------------------------------------------------------------------------
/crawler.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import json
  3 | import os
  4 | from pathlib import Path
  5 | import hashlib
  6 | import scrapy
  7 | from scrapy.crawler import CrawlerProcess
  8 | from bs4 import BeautifulSoup
  9 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 10 | from langchain_community.embeddings import OpenAIEmbeddings
 11 | from langchain_community.vectorstores import FAISS
 12 | from openai import OpenAI
 13 | from urllib.parse import urlparse
 14 | import multiprocessing
 15 | from functools import partial
 16 | from scrapy.downloadermiddlewares.retry import RetryMiddleware
 17 | from scrapy.utils.response import response_status_message
 18 | import time
 19 | 
 20 | class CustomRetryMiddleware(RetryMiddleware):
 21 |     def __init__(self, settings):
 22 |         super().__init__(settings)
 23 |         self.max_retry_times = settings.getint('RETRY_TIMES')
 24 | 
 25 |     def process_response(self, request, response, spider):
 26 |         if response.status == 429:
 27 |             spider.logger.info(f"Received 429 response. Retrying after delay.")
 28 |             time.sleep(60)  # Wait for 60 seconds before retrying
 29 |             return self._retry(request, response.status, spider) or response
 30 |         return super().process_response(request, response, spider)
 31 | 
 32 | class GeneralSpider(scrapy.Spider):
 33 |     name = "general_spider"
 34 | 
 35 |     def __init__(self, start_url, max_depth, min_content_length, *args, **kwargs):
 36 |         super(GeneralSpider, self).__init__(*args, **kwargs)
 37 |         self.start_urls = [start_url]
 38 |         self.allowed_domains = [urlparse(start_url).netloc]
 39 |         self.max_depth = max_depth
 40 |         self.min_content_length = min_content_length
 41 | 
 42 |     def parse(self, response):
 43 |         if self.is_valid_url(response.url):
 44 |             page_content = response.text
 45 |             clean_text = self.clean_html(page_content)
 46 | 
 47 |             if self.is_high_quality_content(clean_text):
 48 |                 yield {'url': response.url, 'content': clean_text}
 49 | 
 50 |         if self.max_depth > 1:
 51 |             for next_page in response.css('a::attr(href)').getall():
 52 |                 next_page = response.urljoin(next_page)
 53 |                 if self.is_valid_url(next_page) and self.is_within_depth(next_page):
 54 |                     yield response.follow(next_page, self.parse)
 55 | 
 56 |     def clean_html(self, raw_html):
 57 |         soup = BeautifulSoup(raw_html, "html.parser")
 58 |         for script in soup(["script", "style"]):
 59 |             script.decompose()
 60 | 
 61 |         content = []
 62 |         for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'span', 'code']):
 63 |             text = element.get_text(separator=" ", strip=True)
 64 |             if len(text) > 30:
 65 |                 content.append(text)
 66 | 
 67 |         return " ".join(content)
 68 | 
 69 |     def is_valid_url(self, url):
 70 |         exclude_patterns = ['contact', 'about',
 71 |                             'privacy', 'terms', 'login', 'signup']
 72 |         return not any(pattern in url for pattern in exclude_patterns) and urlparse(url).netloc in self.allowed_domains
 73 | 
 74 |     def is_within_depth(self, url):
 75 |         return url.count('/') <= self.max_depth + 2
 76 | 
 77 |     def is_high_quality_content(self, text):
 78 |         return len(text) > self.min_content_length
 79 | 
 80 | 
 81 | def scrape_url(url, max_depth, min_content_length):
 82 |     url_hash = hashlib.md5(url.encode()).hexdigest()
 83 |     process = CrawlerProcess(settings={
 84 |         'FEED_FORMAT': 'json',
 85 |         'FEED_URI': f'{url_hash}.json',
 86 |         'RETRY_TIMES': 5,
 87 |         'RETRY_HTTP_CODES': [429, 500, 502, 503, 504, 522, 524],
 88 |         'DOWNLOADER_MIDDLEWARES': {
 89 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 90 |             '__main__.CustomRetryMiddleware': 550,
 91 |         },
 92 |     })
 93 |     process.crawl(GeneralSpider, start_url=url, max_depth=max_depth,
 94 |                   min_content_length=min_content_length)
 95 |     process.start()
 96 |     return url
 97 | 
 98 | 
 99 | def scrape_urls_parallel(urls, max_depth, min_content_length):
100 |     with multiprocessing.Pool() as pool:
101 |         scrape_func = partial(scrape_url, max_depth=max_depth,
102 |                               min_content_length=min_content_length)
103 |         results = pool.map(scrape_func, urls)
104 |     return results
105 | 
106 | if __name__ == '__main__':
107 |     urls = ['https://medium.com/@lorevanoudenhove/how-to-build-ai-agents-with-langgraph-a-step-by-step-guide-5d84d9c7e832']
108 |     max_depth = 1
109 |     min_content_length = 100
110 |     results = scrape_urls_parallel(urls, max_depth, min_content_length)
111 |     print(results)


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import json
  3 | import os
  4 | from pathlib import Path
  5 | from langchain.text_splitter import RecursiveCharacterTextSplitter
  6 | from langchain_openai import OpenAIEmbeddings
  7 | from langchain_community.vectorstores import FAISS
  8 | from openai import OpenAI
  9 | from crawler import scrape_urls_parallel
 10 | from dotenv import load_dotenv
 11 | import shutil
 12 | import PyPDF2
 13 | import uuid
 14 | import re
 15 | 
 16 | # Define directories for knowledge base
 17 | KBASE_DIR = Path("knowledge_base")
 18 | JSON_DIR = KBASE_DIR / "json"
 19 | FAISS_DIR = KBASE_DIR / "faiss"
 20 | 
 21 | # Create directories if they don't exist
 22 | JSON_DIR.mkdir(parents=True, exist_ok=True)
 23 | FAISS_DIR.mkdir(parents=True, exist_ok=True)
 24 | 
 25 | # Initialize session state for API key
 26 | if "api_key" not in st.session_state:
 27 |     st.session_state.api_key = ""
 28 | 
 29 | def sanitize_filename(name):
 30 |     # Remove invalid characters for filenames
 31 |     return re.sub(r'[^a-zA-Z0-9-_\.]', '_', name)
 32 | 
 33 | def generate_readable_filename(base_name):
 34 |     # Truncate the base name to a reasonable length for readability
 35 |     return '_'.join(base_name.split()[:5]).lower()
 36 | 
 37 | def setup_rag(chunk_size, chunk_overlap, files):
 38 |     if 'api_key' not in st.session_state or not st.session_state.api_key:
 39 |         st.sidebar.error("Please enter your OpenAI API Key in the sidebar.")
 40 |         return None
 41 | 
 42 |     try:
 43 |         embeddings = OpenAIEmbeddings(api_key=st.session_state.api_key)
 44 |     except Exception as e:
 45 |         st.sidebar.error(f"Error initializing OpenAI Embeddings: {str(e)}")
 46 |         return None
 47 | 
 48 |     text_splitter = RecursiveCharacterTextSplitter(
 49 |         separators=['\n'], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
 50 | 
 51 |     vector_dbs = {}
 52 | 
 53 |     for file in files:
 54 |         if file.endswith(".json") and file != "settings.json":
 55 |             try:
 56 |                 file_path = JSON_DIR / file
 57 |                 embedding_file = FAISS_DIR / file.replace(".json", ".faiss")
 58 | 
 59 |                 if embedding_file.exists():
 60 |                     vector_db = FAISS.load_local(
 61 |                         str(embedding_file), embeddings, allow_dangerous_deserialization=True)
 62 |                 else:
 63 |                     with open(file_path, 'r', encoding='utf-8') as f:
 64 |                         try:
 65 |                             data = json.load(f)
 66 |                         except json.JSONDecodeError as json_error:
 67 |                             st.sidebar.error(f"Error processing file {file}: {str(json_error)}")
 68 |                             continue
 69 |                     if isinstance(data, dict):
 70 |                         text = data.get('content', '') or data.get('pasted_text', '')
 71 |                         title = data.get('filename', data.get('title', 'Untitled'))
 72 |                     elif isinstance(data, list):
 73 |                         text = ' '.join([item.get('content', '') or item.get('pasted_text', '') for item in data])
 74 |                         title = "Multiple Documents"
 75 |                     else:
 76 |                         raise ValueError(f"Unexpected data format in {file}")
 77 |                     documents = text_splitter.create_documents([text])
 78 |                     vector_db = FAISS.from_documents(documents, embeddings)
 79 |                     vector_db.save_local(str(embedding_file))
 80 | 
 81 |                 vector_dbs[file] = vector_db
 82 |             except json.JSONDecodeError as json_error:
 83 |                 st.sidebar.error(f"Error processing file {file}: {str(json_error)}")
 84 |                 continue
 85 |             except Exception as e:
 86 |                 st.sidebar.error(f"Error processing file {file}: {str(e)}")
 87 |                 continue
 88 | 
 89 |     return vector_dbs
 90 | 
 91 | def query_rag(query, vector_dbs, top_k):
 92 |     if not vector_dbs:
 93 |         st.sidebar.error("RAG system is not properly set up. Please check your configuration and try again.")
 94 |         return None
 95 | 
 96 |     # Collect all documents and their scores across all vector DBs
 97 |     all_results = []
 98 |     for vector_db in vector_dbs.values():
 99 |         results = vector_db.similarity_search_with_score(query, k=top_k)
100 |         all_results.extend(results)
101 | 
102 |     # Sort all results by score and take top_k
103 |     all_results.sort(key=lambda x: x[1])
104 |     top_results = all_results[:top_k]
105 |     
106 |     contents = " ".join([doc.page_content for doc, score in top_results])
107 | 
108 |     if len(contents) > 4000:
109 |         contents = contents[:4000]
110 | 
111 |     return contents
112 | 
113 | st.set_page_config(page_title="Open Notebook", page_icon="🤖", layout="wide", initial_sidebar_state="expanded")
114 | # Custom CSS for dark theme and improved UI
115 | st.markdown("""
116 | <style>
117 |     /* Global styles */
118 |     body {
119 |         color: #E0E0E0;
120 |         background-color: #1E1E1E;
121 |     }
122 |     
123 |     /* Sidebar styles */
124 |     .css-1d391kg {
125 |         background-color: #252526;
126 |     }
127 |     
128 |     /* Main content area styles */
129 |     .stApp {
130 |         background-color: #1E1E1E;
131 |     }
132 |     
133 |     /* Button styles */
134 |     .stButton>button {
135 |         color: #FFFFFF;
136 |         background-color: #007ACC;
137 |         border: none;
138 |         border-radius: 4px;
139 |         padding: 0.5rem 1rem;
140 |         font-weight: 500;
141 |     }
142 |     .stButton>button:hover {
143 |         background-color: #005A9E;
144 |     }
145 |     
146 |     /* Input field styles */
147 |     .stTextInput>div>div>input, .stTextArea textarea {
148 |         color: #D4D4D4;
149 |         background-color: #3C3C3C;
150 |         border: 1px solid #3C3C3C;
151 |         border-radius: 4px;
152 |     }
153 |     
154 |     /* Selectbox styles */
155 |     .stSelectbox>div>div>select {
156 |         color: #D4D4D4;
157 |         background-color: #3C3C3C;
158 |         border: 1px solid #3C3C3C;
159 |         border-radius: 4px;
160 |     }
161 |     
162 |     /* Slider styles */
163 |     .stSlider>div>div>div>div {
164 |         background-color: #007ACC;
165 |     }
166 |     
167 |     /* Expander styles */
168 |     .stExpander {
169 |         background-color: #252526;
170 |         border: 1px solid #3C3C3C;
171 |         border-radius: 4px;
172 |     }
173 |     .stExpander>summary {
174 |         color: #E0E0E0;
175 |         font-weight: 500;
176 |     }
177 |     
178 |     /* Chat message styles */
179 |     .stChatMessage {
180 |         background-color: #252526;
181 |         border-radius: 4px;
182 |         padding: 0.5rem;
183 |         margin-bottom: 0.5rem;
184 |     }
185 |     
186 |     /* Code block styles */
187 |     pre {
188 |         background-color: #1E1E1E;
189 |         border: 1px solid #3C3C3C;
190 |         border-radius: 4px;
191 |     }
192 | </style>
193 | """, unsafe_allow_html=True)
194 | 
195 | # Load settings
196 | if not (KBASE_DIR / "settings.json").exists():
197 |     default_settings = {
198 |         "model": "gpt-3.5-turbo",
199 |         "top_k": 3,
200 |         "chunk_size": 1500,
201 |         "chunk_overlap": 50,
202 |     }
203 |     with open(KBASE_DIR / "settings.json", "w") as settings_file:
204 |         json.dump(default_settings, settings_file)
205 | 
206 | with open(KBASE_DIR / "settings.json", "r") as settings_file:
207 |     settings = json.load(settings_file)
208 | 
209 | model = settings["model"]
210 | top_k = settings["top_k"]
211 | chunk_size = settings["chunk_size"]
212 | chunk_overlap = settings["chunk_overlap"]
213 | 
214 | text_splitter = RecursiveCharacterTextSplitter(
215 |     separators=['\n'], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
216 | 
217 | with st.sidebar:
218 |     st.header("Configuration")
219 |     
220 |     # API Key Configuration
221 |     with st.expander("API Key", expanded=False):
222 |         load_dotenv()
223 |         local_api_key = os.getenv('OPENAI_API_KEY')
224 |         if local_api_key:
225 |             st.session_state.api_key = local_api_key
226 |             st.success("API Key loaded from environment")
227 |         else:
228 |             api_key_input = st.text_input(
229 |                 "OpenAI API Key", 
230 |                 type="password", 
231 |                 value=st.session_state.api_key,
232 |                 help="Enter your OpenAI API key here. You can get one from https://platform.openai.com/account/api-keys"
233 |             )
234 |             
235 |             if api_key_input:
236 |                 st.session_state.api_key = api_key_input
237 |                 if st.button("Save API Key"):
238 |                     with open('.env', 'w') as env_file:
239 |                         env_file.write(f'OPENAI_API_KEY={api_key_input}')
240 |                     st.success("API Key saved successfully!")
241 | 
242 |     # Settings
243 |     with st.expander("Advanced Settings", expanded=False):
244 |         #don't change model names "gpt-4o-mini", "gpt-4o" are 2024 new models
245 |         settings = {
246 |             "model": st.selectbox("AI Model", ["gpt-4o-mini", "gpt-4o"], index=1 if settings["model"] == "gpt-4o" else 0, help="Choose the AI model to use. GPT-4 is more capable but slower and more expensive."),
247 |             "top_k": st.slider("Number of relevant documents", 1, 10, settings["top_k"], help="Number of most relevant documents to retrieve for each query. Higher values may improve accuracy but increase processing time."),
248 |             "chunk_size": st.number_input("Chunk Size", min_value=500, max_value=5000, value=settings["chunk_size"], step=100, help="Size of each text chunk for processing."),
249 |             "chunk_overlap": st.number_input("Chunk Overlap", min_value=0, max_value=500, value=settings["chunk_overlap"], step=10, help="Number of overlapping characters between chunks."),
250 |         }
251 | 
252 |         if st.button("Save settings"):
253 |             with open(KBASE_DIR / "settings.json", "w") as settings_file:
254 |                 json.dump(settings, settings_file)
255 |             st.success("Settings saved successfully!")
256 | 
257 |     # URL Scraping
258 |     with st.expander("Add Websites to Knowledge Base", expanded=False):
259 |         urls = st.text_area("Enter website URLs (one per line)", height=100, help="Enter the URLs of websites you want to add to your knowledge base. The AI will scrape and learn from these websites.")
260 |         if st.button("Add Websites to Knowledge Base"):
261 |             if not st.session_state.api_key:
262 |                 st.error("Please enter your OpenAI API Key first")
263 |             elif not urls:
264 |                 st.error("Please enter at least one URL")
265 |             else:
266 |                 url_list = [url.strip() for url in urls.split('\n') if url.strip()]
267 | 
268 |                 with st.spinner("Reading websites..."):
269 |                     scraped_urls = scrape_urls_parallel(url_list, max_depth=2, min_content_length=100)
270 | 
271 |                 st.success(f"Successfully read {len(scraped_urls)} websites")
272 | 
273 |                 with st.spinner("Updating knowledge base..."):
274 |                     new_files = []
275 |                     for i, content in enumerate(scraped_urls):
276 |                         # Use index and URL as fallback title
277 |                         url = url_list[i] if i < len(url_list) else f"content_{i}"
278 |                         domain = re.sub(r'^https?://', '', url).split('/')[0]
279 |                         title = domain  # Using domain as title since we don't have title metadata
280 |                         sanitized_title = sanitize_filename(title)
281 |                         readable_title = generate_readable_filename(title)
282 |                         unique_id = uuid.uuid4().hex[:6]
283 |                         json_filename = f"{readable_title}_{unique_id}.json"
284 |                         json_path = JSON_DIR / json_filename
285 |                         with open(json_path, "w", encoding='utf-8') as jf:
286 |                             json.dump({
287 |                                 "url": url, 
288 |                                 "title": title, 
289 |                                 "content": content  # content is now directly the text string
290 |                             }, jf, ensure_ascii=False)
291 |                         new_files.append(json_filename)
292 |                     
293 |                     files = [f.name for f in JSON_DIR.glob("*.json") if f.name != "settings.json"]
294 |                     new_vdbs = setup_rag(chunk_size, chunk_overlap, files)
295 |                     if new_vdbs:
296 |                         st.session_state.vector_dbs = new_vdbs
297 |                         st.success("Knowledge base updated successfully!")
298 |                     else:
299 |                         st.error("Failed to update knowledge base. Please try again.")
300 | 
301 |                 st.rerun()
302 | 
303 |     # Add PDF Upload
304 |     with st.expander("Add PDFs to Knowledge Base", expanded=False):
305 |         uploaded_pdfs = st.file_uploader("Upload PDF files", type=["pdf"], accept_multiple_files=True, help="Upload PDF files to add to your knowledge base. The AI will read and learn from these documents.")
306 |         if st.button("Add PDFs to Knowledge Base"):
307 |             if not uploaded_pdfs:
308 |                 st.error("Please upload at least one PDF file.")
309 |             else:
310 |                 for pdf in uploaded_pdfs:
311 |                     try:
312 |                         pdf_reader = PyPDF2.PdfReader(pdf)
313 |                         text = ""
314 |                         for page in pdf_reader.pages:
315 |                             text += page.extract_text() or ""
316 |                         
317 |                         if len(text) < 100:
318 |                             st.warning(f"PDF '{pdf.name}' content is too short and was skipped.")
319 |                             continue
320 | 
321 |                         sanitized_name = sanitize_filename(pdf.name.replace('.pdf', ''))
322 |                         readable_name = generate_readable_filename(sanitized_name)
323 |                         unique_id = uuid.uuid4().hex[:6]
324 |                         json_filename = f"{readable_name}_{unique_id}.json"
325 |                         json_path = JSON_DIR / json_filename
326 |                         with open(json_path, "w", encoding='utf-8') as jf:
327 |                             json.dump({
328 |                                 "filename": pdf.name,
329 |                                 "title": sanitized_name,
330 |                                 "content": text
331 |                             }, jf, ensure_ascii=False)
332 | 
333 |                     except Exception as e:
334 |                         st.error(f"Failed to process PDF '{pdf.name}': {str(e)}")
335 |                 
336 |                 with st.spinner("Updating knowledge base..."):
337 |                     files = [f.name for f in JSON_DIR.glob("*.json") if f.name != "settings.json"]
338 |                     new_vdbs = setup_rag(chunk_size, chunk_overlap, files)
339 |                     if new_vdbs:
340 |                         st.session_state.vector_dbs = new_vdbs
341 |                         st.success("PDFs added to knowledge base successfully!")
342 |                     else:
343 |                         st.error("Failed to update knowledge base. Please try again.")
344 |                 
345 |                 st.rerun()
346 | 
347 |     # Add Text Input
348 |     with st.expander("Add Custom Text to Knowledge Base", expanded=False):
349 |         pasted_text = st.text_area("Enter or paste your text here:", height=200, help="Enter or paste any custom text you want to add to your knowledge base.")
350 |         custom_title = st.text_input("Title for the custom text", help="Provide a title to easily identify this custom text.")
351 |         if st.button("Add Text to Knowledge Base"):
352 |             if not pasted_text.strip():
353 |                 st.error("Please enter some text to add.")
354 |             elif len(pasted_text.strip()) < 100:
355 |                 st.error("The text is too short. Please enter at least 100 characters.")
356 |             elif not custom_title.strip():
357 |                 st.error("Please provide a title for the custom text.")
358 |             else:
359 |                 try:
360 |                     # Use the first five words of the text if title is not sufficiently descriptive
361 |                     if len(custom_title.split()) < 3:
362 |                         first_five = ' '.join(pasted_text.strip().split()[:5])
363 |                         custom_title = f"{custom_title} - {first_five}"
364 |                     
365 |                     sanitized_title = sanitize_filename(custom_title)
366 |                     readable_title = generate_readable_filename(custom_title)
367 |                     unique_id = uuid.uuid4().hex[:6]
368 |                     json_filename = f"{readable_title}_{unique_id}.json"
369 |                     json_path = JSON_DIR / json_filename
370 |                     with open(json_path, "w", encoding='utf-8') as jf:
371 |                         json.dump({
372 |                             "title": custom_title,
373 |                             "pasted_text": pasted_text
374 |                         }, jf, ensure_ascii=False)
375 | 
376 |                     with st.spinner("Updating knowledge base..."):
377 |                         files = [f.name for f in JSON_DIR.glob("*.json") if f.name != "settings.json"]
378 |                         new_vdbs = setup_rag(chunk_size, chunk_overlap, files)
379 |                         if new_vdbs:
380 |                             st.session_state.vector_dbs = new_vdbs
381 |                             st.success("Custom text added to knowledge base successfully!")
382 |                         else:
383 |                             st.error("Failed to update knowledge base. Please try again.")
384 |                     
385 |                     st.rerun()
386 | 
387 |                 except Exception as e:
388 |                     st.error(f"Failed to add custom text: {str(e)}")
389 | 
390 |     # Knowledge Base Management
391 |     with st.expander("Manage Knowledge Base", expanded=False):
392 |         st.subheader("Current Knowledge Base")
393 |         files = list(JSON_DIR.glob("*.json"))
394 |         
395 |         if "vector_dbs" not in st.session_state:
396 |             st.session_state.vector_dbs = setup_rag(chunk_size, chunk_overlap, [f.name for f in files if f.name != "settings.json"])
397 | 
398 |         displayed_items = set()
399 | 
400 |         for json_file in files:
401 |             if json_file.name in displayed_items:
402 |                 continue
403 |             try:
404 |                 with open(json_file, 'r', encoding='utf-8') as jf:
405 |                     data = json.load(jf)
406 |                 title = data.get('title') or data.get('filename') or "Untitled"
407 |             except Exception:
408 |                 title = json_file.stem
409 | 
410 |             col1, col2 = st.columns([4, 1])
411 |             with col1:
412 |                 st.write(title)
413 |             with col2:
414 |                 if st.button("Remove", key=f"remove_{json_file.name}"):
415 |                     try:
416 |                         faiss_file = FAISS_DIR / json_file.name.replace(".json", ".faiss")
417 |                         json_file.unlink()
418 |                         if faiss_file.exists():
419 |                             shutil.rmtree(faiss_file)
420 |                         st.success(f"Removed '{title}' from the knowledge base")
421 |                         st.session_state.vector_dbs.pop(json_file.name, None)
422 |                         st.rerun()
423 |                     except Exception as e:
424 |                         st.error(f"Error removing '{title}': {str(e)}")
425 |             displayed_items.add(json_file.name)
426 | 
427 |         if st.button("Clear Entire Knowledge Base"):
428 |             try:
429 |                 for json_file in files:
430 |                     json_file.unlink()
431 |                 for faiss_dir in FAISS_DIR.glob("*"):
432 |                     if faiss_dir.is_dir():
433 |                         shutil.rmtree(faiss_dir)
434 |                 st.session_state.vector_dbs = {}
435 |                 st.success("Knowledge base cleared successfully!")
436 |                 st.rerun()
437 |             except Exception as e:
438 |                 st.error(f"Error clearing knowledge base: {str(e)}")
439 | 
440 |     if st.button("Refresh"):
441 |         st.rerun()
442 | 
443 | client = OpenAI(api_key=st.session_state.api_key)
444 | 
445 | system_prompt = open('system_prompt.txt', 'r', encoding='utf-8').read()
446 | 
447 | if "messages" not in st.session_state:
448 |     st.session_state.messages = [
449 |         {"role": "system", "content": system_prompt}
450 |     ]
451 | 
452 | for message in st.session_state.messages:
453 |     if message["role"] != "system":
454 |         with st.chat_message(message["role"]):
455 |             st.markdown(message["content"])
456 | 
457 | if prompt := st.chat_input("Ask me anything about your knowledge base"):
458 |     if 'vector_dbs' not in st.session_state or not st.session_state.vector_dbs:
459 |         st.error("Your knowledge base is empty. Please add some content first.")
460 |     else:
461 |         st.session_state.messages.append(
462 |             {"role": "user", "content": prompt})
463 |         with st.chat_message("user"):
464 |             st.markdown(prompt)
465 | 
466 |         context = query_rag(prompt, st.session_state.vector_dbs, top_k)
467 | 
468 |         if context:
469 |             with st.chat_message("assistant"):
470 |                 stream = client.chat.completions.create(
471 |                     model=model,
472 |                     messages=[
473 |                         {"role": "system", "content": system_prompt},
474 |                         {"role": "user", "content": f"Answer the query '{prompt}' based on the following contents:\n{context}"}
475 |                     ],
476 |                     stream=True,
477 |                 )
478 |                 response = st.write_stream(stream)
479 |             st.session_state.messages.append(
480 |                 {"role": "assistant", "content": response})
481 | 
482 |             st.markdown(
483 |                 f"""<details><summary>Source Information</summary><p>{context}</p></details>""", unsafe_allow_html=True)
484 | 


--------------------------------------------------------------------------------
/knowledge_base/json/attention_is_all_you_need_64252a.json:
--------------------------------------------------------------------------------
1 | {"filename": "Attention Is All You Need.pdf", "title": "Attention_Is_All_You_Need", "content": "Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-\nto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight GPUs, a small fraction of the training costs of the\nbest models from the literature. We show that the Transformer generalizes well to\nother tasks by applying it successfully to English constituency parsing both with\nlarge and limited training data.\n∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started\nthe effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and\nhas been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head\nattention and the parameter-free position representation and became the other person involved in nearly every\ndetail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and\ntensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and\nefficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and\nimplementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating\nour research.\n†Work performed while at Google Brain.\n‡Work performed while at Google Research.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.arXiv:1706.03762v7  [cs.CL]  2 Aug 20231 Introduction\nRecurrent neural networks, long short-term memory [ 13] and gated recurrent [ 7] neural networks\nin particular, have been firmly established as state of the art approaches in sequence modeling and\ntransduction problems such as language modeling and machine translation [ 35,2,5]. Numerous\nefforts have since continued to push the boundaries of recurrent language models and encoder-decoder\narchitectures [38, 24, 15].\nRecurrent models typically factor computation along the symbol positions of the input and output\nsequences. Aligning the positions to steps in computation time, they generate a sequence of hidden\nstates ht, as a function of the previous hidden state ht−1and the input for position t. This inherently\nsequential nature precludes parallelization within training examples, which becomes critical at longer\nsequence lengths, as memory constraints limit batching across examples. Recent work has achieved\nsignificant improvements in computational efficiency through factorization tricks [ 21] and conditional\ncomputation [ 32], while also improving model performance in case of the latter. The fundamental\nconstraint of sequential computation, however, remains.\nAttention mechanisms have become an integral part of compelling sequence modeling and transduc-\ntion models in various tasks, allowing modeling of dependencies without regard to their distance in\nthe input or output sequences [ 2,19]. In all but a few cases [ 27], however, such attention mechanisms\nare used in conjunction with a recurrent network.\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead\nrelying entirely on an attention mechanism to draw global dependencies between input and output.\nThe Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building\nblock, computing hidden representations in parallel for all input and output positions. In these models,\nthe number of operations required to relate signals from two arbitrary input or output positions grows\nin the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\nit more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is\nreduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. Self-attention has been\nused successfully in a variety of tasks including reading comprehension, abstractive summarization,\ntextual entailment and learning task-independent sentence representations [4, 27, 28, 22].\nEnd-to-end memory networks are based on a recurrent attention mechanism instead of sequence-\naligned recurrence and have been shown to perform well on simple-language question answering and\nlanguage modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [17, 18] and [9].\n3 Model Architecture\nMost competitive neural sequence transduction models have an encoder-decoder structure [ 5,2,35].\nHere, the encoder maps an input sequence of symbol representations (x1, ..., x n)to a sequence\nof continuous representations z= (z1, ..., z n). Given z, the decoder then generates an output\nsequence (y1, ..., y m)of symbols one element at a time. At each step the model is auto-regressive\n[10], consuming the previously generated symbols as additional input when generating the next.\n2Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [ 11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512 .\nDecoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head\nattention over the output of the encoder stack. Similar to the encoder, we employ residual connections\naround each of the sub-layers, followed by layer normalization. We also modify the self-attention\nsub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position ican depend only on the known outputs at positions less than i.\n3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3Scaled Dot-Product Attention\n Multi-Head Attention\nFigure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several\nattention layers running in parallel.\nof the values, where the weight assigned to each value is computed by a compatibility function of the\nquery with the corresponding key.\n3.2.1 Scaled Dot-Product Attention\nWe call our particular attention \"Scaled Dot-Product Attention\" (Figure 2). The input consists of\nqueries and keys of dimension dk, and values of dimension dv. We compute the dot products of the\nquery with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the\nvalues.\nIn practice, we compute the attention function on a set of queries simultaneously, packed together\ninto a matrix Q. The keys and values are also packed together into matrices KandV. We compute\nthe matrix of outputs as:\nAttention( Q, K, V ) = softmax(QKT\n√dk)V (1)\nThe two most commonly used attention functions are additive attention [ 2], and dot-product (multi-\nplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor\nof1√dk. Additive attention computes the compatibility function using a feed-forward network with\na single hidden layer. While the two are similar in theoretical complexity, dot-product attention is\nmuch faster and more space-efficient in practice, since it can be implemented using highly optimized\nmatrix multiplication code.\nWhile for small values of dkthe two mechanisms perform similarly, additive attention outperforms\ndot product attention without scaling for larger values of dk[3]. We suspect that for large values of\ndk, the dot products grow large in magnitude, pushing the softmax function into regions where it has\nextremely small gradients4. To counteract this effect, we scale the dot products by1√dk.\n3.2.2 Multi-Head Attention\nInstead of performing a single attention function with dmodel-dimensional keys, values and queries,\nwe found it beneficial to linearly project the queries, keys and values htimes with different, learned\nlinear projections to dk,dkanddvdimensions, respectively. On each of these projected versions of\nqueries, keys and values we then perform the attention function in parallel, yielding dv-dimensional\n4To illustrate why the dot products get large, assume that the components of qandkare independent random\nvariables with mean 0and variance 1. Then their dot product, q·k=Pdk\ni=1qiki, has mean 0and variance dk.\n4output values. These are concatenated and once again projected, resulting in the final values, as\ndepicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation\nsubspaces at different positions. With a single attention head, averaging inhibits this.\nMultiHead( Q, K, V ) = Concat(head 1, ...,head h)WO\nwhere head i= Attention( QWQ\ni, KWK\ni, V WV\ni)\nWhere the projections are parameter matrices WQ\ni∈Rdmodel×dk,WK\ni∈Rdmodel×dk,WV\ni∈Rdmodel×dv\nandWO∈Rhdv×dmodel.\nIn this work we employ h= 8 parallel attention layers, or heads. For each of these we use\ndk=dv=dmodel/h= 64 . Due to the reduced dimension of each head, the total computational cost\nis similar to that of single-head attention with full dimensionality.\n3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n•In \"encoder-decoder attention\" layers, the queries come from the previous decoder layer,\nand the memory keys and values come from the output of the encoder. This allows every\nposition in the decoder to attend over all positions in the input sequence. This mimics the\ntypical encoder-decoder attention mechanisms in sequence-to-sequence models such as\n[38, 2, 9].\n•The encoder contains self-attention layers. In a self-attention layer all of the keys, values\nand queries come from the same place, in this case, the output of the previous layer in the\nencoder. Each position in the encoder can attend to all positions in the previous layer of the\nencoder.\n•Similarly, self-attention layers in the decoder allow each position in the decoder to attend to\nall positions in the decoder up to and including that position. We need to prevent leftward\ninformation flow in the decoder to preserve the auto-regressive property. We implement this\ninside of scaled dot-product attention by masking out (setting to −∞) all values in the input\nof the softmax which correspond to illegal connections. See Figure 2.\n3.3 Position-wise Feed-Forward Networks\nIn addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully\nconnected feed-forward network, which is applied to each position separately and identically. This\nconsists of two linear transformations with a ReLU activation in between.\nFFN( x) = max(0 , xW 1+b1)W2+b2 (2)\nWhile the linear transformations are the same across different positions, they use different parameters\nfrom layer to layer. Another way of describing this is as two convolutions with kernel size 1.\nThe dimensionality of input and output is dmodel = 512 , and the inner-layer has dimensionality\ndff= 2048 .\n3.4 Embeddings and Softmax\nSimilarly to other sequence transduction models, we use learned embeddings to convert the input\ntokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transfor-\nmation and softmax function to convert the decoder output to predicted next-token probabilities. In\nour model, we share the same weight matrix between the two embedding layers and the pre-softmax\nlinear transformation, similar to [ 30]. In the embedding layers, we multiply those weights by√dmodel.\n5Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations\nfor different layer types. nis the sequence length, dis the representation dimension, kis the kernel\nsize of convolutions and rthe size of the neighborhood in restricted self-attention.\nLayer Type Complexity per Layer Sequential Maximum Path Length\nOperations\nSelf-Attention O(n2·d) O(1) O(1)\nRecurrent O(n·d2) O(n) O(n)\nConvolutional O(k·n·d2) O(1) O(logk(n))\nSelf-Attention (restricted) O(r·n·d) O(1) O(n/r)\n3.5 Positional Encoding\nSince our model contains no recurrence and no convolution, in order for the model to make use of the\norder of the sequence, we must inject some information about the relative or absolute position of the\ntokens in the sequence. To this end, we add \"positional encodings\" to the input embeddings at the\nbottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel\nas the embeddings, so that the two can be summed. There are many choices of positional encodings,\nlearned and fixed [9].\nIn this work, we use sine and cosine functions of different frequencies:\nPE(pos,2i)=sin(pos/100002i/d model)\nPE(pos,2i+1)=cos(pos/100002i/d model)\nwhere posis the position and iis the dimension. That is, each dimension of the positional encoding\ncorresponds to a sinusoid. The wavelengths form a geometric progression from 2πto10000 ·2π. We\nchose this function because we hypothesized it would allow the model to easily learn to attend by\nrelative positions, since for any fixed offset k,PEpos+kcan be represented as a linear function of\nPEpos.\nWe also experimented with using learned positional embeddings [ 9] instead, and found that the two\nversions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version\nbecause it may allow the model to extrapolate to sequence lengths longer than the ones encountered\nduring training.\n4 Why Self-Attention\nIn this section we compare various aspects of self-attention layers to the recurrent and convolu-\ntional layers commonly used for mapping one variable-length sequence of symbol representations\n(x1, ..., x n)to another sequence of equal length (z1, ..., z n), with xi, zi∈Rd, such as a hidden\nlayer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we\nconsider three desiderata.\nOne is the total computational complexity per layer. Another is the amount of computation that can\nbe parallelized, as measured by the minimum number of sequential operations required.\nThe third is the path length between long-range dependencies in the network. Learning long-range\ndependencies is a key challenge in many sequence transduction tasks. One key factor affecting the\nability to learn such dependencies is the length of the paths forward and backward signals have to\ntraverse in the network. The shorter these paths between any combination of positions in the input\nand output sequences, the easier it is to learn long-range dependencies [ 12]. Hence we also compare\nthe maximum path length between any two input and output positions in networks composed of the\ndifferent layer types.\nAs noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially\nexecuted operations, whereas a recurrent layer requires O(n)sequential operations. In terms of\ncomputational complexity, self-attention layers are faster than recurrent layers when the sequence\n6length nis smaller than the representation dimensionality d, which is most often the case with\nsentence representations used by state-of-the-art models in machine translations, such as word-piece\n[38] and byte-pair [ 31] representations. To improve computational performance for tasks involving\nvery long sequences, self-attention could be restricted to considering only a neighborhood of size rin\nthe input sequence centered around the respective output position. This would increase the maximum\npath length to O(n/r). We plan to investigate this approach further in future work.\nA single convolutional layer with kernel width k < n does not connect all pairs of input and output\npositions. Doing so requires a stack of O(n/k)convolutional layers in the case of contiguous kernels,\norO(logk(n))in the case of dilated convolutions [ 18], increasing the length of the longest paths\nbetween any two positions in the network. Convolutional layers are generally more expensive than\nrecurrent layers, by a factor of k. Separable convolutions [ 6], however, decrease the complexity\nconsiderably, to O(k·n·d+n·d2). Even with k=n, however, the complexity of a separable\nconvolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,\nthe approach we take in our model.\nAs side benefit, self-attention could yield more interpretable models. We inspect attention distributions\nfrom our models and present and discuss examples in the appendix. Not only do individual attention\nheads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic\nand semantic structure of the sentences.\n5 Training\nThis section describes the training regime for our models.\n5.1 Training Data and Batching\nWe trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million\nsentence pairs. Sentences were encoded using byte-pair encoding [ 3], which has a shared source-\ntarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT\n2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece\nvocabulary [ 38]. Sentence pairs were batched together by approximate sequence length. Each training\nbatch contained a set of sentence pairs containing approximately 25000 source tokens and 25000\ntarget tokens.\n5.2 Hardware and Schedule\nWe trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using\nthe hyperparameters described throughout the paper, each training step took about 0.4 seconds. We\ntrained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the\nbottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps\n(3.5 days).\n5.3 Optimizer\nWe used the Adam optimizer [ 20] with β1= 0.9,β2= 0.98andϵ= 10−9. We varied the learning\nrate over the course of training, according to the formula:\nlrate =d−0.5\nmodel·min(step_num−0.5, step _num·warmup _steps−1.5) (3)\nThis corresponds to increasing the learning rate linearly for the first warmup _steps training steps,\nand decreasing it thereafter proportionally to the inverse square root of the step number. We used\nwarmup _steps = 4000 .\n5.4 Regularization\nWe employ three types of regularization during training:\n7Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\nEnglish-to-German and English-to-French newstest2014 tests at a fraction of the training cost.\nModelBLEU Training Cost (FLOPs)\nEN-DE EN-FR EN-DE EN-FR\nByteNet [18] 23.75\nDeep-Att + PosUnk [39] 39.2 1.0·1020\nGNMT + RL [38] 24.6 39.92 2.3·10191.4·1020\nConvS2S [9] 25.16 40.46 9.6·10181.5·1020\nMoE [32] 26.03 40.56 2.0·10191.2·1020\nDeep-Att + PosUnk Ensemble [39] 40.4 8.0·1020\nGNMT + RL Ensemble [38] 26.30 41.16 1.8·10201.1·1021\nConvS2S Ensemble [9] 26.36 41.29 7.7·10191.2·1021\nTransformer (base model) 27.3 38.1 3.3·1018\nTransformer (big) 28.4 41.8 2.3·1019\nResidual Dropout We apply dropout [ 33] to the output of each sub-layer, before it is added to the\nsub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the\npositional encodings in both the encoder and decoder stacks. For the base model, we use a rate of\nPdrop= 0.1.\nLabel Smoothing During training, we employed label smoothing of value ϵls= 0.1[36]. This\nhurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.\n6 Results\n6.1 Machine Translation\nOn the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)\nin Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0\nBLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is\nlisted in the bottom line of Table 3. Training took 3.5days on 8P100 GPUs. Even our base model\nsurpasses all previously published models and ensembles, at a fraction of the training cost of any of\nthe competitive models.\nOn the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0,\noutperforming all of the previously published single models, at less than 1/4the training cost of the\nprevious state-of-the-art model. The Transformer (big) model trained for English-to-French used\ndropout rate Pdrop= 0.1, instead of 0.3.\nFor the base models, we used a single model obtained by averaging the last 5 checkpoints, which\nwere written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We\nused beam search with a beam size of 4and length penalty α= 0.6[38]. These hyperparameters\nwere chosen after experimentation on the development set. We set the maximum output length during\ninference to input length + 50, but terminate early when possible [38].\nTable 2 summarizes our results and compares our translation quality and training costs to other model\narchitectures from the literature. We estimate the number of floating point operations used to train a\nmodel by multiplying the training time, the number of GPUs used, and an estimate of the sustained\nsingle-precision floating-point capacity of each GPU5.\n6.2 Model Variations\nTo evaluate the importance of different components of the Transformer, we varied our base model\nin different ways, measuring the change in performance on English-to-German translation on the\n5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.\n8Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base\nmodel. All metrics are on the English-to-German translation development set, newstest2013. Listed\nperplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to\nper-word perplexities.\nN d model dff h d k dvPdrop ϵlstrain PPL BLEU params\nsteps (dev) (dev) ×106\nbase 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65\n(A)1 512 512 5.29 24.9\n4 128 128 5.00 25.5\n16 32 32 4.91 25.8\n32 16 16 5.01 25.4\n(B)16 5.16 25.1 58\n32 5.01 25.4 60\n(C)2 6.11 23.7 36\n4 5.19 25.3 50\n8 4.88 25.5 80\n256 32 32 5.75 24.5 28\n1024 128 128 4.66 26.0 168\n1024 5.12 25.4 53\n4096 4.75 26.2 90\n(D)0.0 5.77 24.6\n0.2 4.95 25.5\n0.0 4.67 25.3\n0.2 5.47 25.7\n(E) positional embedding instead of sinusoids 4.92 25.7\nbig 6 1024 4096 16 0.3 300K 4.33 26.4 213\ndevelopment set, newstest2013. We used beam search as described in the previous section, but no\ncheckpoint averaging. We present these results in Table 3.\nIn Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions,\nkeeping the amount of computation constant, as described in Section 3.2.2. While single-head\nattention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.\nIn Table 3 rows (B), we observe that reducing the attention key size dkhurts model quality. This\nsuggests that determining compatibility is not easy and that a more sophisticated compatibility\nfunction than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected,\nbigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our\nsinusoidal positional encoding with learned positional embeddings [ 9], and observe nearly identical\nresults to the base model.\n6.3 English Constituency Parsing\nTo evaluate if the Transformer can generalize to other tasks we performed experiments on English\nconstituency parsing. This task presents specific challenges: the output is subject to strong structural\nconstraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence\nmodels have not been able to attain state-of-the-art results in small-data regimes [37].\nWe trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the\nPenn Treebank [ 25], about 40K training sentences. We also trained it in a semi-supervised setting,\nusing the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences\n[37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens\nfor the semi-supervised setting.\nWe performed only a small number of experiments to select the dropout, both attention and residual\n(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters\nremained unchanged from the English-to-German base translation model. During inference, we\n9Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23\nof WSJ)\nParser Training WSJ 23 F1\nVinyals & Kaiser el al. (2014) [37] WSJ only, discriminative 88.3\nPetrov et al. (2006) [29] WSJ only, discriminative 90.4\nZhu et al. (2013) [40] WSJ only, discriminative 90.4\nDyer et al. (2016) [8] WSJ only, discriminative 91.7\nTransformer (4 layers) WSJ only, discriminative 91.3\nZhu et al. (2013) [40] semi-supervised 91.3\nHuang & Harper (2009) [14] semi-supervised 91.3\nMcClosky et al. (2006) [26] semi-supervised 92.1\nVinyals & Kaiser el al. (2014) [37] semi-supervised 92.1\nTransformer (4 layers) semi-supervised 92.7\nLuong et al. (2015) [23] multi-task 93.0\nDyer et al. (2016) [8] generative 93.3\nincreased the maximum output length to input length + 300. We used a beam size of 21andα= 0.3\nfor both WSJ only and the semi-supervised setting.\nOur results in Table 4 show that despite the lack of task-specific tuning our model performs sur-\nprisingly well, yielding better results than all previously reported models with the exception of the\nRecurrent Neural Network Grammar [8].\nIn contrast to RNN sequence-to-sequence models [ 37], the Transformer outperforms the Berkeley-\nParser [29] even when training only on the WSJ training set of 40K sentences.\n7 Conclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014\nEnglish-to-French translation tasks, we achieve a new state of the art. In the former task our best\nmodel outperforms even all previously reported ensembles.\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We\nplan to extend the Transformer to problems involving input and output modalities other than text and\nto investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs\nsuch as images, audio and video. Making generation less sequential is another research goals of ours.\nThe code we used to train and evaluate our models is available at https://github.com/\ntensorflow/tensor2tensor .\nAcknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful\ncomments, corrections and inspiration.\nReferences\n[1]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\narXiv:1607.06450 , 2016.\n[2]Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. CoRR , abs/1409.0473, 2014.\n[3]Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V . Le. Massive exploration of neural\nmachine translation architectures. CoRR , abs/1703.03906, 2017.\n[4]Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine\nreading. arXiv preprint arXiv:1601.06733 , 2016.\n10[5]Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. CoRR , abs/1406.1078, 2014.\n[6]Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv\npreprint arXiv:1610.02357 , 2016.\n[7]Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. CoRR , abs/1412.3555, 2014.\n[8]Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural\nnetwork grammars. In Proc. of NAACL , 2016.\n[9]Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-\ntional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2 , 2017.\n[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint\narXiv:1308.0850 , 2013.\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition , pages 770–778, 2016.\n[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in\nrecurrent nets: the difficulty of learning long-term dependencies, 2001.\n[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation ,\n9(8):1735–1780, 1997.\n[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations\nacross languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural\nLanguage Processing , pages 832–841. ACL, August 2009.\n[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\nthe limits of language modeling. arXiv preprint arXiv:1602.02410 , 2016.\n[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural\nInformation Processing Systems, (NIPS) , 2016.\n[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference\non Learning Representations (ICLR) , 2016.\n[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko-\nray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2 ,\n2017.\n[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks.\nInInternational Conference on Learning Representations , 2017.\n[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR , 2015.\n[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint\narXiv:1703.10722 , 2017.\n[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen\nZhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint\narXiv:1703.03130 , 2017.\n[23] Minh-Thang Luong, Quoc V . Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task\nsequence to sequence learning. arXiv preprint arXiv:1511.06114 , 2015.\n[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-\nbased neural machine translation. arXiv preprint arXiv:1508.04025 , 2015.\n11[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\ncorpus of english: The penn treebank. Computational linguistics , 19(2):313–330, 1993.\n[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In\nProceedings of the Human Language Technology Conference of the NAACL, Main Conference ,\npages 152–159. ACL, June 2006.\n[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention\nmodel. In Empirical Methods in Natural Language Processing , 2016.\n[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive\nsummarization. arXiv preprint arXiv:1705.04304 , 2017.\n[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact,\nand interpretable tree annotation. In Proceedings of the 21st International Conference on\nComputational Linguistics and 44th Annual Meeting of the ACL , pages 433–440. ACL, July\n2006.\n[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv\npreprint arXiv:1608.05859 , 2016.\n[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\nwith subword units. arXiv preprint arXiv:1508.07909 , 2015.\n[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,\nand Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts\nlayer. arXiv preprint arXiv:1701.06538 , 2017.\n[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine\nLearning Research , 15(1):1929–1958, 2014.\n[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory\nnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28 , pages 2440–2448. Curran Associates,\nInc., 2015.\n[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural\nnetworks. In Advances in Neural Information Processing Systems , pages 3104–3112, 2014.\n[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.\nRethinking the inception architecture for computer vision. CoRR , abs/1512.00567, 2015.\n[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In\nAdvances in Neural Information Processing Systems , 2015.\n[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144 , 2016.\n[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with\nfast-forward connections for neural machine translation. CoRR , abs/1606.04199, 2016.\n[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate\nshift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume\n1: Long Papers) , pages 434–443. ACL, August 2013.\n12Attention Visualizations\nInput-Input Layer5\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different heads. Best viewed in color.\n13Input-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nInput-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:\nFull attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5\nand 6. Note that the attentions are very sharp for this word.\n14Input-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nInput-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the\nsentence. We give two such examples above, from two different heads from the encoder self-attention\nat layer 5 of 6. The heads clearly learned to perform different tasks.\n15"}


--------------------------------------------------------------------------------
/knowledge_base/json/2305.16291v2_f49cd5.json:
--------------------------------------------------------------------------------
1 | {"filename": "2305.16291v2.pdf", "title": "2305.16291v2", "content": "VOYAGER : An Open-Ended Embodied Agent\nwith Large Language Models\nGuanzhi Wang1 2/envel⌢pe, Yuqi Xie3, Yunfan Jiang4∗, Ajay Mandlekar1∗,\nChaowei Xiao1 5, Yuke Zhu1 3, Linxi “Jim” Fan1†/envel⌢pe, Anima Anandkumar1 2†\n1NVIDIA,2Caltech,3UT Austin,4Stanford,5UW Madison\n∗Equal contribution†Equal advising/envel⌢peCorresponding authors\nhttps://voyager.minedojo.org\nAbstract\nWe introduce VOYAGER , the first LLM-powered embodied lifelong learning agent\nin Minecraft that continuously explores the world, acquires diverse skills, and\nmakes novel discoveries without human intervention. V OYAGER consists of three\nkey components: 1) an automatic curriculum that maximizes exploration, 2) an\never-growing skill library of executable code for storing and retrieving complex\nbehaviors, and 3) a new iterative prompting mechanism that incorporates environ-\nment feedback, execution errors, and self-verification for program improvement.\nVOYAGER interacts with GPT-4 via blackbox queries, which bypasses the need for\nmodel parameter fine-tuning. The skills developed by VOYAGER are temporally\nextended, interpretable, and compositional, which compounds the agent’s abilities\nrapidly and alleviates catastrophic forgetting. Empirically, VOYAGER shows\nstrong in-context lifelong learning capability and exhibits exceptional proficiency\nin playing Minecraft. It obtains 3.3×more unique items, travels 2.3×longer\ndistances, and unlocks key tech tree milestones up to 15.3×faster than prior SOTA.\nVOYAGER is able to utilize the learned skill library in a new Minecraft world to\nsolve novel tasks from scratch, while other techniques struggle to generalize.\nFigure 1: VOYAGER discovers new Minecraft items and skills continually by self-driven exploration,\nsignificantly outperforming the baselines. X-axis denotes the number of prompting iterations.\n1arXiv:2305.16291v2  [cs.AI]  19 Oct 2023M i n e  W o o d   L o g\nM a k e  C r a f t i n g  T a b l e\nC r a f t  S t o n e  S w o r d\nC r a f t  S h i e l d\nM a k e  F u r n a c e\nC o o k  S t e a k\nC o m b a t  Z o m b i e     M i n e  W o o d  L o gM a k e  C r a f t i n g  T a b l eC o m b a t  \nZ o m b i e\nM i n e  D i a m o n d\nN e w  \nT a s kC o d e  a s  \nA c t i o n sR e f i n e  P r o g r a mE n v  F e e d b a c k\nE x e c u t i o n  E r r o r sU p d a t e  \nE x p l o r a t i o n  \nP r o g r e s sS k i l l  \nR e t r i e v a l\nA d d  N e w  S k i l lA u t o m a t i c  C u r r i c u l u mI t e r a t i v e  P r o m p t i n g  M e c h a n i s mS k i l l  L i b r a r y\nE n v i r o n m e n tS e l f - V e r i f i c a t i o n\nFigure 2: VOYAGER consists of three key components: an automatic curriculum for open-ended\nexploration, a skill library for increasingly complex behaviors, and an iterative prompting mechanism\nthat uses code as action space.\n1 Introduction\nBuilding generally capable embodied agents that continuously explore, plan, and develop new skills\nin open-ended worlds is a grand challenge for the AI community [ 1–5]. Classical approaches\nemploy reinforcement learning (RL) [ 6,7] and imitation learning [ 8–10] that operate on primitive\nactions, which could be challenging for systematic exploration [ 11–15], interpretability [ 16–18], and\ngeneralization [ 19–21]. Recent advances in large language model (LLM) based agents harness the\nworld knowledge encapsulated in pre-trained LLMs to generate consistent action plans or executable\npolicies [ 16,22,19]. They are applied to embodied tasks like games and robotics [ 23–27], as well as\nNLP tasks without embodiment [ 28–30]. However, these agents are not lifelong learners that can\nprogressively acquire, update, accumulate, and transfer knowledge over extended time spans [ 31,32].\nLet us consider Minecraft as an example. Unlike most other games studied in AI [ 33,34,10],\nMinecraft does not impose a predefined end goal or a fixed storyline but rather provides a unique\nplayground with endless possibilities [ 23]. Minecraft requires players to explore vast, procedurally\ngenerated 3D terrains and unlock a tech tree using gathered resources. Human players typically start\nby learning the basics, such as mining wood and cooking food, before advancing to more complex\ntasks like combating monsters and crafting diamond tools. We argue that an effective lifelong learning\nagent should have similar capabilities as human players: (1) propose suitable tasks based on its\ncurrent skill level and world state, e.g., learn to harvest sand and cactus before iron if it finds itself in\na desert rather than a forest; (2) refine skills based on environmental feedback and commit mastered\nskills to memory for future reuse in similar situations (e.g. fighting zombies is similar to fighting\nspiders); (3) continually explore the world and seek out new tasks in a self-driven manner.\nTowards these goals, we introduce VOYAGER , the first LLM-powered embodied lifelong learning\nagent to drive exploration, master a wide range of skills, and make new discoveries continually\nwithout human intervention in Minecraft. VOYAGER is made possible through three key modules\n(Fig. 2): 1) an automatic curriculum that maximizes exploration; 2) a skill library for storing\nand retrieving complex behaviors; and 3) a new iterative prompting mechanism that generates\nexecutable code for embodied control. We opt to use code as the action space instead of low-level\nmotor commands because programs can naturally represent temporally extended and compositional\nactions [ 16,22], which are essential for many long-horizon tasks in Minecraft. VOYAGER interacts\nwith a blackbox LLM (GPT-4 [ 35]) through prompting and in-context learning [ 36–38]. Our approach\nbypasses the need for model parameter access and explicit gradient-based training or finetuning.\nMore specifically, VOYAGER attempts to solve progressively harder tasks proposed by the automatic\ncurriculum , which takes into account the exploration progress and the agent’s state. The curriculum\nis generated by GPT-4 based on the overarching goal of “discovering as many diverse things as\npossible”. This approach can be perceived as an in-context form of novelty search [39,40].VOYAGER\nincrementally builds a skill library by storing the action programs that help solve a task successfully.\n2I n v e n t o r y  ( 5 / 3 6 ) :  { ' o a k _ p l a n k s ' :  3 ,  ' s t i c k ' :  \n4 ,  ' c r a f t i n g _ t a b l e ' :  1 ,  ' s t o n e ' :  3 ,  \n' w o o d e n _ p i c k a x e ' :  1 }B i o m e :  r i v e r\nI n v e n t o r y  ( 4 / 3 6 ) :  { ' s u g a r _ c a n e ' :  3 ,  ' p a p e r ' :  \n3 ,  ' f i s h i n g _ r o d ' :  1 ,  ' w o o d e n _ p i c k a x e ' :  1 }N e a r b y  e n t i t i e s :  p i g ,  c a t ,  v i l l a g e r \r\nH e a l th :  1 2 / 2 0 \r\nH u n g e r :  0 / 2 0I n v e n t o r y  ( 6 / 3 6 ) :  { ' f u r n a c e ' :  1 ,  \n' s t o n e _ p i c k a x e ' :  1 ,  ' o a k _ p l a n k s ' :  7 ,  \n' c r a f t i n g _ t a b l e ' :  1 ,  ' r a w _ i r o n ' :  4 ,  ' c o a l ' :  1 } \rT i m e :  n i g h t\nN e a r b y  e n t i t i e s :  z o m b i e\nE q u i p m e n t :  ' s t o n e _ s w o r d ' ,  ' s h i e l d 'R e a s o n i n g :  S i n c e  y o u  h a v e  a  w o o d e n  p i c k a x e  a n d  s o m e  s t o n e s ,  i t  w o u l d  b e  \nb e n e f i c i a l  t o  u p g r a d e  y o u r  p i c k a x e  t o  a  s t o n e  p i c k a x e  f o r  b e t t e r  e f f i c i e n c y . \r\nT a s k :  C r a f t  1  s t o n e  p i c k a x e .R e a s o n i n g :  S i n c e  y o u  h a v e  a  f i s h i n g  r o d  i n  y o u r  i n v e n t o r y  a n d  y o u  a r e  n e a r  a  r i v e r  \nb i o m e ,  i t  w o u l d  b e  a  g o o d  o p p o r t u n i t y  t o  c a t c h  s o m e  f i s h  f o r  f o o d  a n d  e x p e r i e n c e .  \nT a s k :  C a t c h  1  f i s h .R e a s o n i n g :  Y o u r  h u n g e r  i s  a t  0 ,  w h i c h  m e a n s  y o u  n e e d  t o  f i n d  f o o d  t o  r e p l e n i s h  \ny o u r  h u n g e r .  S i n c e  th e r e  a r e  p i g s  n e a r b y ,  y o u  c a n  k i l l  o n e  t o  o b t a i n  r a w  p o r k c h o p s .\nT a s k :  K i l l  1  p i g .R e a s o n i n g :  Y o u  h a v e  r a w  i r o n  a n d  c o a l ,  a n d  y o u  h a v e  a  f u r n a c e .  I t ' s  t i m e  t o  s m e l t  \nth e  i r o n  t o  m a k e  i r o n  i n g o t s ,  w h i c h  c a n  b e  u s e d  t o  c r a f t  b e t t e r  t o o l s  a n d  a r m o r . \r\nT a s k :  S m e l t  4  r a w  i r o n .R e a s o n i n g :  S i n c e  i t ' s  n i g h t  a n d  th e r e ' s  a  z o m b i e  n e a r b y ,  i t ' s  a  g o o d  o p p o r t u n i t y  t o  \nt r y  k i l l i n g  th e  z o m b i e  n o w  th a t  y o u  h a v e  a  s t o n e  s w o r d  a n d  a  s h i e l d  e q u i p p e d .\nT a s k :  K i l l  1  z o m b i e .G P T - 4G P T - 4G P T - 4G P T - 4G P T - 4Figure 3: Tasks proposed by the automatic curriculum. We only display the partial prompt for brevity.\nSee Appendix, Sec. A.3 for the full prompt structure.\nEach program is indexed by the embedding of its description, which can be retrieved in similar\nsituations in the future. Complex skills can be synthesized by composing simpler programs, which\ncompounds VOYAGER ’s capabilities rapidly over time and alleviates catastrophic forgetting in other\ncontinual learning methods [31, 32].\nHowever, LLMs struggle to produce the correct action code consistently in one shot [ 41]. To address\nthis challenge, we propose an iterative prompting mechanism that: (1) executes the generated\nprogram to obtain observations from the Minecraft simulation (such as inventory listing and nearby\ncreatures) and error trace from the code interpreter (if any); (2) incorporates the feedback into GPT-4’s\nprompt for another round of code refinement; and (3) repeats the process until a self-verification\nmodule confirms the task completion, at which point we commit the program to the skill library (e.g.,\ncraftStoneShovel() andcombatZombieWithSword() ) and query the automatic curriculum for\nthe next milestone (Fig. 2).\nEmpirically, VOYAGER demonstrates strong in-context lifelong learning capabilities. It can construct\nan ever-growing skill library of action programs that are reusable, interpretable, and generalizable\nto novel tasks. We evaluate VOYAGER systematically against other LLM-based agent techniques\n(e.g., ReAct [ 29], Reflexion [ 30], AutoGPT [ 28]) in MineDojo [ 23], an open-source Minecraft AI\nframework. VOYAGER outperforms prior SOTA by obtaining 3.3×more unique items, unlocking key\ntech tree milestones up to 15.3×faster, and traversing 2.3×longer distances. We further demonstrate\nthatVOYAGER is able to utilize the learned skill library in a new Minecraft world to solve novel tasks\nfrom scratch, while other methods struggle to generalize.\n2 Method\nVOYAGER consists of three novel components: (1) an automatic curriculum (Sec. 2.1) that suggests\nobjectives for open-ended exploration, (2) a skill library (Sec. 2.2) for developing increasingly\ncomplex behaviors, and (3) an iterative prompting mechanism (Sec. 2.3) that generates executable\ncode for embodied control. Full prompts are presented in Appendix, Sec. A.\n2.1 Automatic Curriculum\nEmbodied agents encounter a variety of objectives with different complexity levels in open-ended\nenvironments. An automatic curriculum offers numerous benefits for open-ended exploration, ensur-\ning a challenging but manageable learning process, fostering curiosity-driven intrinsic motivation\nfor agents to learn and explore, and encouraging the development of general and flexible problem-\nsolving strategies [ 42–44]. Our automatic curriculum capitalizes on the internet-scale knowledge\ncontained within GPT-4 by prompting it to provide a steady stream of new tasks or challenges. The\ncurriculum unfolds in a bottom-up fashion, allowing for considerable adaptability and responsiveness\nto the exploration progress and the agent’s current state (Fig. 3). As V OYAGER progresses to harder\nself-driven goals, it naturally learns a variety of skills, such as “mining a diamond”.\n3P r o g r a m  D e s c r i p t i o nS k i l l  L i b r a r y\nT o p - 5  R e l e v a n t  S k i l l sP r o g r a m  G e n e r a t e d  b y  G P T - 4\nT a s k :  C r a f t  I r o n  P i c k a x eK e yA d d\nR e t r i e v eV a l u e\nS k i l l  L i b r a r yQ u e r yH o w  t o  c r a f t  a n  i r o n  p i c k a x e  i n  \nM i n e c r a f t ?T o  c r a f t  a n  i r o n  p i c k a x e ,  y o u  \nn e e d  t o  3  i r o n  i n g o t s  a n d  2  \ns t i c k s .  O n c e  y o u  h a v e  g a th e r e d  \nth e  m a t e r i a l s ,  . . . .\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\n         E n v i r o n m e n t  F e e d b a c k\nM i n e  W o o d   L o g\nM a k e  C r a f t i n g  T a b l e\nC r a f t  W o o d e n  P i c k a x e\nC r a f t  S t o n e  S w o r d\nM a k e  F u r n a c e\n. . .\nC o m b a t  C o w\nC o o k  S t e a k\nC r a f t  I r o n  A x e\nC o m b a t  Z o m b i e\nS m e l t  I r o n  I n g o t\nC r a f t  S t i c k\nM a k e  C r a f t i n g  T a b l e\nM a k e  F u r n a c e\nC r a f t  W o o d e n  P i c k a x e\nG P T - 3 . 5E m b e d d i n g\nE m b e d d i n gG P T - 3 . 5Figure 4: Skill library. Top: Adding a new skill. Each time GPT-4 generates and verifies a new\nskill, we add it to the skill library, represented by a vector database. The key is the embedding vector\nof the program description (generated by GPT-3.5), while the value is the program itself. Bottom:\nSkill retrieval. When faced with a new task proposed by the automatic curriculum, we first leverage\nGPT-3.5 to generate a general suggestion for solving the task, which is combined with environment\nfeedback as the query context. Subsequently, we perform querying to identify the top-5 relevant skills.\nThe input prompt to GPT-4 consists of several components:\n(1)Directives encouraging diverse behaviors and imposing constraints , such as\n“My ultimate goal is to discover as many diverse things as possible\n... The next task should not be too hard since I may not have the\nnecessary resources or have learned enough skills to complete it\nyet. ”;\n(2)The agent’s current state , including inventory, equipment, nearby blocks and entities,\nbiome, time, health and hunger bars, and position;\n(3)Previously completed and failed tasks , reflecting the agent’s current exploration progress\nand capabilities frontier;\n(4)Additional context : We also leverage GPT-3.5 to self-ask questions based on the agent’s\ncurrent state and exploration progress and self-answer questions. We opt to use GPT-3.5\ninstead of GPT-4 for standard NLP tasks due to budgetary considerations.\n2.2 Skill Library\nWith the automatic curriculum consistently proposing increasingly complex tasks, it is essential to\nhave a skill library that serves as a basis for learning and evolution. Inspired by the generality, inter-\npretability, and universality of programs [ 45], we represent each skill with executable code that scaf-\nfolds temporally extended actions for completing a specific task proposed by the automatic curriculum.\nThe input prompt to GPT-4 consists of the following components:\n(1)Guidelines for code generation , such as “ Your function will be reused\nfor building more complex functions. Therefore, you should make\nit generic and reusable. ”;\n(2)Control primitive APIs, and relevant skills retrieved from the skill library, which are\ncrucial for in-context learning [36–38] to work well;\n(3)The generated code from the last round, environment feedback, execution errors, and\ncritique , based on which GPT-4 can self-improve (Sec. 2.3);\n(4)The agent’s current state , including inventory, equipment, nearby blocks and entities,\nbiome, time, health and hunger bars, and position;\n4I  c a n n o t  m a k e  s t i c k  b e c a u s e  I  n e e d :   2  m o r e  p l a n k s\nI  c a n n o t  m a k e  s t o n e _ s h o v e l  b e c a u s e  I  n e e d :   2  m o r e  s t i c kth r o w  n e w  E r r o r ( ` N o  i t e m  n a m e d  $ { n a m e } ` ) ;\nN o  i t e m  n a m e d  a c a c i a _ a x e\na t  l i n e  1 8 : a w a i t  c r a f tI t e m ( b o t ,  \" a c a c i a _ a x e \" ,  1 ) ;E n v i r o n m e n t  F e e d b a c kE x e c u t i o n  E r r o r\nG P T - 4G P T - 4\nFigure 5: Left: Environment feedback. GPT-4 realizes it needs 2 more planks before crafting sticks.\nRight: Execution error. GPT-4 realizes it should craft a wooden axe instead of an acacia axe since\nthere is no acacia axe in Minecraft. We only display the partial prompt for brevity. The full prompt\nstructure for code generation is in Appendix, Sec. A.4.\n(5)Chain-of-thought prompting [46] to do reasoning before code generation.\nWe iteratively refine the program through a novel iterative prompting mechanism (Sec. 2.3), in-\ncorporate it into the skill library as a new skill, and index it by the embedding of its description\n(Fig. 4, top). For skill retrieval, we query the skill library with the embedding of self-generated task\nplans and environment feedback (Fig. 4, bottom). By continuously expanding and refining the skill\nlibrary, VOYAGER can learn, adapt, and excel in a wide spectrum of tasks, consistently pushing the\nboundaries of its capabilities in the open world.\n2.3 Iterative Prompting Mechanism\nWe introduce an iterative prompting mechanism for self-improvement through three types of feedback:\n(1)Environment feedback , which illustrates the intermediate progress of program execution\n(Fig. 5, left). For example, “ I cannot make an iron chestplate because I need:\n7 more iron ingots ” highlights the cause of failure in crafting an iron chestplate. We use\nbot.chat() inside control primitive APIs to generate environment feedback and prompt\nGPT-4 to use this function as well during code generation;\n(2)Execution errors from the program interpreter that reveal any invalid operations or syntax\nerrors in programs, which are valuable for bug fixing (Fig. 5, right);\n(3)Self-verification for checking task success. Instead of manually coding success checkers\nfor each new task proposed by the automatic curriculum, we instantiate another GPT-4\nagent for self-verification. By providing VOYAGER ’s current state and the task to GPT-4,\nwe ask it to act as a critic [ 47–49] and inform us whether the program achieves the task.\nIn addition, if the task fails, it provides a critique by suggesting how to complete the task\n(Fig. 6). Hence, our self-verification is more comprehensive than self-reflection [ 30] by both\nchecking success and reflecting on mistakes.\nDuring each round of code generation, we execute the generated program to obtain environment\nfeedback and execution errors from the code interpreter, which are incorporated into GPT-4’s prompt\nfor the next round of code refinement. This iterative process repeats until self-verification validates\n5I n v e n t o r y  ( 8 / 3 6 ) :  { ' o a k _ p l a n k s ' :  5 ,  ' c o b b l e s t o n e ' :  2 ,  \n' p o r k c h o p ' :  2 ,  ' w o o d e n _ s w o r d ' :  1 ,  ' c o a l ' :  5 ,  ' w o o d e n _ p i c k a x e ' :  \n1 ,  ' o a k _ l o g ' :  3 ,  ' d i r t ' :  9 }\nT a s k :  M i n e  5  c o a l  o r e sI n v e n t o r y  ( 1 0 / 3 6 ) :  { ' r a w _ c o p p e r ' :  9 ,  ' c o p p e r _ i n g o t ' :  3 ,  \n' a c a c i a _ p l a n k s ' :  1 ,  ' r a w _ i r o n ' :  1 ,  ' s t i c k ' :  1 ,  ' i r o n _ s w o r d ' :  1 ,  \n' i r o n _ p i c k a x e ' :  1 ,  ' i r o n _ i n g o t ' :  3 ,  ' c r a f t i n g _ t a b l e ' :  1 ,  ' f u r n a c e ' :  1 }\nT a s k :  C r a f t  a  s p y g l a s sI n v e n t o r y  ( 7 / 3 6 ) :  { ' o a k _ l o g ' :  5 ,  ' o a k _ p l a n k s ' :  1 ,  \n' w o o d e n _ p i c k a x e ' :  1 ,  ' w o o d e n _ s w o r d ' :  1 ,  ' p o r k c h o p ' :  2 ,  \n' w h i t e _ w o o l ' :  2 ,  ' m u t t o n ' :  6 }\nT a s k :  K i l l  3  s h e e pI n v e n t o r y  ( 9 / 3 6 ) :  { ' s t r i n g ' :  4 ,  ' c o a l ' :  1 ,  ' r o t t e n _ f l e s h ' :  1 ,  \n' i r o n _ s w o r d ' :  1 ,  ' f u r n a c e ' :  1 ,  ' d i r t ' :  6 ,  ' s t o n e _ s h o v e l ' :  1 ,  \n' w o o d e n _ p i c k a x e ' :  1 ,  ' g r a n i t e ' :  5 }\nT a s k :  K i l l  1  z o m b i eR e a s o n i n g :  M i n i n g  c o a l _ o r e  i n  M i n e c r a f t  w i l l  g e t  c o a l .  Y o u  h a v e  5  c o a l  i n  y o u r  \ni n v e n t o r y .\nS u c c e s s :  T r u eR e a s o n i n g :  T o  c r a f t  a  s p y g l a s s ,  y o u  n e e d  2  c o p p e r  i n g o t s  a n d  1  a m e th y s t  s h a r d .  \nY o u  h a v e  3  c o p p e r  i n g o t s ,  b u t  y o u  d o n ' t  h a v e  a n y  a m e th y s t  s h a r d s .\nS u c c e s s :  F a l s e\nC r i t i q u e :  F i n d  a n d  m i n e  a n  a m e th y s t  s h a r d  u n d e r g r o u n d .R e a s o n i n g :  Y o u  h a v e  2  w h i t e _ w o o l  a n d  6  m u t t o n  i n  y o u r  i n v e n t o r y ,  w h i c h  i n d i c a t e s  \nth a t  y o u  k i l l e d  2  s h e e p .  Y o u  n e e d e d  t o  k i l l  3  s h e e p .\nS u c c e s s :  F a l s e\nC r i t i q u e :  F i n d  a n d  k i l l  o n e  m o r e  s h e e p  t o  c o m p l e t e  th e  t a s k .R e a s o n i n g :  Y o u  h a v e  1  r o t t e n _ f l e s h  i n  y o u r  i n v e n t o r y ,  w h i c h  m e a n s  y o u  h a v e  k i l l e d  \na t  l e a s t  1  z o m b i e .\nS u c c e s s :  T r u eG P T - 4G P T - 4G P T - 4G P T - 4Figure 6: Self-verification examples. We only display the partial prompt for brevity. See Appendix,\nSec. A.5 for the full prompt structure.\nthe task’s completion, at which point we add this new skill to the skill library and ask the automatic\ncurriculum for a new objective (Fig. 2). If the agent gets stuck after 4 rounds of code generation, then\nwe query the curriculum for another task. This iterative prompting approach significantly improves\nprogram synthesis for embodied control, enabling VOYAGER to continuously acquire diverse skills\nwithout human intervention.\n3 Experiments\n3.1 Experimental Setup\nWe leverage OpenAI’s gpt-4-0314 [35] and gpt-3.5-turbo-0301 [50] APIs for text completion,\nalong with text-embedding-ada-002 [51] API for text embedding. We set all temperatures to\n0 except for the automatic curriculum, which uses temperature =0.1 to encourage task diversity. Our\nsimulation environment is built on top of MineDojo [ 23] and leverages Mineflayer [ 52] JavaScript\nAPIs for motor controls. See Appendix, Sec. B.1 for more details.\n3.2 Baselines\nBecause there is no LLM-based agents that work out of the box for Minecraft, we make our best\neffort to select a number of representative algorithms as baselines. These methods are originally\ndesigned only for NLP tasks without embodiment, therefore we have to re-interpret them to be\nexecutable in MineDojo and compatible with our experimental setting:\nReAct [29] uses chain-of-thought prompting [ 46] by generating both reasoning traces and action\nplans with LLMs. We provide it with our environment feedback and the agent states as observations.\nReflexion [30] is built on top of ReAct [ 29] with self-reflection to infer more intuitive future actions.\nWe provide it with execution errors and our self-verification module.\nAutoGPT [28] is a popular software tool that automates NLP tasks by decomposing a high-level\ngoal into multiple subgoals and executing them in a ReAct-style loop. We re-implement AutoGPT\nby using GPT-4 to do task decomposition and provide it with the agent states, environment feedback,\nand execution errors as observations for subgoal execution. Compared with VOYAGER , AutoGPT\nlacks the skill library for accumulating knowledge, self-verification for assessing task success, and\nautomatic curriculum for open-ended exploration.\nNote that we do not directly compare with prior methods that take Minecraft screen pixels as input\nand output low-level controls [ 53–55]. It would not be an apple-to-apple comparison, because we rely\non the high-level Mineflayer [ 52] API to control the agent. Our work’s focus is on pushing the limits\nof GPT-4 for lifelong embodied agent learning, rather than solving the 3D perception or sensorimotor\ncontrol problems. VOYAGER is orthogonal and can be combined with gradient-based approaches like\n6Table 1: Tech tree mastery. Fractions indicate the number of successful trials out of three total runs.\n0/3 means the method fails to unlock a level of the tech tree within the maximal prompting iterations\n(160). Numbers are prompting iterations averaged over three trials. The fewer the iterations, the\nmore efficient the method.\nMethod Wooden Tool Stone Tool Iron Tool Diamond Tool\nReAct [29] N/A (0/3) N/A (0/3) N/A (0/3) N/A (0/3)\nReflexion [30] N/A (0/3) N/A (0/3) N/A (0/3) N/A (0/3)\nAutoGPT [28] 92±72 (3/3) 94±72 (3/3) 135 ±103 (3/3) N/A (0/3)\nVOYAGER w/o Skill Library 7±2(3/3) 9±4(3/3) 29 ±11 (3/3) N/A (0/3)\nVOYAGER (Ours) 6±2(3/3)11±2(3/3) 21±7(3/3) 102 (1/3)\nFigure 7: Map coverage: bird’s eye views of Minecraft maps. VOYAGER is able to traverse 2.3×\nlonger distances compared to baselines while crossing diverse terrains.\nVPT [ 8] as long as the controller provides a code API. We make a system-level comparison between\nVOYAGER and prior Minecraft agents in Table. A.2.\n3.3 Evaluation Results\nWe systematically evaluate VOYAGER and baselines on their exploration performance, tech tree\nmastery, map coverage, and zero-shot generalization capability to novel tasks in a new world.\nSignificantly better exploration. Results of exploration performance are shown in Fig. 1.\nVOYAGER ’s superiority is evident in its ability to consistently make new strides, discovering 63\nunique items within 160 prompting iterations, 3.3×many novel items compared to its counterparts.\nOn the other hand, AutoGPT lags considerably in discovering new items, while ReAct and Reflexion\nstruggle to make significant progress, given the abstract nature of the open-ended exploration goal\nthat is challenging to execute without an appropriate curriculum.\nConsistent tech tree mastery. The Minecraft tech tree tests the agent’s ability to craft and use a\nhierarchy of tools. Progressing through this tree (wooden tool →stone tool →iron tool →diamond\ntool) requires the agent to master systematic and compositional skills. Compared with baselines,\nVOYAGER unlocks the wooden level 15.3×faster (in terms of the prompting iterations), the stone\nlevel 8.5×faster, the iron level 6.4×faster, and VOYAGER is the only one to unlock the diamond level\nof the tech tree (Fig. 2 and Table. 1). This underscores the effectiveness of the automatic curriculum,\nwhich consistently presents challenges of suitable complexity to facilitate the agent’s progress.\nExtensive map traversal. VOYAGER is able to navigate distances 2.3×longer compared to baselines\nby traversing a variety of terrains, while the baseline agents often find themselves confined to local\nareas, which significantly hampers their capacity to discover new knowledge (Fig. 7).\n7Table 2: Zero-shot generalization to unseen tasks. Fractions indicate the number of successful\ntrials out of three total attempts. 0/3 means the method fails to solve the task within the maximal\nprompting iterations (50). Numbers are prompting iterations averaged over three trials. The fewer\nthe iterations, the more efficient the method.\nMethod Diamond Pickaxe Golden Sword Lava Bucket Compass\nReAct [29] N/A (0/3) N/A (0/3) N/A (0/3) N/A (0/3)\nReflexion [30] N/A (0/3) N/A (0/3) N/A (0/3) N/A (0/3)\nAutoGPT [28] N/A (0/3) N/A (0/3) N/A (0/3) N/A (0/3)\nAutoGPT [28] w/ Our Skill Library 39(1/3) 30(1/3) N/A (0/3) 30(2/3)\nVOYAGER w/o Skill Library 36(2/3) 30 ±9 (3/3) 27 ±9 (3/3) 26 ±3 (3/3)\nVOYAGER (Ours) 19±3(3/3) 18±7(3/3)21±5(3/3)18±2(3/3)\nFigure 8: Zero-shot generalization to unseen tasks. We visualize the intermediate progress of each\nmethod on two tasks. See Appendix, Sec. B.4.3 for the other two tasks. We do not plot ReAct and\nReflexion since they do not make any meaningful progress.\nEfficient zero-shot generalization to unseen tasks. To evaluate zero-shot generalization, we clear\nthe agent’s inventory, reset it to a newly instantiated world, and test it with unseen tasks. For both\nVOYAGER and AutoGPT, we utilize GPT-4 to break down the task into a series of subgoals. Table. 2\nand Fig. 8 show VOYAGER can consistently solve all the tasks, while baselines cannot solve any task\nwithin 50 prompting iterations. What’s interesting to note is that our skill library constructed from\nlifelong learning not only enhances VOYAGER ’s performance but also gives a boost to AutoGPT.\nThis demonstrates that the skill library serves as a versatile tool that can be readily employed by other\nmethods, effectively acting as a plug-and-play asset to enhance performance.\n3.4 Ablation Studies\nWe ablate 6 design choices (automatic curriculum, skill library, environment feedback, execution\nerrors, self-verification, and GPT-4 for code generation) in VOYAGER and study their impact on\nexploration performance (see Appendix, Sec. B.3 for details of each ablated variant). Results are\nshown in Fig. 9. We highlight the key findings below:\n•Automatic curriculum is crucial for the agent’s consistent progress. The discovered item\ncount drops by 93% if the curriculum is replaced with a random one, because certain tasks\nmay be too challenging if attempted out of order. On the other hand, a manually designed\ncurriculum requires significant Minecraft-specific expertise, and does not take into account\nthe agent’s live situation. It falls short in the experimental results compared to our automatic\ncurriculum.\n•VOYAGER w/o skill library exhibits a tendency to plateau in the later stages. This\nunderscores the pivotal role that the skill library plays in VOYAGER . It helps create more\ncomplex actions and steadily pushes the agent’s boundaries by encouraging new skills to be\nbuilt upon older ones.\n8Figure 9: Left: Ablation studies for the automatic curriculum, skill library, and GPT-4. GPT-3.5\nmeans replacing GPT-4 with GPT-3.5 for code generation. VOYAGER outperforms all the alternatives,\ndemonstrating the critical role of each component. Right: Ablation studies for the iterative\nprompting mechanism. VOYAGER surpasses all the other options, thereby highlighting the essential\nsignificance of each type of feedback in the iterative prompting mechanism.\nFigure 10: VOYAGER builds 3D structures with human feedback. The progress of building designs\nthat integrate human input is demonstrated from left to right.\n•Self-verification is the most important among all the feedback types . Removing the\nmodule leads to a significant drop ( −73%) in the discovered item count. Self-verification\nserves as a critical mechanism to decide when to move on to a new task or reattempt a\npreviously unsuccessful task.\n•GPT-4 significantly outperforms GPT-3.5 in code generation and obtains 5.7×more\nunique items, as GPT-4 exhibits a quantum leap in coding abilities. This finding corroborates\nrecent studies in the literature [56, 57].\n3.5 Multimodal Feedback from Humans\nVOYAGER does not currently support visual perception, because the available version of GPT-4 API\nis text-only at the time of this writing. However, VOYAGER has the potential to be augmented by\nmultimodal perception models [ 58,59] to achieve more impressive tasks. We demonstrate that given\nhuman feedback, VOYAGER is able to construct complex 3D structures in Minecraft, such as a Nether\nPortal and a house (Fig. 10). There are two ways to integrate human feedback:\n(1)Human as a critic (equivalent to VOYAGER ’s self-verification module): humans provide\nvisual critique to VOYAGER , allowing it to modify the code from the previous round. This\nfeedback is essential for correcting certain errors in the spatial details of a 3D structure that\nVOYAGER cannot perceive directly.\n(2)Human as a curriculum (equivalent to VOYAGER ’s automatic curriculum module): humans\nbreak down a complex building task into smaller steps, guiding VOYAGER to complete them\nincrementally. This approach improves VOYAGER ’s ability to handle more sophisticated 3D\nconstruction tasks.\n94 Limitations and Future Work\nCost. The GPT-4 API incurs significant costs. It is 15×more expensive than GPT-3.5. Nevertheless,\nVOYAGER requires the quantum leap in code generation quality from GPT-4 (Fig. 9), which GPT-3.5\nand open-source LLMs cannot provide [60].\nInaccuracies. Despite the iterative prompting mechanism, there are still cases where the agent gets\nstuck and fails to generate the correct skill. The automatic curriculum has the flexibility to reattempt\nthis task at a later time. Occasionally, self-verification module may also fail, such as not recognizing\nspider string as a success signal of beating a spider.\nHallucinations. The automatic curriculum occasionally proposes unachievable tasks. For example, it\nmay ask the agent to craft a “copper sword\" or “copper chestplate\", which are items that do not exist\nwithin the game. Hallucinations also occur during the code generation process. For instance, GPT-4\ntends to use cobblestone as a fuel input, despite being an invalid fuel source in the game. Additionally,\nit may call functions absent in the provided control primitive APIs, leading to code execution errors.\nWe are confident that improvements in the GPT API models as well as novel techniques for finetuning\nopen-source LLMs will overcome these limitations in the future.\n5 Related work\nDecision-making Agents in Minecraft. Minecraft is an open-ended 3D world with incredibly\nflexible game mechanics supporting a broad spectrum of activities. Built upon notable Minecraft\nbenchmarks [ 23,61–65], Minecraft learning algorithms can be divided into two categories: 1)\nLow-level controller: Many prior efforts leverage hierarchical reinforcement learning to learn from\nhuman demonstrations [ 66–68]. Kanitscheider et al. [ 14] design a curriculum based on success rates,\nbut its objectives are limited to curated items. MineDojo [ 23] and VPT [ 8] utilize YouTube videos\nfor large-scale pre-training. DreamerV3 [ 69], on the other hand, learns a world model to explore\nthe environment and collect diamonds. 2) High-level planner: V olum et al. [ 70] leverage few-shot\nprompting with Codex [ 41] to generate executable policies, but they require additional human\ninteraction. Recent works leverage LLMs as a high-level planner in Minecraft by decomposing\na high-level task into several subgoals following Minecraft recipes [ 55,53,71], thus lacking full\nexploration flexibility. Like these latter works, VOYAGER also uses LLMs as a high-level planner by\nprompting GPT-4 and utilizes Mineflayer [ 52] as a low-level controller following V olum et al. [ 70].\nUnlike prior works, VOYAGER employs an automatic curriculum that unfolds in a bottom-up manner,\ndriven by curiosity, and therefore enables open-ended exploration.\nLarge Language Models for Agent Planning. Inspired by the strong emergent capabilities of\nLLMs, such as zero-shot prompting and complex reasoning [ 72,37,38,36,73,74], embodied agent\nresearch [ 75–78] has witnessed a significant increase in the utilization of LLMs for planning purposes.\nRecent efforts can be roughly classified into two groups. 1) Large language models for robot\nlearning: Many prior works apply LLMs to generate subgoals for robot planning [ 27,27,25,79,80].\nInner Monologue [ 26] incorporates environment feedback for robot planning with LLMs. Code as\nPolicies [ 16] and ProgPrompt [ 22] directly leverage LLMs to generate executable robot policies.\nVIMA [ 19] and PaLM-E [ 59] fine-tune pre-trained LLMs to support multimodal prompts. 2)\nLarge language models for text agents: ReAct [ 29] leverages chain-of-thought prompting [ 46] and\ngenerates both reasoning traces and task-specific actions with LLMs. Reflexion [ 30] is built upon\nReAct [ 29] with self-reflection to enhance reasoning. AutoGPT [ 28] is a popular tool that automates\nNLP tasks by crafting a curriculum of multiple subgoals for completing a high-level goal while\nincorporating ReAct [ 29]’s reasoning and acting loops. DERA [ 81] frames a task as a dialogue\nbetween two GPT-4 [ 35] agents. Generative Agents [ 82] leverages ChatGPT [ 50] to simulate human\nbehaviors by storing agents’ experiences as memories and retrieving those for planning, but its agent\nactions are not executable. SPRING [ 83] is a concurrent work that uses GPT-4 to extract game\nmechanics from game manuals, based on which it answers questions arranged in a directed acyclic\ngraph and predicts the next action. All these works lack a skill library for developing more complex\nbehaviors, which are crucial components for the success of V OYAGER in lifelong learning.\nCode Generation with Execution. Code generation has been a longstanding challenge in\nNLP [ 41,84,85,73,37], with various works leveraging execution results to improve program\n10synthesis. Execution-guided approaches leverage intermediate execution outcomes to guide program\nsearch [ 86–88]. Another line of research utilizes majority voting to choose candidates based on their\nexecution performance [ 89,90]. Additionally, LEVER [ 91] trains a verifier to distinguish and reject\nincorrect programs based on execution results. CLAIRIFY [ 92], on the other hand, generates code\nfor planning chemistry experiments and makes use of a rule-based verifier to iteratively provide\nerror feedback to LLMs. VOYAGER distinguishes itself from these works by integrating environment\nfeedback, execution errors, and self-verification (to assess task success) into an iterative prompting\nmechanism for embodied control.\n6 Conclusion\nIn this work, we introduce VOYAGER , the first LLM-powered embodied lifelong learning agent,\nwhich leverages GPT-4 to explore the world continuously, develop increasingly sophisticated skills,\nand make new discoveries consistently without human intervention. VOYAGER exhibits superior\nperformance in discovering novel items, unlocking the Minecraft tech tree, traversing diverse terrains,\nand applying its learned skill library to unseen tasks in a newly instantiated world. VOYAGER serves\nas a starting point to develop powerful generalist agents without tuning the model parameters.\n7 Broader Impacts\nOur research is conducted within Minecraft, a safe and harmless 3D video game environment. While\nVOYAGER is designed to be generally applicable to other domains, such as robotics, its application to\nphysical robots would require additional attention and the implementation of safety constraints by\nhumans to ensure responsible and secure deployment.\n8 Acknowledgements\nWe are extremely grateful to Ziming Zhu, Kaiyu Yang, Rafał Kocielnik, Colin White, Or Sharir, Sahin\nLale, De-An Huang, Jean Kossaifi, Yuncong Yang, Charles Zhang, Minchao Huang, and many other\ncolleagues and friends for their helpful feedback and insightful discussions. This work is done during\nGuanzhi Wang’s internship at NVIDIA. Guanzhi Wang is supported by the Kortschak fellowship in\nComputing and Mathematical Sciences at Caltech.\nReferences\n[1]Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti,\nDaniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d\nenvironment for visual ai. arXiv preprint arXiv: Arxiv-1712.05474 , 2017.\n[2]Manolis Savva, Jitendra Malik, Devi Parikh, Dhruv Batra, Abhishek Kadian, Oleksandr\nMaksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, and Vladlen\nKoltun. Habitat: A platform for embodied AI research. In 2019 IEEE/CVF International\nConference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2,\n2019 , pages 9338–9346. IEEE, 2019.\n[3]Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Martín-Martín. robosuite: A mod-\nular simulation framework and benchmark for robot learning. arXiv preprint arXiv: Arxiv-\n2009.12293 , 2020.\n[4]Fei Xia, William B. Shen, Chengshu Li, Priya Kasimbeg, Micael Tchapmi, Alexander Toshev,\nLi Fei-Fei, Roberto Martín-Martín, and Silvio Savarese. Interactive gibson benchmark (igibson\n0.5): A benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:\nArxiv-1910.14442 , 2019.\n[5]Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Claudia\nPérez-D’Arpino, Shyamal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent\nVainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. igibson 1.0: a simulation environment for\ninteractive tasks in large realistic scenes. arXiv preprint arXiv: Arxiv-2012.02924 , 2020.\n11[6]Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.\nThe International Journal of Robotics Research , 32(11):1238–1274, 2013.\n[7]Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep\nreinforcement learning: A brief survey. IEEE Signal Processing Magazine , 34(6):26–38, 2017.\n[8]Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon\nHoughton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching\nunlabeled online videos. arXiv preprint arXiv: Arxiv-2206.11795 , 2022.\n[9]DeepMind Interactive Agents Team, Josh Abramson, Arun Ahuja, Arthur Brussee, Federico\nCarnevale, Mary Cassin, Felix Fischer, Petko Georgiev, Alex Goldin, Mansi Gupta, Tim\nHarley, Felix Hill, Peter C Humphreys, Alden Hung, Jessica Landon, Timothy Lillicrap, Hamza\nMerzic, Alistair Muldal, Adam Santoro, Guy Scully, Tamara von Glehn, Greg Wayne, Nathaniel\nWong, Chen Yan, and Rui Zhu. Creating multimodal interactive agents with imitation and\nself-supervised learning. arXiv preprint arXiv: Arxiv-2112.03763 , 2021.\n[10] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wo-\njciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al.\nAlphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog , 2, 2019.\n[11] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-explore:\na new approach for hard-exploration problems. arXiv preprint arXiv: Arxiv-1901.10995 , 2019.\n[12] Joost Huizinga and Jeff Clune. Evolving multimodal robot behavior via many stepping stones\nwith the combinatorial multiobjective evolutionary algorithm. Evolutionary computation ,\n30(2):131–164, 2022.\n[13] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeffrey Clune, and Kenneth O.\nStanley. Enhanced POET: open-ended reinforcement learning through unbounded invention of\nlearning challenges and their solutions. In Proceedings of the 37th International Conference on\nMachine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of\nMachine Learning Research , pages 9940–9951. PMLR, 2020.\n[14] Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton,\nRaul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, Oleg Klimov, and Jeff\nClune. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft.\narXiv preprint arXiv: Arxiv-2106.14876 , 2021.\n[15] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre M. Bayen, Stuart Russell, Andrew\nCritch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised\nenvironment design. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina\nBalcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33:\nAnnual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December\n6-12, 2020, virtual , 2020.\n[16] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence,\nand Andy Zeng. Code as policies: Language model programs for embodied control. arXiv\npreprint arXiv: Arxiv-2209.07753 , 2022.\n[17] Shao-Hua Sun, Te-Lin Wu, and Joseph J. Lim. Program guided agent. In 8th International\nConference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 .\nOpenReview.net, 2020.\n[18] Zelin Zhao, Karan Samel, Binghong Chen, and Le Song. Proto: Program-guided transformer for\nprogram-guided tasks. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy\nLiang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing\nSystems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS\n2021, December 6-14, 2021, virtual , pages 17021–17036, 2021.\n[19] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen,\nLi Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi (Jim) Fan. Vima: General robot manipu-\nlation with multimodal prompts. ARXIV .ORG , 2022.\n12[20] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic\nmanipulation. arXiv preprint arXiv: Arxiv-2109.12098 , 2021.\n[21] Linxi Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, and Animashree\nAnandkumar. SECANT: self-expert cloning for zero-shot generalization of visual policies. In\nMarina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on\nMachine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of\nMachine Learning Research , pages 3088–3099. PMLR, 2021.\n[22] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay,\nDieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task\nplans using large language models. arXiv preprint arXiv: Arxiv-2209.11302 , 2022.\n[23] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew\nTang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended\nembodied agents with internet-scale knowledge. arXiv preprint arXiv: Arxiv-2206.08853 , 2022.\n[24] Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek\nPurohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence.\nSocratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint\narXiv: Arxiv-2204.00598 , 2022.\n[25] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,\nChelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine\nHsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey,\nSally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei\nLee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao,\nKanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan,\nAlexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, and Mengyuan\nYan. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:\nArxiv-2204.01691 , 2022.\n[26] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng,\nJonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas\nJackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue:\nEmbodied reasoning through planning with language models. arXiv preprint arXiv: Arxiv-\n2207.05608 , 2022.\n[27] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-\nshot planners: Extracting actionable knowledge for embodied agents. In Kamalika Chaudhuri,\nStefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International\nConference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA ,\nvolume 162 of Proceedings of Machine Learning Research , pages 9118–9147. PMLR, 2022.\n[28] Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully au-\ntonomous., 2023.\n[29] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan\nCao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:\nArxiv-2210.03629 , 2022.\n[30] Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with\ndynamic memory and self-reflection. arXiv preprint arXiv: Arxiv-2303.11366 , 2023.\n[31] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter.\nContinual lifelong learning with neural networks: A review. Neural Networks , 113:54–71, 2019.\n[32] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual\nlearning: Theory, method and application. arXiv preprint arXiv: Arxiv-2302.00487 , 2023.\n[33] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv: Arxiv-1312.5602 , 2013.\n13[34] OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław\nD˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józe-\nfowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto,\nJonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya\nSutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement\nlearning. arXiv preprint arXiv: Arxiv-1912.06680 , 2019.\n[35] OpenAI. Gpt-4 technical report. arXiv preprint arXiv: Arxiv-2303.08774 , 2023.\n[36] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani\nYogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto,\nOriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language\nmodels. arXiv preprint arXiv: Arxiv-2206.07682 , 2022.\n[37] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,\nArvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel\nHerbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.\nZiegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz\nLitwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec\nRadford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo\nLarochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin,\neditors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural\nInformation Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020.\n[38] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,\nYanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified\ntext-to-text transformer. J. Mach. Learn. Res. , 21:140:1–140:67, 2020.\n[39] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you\nneed: Learning skills without a reward function. In 7th International Conference on Learning\nRepresentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019.\n[40] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley,\nand Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via\na population of novelty-seeking agents. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle,\nKristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural\nInformation Processing Systems 31: Annual Conference on Neural Information Processing\nSystems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada , pages 5032–5043, 2018.\n[41] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,\nJared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul\nPuri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke\nChan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad\nBavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias\nPlappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William Hebgen Guss, Alex\nNichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,\nWilliam Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra,\nEvan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer,\nPeter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech\nZaremba. Evaluating large language models trained on code. arXiv preprint arXiv: Arxiv-\n2107.03374 , 2021.\n[42] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer\n(poet): Endlessly generating increasingly complex and diverse learning environments and their\nsolutions. arXiv preprint arXiv: Arxiv-1901.01753 , 2019.\n[43] Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Auto-\nmatic curriculum learning for deep RL: A short survey. In Christian Bessiere, editor, Proceedings\nof the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages\n4819–4825. ijcai.org, 2020.\n14[44] Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically\nmotivated goal exploration processes with automatic curriculum learning. The Journal of\nMachine Learning Research , 23(1):6818–6858, 2022.\n[45] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales,\nLuke Hewitt, Armando Solar-Lezama, and Joshua B. Tenenbaum. Dreamcoder: Growing\ngeneralizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv\npreprint arXiv: Arxiv-2006.08381 , 2020.\n[46] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny\nZhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint\narXiv: Arxiv-2201.11903 , 2022.\n[47] V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings\nof the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY,\nUSA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings , pages\n1928–1937. JMLR.org, 2016.\n[48] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\npolicy optimization algorithms. arXiv preprint arXiv: Arxiv-1707.06347 , 2017.\n[49] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nIn Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Repre-\nsentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings ,\n2016.\n[50] Introducing chatgpt, 2022.\n[51] New and improved embedding model, 2022.\n[52] PrismarineJS. Prismarinejs/mineflayer: Create minecraft bots with a powerful, stable, and high\nlevel javascript api., 2013.\n[53] Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hanna Hajishirzi, Sameer\nSingh, and Roy Fox. Do embodied agents dream of pixelated sheep?: Embodied decision\nmaking using language guided world modelling. ARXIV .ORG , 2023.\n[54] Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task\ncontrol through goal-aware representation learning and adaptive horizon prediction. arXiv\npreprint arXiv: Arxiv-2301.10034 , 2023.\n[55] Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and\nselect: Interactive planning with large language models enables open-world multi-task agents.\narXiv preprint arXiv: Arxiv-2302.01560 , 2023.\n[56] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece\nKamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi,\nMarco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments\nwith gpt-4. arXiv preprint arXiv: Arxiv-2303.12712 , 2023.\n[57] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He,\nAntong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Dajiang Zhu, Xiang Li, Ning Qiang,\nDingang Shen, Tianming Liu, and Bao Ge. Summary of chatgpt/gpt-4 research and perspective\ntowards the future of large language models. arXiv preprint arXiv: Arxiv-2304.01852 , 2023.\n[58] Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, and Anima Anandkumar.\nPrismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv: Arxiv-\n2303.02506 , 2023.\n15[59] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,\nAyzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar,\nPierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc\nToussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied\nmultimodal language model. arXiv preprint arXiv: Arxiv-2303.03378 , 2023.\n[60] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-\nthée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez,\nArmand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation\nlanguage models. arXiv preprint arXiv: Arxiv-2302.13971 , 2023.\n[61] William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela\nVeloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations.\nIn Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on\nArtificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 , pages 2442–2448.\nijcai.org, 2019.\n[62] William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie\nMilani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin,\nManuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re-\ninforcement learning using human priors. arXiv preprint arXiv: Arxiv-1904.10079 , 2019.\n[63] William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno,\nCrissman Loomis, Stephanie Milani, Sharada Mohanty, Keisuke Nakata, Ruslan Salakhutdinov,\nJohn Schulman, Shinya Shiroshita, Nicholay Topin, Avinash Ummadisingu, and Oriol Vinyals.\nThe minerl 2020 competition on sample efficient reinforcement learning using human priors.\narXiv preprint arXiv: Arxiv-2101.11071 , 2021.\n[64] Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Jun-\nyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, Weijun Hong, Zhongyue Huang, Haicheng\nChen, Guangjun Zeng, Yue Lin, Vincent Micheli, Eloi Alonso, François Fleuret, Alexander\nNikulin, Yury Belousov, Oleg Svidchenko, and Aleksei Shpilman. Minerl diamond 2021\ncompetition: Overview, results, and lessons learned. arXiv preprint arXiv: Arxiv-2202.10583 ,\n2022.\n[65] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for\nartificial intelligence experimentation. In Subbarao Kambhampati, editor, Proceedings of the\nTwenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York,\nNY, USA, 9-15 July 2016 , pages 4246–4247. IJCAI/AAAI Press, 2016.\n[66] Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, and Wei Yang. Juewu-mc: Playing\nminecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:\nArxiv-2112.04907 , 2021.\n[67] Hangyu Mao, Chao Wang, Xiaotian Hao, Yihuan Mao, Yiming Lu, Chengjie Wu, Jianye\nHao, Dong Li, and Pingzhong Tang. Seihai: A sample-efficient hierarchical ai for the minerl\ncompetition. arXiv preprint arXiv: Arxiv-2111.08857 , 2021.\n[68] Alexey Skrynnik, Aleksey Staroverov, Ermek Aitygulov, Kirill Aksenov, Vasilii Davydov, and\nAleksandr I. Panov. Hierarchical deep q-network from imperfect demonstrations in minecraft.\nCogn. Syst. Res. , 65:74–78, 2021.\n[69] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains\nthrough world models. arXiv preprint arXiv: Arxiv-2301.04104 , 2023.\n[70] Ryan V olum, Sudha Rao, Michael Xu, Gabriel DesGarennes, Chris Brockett, Benjamin\nVan Durme, Olivia Deng, Akanksha Malhotra, and Bill Dolan. Craft an iron sword: Dy-\nnamically generating interactive game characters by prompting large language models tuned on\ncode. In Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay\n2022) , pages 25–43, Seattle, United States, 2022. Association for Computational Linguistics.\n[71] Haoqi Yuan, Chi Zhang, Hongcheng Wang, Feiyang Xie, Penglin Cai, Hao Dong, and Zongqing\nLu. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv\npreprint arXiv: 2303.16563 , 2023.\n16[72] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,\nMichael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson,\nShyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel,\nJared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano\nErmon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren\nGillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto,\nPeter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas\nIcard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling,\nFereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi,\nAnanya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa\nLi, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric\nMitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman,\nAllen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr,\nIsabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi\nRaghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack\nRyan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan\nSrinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang,\nWilliam Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga,\nJiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia\nZheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models.\narXiv preprint arXiv: Arxiv-2108.07258 , 2021.\n[73] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam\nRoberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker\nSchuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes,\nYi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson,\nReiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,\nToju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier\nGarcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David\nLuan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani\nAgrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat,\nAitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei\nZhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei,\nKathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling\nlanguage modeling with pathways. arXiv preprint arXiv: Arxiv-2204.02311 , 2022.\n[74] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,\nXuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu,\nZhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav\nMishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov,\nEd H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V . Le, and Jason Wei.\nScaling instruction-finetuned language models. arXiv preprint arXiv: Arxiv-2210.11416 , 2022.\n[75] Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied\nAI: from simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. , 6(2):230–244,\n2022.\n[76] Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun,\nSergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, and Hao Su.\nRearrangement: A challenge for embodied ai. arXiv preprint arXiv: Arxiv-2011.01975 , 2020.\n[77] Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent\nadvances in robot learning from demonstration. Annual review of control, robotics, and\nautonomous systems , 3:297–330, 2020.\n[78] Jack Collins, Shelvin Chand, Anthony Vanderkop, and David Howard. A review of physics\nsimulators for robotic applications. IEEE Access , 9:51416–51431, 2021.\n[79] So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and R. Salakhutdi-\nnov. Film: Following instructions in language with modular methods. International Conference\non Learning Representations , 2021.\n17[80] Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, and Yoav Artzi. A persistent spatial\nsemantic representation for high-level natural language instruction execution. In 5th Annual\nConference on Robot Learning , 2021.\n[81] Varun Nair, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. Dera: Enhancing large\nlanguage model completions with dialog-enabled resolving agents. arXiv preprint arXiv:\nArxiv-2303.17071 , 2023.\n[82] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and\nMichael S. Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv\npreprint arXiv: Arxiv-2304.03442 , 2023.\n[83] Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos\nAzaria, Tom Mitchell, and Yuanzhi Li. Spring: Gpt-4 out-performs rl algorithms by studying\npapers and reasoning. arXiv preprint arXiv: 2305.15486 , 2023.\n[84] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,\nand Caiming Xiong. A conversational paradigm for program synthesis. arXiv preprint arXiv:\nArxiv-2203.13474 , 2022.\n[85] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. Coderl:\nMastering code generation through pretrained models and deep reinforcement learning. arXiv\npreprint arXiv: Arxiv-2207.01780 , 2022.\n[86] Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In 7th\nInternational Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA,\nMay 6-9, 2019 . OpenReview.net, 2019.\n[87] Xinyun Chen, Dawn Song, and Yuandong Tian. Latent execution for neural program synthesis.\narXiv preprint arXiv: Arxiv-2107.00101 , 2021.\n[88] Kevin Ellis, Maxwell I. Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar-\nLezama. Write, execute, assess: Program synthesis with a REPL. In Hanna M. Wallach, Hugo\nLarochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett,\neditors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural\nInformation Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,\nCanada , pages 9165–9174, 2019.\n[89] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond,\nTom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy,\nCyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl,\nSven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson,\nPushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level\ncode generation with alphacode. arXiv preprint arXiv: Arxiv-2203.07814 , 2022.\n[90] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,\nMatthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and\nJohn Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv: Arxiv-\n2110.14168 , 2021.\n[91] Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen tau Yih, Sida I. Wang, and\nXi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. arXiv\npreprint arXiv: Arxiv-2302.08468 , 2023.\n[92] Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bjørn Kristensen,\nKourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, and Animesh Garg. Errors are useful\nprompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv\npreprint arXiv: Arxiv-2303.14100 , 2023.\n18A Method\nA.1 V OYAGER Algorithm\nPseudocode 1: V OYAGER algorithm.\ndef voyager (\nenvironment , # environment that uses code as action space\ncurriculum_agent , # curriculum agent for proposing the next task\naction_agent , # action agent for code generation\ncritic_agent , # critic agent for self - verification\nskill_manager , # skill manager for adding new skills and skill\nretrieval\n):\nagent_state = environment . reset ()\nwhile True :\nexploration_progress = (\ncurriculum_agent . get_exploration_progress (\ncurriculum_agent . get_completed_tasks () ,\ncurriculum_agent . get_failed_tasks () ,\n)\n)\ntask = curriculum_agent . propose_next_task (\nagent_state , exploration_progress\n)\ncode = None\nenvironment_feedback = None\nexecution_errors = None\ncritique = None\nsuccess = False\n# try at most 4 rounds before moving on to the next task\nfor i in range (4) :\nskills = skill_manager . retrieve_skills (\ntask , environment_feedback\n)\ncode = action_agent . generate_code (\ntask ,\ncode ,\nenvironment_feedback ,\nexecution_errors ,\ncritique ,\nskills ,\n)\n(\nagent_state ,\nenvironment_feedback ,\nexecution_errors ,\n) = environment . step ( code )\nsuccess , critique = critic_agent . check_task_success (\ntask , agent_state\n)\nif success :\nbreak\nif success :\nskill_manager . add_skill ( code )\ncurriculum_agent . add_completed_task ( task )\nelse :\ncurriculum_agent . add_failed_task ( task )\nA.2 Prompting\nGPT-4 and GPT-3.5 offer users the ability to designate the role of each prompt message among three\noptions:\n19•System: A high-level instruction that guides the model behavior throughout the conversation.\nIt sets the overall tone and objective for the interaction.\n• User: A detailed instruction that guides the assistant for the next immediate response.\n• Assistant: A response message generated the model.\nSeehttps://platform.openai.com/docs/guides/chat/introduction for more details.\nTo save token usage, instead of engaging in multi-round conversations, we concatenate a system\nprompt and a user prompt to obtain each assistant’s response.\nA.3 Automatic Curriculum\nA.3.1 Components in the Prompt\nThe input prompt to GPT-4 consists of several components:\n(1)Directives encouraging diverse behaviors and imposing constraints (so that the proposed\ntask is achievable and verifiable): See Sec. A.3.4 for the full prompt;\n(2) The agent’s current state:\n•Inventory : A dictionary of items with counts, for example, {‘cobblestone’: 4, ‘furnace’:\n1, ‘stone_pickaxe’: 1, ‘oak_planks’: 7, ‘dirt’: 6, ‘wooden_pickaxe’: 1, ‘crafting_table’:\n1, ‘raw_iron’: 4, ‘coal’: 1};\n•Equipment : Armors or weapons equipped by the agents;\n•Nearby blocks : A set of block names within a 32-block distance to the agent, for\nexample, ‘dirt’, ‘water’, ‘spruce_planks’, ‘grass_block’, ‘dirt_path’, ‘sugar_cane’,\n‘fern’;\n•Other blocks that are recently seen : Blocks that are not nearby or in the inventory;\n•Nearby entities : A set of entity names within a 32-block distance to the agent, for\nexample, ‘pig’, ‘cat’, ‘villager’, ‘zombie’;\n•A list of chests that are seen by the agent : Chests are external containers where the\nagent can deposit items. If a chest is not opened before, its content is “Unknown”.\nOtherwise, the items inside each chest are shown to the agent.\n•Biome : For example, ‘plains’, ‘flower_forest’, ‘meadow’, ‘river’, ‘beach’, ‘for-\nest’, ‘snowy_slopes’, ‘frozen_peaks’, ‘old_growth_birch_forest’, ‘ocean’, ‘sun-\nflower_plains’, ‘stony_shore’;\n•Time : One of ‘sunrise’, ‘day’, ‘noon’, ‘sunset’, ‘night’, ‘midnight’;\n•Health and hunger bars : The max value is 20;\n•Position : 3D coordinate (x, y, z )of the agent’s position in the Minecraft world;\n(3) Previously completed and failed tasks;\n(4) Additional context: See Sec. A.3.2;\n(5)Chain-of-thought prompting [ 46] in response: We request GPT-4 to first reason about the\ncurrent progress and then suggest the next task.\nA.3.2 Additional Context\nWe leverage GPT-3.5 to self-ask questions to provide additional context. Each question is paired with\na concept that is used for retrieving the most relevant document from the wiki knowledge base [ 23].\nWe feed the document content to GPT-3.5 for self-answering questions. In practice, using a wiki\nknowledge base is optional since GPT-3.5 already possesses a good understanding of Minecraft\ngame mechanics. However, the external knowledge base becomes advantageous if GPT-3.5 is not\npre-trained in that specific domain. See Sec. A.3.4 for the full prompt.\nA.3.3 Warm-up Schedule\nIn practice, we adopt a warm-up schedule to gradually incorporate the agent’s state and the additional\ncontext into the prompt based on how many tasks the agent has completed. This ensures that the\nprompt is exposed to increasing amounts of information over the exploration progress and therefore\n20begins with basic skills and progressively advances towards more intricate and diverse ones. The\nwarm-up setting that we use across all the experiments is shown in Table. A.1.\nTable A.1: Warm-up schedule for automatic curriculum.\nInformation in the prompt After how many tasks are completed\ncore inventory (only including log, planks, stick,\ncrafting table, furnace, dirt, coal, pickaxe, sword,\nand axe)0\nequipment 0\nnearby blocks 0\nposition 0\nnearby entities 5\nfull inventory 7\nother blocks that are recently seen 10\nbiome 10\nhealth bar 15\nhunger bar 15\ntime 15\nadditional context 15\nA.3.4 Full Prompt\nPrompt 1: Full system prompt for automatic curriculum. The list of question-answer pairs represents\nthe additional context.\nYou are a helpful assistant that tells me the next immediate task to\ndo in Minecraft . My ultimate goal is to discover as many diverse\nthings as possible , accomplish as many diverse tasks as possible\nand become the best Minecraft player in the world .\nI will give you the following information :\nQuestion 1: ...\nAnswer : ...\nQuestion 2: ...\nAnswer : ...\nQuestion 3: ...\nAnswer : ...\n...\nBiome : ...\nTime : ...\nNearby blocks : ...\nOther blocks that are recently seen : ...\nNearby entities ( nearest to farthest ): ...\nHealth : Higher than 15 means I’m healthy .\nHunger : Higher than 15 means I’m not hungry .\nPosition : ...\nEquipment : If I have better armor in my inventory , you should ask me\nto equip it.\nInventory (xx /36) : ...\nChests : You can ask me to deposit or take items from these chests .\nThere also might be some unknown chest , you should ask me to open\nand check items inside the unknown chest .\nCompleted tasks so far : ...\nFailed tasks that are too hard : ...\nYou must follow the following criteria :\n1) You should act as a mentor and guide me to the next task based on\nmy current learning progress .\n2) Please be very specific about what resources I need to collect ,\nwhat I need to craft , or what mobs I need to kill .\n213) The next task should follow a concise format , such as \" Mine [\nquantity ] [ block ]\", \" Craft [ quantity ] [ item ]\", \" Smelt [ quantity ] [\nitem ]\", \" Kill [ quantity ] [mob ]\", \" Cook [ quantity ] [ food ]\", \" Equip\n[ item ]\" etc. It should be a single phrase . Do not propose multiple\ntasks at the same time . Do not mention anything else .\n4) The next task should not be too hard since I may not have the\nnecessary resources or have learned enough skills to complete it\nyet .\n5) The next task should be novel and interesting . I should look for\nrare resources , upgrade my equipment and tools using better\nmaterials , and discover new things . I should not be doing the same\nthing over and over again .\n6) I may sometimes need to repeat some tasks if I need to collect more\nresources to complete more difficult tasks . Only repeat tasks if\nnecessary .\n7) Do not ask me to build or dig shelter even if it ’s at night . I want\nto explore the world and discover new things . I don ’t want to\nstay in one place .\n8) Tasks that require information beyond the player ’s status to verify\nshould be avoided . For instance , \" Placing 4 torches \" and \"Dig a 2\nx1x2 hole \" are not ideal since they require visual confirmation\nfrom the screen . All the placing , building , planting , and trading\ntasks should be avoided . Do not propose task starting with these\nkeywords .\nYou should only respond in the format as described below :\nRESPONSE FORMAT :\nReasoning : Based on the information I listed above , do reasoning about\nwhat the next task should be.\nTask : The next task .\nHere ’s an example response :\nReasoning : The inventory is empty now , chop down a tree to get some\nwood .\nTask : Obtain a wood log.\nPrompt 2: Full system prompt for asking questions. We provide both good and bad examples as\nfew-shot exemplars.\nYou are a helpful assistant that asks questions to help me decide the\nnext immediate task to do in Minecraft . My ultimate goal is to\ndiscover as many things as possible , accomplish as many tasks as\npossible and become the best Minecraft player in the world .\nI will give you the following information :\nBiome : ...\nTime : ...\nNearby blocks : ...\nOther blocks that are recently seen : ...\nNearby entities ( nearest to farthest ): ...\nHealth : ...\nHunger : ...\nPosition : ...\nEquipment : ...\nInventory (xx /36) : ...\nChests : ...\nCompleted tasks so far : ...\nFailed tasks that are too hard : ...\nYou must follow the following criteria :\n1) You should ask at least 5 questions (but no more than 10 questions )\nto help me decide the next immediate task to do. Each question\nshould be followed by the concept that the question is about .\n2) Your question should be specific to a concept in Minecraft .\nBad example (the question is too general ):\n22Question : What is the best way to play Minecraft ?\nConcept : unknown\nBad example (axe is still general , you should specify the type of\naxe such as wooden axe):\nWhat are the benefits of using an axe to gather resources ?\nConcept : axe\nGood example :\nQuestion : How to make a wooden pickaxe ?\nConcept : wooden pickaxe\n3) Your questions should be self - contained and not require any context\n.\nBad example (the question requires the context of my current biome ):\nQuestion : What are the blocks that I can find in my current biome ?\nConcept : unknown\nBad example (the question requires the context of my current\ninventory ):\nQuestion : What are the resources you need the most currently ?\nConcept : unknown\nBad example (the question requires the context of my current\ninventory ):\nQuestion : Do you have any gold or emerald resources ?\nConcept : gold\nBad example (the question requires the context of my nearby entities\n):\nQuestion : Can you see any animals nearby that you can kill for\nfood ?\nConcept : food\nBad example (the question requires the context of my nearby blocks ):\nQuestion : Is there any water source nearby ?\nConcept : water\nGood example :\nQuestion : What are the blocks that I can find in the sparse jungle\n?\nConcept : sparse jungle\n4) Do not ask questions about building tasks ( such as building a\nshelter ) since they are too hard for me to do.\nLet ’s say your current biome is sparse jungle . You can ask questions\nlike :\nQuestion : What are the items that I can find in the sparse jungle ?\nConcept : sparse jungle\nQuestion : What are the mobs that I can find in the sparse jungle ?\nConcept : sparse jungle\nLet ’s say you see a creeper nearby , and you have not defeated a\ncreeper before . You can ask a question like :\nQuestion : How to defeat the creeper ?\nConcept : creeper\nLet ’s say you last completed task is \" Craft a wooden pickaxe \". You can\nask a question like :\nQuestion : What are the suggested tasks that I can do after crafting a\nwooden pickaxe ?\nConcept : wooden pickaxe\nHere are some more question and concept examples :\nQuestion : What are the ores that I can find in the sparse jungle ?\nConcept : sparse jungle\n(the above concept should not be \"ore \" because I need to look up the\npage of \" sparse jungle \" to find out what ores I can find in the\nsparse jungle )\nQuestion : How can you obtain food in the sparse jungle ?\nConcept : sparse jungle\n23(the above concept should not be \" food \" because I need to look up the\npage of \" sparse jungle \" to find out what food I can obtain in the\nsparse jungle )\nQuestion : How can you use the furnace to upgrade your equipment and\nmake useful items ?\nConcept : furnace\nQuestion : How to obtain a diamond ore?\nConcept : diamond ore\nQuestion : What are the benefits of using a stone pickaxe over a wooden\npickaxe ?\nConcept : stone pickaxe\nQuestion : What are the tools that you can craft using wood planks and\nsticks ?\nConcept : wood planks\nYou should only respond in the format as described below :\nRESPONSE FORMAT :\nReasoning : ...\nQuestion 1: ...\nConcept 1: ...\nQuestion 2: ...\nConcept 2: ...\nQuestion 3: ...\nConcept 3: ...\nQuestion 4: ...\nConcept 4: ...\nQuestion 5: ...\nConcept 5: ...\n...\nPrompt 3: Full system prompt for answering questions. Context represents the optional content from\na wiki knowledge base.\nYou are a helpful assistant that answer my question about Minecraft .\nI will give you the following information :\nQuestion : ...\nYou will answer the question based on the context ( only if available\nand helpful ) and your own knowledge of Minecraft .\n1) Start your answer with \" Answer : \".\n2) Answer \" Answer : Unknown \" if you don ’t know the answer .\nA.4 Skill Library\nA.4.1 Components in the Prompt\nThe input prompt to GPT-4 consists of the following components:\n(1) Guidelines for code generation: See Sec A.4.2 for the full prompt;\n(2)Control primitive APIs implemented by us: These APIs serve a dual purpose: they demon-\nstrate the usage of Mineflayer APIs, and they can be directly called by GPT-4.\n•exploreUntil(bot, direction, maxTime = 60, callback) : Allow the agent\nto explore in a fixed direction for maxTime . The callback is the stopping condition\nimplemented by the agent to determine when to stop exploring;\n•mineBlock(bot, name, count = 1) : Mine and collect the specified number of\nblocks within a 32-block distance;\n•craftItem(bot, name, count = 1) : Craft the item with a crafting table nearby;\n•placeItem(bot, name, position) : Place the block at the specified position;\n•smeltItem(bot, itemName, fuelName, count = 1) : Smelt the item with the\nspecified fuel. There must be a furnace nearby;\n24•killMob(bot, mobName, timeout = 300) : Attack the mob and collect its\ndropped item;\n•getItemFromChest(bot, chestPosition, itemsToGet) : Move to the chest at\nthe specified position and get items from the chest;\n•depositItemIntoChest(bot, chestPosition, itemsToDeposit) : Move to\nthe chest at the specified position and deposit items into the chest;\n(3) Control primitive APIs provided by Mineflayer:\n•await bot.pathfinder.goto(goal) : Go to a specific position. See below for how\nto set the goal;\n•new GoalNear(x, y, z, range) : Move the bot to a block within the specified\nrange of the specified block;\n•new GoalXZ(x, z) : For long-range goals that don’t have a specific Y level;\n•new GoalGetToBlock(x, y, z) : Not get into the block, but get directly adjacent\nto it. Useful for fishing, farming, filling a bucket, and using a bed.;\n•new GoalFollow(entity, range) : Follow the specified entity within the specified\nrange;\n•new GoalPlaceBlock(position, bot.world, {}) : Position the bot in order to\nplace a block;\n•new GoalLookAtBlock(position, bot.world, {}) : Path towards a position\nwhere a face of the block at position is visible;\n•bot.isABed(bedBlock) : Return true if bedBlock is a bed;\n•bot.blockAt(position) : Return the block at position ;\n•await bot.equip(item, destination) : Equip the item in the specified destina-\ntion. destination must be one of “hand”, “head”, “torso”, “legs”, “feet”, “off-hand”;\n•await bot.consume() : Consume the item in the bot’s hand. You must equip the\nitem to consume first. Useful for eating food, drinking potions, etc.;\n•await bot.fish() : Let bot fish. Before calling this function, you must first get to a\nwater block and then equip a fishing rod. The bot will automatically stop fishing when\nit catches a fish;\n•await bot.sleep(bedBlock) : Sleep until sunrise. You must get to a bed block\nfirst;\n•await bot.activateBlock(block) : This is the same as right-clicking a block in\nthe game. Useful for buttons, doors, etc. You must get to the block first;\n•await bot.lookAt(position) : Look at the specified position. You must go near\nthe position before you look at it. To fill a bucket with water, you must look at it first;\n•await bot.activateItem() : This is the same as right-clicking to use the item in\nthe bot’s hand. Useful for using a bucket, etc. You must equip the item to activate first;\n•await bot.useOn(entity) : This is the same as right-clicking an entity in the game.\nUseful for shearing a sheep. You must get to the entity first;\n(4) Retrieved skills from the skill library;\n(5) Generated code from the last round;\n(6) Environment feedback: The chat log in the prompt;\n(7) Execution errors;\n(8) Critique from the self-verification module;\n(9) The agent’s current state: See Sec. A.3.1 for each element of the agent’s state;\n(10) Task proposed by the automatic curriculum;\n(11) Task context: We prompt GPT-3.5 to ask for general suggestions about how to solve the\ntask. In practice, this part is handled by the automatic curriculum since it has a systematic\nmechanism for question-answering (Sec. A.3.2);\n(12) Chain-of-thought prompting [ 46] in response: We ask GPT-4 to first explain the reason why\nthe code from the last round fails, then give step-by-step plans to finish the task, and finally\ngenerate code. See Sec. A.4.2 for the full prompt.\n25A.4.2 Full Prompt\nPrompt 4: Full system prompt for code generation.\nYou are a helpful assistant that writes Mineflayer javascript code to\ncomplete any Minecraft task specified by me.\nHere are some useful programs written with Mineflayer APIs .\n/*\nExplore until find an iron_ore , use Vec3 (0, -1, 0) because iron ores\nare usually underground\nawait exploreUntil (bot , new Vec3 (0, -1, 0) , 60, () => {\nconst iron_ore = bot. findBlock ({\nmatching : mcData . blocksByName [\" iron_ore \"].id ,\nmaxDistance : 32,\n});\nreturn iron_ore ;\n});\nExplore until find a pig , use Vec3 (1, 0, 1) because pigs are usually\non the surface\nlet pig = await exploreUntil (bot , new Vec3 (1, 0, 1) , 60, () => {\nconst pig = bot. nearestEntity (( entity ) => {\nreturn (\nentity . name === \"pig\" &&\nentity . position . distanceTo (bot. entity . position ) < 32\n);\n});\nreturn pig;\n});\n*/\nasync function exploreUntil (bot , direction , maxTime = 60, callback ) {\n/*\nImplementation of this function is omitted .\ndirection : Vec3 , can only contain value of -1, 0 or 1\nmaxTime : number , the max time for exploration\ncallback : function , early stop condition , will be called each\nsecond , exploration will stop if return value is not null\nReturn : null if explore timeout , otherwise return the return value\nof callback\n*/\n}\n// Mine 3 cobblestone : mineBlock (bot , \" stone \", 3);\nasync function mineBlock (bot , name , count = 1) {\nconst blocks = bot. findBlocks ({\nmatching : ( block ) => {\nreturn block . name === name ;\n},\nmaxDistance : 32,\ncount : count ,\n});\nconst targets = [];\nfor ( let i = 0; i < Math .min ( blocks . length , count ); i ++) {\ntargets . push (bot. blockAt ( blocks [i]));\n}\nawait bot . collectBlock . collect ( targets , { ignoreNoPath : true });\n}\n// Craft 8 oak_planks from 2 oak_log (do the recipe 2 times ):\ncraftItem (bot , \" oak_planks \", 2);\n26// You must place a crafting table before calling this function\nasync function craftItem (bot , name , count = 1) {\nconst item = mcData . itemsByName [ name ];\nconst craftingTable = bot . findBlock ({\nmatching : mcData . blocksByName . crafting_table .id ,\nmaxDistance : 32,\n});\nawait bot . pathfinder . goto (\nnew GoalLookAtBlock ( craftingTable . position , bot. world )\n);\nconst recipe = bot. recipesFor ( item .id , null , 1, craftingTable ) [0];\nawait bot . craft (recipe , count , craftingTable );\n}\n// Place a crafting_table near the player , Vec3 (1, 0, 0) is just an\nexample , you shouldn ’t always use that : placeItem (bot , \"\ncrafting_table \", bot. entity . position . offset (1, 0, 0));\nasync function placeItem (bot , name , position ) {\nconst item = bot. inventory . findInventoryItem ( mcData . itemsByName [\nname ]. id);\n// find a reference block\nconst faceVectors = [\nnew Vec3 (0, 1, 0) ,\nnew Vec3 (0, -1, 0) ,\nnew Vec3 (1, 0, 0) ,\nnew Vec3 (-1, 0, 0) ,\nnew Vec3 (0, 0, 1) ,\nnew Vec3 (0, 0, -1) ,\n];\nlet referenceBlock = null ;\nlet faceVector = null ;\nfor ( const vector of faceVectors ) {\nconst block = bot. blockAt ( position . minus ( vector ));\nif ( block ?. name !== \"air \") {\nreferenceBlock = block ;\nfaceVector = vector ;\nbreak ;\n}\n}\n// You must first go to the block position you want to place\nawait bot . pathfinder . goto (new GoalPlaceBlock ( position , bot.world ,\n{}) );\n// You must equip the item right before calling placeBlock\nawait bot . equip (item , \" hand \");\nawait bot . placeBlock ( referenceBlock , faceVector );\n}\n// Smelt 1 raw_iron into 1 iron_ingot using 1 oak_planks as fuel :\nsmeltItem (bot , \" raw_iron \", \" oak_planks \");\n// You must place a furnace before calling this function\nasync function smeltItem (bot , itemName , fuelName , count = 1) {\nconst item = mcData . itemsByName [ itemName ];\nconst fuel = mcData . itemsByName [ fuelName ];\nconst furnaceBlock = bot . findBlock ({\nmatching : mcData . blocksByName . furnace .id ,\nmaxDistance : 32,\n});\nawait bot . pathfinder . goto (\nnew GoalLookAtBlock ( furnaceBlock . position , bot. world )\n);\nconst furnace = await bot. openFurnace ( furnaceBlock );\nfor ( let i = 0; i < count ; i++) {\nawait furnace . putFuel ( fuel .id , null , 1);\n27await furnace . putInput ( item .id , null , 1);\n// Wait 12 seconds for the furnace to smelt the item\nawait bot . waitForTicks (12 * 20);\nawait furnace . takeOutput ();\n}\nawait furnace . close ();\n}\n// Kill a pig and collect the dropped item : killMob (bot , \"pig\", 300) ;\nasync function killMob (bot , mobName , timeout = 300) {\nconst entity = bot. nearestEntity (\n( entity ) =>\nentity . name === mobName &&\nentity . position . distanceTo (bot. entity . position ) < 32\n);\nawait bot .pvp. attack ( entity );\nawait bot . pathfinder . goto (\nnew GoalBlock ( entity . position .x, entity . position .y, entity .\nposition .z)\n);\n}\n// Get a torch from chest at (30 , 65, 100) : getItemFromChest (bot , new\nVec3 (30 , 65, 100) , {\" torch \": 1}) ;\n// This function will work no matter how far the bot is from the chest\n.\nasync function getItemFromChest (bot , chestPosition , itemsToGet ) {\nawait moveToChest (bot , chestPosition );\nconst chestBlock = bot . blockAt ( chestPosition );\nconst chest = await bot . openContainer ( chestBlock );\nfor ( const name in itemsToGet ) {\nconst itemByName = mcData . itemsByName [ name ];\nconst item = chest . findContainerItem ( itemByName .id);\nawait chest . withdraw ( item .type , null , itemsToGet [ name ]);\n}\nawait closeChest (bot , chestBlock );\n}\n// Deposit a torch into chest at (30 , 65, 100) : depositItemIntoChest (\nbot , new Vec3 (30 , 65, 100) , {\" torch \": 1});\n// This function will work no matter how far the bot is from the chest\n.\nasync function depositItemIntoChest (bot , chestPosition , itemsToDeposit\n) {\nawait moveToChest (bot , chestPosition );\nconst chestBlock = bot . blockAt ( chestPosition );\nconst chest = await bot . openContainer ( chestBlock );\nfor ( const name in itemsToDeposit ) {\nconst itemByName = mcData . itemsByName [ name ];\nconst item = bot. inventory . findInventoryItem ( itemByName .id);\nawait chest . deposit ( item .type , null , itemsToDeposit [ name ]);\n}\nawait closeChest (bot , chestBlock );\n}\n// Check the items inside the chest at (30 , 65, 100) :\ncheckItemInsideChest (bot , new Vec3 (30 , 65, 100) );\n// You only need to call this function once without any action to\nfinish task of checking items inside the chest .\nasync function checkItemInsideChest (bot , chestPosition ) {\nawait moveToChest (bot , chestPosition );\nconst chestBlock = bot . blockAt ( chestPosition );\nawait bot . openContainer ( chestBlock );\n// You must close the chest after opening it if you are asked to\nopen a chest\n28await closeChest (bot , chestBlock );\n}\nawait bot . pathfinder . goto ( goal ); // A very useful function . This\nfunction may change your main - hand equipment .\n// Following are some Goals you can use:\nnew GoalNear (x, y, z, range ); // Move the bot to a block within the\nspecified range of the specified block . ‘x‘, ‘y‘, ‘z‘, and ‘range ‘\nare ‘number ‘\nnew GoalXZ (x, z); // Useful for long - range goals that don ’t have a\nspecific Y level . ‘x‘ and ‘z‘ are ‘number ‘\nnew GoalGetToBlock (x, y, z); // Not get into the block , but get\ndirectly adjacent to it. Useful for fishing , farming , filling\nbucket , and beds . ‘x‘, ‘y‘, and ‘z‘ are ‘number ‘\nnew GoalFollow (entity , range ); // Follow the specified entity within\nthe specified range . ‘entity ‘ is ‘Entity ‘, ‘range ‘ is ‘number ‘\nnew GoalPlaceBlock ( position , bot.world , {}); // Position the bot in\norder to place a block . ‘position ‘ is ‘Vec3 ‘\nnew GoalLookAtBlock ( position , bot.world , {}); // Path into a position\nwhere a blockface of the block at position is visible . ‘position ‘\nis ‘Vec3 ‘\n// These are other Mineflayer functions you can use:\nbot . isABed ( bedBlock ); // Return true if ‘bedBlock ‘ is a bed\nbot . blockAt ( position ); // Return the block at ‘position ‘. ‘position ‘\nis ‘Vec3 ‘\n// These are other Mineflayer async functions you can use:\nawait bot . equip (item , destination ); // Equip the item in the specified\ndestination . ‘item ‘ is ‘Item ‘, ‘destination ‘ can only be \" hand \",\n\" head \", \" torso \", \" legs \", \" feet \", \"off - hand \"\nawait bot . consume (); // Consume the item in the bot ’s hand . You must\nequip the item to consume first . Useful for eating food , drinking\npotions , etc.\nawait bot . fish (); // Let bot fish . Before calling this function , you\nmust first get to a water block and then equip a fishing rod. The\nbot will automatically stop fishing when it catches a fish\nawait bot . sleep ( bedBlock ); // Sleep until sunrise . You must get to a\nbed block first\nawait bot . activateBlock ( block ); // This is the same as right - clicking\na block in the game . Useful for buttons , doors , using hoes , etc.\nYou must get to the block first\nawait bot . lookAt ( position ); // Look at the specified position . You\nmust go near the position before you look at it. To fill bucket\nwith water , you must lookAt first . ‘position ‘ is ‘Vec3 ‘\nawait bot . activateItem (); // This is the same as right - clicking to use\nthe item in the bot ’s hand . Useful for using buckets , etc. You\nmust equip the item to activate first\nawait bot . useOn ( entity ); // This is the same as right - clicking an\nentity in the game . Useful for shearing sheep , equipping harnesses\n, etc . You must get to the entity first\n{ retrieved_skills }\nAt each round of conversation , I will give you\nCode from the last round : ...\nExecution error : ...\nChat log: ...\nBiome : ...\nTime : ...\nNearby blocks : ...\nNearby entities ( nearest to farthest ):\nHealth : ...\n29Hunger : ...\nPosition : ...\nEquipment : ...\nInventory (xx /36) : ...\nChests : ...\nTask : ...\nContext : ...\nCritique : ...\nYou should then respond to me with\nExplain (if applicable ): Are there any steps missing in your plan ? Why\ndoes the code not complete the task ? What does the chat log and\nexecution error imply ?\nPlan : How to complete the task step by step . You should pay attention\nto Inventory since it tells what you have . The task completeness\ncheck is also based on your final inventory .\nCode :\n1) Write an async function taking the bot as the only argument .\n2) Reuse the above useful programs as much as possible .\n- Use ‘mineBlock (bot , name , count )‘ to collect blocks . Do not\nuse ‘bot.dig ‘ directly .\n- Use ‘craftItem (bot , name , count )‘ to craft items . Do not use\n‘bot.craft ‘ directly .\n- Use ‘smeltItem (bot , name count )‘ to smelt items . Do not use\n‘bot. openFurnace ‘ directly .\n- Use ‘placeItem (bot , name , position )‘ to place blocks . Do not\nuse ‘bot. placeBlock ‘ directly .\n- Use ‘killMob (bot , name , timeout )‘ to kill mobs . Do not use ‘\nbot .attack ‘ directly .\n3) Your function will be reused for building more complex\nfunctions . Therefore , you should make it generic and reusable . You\nshould not make strong assumption about the inventory (as it may\nbe changed at a later time ), and therefore you should always check\nwhether you have the required items before using them . If not ,\nyou should first collect the required items and reuse the above\nuseful programs .\n4) Functions in the \" Code from the last round \" section will not be\nsaved or executed . Do not reuse functions listed there .\n5) Anything defined outside a function will be ignored , define all\nyour variables inside your functions .\n6) Call ‘bot.chat ‘ to show the intermediate progress .\n7) Use ‘exploreUntil (bot , direction , maxDistance , callback )‘ when\nyou cannot find something . You should frequently call this before\nmining blocks or killing mobs . You should select a direction at\nrandom every time instead of constantly using (1, 0, 1).\n8) ‘maxDistance ‘ should always be 32 for ‘bot . findBlocks ‘ and ‘bot\n. findBlock ‘. Do not cheat .\n9) Do not write infinite loops or recursive functions .\n10) Do not use ‘bot.on ‘ or ‘bot .once ‘ to register event listeners .\nYou definitely do not need them .\n11) Name your function in a meaningful way (can infer the task\nfrom the name ).\nYou should only respond in the format as described below :\nRESPONSE FORMAT :\nExplain : ...\nPlan :\n1) ...\n2) ...\n3) ...\n...\nCode :\n‘‘‘ javascript\n// helper functions ( only if needed , try to avoid them )\n...\n30// main function after the helper functions\nasync function yourMainFunctionName (bot) {\n// ...\n}\n‘‘‘\nPrompt 5: Full system prompt for generating function descriptions. This is used when adding a new\nskill to the skill library. We give a one-shot example in the prompt.\nYou are a helpful assistant that writes a description of the given\nfunction written in Mineflayer javascript code .\n1) Do not mention the function name .\n2) Do not mention anything about ‘bot.chat ‘ or helper functions .\n3) There might be some helper functions before the main function , but\nyou only need to describe the main function .\n4) Try to summarize the function in no more than 6 sentences .\n5) Your response should be a single line of text .\nFor example , if the function is:\nasync function mineCobblestone ( bot) {\n// Check if the wooden pickaxe is in the inventory , if not , craft\none\nlet woodenPickaxe = bot. inventory . findInventoryItem ( mcData .\nitemsByName [\" wooden_pickaxe \"]. id);\nif (! woodenPickaxe ) {\nbot . chat (\" Crafting a wooden pickaxe .\") ;\nawait craftWoodenPickaxe (bot );\nwoodenPickaxe = bot. inventory . findInventoryItem ( mcData . itemsByName\n[\" wooden_pickaxe \"]. id);\n}\n// Equip the wooden pickaxe if it exists\nif ( woodenPickaxe ) {\nawait bot . equip ( woodenPickaxe , \" hand \");\n// Explore until we find a stone block\nawait exploreUntil (bot , new Vec3 (1, -1, 1) , 60, () => {\nconst stone = bot. findBlock ({\nmatching : mcData . blocksByName [\" stone \"].id ,\nmaxDistance : 32\n});\nif ( stone ) {\nreturn true ;\n}\n});\n// Mine 8 cobblestone blocks using the wooden pickaxe\nbot . chat (\" Found a stone block . Mining 8 cobblestone blocks .\");\nawait mineBlock (bot , \" stone \", 8);\nbot . chat (\" Successfully mined 8 cobblestone blocks .\");\n// Save the event of mining 8 cobblestone\nbot . save (\" cobblestone_mined \");\n} else {\nbot . chat (\" Failed to craft a wooden pickaxe . Cannot mine\ncobblestone .\") ;\n}\n}\nThe main function is ‘mineCobblestone ‘.\nThen you would write :\n31The function is about mining 8 cobblestones using a wooden pickaxe .\nFirst check if a wooden pickaxe is in the inventory . If not , craft\none . If the wooden pickaxe is available , equip the wooden pickaxe\nin the hand . Next , explore the environment until finding a stone\nblock . Once a stone block is found , mine a total of 8 cobblestone\nblocks using the wooden pickaxe .\nA.4.3 Examples\nSkill library example 1: craftWoodenPlanks.\nasync function craftWoodenPlanks ( bot) {\nconst logNames = [\" oak_log \", \" birch_log \", \" spruce_log \", \" jungle_log \"\n, \" acacia_log \", \" dark_oak_log \", \" mangrove_log \"];\nconst plankNames = [\" oak_planks \", \" birch_planks \", \" spruce_planks \", \"\njungle_planks \", \" acacia_planks \", \" dark_oak_planks \", \"\nmangrove_planks \"];\nconst logInInventory = logNames . find ( logName => bot. inventory . count (\nmcData . itemsByName [ logName ]. id) > 0);\nif (! logInInventory ) {\nbot . chat (\"No wooden log in inventory . Mining a wooden log ...\");\nawait mineWoodLog (bot);\n}\nconst logIndex = logNames . indexOf ( logInInventory );\nconst plankName = plankNames [ logIndex ];\nbot . chat (‘ Crafting 4 ${ plankName }... ‘);\nawait craftItem (bot , plankName , 1);\nbot . chat (‘4 ${ plankName } crafted .‘);\n}\nSkill library example 2: mineTenCobbledDeepslateBelowY0.\nasync function mineTenCobbledDeepslateBelowY0 (bot) {\n// Equip the iron pickaxe\nconst ironPickaxe = bot . inventory . findInventoryItem ( mcData .\nitemsByName [\" iron_pickaxe \"]. id);\nawait bot . equip ( ironPickaxe , \" hand \");\n// Find cobbled_deepslate blocks below Y=0\nconst cobbledDeepslateBlocks = await exploreUntil (bot , new Vec3 (1,\n-1, 1) , 60, () => {\nconst cobbledDeepslate = bot. findBlock ({\nmatching : mcData . blocksByName [\" cobbled_deepslate \"].id ,\nmaxDistance : 32,\nposition : pos => pos.y < 0\n});\nreturn cobbledDeepslate ;\n});\n// Mine 10 cobbled_deepslate blocks\nawait mineBlock (bot , \" cobbled_deepslate \", 10) ;\nbot . chat (\"10 cobbled_deepslate mined below Y=0.\");\n}\nSkill library example 3: smeltFiveRawIronV2.\nasync function findSuitablePosition (bot) {\nconst offsets = [new Vec3 (1, 0, 0) , new Vec3 (-1, 0, 0) , new Vec3 (0,\n0, 1) , new Vec3 (0, 0, -1) ];\nfor ( const offset of offsets ) {\nconst position = bot. entity . position . offset ( offset .x, offset .y,\noffset .z);\nconst block = bot. blockAt ( position );\n32if ( block . name === \" air\") {\nreturn position ;\n}\n}\nreturn null ;\n}\nasync function smeltFiveRawIron ( bot) {\n// Check if there is a furnace in the inventory\nconst furnaceItem = bot . inventory . findInventoryItem ( mcData .\nitemsByName . furnace .id);\n// If not , craft a furnace using the available cobblestone\nif (! furnaceItem ) {\nawait craftFurnace (bot);\n}\n// Find a suitable position to place the furnace\nconst furnacePosition = await findSuitablePosition (bot);\nif (! furnacePosition ) {\nbot . chat (\" Could not find a suitable position to place the furnace .\n\");\nreturn ;\n}\n// Place the furnace at the suitable position\nawait placeItem (bot , \" furnace \", furnacePosition );\n// Smelt 5 raw iron using the available coal as fuel\nawait smeltItem (bot , \" raw_iron \", \" coal \", 5);\nbot . chat (\"5 raw iron smelted .\");\n}\nSkill library example 4: fillBucketWithWater.\nasync function fillBucketWithWater (bot) {\n// Find a water block nearby\nconst waterBlock = await exploreUntil (bot , new Vec3 (1, 0, 1) , 60, ()\n=> {\nconst water = bot. findBlock ({\nmatching : mcData . blocksByName . water .id ,\nmaxDistance : 32\n});\nreturn water ;\n});\nif (! waterBlock ) {\nbot . chat (\" Could not find water .\");\nreturn ;\n}\n// Go to the water block\nconst adjacentBlock = waterBlock . position . offset (0, 1, 0);\nawait bot . pathfinder . goto (\nnew GoalGetToBlock ( adjacentBlock .x, adjacentBlock .y,\nadjacentBlock .z)\n);\n// Look at the water block\nawait bot . lookAt ( waterBlock . position , true );\n// Equip the bucket\nconst bucket = bot. inventory . findInventoryItem ( mcData . itemsByName .\nbucket .id);\nawait bot . equip (bucket , \" hand \");\n33// Activate the bucket to collect water\nawait bot . activateItem ();\nbot . chat (\" Filled the bucket with water .\");\n}\nSkill library example 5: catchFiveFishSafely.\nasync function catchFiveFishSafely (bot) {\n// Check if the bot has a fishing rod in its inventory\nlet fishingRod = bot. inventory . findInventoryItem ( mcData . itemsByName .\nfishing_rod .id);\nif (! fishingRod ) {\nawait craftFishingRod (bot);\nfishingRod = bot. inventory . findInventoryItem ( mcData . itemsByName .\nfishing_rod .id);\n}\n// Find a nearby water block\nlet waterBlock ;\nwhile (! waterBlock ) {\nwaterBlock = await exploreUntil (bot , new Vec3 (1, 0, 1) , 60, () =>\n{\nconst foundWaterBlock = bot. findBlock ({\nmatching : mcData . blocksByName . water .id ,\nmaxDistance : 32\n});\nreturn foundWaterBlock ;\n});\nif (! waterBlock ) {\nbot . chat (\"No path to the water block . Trying to find another\nwater block ...\");\n}\n}\n// Move to a block adjacent to the water block\nconst adjacentBlock = waterBlock . position . offset (0, 1, 0);\nawait bot . pathfinder . goto (new GoalBlock ( adjacentBlock .x,\nadjacentBlock .y, adjacentBlock .z));\n// Look at the water block\nawait bot . lookAt ( waterBlock . position );\n// Equip the fishing rod\nawait bot . equip ( fishingRod , \" hand \");\n// Fish in the water 5 times\nfor ( let i = 0; i < 5; i++) {\ntry {\nawait bot . fish ();\nbot . chat (‘ Fish ${i + 1} caught .‘);\n} catch ( error ) {\nif ( error . message === \" Fishing cancelled \") {\nbot . chat (\" Fishing was cancelled . Trying again ...\");\ni --; // Retry the same iteration\n} else {\nthrow error ;\n}\n}\n}\n}\n34A.5 Self-Verification\nA.5.1 Components in the Prompt\nThe input prompt to GPT-4 consists of the following components:\n(1)The agent’s state: We exclude other blocks that are recently seen and nearby entities from the\nagent’s state since they are not useful for assessing the task’s completeness. See Sec. A.3.1\nfor each element of the agent’s state;\n(2) Task proposed by the automatic curriculum;\n(3)Task context: We prompt GPT-3.5 to ask for general suggestions about how to solve the\ntask. In practice, this part is handled by the automatic curriculum since it has a systematic\nmechanism for question-answering (Sec. A.3.2);\n(4)Chain-of-thought prompting [ 46] in response: We request GPT-4 to initially reason about\nthe task’s success or failure, then output a boolean variable indicating the task’s outcome,\nand finally provide a critique to the agent if the task fails.\n(5) Few-shot examples for in-context learning [36–38].\nA.5.2 Full Prompt\nPrompt 6: Full system prompt for self-verification.\nYou are an assistant that assesses my progress of playing Minecraft\nand provides useful guidance .\nYou are required to evaluate if I have met the task requirements .\nExceeding the task requirements is also considered a success while\nfailing to meet them requires you to provide critique to help me\nimprove .\nI will give you the following information :\nBiome : The biome after the task execution .\nTime : The current time .\nNearby blocks : The surrounding blocks . These blocks are not collected\nyet . However , this is useful for some placing or planting tasks .\nHealth : My current health .\nHunger : My current hunger level . For eating task , if my hunger level\nis 20.0 , then I successfully ate the food .\nPosition : My current position .\nEquipment : My final equipment . For crafting tasks , I sometimes equip\nthe crafted item .\nInventory (xx /36) : My final inventory . For mining and smelting tasks ,\nyou only need to check inventory .\nChests : If the task requires me to place items in a chest , you can\nfind chest information here .\nTask : The objective I need to accomplish .\nContext : The context of the task .\nYou should only respond in JSON format as described below :\n{\n\" reasoning \": \" reasoning \",\n\" success \": boolean ,\n\" critique \": \" critique \",\n}\nEnsure the response can be parsed by Python ‘json .loads ‘, e.g.: no\ntrailing commas , no single quotes , etc.\nHere are some examples :\nINPUT :\nInventory (2/36) : {’ oak_log ’:2 , ’spruce_log ’:2}\n35Task : Mine 3 wood logs\nRESPONSE :\n{\n\" reasoning \": \"You need to mine 3 wood logs . You have 2 oak logs\nand 2 spruce logs , which add up to 4 wood logs .\",\n\" success \": true ,\n\" critique \": \"\"\n}\nINPUT :\nInventory (3/36) : {’ crafting_table ’: 1, ’spruce_planks ’: 6, ’stick ’:\n4}\nTask : Craft a wooden pickaxe\nRESPONSE :\n{\n\" reasoning \": \"You have enough materials to craft a wooden pickaxe ,\nbut you didn ’t craft it .\",\n\" success \": false ,\n\" critique \": \" Craft a wooden pickaxe with a crafting table using 3\nspruce planks and 2 sticks .\"\n}\nINPUT :\nInventory (2/36) : {’ raw_iron ’: 5, ’stone_pickaxe ’: 1}\nTask : Mine 5 iron_ore\nRESPONSE :\n{\n\" reasoning \": \" Mining iron_ore in Minecraft will get raw_iron . You\nhave 5 raw_iron in your inventory .\",\n\" success \": true ,\n\" critique \": \"\"\n}\nINPUT :\nBiome : plains\nNearby blocks : stone , dirt , grass_block , grass , farmland , wheat\nInventory (26/36) : ...\nTask : Plant 1 wheat seed .\nRESPONSE :\n{\n\" reasoning \": \"For planting tasks , inventory information is useless\n. In nearby blocks , there is farmland and wheat , which means you\nsucceed to plant the wheat seed .\",\n\" success \": true ,\n\" critique \": \"\"\n}\nINPUT :\nInventory (11/36) : {... ,’ rotten_flesh ’: 1}\nTask : Kill 1 zombie\nContext : ...\nRESPONSE\n{\n36\" reasoning \": \"You have rotten flesh in your inventory , which means\nyou successfully killed one zombie .\",\n\" success \": true ,\n\" critique \": \"\"\n}\nINPUT :\nHunger : 20.0/20.0\nInventory (11/36) : ...\nTask : Eat 1 ...\nContext : ...\nRESPONSE\n{\n\" reasoning \": \"For all eating task , if the player ’s hunger is 20.0 ,\nthen the player successfully ate the food .\",\n\" success \": true ,\n\" critique \": \"\"\n}\nINPUT :\nNearby blocks : chest\nInventory (28/36) : {’rail ’: 1, ’coal ’: 2, ’oak_planks ’: 13, ’\ncopper_block ’: 1, ’diorite ’: 7, ’cooked_beef ’: 4, ’granite ’: 22, ’\ncobbled_deepslate ’: 23, ’feather ’: 4, ’leather ’: 2, ’\ncooked_chicken ’: 3, ’white_wool ’: 2, ’stick ’: 3, ’black_wool ’: 1,\n’stone_sword ’: 2, ’stone_hoe ’: 1, ’stone_axe ’: 2, ’stone_shovel ’:\n2, ’cooked_mutton ’: 4, ’cobblestone_wall ’: 18, ’crafting_table ’:\n1, ’furnace ’: 1, ’iron_pickaxe ’: 1, ’stone_pickaxe ’: 1, ’\nraw_copper ’: 12}\nChests :\n(81 , 131 , 16) : {’ andesite ’: 2, ’dirt ’: 2, ’cobblestone ’: 75, ’\nwooden_pickaxe ’: 1, ’wooden_sword ’: 1}\nTask : Deposit useless items into the chest at (81 , 131 , 16)\nContext : ...\nRESPONSE\n{\n\" reasoning \": \"You have 28 items in your inventory after depositing\n, which is more than 20. You need to deposit more items from your\ninventory to the chest .\",\n\" success \": false ,\n\" critique \": \" Deposit more useless items such as copper_block ,\ndiorite , granite , cobbled_deepslate , feather , and leather to meet\nthe requirement of having only 20 occupied slots in your inventory\n.\"\n}\nA.6 System-level Comparison between V OYAGER and Prior Works\nWe make a system-level comparison in Table. A.2. V oyager stands out as the only method featuring a\ncombination of automatic curriculum, iterative planning, and a skill library. Moreover, it learns to\nplay Minecraft without the need for any gradient update.\n37Table A.2: System-level comparison between V OYAGER and prior works.\nVPT [8] DreamerV3 [ 69]DECKARD [ 53]DEPS [55] Plan4MC [ 71]VOYAGER\nDemos Videos None Videos None None None\nRewards Sparse Dense Sparse None Dense None\nObservations Pixels Only Pixels &\nMetaPixels &\nInventoryFeedback &\nInventoryPixels &\nMetaFeedback &\nMeta &\nInventory\nActions Keyboard &\nMouseDiscrete Keyboard &\nMouseKeyboard &\nMouseDiscrete Code\nAutomatic\nCurriculum✓ ✓\n(in-context\nGPT-4 pro-\nposal)\nIterative Plan-\nning✓ ✓\n(3 types of\nfeedback)\nSkill Library ✓\n(pre-defined)✓\n(self-\ngenerated)\nGradient-Free ✓\nB Experiments\nB.1 Experimental Setup\nOur simulation environment is built upon MineDojo [ 23] and utilizes Mineflayer [ 52] JavaScript APIs\nfor motor controls (Sec. A.4.2). Additionally, we incorporate many bot.chat() into Mineflayer\nfunctions to provide abundant environment feedback and implement various condition checks along\nwith try-catch exceptions for continuous execution. If the bot dies, it is resurrected near the closest\nground, and its inventory is preserved for uninterrupted exploration. The bot recycles its crafting table\nand furnace after program execution. For detailed implementations, please refer to our codebase.\nB.2 Baselines\nReAct [29] uses chain-of-thought prompting [ 46] by generating both reasoning traces and action\nplans with LLMs. We provide it with our environment feedback and the agent states as observations.\nReAct undergoes one round of code generation from scratch, followed by three rounds of code\nrefinement. This process is then repeated until the maximum prompting iteration is reached.\nReflexion [30] is built on top of ReAct [ 29] with self-reflection to infer more intuitive future actions.\nWe provide it with environment feedback, the agent states, execution errors, and our self-verification\nmodule. Similar to ReAct, Reflexion undergoes one round of code generation from scratch, followed\nby three rounds of code refinement. This process is then repeated until the maximum prompting\niteration is reached.\nAutoGPT [28] is a popular software tool that automates NLP tasks by decomposing a high-level goal\ninto multiple subgoals and executing them in a ReAct-style loop. We re-implement AutoGPT by\nusing GPT-4 to do task decomposition and provide it with the agent states, environment feedback,\nand execution errors as observations for subgoal execution. Compared with VOYAGER , AutoGPT\nlacks the skill library for accumulating knowledge, self-verification for assessing task success, and\nautomatic curriculum for open-ended exploration. During each subgoal execution, if no execution\nerror occurs, we consider the subgoal completed and proceed to the next one. Otherwise, we refine\nthe program until three rounds of code refinement (equivalent to four rounds of code generation)\nare completed, and then move on to the next subgoal. If three consecutive subgoals do not result in\nacquiring a new item, we replan by rerunning the task decomposition.\nThe task is “explore the world and get as many items as possible” for all baselines.\n38Table A.3: Comparison between V OYAGER and baselines.\nReAct [29] Reflexion [30] AutoGPT [28] V OYAGER\nChain-of-Thought [46] ✓ ✓ ✓ ✓\nSelf Verification ✓ ✓\nEnvironment Feedback ✓ ✓ ✓ ✓\nExecution Errors ✓ ✓ ✓\nAgent State ✓ ✓ ✓ ✓\nSkill Library ✓\nAutomatic Curriculum ✓\nFigure A.1: Minecraft item icons with corresponding names.\nB.3 Ablations\nWe ablate 6 design choices (automatic curriculum, skill library, environment feedback, execution\nerrors, self-verification, and GPT-4 for code generation) in VOYAGER and study their impact on\nexploration performance.\n•Manual Curriculum : We substitute the automatic curriculum with a manually designed\ncurriculum for mining a diamond: “Mine 3 wood log”, “Craft 1 crafting table”, “Craft\n1 wooden pickaxe”, “Mine 11 cobblestone”, “Craft 1 stone pickaxe”, “Craft 1 furnace”,\n“Mine 3 iron ore”, “Smelt 3 iron ore”, “Craft 1 iron pickaxe”, “Mine 1 diamond”. A manual\ncurriculum requires human effort to design and is not scalable for open-ended exploration.\n•Random Curriculum : We curate 101 items obtained by VOYAGER and create a random\ncurriculum by randomly selecting one item as the next task.\n•w/o Skill Library : We remove the skill library, eliminating skill retrieval for code generation.\n•w/o Environment Feedback : We exclude environment feedback (chat log) from the prompt\nfor code generation.\n•w/o Execution Errors : We exclude execution errors from the prompt for code generation.\n•w/o Self-Verification : For each task, we generate code without self-verification and it-\neratively refine the program for 3 rounds (equivalent to 4 rounds of code generation in\ntotal).\n•GPT-3.5 : We replace GPT-4 with GPT-3.5 for code generation. We retain GPT-4 for the\nautomatic curriculum and the self-verification module.\nB.4 Evaluation Results\nB.4.1 Significantly Better Exploration\nThe meaning of each icon in Fig. 1 is shown in Fig. A.1.\nWe run three trials for each method. The items collected by V OYAGER in each trial is\n39•Trial 1 : ‘iron_ingot’, ‘stone_shovel’, ‘iron_leggings’, ‘fishing_rod’, ‘pufferfish’,\n‘oak_log’, ‘cooked_mutton’, ‘green_dye’, ‘flint’, ‘chest’, ‘iron_sword’, ‘string’, ‘en-\nder_pearl’, ‘raw_copper’, ‘crafting_table’, ‘cactus’, ‘lapis_lazuli’, ‘iron_pickaxe’, ‘cop-\nper_ingot’, ‘stone_pickaxe’, ‘wooden_hoe’, ‘scaffolding’, ‘stick’, ‘porkchop’, ‘cop-\nper_block’, ‘gravel’, ‘grass_block’, ‘white_bed’, ‘bone’, ‘dirt’, ‘mutton’, ‘white_wool’,\n‘oak_sapling’, ‘coal’, ‘bamboo’, ‘wooden_pickaxe’, ‘rotten_flesh’, ‘cooked_porkchop’,\n‘cod’, ‘iron_boots’, ‘lightning_rod’, ‘diorite’, ‘water_bucket’, ‘shears’, ‘furnace’, ‘andesite’,\n‘granite’, ‘bucket’, ‘wooden_sword’, ‘sandstone’, ‘iron_helmet’, ‘raw_iron’, ‘sand’, ‘aca-\ncia_log’, ‘cooked_cod’, ‘oak_planks’, ‘azure_bluet’, ‘iron_shovel’, ‘acacia_planks’, ‘shield’,\n‘iron_axe’, ‘iron_chestplate’, ‘cobblestone’;\n•Trial 2 : ‘iron_ingot’, ‘tuff’, ‘stone_shovel’, ‘iron_leggings’, ‘fishing_rod’, ‘cooked_mutton’,\n‘spruce_planks’, ‘gunpowder’, ‘amethyst_shard’, ‘chest’, ‘string’, ‘cooked_salmon’,\n‘iron_sword’, ‘raw_copper’, ‘crafting_table’, ‘torch’, ‘lapis_lazuli’, ‘iron_pickaxe’, ‘cop-\nper_ingot’, ‘stone_pickaxe’, ‘wooden_hoe’, ‘stick’, ‘amethyst_block’, ‘salmon’, ‘cal-\ncite’, ‘gravel’, ‘white_bed’, ‘bone’, ‘dirt’, ‘mutton’, ‘white_wool’, ‘spyglass’, ‘coal’,\n‘wooden_pickaxe’, ‘cod’, ‘iron_boots’, ‘lily_pad’, ‘cobbled_deepslate’, ‘lightning_rod’,\n‘snowball’, ‘stone_axe’, ‘smooth_basalt’, ‘diorite’, ‘water_bucket’, ‘furnace’, ‘andesite’,\n‘bucket’, ‘granite’, ‘shield’, ‘iron_helmet’, ‘raw_iron’, ‘cobblestone’, ‘spruce_log’,\n‘cooked_cod’, ‘tripwire_hook’, ‘stone_hoe’, ‘iron_chestplate’, ‘stone_sword’;\n•Trial 3 : ‘spruce_planks’, ‘dirt’, ‘shield’, ‘redstone’, ‘clock’, ‘diamond_sword’,\n‘iron_chestplate’, ‘stone_pickaxe’, ‘leather’, ‘string’, ‘chicken’, ‘chest’, ‘diorite’,\n‘iron_leggings’, ‘black_wool’, ‘cobblestone_wall’, ‘cobblestone’, ‘cooked_chicken’,\n‘feather’, ‘stone_sword’, ‘raw_gold’, ‘gravel’, ‘birch_planks’, ‘coal’, ‘cobbled_deepslate’,\n‘oak_planks’, ‘iron_pickaxe’, ‘granite’, ‘tuff’, ‘crafting_table’, ‘iron_helmet’, ‘stone_hoe’,\n‘iron_ingot’, ‘stone_axe’, ‘birch_boat’, ‘stick’, ‘sand’, ‘bone’, ‘raw_iron’, ‘beef’, ‘rail’,\n‘oak_sapling’, ‘kelp’, ‘gold_ingot’, ‘birch_log’, ‘wheat_seeds’, ‘cooked_mutton’, ‘furnace’,\n‘arrow’, ‘stone_shovel’, ‘white_wool’, ‘andesite’, ‘jungle_slab’, ‘mutton’, ‘iron_sword’,\n‘copper_ingot’, ‘diamond’, ‘torch’, ‘oak_log’, ‘cooked_beef’, ‘copper_block’, ‘flint’,\n‘bone_meal’, ‘raw_copper’, ‘wooden_pickaxe’, ‘iron_boots’, ‘wooden_sword’.\nThe items collected by ReAct [29] in each trial is\n•Trial 1 : ‘bamboo’, ‘dirt’, ‘sand’, ‘wheat_seeds’;\n•Trial 2 : ‘dirt’, ‘rabbit’, ‘spruce_log’, ‘spruce_sapling’;\n•Trial 3 : ‘dirt’, ‘pointed_dripstone’;\nThe items collected by Reflexion [30] in each trial is\n•Trial 1 : ‘crafting_table’, ‘orange_tulip’, ‘oak_planks’, ‘oak_log’, ‘dirt’;\n•Trial 2 : ‘spruce_log’, ‘dirt’, ‘clay_ball’, ‘sand’, ‘gravel’;\n•Trial 3 : ‘wheat_seeds’, ‘oak_log’, ‘dirt’, ‘birch_log’, ‘sand’.\nThe items collected by AutoGPT [28] in each trial is\n•Trial 1 : ‘feather’, ‘oak_log’, ‘leather’, ‘stick’, ‘porkchop’, ‘chicken’, ‘crafting_table’,\n‘wheat_seeds’, ‘oak_planks’, ‘dirt’, ‘mutton’;\n•Trial 2 : ‘wooden_pickaxe’, ‘iron_ingot’, ‘stone’, ‘coal’, ‘spruce_planks’, ‘string’,\n‘raw_copper’, ‘crafting_table’, ‘diorite’, ‘andesite’, ‘furnace’, ‘torch’, ‘spruce_sapling’,\n‘granite’, ‘iron_pickaxe’, ‘stone_pickaxe’, ‘wooden_axe’, ‘raw_iron’, ‘stick’, ‘spruce_log’,\n‘dirt’, ‘cobblestone’;\n•Trial 3 : ‘wooden_shovel’, ‘wooden_pickaxe’, ‘iron_ingot’, ‘stone’, ‘cod’, ‘coal’, ‘oak_log’,\n‘flint’, ‘raw_copper’, ‘crafting_table’, ‘diorite’, ‘furnace’, ‘andesite’, ‘torch’, ‘granite’,\n‘lapis_lazuli’, ‘iron_pickaxe’, ‘stone_pickaxe’, ‘raw_iron’, ‘stick’, ‘gravel’, ‘oak_planks’,\n‘dirt’, ‘iron_axe’, ‘cobblestone’.\n40Figure A.2: Map coverage: Two bird’s eye views of Minecraft maps. VOYAGER is able to traverse\n2.3×longer distances compared to baselines while crossing diverse terrains. Trajectories are plotted\nbased on the positions where each agent interacts with GPT-4.\nB.4.2 Extensive Map Traversal\nAgent trajectories for map coverage are displayed in Fig. A.2. Fig. 7 is plotted based on Fig. A.2 by\ndrawing the smallest circle enclosing each trajectory. The terrains traversed by VOYAGER in each\ntrial is\n•Trial 1 : ‘meadow’, ‘desert’, ‘river’, ‘savanna’, ‘forest’, ‘plains’, ‘bamboo_jungle’, ‘drip-\nstone_caves’;\n•Trial 2 : ‘snowy_plains’, ‘frozen_river’, ‘dripstone_caves’, ‘snowy_taiga’, ‘beach’;\n•Trial 3 : ‘flower_forest’, ‘meadow’, ‘old_growth_birch_forest’, ‘snowy_slopes’,\n‘frozen_peaks’, ‘forest’, ‘river’, ‘beach’, ‘ocean’, ‘sunflower_plains’, ‘plains’, ‘stony_shore’.\nThe terrains traversed by ReAct [29] in each trial is\n•Trial 1 : ‘plains’, ‘desert’, ‘jungle’;\n•Trial 2 : ‘snowy_plains’, ‘snowy_taiga’, ‘snowy_slopes’;\n•Trial 3 : ‘dark_forest’, ‘dripstone_caves’, ‘grove’, ‘jagged_peaks’.\nThe terrains traversed by Reflexion [30] in each trial is\n•Trial 1 : ‘plains’, ‘flower_forest’;\n•Trial 2 : ‘snowy_taiga’;\n•Trial 3 : ‘old_growth_birch_forest’, ‘river’, ‘ocean’, ‘beach’, ‘plains’.\nThe terrains traversed by AutoGPT [28] in each trial is\n•Trial 1 : ‘plains’, ‘dripstone_caves’, ‘savanna’, ‘meadow’;\n•Trial 2 : ‘snowy_taiga’;\n•Trial 3 : ‘plains’, ‘stony_shore’, ‘forest’, ‘ocean’.\nB.4.3 Efficient Zero-Shot Generalization to Unseen Tasks\nThe results of zero-shot generalization to unseen tasks for the other two tasks are presented in Fig. A.3.\nSimilar to Fig. 8, VOYAGER consistently solves all tasks, while the baselines are unable to solve any\n41Figure A.3: Zero-shot generalization to unseen tasks. We visualize the intermediate progress of each\nmethod on the other two tasks. We do not plot ReAct and Reflexion since they do not make any\nmeaningful progress.\ntask within 50 prompting iterations. Our skill library, constructed from lifelong learning, not only\nenhances V OYAGER ’s performance but also provides a boost to AutoGPT [28].\nB.4.4 Accurate Skill Retrieval\nWe conduct an evaluation of our skill retrieval (309 samples in total) and the results are in Table. A.4.\nThe top-5 accuracy standing at 96.5% suggests our retrieval process is reliable (note that we include\nthe top-5 relevant skills in the prompt for synthesizing a new skill).\nTable A.4: Skill retrieval accuracy.\nTop-1 Acc Top-2 Acc Top-3 Acc Top-4 Acc Top-5 Acc\n80.2±3.0 89 .3±1.8 93 .2±0.7 95 .2±1.8 96 .5±0.3\nB.4.5 Robust to Model Variations\nIn the main paper, all of V oyager’s experiments are conducted with gpt-4-0314 . We additionally\nrun new experiments with gpt-4-0613 and find that the performance is roughly the same (Fig. A.4).\nIt demonstrates that V oyager is robust to model variations.\nFigure A.4: V OYAGER ’s performance with GPT-4-0314 and GPT-4-0613.\n42"}


--------------------------------------------------------------------------------