├── .gitignore
├── README.md
├── ai-agent-from-scratch.md
├── auto-email-response-outreach.md
├── building-a-speedy-resilient-web-scraper-for-rag-ai-part1-preparing.md
├── building-a-speedy-resilient-web-scraper-for-rag-ai-part2-scaling-up.md
├── crawl-agent-with-autogen.md
├── crewai-spider-research-agent.md
├── extracting-contacts.md
├── images
    ├── anti_bot
    │   ├── abrahamjuliot_github_io_creepjs.png
    │   ├── bot_detector_rebrowser_net.png
    │   ├── bot_sannysoft_com.png
    │   ├── demo_fingerprint_com_playground.png
    │   ├── deviceandbrowserinfo_com_are_you_a_bot.png
    │   ├── deviceandbrowserinfo_com_info_device.png
    │   └── www_browserscan_net_bot_detection.png
    ├── spider-logo-github-dark.png
    └── spider-logo-github-light.png
├── langchain-groq.md
├── proxy-mode.md
├── spider-api.md
└── website-archiving.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <a href="https://spider.cloud" target="_blank">
 2 |   <picture>
 3 |     <source media="(prefers-color-scheme: dark)" srcset="images/spider-logo-github-dark.png" style="max-width: 100%; width: 300px; margin-bottom: 20px">
 4 |     <img alt="Spider Logo" src="/images/spider-logo-github-light.png" width="300px">
 5 |   </picture>
 6 | </a>
 7 | 
 8 | # Spider Web Crawling and Scraping Guides
 9 | 
10 | This repo contains a collection of guides on how to effectively use the Spider service to crawl or scrape. Contributors are welcome! 😁
11 | 
12 | ## Collection
13 | 
14 | - [Using the Spider API](spider-api.md)
15 | - [How to Use Proxy Mode](proxy-mode.md)
16 | - [LangChain + Groq + Spider = 🚀 (Integration Guide)](langchain-groq.md)
17 | - [CrewAI Spider Stock Research](crewai-spider-research-agent.md)
18 | - [Extracting Contacts](extracting-contacts.md)
19 | - [Automated Cold Email Outreach Using Spider](auto-email-response-outreach.md)
20 | - [How to Archive Full Website](website-archiving.md)
21 | - Building A Speedy Resilient Web Scraper for RAG AI ([Part 1](building-a-speedy-resilient-web-scraper-for-rag-ai-part1-preparing.md), [Part 2](building-a-speedy-resilient-web-scraper-for-rag-ai-part2-scaling-up.md))
22 | - [Agents from Scratch](ai-agent-from-scratch.md)
23 | 
24 | ## Anti-Bot Detection
25 | 
26 | Spider, combined with the [`headless-browser`](https://github.com/spider-rs/headless-browser) repo, achieves **full stealth** against leading bot detection services — even when running fully headless.
27 | 
28 | Our techniques make Spider the most powerful crawling stack available today, providing an invisible footprint while scraping at scale.
29 | 
30 | Below are some screenshots proving Spider's stealth against major bot detectors:
31 | 
32 | | Detector                                 | Screenshot                                                                                               |
33 | | :--------------------------------------- | :------------------------------------------------------------------------------------------------------- |
34 | | BrowserScan.net Bot Detection            | ✅ [View Screenshot](images/anti_bot/www_browserscan_net_bot_detection.png)        |
35 | | Bot Detector Rebrowser                   | ✅ [View Screenshot](images/anti_bot/bot_detector_rebrowser_net.png)                 |
36 | | SammySoft Bot Ecom                       | ✅ [View Screenshot](images/anti_bot/bot_sannysoft_com.png)                         |
37 | | Device and Browser Info (Are You a Bot?) | ✅ [View Screenshot](images/anti_bot/deviceandbrowserinfo_com_are_you_a_bot.png) |
38 | | Fingerprint Ecom Playground              | ✅ [View Screenshot](images/anti_bot/demo_fingerprint_com_playground.png)            |
39 | | Device and Browser Info - Device Test    | ✅ [View Screenshot](images/anti_bot/deviceandbrowserinfo_com_info_device.png)       |
40 | | Creepjs - Device Test                    | ✅ [View Screenshot](images/anti_bot/abrahamjuliot_github_io_creepjs.png)         |
41 | 
42 | Spider is designed for **extreme evasion**, **high concurrency**, and **human-like behavior**, allowing you to dominate even the most protected websites.
43 | 
44 | ## Contribute
45 | 
46 | We're happy to accept requests in the issue tracker, improvements to the content, and additional guides.
47 | 


--------------------------------------------------------------------------------
/ai-agent-from-scratch.md:
--------------------------------------------------------------------------------
  1 | # Guide - Build an AI Agent from Scratch
  2 | 
  3 | AI agents are revolutionizing how we process and interact with information. By combining language models with web search capabilities, we can create assistants that not only understand our queries but can actively research and provide comprehensive answers. This guide will show you how to harness this power.
  4 | 
  5 | ## Setup
  6 | 
  7 | First, let's set up our environment and install the necessary dependencies.
  8 | 
  9 | ### Install Required Packages
 10 | 
 11 | Install the required packages using pip:
 12 | 
 13 | ```bash
 14 | pip install python-dotenv openai spider-client colorama
 15 | ```
 16 | 
 17 | - `python-dotenv`: Manages environment variables
 18 | - `openai`: Interfaces with OpenAI's powerful language models
 19 | - `spider-client`: Scraping, crawling and web searching (all of [Spiders](https://spider.cloud/) capabilities)
 20 | - `colorama`: Adds color to our console output for better readability
 21 | 
 22 | ### Environment Variables
 23 | 
 24 | Create a `.env` file in your project root and add your API keys:
 25 | 
 26 | ```bash
 27 | OPENAI_API_KEY=<your_openai_api_key_here>
 28 | SPIDER_API_KEY=<your_spider_api_key_here>
 29 | ```
 30 | 
 31 | ## Building the AI Research Agent
 32 | 
 33 | Let's break down the process of building our AI agent into steps.
 34 | 
 35 | ### Step 1: Import Dependencies and Set Up
 36 | 
 37 | ```python
 38 | import os
 39 | from dotenv import load_dotenv
 40 | import openai
 41 | from spider import Spider
 42 | from typing import List, Dict, Any
 43 | from colorama import init, Fore
 44 | 
 45 | 
 46 | init(autoreset=True)
 47 | load_dotenv()
 48 | 
 49 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
 50 | SPIDER_API_KEY = os.getenv("SPIDER_API_KEY")
 51 | ```
 52 | 
 53 | This section sets the stage for our agent. We're importing necessary libraries and loading our environment variables. The use of `colorama` will make our console output more visually appealing and easier to read.
 54 | 
 55 | ### Step 2: Create the AIResearchAgent Class
 56 | 
 57 | The `AIResearchAgent` class is the core of our AI assistant. It encapsulates all the functionality we'll be building, providing a clean and organized structure for our code.
 58 | 
 59 | ```python
 60 | class AIResearchAgent:
 61 |     def __init__(self, openai_api_key: str, spider_api_key: str):
 62 |         self.openai_client = openai.OpenAI(api_key=openai_api_key)
 63 |         self.spider_client = Spider(spider_api_key)
 64 | ```
 65 | 
 66 | This initializer sets up our connections to the OpenAI and Spider APIs, preparing our agent for action.
 67 | 
 68 | ### Step 3: Implement Web Search Functionality
 69 | 
 70 | Web search is a crucial capability of our agent. By leveraging Spider's API, we can fetch relevant information from across the internet, providing our agent with up-to-date data to work with. And thanks to spider's speed, we don't have to wait ages for this data to be returned.
 71 | 
 72 | ```python
 73 | def search(self, query: str, limit: int = 5) -> List[Dict[str, Any]]:
 74 |     """Perform a web search using Spider."""
 75 |     params = {"limit": limit, "fetch_page_content": False}
 76 |     print(f"{Fore.GREEN}Searching for: {query}")
 77 |     results = self.spider_client.search(query, params)
 78 |     return results
 79 | ```
 80 | 
 81 | This method allows our agent to cast a wide net across the web, gathering diverse information to inform its responses.
 82 | 
 83 | ### Step 4: Implement OpenAI Request Helper
 84 | 
 85 | ```python
 86 | def openai_request(self, system_content: str, user_content: str) -> str:
 87 |     """Helper method to make OpenAI API requests."""
 88 |     response = self.openai_client.chat.completions.create(
 89 |         model="gpt-4o",
 90 |         messages=[
 91 |             {"role": "system", "content": system_content},
 92 |             {"role": "user", "content": user_content}
 93 |         ]
 94 |     )
 95 |     return response.choices[0].message.content
 96 | ```
 97 | 
 98 | This helper method streamlines our interactions with OpenAI's API, abstracting the complexities of API calls and allowing us to focus on the core functionality of our agent.
 99 | 
100 | ### Step 5: Implement Text Summarization (this method is not used in the code below, but you can easily implement it by calling it before the `combined_summary` variable defined in the research method)
101 | 
102 | Summarization is a powerful feature that allows our agent to distill large amounts of information into concise, digestible chunks. This is particularly useful when dealing with lengthy web content. We don't use this function, but it is here for you as a little "task", if you want to implement it to our agent.
103 | 
104 | ```python
105 | def summarize(self, text: str) -> str:
106 |     """Summarize the given text using OpenAI."""
107 |     print(f"{Fore.BLUE}Summarizing...", text)
108 |     return self.openai_request(
109 |         "You are a helpful assistant that summarizes text.",
110 |         f"Summarize this text in 2-3 sentences: {text}"
111 |     )
112 | ```
113 | 
114 | This method uses OpenAI to summarize text.
115 | 
116 | ### Step 6: Implement Answer Evaluation
117 | 
118 | ```python
119 | def evaluate(self, question: str, summary: str) -> str:
120 |     """Evaluate if the summary answers the question."""
121 |     print(f"{Fore.MAGENTA}Evaluating...")
122 |     evaluation = self.openai_request(
123 |         "You are an AI research assistant. Your task is to evaluate if the given summary answers the user's question.",
124 |         f"Question: {question}\n\nSummary:\n{summary}\n\nDoes this summary answer the question? If it does, write exactly: 'does answer the question'. If not, explain why."
125 |     )
126 |     print(f"{Fore.MAGENTA}Evaluation: {evaluation}")
127 |     return evaluation
128 | ```
129 | 
130 | This method adds a layer of intelligence to our agent. By evaluating whether a summary answers the original question, our agent can determine if it needs to continue searching or if it has found a satisfactory answer. This is the core and what makes this a [level 3](https://arxiv.org/pdf/2405.06643#:~:text=Inspired%20by%20the%206%20levels,%2FRL%2Dbased%20AI%2C%20with) agent.  
131 | 
132 | ### Step 7: Implement Search Query Formation
133 | 
134 | Forming effective search queries is an art, and the users query might not always be formed as a search query:
135 | - User query: What is the wheater is Boston?
136 | - Search query: Boston weather
137 | This method leverages OpenAI's language understanding to create queries that are more likely to yield relevant results.
138 | 
139 | ```python
140 | def form_search_query(self, user_query: str) -> str:
141 |     """Form a search query from the user's input."""
142 |     search_query = self.openai_request(
143 |         "You are an AI research assistant. Your task is to form an effective search query based on the user's question.",
144 |         f"User's question: {user_query}\n\nPlease provide a concise and effective search query to find relevant information."
145 |     )
146 |     return search_query
147 | ```
148 | 
149 | By refining user queries, our agent can perform more targeted and efficient web searches.
150 | 
151 | ### Step 8: Implement Final Answer Formation
152 | 
153 | This is where our agent truly shines. By using the information it has gathered (and evaluated to be sufficient in asnwering the user query), it can form comprehensive answers to complex questions.
154 | 
155 | ```python
156 | def form_final_answer(self, user_query: str, summary: str) -> str:
157 |     """Form a final answer based on the user's query and the summary."""
158 |     final_answer = self.openai_request(
159 |         "You are an AI research assistant. Your task is to form a comprehensive answer to the user's question based on the provided summary.",
160 |         f"User's question: {user_query}\n\nSummary of research:\n{summary}\n\nPlease provide a comprehensive answer to the user's question based on this information."
161 |     )
162 |     print(f"{Fore.GREEN}Formed final answer.")
163 |     return final_answer
164 | ```
165 | 
166 | This method demonstrates the agent's ability to understand context, synthesize information, and communicate clearly.
167 | 
168 | ### Step 9: Implement Question Refinement
169 | 
170 | ```python
171 | def refine_question(self, original_question: str, evaluation: str) -> str:
172 |     """Refine the search question based on the evaluation."""
173 |     print(f"{Fore.CYAN}Refining...")
174 |     return self.openai_request(
175 |         "You are an AI research assistant. Your task is to refine a search query based on the original question and the evaluation of previous search results.",
176 |         f"Original question: {original_question}\n\nEvaluation of previous results: {evaluation}\n\nPlease provide a refined search query to find more relevant information."
177 |     )
178 | ```
179 | 
180 | The ability to refine questions based on previous results is what makes our agent truly adaptive. This iterative approach allows the agent to hone in on the most relevant information, improving its research capabilities with each iteration.
181 | 
182 | ### Step 10: Implement the Main Research Loop
183 | 
184 | Now we come to the heart of our AI agent - the main research loop. This is where all the pieces come together to create a powerful, autonomous research assistant.
185 | 
186 | ```python
187 | def research(self, user_query: str, max_iterations: int = 5) -> str:
188 |     """Perform research on the given question."""
189 |     print(f"{Fore.BLUE}Starting research for: {user_query}")
190 |     
191 |     for iteration in range(max_iterations):
192 |         print(f"{Fore.YELLOW}Iteration {iteration + 1}/{max_iterations}")
193 | 
194 |         search_query = self.form_search_query(user_query)
195 |         search_results = self.search(search_query)
196 |         # OPTIONAL: call the summarize method here to summarize the search results
197 |         combined_summary = "\n".join([result['description'] for result in search_results['content']])
198 |         evaluation = self.evaluate(user_query, combined_summary)
199 | 
200 |         if "does answer the question" in evaluation.lower():
201 |             final_answer = self.form_final_answer(user_query, combined_summary)
202 |             return f"{Fore.GREEN}Final Answer:\n{final_answer}\n\nBased on:\n{combined_summary}"
203 | 
204 |         user_query = self.refine_question(user_query, evaluation)
205 |         
206 |     return f"{Fore.RED}Couldn't find a satisfactory answer after {max_iterations} iterations. Last summary:\n{combined_summary}"
207 | ```
208 | 
209 | This method orchestrates the entire research process, from forming initial queries to delivering final answers. It showcases the agent's ability to:
210 | 
211 | - Form effective search queries
212 | - Evaluate the relevance of search results
213 | - Refine and give feedback on its approach based on intermediate results
214 | - Synthesize information into a coherent final answer
215 | 
216 | ### Step 11: Implement the Main Function
217 | 
218 | Finally, let's create an interactive interface for users to engage with our AI research agent:
219 | 
220 | ```python
221 | def main():
222 |     agent = AIResearchAgent(OPENAI_API_KEY, SPIDER_API_KEY)
223 |     while True:
224 |         user_input = input("What would you like to research? (Type 'exit' to quit): ")
225 |         if user_input.lower() == 'exit':
226 |             break
227 |         result = agent.research(user_input)
228 |         print(result)
229 | 
230 | if __name__ == "__main__":
231 |     main()
232 | ```
233 | 
234 | This main function brings everything together, allowing users to interact directly with the AI agent and experience its research capabilities firsthand.
235 | 
236 | ## Conclusion
237 | 
238 | You now have a fully functional AI research agent that can:
239 | 
240 | - Form web searches
241 | - Evaluate if the search results are sufficient
242 | - Give feedback to itself, to improve the search query if the serch results were insufficient
243 | - Form final answer based on the search results gathered
244 | 
245 | ## Complete Code
246 | 
247 | You can find the complete code for this guide down below:
248 | 
249 | ```python
250 | import os
251 | from dotenv import load_dotenv
252 | import openai
253 | from spider import Spider
254 | from typing import List, Dict, Any
255 | from colorama import init, Fore
256 | 
257 | 
258 | init(autoreset=True)
259 | load_dotenv()
260 | 
261 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
262 | SPIDER_API_KEY = os.getenv("SPIDER_API_KEY")
263 | 
264 | class AIResearchAgent:
265 |     def __init__(self, openai_api_key: str, spider_api_key: str):
266 |         self.openai_client = openai.OpenAI(api_key=openai_api_key)
267 |         self.spider_client = Spider(spider_api_key)
268 | 
269 |     def search(self, query: str, limit: int = 5) -> List[Dict[str, Any]]:
270 |         """Perform a web search using Spider."""
271 |         params = {"limit": limit, "fetch_page_content": False}
272 |         print(f"{Fore.GREEN}Searching for: {query}")
273 |         results = self.spider_client.search(query, params)
274 |         return results
275 | 
276 |     def _openai_request(self, system_content: str, user_content: str) -> str:
277 |         """Helper method to make OpenAI API requests."""
278 |         response = self.openai_client.chat.completions.create(
279 |             model="gpt-4o",
280 |             messages=[
281 |                 {"role": "system", "content": system_content},
282 |                 {"role": "user", "content": user_content}
283 |             ]
284 |         )
285 |         return response.choices[0].message.content
286 | 
287 |     def summarize(self, text: str) -> str:
288 |         """Summarize the given text using OpenAI."""
289 |         print(f"{Fore.BLUE}Summarizing...")
290 |         return self._openai_request(
291 |             "You are a helpful assistant that summarizes text.",
292 |             f"Summarize this text in 2-3 sentences: {text}"
293 |         )
294 | 
295 |     def evaluate(self, question: str, summary: str) -> str:
296 |         """Evaluate if the summary answers the question."""
297 |         print(f"{Fore.MAGENTA}Evaluating...")
298 |         evaluation = self._openai_request(
299 |             "You are an AI research assistant. Your task is to evaluate if the given summary answers the user's question.",
300 |             f"Question: {question}\n\nSummary:\n{summary}\n\nDoes this summary answer the question? If it does, write exactly: 'does answer the question'. If not, explain why."
301 |         )
302 |         return evaluation
303 | 
304 |     def form_search_query(self, user_query: str) -> str:
305 |         """Form a search query from the user's input."""
306 |         search_query = self._openai_request(
307 |             "You are an AI research assistant. Your task is to form an effective search query based on the user's question.",
308 |             f"User's question: {user_query}\n\nPlease provide a concise and effective search query to find relevant information."
309 |         )
310 |         return search_query
311 | 
312 |     def form_final_answer(self, user_query: str, summary: str) -> str:
313 |         """Form a final answer based on the user's query and the summary."""
314 |         final_answer = self._openai_request(
315 |             "You are an AI research assistant. Your task is to form a comprehensive answer to the user's question based on the provided summary.",
316 |             f"User's question: {user_query}\n\nSummary of research:\n{summary}\n\nPlease provide a comprehensive answer to the user's question based on this information."
317 |         )
318 |         print(f"{Fore.GREEN}Formed final answer.")
319 |         return final_answer
320 | 
321 |     def refine_question(self, original_question: str, evaluation: str) -> str:
322 |         """Refine the search question based on the evaluation."""
323 |         print(f"{Fore.CYAN}Refining...")
324 |         return self._openai_request(
325 |             "You are an AI research assistant. Your task is to refine a search query based on the original question and the evaluation of previous search results.",
326 |             f"Original question: {original_question}\n\nEvaluation of previous results: {evaluation}\n\nPlease provide a refined search query to find more relevant information."
327 |         )
328 | 
329 |     def research(self, user_query: str, max_iterations: int = 5) -> str:
330 |         """Perform research on the given question."""
331 |         print(f"{Fore.BLUE}Starting research for: {user_query}")
332 |         
333 |         for iteration in range(max_iterations):
334 |             print(f"{Fore.YELLOW}Iteration {iteration + 1}/{max_iterations}")
335 |             
336 |             search_query = self.form_search_query(user_query)
337 |             search_results = self.search(search_query)
338 |             # OPTIONAL: call the summarize method here to summarize the search results
339 |             combined_summary = "\n".join([result['description'] for result in search_results['content']])
340 |             evaluation = self.evaluate(user_query, combined_summary)
341 |             
342 |             if "does answer the question" in evaluation.lower():
343 |                 final_answer = self.form_final_answer(user_query, combined_summary)
344 |                 return f"{Fore.GREEN}Final Answer:\n{final_answer}\n\nBased on:\n{combined_summary}"
345 |             
346 |             user_query = self.refine_question(user_query, evaluation)
347 |         
348 |         return f"{Fore.RED}Couldn't find a satisfactory answer after {max_iterations} iterations. Last summary:\n{combined_summary}"
349 | 
350 | def main():
351 |     agent = AIResearchAgent(OPENAI_API_KEY, SPIDER_API_KEY)
352 | 
353 |     while True:
354 |         user_input = input("What would you like to research? (Type 'exit' to quit): ")
355 |         if user_input.lower() == 'exit':
356 |             break
357 | 
358 |         result = agent.research(user_input)
359 |         print(result)
360 | 
361 | if __name__ == "__main__":
362 |     main()
363 | ```
364 | 
365 | If you liked this guide, consider checking out Spider on Twitter and follow me (the author):
366 | - **Spider Twitter:** [spider_rust](https://x.com/spider_rust)
367 | - **William Espegren Twitter:** [@WilliamEspegren](https://x.com/WilliamEspegren)
368 | 


--------------------------------------------------------------------------------
/auto-email-response-outreach.md:
--------------------------------------------------------------------------------
  1 | # Automated Cold Email Outreach Using Spider
  2 | 
  3 | This guide will show you how to automate the process of cold email outreach by extracting email content, identifying the company behind the email, searching for their website, and crafting a personalized email using the LLM-ready data returned from the website by Spider.
  4 | 
  5 | ## Retrieve Email
  6 | 
  7 | For this guide, we will not cover how to get the contents of the email, as it varies between different services. Instead, we will have a variable with the email content in it.
  8 | 
  9 | ```python
 10 | email = '''
 11 | Thank you for your email Gilbert,
 12 | 
 13 | I have looked into YourBusinessName and it seems to suit some of our customers' requests, but not really so many that make it profitable for us to invest time and money integrating it into our current services. If you have any use cases in mind that suit our company, I might propose an idea to the others.
 14 | 
 15 | Best,
 16 | Matilda
 17 | 
 18 | SEO expert at Spider.cloud
 19 | '''
 20 | ```
 21 | 
 22 | ## Setup OpenAI
 23 | 
 24 | Get OpenAI setup and running in a few minutes with the following steps:
 25 | 
 26 | 1. Create an account and get an API Key on [OpenAI](https://openai.com/).
 27 | 
 28 | 2. Install OpenAI and set up the API key in your project as an environment variable. This approach prevents you from hardcoding the key in your code.
 29 | 
 30 | ```bash
 31 | pip install openai
 32 | ```
 33 | 
 34 | In your terminal:
 35 | 
 36 | ```bash
 37 | export OPENAI_API_KEY=<your-api-key-here>
 38 | ```
 39 | 
 40 | Alternatively, you can use the `dotenv` package to load the environment variables from a `.env` file. Create a `.env` file in your project root and add the following:
 41 | 
 42 | ```bash
 43 | OPENAI_API_KEY=<your-api-key-here>
 44 | ```
 45 | 
 46 | Then, in your Python code:
 47 | 
 48 | ```python
 49 | from dotenv import load_dotenv
 50 | from openai import OpenAI
 51 | import os
 52 | 
 53 | load_dotenv()
 54 | 
 55 | client = OpenAI(
 56 |     api_key=os.environ.get("OPENAI_API_KEY"),
 57 | )
 58 | ```
 59 | 
 60 | 3. Test OpenAI to see if things are working correctly:
 61 | 
 62 | ```python
 63 | import os
 64 | from openai import OpenAI
 65 | 
 66 | client = OpenAI(
 67 |     api_key=os.environ.get("OPENAI_API_KEY"),
 68 | )
 69 | 
 70 | chat_completion = client.chat.completions.create(
 71 |     model="gpt-3.5-turbo",
 72 |     messages=[
 73 |         {
 74 |             "role": "user",
 75 |             "content": "What are large language models?",
 76 |         }
 77 |     ]
 78 | )
 79 | ```
 80 | 
 81 | ## Setup Spider & Langchain
 82 | 
 83 | Getting started with the API is simple and straightforward. After you get your [secret key](https://spider.cloud/api-keys), you can use the Spider LangChain document loader. We won't rehash the full setup guide for Spider here, but if you want to use the API directly, you can check out the [Spider API Guide](https://spider.cloud/guides/spider-api) to learn more. Let's move on.
 84 | 
 85 | Install the Spider Python client library and langChain:
 86 | 
 87 | ```bash
 88 | pip install spider-client langchain langchain-community
 89 | ```
 90 | 
 91 | Then import the `SpiderLoader` from the document loaders module:
 92 | 
 93 | ```python
 94 | from langchain_community.document_loaders import SpiderLoader
 95 | ```
 96 | 
 97 | Let's set up the Spider API for our example use case:
 98 | 
 99 | ```python
100 | def load_markdown_from_url(urls):
101 |     loader = SpiderLoader(
102 |         url=urls,
103 |         mode="crawl",
104 |         params={
105 |             "return_format": "markdown",
106 |             "proxy_enabled": False,
107 |             "request": "http",
108 |             "request_timeout": 60,
109 |             "limit": 1,
110 |         },
111 |     )
112 |     data = loader.load()
113 | ```
114 | 
115 | Set the mode to `crawl` and use the `return_format` parameter to specify we want markdown content. The rest of the parameters are optional.
116 | 
117 | ### Reminder
118 | 
119 | Spider handles automatic concurrency handling and IP rotation to make it simple to scrape multiple URLs at once. The more credits you have or usage available allows for a higher concurrency limit. Make sure you have enough credits if you choose to crawl more than one page.
120 | 
121 | For now, we'll turn off the proxy and move on to setting up LangChain.
122 | 
123 | ## Puzzling the pieces together
124 | 
125 | Now that we have everything installed and working, we should start with connecting the different pieces together.
126 | 
127 | First we need to extract the company name from the email:
128 | 
129 | ```python
130 | import os
131 | from openai import OpenAI
132 | 
133 | email_content = '''
134 | Thank you for your email Gilbert,
135 | 
136 | I have looked into yourAutomatedCRM and it seems to suit some of our customers' requests, but not really so many that makes it profitable for us to invest time and money integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.
137 | 
138 | Best,
139 | Matilda
140 | 
141 | SEO expert at Spider.cloud
142 | '''
143 | 
144 | # Initialize OpenAI client
145 | client = OpenAI(
146 |     api_key=os.environ.get("OPENAI_API_KEY")
147 | )
148 | 
149 | def extract_company_name(email):
150 |     # Define messages
151 |     messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]
152 | 
153 |     # Call OpenAI API
154 |     completion = client.chat.completions.create(
155 |         model="gpt-3.5-turbo",
156 |         messages=messages,
157 |     )
158 | 
159 |     return completion.choices[0].message.content
160 | 
161 | company_name = extract_company_name(email_content)
162 | print(company_name)
163 | ```
164 | 
165 | ### Finding the company's official website
166 | 
167 | By using Spider's built in AI scraping tools, we can specify our own prompt in our Spider API request.
168 | 
169 | "Return the official website of the company for `company-name`" on a bing search suits really well for this guide, since we want the url for the company's website.'
170 | 
171 | ``` python
172 | import requests, os
173 | 
174 | headers = {
175 |     'Authorization': os.environ["SPIDER_API_KEY"],
176 |     'Content-Type': 'application/json',
177 | }
178 | 
179 | json_data = {
180 |     "limit":1,
181 |     "gpt_config":{
182 |         "prompt":f'Return the official website of the company for {company_name}',
183 |         "model":"gpt-4o",
184 |         "max_tokens":4096,
185 |         "temperature":0.54,
186 |         "top_p":0.17,
187 |     },
188 |     "url":"https://www.bing.com/search?q=spider.cloud"
189 | }
190 | 
191 | response = requests.post('https://api.spider.cloud/crawl', 
192 |   headers=headers, 
193 |   json=json_data
194 | )
195 | 
196 | company_url = response.json()[0]['metadata']['extracted_data']
197 | ```
198 | 
199 | ## Explanation of the `gpt_config`
200 | The gpt_config in the Spider API request specifies the configuration for the GPT model used to do actions on the scraped data. It includes parameters such as:
201 | 
202 | - prompt: The prompt provided to the model (string or a list of strings)
203 | - model: The specific GPT model to use.
204 | - max_tokens: The maximum number of tokens to generate.
205 | - temperature: Controls the randomness of the output (higher values make output more random).
206 | - top_p: Controls the diversity of the output (higher values make output more diverse).
207 | 
208 | These settings ensure that the API generates coherent and contextually appropriate responses based on the scraped data.
209 | 
210 | ### Crafting super personalized email based on the websites content
211 | ```python
212 | from langchain_text_splitters import RecursiveCharacterTextSplitter
213 | from langchain_chroma import Chroma
214 | from langchain_openai import OpenAIEmbeddings
215 | from langchain import hub
216 | from langchain_core.output_parsers import StrOutputParser
217 | from langchain_core.runnables import RunnablePassthrough
218 | from langchain_openai import ChatOpenAI
219 | from langchain_community.document_loaders import SpiderLoader
220 | 
221 | company_url = 'https://spider.cloud'
222 | 
223 | def filter_metadata(doc):
224 |     # Filter out or replace None values in metadata
225 |     doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
226 |     return doc
227 | 
228 | def load_markdown_from_url(urls):
229 |     loader = SpiderLoader(
230 |         # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
231 |         url=urls,
232 |         mode="crawl",  
233 |         params={
234 |             "return_format": "markdown",
235 |             "proxy_enabled": False,
236 |             "request": "http",  
237 |             "request_timeout": 60,
238 |             "limit": 1,
239 |         },
240 |     )
241 |     data = loader.load()
242 |     return data
243 | 
244 | docs = load_markdown_from_url(company_url)
245 | 
246 | # Filter metadata in documents
247 | docs = [filter_metadata(doc) for doc in docs]
248 | 
249 | text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
250 | splits = text_splitter.split_documents(docs)
251 | vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
252 | 
253 | # Retrieve and generate using the relevant snippets of the blog.
254 | retriever = vectorstore.as_retriever()
255 | prompt = hub.pull("rlm/rag-prompt")
256 | 
257 | def format_docs(docs):
258 |     return "\n\n".join(doc.page_content for doc in docs)
259 | 
260 | llm = ChatOpenAI(model="gpt-4o")
261 | 
262 | rag_chain = (
263 |     {"context": retriever | format_docs, "question": RunnablePassthrough()}
264 |     | prompt
265 |     | llm
266 |     | StrOutputParser()
267 | )
268 | print(rag_chain.invoke(f'Craft a super personalized email answering {company_name}, answering their response to our cold outreach campaigne. Their email: """{email_content}"""'))
269 | ```
270 | 
271 | And the results should be something like this:
272 | 
273 | ```markdown
274 | ### output
275 | 
276 | Hi Matilda,
277 | 
278 | Thank you for considering AutomatedCRM. Given Spider.cloud's needs for efficient and large-scale data collection, our CRM can integrate seamlessly with tools like Spider, providing robust, high-speed data extraction and management. I'd love to discuss specific use cases where this integration could significantly enhance your current offerings.
279 | 
280 | Best,
281 | Gilbert
282 | ```
283 | 
284 | ## Why do we use `hub.pull("rlm/rag-prompt")?`
285 | 
286 | We chose hub.pull("rlm/rag-prompt") for this use case because it provides a robust and flexible template for prompt construction, specifically designed for retrieval-augmented generation (RAG) tasks. This helps in creating contextually relevant and highly personalized responses by leveraging the extracted and processed data returned from Spider.
287 | 
288 | ## Complete code
289 | 
290 | That is it, now we have a fully automated email cold outreach with Spider, that responds with emails with knowledge about the companys website.
291 | 
292 | Here is the full code:
293 | 
294 | ```python
295 | import requests, os
296 | from openai import OpenAI
297 | from langchain_text_splitters import RecursiveCharacterTextSplitter
298 | from langchain_chroma import Chroma
299 | from langchain_openai import OpenAIEmbeddings
300 | from langchain import hub
301 | from langchain_core.output_parsers import StrOutputParser
302 | from langchain_core.runnables import RunnablePassthrough
303 | from langchain_openai import ChatOpenAI
304 | from langchain_community.document_loaders import SpiderLoader
305 | 
306 | email_content = '''
307 | Thank you for your email Gilbert,
308 | 
309 | I have looked into yourAutomatedCRM and it seems to suit some of our customers' requests, but not really so many that makes it profitable for us to invest time and money integrating it into our current services. If you have any use cases in mind that suit our company, I might be able to propose an idea to the others.
310 | 
311 | Best,
312 | Matilda
313 | 
314 | SEO expert at Spider.cloud
315 | '''
316 | 
317 | # Initialize OpenAI client
318 | client = OpenAI(
319 |     api_key=os.environ.get("OPENAI_API_KEY")
320 | )
321 | 
322 | def extract_company_name(email):
323 |     # Define messages
324 |     messages = [{"role": "user", "content": f'Extract the company name and return ONLY the company name from the sender of this email: """{email_content}"""'}]
325 | 
326 |     # Call OpenAI API
327 |     completion = client.chat.completions.create(
328 |         model="gpt-3.5-turbo",
329 |         messages=messages,
330 |     )
331 | 
332 |     return completion.choices[0].message.content
333 | 
334 | company_name = extract_company_name(email_content)
335 | 
336 | headers = {
337 |     'Authorization': os.environ["SPIDER_API_KEY"],
338 |     'Content-Type': 'application/json',
339 | }
340 | 
341 | json_data = {
342 |     "limit":1,
343 |     "gpt_config":{
344 |         "prompt":f'Return the official website of the company for {company_name}',
345 |         "model":"gpt-4o",
346 |         "max_tokens":4096,
347 |         "temperature":0.54,
348 |         "top_p":0.17,
349 |     },
350 |     "url":"https://www.bing.com/search?q=spider.cloud"
351 | }
352 | 
353 | response = requests.post('https://api.spider.cloud/crawl', 
354 |   headers=headers, 
355 |   json=json_data
356 | )
357 | 
358 | company_url = response.json()[0]['metadata']['extracted_data']
359 | 
360 | def filter_metadata(doc):
361 |     # Filter out or replace None values in metadata
362 |     doc.metadata = {k: (v if v is not None else "") for k, v in doc.metadata.items()}
363 |     return doc
364 | 
365 | def load_markdown_from_url(urls):
366 |     loader = SpiderLoader(
367 |         # env="your-api-key-here", # if no API key is provided it looks for SPIDER_API_KEY in env
368 |         url=urls,
369 |         mode="crawl",  
370 |         params={
371 |             "return_format": "markdown",
372 |             "proxy_enabled": False,
373 |             "request": "http",  
374 |             "request_timeout": 60,
375 |             "limit": 1,
376 |         },
377 |     )
378 |     data = loader.load()
379 |     return data
380 | 
381 | docs = load_markdown_from_url(company_url)
382 | 
383 | # Filter metadata in documents
384 | docs = [filter_metadata(doc) for doc in docs]
385 | 
386 | text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
387 | splits = text_splitter.split_documents(docs)
388 | vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
389 | 
390 | # Retrieve and generate using the relevant snippets of the blog.
391 | retriever = vectorstore.as_retriever()
392 | prompt = hub.pull("rlm/rag-prompt")
393 | 
394 | def format_docs(docs):
395 |     return "\n\n".join(doc.page_content for doc in docs)
396 | 
397 | llm = ChatOpenAI(model="gpt-4o")
398 | 
399 | rag_chain = (
400 |     {"context": retriever | format_docs, "question": RunnablePassthrough()}
401 |     | prompt
402 |     | llm
403 |     | StrOutputParser()
404 | )
405 | print(rag_chain.invoke(f'Craft a super personalized email answering {company_name}, answering their response to our cold outreach campaigne. Their email: """{email_content}"""'))
406 | ```
407 | 
408 | If you liked this guide, consider checking out me and Spider on Twitter:
409 | - **Author Twitter:** [WilliamEspegren](https://x.com/WilliamEspegren)
410 | - **Spider Twitter:** [spider_rust](https://x.com/spider_rust)


--------------------------------------------------------------------------------
/building-a-speedy-resilient-web-scraper-for-rag-ai-part1-preparing.md:
--------------------------------------------------------------------------------
  1 | # Building A Speedy Resilient Web Scraper for RAG AI: Part 1 — Preparing
  2 | 
  3 | > “Building a RAG AI is easy; building a RAG AI at scale is incredibly difficult.” — Jerry Liu of LlamaIndex.
  4 | 
  5 | As part of my ongoing series on how to build a scalable RAG AI, today I’ll be diving deep into web scraping, the automated extracting of data from web pages. Many RAG AIs need at least a little bit of data gathered from various sites around the web. Typical examples include gathering data for customer support chatbots, generating personalized recommendations, and extracting information for market analysis.
  6 | 
  7 | For my project, [The Journey Sage Finder](https://lowryonleadership.com/2024/04/02/the-journey-sage-finder/), I needed to scrape [thejourneysage.com](https://thejourneysage.com/) for answers to common questions. This required only a single page worth of data. [My Virtual College Advisor](https://lowryonleadership.com/2024/05/27/inside-the-virtual-college-advisor-a-deep-dive-into-rag-ai-and-agent-technology/) required a much more extensive scrape of over 6,200 college and university websites comprising well over a million unique web pages.
  8 | 
  9 | What worked fine for the one page The Journey Sage failed spectacularly when trying to bring it to scale. In the hopes of saving the next person some time, I’m sharing much of what I encountered on this journey.
 10 | 
 11 | ## Upfront Preparation
 12 | 
 13 | Any large-scale scraping requires some upfront preparation. Upfront, you should think about:
 14 | 
 15 | 1. What data you need.
 16 | 2. What tools to use.
 17 | 3. Testing on a single website first.
 18 | 4. Scale Up! Have a plan for problems that may arise.
 19 | 5. Error First Thinking. Expect some cleanup.
 20 | 
 21 | In this post, we will cover the first three items: understanding what data you need, choosing your tools, and testing on a single website. The next post will cover scaling up your operations and adopting an error-first thinking approach.
 22 | 
 23 | ## Understand What Data You Need
 24 | 
 25 | When gathering data for a RAG AI, less is often more. Extraneous data erodes the quality of the semantic searches and thus the answers generated. Each extra bit of data stored adds to storage costs and overhead for embedding and retrieval. Good quality data in as small a package as you can get, with no extraneous data, is what you are shooting for.
 26 | 
 27 | When scraping, if there’s one piece of advice I’d stress above all others, it’s that **keeping the data to just what you really need** is vital.
 28 | 
 29 | A typical website has huge amounts of data used for styling, headers, footers, and structure. For instance, if you look at the HTML of the homepage for my [website](https://www.lowryonleadership.com), you see 934 lines of HTML, but really only about 16 lines you’d want to use in a RAG. That’s 2% of the page that has useful data for your RAG AI, while 98% is formatting, navigation, and the like. Think hard about what data you truly need.
 30 | 
 31 | ## Choose Your Tools
 32 | 
 33 | You will need some way to handle:
 34 | 
 35 | ### Scraping the websites
 36 |    
 37 | I tried a number of different options before settling on [Spider](https://spider.cloud/). I started just calling web pages directly. After all, I reasoned, all a browser really does is make a call and get the HTML back. I should be able to do so as well. This worked fine for a simple WordPress site, but fell apart quickly at scale. Calls tended to be rejected by web servers and I got a number of certificate errors. I then tried Chromium, Google’s web browser foundation. This worked but had many intricacies, a steep learning curve, numerous errors, and was a bit slow.
 38 | 
 39 | Spider does require a small fee, but is super quick, easy to use, and has good customer support. Importantly, it runs headless, meaning that my system doesn’t open up a browser for each web page accessed. It also fits seamlessly with Llama Index, which I use extensively.
 40 | 
 41 | ### Benefits of Spider:
 42 | 
 43 | - Super fast.
 44 | - API easy to use.
 45 | - Support for streaming and asynchronous processing.
 46 | - Headless.
 47 | - Low memory.
 48 | 
 49 | I will talk more about these benefits when I discuss scaling up in the next post.
 50 | 
 51 | ## Cleaning the Returned Data
 52 | 
 53 | For cleaning the HTML, I chose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for its great documentation and reputation. It effectively stripped out unnecessary elements, leaving only the 2% of data needed for my RAG AI.
 54 | 
 55 | Originally, I had saved the headers, footers, and menus. My thinking was that this contained valuable information. **What it really contained was a lot of noise**. For instance, let’s say I’m scraping school websites and the side menu lists all of the academic departments. The problem is that a semantic search on “Economics” gets a hit on dozens or hundreds of pages that actually are about other departments. While those pages will likely get weeded out by the reranker, **it’s still unnecessary overhead which can only erode accuracy and performance**.
 56 | 
 57 | I ended up culling pretty much everything from the page except the headers and the main text. To be specific, I removed everything with the following tags: ‘script’, ‘style’, ‘nav’, ‘footer’, ‘header’, ‘aside’, ‘form’ as well as the following classes: ‘menu’, ‘sidebar’, ‘ad-section’, ‘navbar’, ‘modal’, ‘footer’, ‘masthead’, ‘comment’, ‘widget’.
 58 | 
 59 | Whereas I tried half a dozen different scrapers before settling on Spider, the only thing I tried before using Beautiful Soup was some custom-built code to strip styling. Beautiful Soup has produced high-quality output for me and is quick enough for my purposes, but I can’t attest to how well it works compared to any similar utility.
 60 | 
 61 | ## Testing on a Single Website
 62 | 
 63 | Before scaling up, it’s crucial to test your scraper on a single website. This allows you to identify and resolve issues without dealing with the complexity of multiple sites. Here’s how you can do it:
 64 | 
 65 | 1. **Select a Representative Site:** Choose a website that is representative of the type of data and structure you expect to encounter in your larger scrape.
 66 | 2. **Run Your Scraper:** Execute your scraper on this site and carefully analyze the data you retrieve.
 67 | 3. **Adjust Your Process:** Make any necessary adjustments to your scraping and cleaning process based on the results.
 68 | 4. **Validate Your Data:** Ensure that the data you have gathered is accurate, relevant, and clean.
 69 | 
 70 | This step helps you fully understand the data before ramping up to multiple websites, reducing the risk of encountering major issues later.
 71 | 
 72 | ## Storing the Returned Data
 73 | 
 74 | In this example, I’m using MongoDB. For smaller implementations, I actually use local flat files with a SQLite database for coordination. I’m not sure this choice matters too much in this phase. It can have enormous impacts on performance in the vector search phase, but that’s another post…
 75 | 
 76 | Instead of just taking the cleaned data and storing it, I suggest adding a few additional steps:
 77 | 
 78 | ### 1. Hashing the content to check for duplicates.
 79 | 
 80 | To my surprise, many sites have a large amount of duplicate content. Sometimes this is different URLs pointing to the same page, other times it’s basic information repeated. Depending on the website, I was finding anywhere between 0% duplicates and 70% (!!!) duplicates. At first, I had a process go through and clean out duplicates, but the issue was so pervasive I decided that I needed to stop it at the source instead of after the fact. I added a hash to each website saved. If the hash already existed, I didn’t save the duplicate page.
 81 | 
 82 | To reiterate, keeping just the data you need is vital to accuracy and performance. Duplicates in the database mean that when your semantic search retrieves the top 5 websites, it might retrieve 5 exact copies.
 83 | 
 84 | Note: Don’t include the URL in the data that is hashed or multiple URLs with the same content won’t be detected as duplicates.
 85 | 
 86 | ### 2. Make sure to add basic sanity checks on each page.
 87 | 
 88 | In my scraping, about 4% of the returned web pages had very little data once cleaned. Usually, this is because the web page had a header and a picture on it.
 89 | 
 90 | This **4% caused 90% of my troubles during semantic searches**. For example, a page that said “Economics Department” and had a picture of the department staff would score very high on a semantic search for “Economics.” Ironically, it would score better than a page that described the Economics Department in detail because the extra detail, while important to anyone querying the topic, made the semantic value of the page differ from the base term itself.
 91 | 
 92 | Once those 4% were removed, average [evaluation scores](https://lowryonleadership.com/2024/05/30/evaluating-vector-search-performance-on-a-rag-ai-a-detailed-look/) went from 4.6 to 9.5 (scale of 1–10. Top 3 looked at for relevance to query). That’s a huge change by just getting rid of a small amount of data.
 93 | 
 94 | ### 3. Size Matters.
 95 | Another major issue was large pages. These tended to be large PDFs. Sometimes these would be 100MB or more. These PDFs are not valuable for my use case. I automatically ignore any document over 1MB. Ultimately, for this project, I decided to also ignore everything in the cleaned data beyond the first 7,000 tokens. Why?
 96 | 
 97 | At first, I was doing the [standard chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) of larger pages into smaller segments. Depending on your situation, this may or may not be wise:
 98 | 
 99 | a. In [The Journey Sage Finder](https://lowryonleadership.com/2024/04/02/the-journey-sage-finder/), which sometimes dealt with long videos that handled multiple different topics, this sort of chunking was important to retain information of the topics discussed in the video. It improved semantic search results as each chunk had its own embedding and was thus searched separately.
100 | 
101 | b. For [My Virtual College Advisor](https://lowryonleadership.com/2024/05/27/inside-the-virtual-college-advisor-a-deep-dive-into-rag-ai-and-agent-technology/), this chunking was not helpful. Tests using the evaluator showed slightly lower evaluation scores when chunking. This is because school web pages tend to be single subject, as opposed to long videos that might cover several subjects. The first 7,000 tokens have an extremely high likelihood of capturing the semantic meaning of the entire page. Indeed, the longer the page, the more likely it will veer off topic enough to make the page less relevant. So I decided I wouldn’t bother to process, chunk and save this extra data since it doesn’t improve query accuracy.
102 | 
103 | ## Conclusion
104 | 
105 | In this first part, we have laid the groundwork for building a robust web scraper for your RAG AI. By understanding your data needs, choosing the right tools, and testing on a single website, you can ensure a smooth and efficient scraping process. In the next part, we delve into scaling up your scraping operations, tackling issues of performance, and maintaining resilience. Read [Part 2](building-a-speedy-resilient-web-scraper-for-rag-ai-part2-scaling-up.md), where we discuss scaling up.
106 | 
107 | - Author: Troy Lowry
108 | - Twitter: [@Troyusrex](https://x.com/Troyusrex)
109 | - Read more: [https://lowryonleadership.com](https://lowryonleadership.com)


--------------------------------------------------------------------------------
/building-a-speedy-resilient-web-scraper-for-rag-ai-part2-scaling-up.md:
--------------------------------------------------------------------------------
 1 | # Building a Fast and Resilient Web Scraper for Your RAG AI: Part 2 — Scaling Up
 2 | 
 3 | In the first part of this [series](building-a-speedy-resilient-web-scraper-for-rag-ai-part1-preparing.md), we covered understanding what data you need, choosing your tools, and testing your scraper on a single website. Now, we will delve into the challenges of scaling up your web scraping operations and adopting an error-first thinking approach. We also discuss in depth the importance of limiting the blast radius of any errors.
 4 | 
 5 | You can find the complete code for this project on [GitHub](https://github.com/Troyusrex2/RAG-AI-Scaling).
 6 | 
 7 | ## Scale Up! Have a Plan For Problems
 8 | 
 9 | You WILL encounter problems. Scraping one web page is easy; scraping a million is anything but. Websites vary tremendously and what worked on the first 100 websites might fail on the 101st. It might also work on one website for 100 days and fail on the 101st. This is especially true for websites that use technologies like React or Drupal to supplement their HTML. Even with a great tool like [Spider](https://spider.cloud/), errors are a fact of scraping.
10 | 
11 | A few things to keep in mind:
12 | 
13 | 1. **Errors:** Every web scraping program I tried gave me errors occasionally; Spider is no different. What is different is that Spider’s customer support helped me handle most issues quickly. Outright errors, where an exception was thrown, were okay because at least the problem was known and could be handled by error trapping.
14 | 
15 | 2. **Stopping:** Sometimes the API I was calling simply failed to return any data.
16 | 
17 | 3. **Few or no pages returned:** Having some pages on a site return with a good amount of data and others not at all is especially problematic for a RAG AI where incomplete data will make for incomplete results. Scraping sites made up of many sub-sites, which is especially prevalent at organizations such as universities that are made up of many quasi-independent sub-entities, often using different web technologies, is especially prone to this. To handle this, any website that returns fewer than 500 pages is flagged for review.
18 | 
19 | 4. **Pages returned, but with little or no data:** While sometimes an error, this is more often a picture-heavy site that has little text with the pictures. Because of this, I only consider pages with at least 75 words of text worth saving. Depending on what your RAG AI is doing, your cutoff may differ.
20 |    
21 | 5. **Spinning forever:** Like the ‘stopping’ above, this is where the system waits for the API to return data but for whatever reason, the data never returns. This is made worse by the fact that processing web pages is a highly variable process, depending not just on the complexity of the web page being scraped, but the latency of the internet and my machines. With other scraping APIs, some pages would take minutes. Spider is quick enough that if nothing is returned after 2 minutes it’s fine to assume a problem and move on.
22 | 
23 | 6. **Out of memory:** Scraping can be a memory-intensive exercise. My previous scraper, using Chromium directly, would run out of memory on my 32GB i9 machine. In contrast, **I run Spider on a series of AWS t2.nanos, which have 0.5GB of memory and 1 CPU**. This is half the memory of a $35 Raspberry Pi (!!!). It runs out of memory approximately once every 300 websites (averaging 500 pages per website). When running a t2.small, with 2 GB of memory, I’ve never run out of memory. So I simply have it so that if a site runs out of memory, the website is later picked up by a t2.small and reprocessed. The t2.nano is about ¼ the price of the t2.small (much less when free tier services are factored in).
24 |    
25 | ## Error First Thinking. Expect Some Cleanup
26 | 
27 | Since errors are inevitable, the important thing is to know when they happen and [handle them gracefully](https://lowryonleadership.com/2024/04/05/dannys-law-resilience-is-built-by-facing-adversity-not-by-avoiding-it/).
28 | 
29 | To ease handling, I assume each webpage is an error until proven otherwise. This dramatically improves my processing as it encouraged me to built in the following features:
30 | 
31 | 1. **Error logging/Trapping:** All good systems include error logging and trapping. I built this in from the beginning. In this case, it was only the start of a larger fault-tolerant architecture.
32 |    
33 | 2. **On Error, flag it and continue to the next:** I’m a big fan of Toyota’s [stop-the-line](https://businessmap.io/blog/stop-the-line) processing where when errors happen, the entire processing line stops and errors get resolved before any other processing is allowed to occur. That said, **I’m a bigger fan of sleep**. Stopping all processing when an issue occurs meant I woke up many mornings to completely stopped processing, often having lost 40 threads worth of processing for many hours.
34 |    
35 | Back before the speed of Spider, I projected I would need 4 months of processing time for 40 concurrent threads to complete my scraping. Losing 6 or 8 hours meant delaying launch by another day. This prompted me to just flag any errors and move on, allowing processing to continue. This did backfire a few times when an error would pop up that would propagate through the threads, causing lots of lost work and cleanup.
36 | 
37 | 1. **Have a separate process to track and handle errors:** Delaying the review of errors until the morning didn’t mean ignoring them, far from it. It did mean I had to have time set aside to identify and work through all the errors from the previous night. Any errors not resolved, and the root cause mitigated, will just pop up again and again until it is resolved, so it’s vitally important to have these looked at.
38 |    
39 | 2. **Be careful about concurrency: Be careful about concurrency:** When multiple processes or threads are running, ensure that any website is only worked on by one process. I used a simple processing flag in MongoDB to handle this.
40 |    
41 | 3. **Limit Blast Radius, so when problems happen they are isolated:** This one is so important, it deserves its own section:
42 | 
43 | ## Separate Processing of Requests — Limited Blast Radius
44 | 
45 | Originally, I had a nice multi-threaded scraper running on my beefy server. Again and again, I was hitting unexpected issues that would bring the entire system down. Be it a memory issue or a threading problem, issues I expected would be trapped and isolated instead impacted the entire system, meaning that an error on one thread would compromise the entire system. This meant that my beefy server that could scrape 40 websites at a time was actually a liability instead of an asset. Having architected many production systems, there are ways to architect around this using clusters, microservices, and the like, but this is a background process run infrequently and an outage doesn’t have the impact of a customer-facing system facing a failure. I needed a simpler, cheaper solution.
46 | 
47 | As previously mentioned, one problem with most of the other scrapers I tried was that they would bring up a browser to display every webpage being scraped. Being able to scrape a site without bringing up a browser is called running “headless.” I certainly tried running Chromium headless and had some success, but I encountered many new additional errors attempting to run headless. In particular, sites seemed more able to detect that I was scraping and would prevent me from accessing the site.
48 | 
49 | Running with the browser appearing, at first, was a benefit as I could see firsthand what the system was doing. It became a liability because all of those browser windows popping up on my server made doing anything else with the server difficult as it would occasionally grab focus of the system. There’s nothing quite like typing code and then being thrust onto a website mid-sentence. Playing a game on my computer while running the scraping was completely out of the question.
50 | 
51 | A bigger issue with having the browser pop up was that I could not use **low-cost cloud computing** to do this work as the lowest-cost cloud services don’t have a user interface. If I could find a headless, low-memory option, I could use these cloud services and then I could just “throw servers” at the problem. But without a headless option, that would be very expensive. Worse, managing a CLI-only t2.nano is simplicity. Managing a Windows GUI server is far more complex. Spider being able to run well on these machines was a game-changer.
52 | 
53 | As I mentioned above, these tiny servers would occasionally hit errors. Unfortunately, scraping websites will never be a “fire and forget” exercise. It will take constant oversight as technology changes to make sure everything keeps running correctly and quick intervention when it doesn’t. The internet and targeted websites are just changing so much that constant vigilance is required.
54 | 
55 | The fact that Spider ran headless with much lower memory requirements meant I could spin up 15 low-cost t2.nanos and have each run the scraper single-threaded. These nanos are actually on AWS’ free tier. My total cloud server cost to handle 1.2 million web pages was less than the cost of a cappuccino at Starbucks! My Spider costs did run several hundred dollars, but far less than the $1,000 in AWS costs I had originally planned and would have needed had I continued with the Chromium route.
56 | 
57 | ## Conclusion
58 | 
59 | Building a resilient, fault-tolerant web scraper is crucial for the success of a scalable RAG AI. The journey from scraping a few pages to handling millions is fraught with challenges, but with the right tools and strategies, it becomes manageable. The key is to plan, choose your tools wisely, and be prepared for the inevitable issues that will arise.
60 | 
61 | Having my “spider legion” of 15 AWS t2.nano servers, backed by a single AWS t2.small server for the very rare high-memory website, I was able to complete in a week what I had expected would take four months (if all went well!). Spider’s low memory overhead combined with it being headless meant I could run it massively in parallel. This setup allowed for efficient, cost-effective scraping at scale.
62 | 
63 | ## Key Takeaways:
64 | 
65 | - **Understand Your Data Needs:** Always aim to collect only the essential data to maintain high-quality semantic searches.
66 | - **Choose Reliable Tools:** Tools like Spider for scraping can significantly streamline your process.
67 | - **Plan for Errors:** Implement robust error logging and handling mechanisms to ensure your scraping process is resilient.
68 | - **Optimize Storage:** Use strategies like hashing to eliminate duplicates and store only the necessary data.
69 | - **Parallel Processing:** A tool like Spider allows you to use cloud services to run multiple scraping instances in parallel, which can drastically reduce the time required for large-scale scraping projects.
70 |   
71 | By leveraging these strategies, you can build a scalable, efficient, and resilient web scraping system that forms a robust foundation for your RAG AI. As the landscape of web technologies continues to evolve, staying adaptive and prepared for new challenges will be essential for ongoing success.
72 | 
73 | You can find the complete code for this project on Troy's [GitHub](https://github.com/Troyusrex2/RAG-AI-Scaling).
74 | 
75 | - Author: Troy Lowry
76 | - Twitter: [@Troyusrex](https://x.com/Troyusrex)
77 | - Read more: [https://lowryonleadership.com](https://lowryonleadership.com)


--------------------------------------------------------------------------------
/crawl-agent-with-autogen.md:
--------------------------------------------------------------------------------
  1 | # Scrape & Crawl Agent with Microsoft's Autogen
  2 | 
  3 | This guide will show you how to set up an [Autogen](https://www.microsoft.com/en-us/research/project/autogen/) agent to scrape and crawl any website using the Spider API.
  4 | 
  5 | ## Setup OpenAI
  6 | 
  7 | Get OpenAI setup and running in a few minutes with the following steps:
  8 | 
  9 | 1. Create an account and get an API Key on [OpenAI](https://openai.com/).
 10 | 
 11 | 2. Install OpenAI and set up the API key in your project as an environment variable. This approach prevents you from hardcoding the key in your code.
 12 | 
 13 | ```bash
 14 | pip install openai
 15 | ```
 16 | 
 17 | In your terminal:
 18 | 
 19 | ```bash
 20 | export OPENAI_API_KEY=<your-api-key-here>
 21 | ```
 22 | 
 23 | Alternatively, you can use the `dotenv` package to load the environment variables from a `.env` file. Create a `.env` file in your project root and add the following:
 24 | 
 25 | ```bash
 26 | OPENAI_API_KEY=<your-api-key-here>
 27 | ```
 28 | 
 29 | Then, in your Python code:
 30 | 
 31 | ```python
 32 | from dotenv import load_dotenv
 33 | from openai import OpenAI
 34 | import os
 35 | 
 36 | load_dotenv()
 37 | 
 38 | client = OpenAI(
 39 |     api_key=os.environ.get("OPENAI_API_KEY"),
 40 | )
 41 | ```
 42 | 
 43 | 3. Test OpenAI to see if things are working correctly:
 44 | 
 45 | ```python
 46 | import os
 47 | from openai import OpenAI
 48 | 
 49 | client = OpenAI(
 50 |     api_key=os.environ.get("OPENAI_API_KEY"),
 51 | )
 52 | 
 53 | chat_completion = client.chat.completions.create(
 54 |     model="gpt-3.5-turbo",
 55 |     messages=[
 56 |         {
 57 |             "role": "user",
 58 |             "content": "What are large language models?",
 59 |         }
 60 |     ]
 61 | )
 62 | ```
 63 | 
 64 | ## Setup Spider & Autogen
 65 | 
 66 | Getting started with the API is simple and straightforward. After you get your [secret key](https://spider.cloud/api-keys) you can follow this guide. We won't rehash the full setup guide for Spider here, but if you want to use the API directly, you can check out the [Spider API Guide](https://spider.cloud/guides/spider-api) to learn more. Let's move on.
 67 | 
 68 | Install the Spider Python client library and autogen:
 69 | 
 70 | ```bash
 71 | pip install spider-client pyautogen
 72 | ```
 73 | 
 74 | Now we need to setup the Autogen LLM configuration.
 75 | 
 76 | ```python
 77 | import os
 78 | 
 79 | config_list = [
 80 |     {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
 81 | ]
 82 | ```
 83 | 
 84 | And we need to set the Spider API key:
 85 | 
 86 | ```python
 87 | spider_api_key = os.getenv("SPIDER_API_KEY")
 88 | ```
 89 | 
 90 | ## Creating Scrape & Crawl Functions
 91 | 
 92 | We first need to import spider so that we can call the API to be able to scrape and crawl.
 93 | 
 94 | ```python
 95 | from spider import Spider
 96 | ```
 97 | 
 98 | ### Defining functions for the agents
 99 | 
100 | Now we need to define the scrape and crawl function that the agent will call. We will use the python Spider SDK for this and set the default `return_format` to `markdown` to retrieve LLM-ready data.
101 | 
102 | ```python
103 | from typing_extensions import Annotated
104 | from typing import List, Dict, Any
105 | 
106 | def scrape_page(url: Annotated[str, "The URL of the web page to scrape"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[Dict[str, Any], "Scraped content"]:
107 |     # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
108 |     client = Spider(spider_api_key)
109 | 
110 |     if params == None:
111 |         params = {
112 |             "return_format": "markdown"
113 |         }
114 | 
115 |     scraped_data = client.scrape_url(url, params)
116 |     return scraped_data[0]
117 | 
118 | def crawl_page(url: Annotated[str, "The url of the domain to be crawled"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[List[Dict[str, Any]], "Scraped content"]:
119 |     # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
120 |     client = Spider(spider_api_key)
121 | 
122 |     if params == None:
123 |         params = {
124 |             "return_format": "markdown"
125 |         }
126 | 
127 |     crawled_data = client.crawl_url(url, params)
128 |     return crawled_data
129 | ```
130 | 
131 | Now that we have the functions defined, we need to create the scrape & crawl agents, and let them know that they can use the functions to scrape & crawl any website.
132 | 
133 | Here is also when we use the `config_list` we defined at the top of this guide.
134 | 
135 | ```python
136 | from autogen import ConversableAgent
137 | 
138 | # Create web scraper agent.
139 | scraper_agent = ConversableAgent(
140 |     "WebScraper",
141 |     llm_config={"config_list": config_list},
142 |     system_message="You are a web scraper and you can scrape any web page to retrieve its contents."
143 |     "Returns 'TERMINATE' when the scraping is done.",
144 | )
145 | 
146 | # Create web crawler agent.
147 | crawler_agent = ConversableAgent(
148 |     "WebCrawler",
149 |     llm_config={"config_list": config_list},
150 |     system_message="You are a web crawler and you can crawl any page with deeper crawling following subpages."
151 |     "Returns 'TERMINATE' when the scraping is done.",
152 | )
153 | ```
154 | 
155 | ### How do we tell the agents to do things?
156 | 
157 | To be able to chat and make these agents actually do something, we need a `UserProxyAgent` that can communicate with the other agents:
158 | 
159 | You can read more about the [UserProxyAgent here](https://microsoft.github.io/autogen/docs/reference/agentchat/user_proxy_agent/).
160 | 
161 | ```python
162 | user_proxy_agent = ConversableAgent(
163 |     "UserProxy",
164 |     llm_config=False,  # No LLM for this agent.
165 |     human_input_mode="NEVER",
166 |     code_execution_config=False,  # No code execution for this agent.
167 |     is_termination_msg=lambda x: x.get("content", "") is not None and "terminate" in x["content"].lower(),
168 |     default_auto_reply="Please continue if not finished, otherwise return 'TERMINATE'.",
169 | )
170 | ```
171 | 
172 | ### Registering the functions
173 | 
174 | Now when we have the agents and the `user_proxy_agent` we can officially register the functions with the correct agents, and the agents with the user_proxy_agent using `register_function` provided from Autogen.
175 | 
176 | ```python
177 | from autogen import register_function
178 | 
179 | register_function(
180 |     scrape_page,
181 |     caller=scraper_agent,
182 |     executor=user_proxy_agent,
183 |     name="scrape_page",
184 |     description="Scrape a web page and return the content.",
185 | )
186 | 
187 | register_function(
188 |     crawl_page,
189 |     caller=crawler_agent,
190 |     executor=user_proxy_agent,
191 |     name="crawl_page",
192 |     description="Crawl an entire domain, following subpages and return the content.",
193 | )
194 | ```
195 | 
196 | Now we have officially linked all the agents together and can try talking to user_proxy_agent.
197 | 
198 | ## Using the agents
199 | 
200 | We can start the conversation with user_proxy_agent and say that we either want to crawl or scrape a specific website.
201 | 
202 | Then we can summarize the scraped and crawled page with Autogen's built in summary_method. We use reflection_with_llm to create a summary based on the conversation history, AKA the scraped or crawled content.
203 | 
204 | ```python
205 | # Scrape page
206 | scraped_chat_result = user_proxy_agent.initiate_chat(
207 |     scraper_agent,
208 |     message="Can you scrape william-espegren.com for me?",
209 |     summary_method="reflection_with_llm",
210 |     summary_args={
211 |         "summary_prompt": """Summarize the scraped content"""
212 |     },
213 | )
214 | 
215 | # Crawl page
216 | crawled_chat_result = user_proxy_agent.initiate_chat(
217 |     crawler_agent,
218 |     message="Can you crawl william-espegren.com for me, I want the whole domains information?",
219 |     summary_method="reflection_with_llm",
220 |     summary_args={
221 |         "summary_prompt": """Summarize the crawled content"""
222 |     },
223 | )
224 | ```
225 | 
226 | The output is stored in the summary:
227 | 
228 | ```python
229 | print(scraped_chat_result.summary)
230 | print(crawled_chat_result.summary)
231 | ```
232 | 
233 | ## Full code
234 | 
235 | Now we have two agents: one that scrapes a page and one that crawls a page following subpages. These two agents can you use in combination with your other Autogen agents.
236 | 
237 | ```python
238 | import os
239 | from spider import Spider
240 | from typing_extensions import Annotated
241 | from typing import List, Dict, Any
242 | from autogen import ConversableAgent
243 | from autogen import register_function
244 | 
245 | config_list = [
246 |     {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
247 | ]
248 | 
249 | spider_api_key = os.getenv("SPIDER_API_KEY")
250 | 
251 | def scrape_page(url: Annotated[str, "The URL of the web page to scrape"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[Dict[str, Any], "Scraped content"]:
252 |     # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
253 |     client = Spider(spider_api_key)
254 | 
255 |     if params == None:
256 |         params = {
257 |             "return_format": "markdown"
258 |         }
259 | 
260 |     scraped_data = client.scrape_url(url, params)
261 |     return scraped_data[0]
262 | 
263 | def crawl_page(url: Annotated[str, "The url of the domain to be crawled"], params: Annotated[dict, "Dictionary of additional params."] = None) -> Annotated[List[Dict[str, Any]], "Scraped content"]:
264 |     # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
265 |     client = Spider(spider_api_key)
266 | 
267 |     if params == None:
268 |         params = {
269 |             "return_format": "markdown"
270 |         }
271 | 
272 |     crawled_data = client.crawl_url(url, params)
273 |     return crawled_data
274 | 
275 | # Create web scraper agent.
276 | scraper_agent = ConversableAgent(
277 |     "WebScraper",
278 |     llm_config={"config_list": config_list},
279 |     system_message="You are a web scraper and you can scrape any web page to retrieve its contents."
280 |     "Returns 'TERMINATE' when the scraping is done.",
281 | )
282 | 
283 | # Create web crawler agent.
284 | crawler_agent = ConversableAgent(
285 |     "WebCrawler",
286 |     llm_config={"config_list": config_list},
287 |     system_message="You are a web crawler and you can crawl any page with deeper crawling following subpages."
288 |     "Returns 'TERMINATE' when the scraping is done.",
289 | )
290 | 
291 | user_proxy_agent = ConversableAgent(
292 |     "UserProxy",
293 |     llm_config=False,  # No LLM for this agent.
294 |     human_input_mode="NEVER",
295 |     code_execution_config=False,  # No code execution for this agent.
296 |     is_termination_msg=lambda x: x.get("content", "") is not None and "terminate" in x["content"].lower(),
297 |     default_auto_reply="Please continue if not finished, otherwise return 'TERMINATE'.",
298 | )
299 | 
300 | register_function(
301 |     scrape_page,
302 |     caller=scraper_agent,
303 |     executor=user_proxy_agent,
304 |     name="scrape_page",
305 |     description="Scrape a web page and return the content.",
306 | )
307 | 
308 | register_function(
309 |     crawl_page,
310 |     caller=crawler_agent,
311 |     executor=user_proxy_agent,
312 |     name="crawl_page",
313 |     description="Crawl an entire domain, following subpages and return the content.",
314 | )
315 | 
316 | # Scrape page
317 | scraped_chat_result = user_proxy_agent.initiate_chat(
318 |     scraper_agent,
319 |     message="Can you scrape william-espegren.com for me?",
320 |     summary_method="reflection_with_llm",
321 |     summary_args={
322 |         "summary_prompt": """Summarize the scraped content"""
323 |     },
324 | )
325 | 
326 | print(scraped_chat_result.summary)
327 | ```
328 | 
329 | If you liked this guide, consider checking out me and Spider on Twitter:
330 | 
331 | - **Author Twitter:** [WilliamEspegren](https://x.com/WilliamEspegren)
332 | - **Spider Twitter:** [spider_rust](https://x.com/spider_rust)


--------------------------------------------------------------------------------
/crewai-spider-research-agent.md:
--------------------------------------------------------------------------------
  1 | # Stock Research Assistant Using crewAI and Spider
  2 | 
  3 | This guide will show you the power of using Spider with AI agents. Specifically, we're going to be using crewAI, a popular agent framework to scaffold our agent and orchestrate our research work flow. CrewAI has a great stock researcher guide on their site, so this guide is not to replace that, but to show you how to use Spider as an additional research tool.
  4 | 
  5 | ## Install and setup crewAI
  6 | 
  7 | ```shell
  8 | pip install crewai
  9 | ```
 10 | 
 11 | Then, we'll install additional tool dependencies
 12 | 
 13 | ```shell
 14 | pip install 'crewai[tools]'
 15 | ```
 16 | 
 17 | Then setup environment variables for our OpenAI api key, and the model string we're going to use. For our simple stock research example, we'll use `gpt-4-turbo` as it has a context size of 128K, plenty for our research agents to use. An alternative is to mix the models and an assign a different model to each of the agents. You can find more information on setting different [LLM configurations in the documentation](https://docs.crewai.com/how-to/LLM-Connections/#connect-crewai-to-llms).
 18 | 
 19 | ```py
 20 | import os
 21 | os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
 22 | os.environ["OPENAI_MODEL_NAME"] = "gpt-4-turbo"
 23 | ```
 24 | 
 25 | ### Setup Serper.dev for Google Search
 26 | 
 27 | We're going to use [serper.dev](https://serper.dev/) as our search tool in crewAI. Make sure to create an account and grab your api key.
 28 | 
 29 | ```py
 30 | import os
 31 | os.environ["SERPER_API_KEY"] = "Your Key"
 32 | from crewai_tools import SerperDevTool
 33 | search_tool = SerperDevTool()
 34 | ```
 35 | 
 36 | Refer to crew's [documentation](https://docs.crewai.com/how-to/Creating-a-Crew-and-kick-it-off/) for more information
 37 | 
 38 | ## Create a custom scrape tool
 39 | 
 40 | Before we setup our agents, we need to create a custom tool for scraping and summarizing content that the agents will be using for our report. We'll use Spider as our crawler and scraping tool. In case you're wondering, crewAI does have it's own scraping tool, but it's not as robust and fast as Spider.
 41 | 
 42 | ### Setup Spider
 43 | 
 44 | Getting started with the API is simple and straight forward. After you get your [secret key](https://spider.cloud/api-keys)
 45 | you can use the Spider LangChain document loader. We won't rehash the full setup guide for Spider here if you want to use the API directly, you can check out the [Spider API Guide](https://spider.cloud/guides/spider-api) to learn more. Let's move on.
 46 | 
 47 | Install the Spider Python client library:
 48 | 
 49 | ```bash
 50 | pip install spider-client
 51 | ```
 52 | 
 53 | Setup the api key in the env file:
 54 | 
 55 | ```bash
 56 | SPIDER_API_KEY=<your-api-key-here>
 57 | ```
 58 | 
 59 | Then import the `SpiderLoader` from the document loaders module:
 60 | 
 61 | ```python
 62 | from langchain_community.document_loaders import SpiderLoader
 63 | ```
 64 | 
 65 | First, we'll code up our Spider crawler like so as a data loader for LangChain:
 66 | 
 67 | ```py
 68 | 
 69 | def load_markdown_from_urls(url):
 70 |     loader = SpiderLoader(
 71 |         url=url,
 72 |         mode="crawl",
 73 |         params={
 74 |             "return_format": "markdown",
 75 |             "proxy_enabled": False,
 76 |             "request": "http",
 77 |             "request_timeout": 60,
 78 |             "limit": 1,
 79 |             "cache": False,
 80 |         },
 81 |     )
 82 |     data = loader.load()
 83 | 
 84 |     if data:
 85 |         return data
 86 |     else:
 87 |         return None
 88 | 
 89 | ```
 90 | 
 91 | To learn more about how to setup Spider, follow [this guide](https://spider.cloud/guides/spider-api).
 92 | 
 93 | Install LangChain if you haven't yet:
 94 | 
 95 | ```bash
 96 | pip install langchain
 97 | ```
 98 | 
 99 | [LangChain](https://www.langchain.com/) is a powerful LLM orchestration API framework, but in this example we'll use it more simply to put together our prompt and run a chain. To see LangChain's features and capabilities, check out their API documentation [here](https://python.langchain.com/).
100 | 
101 | Now, we continue to code up the rest of the summarization code, which we'll use LangChain for. We'll add the crewAI `tool` decorator so that our tool function can be utilized by our agents.
102 | 
103 | ```py
104 | from crewai_tools import tool
105 | 
106 | @tool("scrape_and_summarize")
107 | def scrape_and_summarize(urls: List[str]) -> str:
108 |     """Scrape website content based on one or more urls and summarize each based on the objective of the goal. Scrape up to 5 URLs at a time. Do not scrape or summarize PDF content types."""
109 | 
110 |     url_join_str = ",".join(urls)
111 |     content_docs = load_markdown_from_urls(url_join_str)
112 | 
113 |     llm = ChatOpenAI(model="gpt-4-turbo")
114 | 
115 |     document_prompt = PromptTemplate(
116 |         input_variables=["page_content"], template="{page_content}"
117 |     )
118 |     document_variable_name = "context"
119 |     prompt = PromptTemplate.from_template(
120 |         "Objective: Summarize this content in bullet points highlighting important insights. Be comprehensive, yet concise: {context}"
121 |     )
122 |     llm_chain = LLMChain(llm=llm, prompt=prompt)
123 |     stuff_chain = StuffDocumentsChain(
124 |         llm_chain=llm_chain,
125 |         document_prompt=document_prompt,
126 |         document_variable_name=document_variable_name,
127 |     )
128 |     output = stuff_chain.invoke(content_docs)["output_text"]
129 |     return output
130 | 
131 | ```
132 | 
133 | We won't go into details on how to setup our summarization chain, but the gist is that we take our documents (scraped content from all the URLs passed in) and stuff them into an llm chain for summarization. The output would then be fed back into the agents to use. Notice we also use `gpt-4-turbo` here as content could fill up the context window. Feel free to experiment with different LangChain prompt and chains in summarizing content.
134 | 
135 | ## Setup crewAI
136 | 
137 | The guys over at crewAI have done a fantastic job with the framework that makes setting up agents a breeze. We'll take a page from their [documentation](https://docs.crewai.com/how-to/Creating-a-Crew-and-kick-it-off/) in configuring our agents for a research use case. First, we need to figure out the make up of our "crew" members.
138 | 
139 | - Senior researcher: Our main research agent that will kick off the research and use tools like scrape and Google search to find web articles.
140 | - Writer: Our writer that will write the article
141 | 
142 | Now, next we have to define the tasks for each agent to do.
143 | 
144 | - Research fundamentals task: This task will highlight the strengths and weaknesses of the company we want to research
145 | - Research technicals task: In addition to fundamentals, we want to research the technicals of stock price. Obviously, this isn't needed but let's pretend we're a medium-long term trader who looks at technical analysis to know when is a good time to buy a stock.
146 | - Write task: Finally, we define the actual task for writing the article. We're going to focus on recent news about the company, overall market outlook, and then the industry in which the company operates in.
147 | 
148 | ## Setup our crews
149 | 
150 | ```py
151 | from crewai import Agent
152 | 
153 | researcher = Agent(
154 |     role="Senior Stock Researcher",
155 |     goal="Stock researcher for company or ticker: {company}",
156 |     verbose=True,
157 |     memory=True,
158 |     backstory="Driven by researching the next upcoming company stock that would make a good purchase",
159 |     tools=[search_tool, scrape_and_summarize],
160 |     allow_delegation=True,
161 | )
162 | 
163 | writer = Agent(
164 |     role="Writer",
165 |     goal="Blog writer for company stock {company}",
166 |     verbose=True,
167 |     memory=True,
168 |     backstory=(
169 |         "A writer for many popular business magazines and journals covering companies and business."
170 |     ),
171 |     tools=[search_tool, scrape_and_summarize],
172 |     allow_delegation=False,
173 | )
174 | ```
175 | 
176 | Pretty straightforward setup. We give each agent the ability to search and scrape content because each may want to conduct those intermediary tasks on their own. We allow the researcher to delegate tasks to the writer if they choose to. We also enable memory usage so that each agent retains information during and a across executions.
177 | 
178 | ## Setup our tasks
179 | 
180 | ```py
181 | from crewai import Task
182 | 
183 | research_fundamentals_task = Task(
184 |     description="Research stock for {company} based on the fundamentals of the company for 2024 and beyond."
185 |     "Focus on identifying the strengths and weaknesses for the given company and provide reasons for why the stock is a good or bad buy. Scrape search results by passing in urls to the scrape_and_summarize tool."
186 |     "Based on the scraped content, your final report should clearly articulate the key points,"
187 |     "its market opportunities, and potential risks of buying the stock. ONLY use scraped content from our search results for the report.",
188 |     expected_output="A comprehensive 4-6 paragraphs long report on company stock.",
189 |     tools=[search_tool, scrape_and_summarize],
190 |     agent=researcher,
191 | )
192 | 
193 | research_technicals_task = Task(
194 |     description="Research the technicals of stock chart for {company} for 2024 and beyond."
195 |     "Focus on identifying the strengths and weaknesses of what the charts and price are saying for the given company and provide reasons for why the stock is a good or bad buy based on this perspective. Scrape search results by passing in urls to the scrape_and_summarize tool."
196 |     "Based on the scraped content, your final report should clearly articulate the key points. ONLY use scraped content from our search results for the report.",
197 |     expected_output="A comprehensive 4-6 paragraphs long report on company stock.",
198 |     tools=[search_tool, scrape_and_summarize],
199 |     agent=researcher,
200 | )
201 | 
202 | write_task = Task(
203 |     description=(
204 |         "Compose an insightful article on {company}."
205 |         "Focus on recent news about the company, fundamental analysis like strengths and weaknesses, it's overall market outlook and the industry in which the company operates. Also, write an overview of it's stock from a technical analysis perspective."
206 |         "This article should be easy to understand, engaging, and positive. ONLY use scraped content from our search results for the report."
207 |     ),
208 |     expected_output="A 6-8 paragraph article on {company}, formatted as markdown.",
209 |     tools=[search_tool, scrape_and_summarize],
210 |     agent=writer,
211 |     async_execution=False,
212 |     output_file="COST-post.md",
213 |     human_input=True,
214 | )
215 | ```
216 | 
217 | You can see we add some instructions on how to use the scrape tool and to only use content that was scraped so that the agents do not make something up or rely on the LLM's internal knowledge. We set the expected output and filename for our article formatted as markdown. Human input is set to `True` in case we want to review the output and make changes to the final article.
218 | 
219 | Next, we'll finalize the crew and kick things off.
220 | 
221 | ```py
222 | from crewai import Process
223 | from crewai import Crew
224 | crew = Crew(
225 |     agents=[researcher, writer],
226 |     tasks=[research_fundamentals_task, research_technicals_task, write_task],
227 |     process=Process.sequential,
228 |     memory=True,
229 |     cache=True,
230 |     max_rpm=100,
231 |     share_crew=False,
232 |     output_log_file="crewai_spider.log",
233 | )
234 | ```
235 | 
236 | Here the settings are pretty self explanatory. The process will be sequential as researching and writing a blog post has a similar workflow. We'll set memory and cache to `True` for performance, especially during testing to save on LLM costs.
237 | 
238 | Then kickoff the crew!
239 | 
240 | ```py
241 | result = crew.kickoff(inputs={"company": "Costco Wholesale"})
242 | print(result)
243 | ```
244 | 
245 | Here we define the inputs of our crew agents. For our stock research it will be the company name. Notice that the `company` key is used throughout our previous code examples.
246 | 
247 | Example output of an agent performing Google search for `Costco Wholesale` stock analysis:
248 | 
249 | ```shell
250 | > Entering new CrewAgentExecutor chain...
251 | I need to gather recent and relevant information regarding Costco Wholesale's business fundamentals, focusing on its strengths, weaknesses, market opportunities, and potential risks related to its stock. I will start by searching for recent articles, analyst reports, and financial news related to Costco Wholesale's performance and projections for 2024 and beyond.
252 | 
253 | Action: Search the internet
254 | Action Input: {"search_query": "Costco Wholesale stock analysis 2024"}
255 | 
256 | 
257 | Search results: Title: Costco's Stock Is Expensive, But It Could Quickly Go Higher If This ...
258 | Link: https://www.fool.com/investing/2024/04/04/costcos-stock-is-expensive-but-it-could-quickly-go/
259 | Snippet: ... Price. $779.04. Price as of May 9, 2024, 4:00 p.m. ET. Is a price hike to Costco's membership coming soon? Costco Wholesale (COST 2.05%) is one ...
260 | ---
261 | Title: Costco (COST) Q2 2024 earnings - CNBC
262 | Link: https://www.cnbc.com/2024/03/07/costco-cost-q2-2024-earnings.html
263 | Snippet: Costco on Thursday missed Wall Street's revenue expectations for its holiday quarter, despite reporting year-over-year sales growth.
264 | ---
265 | Title: Costco Wholesale Corporation Reports Second Quarter and Year-to ...
266 | Link: https://investor.costco.com/news/news-details/2024/Costco-Wholesale-Corporation-Reports-Second-Quarter-and-Year-to-Date-Operating-Results-for-Fiscal-2024-and-February-Sales-Results/default.aspx
267 | Snippet: Costco Wholesale Corporation Reports Second Quarter and Year-to-Date Operating Results for Fiscal 2024 and February Sales Results ; ASSETS.
268 | ---
269 | Title: Costco Wholesale (COST) Stock Forecast and Price Target 2024
270 | Link: https://www.marketbeat.com/stocks/NASDAQ/COST/price-target/
271 | Snippet: The average twelve-month price prediction for Costco Wholesale is $694.48 with a high price target of $870.00 and a low price target of $550.00. Learn more on ...
272 | ---
273 | Title: Costco Profit Tops Expectations But Revenue Growth Disappoints
274 | Link: https://www.investopedia.com/costco-q2-fy2024-earnings-8605941
275 | Snippet: Costco reported better-than-expected earnings for the second quarter of fiscal 2024, but revenue growth was lower than analysts had anticipated.
276 | ---
277 | Title: Costco Wholesale Stock Has 11% Upside, According to 1 Wall ...
278 | Link: https://www.fool.com/investing/2024/04/04/costco-wholesale-stock-upside-wall-street-analyst/
279 | Snippet: ... Price. $787.19. Price as of May 10, 2024, 4:00 p.m. ET. This analyst is growing cautious about the company's prospects in the near term. Costco ...
280 | ---
281 | ```
282 | 
283 | And an agent reviewing the search results and executing our scrape tool.
284 | 
285 | ```shell
286 | Thought:
287 | The search results provided multiple sources with insights into Costco's financial performance and stock analysis for 2024. To better understand Costco Wholesale's business fundamentals, strengths, weaknesses, market opportunities, and potential risks for its stock, I will scrape and summarize content from the top relevant URLs.
288 | 
289 | Action: scrape_and_summarize
290 | Action Input: {"urls": ["https://www.fool.com/investing/2024/04/04/costcos-stock-is-expensive-but-it-could-quickly-go/", "https://www.cnbc.com/2024/03/07/costco-cost-q2-2024-earnings.html", "https://investor.costco.com/news/news-details/2024/Costco-Wholesale-Corporation-Reports-Second-Quarter-and-Year-to-Date-Operating-Results-for-Fiscal-2024-and-February-Sales-Results/default.aspx", "https://www.marketbeat.com/stocks/NASDAQ/COST/price-target/", "https://www.investopedia.com/costco-q2-fy2024-earnings-8605941"]}
291 | ```
292 | 
293 | And finally we have the output for our article report of our stock analysis for `Costco Wholesale`:
294 | 
295 | ```markdown
296 | # Comprehensive Analysis of Costco Wholesale in 2024
297 | 
298 | Costco Wholesale continues to excel in the competitive retail market, showcasing strong
299 | financial results and strategic growth initiatives as of 2024. The company reported a
300 | significant 9.4% year-over-year increase in net sales reaching $23.48 billion in March,
301 | with e-commerce experiencing an 18.4% surge, reflecting Costco's adeptness in integrating
302 | digital solutions into its business model.
303 | 
304 | ## Enhanced Market Position and Membership Growth
305 | 
306 | Costco's strategic positioning is reinforced by its extensive global presence,
307 | with 876 warehouses worldwide. The company has been focusing on both enhancing
308 | the in-store experience and expanding its digital footprint. Notably, membership
309 | dynamics have been positively influenced by stricter card checking, which has
310 | led to increased sign-ups and sustained membership loyalty,
311 | a critical factor in Costco's recurring revenue model.
312 | 
313 | ## Financial Stability and Shareholder Returns
314 | 
315 | Despite a slight shortfall in expected quarterly revenue, with $58.44 billion
316 | reported against projections of $59.16 billion, Costco's financial health
317 | remains robust, evidenced by a net income of $1.74 billion. The company’s
318 | commitment to shareholder returns is evident from the recent increase in
319 | its quarterly dividend from $1.02 to $1.16 per share, showcasing
320 | its financial confidence and commitment to returning value to its investors.
321 | 
322 | ## Navigating Challenges
323 | 
324 | Costco is not without its challenges; the company acknowledges potential impacts
325 | from broader economic conditions and competitive pressures. The
326 | ongoing geopolitical tensions, such as the Ukraine conflict, could pose
327 | supply chain risks. However, Costco's diversified global operations
328 | provide a buffer against localized economic disruptions.
329 | 
330 | ## Detailed Technical Stock Analysis
331 | 
332 | Turning to technical analysis, Costco's stock has demonstrated a strong
333 | upward trend, with a recent all-time high of $787.45. The stock's 52-week
334 | range from $476.75 to $787.19 illustrates significant volatility but
335 | also robust recovery and growth. Technical indicators suggest continued
336 | bullish trends, although the Relative Strength Index (RSI) nearing 70
337 | points to potential overbought conditions, hinting at possible short-term
338 | pullbacks. Investors should look for support levels at around $750 as
339 | potential buying points during dips.
340 | 
341 | ## Industry Perspective
342 | 
343 | Within the Retail - Discount & Variety Industry, Costco maintains a competitive
344 | advantage through its bulk-selling membership model, which is distinct from
345 | competitors like TJX and Target. This model has consistently driven high
346 | volume sales and customer retention through economic cycles, positioning
347 | Costco favorably against industry peers.
348 | 
349 | ## Forward-Looking Statements
350 | 
351 | Costco's forward-looking strategies include increasing its investment in
352 | technology and infrastructure to further enhance its e-commerce capabilities
353 | and improve operational efficiencies. The company's proactive approach in
354 | adapting to consumer trends and technological advancements bodes well for
355 | its sustained growth.
356 | 
357 | ## Conclusion
358 | 
359 | In summary, Costco's performance in 2024 has been marked by impressive sales
360 | growth, strategic market positioning, and strong financial health. While
361 | mindful of market risks, the company's ongoing investments in digital
362 | transformation and global expansion are likely to foster continued growth.
363 | Investors and stakeholders should remain optimistic about Costco's market
364 | trajectory while staying cautious of external economic and geopolitical
365 | factors that could influence the retail sector.
366 | ```
367 | 
368 | See the [full code here](https://gist.github.com/gbertb/c04216260cfa70583462cbf2f3a0260b).
369 | 
370 | So now we got a full working code that demonstrates the power of using Spider as a scraping tool used by our agents. Thanks for following along! Stay tuned for more guides.
371 | 


--------------------------------------------------------------------------------
/extracting-contacts.md:
--------------------------------------------------------------------------------
 1 | # Extract Contacts
 2 | 
 3 | ## Contents
 4 | 
 5 | Extracting data from websites using AI to consistently get contact information.
 6 | 
 7 | Our system handles:
 8 | - gathering the data with crawls
 9 | - advanced filtering pages
10 | - AI enhanced data extracting with OpenAI and Open-Source models
11 | - contact management (coming soon)
12 | 
13 | ## Seamless extracting any contact any website
14 | 
15 | Extracting contacts from a website used to be a very difficult challenge involving many steps that would change often. The challenges typically faced involve being able to get the data from a website without being blocked and setting up query selectors for the information you need using javascript. This would often break in two folds - the data extracting with a correct stealth technique or the css selector breaking as they update the website HTML code. Now we toss those two hard challenges away - one of them spider takes care of and the other the advancement in AI to process and extract information.
16 | 
17 | ## UI (Extracting Contacts)
18 | 
19 | You can use the UI on the dashboard to extract contacts after you crawled a page. Go to the page you 
20 | want to extract and click on the horizontal dropdown menu to display an option to extract the contact.
21 | The crawl will get the data first to see if anything new has changed. Afterwards if a contact was found usually within 10-60 seconds you will get a notification that the extraction is complete with the data.
22 | 
23 | ![Extracting contacts with the spider app](https://spider.cloud/img/app/extract-contacts.png)
24 | 
25 | After extraction if the page has contact related data you can view it with a grid in the app.
26 | 
27 | ![The menu displaying the found contacts after extracting with the spider app](https://spider.cloud/img/app/extract-contacts-found.png)
28 | 
29 | The grid will display the name, email, phone, title, and host(website found) of the contact(s).
30 | 
31 | ![Grid display of all the contact information found for the web page](https://spider.cloud/img/app/extract-contacts-grid.png)
32 | 
33 | ## API Extracting Usage
34 | 
35 | The endpoint `/pipeline/extract-contacts` provides the ability to extract all contacts from a website concurrently.
36 | 
37 | ### API Extracting Example
38 | 
39 | To extract contacts from a website you can follow the example below. All params are optional except `url`. Use the `prompt` param to adjust the way the AI handles the extracting. If you use the param `store_data` or if the website already exist in the dashboard the contact data will be saved with the page.
40 | 
41 | ```py
42 | import requests, os, json
43 | 
44 | headers = {
45 |     'Authorization': os.environ["SPIDER_API_KEY"],
46 |     'Content-Type': 'application/json',
47 | }
48 | 
49 | json_data = {"limit":1,"url":"http://www.example.com/contacts", "model": "gpt-4-1106-preview", "prompt": "A custom prompt to tailor the extracting."}
50 | 
51 | response = requests.post('https://api.spider.cloud/v1/pipeline/extract-contacts', 
52 |   headers=headers, 
53 |   json=json_data,
54 |   stream=True)
55 | 
56 | for line in response.iter_lines():
57 |   if line:
58 |       print(json.loads(line))
59 | ```
60 | 
61 | ### Pipelines Combo
62 | 
63 | Pipelines bring a whole new entry to workflows for data curation, if you combine the API endpoints to only use the extraction on pages you know may have contacts can save credits on the system. One way would be to perform gathering all the links first with the `/links` endpoint. After getting the links for the pages use `/pipeline/filter-links` with a custom prompt that can use AI to reduce the noise of the links to process before `/pipline/extract-contacts`.
64 | 


--------------------------------------------------------------------------------
/images/anti_bot/abrahamjuliot_github_io_creepjs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/abrahamjuliot_github_io_creepjs.png


--------------------------------------------------------------------------------
/images/anti_bot/bot_detector_rebrowser_net.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/bot_detector_rebrowser_net.png


--------------------------------------------------------------------------------
/images/anti_bot/bot_sannysoft_com.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/bot_sannysoft_com.png


--------------------------------------------------------------------------------
/images/anti_bot/demo_fingerprint_com_playground.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/demo_fingerprint_com_playground.png


--------------------------------------------------------------------------------
/images/anti_bot/deviceandbrowserinfo_com_are_you_a_bot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/deviceandbrowserinfo_com_are_you_a_bot.png


--------------------------------------------------------------------------------
/images/anti_bot/deviceandbrowserinfo_com_info_device.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/deviceandbrowserinfo_com_info_device.png


--------------------------------------------------------------------------------
/images/anti_bot/www_browserscan_net_bot_detection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/anti_bot/www_browserscan_net_bot_detection.png


--------------------------------------------------------------------------------
/images/spider-logo-github-dark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/spider-logo-github-dark.png


--------------------------------------------------------------------------------
/images/spider-logo-github-light.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spider-rs/web-crawling-guides/959d9d1686893f0c37008c6c31701dbe3186fa11/images/spider-logo-github-light.png


--------------------------------------------------------------------------------
/langchain-groq.md:
--------------------------------------------------------------------------------
  1 | # LangChain + Groq + Spider = 🚀 (Integration Guide)
  2 | 
  3 | This guide will show you the power of combining LangChain, with the fastest inference available using Groq and the fastest crawler API available. We will crawl multiple URLs and then pass the contents to an LLM to summarize using the Meta's new [LLama 3](https://llama.meta.com/llama3/) model.
  4 | 
  5 | ## Setup Groq
  6 | 
  7 | Get Groq setup and running in a few minutes with the following steps:
  8 | 
  9 | 1. Create an API Key [here](https://console.groq.com/keys)
 10 | 
 11 | 2. Install Groq and setup the api key in your project as an environment variable. Simple approach that prevents you from hardcoding the key in your code.
 12 | 
 13 | ```bash
 14 | pip install groq
 15 | ```
 16 | 
 17 | In your terminal:
 18 | 
 19 | ```bash
 20 | export GROQ_API_KEY=<your-api-key-here>
 21 | ```
 22 | 
 23 | Alternatively, you can use the `dotenv` package to load the environment variables from a `.env` file. Create a `.env` file in your project root and add the following:
 24 | 
 25 | ```bash
 26 | GROQ_API_KEY=<your-api-key-here>
 27 | ```
 28 | 
 29 | Then in your Python code:
 30 | 
 31 | ```py
 32 | from dotenv import load_dotenv
 33 | import os
 34 | 
 35 | load_dotenv()
 36 | 
 37 | client = Groq(
 38 |     api_key=os.environ.get("GROQ_API_KEY"),
 39 | )
 40 | ```
 41 | 
 42 | 3. Test groq and see if things are working correctly:
 43 | 
 44 | ```python
 45 | import os
 46 | 
 47 | from groq import Groq
 48 | 
 49 | client = Groq(
 50 |     api_key=os.environ.get("GROQ_API_KEY"),
 51 | )
 52 | 
 53 | chat_completion = client.chat.completions.create(
 54 |     messages=[
 55 |         {
 56 |             "role": "user",
 57 |             "content": "What are large language models?",
 58 |         }
 59 |     ],
 60 |     model="llama3-8b-8192",
 61 | )
 62 | ```
 63 | 
 64 | We'll be using the Llama 3 8B model for this example–very fast! This model should give us about 800 tokens per second or more. The larger 70B model will give you around 150 tokens per second.
 65 | 
 66 | ## Setup Spider
 67 | 
 68 | Getting started with the API is simple and straight forward. After you get your [secret key](/api-keys)
 69 | you can use the Spider LangChain document loader. We won't rehash the full setup guide for Spider here if you want to use the API directly, you can check out the [Spider API Guide](spider-api) to learn more. Let's move on.
 70 | 
 71 | Install the Spider Python client library:
 72 | 
 73 | ```bash
 74 | pip install spider-client
 75 | ```
 76 | 
 77 | Setup the api key in the env file:
 78 | 
 79 | ```bash
 80 | SPIDER_API_KEY=<your-api-key-here>
 81 | ```
 82 | 
 83 | Then import the `SpiderLoader` from the document loaders module:
 84 | 
 85 | ```python
 86 | from langchain_community.document_loaders import SpiderLoader
 87 | ```
 88 | 
 89 | Let's setup the Spider API for our example use case:
 90 | 
 91 | ```python
 92 | def load_markdown_from_url(urls):
 93 |     loader = SpiderLoader(
 94 |         url=urls,
 95 |         mode="crawl",
 96 |         params={
 97 |             "return_format": "markdown",
 98 |             "proxy_enabled": False,
 99 |             "request": "http",
100 |             "request_timeout": 60,
101 |             "limit": 1,
102 |         },
103 |     )
104 |     data = loader.load()
105 | ```
106 | 
107 | Set the mode to `crawl` and we use the `return_format` parameter to specify we want to markdown content. The rest of the parameters are optional.
108 | 
109 | ### Reminder
110 | 
111 | Spider handles automatic concurrency handling and ip rotation to make it simple to scrape multiple urls at once.
112 | The more credits you have or usage available allows for a higher concurrency limit. So make sure you have enough credits if you choose to cawl more than one page.
113 | 
114 | For now, we'll turn off the proxy and move on to setting up LangChain.
115 | 
116 | ## Setup LangChain
117 | 
118 | Install LangChain if you haven't yet:
119 | 
120 | ```bash
121 | pip install langchain
122 | ```
123 | 
124 | Let's install the Groq LangChain package
125 | 
126 | ```bash
127 | pip install langchain-groq
128 | ```
129 | 
130 | [LangChain](https://www.langchain.com/) is a powerful LLM orchestration API framework, but in this example we'll use it more simply to put together our prompt and run a chain. To see LangChain's features and capabilities, check out their API documentation [here](https://python.langchain.com/).
131 | 
132 | Let's setup LangChain with a simple chain using the chat prompt template:
133 | 
134 | ```python
135 | from langchain_core.prompts import ChatPromptTemplate
136 | markdown_content = "Ancana: Marketplace to buy managed vacation   homes through fractional ownership | Y Combinator...."
137 | system = "You are a helpful assistant. Please summarize this company in bullet points."
138 | prompt = ChatPromptTemplate.from_messages(
139 |     [("system", system), ("human", "{markdown_text}")]
140 | )
141 | 
142 | chain = prompt | chat
143 | 
144 | urls = [
145 |     "https://www.ycombinator.com/companies/airbnb",
146 |     "https://www.ycombinator.com/companies/ancana",
147 | ]
148 | 
149 | summary = chain.invoke({"markdown_text": markdown_content})
150 | print(summary.content)
151 | ```
152 | 
153 | And the results should be something like this:
154 | 
155 | ```markdown
156 | ## output
157 | 
158 | **Ancana:**
159 | 
160 | - Founded in 2019
161 | - Marketplace for buying and owning fractional shares of vacation homes
162 | - Allows users to purchase a share of a property and split expenses with other co-owners
163 | - Offers features like property management, furnishing, and maintenance
164 | - Founded by Andres Barrios and Ryan Black
165 | ```
166 | 
167 | Great, tt's looking good now! The goal now is we want to crawl two startup page descriptions from Y-Combinator's site, summarize them and have the LLM tell me the difference between the two.
168 | 
169 | But first we need to combine the markdown content from both companies and format it so that the LLM can easily understand it.
170 | 
171 | ```py
172 | def concat_markdown_content(markdown_content):
173 |     final_content = ""
174 |     separator = "=" * 40
175 | 
176 |     for content_dict in markdown_content:
177 |         url = content_dict["url"]
178 |         markdown_text = content_dict["markdown_text"]
179 |         title = content_dict["title"]
180 |         final_content += f"{separator}\nURL: {url}\nPage Title: {title}\nMarkdown:\n{separator}\n{markdown_text}\n"
181 | 
182 |     return final_content
183 | ```
184 | 
185 | By adding a demarcation and some metadata, the LLM should have enough context to summarize and compare the two companies.
186 | 
187 | Let's also update the system prompt to help us achieve our goal.
188 | 
189 | ```python
190 | system = "You are a helpful assistant. Please summarize the markdown content for Ancana and AirBNB provided in the context in bullet points separately. And then provide a summary of how they are similar and different."
191 | ```
192 | 
193 | And that's all! Check out the full code and we'll update this guide to link out to a notebook, so you can easily play with it.
194 | 
195 | ## Full code
196 | 
197 | ```python
198 | from langchain_community.document_loaders import SpiderLoader
199 | from langchain_core.prompts import ChatPromptTemplate
200 | from langchain_groq import ChatGroq
201 | from dotenv import load_dotenv
202 | import os
203 | import time
204 | 
205 | load_dotenv()
206 | 
207 | chat = ChatGroq(
208 |     temperature=0,
209 |     groq_api_key=os.environ.get("GROQ_API_KEY"),
210 |     model_name="llama3-70b-8192",
211 | )
212 | 
213 | 
214 | def load_markdown_from_urls(url):
215 |     loader = SpiderLoader(
216 |         url=url,
217 |         mode="crawl",
218 |         params={
219 |             "return_format": "markdown",
220 |             "proxy_enabled": False,
221 |             "request": "http",
222 |             "request_timeout": 60,
223 |             "limit": 1,
224 |             "cache": False,
225 |         },
226 |     )
227 |     data = loader.load()
228 | 
229 |     if data:
230 |         return data
231 |     else:
232 |         return None
233 | 
234 | 
235 | def concat_markdown_content(markdown_content):
236 |     final_content = ""
237 |     separator = "=" * 40
238 | 
239 |     for content_dict in markdown_content:
240 |         url = content_dict["url"]
241 |         markdown_text = content_dict["markdown_text"]
242 |         title = content_dict["title"]
243 |         final_content += f"{separator}\nURL: {url}\nPage Title: {title}\nMarkdown:\n{separator}\n{markdown_text}\n"
244 | 
245 |     return final_content
246 | 
247 | 
248 | system = "You are a helpful assistant. Please summarize the markdown content for Ancana and AirBNB provided in the context in bullet points separately. And then provide a summary of how they are similar and different."
249 | prompt = ChatPromptTemplate.from_messages(
250 |     [("system", system), ("human", "{markdown_text}")]
251 | )
252 | 
253 | chain = prompt | chat
254 | 
255 | start_time = time.time()
256 | urls = [
257 |     "https://www.ycombinator.com/companies/airbnb",
258 |     "https://www.ycombinator.com/companies/ancana",
259 | ]
260 | 
261 | url_join = ",".join(urls)
262 | markdown_contents = load_markdown_from_urls(url_join)
263 | all_contents = []
264 | 
265 | for content in markdown_contents:
266 |     if content.page_content:
267 |         all_contents.append(
268 |             {
269 |                 "markdown_text": content.page_content,
270 |                 "url": content.metadata["url"],
271 |                 "title": content.metadata["title"],
272 |             }
273 |         )
274 | 
275 |     concatenated_markdown = concat_markdown_content(all_contents)
276 | 
277 | summary = chain.invoke({"markdown_text": concatenated_markdown})
278 | end_time = time.time()
279 | elapsed_time = end_time - start_time
280 | print(f"Elapsed time: {elapsed_time:.2f} seconds")
281 | print(summary.content)
282 | ```
283 | 
284 | ```markdown
285 | # Output
286 | 
287 | Here are the bullet points summarizing Ancana:
288 | 
289 | **Ancana**
290 | 
291 | - Marketplace to buy managed vacation homes through fractional ownership
292 | - Allows individuals to own a share of a vacation home (e.g. 1/4 or 1/8) and split expenses with other co-owners
293 | - Properties are furnished and managed by Ancana, with costs split amongst co-owners
294 | - Owners have access to the property for a set period of time (e.g. 3 months per year)
295 | - Founded in 2019, based in Mérida, Mexico
296 | - Team size: 7
297 | - Founders: Andres Barrios, Ryan Black
298 | 
299 | And here are the bullet points summarizing Airbnb:
300 | 
301 | **Airbnb:**
302 | 
303 | - Online marketplace for booking unique accommodations around the world
304 | - Founded in 2008 and based in San Francisco, California
305 | - Allows users to list and discover unique spaces, from apartments to castles
306 | - Offers a trusted community marketplace for people to monetize their extra space
307 | - Has over 33,000 cities and 192 countries listed on the platform
308 | - Team size: 6,132
309 | 
310 | **Similarities:**
311 | 
312 | - Both Ancana and Airbnb operate in the travel and accommodation space
313 | - Both platforms allow users to book and stay in unique properties
314 | - Both companies focus on providing a trusted and community-driven marketplace for users
315 | 
316 | **Differences:**
317 | 
318 | - Ancana focuses on fractional ownership of vacation homes, while Airbnb is a booking platform for short-term rentals
319 | - Ancana properties are fully managed and furnished, while Airbnb listings can vary in terms of amenities and services
320 | - Ancana is a newer company with a smaller team size, while Airbnb is a more established company with a larger team and global presence
321 | ```
322 | 
323 | When benchmarking this example we got an elapsed time of around 7.58 seconds (Cache disabled) total.
324 | 
325 | So now we got a full working code that demonstrates the power of Groq and Spider, coupled with LangChain to format our prompts together. Thanks for following along! Stay tuned for more guides.
326 | 


--------------------------------------------------------------------------------
/proxy-mode.md:
--------------------------------------------------------------------------------
 1 | # Getting started with Proxy Mode
 2 | 
 3 | ## What is the proxy mode?
 4 | 
 5 | Spider also offers a proxy front-end to the API. This can make integration with third-party tools easier. The Proxy mode only changes the way you access Spider. The Spider API will then handle requests just like any standard request.
 6 | 
 7 | Request cost, return code and default parameters will be the same as a standard no-proxy request.
 8 | 
 9 | We recommend disabling JavaScript rendering in proxy mode, which is enabled by default. The following credentials and configurations are used to access the proxy mode:
10 | 
11 | - **HTTP address**: `proxy.spider.cloud:8888`
12 | - **HTTPS address**: `proxy.spider.cloud:8889`
13 | - **Username**: `YOUR-API-KEY`
14 | - **Password**: `PARAMETERS`
15 | 
16 | Important: Replace `PARAMETERS` with our supported API parameters. If you don't know what to use, you can begin by using `render_js=False`. If you want to use multiple parameters, use `&` as a delimiter, example: `render_js=False&premium_proxy=True`.
17 | 
18 | As an alternative, you can use URLs like the following:
19 | 
20 | ```json
21 | {
22 | 	"url": "http://proxy.spider.cloud:8888",
23 | 	"username": "YOUR-API-KEY",
24 | 	"password": "render_js=False&premium_proxy=True"
25 | }
26 | ```
27 | 
28 | ## Spider Proxy Features
29 | 
30 | - **Premium proxy rotations**: no more headaches dealing with IP blocks
31 | - **Cost-effective**: 1 credit per base request and two credits for premium proxies.
32 | - **Full concurrency**: crawl thousands of pages in seconds, yes that isn't a typo!
33 | - **Caching**: repeated page crawls ensures a speed boost in your crawls
34 | - **Optimal response format**: Get clean and formatted markdown, HTML, or text for LLM and AI agents
35 | - **Avoid anti-bot detection**: measures that further lower the chances of crawls being blocked
36 | - [And many more](/docs/api)
37 | 
38 | ## HTTP Proxies built to scale
39 | 
40 | At the time of writing, we now have http and https proxying capabilities to leverage Spider to gather data.
41 | 
42 | ## Proxy Usage
43 | 
44 | Getting started with the proxy is simple and straightforward. After you get your [secret key](/api-keys)
45 | you can access our instance directly.
46 | 
47 | ```py
48 | import requests, os, logging
49 | 
50 | # Set debug level logging
51 | logging.basicConfig(level=logging.DEBUG)
52 | 
53 | # Proxy configuration
54 | proxies = {
55 |     'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888",
56 |     'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"
57 | }
58 | 
59 | # Function to make a request through the proxy
60 | def get_via_proxy(url):
61 |     try:
62 |         response = requests.get(url, proxies=proxies)
63 |         response.raise_for_status()
64 |         logging.info('Response HTTP Status Code: ', response.status_code)
65 |         logging.info('Response HTTP Response Body: ', response.content)
66 |         return response.text
67 |     except requests.exceptions.RequestException as e:
68 |         logging.error(f"Error: {e}")
69 |         return None
70 | 
71 | # Example usage
72 | if __name__ == "__main__":
73 |      get_via_proxy("https://www.choosealicense.com")
74 |      get_via_proxy("https://www.choosealicense.com/community")
75 | ```
76 | 
77 | ## Request Cost Structure
78 | 
79 | | Request Type        | Cost (Credits) |
80 | | ------------------- | -------------- |
81 | | Base request (HTTP) | 1 credit       |
82 | | Premium proxies     | 2 credits      |
83 | | Chrome              | 4 credits      |
84 | | Smart-mode          | 1 - 4 credits  |
85 | 
86 | 
87 | ### Coming soon
88 | 
89 | Some of the params and socks5 are not available at the time of writing.
90 | 
91 | 1. JS rendering.
92 | 1. Transforming to markdown etc.
93 | 1. Readability.
94 | 


--------------------------------------------------------------------------------
/spider-api.md:
--------------------------------------------------------------------------------
 1 | # Spider API Overview
 2 | 
 3 | ## Spider API Features
 4 | 
 5 | - **Premium proxy rotations**: no more headaches dealing with IP blocks
 6 | - **Cost-effective**: crawl many pages for a fraction of the cost of what other services charge
 7 | - **Full concurrency**: crawl thousands of pages in seconds, yes that isn't a typo!
 8 | - **Smart mode**: ensure fast crawling while considering javascript rendered pages into account using headless Chrome
 9 | - **Caching**: repeated page crawls ensures a speed boost in your crawls
10 | - **Optimal response format**: Get clean and formatted markdown, HTML, or text for LLM and AI agents
11 | - **Scrape with AI (Beta)**: do custom browser scripting and data extraction using the latest AI models
12 | - **Avoid anti-bot detection**: measures that further lowers the chances of crawls being blocked
13 | - [And many more](/docs/api)
14 | 
15 | 
16 | ## API built to scale
17 | 
18 | Welcome to the fastest web crawler API. We want you to experience the full potential of our platform, which is why we have designed our API to be highly scalable and efficient. 
19 | 
20 | Our platform is designed to effortlessly manage thousands of requests per second, thanks to our elastically scalable system architecture and the Open-Source [spider](https://github.com/spider-rs/spider) project. We deliver consistent latency times ensuring processing for all responses. 
21 | 
22 | For an in-depth understanding of the request parameters supported, we invite you to explore our comprehensive API documentation. At present, we do not provide client-side libraries, as our API has been crafted with simplicity in mind for straightforward usage. However, we are open to expanding our offerings in the future to enhance user convenience.
23 | 
24 | Dive into our [documentation](/docs/api) to get started and unleash the full potential of our web crawler today.
25 | 
26 | ## API usage
27 | 
28 | Getting started with the API is simple and straight forward. After you get your [secret key](/api-keys)
29 | you can access our instance directly. Or if you prefer, you can use our client SDK libraries for [python](https://github.com/spider-rs/spider-clients/tree/main/python) and [javascript](https://github.com/spider-rs/spider-clients/tree/main/javascript). The crawler is highly configurable through the params to fit all needs and use cases when using the API directly or client libraries.
30 | 
31 | ## Use Spider in LangChain or LlamaIndex
32 | ### LangChain
33 | - [Documentation](https://python.langchain.com/v0.1/docs/integrations/document_loaders/spider/) using Spider as a data loader (Python)
34 | - Javascript coming soon!
35 | 
36 | ### LlamaIndex
37 | - [Documentation](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/?h=spider#using-spider-reader) for using Spider as a web page reader
38 | 
39 | ## Crawling one page
40 | 
41 | Most cases you probally just want to crawl one page. Even if you only need one page, our system performs fast enough to lead the race.
42 | The most straight forward way to make sure you only crawl a single page is to set the [budget limit](/account/settings) with a wild card value or `*` to 1.
43 | You can also pass in the param `limit` in the JSON body with the limit of pages.
44 | 
45 | ## Crawling multiple pages
46 | 
47 | When you crawl multiple pages, the concurrency horsepower of the spider kicks in. You might wonder why and how one request may take (x)ms to come back, and 100 requests take about the same time! That's because the built-in isolated concurrency allows for crawling thousands to millions of pages in no time. It's the only current solution that can handle large websites with over 100k pages within a minute or two (sometimes even in a blink or two). By default, we do not add any limits to crawls unless specified.
48 | 
49 | 
50 | ## Optimized response format for large language models (LLM) and AI agents
51 | 
52 | Get the response in markdown that is clean, easy to parse, and save on token cost when using LLMs.
53 | ```py
54 | 
55 | json_data = {""return_format":"markdown","url":"http://www.example.com", "limit": 5}
56 | 
57 | response = requests.post('https://api.spider.cloud/v1/crawl', 
58 |   headers=headers, 
59 |   json=json_data)
60 | ```
61 | 
62 | Or if you prefer, you can get the response in `raw` HTML format or `text` only.
63 | 
64 | ## Planet scale crawling
65 | 
66 | If you plan on processing crawls that have over 200 pages, we recommend streaming the request from the client instead of parsing the entire payload once finished. We have an example of this with Python on the API docs page, also shown below.
67 | 
68 | ```py
69 | import requests, os, json
70 | 
71 | headers = {
72 |     'Authorization': os.environ["SPIDER_API_KEY"],
73 |     'Content-Type': 'application/json',
74 | }
75 | 
76 | json_data = {"limit":250,"url":"http://www.example.com"}
77 | 
78 | response = requests.post('https://api.spider.cloud/v1/crawl', 
79 |   headers=headers, 
80 |   json=json_data,
81 |   stream=True)
82 | 
83 | for line in response.iter_lines():
84 |   if line:
85 |       print(json.loads(line))
86 | ```
87 | 
88 | ### Automatic configuration
89 | 
90 | Spider handles automatic concurrency handling, proxy rotations (if enabled), anti-bot measures, and more. 
91 | 
92 | ## Get started with using the API
93 | 1. [Add credits](/credits/new) to your account balance. The more credits you have or usage available allows for a higher concurrency limit.
94 | 2. [Create](/api-keys) an API key
95 | 
96 | Thanks for using Spider! We are excited to see what you build with our API. If you have any questions or need help, please contact us through the feedback form.


--------------------------------------------------------------------------------
/website-archiving.md:
--------------------------------------------------------------------------------
 1 | # How to Archive Better with Spider
 2 | 
 3 | Spider offers advanced features that can optimize your website archiving process. Here’s how you can leverage these tools to ensure a comprehensive and efficient archive.
 4 | 
 5 | ## Utilize Full Resource Storing
 6 | 
 7 | Spider allows you to enable Full Resource storing in the settings or website configuration. This feature ensures that all the elements of a website, including images, scripts, and stylesheets, are captured. Here’s how to enable it:
 8 | 
 9 | 1. **Access Settings**: Log into your Spider account and navigate to the settings panel.
10 | 2. **Enable Full Resource Storing**: Find the option for Full Resource storing and toggle it on. This will ensure that every resource on the website is archived.
11 | 3. **Configure Website Specific Settings**: For specific websites, you can configure these settings within the website configuration screen. This customization can help tailor the archiving process according to the needs of each site.
12 | 
13 | ### Schedule Regular Crawls
14 | 
15 | Regularly scheduled crawls can help maintain up-to-date archives of your websites. Here’s how to set them up:
16 | 
17 | 1. **Navigate to Crawl Scheduler**: Within Spider, go to the Crawl Scheduler section from the dashboard.
18 | 2. **Set Frequency and Timing**: Choose how often you want the crawl to take place (daily, weekly, monthly) and set the specific timing according to your preference.
19 | 3. **Select Websites for Crawling**: Choose the websites you wish to include in the regular crawling schedule.
20 | 4. **Activate Scheduling**: Save your settings to activate the scheduled crawls.
21 | 
22 | ### Leverage Incremental Crawling
23 | 
24 | Incremental crawling helps in capturing only the changes made since the last crawl, saving time and storage space. Here’s how to use it:
25 | 
26 | 1. **Enable Incremental Crawling**: In the website configuration panel, enable the incremental crawling feature.
27 | 2. **Configure Change Detection**: Set up criteria for change detection, such as monitoring file modifications or specific webpage elements.
28 | 3. **Run Incremental Crawls**: Once setup, Spider will only save the changes, making the archive process more efficient.
29 | 
30 | ### Download and Store Data Effectively
31 | 
32 | Efficient data downloading and storage ensure that you can access your archived data when needed. Here’s how to manage it effectively:
33 | 
34 | 1. **Run Crawls and Download Data**: After running your scheduled or manual crawls, go to the download section to obtain your archived data.
35 | 2. **Organize Downloads**: Store the downloaded data in a well-organized file system. Consider using a dedicated folder structure that categorizes data by date or website.
36 | 3. **Backup Archives**: Regularly back up your downloaded archives to an external storage solution or cloud-based service to prevent data loss.
37 | 
38 | ### Monitor and Review Archives
39 | 
40 | Regular monitoring and reviewing of your archives help in maintaining their integrity. Here’s how to perform these tasks:
41 | 
42 | 1. **Periodic Reviews**: Schedule regular reviews of your archived data to ensure everything is captured accurately.
43 | 2. **Compare and Validate**: Compare current live websites with your archives to validate the completeness of the stored data.
44 | 3. **Issue Tracking**: Keep an issue tracker for any discrepancies or missing elements within your archives, and address them promptly using Spider’s tools.
45 | 
46 | By leveraging these advanced features and best practices, you can enhance the effectiveness of your website archiving process with Spider, ensuring that your data is stored accurately and efficiently for future reference.


--------------------------------------------------------------------------------