├── README.md
└── llm_dialogue_dataset_full_version.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # ai-twinkle-llm-lab-llm-dialogue-dataset
2 |
--------------------------------------------------------------------------------
/llm_dialogue_dataset_full_version.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "id": "view-in-github",
7 | "colab_type": "text"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "source": [
16 | "# 使用 API 建立對話式訓練資料集(Colab 實作)\n",
17 | "\n",
18 | "此 Colab 實作將會完整處理實作對話資料生成工作\n",
19 | "\n",
20 | "註明:本 Colab 是由 Simon Liu 根據 [Twinkle AI - 使用 Gemma-3-12B-it API 建立對話式訓練資料集(Colab 實作)](https://github.com/ai-twinkle/llm-lab/blob/main/courses/2025-08-llm-dialogue-dataset/00_setup_and_api_call.ipynb) 編修完成"
21 | ],
22 | "metadata": {
23 | "id": "2TLRvB-1RJrm"
24 | },
25 | "id": "2TLRvB-1RJrm"
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "source": [
30 | "# I am Simon\n",
31 | "\n",
32 | "大家好,我是 Simon 劉育維,是一位 AI 領域解決方案專家,目前也擔任 Google GenAI 領域開發者專家 (GDE),期待能夠幫助企業導入人工智慧相關技術解決問題。如果這篇文章對您有幫助,請在 Medium 上按一下鼓勵,並追蹤我的個人帳號,這樣您就可以隨時閱讀我所撰寫的文章。歡迎在我的 Linkedin 上留言提供意見,並與我一起討論有關人工智慧的主題,期待能夠對大家有所幫助!"
33 | ],
34 | "metadata": {
35 | "id": "QjRAgnGbT76D"
36 | },
37 | "id": "QjRAgnGbT76D"
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "4df8ae2d-0c56-48f2-bef6-d7798800bfd5",
42 | "metadata": {
43 | "id": "4df8ae2d-0c56-48f2-bef6-d7798800bfd5"
44 | },
45 | "source": [
46 | "## 1 對話資料生成 & 對話集格式介紹\n",
47 | "\n",
48 | "在這個章節,目標是建立一份「可持續擴充」的對話資料集。主要的步驟如下:\n",
49 | "\n",
50 | "1. 使用 OpenAI SDK 連 API\n",
51 | "2. 介紹對話資料的常見格式:**Alpaca**, **ShareGPT**,以及 **OpenAI** 格式(我們採用後者)\n",
52 | "3. 探討 `.jsonl` 格式與 `.parquet` 格式的優缺點,並說明 HF Hub 對 parquet 的轉換支援\n",
53 | " (上傳 parquet 時 HF 會自動生成 `.parquet` 分支與 viewer)"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "source": [
59 | "### 1.1 初始化 OpenAI API 環境參數"
60 | ],
61 | "metadata": {
62 | "id": "Cp6NGO-N6IpY"
63 | },
64 | "id": "Cp6NGO-N6IpY"
65 | },
66 | {
67 | "cell_type": "code",
68 | "source": [
69 | "# 取得 Colab 金鑰環境變數\n",
70 | "\n",
71 | "from google.colab import userdata"
72 | ],
73 | "metadata": {
74 | "id": "MpqGewAZCGJi"
75 | },
76 | "id": "MpqGewAZCGJi",
77 | "execution_count": 1,
78 | "outputs": []
79 | },
80 | {
81 | "cell_type": "code",
82 | "source": [
83 | "# 初始化 OpenAI 套件設定\n",
84 | "# @markdown 請設定以下 OpenAI Compatible 的變數數值\n",
85 | "\n",
86 | "\n",
87 | "from openai import OpenAI\n",
88 | "\n",
89 | "# 請去申請 Google API Key ,然後放在 Colab 左邊側邊欄,「鑰匙」的地方,保護你的 Key\n",
90 | "API_KEY = userdata.get('GOOGLE_API_KEY') #@param {type:\"string\"}\n",
91 | "BASE_URL = \"https://generativelanguage.googleapis.com/v1beta/openai/\" #@param {type:\"string\"}\n",
92 | "MODEL = \"gemini-2.0-flash\" #@param {type:\"string\"}\n",
93 | "\n",
94 | "client = OpenAI(\n",
95 | " api_key=API_KEY,\n",
96 | " base_url=BASE_URL\n",
97 | ")\n",
98 | "\n",
99 | "print(\"API client 已初始化\")"
100 | ],
101 | "metadata": {
102 | "colab": {
103 | "base_uri": "https://localhost:8080/"
104 | },
105 | "id": "CoHUqXvH299O",
106 | "outputId": "b3de8d22-ec1c-4e7a-f70d-095f8346af1d"
107 | },
108 | "id": "CoHUqXvH299O",
109 | "execution_count": 2,
110 | "outputs": [
111 | {
112 | "output_type": "stream",
113 | "name": "stdout",
114 | "text": [
115 | "API client 已初始化\n"
116 | ]
117 | }
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "source": [
123 | "# 測試 OpenAI 套件設定\n",
124 | "\n",
125 | "try:\n",
126 | " resp = client.chat.completions.create(\n",
127 | " model=MODEL,\n",
128 | " messages=[\n",
129 | " {\"role\": \"system\", \"content\": \"你是專業的助理,使用繁體中文回答。\"},\n",
130 | " {\"role\": \"user\", \"content\": \"請用一句話介紹什麼是大型語言模型(LLM)。\"}\n",
131 | " ],\n",
132 | " temperature=0.7,\n",
133 | " max_tokens=256,\n",
134 | " )\n",
135 | " print(\"✅ 呼叫成功\")\n",
136 | "except Exception as e:\n",
137 | " print(\"❌ 呼叫失敗,請檢查 API Key / base_url / 模型名稱是否正確。\")\n",
138 | " raise e\n",
139 | "\n",
140 | "if resp.choices:\n",
141 | " print(\"=== Model Output ===\")\n",
142 | " print(resp.choices[0].message.content)\n",
143 | "else:\n",
144 | " import json\n",
145 | " print(\"⚠️ 非預期回傳格式:\")\n",
146 | " print(json.dumps(resp.model_dump(), ensure_ascii=False, indent=2))"
147 | ],
148 | "metadata": {
149 | "colab": {
150 | "base_uri": "https://localhost:8080/"
151 | },
152 | "id": "ZSGiTu4BS1MF",
153 | "outputId": "2fb7563e-af7e-4e37-ef30-d84e43a9dd0d"
154 | },
155 | "id": "ZSGiTu4BS1MF",
156 | "execution_count": 3,
157 | "outputs": [
158 | {
159 | "output_type": "stream",
160 | "name": "stdout",
161 | "text": [
162 | "✅ 呼叫成功\n",
163 | "=== Model Output ===\n",
164 | "大型語言模型 (LLM) 是一種經過大量文本數據訓練的人工智慧模型,能夠理解、生成和預測人類語言。\n",
165 | "\n"
166 | ]
167 | }
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "id": "5ffa29ba-2e60-4041-a21f-c8f328f61304",
173 | "metadata": {
174 | "id": "5ffa29ba-2e60-4041-a21f-c8f328f61304"
175 | },
176 | "source": [
177 | "### 1.2 常見對話資料集格式比較\n",
178 | "\n"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "id": "241fddab-ede4-4d95-86b7-569bee685087",
184 | "metadata": {
185 | "id": "241fddab-ede4-4d95-86b7-569bee685087"
186 | },
187 | "source": [
188 | "
\n",
189 | " 
\n",
190 | " 圖 1:Wiki 對話格式示意圖\n",
191 | "
"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "id": "cd084c2b-1741-4a4d-8932-5e6dfdfafcfa",
197 | "metadata": {
198 | "id": "cd084c2b-1741-4a4d-8932-5e6dfdfafcfa"
199 | },
200 | "source": [
201 | "### 1.3 JSONL vs Parquet 比較\n",
202 | "\n",
203 | "| 格式 | 優點 | 缺點 |\n",
204 | "|----------|-------------------------------|------------------------------|\n",
205 | "| `.jsonl` | 易讀、輕量、開發友善 | 檔案大、大量數據讀取效率較低 |\n",
206 | "| `.parquet` | 壓縮效果好、查詢效能高、支援 HF 轉換 | 不易直接閱讀,需使用工具處理 |\n",
207 | "\n",
208 | "注意:即使你上傳 `.jsonl`,HF Hub 也可能幫你生成 `.parquet` 分支,方便瀏覽與載入。"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "id": "e0a123e6-20c2-41d3-a235-7cfc8974c969",
214 | "metadata": {
215 | "id": "e0a123e6-20c2-41d3-a235-7cfc8974c969"
216 | },
217 | "source": [
218 | "\n",
219 | " 
\n",
220 | " 圖 2:HF Hub 自動生成的 .parquet 分支\n",
221 | "
"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "id": "911efc9a-b8b9-4b28-92d8-c6d405ce31e3",
227 | "metadata": {
228 | "id": "911efc9a-b8b9-4b28-92d8-c6d405ce31e3"
229 | },
230 | "source": [
231 | "### 1.4 Reference-Free vs Reference-Based\n",
232 | "\n",
233 | "- **Reference-Free(無參考)**:用一些 seed prompt 引導模型生成。最早出自 [Self-Instruct: Aligning Language Models with Self-Generated Instructions\n",
234 | "](https://arxiv.org/abs/2212.10560)。\n",
235 | "- **Reference-Based(參考內容)**:使用真實資料片段(例如 Wiki 條目)作 prompt 佐料,讓生成內容更 grounded。"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "id": "582b3052-568d-4ab3-aa56-cc5fe2c942ab",
241 | "metadata": {
242 | "id": "582b3052-568d-4ab3-aa56-cc5fe2c942ab"
243 | },
244 | "source": [
245 | "#### 1.4.1 Reference-Free 實作\n",
246 | "\n",
247 | "在 Reference-Free 的情境下,我們並不依賴任何外部知識庫或文件,而是透過 **seed 任務 (seed task)** 來驅動模型自行生成資料。 \n",
248 | "這些 seed 任務通常包含一個 **instruction(指令)**,加上少量的 **instance(範例輸入/輸出對)**,作為模型模仿與延伸的起點。 \n",
249 | "\n",
250 | "這種方法的代表性工作是 *Self-Instruct*,它透過人工設計的一些高品質種子指令,讓模型去「舉一反三」產生更多指令和對應答案,最終建立出龐大的資料集。\n",
251 | "\n",
252 | "以下是一個取自 [self-instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl) seed 範例,主題是「早餐建議」。 \n",
253 | "\n",
254 | "```json\n",
255 | "{\n",
256 | " \"id\": \"seed_task_0\",\n",
257 | " \"name\": \"breakfast_suggestion\",\n",
258 | " \"instruction\": \"Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?\",\n",
259 | " \"instances\": [\n",
260 | " {\n",
261 | " \"input\": \"\",\n",
262 | " \"output\": \"Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1 tbsp flaxseed oil and 1/2 cup water, totaling about 550 calories. The 4 strips of bacon contains about 200 calories.\"\n",
263 | " }\n",
264 | " ],\n",
265 | " \"is_classification\": false\n",
266 | "}\n",
267 | "```\n",
268 | "說明:\n",
269 | "- id:任務的唯一識別碼。\n",
270 | "- name:任務名稱,方便辨識。\n",
271 | "- instruction:給模型的主要問題或指令。\n",
272 | "- instances:包含輸入/輸出對,本例中 input 為空,代表模型直接依 instruction 回答;output 是一個可能的解答。\n",
273 | "- is_classification:標記此任務是否為分類型問題(此例為否)。"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "id": "52128cae-647b-43da-913d-04aed64fc783",
279 | "metadata": {
280 | "id": "52128cae-647b-43da-913d-04aed64fc783"
281 | },
282 | "source": [
283 | "在實務中,我們會設計數十到數百個 seed 任務,涵蓋不同領域與指令型態,作為 Reference-Free 資料生成的核心基礎。\n",
284 | "\n",
285 | "不過,我們的作法並**不完全等同於 Self-Instruct**。\n",
286 | "相較於 Self-Instruct 的完整 pipeline(如:過濾、去重、迭代擴展),我們傾向採用更簡單直接的方式:\n",
287 | "\t1.\t人工撰寫少量高品質 seed 指令。\n",
288 | "\t2.\t要求模型基於這些 seed 產生新的 seed 指令(但僅限輸出 seed 本文,避免雜訊)。\n",
289 | "\t3.\t再利用這些新 seed 指令,由模型生成單輪問答配對。\n",
290 | "\n",
291 | "這樣的流程更輕量,雖然缺少複雜的篩選與多輪迭代,但對於課程實作與教學目標而言,已經能清楚展現 Reference-Free 的核心精神。"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "source": [
297 | "base_seed = \"\"\"Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?\"\"\""
298 | ],
299 | "metadata": {
300 | "id": "Zhfi1rGdVo7i"
301 | },
302 | "id": "Zhfi1rGdVo7i",
303 | "execution_count": 4,
304 | "outputs": []
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 5,
309 | "id": "727cbf67-aca6-4e63-854f-d08a889ea711",
310 | "metadata": {
311 | "colab": {
312 | "base_uri": "https://localhost:8080/"
313 | },
314 | "id": "727cbf67-aca6-4e63-854f-d08a889ea711",
315 | "outputId": "82e43e30-b32d-4ac9-c9b8-bf36d066d007"
316 | },
317 | "outputs": [
318 | {
319 | "output_type": "stream",
320 | "name": "stdout",
321 | "text": [
322 | "🔹 原始 seed: Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?\n",
323 | "🔸 新的 seed: 推薦一份不含乳製品、高纖且富含健康脂肪的午餐食譜,熱量約 500 大卡。\n"
324 | ]
325 | }
326 | ],
327 | "source": [
328 | "# Step 1: 以既有 seed 為出發點,要求 LLM 產生「不同但相關」的新 seed。\n",
329 | "# 重要:嚴格要求只輸出 seed 文字本身,不要任何多餘說明、標籤或引號。\n",
330 | "\n",
331 | "from openai import OpenAI\n",
332 | "import re\n",
333 | "\n",
334 | "seed_gen_messages = [\n",
335 | " {\n",
336 | " \"role\": \"system\",\n",
337 | " \"content\": (\n",
338 | " \"你是一個資料生成器。你的任務是『根據給定 seed,產生一則不同但主題相關的 seed 指令』。\\n\"\n",
339 | " \"務必遵守:\\n\"\n",
340 | " \"1) 僅輸出新的 seed 指令本身(繁體中文)。\\n\"\n",
341 | " \"2) 不要加任何解釋、前後文、引號、標點裝飾或標籤。\\n\"\n",
342 | " \"3) 一至兩句話,清楚可執行。\\n\"\n",
343 | " \"4) 避免重複與原 seed 完全相同的限制條件或措辭,但主題需相關。\\n\"\n",
344 | " )\n",
345 | " },\n",
346 | " {\n",
347 | " \"role\": \"user\",\n",
348 | " \"content\": (\n",
349 | " f\"這是原始 seed:\\n{base_seed}\\n\\n\"\n",
350 | " \"請依規則產生一個新的 seed 指令(繁體中文)。只輸出新 seed 本文,其他一律不要。\"\n",
351 | " )\n",
352 | " },\n",
353 | "]\n",
354 | "\n",
355 | "resp_seed = client.chat.completions.create(\n",
356 | " model=MODEL,\n",
357 | " messages=seed_gen_messages,\n",
358 | " temperature=0.9,\n",
359 | " max_tokens=200,\n",
360 | ")\n",
361 | "\n",
362 | "new_seed_instruction_raw = resp_seed.choices[0].message.content.strip()\n",
363 | "\n",
364 | "# 基本清理:移除常見多餘字樣(保險)\n",
365 | "def sanitize_seed(text: str) -> str:\n",
366 | " text = text.strip()\n",
367 | " # 移除可能的程式碼圍欄或引號\n",
368 | " text = re.sub(r\"^```.*?\\n|\\n```$\", \"\", text, flags=re.DOTALL) # 去掉 ``` 區塊\n",
369 | " text = text.strip(\"「」\\\"'` \\n\\t\")\n",
370 | " # 去掉可能的前綴\n",
371 | " text = re.sub(r\"^(新的?seed指令[::]\\s*|seed[::]\\s*|新指令[::]\\s*)\", \"\", text, flags=re.IGNORECASE)\n",
372 | " return text.strip()\n",
373 | "\n",
374 | "new_seed_instruction = sanitize_seed(new_seed_instruction_raw)\n",
375 | "\n",
376 | "print(\"🔹 原始 seed:\", base_seed)\n",
377 | "print(\"🔸 新的 seed:\", new_seed_instruction)"
378 | ]
379 | },
380 | {
381 | "cell_type": "code",
382 | "execution_count": 6,
383 | "id": "93ef39f8-4d25-4404-bcc4-59c4da85a9a9",
384 | "metadata": {
385 | "colab": {
386 | "base_uri": "https://localhost:8080/"
387 | },
388 | "id": "93ef39f8-4d25-4404-bcc4-59c4da85a9a9",
389 | "outputId": "3c622424-7d24-4484-96e0-ecfc1d3fdbd0"
390 | },
391 | "outputs": [
392 | {
393 | "output_type": "stream",
394 | "name": "stdout",
395 | "text": [
396 | "✅ 已生成單輪 QA 並寫入: outputs/datasets.jsonl\n",
397 | "\n",
398 | "=== 回答預覽 ===\n",
399 | " 好的,這是一份不含乳製品、高纖且富含健康脂肪,熱量約 500 大卡的午餐食譜,並提供詳細的說明和建議,讓你輕鬆準備:\n",
400 | "\n",
401 | "**午餐食譜:酪梨藜麥沙拉佐烤鮭魚**\n",
402 | "\n",
403 | "**熱量估算:** 約 480-520 大卡 (依食材份量微調)\n",
404 | "\n",
405 | "**食材:**\n",
406 | "\n",
407 | "* **烤鮭魚 (約 120 克):** 提供優質蛋白質和 Omega-3 脂肪酸 (約 200 大卡)\n",
408 | " * 鮭魚片:120 克\n",
409 | " * 橄欖油:1 茶匙\n",
410 | " * 鹽:少許\n",
411 | " * 黑胡椒:少許\n",
412 | " * 檸檬汁:少許 (可選)\n",
413 | "* **藜麥 (煮熟後約 1 杯):** 提供豐富纖維和植物性蛋白質 (約 220 大卡)\n",
414 | " * 乾燥藜麥:1/2 杯\n",
415 | " * 水:1 杯\n",
416 | "* **酪梨 (1/4 個):** 提供健康脂肪和纖維 (約 80 大卡)\n",
417 | "* **蔬菜 (總量約 1 杯):** 提供纖維、維生素和礦物質 (約 20 大卡)\n",
418 | " * 小黃瓜:1/4 根 (切丁)\n",
419 | " * 紅蘿蔔:1/4 根 (切丁)\n",
420 | " * 甜椒 (任何顏色):1/4 個 (切丁)\n",
421 | " * 芝麻葉或其他綠葉蔬菜:適量\n",
422 | "* **堅果和種子 (約 1 湯匙):** 提供健康脂肪和礦物質 (約 50 大卡)\n",
423 | " * 南瓜籽、葵花籽或杏仁片:1 湯匙 (混合或單選)\n",
424 | "* **醬汁:**\n",
425 | " * 檸檬汁:1 湯匙\n",
426 | " * 橄欖油:1 茶匙\n",
427 | " * 第戎芥末:1/2 茶匙 (可選)\n",
428 | " * 鹽:少許\n",
429 | " * 黑胡椒:少許\n",
430 | "\n",
431 | "**烹飪步驟:**\n",
432 | "\n",
433 | "1. **準備藜麥:**\n",
434 | " * 將藜麥用清水沖洗乾淨。\n",
435 | " * 將藜麥和水放入鍋中煮沸,然後轉小火煮約 15 分鐘,或直到藜麥煮熟且水分被吸收。\n",
436 | " \n"
437 | ]
438 | }
439 | ],
440 | "source": [
441 | "# Step 2: 以「新的 seed 指令」當作 user 提問,生成單輪回答(assistant 一次回覆)。\n",
442 | "# 產出為 OpenAI messages 格式,可直接累積進 datasets.jsonl。\n",
443 | "\n",
444 | "import json\n",
445 | "from uuid import uuid4\n",
446 | "from pathlib import Path\n",
447 | "\n",
448 | "qa_messages = [\n",
449 | " {\"role\": \"system\", \"content\": \"你是一位營養與飲食規劃的專家,請使用繁體中文,給出明確、可執行的建議。\"},\n",
450 | " {\"role\": \"user\", \"content\": new_seed_instruction},\n",
451 | "]\n",
452 | "\n",
453 | "resp_qa = client.chat.completions.create(\n",
454 | " model=MODEL,\n",
455 | " messages=qa_messages,\n",
456 | " temperature=0.7,\n",
457 | " max_tokens=600,\n",
458 | ")\n",
459 | "\n",
460 | "answer = resp_qa.choices[0].message.content\n",
461 | "\n",
462 | "example = {\n",
463 | " \"id\": str(uuid4()),\n",
464 | " \"type\": \"reference_free\",\n",
465 | " \"seed\": new_seed_instruction,\n",
466 | " \"messages\": [\n",
467 | " qa_messages[0], # system\n",
468 | " qa_messages[1], # user(新的 seed)\n",
469 | " {\"role\": \"assistant\", \"content\": answer}, # 單輪回答\n",
470 | " ]\n",
471 | "}\n",
472 | "\n",
473 | "# ✅ 可選:追加寫入 datasets.jsonl(供下一章節 QC 使用)\n",
474 | "out_path = Path(\"outputs/datasets.jsonl\")\n",
475 | "out_path.parent.mkdir(parents=True, exist_ok=True)\n",
476 | "with out_path.open(\"a\", encoding=\"utf-8\") as f:\n",
477 | " f.write(json.dumps(example, ensure_ascii=False) + \"\\n\")\n",
478 | "\n",
479 | "print(\"✅ 已生成單輪 QA 並寫入:\", out_path)\n",
480 | "print(\"\\n=== 回答預覽 ===\\n\", answer[:800])"
481 | ]
482 | },
483 | {
484 | "cell_type": "markdown",
485 | "id": "60cc5f17-dc9f-400b-9a1b-94f7975ac569",
486 | "metadata": {
487 | "id": "60cc5f17-dc9f-400b-9a1b-94f7975ac569"
488 | },
489 | "source": [
490 | "#### 1.4.2 Reference-based 資料生成\n",
491 | "\n",
492 | "在 Reference-based 的情境下,我們會使用一段外部文本作為依據,並在其上生成問答資料。\n",
493 | "這種方式常見於知識型 QA 系統(例如 Wikipedia 問答),其核心原則是:\n",
494 | "- 問題(Question)必須來自於文本\n",
495 | "- 答案(Answer)必須完全依照文本,不可超出文本範圍\n",
496 | "\n",
497 | "這樣生成的資料,可以幫助模型學會「根據參考內容回答」,而非憑空想像。"
498 | ]
499 | },
500 | {
501 | "cell_type": "code",
502 | "source": [
503 | "article_context = \"\"\"\n",
504 | "[ 開源模型 ] Google Gemma 3 270M 介紹\n",
505 | "Simon Liu\n",
506 | "\n",
507 | "[ 開源模型 ] Google Gemma 3 270M\n",
508 | "Google 官方部落格介紹:連結\n",
509 | "\n",
510 | "Gemma 3 270M 是 Google DeepMind 於 2025 年 8 月正式推出的一款極致輕量化、大幅降低運算成本的開源語言模型。其設計理念側重於高能效、可在邊緣設備上直接運行 (on-device),並且能迅速完成特定任務的微調 (fine-tuning),以達到成本效益最佳化。\n",
511 | "\n",
512 | "I. 核心技術特點與差異化優勢\n",
513 | "1. 模型規模與架構設計\n",
514 | "總參數量為 2.7 億個參數,其中約 1.7 億個參數是 embedding 層權重,剩下則是 transformer 模組,屬於 Gemma 3 家族中的最小版本,採用 decoder‐only Transformer 架構。\n",
515 | "\n",
516 | "2. 能源效率極佳\n",
517 | "透過 INT4 量化後,根據官方的說法,在 Pixel 9 Pro SoC 上進行 25 次對話測試僅消耗 0.75% 電量,展現出極低耗電特性。\n",
518 | "\n",
519 | "3. 出色的 instruction-following 能力\n",
520 | "即使在未經複雜調校下,依然具備強大的「依指令執行」能力,於 IFEval 基準測試中,Gemma 3 270M 取得約 51.2% 的分數,超越多數更大模型。\n",
521 | "\n",
522 | "4. 支援量化自覺訓練 (QAT),便於部署\n",
523 | "提供可用於 INT4 推論的 QAT 檢查點,確保在極度壓縮下仍維持足夠性能,適合資源受限的執行環境。\n",
524 | "\n",
525 | "\n",
526 | "實際在 Samsung S24 Plus 測試 Google Gemma 3 270M INT8 實測結果,手機離線運算速度表現真的很漂亮,知識能力就不為難他了,或許可以 Fine-Tune 成文字對分類的AI模型\n",
527 | "應用場景與部署策略\n",
528 | "適用任務類型\n",
529 | "適合高頻、明確定義片段任務,例如:情緒分析 (sentiment analysis)、實體擷取 (entity extraction)、查詢 (query routing)、結構化文本生成、創意寫作、遵從性檢查等。\n",
530 | "\n",
531 | "快速微調與部署\n",
532 | "模型尺寸小,可在數小時內完成 fine‑tuning,極速部署原型,且可在輕量基礎設施或裝置端運行,提高開發效率並降低成本。\n",
533 | "\n",
534 | "隱私與使用者控制\n",
535 | "可完全本地化部署,避免資料往返雲端,提升敏感資料保護及隱私控制。\n",
536 | "\n",
537 | "建構專責微模型 (fleet of specialized models)\n",
538 | "利用其小巧、效率高的特性,可同時維運多個專門優化的任務模型,實現模組化、效能優化與成本最小化。\n",
539 | "\n",
540 | "比較分析與風險考量\n",
541 | "成本效益 VS 通用能力\n",
542 | "相較大模型,其推論成本與能耗極低;但在通用性、複雜對話或生成能力方面仍有限制,應視任務選擇。\n",
543 | "\n",
544 | "推論性能 VS 訓練性能\n",
545 | "雖然適合地端部署和快速微調,蛋 Google Gemma 的 Fine-Tuning 仍會建議在 GPU 或者 TPU 上的完成,並非本地端的設備進行處理。\n",
546 | "\n",
547 | "結論\n",
548 | "Gemma 3 270M 是典型的在資源受限環境中,以最低成本、最快部署速度能夠完成高效能任務,兼顧能效與靈活性。適用於邊緣部署、快速開發與特定功能場景,如客服分類、自動標註與本地化創意應用。\n",
549 | "\n",
550 | "若企業目標是打造輕量、可擴展且具隱私保障 Edge 端的 AI 解決方案,Gemma 3 270M 是值得納入模型庫的優選選項。\n",
551 | "\"\"\""
552 | ],
553 | "metadata": {
554 | "id": "XKPF6HVH7sLk"
555 | },
556 | "id": "XKPF6HVH7sLk",
557 | "execution_count": 7,
558 | "outputs": []
559 | },
560 | {
561 | "cell_type": "code",
562 | "source": [
563 | "# @markdown 請設定以下 想產生幾組 QA 的變數數值\n",
564 | "\n",
565 | "NUM_QA = 10 # @param {type:\"string\"}"
566 | ],
567 | "metadata": {
568 | "id": "nPnLQcxDpLhu"
569 | },
570 | "id": "nPnLQcxDpLhu",
571 | "execution_count": 8,
572 | "outputs": []
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": 9,
577 | "id": "b98dab6b-5a35-4ef2-9cd5-638988ee81a6",
578 | "metadata": {
579 | "id": "b98dab6b-5a35-4ef2-9cd5-638988ee81a6"
580 | },
581 | "outputs": [],
582 | "source": [
583 | "# ==== 產生「只有問題」→ 再逐題回答(Reference-based)====\n",
584 | "import json, re\n",
585 | "from typing import List\n",
586 | "from uuid import uuid4\n",
587 | "from pathlib import Path\n",
588 | "\n",
589 | "# ---------- (A) 用 Structured Outputs 產生「問題清單」 ----------\n",
590 | "# 參考:OpenAI Structured Outputs / responses.parse(若端點不支援,會自動 fallback)\n",
591 | "# Docs: platform.openai.com/docs/guides/structured-outputs & responses.parse\n",
592 | "from pydantic import BaseModel, Field, conlist\n",
593 | "\n",
594 | "class QuestionItem(BaseModel):\n",
595 | " question: str = Field(..., min_length=4, description=\"依據給定文本可直接回答的問題(繁體中文)\")\n",
596 | "\n",
597 | "class QuestionList(BaseModel):\n",
598 | " items: List[QuestionItem]\n",
599 | "\n",
600 | "def generate_questions_from_context(context: str, n_pairs: int = 4) -> List[str]:\n",
601 | " sys_rules = (\n",
602 | " \"你是資料標註助理,請使用繁體中文設計問題。\\n\"\n",
603 | " f\"請產生 {n_pairs} 題問題,不要提供答案。\\n\"\n",
604 | " \"原則:\\n\"\n",
605 | " \"1) 問題必須可由【文本】直接回答,或能忠實改寫自其中資訊。\\n\"\n",
606 | " \"2) 禁止加入【文本】以外的知識。\\n\"\n",
607 | " \"3) 問題要清楚、具體,答案可在 1–2 句內表達。\\n\"\n",
608 | " \"4) 若【文本】不足以支撐問題,請產生需要使用者進一步釐清的問題(單一句)。\\n\"\n",
609 | " \"5) 問題要自然,不要暴露有任何【文本】或外部資料存在。\\n\"\n",
610 | " \"6) 只輸出 JSON,格式固定為:{\\\"items\\\":[{\\\"question\\\":\\\"...\\\"}, ...]}。\"\n",
611 | " )\n",
612 | " user_rules = (\n",
613 | " \"請根據以下【文本】設計問題:\\n\\n\"\n",
614 | " f\"{context}\\n\\n\"\n",
615 | " \"⚠️ 僅輸出 JSON,格式:{\\\"items\\\":[{\\\"question\\\":\\\"...\\\"}, ...]},\"\n",
616 | " \"不得有額外說明/Markdown/前後綴。\"\n",
617 | " )\n",
618 | "\n",
619 | " # ---- 路徑 1:responses.parse(支援時最穩定)----\n",
620 | " try:\n",
621 | " parsed = client.beta.chat.completions.parse(\n",
622 | " model=MODEL,\n",
623 | " messages=[{\"role\": \"system\", \"content\": sys_rules},\n",
624 | " {\"role\": \"user\", \"content\": user_rules}],\n",
625 | " response_format=QuestionList, # 注意:這裡是一個 Pydantic Model class\n",
626 | " )\n",
627 | " # 取得結構化結果(關鍵)\n",
628 | " items = parsed.choices[0].message.parsed.items\n",
629 | " questions = [it.question.strip() for it in items if it.question.strip()]\n",
630 | " return questions[:n_pairs]\n",
631 | "\n",
632 | " except Exception:\n",
633 | " # ---- 路徑 2:Chat Completions + JSON(相容端常用)----\n",
634 | " fallback_sys = (\n",
635 | " \"你是資料標註助理。請只輸出 JSON,不要任何解釋或 Markdown。\\n\"\n",
636 | " '格式:[{\"question\":\"...\"}, {\"question\":\"...\"}]'\n",
637 | " )\n",
638 | " fallback_user = (\n",
639 | " f\"{sys_rules}\\n\\n\"\n",
640 | " \"請輸出 JSON 陣列,每個物件僅含 question 欄位。\\n\\n\"\n",
641 | " f\"【文本】\\n{context}\"\n",
642 | " )\n",
643 | " resp = client.chat.completions.create(\n",
644 | " model=MODEL,\n",
645 | " messages=[{\"role\": \"system\", \"content\": fallback_sys},\n",
646 | " {\"role\": \"user\", \"content\": fallback_user}],\n",
647 | " # 部分代理不支援 JSON mode;若報錯就移除此參數\n",
648 | " response_format={\"type\": \"json_object\"},\n",
649 | " temperature=0.2,\n",
650 | " max_tokens=800,\n",
651 | " )\n",
652 | " raw = resp.choices[0].message.content.strip()\n",
653 | " txt = re.sub(r\"^```json\\s*|\\s*```$\", \"\", raw, flags=re.IGNORECASE).strip()\n",
654 | " data = json.loads(txt)\n",
655 | "\n",
656 | " # 接受 [{\"question\": \"...\"}] 或 {\"items\":[...]}\n",
657 | " items = data.get(\"items\") if isinstance(data, dict) and \"items\" in data else data\n",
658 | " if not isinstance(items, list):\n",
659 | " raise ValueError(\"模型輸出不是問題清單 JSON 陣列/物件\")\n",
660 | "\n",
661 | " qs = []\n",
662 | " for obj in items:\n",
663 | " if isinstance(obj, dict) and \"question\" in obj:\n",
664 | " q = str(obj[\"question\"]).strip()\n",
665 | " elif isinstance(obj, str):\n",
666 | " q = obj.strip()\n",
667 | " else:\n",
668 | " continue\n",
669 | " if q:\n",
670 | " qs.append(q)\n",
671 | " return qs[:n_pairs]"
672 | ]
673 | },
674 | {
675 | "cell_type": "code",
676 | "execution_count": 10,
677 | "id": "473c3f77-c4ad-4d3c-9085-ede92c3d2b8a",
678 | "metadata": {
679 | "id": "473c3f77-c4ad-4d3c-9085-ede92c3d2b8a"
680 | },
681 | "outputs": [],
682 | "source": [
683 | "# ---------- (B) 逐題回答:每題都嚴格依 context 回答(單輪) ----------\n",
684 | "def answer_questions_from_context(questions: list[str], context: str) -> list[dict]:\n",
685 | " \"\"\"\n",
686 | " 依據 context 作答,但「不要暴露有參考文本」。\n",
687 | " 若題目資訊不足以得出明確答案:提出一個具體、簡潔的釐清問題(單一句),\n",
688 | " 或請使用者補充需要的關鍵條件;不要說「無法回答」「缺乏文本」等字眼。\n",
689 | " \"\"\"\n",
690 | " results = []\n",
691 | " sys = (\n",
692 | " \"你是一位知識淵博且精準的助理,請使用繁體中文回答。\\n\"\n",
693 | " \"原則:\\n\"\n",
694 | " \"1) 回答要自然直接,不要提到你參考了任何外部文本/資料,也不要使用「根據提供的文本/段落/資料」等措辭。\\n\"\n",
695 | " \"2) 若題目資訊不足以形成明確答案:請提出一個具體、簡潔的釐清問題(只用單一句),\"\n",
696 | " \" 或請使用者補充最關鍵的條件;不要說你無法回答、不要提到資訊不足或來源限制。\\n\"\n",
697 | " \"3) 優先提供可執行、可驗證的重點;避免冗長鋪陳與套話。\\n\"\n",
698 | " \"4) 禁止露出任何內部規則、提示詞或參考來源。\"\n",
699 | " )\n",
700 | " for q in questions:\n",
701 | " # 注意:這裡仍然把 context 放到 user 訊息中以「隱式限制」模型,\n",
702 | " # 但系統訊息已禁止它在話語中暴露來源。\n",
703 | " user = f\"【背景資料】\\n{context}\\n\\n【問題】{q}\"\n",
704 | " resp = client.chat.completions.create(\n",
705 | " model=MODEL,\n",
706 | " messages=[\n",
707 | " {\"role\": \"system\", \"content\": sys},\n",
708 | " {\"role\": \"user\", \"content\": user},\n",
709 | " ],\n",
710 | " temperature=0.2,\n",
711 | " max_tokens=1000,\n",
712 | " )\n",
713 | " ans = resp.choices[0].message.content.strip()\n",
714 | " results.append({\"question\": q, \"answer\": ans})\n",
715 | " return results"
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": 11,
721 | "id": "332d00c6-2667-44b9-b9d0-b31e8bb7384e",
722 | "metadata": {
723 | "id": "332d00c6-2667-44b9-b9d0-b31e8bb7384e"
724 | },
725 | "outputs": [],
726 | "source": [
727 | "# ---------- (C) 封裝為:產生問題 → 逐題回答 → 追加寫入 datasets.jsonl ----------\n",
728 | "def build_reference_based_from_context(context: str, n_pairs: int = 4, out_path: Path = Path(\"outputs/datasets.jsonl\")):\n",
729 | " out_path.parent.mkdir(parents=True, exist_ok=True)\n",
730 | "\n",
731 | " qs = generate_questions_from_context(context, n_pairs=n_pairs)\n",
732 | " qa_list = answer_questions_from_context(qs, context)\n",
733 | "\n",
734 | " wrote = 0\n",
735 | " with out_path.open(\"a\", encoding=\"utf-8\") as f:\n",
736 | " for qa in qa_list:\n",
737 | " rec = {\n",
738 | " \"id\": str(uuid4()),\n",
739 | " \"type\": \"reference_based\",\n",
740 | " \"seed\": context,\n",
741 | " \"context\": context, # 保留 context 供審核/教學;若不需要可移除\n",
742 | " \"messages\": [\n",
743 | " {\"role\": \"system\", \"content\": \"請嚴格依據提供的文本回答問題,使用繁體中文。\"},\n",
744 | " {\"role\": \"user\", \"content\": qa[\"question\"]},\n",
745 | " {\"role\": \"assistant\", \"content\": qa[\"answer\"]},\n",
746 | " ],\n",
747 | " }\n",
748 | " f.write(json.dumps(rec, ensure_ascii=False) + \"\\n\")\n",
749 | " wrote += 1\n",
750 | "\n",
751 | " print(f\"✅ 已新增 {wrote} 筆 reference-based QA 至 {out_path}\")\n",
752 | " return qa_list"
753 | ]
754 | },
755 | {
756 | "cell_type": "code",
757 | "source": [
758 | "import re\n",
759 | "\n",
760 | "def split_markdown_by_headers(markdown_text):\n",
761 | " \"\"\"Splits a markdown string by headers (#, ##, ###).\"\"\"\n",
762 | " # Use regex to find all headers and their positions\n",
763 | " # This regex looks for lines starting with 1 to 3 '#' characters, followed by a space\n",
764 | " # and captures the header line and the content that follows until the next header\n",
765 | " segments = []\n",
766 | " # Find all matches of headers and their starting positions\n",
767 | " matches = list(re.finditer(r\"^(#+\\s.*)$\", markdown_text, re.MULTILINE))\n",
768 | "\n",
769 | " if not matches:\n",
770 | " # If no headers are found, return the entire text as a single segment\n",
771 | " return [markdown_text]\n",
772 | "\n",
773 | " # Add content before the first header if it exists\n",
774 | " if matches[0].start() > 0:\n",
775 | " segments.append(markdown_text[:matches[0].start()].strip())\n",
776 | "\n",
777 | " # Iterate through the matches to extract segments\n",
778 | " for i in range(len(matches)):\n",
779 | " start_pos = matches[i].start()\n",
780 | " # The end position is the start of the next header, or the end of the text\n",
781 | " end_pos = matches[i+1].start() if i+1 < len(matches) else len(markdown_text)\n",
782 | " segment = markdown_text[start_pos:end_pos].strip()\n",
783 | " if segment:\n",
784 | " segments.append(segment)\n",
785 | "\n",
786 | " return segments\n",
787 | "\n",
788 | "# Split the article_context\n",
789 | "wiki_segments = split_markdown_by_headers(article_context)\n",
790 | "\n",
791 | "# Print the number of segments and the first few\n",
792 | "print(f\"Split article_context into {len(wiki_segments)} segments.\")\n",
793 | "for i, segment in enumerate(wiki_segments):\n",
794 | " print(f\"\\n--- Segment {i+1} ---\")\n",
795 | " print(segment[:500] + ('...' if len(segment) > 500 else ''))"
796 | ],
797 | "metadata": {
798 | "colab": {
799 | "base_uri": "https://localhost:8080/"
800 | },
801 | "id": "EoiTAHOb9N3p",
802 | "outputId": "992265f9-900c-4b24-d954-10979e1f344a"
803 | },
804 | "id": "EoiTAHOb9N3p",
805 | "execution_count": 12,
806 | "outputs": [
807 | {
808 | "output_type": "stream",
809 | "name": "stdout",
810 | "text": [
811 | "Split article_context into 1 segments.\n",
812 | "\n",
813 | "--- Segment 1 ---\n",
814 | "\n",
815 | "[ 開源模型 ] Google Gemma 3 270M 介紹\n",
816 | "Simon Liu\n",
817 | "\n",
818 | "[ 開源模型 ] Google Gemma 3 270M\n",
819 | "Google 官方部落格介紹:連結\n",
820 | "\n",
821 | "Gemma 3 270M 是 Google DeepMind 於 2025 年 8 月正式推出的一款極致輕量化、大幅降低運算成本的開源語言模型。其設計理念側重於高能效、可在邊緣設備上直接運行 (on-device),並且能迅速完成特定任務的微調 (fine-tuning),以達到成本效益最佳化。\n",
822 | "\n",
823 | "I. 核心技術特點與差異化優勢\n",
824 | "1. 模型規模與架構設計\n",
825 | "總參數量為 2.7 億個參數,其中約 1.7 億個參數是 embedding 層權重,剩下則是 transformer 模組,屬於 Gemma 3 家族中的最小版本,採用 decoder‐only Transformer 架構。\n",
826 | "\n",
827 | "2. 能源效率極佳\n",
828 | "透過 INT4 量化後,根據官方的說法,在 Pixel 9 Pro SoC 上進行 25 次對話測試僅消耗 0.75% 電量,展現出極低耗電特性。\n",
829 | "\n",
830 | "3. 出色的 instruction-following...\n"
831 | ]
832 | }
833 | ]
834 | },
835 | {
836 | "cell_type": "code",
837 | "source": [
838 | "print(\"* Context information: \\n\")\n",
839 | "\n",
840 | "for index, context in enumerate(wiki_segments):\n",
841 | " print(f\"===== context: {index} =====\")\n",
842 | " print(\"length of context: \" + str(len(context)))\n",
843 | " print(\"Suggestion QA about this context: \" + str(int(len(context)/100)+1))\n",
844 | " print(\"Preview the content: \\n\\n\" + context[:30])\n",
845 | " print(f\"======================\")"
846 | ],
847 | "metadata": {
848 | "colab": {
849 | "base_uri": "https://localhost:8080/"
850 | },
851 | "id": "lPl422189trw",
852 | "outputId": "f067509d-dfc2-445e-a696-f91520791a44"
853 | },
854 | "id": "lPl422189trw",
855 | "execution_count": 13,
856 | "outputs": [
857 | {
858 | "output_type": "stream",
859 | "name": "stdout",
860 | "text": [
861 | "* Context information: \n",
862 | "\n",
863 | "===== context: 0 =====\n",
864 | "length of context: 1428\n",
865 | "Suggestion QA about this context: 15\n",
866 | "Preview the content: \n",
867 | "\n",
868 | "\n",
869 | "[ 開源模型 ] Google Gemma 3 270M \n",
870 | "======================\n"
871 | ]
872 | }
873 | ]
874 | },
875 | {
876 | "cell_type": "code",
877 | "source": [
878 | "import time\n",
879 | "\n",
880 | "for context in wiki_segments:\n",
881 | " if NUM_QA is None:\n",
882 | " n_pair = int(NUM_QA/len(wiki_segments))\n",
883 | " else:\n",
884 | " n_pair = NUM_QA\n",
885 | "\n",
886 | " _qa_preview = build_reference_based_from_context(context, n_pairs = NUM_QA)\n",
887 | " print(\"\\n--- 產生預覽 ---\")\n",
888 | " for i, qa in enumerate(_qa_preview, 1):\n",
889 | " print(f\"Q{i}: {qa['question']}\")\n",
890 | " print(f\"A{i}: {qa['answer'][:200]}{'...' if len(qa['answer'])>200 else ''}\\n\")\n",
891 | " time.sleep(5)"
892 | ],
893 | "metadata": {
894 | "colab": {
895 | "base_uri": "https://localhost:8080/"
896 | },
897 | "id": "4mRbvk1Z92rz",
898 | "outputId": "1a4db710-4938-41c0-dce4-9a52d2941eb0"
899 | },
900 | "id": "4mRbvk1Z92rz",
901 | "execution_count": 14,
902 | "outputs": [
903 | {
904 | "output_type": "stream",
905 | "name": "stdout",
906 | "text": [
907 | "✅ 已新增 10 筆 reference-based QA 至 outputs/datasets.jsonl\n",
908 | "\n",
909 | "--- 產生預覽 ---\n",
910 | "Q1: Google Gemma 3 270M 是由哪個機構推出的?\n",
911 | "A1: Google DeepMind。\n",
912 | "\n",
913 | "Q2: Gemma 3 270M 的設計理念是什麼?\n",
914 | "A2: Gemma 3 270M 的設計理念側重於高能效,使其能在邊緣設備上直接運行,並迅速完成特定任務的微調,以達到成本效益最佳化。\n",
915 | "\n",
916 | "Q3: Gemma 3 270M 總共有多少個參數?\n",
917 | "A3: Gemma 3 270M 總共有 2.7 億個參數。\n",
918 | "\n",
919 | "Q4: Gemma 3 270M 在 Pixel 9 Pro SoC 上進行 25 次對話測試消耗多少電量?\n",
920 | "A4: Gemma 3 270M 在 Pixel 9 Pro SoC 上進行 25 次對話測試僅消耗 0.75% 電量。\n",
921 | "\n",
922 | "Q5: Gemma 3 270M 在 IFEval 基準測試中取得了約多少分數?\n",
923 | "A5: Gemma 3 270M 在 IFEval 基準測試中取得約 51.2% 的分數。\n",
924 | "\n",
925 | "Q6: Gemma 3 270M 支援哪種量化訓練方式,以方便部署?\n",
926 | "A6: Gemma 3 270M 支援量化自覺訓練 (QAT),以便於 INT4 推論的部署。\n",
927 | "\n",
928 | "Q7: Gemma 3 270M 適合哪些類型的高頻任務?\n",
929 | "A7: Gemma 3 270M 適合情緒分析、實體擷取、查詢、結構化文本生成、創意寫作和遵從性檢查等高頻、明確定義片段的任務。\n",
930 | "\n",
931 | "Q8: 使用 Gemma 3 270M 進行 fine-tuning 通常建議在哪種硬體上完成?\n",
932 | "A8: 建議在 GPU 或 TPU 上完成 Google Gemma 的 Fine-Tuning。\n",
933 | "\n",
934 | "Q9: Gemma 3 270M 的哪些特性使其適用於邊緣部署?\n",
935 | "A9: Gemma 3 270M 適用於邊緣部署的特性包含:\n",
936 | "\n",
937 | "* **能源效率極佳**:INT4 量化後耗電量極低。\n",
938 | "* **模型尺寸小**:參數少,可在輕量基礎設施或裝置端運行。\n",
939 | "* **快速微調與部署**:可在數小時內完成 fine‑tuning,極速部署原型。\n",
940 | "* **隱私與使用者控制**:可完全本地化部署,避免資料往返雲端。\n",
941 | "\n",
942 | "Q10: 相較於大型模型,Gemma 3 270M 在通用能力方面有何限制?\n",
943 | "A10: 在通用性、複雜對話或生成能力方面仍有限制。\n",
944 | "\n"
945 | ]
946 | }
947 | ]
948 | },
949 | {
950 | "cell_type": "markdown",
951 | "metadata": {
952 | "id": "3014b7ac-7748-46f6-965e-8f92b57377cf"
953 | },
954 | "source": [
955 | "## 2 資料品質檢查與過濾(Quality Checks)\n",
956 | "\n",
957 | "\n",
958 | "目標:\n",
959 | "- 載入 `raw.jsonl`\n",
960 | "- 規則式檢查:敏感詞 / 結構完整 / 長度門檻 / 不含 placeholder\n",
961 | "- 產出 `clean.jsonl`\n",
962 | "- 生成摘要報表(通過/剔除統計、剔除原因分佈)"
963 | ],
964 | "id": "3014b7ac-7748-46f6-965e-8f92b57377cf"
965 | },
966 | {
967 | "cell_type": "markdown",
968 | "metadata": {
969 | "id": "50b8ced3-2e81-4f19-af3f-99c98c5efbd8"
970 | },
971 | "source": [
972 | "> 註:不論如何,禁用 [opencc-python](https://github.com/yichen0831/opencc-python) 做任何轉換\n",
973 | "\n",
974 | "> 雖然 OpenCC 的簡轉繁功能很方便,但它只是機械式轉換,繁體字有時會被誤判或錯轉,導致語意錯誤或不符合在地用法,因此並不適合需要精準繁體輸出的情境。"
975 | ],
976 | "id": "50b8ced3-2e81-4f19-af3f-99c98c5efbd8"
977 | },
978 | {
979 | "cell_type": "markdown",
980 | "metadata": {
981 | "id": "8aed0c38-dc62-44ef-aa93-dc043781f5c8"
982 | },
983 | "source": [
984 | "### 2.1 準備路徑與依賴"
985 | ],
986 | "id": "8aed0c38-dc62-44ef-aa93-dc043781f5c8"
987 | },
988 | {
989 | "cell_type": "code",
990 | "execution_count": 15,
991 | "metadata": {
992 | "id": "92d2e6e0-336f-4893-88b9-2b88dc67c79d",
993 | "colab": {
994 | "base_uri": "https://localhost:8080/"
995 | },
996 | "outputId": "dc0afc41-7b90-4ff2-c66f-79e360a1938d"
997 | },
998 | "outputs": [
999 | {
1000 | "output_type": "stream",
1001 | "name": "stdout",
1002 | "text": [
1003 | "✅ 讀取來源: outputs/datasets.jsonl\n",
1004 | "✅ 乾淨輸出: outputs/clean.jsonl\n"
1005 | ]
1006 | }
1007 | ],
1008 | "source": [
1009 | "from pathlib import Path\n",
1010 | "import json, re, statistics\n",
1011 | "from collections import Counter, defaultdict\n",
1012 | "\n",
1013 | "INPUT_PATH = Path(\"outputs/datasets.jsonl\")\n",
1014 | "\n",
1015 | "OUTPUT_DIR = Path(\"outputs\")\n",
1016 | "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
1017 | "\n",
1018 | "OUTPUT_CLEAN = OUTPUT_DIR / \"clean.jsonl\"\n",
1019 | "OUTPUT_REPORT = OUTPUT_DIR / \"qc_report.json\"\n",
1020 | "\n",
1021 | "\n",
1022 | "print(\"✅ 讀取來源:\", INPUT_PATH)\n",
1023 | "print(\"✅ 乾淨輸出:\", OUTPUT_CLEAN)"
1024 | ],
1025 | "id": "92d2e6e0-336f-4893-88b9-2b88dc67c79d"
1026 | },
1027 | {
1028 | "cell_type": "markdown",
1029 | "metadata": {
1030 | "id": "3926d664-eddd-4303-9207-b3a2260afaf3"
1031 | },
1032 | "source": [
1033 | "### 2.2 載入資料\n",
1034 | "\n",
1035 | "逐行讀取 JSONL,存到 list。這裡不做任何變形,只檢視基本鍵值。"
1036 | ],
1037 | "id": "3926d664-eddd-4303-9207-b3a2260afaf3"
1038 | },
1039 | {
1040 | "cell_type": "code",
1041 | "execution_count": 16,
1042 | "metadata": {
1043 | "id": "a26301db-4c15-4580-a84b-5f6a4687685b",
1044 | "colab": {
1045 | "base_uri": "https://localhost:8080/"
1046 | },
1047 | "outputId": "40f39a5e-e115-4b88-a4cf-167067eda1f3"
1048 | },
1049 | "outputs": [
1050 | {
1051 | "output_type": "stream",
1052 | "name": "stdout",
1053 | "text": [
1054 | "Number of records: 11\n"
1055 | ]
1056 | }
1057 | ],
1058 | "source": [
1059 | "records = []\n",
1060 | "with INPUT_PATH.open(\"r\", encoding=\"utf-8\") as f:\n",
1061 | " for line in f:\n",
1062 | " try:\n",
1063 | " records.append(json.loads(line))\n",
1064 | " except Exception as e:\n",
1065 | " # 若出現無法解析的行,記錄並跳過\n",
1066 | " print(\"⚠️ 無法解析的行,已略過:\", e)\n",
1067 | "\n",
1068 | "print(f\"Number of records: {len(records)}\")"
1069 | ],
1070 | "id": "a26301db-4c15-4580-a84b-5f6a4687685b"
1071 | },
1072 | {
1073 | "cell_type": "markdown",
1074 | "metadata": {
1075 | "id": "9b9753e1-25c2-4fce-9faa-c6266f19f43c"
1076 | },
1077 | "source": [
1078 | "### 2.3 品質規則定義\n",
1079 | "\n",
1080 | "本課採「規則式(rule-based)」檢查以快速過濾:\n",
1081 | "1. **結構**:`messages` 至少包含 `system`、`user`、`assistant` 三則;且對話文本不為空。\n",
1082 | "2. **多輪性**:對話需包含至少 3 輪(可鬆綁為 1 輪以上,但本課先採至少 3 輪)。\n",
1083 | "3. **長度**:合併文本長度至少 80 字(避免過短)。\n",
1084 | "4. **敏感詞**:過濾個資或敏感詞(示例黑名單)。\n",
1085 | "5. **Placeholder**:不得包含 `XXX`、`<填充>` 類佔位符。"
1086 | ],
1087 | "id": "9b9753e1-25c2-4fce-9faa-c6266f19f43c"
1088 | },
1089 | {
1090 | "cell_type": "code",
1091 | "execution_count": 17,
1092 | "metadata": {
1093 | "id": "4c0923e3-6fa0-41bb-97a4-653a036c72d5"
1094 | },
1095 | "outputs": [],
1096 | "source": [
1097 | "# 1) 結構/角色檢查\n",
1098 | "def has_min_roles(msgs):\n",
1099 | " roles = [m.get(\"role\") for m in msgs]\n",
1100 | " return {\"system\", \"user\", \"assistant\"}.issubset(set(roles))\n",
1101 | "\n",
1102 | "# 2) 多輪性(這裡以訊息數 >= 3 視為最低門檻;若需要更嚴謹可解析回合)\n",
1103 | "def has_min_turns(msgs, min_msgs=3):\n",
1104 | " return len(msgs) >= min_msgs\n",
1105 | "\n",
1106 | "# 3) 長度門檻\n",
1107 | "def meet_min_length(msgs, min_chars=80):\n",
1108 | " total = sum(len((m.get(\"content\") or \"\").strip()) for m in msgs)\n",
1109 | " return total >= min_chars\n",
1110 | "\n",
1111 | "# 4) 敏感詞(示例):身分證/電話/地址/Email/信用卡/生日\n",
1112 | "SENSITIVE_PATTERNS = [\n",
1113 | " r\"\\b[A-Z][12]\\d{8}\\b\", # 台灣身分證格式\n",
1114 | " r\"\\b09\\d{8}\\b|\\b0\\d{1,2}-\\d{6,8}\\b\", # 手機或市話\n",
1115 | " r\"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\", # email\n",
1116 | " r\"\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b\", # 信用卡 16 碼\n",
1117 | " r\"\\b(19|20)\\d{2}[/-]\\d{1,2}[/-]\\d{1,2}\\b\", # 西元生日 yyyy/mm/dd 或 yyyy-mm-dd\n",
1118 | "]\n",
1119 | "\n",
1120 | "def has_sensitive(text):\n",
1121 | " return any(re.search(p, text) for p in SENSITIVE_PATTERNS)\n",
1122 | "\n",
1123 | "# 5) Placeholder 過濾\n",
1124 | "PLACEHOLDER_PATTERNS = [r\"XXX\", r\"<填充>\", r\"\\[PLACEHOLDER\\]\"]\n",
1125 | "\n",
1126 | "def has_placeholder(text):\n",
1127 | " return any(re.search(p, text, flags=re.IGNORECASE) for p in PLACEHOLDER_PATTERNS)"
1128 | ],
1129 | "id": "4c0923e3-6fa0-41bb-97a4-653a036c72d5"
1130 | },
1131 | {
1132 | "cell_type": "markdown",
1133 | "metadata": {
1134 | "id": "80e5a6cd-4b31-451f-a1c8-c34437a11560"
1135 | },
1136 | "source": [
1137 | "### 2.4 單筆檢查與原因標註\n",
1138 | "\n",
1139 | "輸入一筆記錄,回傳 (是否通過, 剔除原因集合)。"
1140 | ],
1141 | "id": "80e5a6cd-4b31-451f-a1c8-c34437a11560"
1142 | },
1143 | {
1144 | "cell_type": "code",
1145 | "execution_count": 18,
1146 | "metadata": {
1147 | "id": "495390d6-59f4-4ec5-a7e7-1b236a5a1ef7"
1148 | },
1149 | "outputs": [],
1150 | "source": [
1151 | "def join_text_by_roles(msgs, roles=(\"assistant\",)):\n",
1152 | " return \"\\n\".join((m.get(\"content\") or \"\").strip()\n",
1153 | " for m in msgs if m.get(\"role\") in roles)\n",
1154 | "\n",
1155 | "def quality_check(record):\n",
1156 | " reasons = []\n",
1157 | "\n",
1158 | " msgs = record.get(\"messages\", [])\n",
1159 | " if not isinstance(msgs, list) or not msgs:\n",
1160 | " return False, {\"bad_structure\"}\n",
1161 | "\n",
1162 | " if not has_min_roles(msgs):\n",
1163 | " reasons.append(\"missing_roles\")\n",
1164 | "\n",
1165 | " if not has_min_turns(msgs, min_msgs=3):\n",
1166 | " reasons.append(\"too_few_messages\")\n",
1167 | "\n",
1168 | " # ⬇️ 只看 assistant 文字,避免掃到 user 提示內的「例如 身分證/電話…」\n",
1169 | " text = join_text_by_roles(msgs, roles=(\"assistant\",))\n",
1170 | "\n",
1171 | " if not meet_min_length(msgs, min_chars=80):\n",
1172 | " reasons.append(\"too_short\")\n",
1173 | "\n",
1174 | " if has_sensitive(text):\n",
1175 | " reasons.append(\"sensitive_content\")\n",
1176 | "\n",
1177 | " if has_placeholder(text):\n",
1178 | " reasons.append(\"placeholder_found\")\n",
1179 | "\n",
1180 | " return (len(reasons) == 0), set(reasons)"
1181 | ],
1182 | "id": "495390d6-59f4-4ec5-a7e7-1b236a5a1ef7"
1183 | },
1184 | {
1185 | "cell_type": "markdown",
1186 | "metadata": {
1187 | "id": "943576f6-5c01-488e-8e71-f305afae9dd1"
1188 | },
1189 | "source": [
1190 | "### 2.5 執行過濾並輸出 `clean.jsonl`"
1191 | ],
1192 | "id": "943576f6-5c01-488e-8e71-f305afae9dd1"
1193 | },
1194 | {
1195 | "cell_type": "code",
1196 | "execution_count": 19,
1197 | "metadata": {
1198 | "id": "71a2ae66-e807-4ecd-a4bf-577825c339d2",
1199 | "colab": {
1200 | "base_uri": "https://localhost:8080/"
1201 | },
1202 | "outputId": "762bac96-7b6c-46a5-ab08-5a4100f33388"
1203 | },
1204 | "outputs": [
1205 | {
1206 | "output_type": "stream",
1207 | "name": "stdout",
1208 | "text": [
1209 | "✅ 通過:8 筆\n",
1210 | "❌ 剔除:3 筆\n"
1211 | ]
1212 | }
1213 | ],
1214 | "source": [
1215 | "kept, dropped = [], []\n",
1216 | "for rec in records:\n",
1217 | " ok, reasons = quality_check(rec)\n",
1218 | " if ok:\n",
1219 | " kept.append(rec)\n",
1220 | " else:\n",
1221 | " dropped.append((rec.get(\"id\"), reasons))\n",
1222 | "\n",
1223 | "with OUTPUT_CLEAN.open(\"w\", encoding=\"utf-8\") as f:\n",
1224 | " for r in kept:\n",
1225 | " f.write(json.dumps(r, ensure_ascii=False) + \"\\n\")\n",
1226 | "\n",
1227 | "print(f\"✅ 通過:{len(kept)} 筆\")\n",
1228 | "print(f\"❌ 剔除:{len(dropped)} 筆\")"
1229 | ],
1230 | "id": "71a2ae66-e807-4ecd-a4bf-577825c339d2"
1231 | },
1232 | {
1233 | "cell_type": "markdown",
1234 | "metadata": {
1235 | "id": "ae5ec9a4-9eea-4a5a-8af4-0cdbead0a49a"
1236 | },
1237 | "source": [
1238 | "### 2.6 產出品質報表\n",
1239 | "\n",
1240 | "統計剔除原因分佈、長度分佈(通過者),並輸出 `qc_report.json` 方便保存與追蹤。"
1241 | ],
1242 | "id": "ae5ec9a4-9eea-4a5a-8af4-0cdbead0a49a"
1243 | },
1244 | {
1245 | "cell_type": "code",
1246 | "execution_count": 20,
1247 | "metadata": {
1248 | "id": "91348e15-d898-4b96-a6f4-8e19fa1080ae",
1249 | "colab": {
1250 | "base_uri": "https://localhost:8080/"
1251 | },
1252 | "outputId": "b57ad11f-987f-41e2-ed68-41d7ec56f1c8"
1253 | },
1254 | "outputs": [
1255 | {
1256 | "output_type": "stream",
1257 | "name": "stdout",
1258 | "text": [
1259 | "{\n",
1260 | " \"input_total\": 11,\n",
1261 | " \"kept\": 8,\n",
1262 | " \"dropped\": 3,\n",
1263 | " \"drop_reasons\": {\n",
1264 | " \"too_short\": 3\n",
1265 | " },\n",
1266 | " \"length_stats_kept\": {\n",
1267 | " \"min\": 95,\n",
1268 | " \"max\": 927,\n",
1269 | " \"mean\": 224.5,\n",
1270 | " \"median\": 109.0\n",
1271 | " }\n",
1272 | "}\n"
1273 | ]
1274 | }
1275 | ],
1276 | "source": [
1277 | "# 剔除原因分佈\n",
1278 | "reason_counter = Counter()\n",
1279 | "for _id, reasons in dropped:\n",
1280 | " reason_counter.update(reasons)\n",
1281 | "\n",
1282 | "# 通過資料長度(字元計)分佈\n",
1283 | "lengths = []\n",
1284 | "for r in kept:\n",
1285 | " lengths.append(sum(len((m.get(\"content\") or \"\").strip()) for m in r[\"messages\"]))\n",
1286 | "\n",
1287 | "report = {\n",
1288 | " \"input_total\": len(records),\n",
1289 | " \"kept\": len(kept),\n",
1290 | " \"dropped\": len(dropped),\n",
1291 | " \"drop_reasons\": dict(reason_counter),\n",
1292 | " \"length_stats_kept\": {\n",
1293 | " \"min\": min(lengths) if lengths else 0,\n",
1294 | " \"max\": max(lengths) if lengths else 0,\n",
1295 | " \"mean\": round(statistics.mean(lengths), 2) if lengths else 0,\n",
1296 | " \"median\": statistics.median(lengths) if lengths else 0,\n",
1297 | " },\n",
1298 | "}\n",
1299 | "\n",
1300 | "with OUTPUT_REPORT.open(\"w\", encoding=\"utf-8\") as f:\n",
1301 | " json.dump(report, f, ensure_ascii=False, indent=2)\n",
1302 | "\n",
1303 | "print(json.dumps(report, ensure_ascii=False, indent=2))"
1304 | ],
1305 | "id": "91348e15-d898-4b96-a6f4-8e19fa1080ae"
1306 | },
1307 | {
1308 | "cell_type": "markdown",
1309 | "metadata": {
1310 | "id": "fbb83d71-8c3f-41d6-adb7-3ff00e181cfb"
1311 | },
1312 | "source": [
1313 | "### 2.7. 抽樣檢視通過樣本(前 2 筆)\n",
1314 | "\n",
1315 | "確認清洗後的資料結構與內容是否符合預期。"
1316 | ],
1317 | "id": "fbb83d71-8c3f-41d6-adb7-3ff00e181cfb"
1318 | },
1319 | {
1320 | "cell_type": "code",
1321 | "execution_count": 21,
1322 | "metadata": {
1323 | "id": "1f78a29c-74fa-465c-a16d-1ba8f406406a",
1324 | "colab": {
1325 | "base_uri": "https://localhost:8080/"
1326 | },
1327 | "outputId": "d47b52a1-1658-4cfe-a35c-6263e957d222"
1328 | },
1329 | "outputs": [
1330 | {
1331 | "output_type": "stream",
1332 | "name": "stdout",
1333 | "text": [
1334 | "\n",
1335 | "--- Clean Sample 1 / topic=None ---\n",
1336 | "好的,這是一份不含乳製品、高纖且富含健康脂肪,熱量約 500 大卡的午餐食譜,並提供詳細的說明和建議,讓你輕鬆準備:\n",
1337 | "\n",
1338 | "**午餐食譜:酪梨藜麥沙拉佐烤鮭魚**\n",
1339 | "\n",
1340 | "**熱量估算:** 約 480-520 大卡 (依食材份量微調)\n",
1341 | "\n",
1342 | "**食材:**\n",
1343 | "\n",
1344 | "* **烤鮭魚 (約 120 克):** 提供優質蛋白質和 Omega-3 脂肪酸 (約 200 大卡)\n",
1345 | " * 鮭魚片:120 克\n",
1346 | " * 橄欖油:1 茶匙\n",
1347 | " * 鹽:少許\n",
1348 | " * 黑胡椒:少許\n",
1349 | " * 檸檬汁:少許 (可選)\n",
1350 | "* **藜麥 (煮熟後約 1 杯):** 提供豐富纖維和植物性蛋白質 (約 220 大卡)\n",
1351 | " * 乾燥藜麥:1/2 杯\n",
1352 | " * 水:1 杯\n",
1353 | "* **酪梨 (1/4 個):** 提供健康脂肪和纖維 (約 80 大卡)\n",
1354 | "* **蔬菜 (總量約 1 杯):** 提供纖維、維生素和礦物質 (約 20 大卡)\n",
1355 | " * 小黃瓜:1/4 根 (切丁)\n",
1356 | " * 紅蘿蔔:1/4 根 (切丁)\n",
1357 | " * 甜椒 (任何顏色):1/4 個 (切...\n",
1358 | "\n",
1359 | "--- Clean Sample 2 / topic=None ---\n",
1360 | "Gemma 3 270M 的設計理念側重於高能效,使其能在邊緣設備上直接運行,並迅速完成特定任務的微調,以達到成本效益最佳化。\n"
1361 | ]
1362 | }
1363 | ],
1364 | "source": [
1365 | "preview = []\n",
1366 | "with OUTPUT_CLEAN.open(\"r\", encoding=\"utf-8\") as f:\n",
1367 | " for i, line in enumerate(f):\n",
1368 | " if i >= 2:\n",
1369 | " break\n",
1370 | " preview.append(json.loads(line))\n",
1371 | "\n",
1372 | "for i, s in enumerate(preview, 1):\n",
1373 | " print(f\"\\n--- Clean Sample {i} / topic={s.get('topic')} ---\")\n",
1374 | " text = s[\"messages\"][-1][\"content\"]\n",
1375 | " print(text[:500] + (\"...\" if len(text) > 500 else \"\"))"
1376 | ],
1377 | "id": "1f78a29c-74fa-465c-a16d-1ba8f406406a"
1378 | },
1379 | {
1380 | "cell_type": "markdown",
1381 | "metadata": {
1382 | "id": "52ebdc34-7526-4679-bf09-b9f868311a92"
1383 | },
1384 | "source": [
1385 | "### 2.8(可選)LLM 輔助檢查(實務建議)\n",
1386 | "> 所謂的 LLM-as-Judge\n",
1387 | "\n",
1388 | "在規則式檢查後,可抽樣使用 LLM 來做語義層面的檢查(如:是否符合主題、語氣、是否含危險建議等)。 \n",
1389 | "以下為示意程式(預設註解,不影響主流程)。"
1390 | ],
1391 | "id": "52ebdc34-7526-4679-bf09-b9f868311a92"
1392 | },
1393 | {
1394 | "cell_type": "code",
1395 | "source": [
1396 | "len(preview)"
1397 | ],
1398 | "metadata": {
1399 | "colab": {
1400 | "base_uri": "https://localhost:8080/"
1401 | },
1402 | "id": "V1MswHRmqxSa",
1403 | "outputId": "318ccb70-aecb-4250-8c52-812fa49ceb15"
1404 | },
1405 | "id": "V1MswHRmqxSa",
1406 | "execution_count": 22,
1407 | "outputs": [
1408 | {
1409 | "output_type": "execute_result",
1410 | "data": {
1411 | "text/plain": [
1412 | "2"
1413 | ]
1414 | },
1415 | "metadata": {},
1416 | "execution_count": 22
1417 | }
1418 | ]
1419 | },
1420 | {
1421 | "cell_type": "code",
1422 | "execution_count": 23,
1423 | "metadata": {
1424 | "id": "88b8958b-6f3a-4417-9ba2-e13a4a5500fd",
1425 | "colab": {
1426 | "base_uri": "https://localhost:8080/"
1427 | },
1428 | "outputId": "7843492d-a550-4bc7-f8cd-995b2c768785"
1429 | },
1430 | "outputs": [
1431 | {
1432 | "output_type": "stream",
1433 | "name": "stdout",
1434 | "text": [
1435 | "LLM QC -> PASS\n",
1436 | "LLM QC -> PASS\n"
1437 | ]
1438 | }
1439 | ],
1440 | "source": [
1441 | "def llm_qc_judgement(text: str) -> bool:\n",
1442 | " \"\"\"回傳 True 視為通過;False 視為不通過\"\"\"\n",
1443 | " prompt = f\"請閱讀以下對話是否符合:主題連貫、語氣正式友善、無敏感資料、無危險建議。\\n\\n{text}\\n\\n請只回答:PASS 或 FAIL。\"\n",
1444 | " resp = client.chat.completions.create(\n",
1445 | " model=\"gemma-3-12b-it\",\n",
1446 | " messages=[{\"role\":\"user\",\"content\": prompt}],\n",
1447 | " temperature=0.0,\n",
1448 | " max_tokens=10,\n",
1449 | " )\n",
1450 | " ans = resp.choices[0].message.content.strip().upper()\n",
1451 | " return ans.startswith(\"PASS\")\n",
1452 | "\n",
1453 | "# 示例(只檢查前 3 筆)\n",
1454 | "for s in preview:\n",
1455 | " ok = llm_qc_judgement(\"\\n\".join(m[\"content\"] for m in s[\"messages\"]))\n",
1456 | " print(\"LLM QC ->\", \"PASS\" if ok else \"FAIL\")"
1457 | ],
1458 | "id": "88b8958b-6f3a-4417-9ba2-e13a4a5500fd"
1459 | },
1460 | {
1461 | "cell_type": "markdown",
1462 | "metadata": {
1463 | "id": "e2bfbf3a-5157-4bab-826b-64c510619a31"
1464 | },
1465 | "source": [
1466 | "### 2.9.(可選)如果生成資料集一直沒通過"
1467 | ],
1468 | "id": "e2bfbf3a-5157-4bab-826b-64c510619a31"
1469 | },
1470 | {
1471 | "cell_type": "code",
1472 | "execution_count": 24,
1473 | "metadata": {
1474 | "id": "b5d49ad4-a02e-4416-81a3-22bdd0b10ab5"
1475 | },
1476 | "outputs": [],
1477 | "source": [
1478 | "# 🔍 Debug:逐筆列出命中的敏感詞 / Placeholder(含前後文)\n",
1479 | "import re\n",
1480 | "\n",
1481 | "def _ctx(text: str, start: int, end: int, width: int = 50) -> str:\n",
1482 | " s = max(0, start - width)\n",
1483 | " e = min(len(text), end + width)\n",
1484 | " return text[s:start] + \"【\" + text[start:end] + \"】\" + text[end:e]\n",
1485 | "\n",
1486 | "def debug_scan_record(rec: dict, show_only_hits: bool = True):\n",
1487 | " rid = rec.get(\"id\", \"\")\n",
1488 | " topic = rec.get(\"topic\", \"\")\n",
1489 | " msgs = rec.get(\"messages\", [])\n",
1490 | "\n",
1491 | " # 🔑 只掃 assistant(模型輸出)\n",
1492 | " text = \"\\n\".join((m.get(\"content\") or \"\") for m in msgs if m.get(\"role\") == \"assistant\")\n",
1493 | "\n",
1494 | " sens_hits = []\n",
1495 | " for p in SENSITIVE_PATTERNS:\n",
1496 | " for m in re.finditer(p, text, flags=re.IGNORECASE):\n",
1497 | " sens_hits.append((p, m.start(), m.end(), m.group(0)))\n",
1498 | "\n",
1499 | " ph_hits = []\n",
1500 | " for p in PLACEHOLDER_PATTERNS:\n",
1501 | " for m in re.finditer(p, text, flags=re.IGNORECASE):\n",
1502 | " ph_hits.append((p, m.start(), m.end(), m.group(0)))\n",
1503 | "\n",
1504 | " if sens_hits or ph_hits or not show_only_hits:\n",
1505 | " print(f\"\\n=== Record id={rid} | topic={topic} ===\")\n",
1506 | " if sens_hits:\n",
1507 | " print(f\"Sensitive matches ({len(sens_hits)}):\")\n",
1508 | " for p, s, e, g in sens_hits:\n",
1509 | " print(f\" - pattern: {p} | match: {g!r}\")\n",
1510 | " print(\" ...\", _ctx(text, s, e), \"...\")\n",
1511 | " if ph_hits:\n",
1512 | " print(f\"Placeholder matches ({len(ph_hits)}):\")\n",
1513 | " for p, s, e, g in ph_hits:\n",
1514 | " print(f\" - pattern: {p} | match: {g!r}\")\n",
1515 | " print(\" ...\", _ctx(text, s, e), \"...\")\n",
1516 | " return bool(sens_hits), bool(ph_hits)\n",
1517 | "\n",
1518 | "def debug_scan_all(recs: list[dict], limit: int | None = None):\n",
1519 | " n = 0\n",
1520 | " total_sens = total_ph = 0\n",
1521 | " for rec in recs:\n",
1522 | " sens, ph = debug_scan_record(rec)\n",
1523 | " total_sens += int(sens)\n",
1524 | " total_ph += int(ph)\n",
1525 | " n += 1\n",
1526 | " if limit and n >= limit:\n",
1527 | " break\n",
1528 | " print(f\"\\nSummary: scanned {n} records | with_sensitive={total_sens} | with_placeholder={total_ph}\")"
1529 | ],
1530 | "id": "b5d49ad4-a02e-4416-81a3-22bdd0b10ab5"
1531 | },
1532 | {
1533 | "cell_type": "code",
1534 | "execution_count": 25,
1535 | "metadata": {
1536 | "id": "f4161db3-b9e8-430e-b8b7-a5492edc3491",
1537 | "colab": {
1538 | "base_uri": "https://localhost:8080/"
1539 | },
1540 | "outputId": "f128761d-faf4-444b-faff-7dc14faa79d2"
1541 | },
1542 | "outputs": [
1543 | {
1544 | "output_type": "stream",
1545 | "name": "stdout",
1546 | "text": [
1547 | "\n",
1548 | "Summary: scanned 11 records | with_sensitive=0 | with_placeholder=0\n"
1549 | ]
1550 | }
1551 | ],
1552 | "source": [
1553 | "# 假設你已在前面載入 records = [...](從 raw.jsonl)\n",
1554 | "debug_scan_all(records) # 掃全部\n",
1555 | "# 或只看前 10 筆\n",
1556 | "# debug_scan_all(records, limit=10)"
1557 | ],
1558 | "id": "f4161db3-b9e8-430e-b8b7-a5492edc3491"
1559 | },
1560 | {
1561 | "cell_type": "markdown",
1562 | "metadata": {
1563 | "id": "74e85e7f-dce9-4f7a-a73f-033328c2e549"
1564 | },
1565 | "source": [
1566 | "# 3 將資料集上傳到 Hugging Face Hub(Dataset Repo)\n",
1567 | "\n",
1568 | "本章目標:\n",
1569 | "1. 準備要上傳的檔案(預期:`outputs/datasets.jsonl`)\n",
1570 | "2. 使用 `huggingface_hub` 建立或覆用 **Dataset repo**\n",
1571 | "3. 上傳 `data/train.jsonl`(選配:同時上傳 `train.parquet`)\n",
1572 | "4. 建立 / 更新 Dataset Card(`README.md`)"
1573 | ],
1574 | "id": "74e85e7f-dce9-4f7a-a73f-033328c2e549"
1575 | },
1576 | {
1577 | "cell_type": "code",
1578 | "execution_count": 26,
1579 | "metadata": {
1580 | "id": "544a0f13-b9e4-4ada-99ff-6bbdff9715d6",
1581 | "colab": {
1582 | "base_uri": "https://localhost:8080/"
1583 | },
1584 | "outputId": "e0a997b1-949c-45d6-a7d6-79a42fea6e1b"
1585 | },
1586 | "outputs": [
1587 | {
1588 | "output_type": "stream",
1589 | "name": "stdout",
1590 | "text": [
1591 | "Repo: Simon-Liu/gemma-270m-medium-qa\n",
1592 | "Local file: /content/outputs/datasets.jsonl\n"
1593 | ]
1594 | }
1595 | ],
1596 | "source": [
1597 | "from huggingface_hub import HfApi, create_repo, upload_file, upload_folder\n",
1598 | "from huggingface_hub import login as hf_login\n",
1599 | "from pathlib import Path\n",
1600 | "import json, os, time\n",
1601 | "from google.colab import userdata\n",
1602 | "\n",
1603 | "# @markdown 請設定以下 HuggingFace 專案資訊 的變數數值\n",
1604 | "\n",
1605 | "# @markdown > HuggingFace Token 可以設定在 Google Colab 左邊的金鑰區域\n",
1606 | "\n",
1607 | "# === 基本設定(請依實際調整) ===\n",
1608 | "HF_TOKEN = userdata.get(\"HF_TOKEN\") # @param {type:\"string\"}\n",
1609 | "ORG_OR_USER = \"Simon-Liu\" # @param {type:\"string\"}\n",
1610 | "DATASET_NAME = \"gemma-270m-medium-qa\" # @param {type:\"string\"}\n",
1611 | "REPO_ID = f\"{ORG_OR_USER}/{DATASET_NAME}\"\n",
1612 | "\n",
1613 | "LOCAL_JSONL = Path(\"outputs/datasets.jsonl\")\n",
1614 | "assert LOCAL_JSONL.exists(), f\"找不到 {LOCAL_JSONL},請先完成前面章節生成資料\"\n",
1615 | "\n",
1616 | "# 可選:是否也上傳 Parquet(HF Hub 也會在後台自動生成 parquet 分支,但這裡示範手動輸出一次)\n",
1617 | "ALSO_UPLOAD_PARQUET = True # @param {type:\"string\"}\n",
1618 | "\n",
1619 | "print(\"Repo:\", REPO_ID)\n",
1620 | "print(\"Local file:\", LOCAL_JSONL.resolve())"
1621 | ],
1622 | "id": "544a0f13-b9e4-4ada-99ff-6bbdff9715d6"
1623 | },
1624 | {
1625 | "cell_type": "code",
1626 | "execution_count": 27,
1627 | "metadata": {
1628 | "id": "1d9493e6-8eb4-4ea0-a678-0ce92d029b95",
1629 | "colab": {
1630 | "base_uri": "https://localhost:8080/"
1631 | },
1632 | "outputId": "2abcc60b-f690-43d3-a66e-e194d57b5fef"
1633 | },
1634 | "outputs": [
1635 | {
1636 | "output_type": "stream",
1637 | "name": "stdout",
1638 | "text": [
1639 | "✅ 產生 Dataset Card: /content/outputs/README.md\n"
1640 | ]
1641 | }
1642 | ],
1643 | "source": [
1644 | "CARD_PATH = Path(\"outputs/README.md\")\n",
1645 | "CARD_PATH.parent.mkdir(parents=True, exist_ok=True)\n",
1646 | "\n",
1647 | "# 注意:HF 會讀取 README.md 頂端的 YAML 區塊作為中繼資料\n",
1648 | "card_md = f\"\"\"---\n",
1649 | "pretty_name: {REPO_ID} (Gemma-3-27B-it, ADK Reference)\n",
1650 | "tags:\n",
1651 | "- dialog\n",
1652 | "- instruction-tuning\n",
1653 | "- sft\n",
1654 | "- openai-messages\n",
1655 | "- reference-based\n",
1656 | "- reference-free\n",
1657 | "license: cc-by-4.0\n",
1658 | "task_categories:\n",
1659 | "- text-generation\n",
1660 | "language:\n",
1661 | "- zh\n",
1662 | "---\n",
1663 | "\n",
1664 | "本資料集包含由 ** {MODEL} ** 生成的對話資料,採用 **OpenAI Chat Messages** 格式(`.jsonl`)。資料來源結合:\n",
1665 | "- **Reference-free**:由 seed 派生的單輪問答。\n",
1666 | "- **Reference-based**:依據參考文本生成單輪問答。\n",
1667 | "\n",
1668 | "> 檔案路徑:`data/train.jsonl`(選配:`data/train.parquet`)\n",
1669 | "\n",
1670 | "## 結構說明\n",
1671 | "- 每列為一筆樣本:`{{\"id\": \"...\", \"type\": \"...\", \"seed\": \"...\", \"context\": \"...\", \"messages\": [{{\"role\":\"user\",\"content\":\"...\"}}, {{\"role\":\"assistant\",\"content\":\"...\"}}]}}`\n",
1672 | "- `type` 欄位標示資料來源:`reference_free` 或 `reference_based`。\n",
1673 | "- `seed` 欄位儲存 Reference-free 的原始 seed 指令,或 Reference-based 的參考文本片段。\n",
1674 | "- `context` 欄位僅在 `reference_based` 資料中包含完整的參考文本片段。\n",
1675 | "- 訓練時可直接使用 `messages` 欄位的對話格式進行訓練。\n",
1676 | "\n",
1677 | "## 來源與限制\n",
1678 | "- Model: {MODEL}\n",
1679 | "- 語言:繁體中文(生成內容),部分參考文本為英文。\n",
1680 | "- 使用情境:教學示範用;不代表專業意見。\n",
1681 | "- **重要**:Reference-based 資料的問題和答案均從參考文本中生成,答案不應超出參考文本範圍。\n",
1682 | "\n",
1683 | "## 授權\n",
1684 | "- 建議使用 **CC BY 4.0**;若另有需求請調整 `license` 欄位。\n",
1685 | "\"\"\"\n",
1686 | "\n",
1687 | "CARD_PATH.write_text(card_md, encoding=\"utf-8\")\n",
1688 | "print(\"✅ 產生 Dataset Card:\", CARD_PATH.resolve())"
1689 | ],
1690 | "id": "1d9493e6-8eb4-4ea0-a678-0ce92d029b95"
1691 | },
1692 | {
1693 | "cell_type": "code",
1694 | "execution_count": 28,
1695 | "metadata": {
1696 | "id": "9e517999-70ed-4814-8fe7-b486c360fa6d",
1697 | "colab": {
1698 | "base_uri": "https://localhost:8080/",
1699 | "height": 67,
1700 | "referenced_widgets": [
1701 | "c91e25ccc6034a7c94f52d676c44b94e",
1702 | "13731b55ad004f798665cdc8af1cfff3",
1703 | "cac6255099324943afc2b4fb4e0df034",
1704 | "b2657e8be0c644558ac96a766b85e29c",
1705 | "6d6a9866108f4b248808096ef0dfbeea",
1706 | "c5d372f7efc840aca0e64f3b2a4350fb",
1707 | "9ccd9d2c4a6246a2994be3f1eea8be37",
1708 | "01bb49ccf71e4555a06a6b1933e7c016",
1709 | "c6d75ee7208d4e29b3723889751a6284",
1710 | "c58b638d61764d94a15bc15271296320",
1711 | "850eda4161184dd5b7925955ebcd320e"
1712 | ]
1713 | },
1714 | "outputId": "9bc2aad1-70db-4e8e-9f33-1461e78d5b73"
1715 | },
1716 | "outputs": [
1717 | {
1718 | "output_type": "display_data",
1719 | "data": {
1720 | "text/plain": [
1721 | "Creating parquet from Arrow format: 0%| | 0/1 [00:00, ?ba/s]"
1722 | ],
1723 | "application/vnd.jupyter.widget-view+json": {
1724 | "version_major": 2,
1725 | "version_minor": 0,
1726 | "model_id": "c91e25ccc6034a7c94f52d676c44b94e"
1727 | }
1728 | },
1729 | "metadata": {}
1730 | },
1731 | {
1732 | "output_type": "stream",
1733 | "name": "stdout",
1734 | "text": [
1735 | "✅ 產生 parquet: /content/outputs/train.parquet\n"
1736 | ]
1737 | }
1738 | ],
1739 | "source": [
1740 | "if ALSO_UPLOAD_PARQUET:\n",
1741 | " from datasets import Dataset\n",
1742 | " import pandas as pd\n",
1743 | "\n",
1744 | " # 讀 jsonl → Dataset → parquet\n",
1745 | " rows = []\n",
1746 | " with LOCAL_JSONL.open(\"r\", encoding=\"utf-8\") as f:\n",
1747 | " for line in f:\n",
1748 | " rows.append(json.loads(line))\n",
1749 | " ds = Dataset.from_pandas(pd.DataFrame(rows))\n",
1750 | " PARQUET_PATH = Path(\"outputs/train.parquet\")\n",
1751 | " ds.to_parquet(PARQUET_PATH)\n",
1752 | " print(\"✅ 產生 parquet:\", PARQUET_PATH.resolve())\n",
1753 | "else:\n",
1754 | " PARQUET_PATH = None"
1755 | ],
1756 | "id": "9e517999-70ed-4814-8fe7-b486c360fa6d"
1757 | },
1758 | {
1759 | "cell_type": "code",
1760 | "execution_count": 29,
1761 | "metadata": {
1762 | "id": "d8177bde-6942-4329-96d6-1d268fa68fbc",
1763 | "colab": {
1764 | "base_uri": "https://localhost:8080/"
1765 | },
1766 | "outputId": "7fa49882-346c-4713-a6e1-fadca8e80dc1"
1767 | },
1768 | "outputs": [
1769 | {
1770 | "output_type": "stream",
1771 | "name": "stdout",
1772 | "text": [
1773 | "✅ 產生 .gitattributes\n"
1774 | ]
1775 | }
1776 | ],
1777 | "source": [
1778 | "# HF 對部分副檔名會自動 LFS,但 .jsonl 有時未必;這裡顯式指定\n",
1779 | "GITATTR_PATH = Path(\"outputs/.gitattributes\")\n",
1780 | "gitattributes = \"\"\"*.jsonl filter=lfs diff=lfs merge=lfs -text\n",
1781 | "*.parquet filter=lfs diff=lfs merge=lfs -text\n",
1782 | "\"\"\"\n",
1783 | "GITATTR_PATH.write_text(gitattributes, encoding=\"utf-8\")\n",
1784 | "print(\"✅ 產生 .gitattributes\")"
1785 | ],
1786 | "id": "d8177bde-6942-4329-96d6-1d268fa68fbc"
1787 | },
1788 | {
1789 | "cell_type": "code",
1790 | "execution_count": 30,
1791 | "metadata": {
1792 | "id": "b6a79ecc-b60b-4f22-91e8-c68cecc38beb",
1793 | "colab": {
1794 | "base_uri": "https://localhost:8080/",
1795 | "height": 54
1796 | },
1797 | "outputId": "2cf834d7-1971-4b56-9f87-a1a01bff42c6"
1798 | },
1799 | "outputs": [
1800 | {
1801 | "output_type": "stream",
1802 | "name": "stdout",
1803 | "text": [
1804 | "Logged in as: Simon-Liu\n"
1805 | ]
1806 | },
1807 | {
1808 | "output_type": "execute_result",
1809 | "data": {
1810 | "text/plain": [
1811 | "RepoUrl('https://huggingface.co/datasets/Simon-Liu/gemma-270m-medium-qa', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Simon-Liu/gemma-270m-medium-qa')"
1812 | ],
1813 | "application/vnd.google.colaboratory.intrinsic+json": {
1814 | "type": "string"
1815 | }
1816 | },
1817 | "metadata": {},
1818 | "execution_count": 30
1819 | }
1820 | ],
1821 | "source": [
1822 | "from huggingface_hub import login as hf_login, whoami, HfApi\n",
1823 | "\n",
1824 | "hf_login(token=HF_TOKEN) # 一次性登入本機快取\n",
1825 | "print(\"Logged in as:\", whoami()[\"name\"])\n",
1826 | "\n",
1827 | "# ==== 先建立(或覆用)Dataset repo ====\n",
1828 | "api = HfApi()\n",
1829 | "api.create_repo(\n",
1830 | " repo_id=REPO_ID,\n",
1831 | " repo_type=\"dataset\",\n",
1832 | " exist_ok=True, # 已存在則不報錯\n",
1833 | " private=False # 需要私有可改 True\n",
1834 | ")"
1835 | ],
1836 | "id": "b6a79ecc-b60b-4f22-91e8-c68cecc38beb"
1837 | },
1838 | {
1839 | "cell_type": "code",
1840 | "execution_count": 31,
1841 | "metadata": {
1842 | "id": "dd562d22-bf69-47f4-b18c-775737ec4c90",
1843 | "colab": {
1844 | "base_uri": "https://localhost:8080/",
1845 | "height": 318,
1846 | "referenced_widgets": [
1847 | "9b63f7b54d1d467a9fb6db05fa0f4d22",
1848 | "e3a1b058363143d6a6d880fad566eb60",
1849 | "6b480e1bbd424748843f1d8f5c29cc32",
1850 | "53c6620d77e94493a7ab2ec4c85d9dd9",
1851 | "672207930bf74056896786b279637c72",
1852 | "bda31933ed0f4beeb946cba939924164",
1853 | "d7050aa8916b47bebd0ea2f7b439f38f",
1854 | "b8dde2ad8fa842859cada5ed80d62e5e",
1855 | "168e77f14c2246a884170abbc354e7c0",
1856 | "51df796cf27947fbbd9265418c1b6c53",
1857 | "04c788403c854984b2fbf382c66ad739",
1858 | "a8b0edfbff934a2eb78b0b858fb1fad1",
1859 | "407c85019fa946109c1eb0bd8beb50fe",
1860 | "f30c554c31644300ba33e4bcacded9cb",
1861 | "cd30de9e72c64e4a8c1beefeab3807cd",
1862 | "d23709045e2040439721c14a84691bb8",
1863 | "c9de0f26c0654ca1ab7323a862d928f6",
1864 | "03d101d933884218af93f9928ddd33ea",
1865 | "d89cf9e9c5e24d77afe223d4f2c40f12",
1866 | "1da9689469644863abce276ad1e5f3b7",
1867 | "ed2f10039e4e425d8b53aeedfc0a6ee8",
1868 | "bc8fbbace35847e19ca731d5c8f29093",
1869 | "1ef55a8510074674b04dd33bf11e0966",
1870 | "880ce95de5344905b01490ab77791ecd",
1871 | "cd4ea41d9e4840898c702ec214bc38f6",
1872 | "96ef472afbb84a36a12caa4827e49904",
1873 | "4dfd57e35e414aec866305e1c3862f43",
1874 | "f3085f5a5d2d4e5baf91834a7aff0823",
1875 | "beadea8e78804664bbfcbd95527fda43",
1876 | "3d306af68c664ed0a366e85aaf0dfcdb",
1877 | "083c8e025835406ab0d2e80e7f4a20a4",
1878 | "5e3d5a36afe548e58fbc97360205c798",
1879 | "6ebb44e73cff4db28191370741b00fec",
1880 | "e386f44e49a1465f941b8be9a0122fb1",
1881 | "464b2f8c754545f188e22354fa5656b2",
1882 | "135ad0f12d9f4856b6db2525a2e7d6f2",
1883 | "7b69d2e073c34debb8c1f67e4abe43e6",
1884 | "22f7305ccc934e288c3301f4c034949e",
1885 | "fc1baa2c97ce4ab1800174f7a8dc9f5e",
1886 | "0805f99a9a23458c9ed5770a0aca463a",
1887 | "08a60d97e888439cbca68cdfbbd91c72",
1888 | "5402917fe7b74add855ac97ed388c50b",
1889 | "6755033e680049f391bcba6f9c55762f",
1890 | "6bd62bd1ff8345b28eb37636a8f6ea34",
1891 | "64eb477930fc41b99959d645f48715d3",
1892 | "64a124094c47471096d06036005390a2",
1893 | "b6a380f0e58443f7a559db24976bc31f",
1894 | "a17a2e82f2844e25afc7f290581e7f2d",
1895 | "23b66fac5d1c42768f37aedd1abca1aa",
1896 | "b0bcd57df2404c76afc84208bdaf62f4",
1897 | "1714d8edf85549c193d6143aa0eac916",
1898 | "858128db76f743b795e8a70a7e6a4e81",
1899 | "2d0a60a2d2b6456897c20274102de722",
1900 | "e2c52bf417854b0284df7bd2eafeae18",
1901 | "805883b350a84f3a83da12eadcc8dfa2",
1902 | "b1d541f65026459e87ec6c8b2885d5e1",
1903 | "de1084195ecf48609a52faf367a1c774",
1904 | "46791d145bcd494b955dba5dc04a7ffe",
1905 | "2eb0868140ef4a159f24a391a7a81175",
1906 | "4cf10360bda24346b064cee196b4d9ca",
1907 | "8e5b7a693c214004abd0e83ed176bfb9",
1908 | "8e8589eb7b56495fb5d81ebcb16bb0bc",
1909 | "1768820f1f9f4fe1b190c89c1c849778",
1910 | "f91c6242eac9427bbd7b6b6ef0cf7f45",
1911 | "26a22e17c52549c4b30aa5ae22451467",
1912 | "484c25d6506a4b7891c9090e8ae7b444"
1913 | ]
1914 | },
1915 | "outputId": "f330ec69-a157-4445-811b-dd787e16dba7"
1916 | },
1917 | "outputs": [
1918 | {
1919 | "output_type": "display_data",
1920 | "data": {
1921 | "text/plain": [
1922 | "Processing Files (0 / 0) : | | 0.00B / 0.00B "
1923 | ],
1924 | "application/vnd.jupyter.widget-view+json": {
1925 | "version_major": 2,
1926 | "version_minor": 0,
1927 | "model_id": "9b63f7b54d1d467a9fb6db05fa0f4d22"
1928 | }
1929 | },
1930 | "metadata": {}
1931 | },
1932 | {
1933 | "output_type": "display_data",
1934 | "data": {
1935 | "text/plain": [
1936 | "New Data Upload : | | 0.00B / 0.00B "
1937 | ],
1938 | "application/vnd.jupyter.widget-view+json": {
1939 | "version_major": 2,
1940 | "version_minor": 0,
1941 | "model_id": "a8b0edfbff934a2eb78b0b858fb1fad1"
1942 | }
1943 | },
1944 | "metadata": {}
1945 | },
1946 | {
1947 | "output_type": "display_data",
1948 | "data": {
1949 | "text/plain": [
1950 | " outputs/datasets.jsonl : 100%|##########| 69.5kB / 69.5kB "
1951 | ],
1952 | "application/vnd.jupyter.widget-view+json": {
1953 | "version_major": 2,
1954 | "version_minor": 0,
1955 | "model_id": "1ef55a8510074674b04dd33bf11e0966"
1956 | }
1957 | },
1958 | "metadata": {}
1959 | },
1960 | {
1961 | "output_type": "stream",
1962 | "name": "stderr",
1963 | "text": [
1964 | "No files have been modified since last commit. Skipping to prevent empty commit.\n",
1965 | "WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.\n",
1966 | "No files have been modified since last commit. Skipping to prevent empty commit.\n",
1967 | "WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.\n"
1968 | ]
1969 | },
1970 | {
1971 | "output_type": "display_data",
1972 | "data": {
1973 | "text/plain": [
1974 | "Processing Files (0 / 0) : | | 0.00B / 0.00B "
1975 | ],
1976 | "application/vnd.jupyter.widget-view+json": {
1977 | "version_major": 2,
1978 | "version_minor": 0,
1979 | "model_id": "e386f44e49a1465f941b8be9a0122fb1"
1980 | }
1981 | },
1982 | "metadata": {}
1983 | },
1984 | {
1985 | "output_type": "display_data",
1986 | "data": {
1987 | "text/plain": [
1988 | "New Data Upload : | | 0.00B / 0.00B "
1989 | ],
1990 | "application/vnd.jupyter.widget-view+json": {
1991 | "version_major": 2,
1992 | "version_minor": 0,
1993 | "model_id": "64eb477930fc41b99959d645f48715d3"
1994 | }
1995 | },
1996 | "metadata": {}
1997 | },
1998 | {
1999 | "output_type": "display_data",
2000 | "data": {
2001 | "text/plain": [
2002 | " outputs/train.parquet : 100%|##########| 30.0kB / 30.0kB "
2003 | ],
2004 | "application/vnd.jupyter.widget-view+json": {
2005 | "version_major": 2,
2006 | "version_minor": 0,
2007 | "model_id": "b1d541f65026459e87ec6c8b2885d5e1"
2008 | }
2009 | },
2010 | "metadata": {}
2011 | },
2012 | {
2013 | "output_type": "stream",
2014 | "name": "stdout",
2015 | "text": [
2016 | "✅ 上傳完成\n",
2017 | "👉 瀏覽: https://huggingface.co/datasets/Simon-Liu/gemma-270m-medium-qa\n"
2018 | ]
2019 | }
2020 | ],
2021 | "source": [
2022 | "# 建議的 Hub 目錄結構\n",
2023 | "REMOTE_JSONL = \"data/train.jsonl\"\n",
2024 | "REMOTE_PARQUET = \"data/train.parquet\" if PARQUET_PATH else None\n",
2025 | "REMOTE_CARD = \"README.md\"\n",
2026 | "REMOTE_GITATTR = \".gitattributes\"\n",
2027 | "\n",
2028 | "# 逐檔上傳(huggingface_hub 會自動處理 commit)\n",
2029 | "upload_file(\n",
2030 | " path_or_fileobj=str(LOCAL_JSONL),\n",
2031 | " path_in_repo=REMOTE_JSONL,\n",
2032 | " repo_id=REPO_ID,\n",
2033 | " repo_type=\"dataset\",\n",
2034 | ")\n",
2035 | "\n",
2036 | "upload_file(\n",
2037 | " path_or_fileobj=str(CARD_PATH),\n",
2038 | " path_in_repo=REMOTE_CARD,\n",
2039 | " repo_id=REPO_ID,\n",
2040 | " repo_type=\"dataset\",\n",
2041 | ")\n",
2042 | "\n",
2043 | "upload_file(\n",
2044 | " path_or_fileobj=str(GITATTR_PATH),\n",
2045 | " path_in_repo=REMOTE_GITATTR,\n",
2046 | " repo_id=REPO_ID,\n",
2047 | " repo_type=\"dataset\",\n",
2048 | ")\n",
2049 | "\n",
2050 | "if PARQUET_PATH and PARQUET_PATH.exists():\n",
2051 | " upload_file(\n",
2052 | " path_or_fileobj=str(PARQUET_PATH),\n",
2053 | " path_in_repo=REMOTE_PARQUET,\n",
2054 | " repo_id=REPO_ID,\n",
2055 | " repo_type=\"dataset\",\n",
2056 | " )\n",
2057 | "\n",
2058 | "print(\"✅ 上傳完成\")\n",
2059 | "print(f\"👉 瀏覽: https://huggingface.co/datasets/{REPO_ID}\")"
2060 | ],
2061 | "id": "dd562d22-bf69-47f4-b18c-775737ec4c90"
2062 | }
2063 | ],
2064 | "metadata": {
2065 | "kernelspec": {
2066 | "display_name": "Python 3 (ipykernel)",
2067 | "language": "python",
2068 | "name": "python3"
2069 | },
2070 | "language_info": {
2071 | "codemirror_mode": {
2072 | "name": "ipython",
2073 | "version": 3
2074 | },
2075 | "file_extension": ".py",
2076 | "mimetype": "text/x-python",
2077 | "name": "python",
2078 | "nbconvert_exporter": "python",
2079 | "pygments_lexer": "ipython3",
2080 | "version": "3.11.6"
2081 | },
2082 | "colab": {
2083 | "provenance": [],
2084 | "collapsed_sections": [
2085 | "cd084c2b-1741-4a4d-8932-5e6dfdfafcfa"
2086 | ],
2087 | "include_colab_link": true
2088 | },
2089 | "widgets": {
2090 | "application/vnd.jupyter.widget-state+json": {
2091 | "c91e25ccc6034a7c94f52d676c44b94e": {
2092 | "model_module": "@jupyter-widgets/controls",
2093 | "model_name": "HBoxModel",
2094 | "model_module_version": "1.5.0",
2095 | "state": {
2096 | "_dom_classes": [],
2097 | "_model_module": "@jupyter-widgets/controls",
2098 | "_model_module_version": "1.5.0",
2099 | "_model_name": "HBoxModel",
2100 | "_view_count": null,
2101 | "_view_module": "@jupyter-widgets/controls",
2102 | "_view_module_version": "1.5.0",
2103 | "_view_name": "HBoxView",
2104 | "box_style": "",
2105 | "children": [
2106 | "IPY_MODEL_13731b55ad004f798665cdc8af1cfff3",
2107 | "IPY_MODEL_cac6255099324943afc2b4fb4e0df034",
2108 | "IPY_MODEL_b2657e8be0c644558ac96a766b85e29c"
2109 | ],
2110 | "layout": "IPY_MODEL_6d6a9866108f4b248808096ef0dfbeea"
2111 | }
2112 | },
2113 | "13731b55ad004f798665cdc8af1cfff3": {
2114 | "model_module": "@jupyter-widgets/controls",
2115 | "model_name": "HTMLModel",
2116 | "model_module_version": "1.5.0",
2117 | "state": {
2118 | "_dom_classes": [],
2119 | "_model_module": "@jupyter-widgets/controls",
2120 | "_model_module_version": "1.5.0",
2121 | "_model_name": "HTMLModel",
2122 | "_view_count": null,
2123 | "_view_module": "@jupyter-widgets/controls",
2124 | "_view_module_version": "1.5.0",
2125 | "_view_name": "HTMLView",
2126 | "description": "",
2127 | "description_tooltip": null,
2128 | "layout": "IPY_MODEL_c5d372f7efc840aca0e64f3b2a4350fb",
2129 | "placeholder": "",
2130 | "style": "IPY_MODEL_9ccd9d2c4a6246a2994be3f1eea8be37",
2131 | "value": "Creating parquet from Arrow format: 100%"
2132 | }
2133 | },
2134 | "cac6255099324943afc2b4fb4e0df034": {
2135 | "model_module": "@jupyter-widgets/controls",
2136 | "model_name": "FloatProgressModel",
2137 | "model_module_version": "1.5.0",
2138 | "state": {
2139 | "_dom_classes": [],
2140 | "_model_module": "@jupyter-widgets/controls",
2141 | "_model_module_version": "1.5.0",
2142 | "_model_name": "FloatProgressModel",
2143 | "_view_count": null,
2144 | "_view_module": "@jupyter-widgets/controls",
2145 | "_view_module_version": "1.5.0",
2146 | "_view_name": "ProgressView",
2147 | "bar_style": "success",
2148 | "description": "",
2149 | "description_tooltip": null,
2150 | "layout": "IPY_MODEL_01bb49ccf71e4555a06a6b1933e7c016",
2151 | "max": 1,
2152 | "min": 0,
2153 | "orientation": "horizontal",
2154 | "style": "IPY_MODEL_c6d75ee7208d4e29b3723889751a6284",
2155 | "value": 1
2156 | }
2157 | },
2158 | "b2657e8be0c644558ac96a766b85e29c": {
2159 | "model_module": "@jupyter-widgets/controls",
2160 | "model_name": "HTMLModel",
2161 | "model_module_version": "1.5.0",
2162 | "state": {
2163 | "_dom_classes": [],
2164 | "_model_module": "@jupyter-widgets/controls",
2165 | "_model_module_version": "1.5.0",
2166 | "_model_name": "HTMLModel",
2167 | "_view_count": null,
2168 | "_view_module": "@jupyter-widgets/controls",
2169 | "_view_module_version": "1.5.0",
2170 | "_view_name": "HTMLView",
2171 | "description": "",
2172 | "description_tooltip": null,
2173 | "layout": "IPY_MODEL_c58b638d61764d94a15bc15271296320",
2174 | "placeholder": "",
2175 | "style": "IPY_MODEL_850eda4161184dd5b7925955ebcd320e",
2176 | "value": " 1/1 [00:00<00:00, 40.58ba/s]"
2177 | }
2178 | },
2179 | "6d6a9866108f4b248808096ef0dfbeea": {
2180 | "model_module": "@jupyter-widgets/base",
2181 | "model_name": "LayoutModel",
2182 | "model_module_version": "1.2.0",
2183 | "state": {
2184 | "_model_module": "@jupyter-widgets/base",
2185 | "_model_module_version": "1.2.0",
2186 | "_model_name": "LayoutModel",
2187 | "_view_count": null,
2188 | "_view_module": "@jupyter-widgets/base",
2189 | "_view_module_version": "1.2.0",
2190 | "_view_name": "LayoutView",
2191 | "align_content": null,
2192 | "align_items": null,
2193 | "align_self": null,
2194 | "border": null,
2195 | "bottom": null,
2196 | "display": null,
2197 | "flex": null,
2198 | "flex_flow": null,
2199 | "grid_area": null,
2200 | "grid_auto_columns": null,
2201 | "grid_auto_flow": null,
2202 | "grid_auto_rows": null,
2203 | "grid_column": null,
2204 | "grid_gap": null,
2205 | "grid_row": null,
2206 | "grid_template_areas": null,
2207 | "grid_template_columns": null,
2208 | "grid_template_rows": null,
2209 | "height": null,
2210 | "justify_content": null,
2211 | "justify_items": null,
2212 | "left": null,
2213 | "margin": null,
2214 | "max_height": null,
2215 | "max_width": null,
2216 | "min_height": null,
2217 | "min_width": null,
2218 | "object_fit": null,
2219 | "object_position": null,
2220 | "order": null,
2221 | "overflow": null,
2222 | "overflow_x": null,
2223 | "overflow_y": null,
2224 | "padding": null,
2225 | "right": null,
2226 | "top": null,
2227 | "visibility": null,
2228 | "width": null
2229 | }
2230 | },
2231 | "c5d372f7efc840aca0e64f3b2a4350fb": {
2232 | "model_module": "@jupyter-widgets/base",
2233 | "model_name": "LayoutModel",
2234 | "model_module_version": "1.2.0",
2235 | "state": {
2236 | "_model_module": "@jupyter-widgets/base",
2237 | "_model_module_version": "1.2.0",
2238 | "_model_name": "LayoutModel",
2239 | "_view_count": null,
2240 | "_view_module": "@jupyter-widgets/base",
2241 | "_view_module_version": "1.2.0",
2242 | "_view_name": "LayoutView",
2243 | "align_content": null,
2244 | "align_items": null,
2245 | "align_self": null,
2246 | "border": null,
2247 | "bottom": null,
2248 | "display": null,
2249 | "flex": null,
2250 | "flex_flow": null,
2251 | "grid_area": null,
2252 | "grid_auto_columns": null,
2253 | "grid_auto_flow": null,
2254 | "grid_auto_rows": null,
2255 | "grid_column": null,
2256 | "grid_gap": null,
2257 | "grid_row": null,
2258 | "grid_template_areas": null,
2259 | "grid_template_columns": null,
2260 | "grid_template_rows": null,
2261 | "height": null,
2262 | "justify_content": null,
2263 | "justify_items": null,
2264 | "left": null,
2265 | "margin": null,
2266 | "max_height": null,
2267 | "max_width": null,
2268 | "min_height": null,
2269 | "min_width": null,
2270 | "object_fit": null,
2271 | "object_position": null,
2272 | "order": null,
2273 | "overflow": null,
2274 | "overflow_x": null,
2275 | "overflow_y": null,
2276 | "padding": null,
2277 | "right": null,
2278 | "top": null,
2279 | "visibility": null,
2280 | "width": null
2281 | }
2282 | },
2283 | "9ccd9d2c4a6246a2994be3f1eea8be37": {
2284 | "model_module": "@jupyter-widgets/controls",
2285 | "model_name": "DescriptionStyleModel",
2286 | "model_module_version": "1.5.0",
2287 | "state": {
2288 | "_model_module": "@jupyter-widgets/controls",
2289 | "_model_module_version": "1.5.0",
2290 | "_model_name": "DescriptionStyleModel",
2291 | "_view_count": null,
2292 | "_view_module": "@jupyter-widgets/base",
2293 | "_view_module_version": "1.2.0",
2294 | "_view_name": "StyleView",
2295 | "description_width": ""
2296 | }
2297 | },
2298 | "01bb49ccf71e4555a06a6b1933e7c016": {
2299 | "model_module": "@jupyter-widgets/base",
2300 | "model_name": "LayoutModel",
2301 | "model_module_version": "1.2.0",
2302 | "state": {
2303 | "_model_module": "@jupyter-widgets/base",
2304 | "_model_module_version": "1.2.0",
2305 | "_model_name": "LayoutModel",
2306 | "_view_count": null,
2307 | "_view_module": "@jupyter-widgets/base",
2308 | "_view_module_version": "1.2.0",
2309 | "_view_name": "LayoutView",
2310 | "align_content": null,
2311 | "align_items": null,
2312 | "align_self": null,
2313 | "border": null,
2314 | "bottom": null,
2315 | "display": null,
2316 | "flex": null,
2317 | "flex_flow": null,
2318 | "grid_area": null,
2319 | "grid_auto_columns": null,
2320 | "grid_auto_flow": null,
2321 | "grid_auto_rows": null,
2322 | "grid_column": null,
2323 | "grid_gap": null,
2324 | "grid_row": null,
2325 | "grid_template_areas": null,
2326 | "grid_template_columns": null,
2327 | "grid_template_rows": null,
2328 | "height": null,
2329 | "justify_content": null,
2330 | "justify_items": null,
2331 | "left": null,
2332 | "margin": null,
2333 | "max_height": null,
2334 | "max_width": null,
2335 | "min_height": null,
2336 | "min_width": null,
2337 | "object_fit": null,
2338 | "object_position": null,
2339 | "order": null,
2340 | "overflow": null,
2341 | "overflow_x": null,
2342 | "overflow_y": null,
2343 | "padding": null,
2344 | "right": null,
2345 | "top": null,
2346 | "visibility": null,
2347 | "width": null
2348 | }
2349 | },
2350 | "c6d75ee7208d4e29b3723889751a6284": {
2351 | "model_module": "@jupyter-widgets/controls",
2352 | "model_name": "ProgressStyleModel",
2353 | "model_module_version": "1.5.0",
2354 | "state": {
2355 | "_model_module": "@jupyter-widgets/controls",
2356 | "_model_module_version": "1.5.0",
2357 | "_model_name": "ProgressStyleModel",
2358 | "_view_count": null,
2359 | "_view_module": "@jupyter-widgets/base",
2360 | "_view_module_version": "1.2.0",
2361 | "_view_name": "StyleView",
2362 | "bar_color": null,
2363 | "description_width": ""
2364 | }
2365 | },
2366 | "c58b638d61764d94a15bc15271296320": {
2367 | "model_module": "@jupyter-widgets/base",
2368 | "model_name": "LayoutModel",
2369 | "model_module_version": "1.2.0",
2370 | "state": {
2371 | "_model_module": "@jupyter-widgets/base",
2372 | "_model_module_version": "1.2.0",
2373 | "_model_name": "LayoutModel",
2374 | "_view_count": null,
2375 | "_view_module": "@jupyter-widgets/base",
2376 | "_view_module_version": "1.2.0",
2377 | "_view_name": "LayoutView",
2378 | "align_content": null,
2379 | "align_items": null,
2380 | "align_self": null,
2381 | "border": null,
2382 | "bottom": null,
2383 | "display": null,
2384 | "flex": null,
2385 | "flex_flow": null,
2386 | "grid_area": null,
2387 | "grid_auto_columns": null,
2388 | "grid_auto_flow": null,
2389 | "grid_auto_rows": null,
2390 | "grid_column": null,
2391 | "grid_gap": null,
2392 | "grid_row": null,
2393 | "grid_template_areas": null,
2394 | "grid_template_columns": null,
2395 | "grid_template_rows": null,
2396 | "height": null,
2397 | "justify_content": null,
2398 | "justify_items": null,
2399 | "left": null,
2400 | "margin": null,
2401 | "max_height": null,
2402 | "max_width": null,
2403 | "min_height": null,
2404 | "min_width": null,
2405 | "object_fit": null,
2406 | "object_position": null,
2407 | "order": null,
2408 | "overflow": null,
2409 | "overflow_x": null,
2410 | "overflow_y": null,
2411 | "padding": null,
2412 | "right": null,
2413 | "top": null,
2414 | "visibility": null,
2415 | "width": null
2416 | }
2417 | },
2418 | "850eda4161184dd5b7925955ebcd320e": {
2419 | "model_module": "@jupyter-widgets/controls",
2420 | "model_name": "DescriptionStyleModel",
2421 | "model_module_version": "1.5.0",
2422 | "state": {
2423 | "_model_module": "@jupyter-widgets/controls",
2424 | "_model_module_version": "1.5.0",
2425 | "_model_name": "DescriptionStyleModel",
2426 | "_view_count": null,
2427 | "_view_module": "@jupyter-widgets/base",
2428 | "_view_module_version": "1.2.0",
2429 | "_view_name": "StyleView",
2430 | "description_width": ""
2431 | }
2432 | },
2433 | "9b63f7b54d1d467a9fb6db05fa0f4d22": {
2434 | "model_module": "@jupyter-widgets/controls",
2435 | "model_name": "HBoxModel",
2436 | "model_module_version": "1.5.0",
2437 | "state": {
2438 | "_dom_classes": [],
2439 | "_model_module": "@jupyter-widgets/controls",
2440 | "_model_module_version": "1.5.0",
2441 | "_model_name": "HBoxModel",
2442 | "_view_count": null,
2443 | "_view_module": "@jupyter-widgets/controls",
2444 | "_view_module_version": "1.5.0",
2445 | "_view_name": "HBoxView",
2446 | "box_style": "",
2447 | "children": [
2448 | "IPY_MODEL_e3a1b058363143d6a6d880fad566eb60",
2449 | "IPY_MODEL_6b480e1bbd424748843f1d8f5c29cc32",
2450 | "IPY_MODEL_53c6620d77e94493a7ab2ec4c85d9dd9"
2451 | ],
2452 | "layout": "IPY_MODEL_672207930bf74056896786b279637c72"
2453 | }
2454 | },
2455 | "e3a1b058363143d6a6d880fad566eb60": {
2456 | "model_module": "@jupyter-widgets/controls",
2457 | "model_name": "HTMLModel",
2458 | "model_module_version": "1.5.0",
2459 | "state": {
2460 | "_dom_classes": [],
2461 | "_model_module": "@jupyter-widgets/controls",
2462 | "_model_module_version": "1.5.0",
2463 | "_model_name": "HTMLModel",
2464 | "_view_count": null,
2465 | "_view_module": "@jupyter-widgets/controls",
2466 | "_view_module_version": "1.5.0",
2467 | "_view_name": "HTMLView",
2468 | "description": "",
2469 | "description_tooltip": null,
2470 | "layout": "IPY_MODEL_bda31933ed0f4beeb946cba939924164",
2471 | "placeholder": "",
2472 | "style": "IPY_MODEL_d7050aa8916b47bebd0ea2f7b439f38f",
2473 | "value": "Processing Files (1 / 1) : 100%"
2474 | }
2475 | },
2476 | "6b480e1bbd424748843f1d8f5c29cc32": {
2477 | "model_module": "@jupyter-widgets/controls",
2478 | "model_name": "FloatProgressModel",
2479 | "model_module_version": "1.5.0",
2480 | "state": {
2481 | "_dom_classes": [],
2482 | "_model_module": "@jupyter-widgets/controls",
2483 | "_model_module_version": "1.5.0",
2484 | "_model_name": "FloatProgressModel",
2485 | "_view_count": null,
2486 | "_view_module": "@jupyter-widgets/controls",
2487 | "_view_module_version": "1.5.0",
2488 | "_view_name": "ProgressView",
2489 | "bar_style": "success",
2490 | "description": "",
2491 | "description_tooltip": null,
2492 | "layout": "IPY_MODEL_b8dde2ad8fa842859cada5ed80d62e5e",
2493 | "max": 1,
2494 | "min": 0,
2495 | "orientation": "horizontal",
2496 | "style": "IPY_MODEL_168e77f14c2246a884170abbc354e7c0",
2497 | "value": 1
2498 | }
2499 | },
2500 | "53c6620d77e94493a7ab2ec4c85d9dd9": {
2501 | "model_module": "@jupyter-widgets/controls",
2502 | "model_name": "HTMLModel",
2503 | "model_module_version": "1.5.0",
2504 | "state": {
2505 | "_dom_classes": [],
2506 | "_model_module": "@jupyter-widgets/controls",
2507 | "_model_module_version": "1.5.0",
2508 | "_model_name": "HTMLModel",
2509 | "_view_count": null,
2510 | "_view_module": "@jupyter-widgets/controls",
2511 | "_view_module_version": "1.5.0",
2512 | "_view_name": "HTMLView",
2513 | "description": "",
2514 | "description_tooltip": null,
2515 | "layout": "IPY_MODEL_51df796cf27947fbbd9265418c1b6c53",
2516 | "placeholder": "",
2517 | "style": "IPY_MODEL_04c788403c854984b2fbf382c66ad739",
2518 | "value": " 69.5kB / 69.5kB, 174kB/s "
2519 | }
2520 | },
2521 | "672207930bf74056896786b279637c72": {
2522 | "model_module": "@jupyter-widgets/base",
2523 | "model_name": "LayoutModel",
2524 | "model_module_version": "1.2.0",
2525 | "state": {
2526 | "_model_module": "@jupyter-widgets/base",
2527 | "_model_module_version": "1.2.0",
2528 | "_model_name": "LayoutModel",
2529 | "_view_count": null,
2530 | "_view_module": "@jupyter-widgets/base",
2531 | "_view_module_version": "1.2.0",
2532 | "_view_name": "LayoutView",
2533 | "align_content": null,
2534 | "align_items": null,
2535 | "align_self": null,
2536 | "border": null,
2537 | "bottom": null,
2538 | "display": null,
2539 | "flex": null,
2540 | "flex_flow": null,
2541 | "grid_area": null,
2542 | "grid_auto_columns": null,
2543 | "grid_auto_flow": null,
2544 | "grid_auto_rows": null,
2545 | "grid_column": null,
2546 | "grid_gap": null,
2547 | "grid_row": null,
2548 | "grid_template_areas": null,
2549 | "grid_template_columns": null,
2550 | "grid_template_rows": null,
2551 | "height": null,
2552 | "justify_content": null,
2553 | "justify_items": null,
2554 | "left": null,
2555 | "margin": null,
2556 | "max_height": null,
2557 | "max_width": null,
2558 | "min_height": null,
2559 | "min_width": null,
2560 | "object_fit": null,
2561 | "object_position": null,
2562 | "order": null,
2563 | "overflow": null,
2564 | "overflow_x": null,
2565 | "overflow_y": null,
2566 | "padding": null,
2567 | "right": null,
2568 | "top": null,
2569 | "visibility": null,
2570 | "width": null
2571 | }
2572 | },
2573 | "bda31933ed0f4beeb946cba939924164": {
2574 | "model_module": "@jupyter-widgets/base",
2575 | "model_name": "LayoutModel",
2576 | "model_module_version": "1.2.0",
2577 | "state": {
2578 | "_model_module": "@jupyter-widgets/base",
2579 | "_model_module_version": "1.2.0",
2580 | "_model_name": "LayoutModel",
2581 | "_view_count": null,
2582 | "_view_module": "@jupyter-widgets/base",
2583 | "_view_module_version": "1.2.0",
2584 | "_view_name": "LayoutView",
2585 | "align_content": null,
2586 | "align_items": null,
2587 | "align_self": null,
2588 | "border": null,
2589 | "bottom": null,
2590 | "display": null,
2591 | "flex": null,
2592 | "flex_flow": null,
2593 | "grid_area": null,
2594 | "grid_auto_columns": null,
2595 | "grid_auto_flow": null,
2596 | "grid_auto_rows": null,
2597 | "grid_column": null,
2598 | "grid_gap": null,
2599 | "grid_row": null,
2600 | "grid_template_areas": null,
2601 | "grid_template_columns": null,
2602 | "grid_template_rows": null,
2603 | "height": null,
2604 | "justify_content": null,
2605 | "justify_items": null,
2606 | "left": null,
2607 | "margin": null,
2608 | "max_height": null,
2609 | "max_width": null,
2610 | "min_height": null,
2611 | "min_width": null,
2612 | "object_fit": null,
2613 | "object_position": null,
2614 | "order": null,
2615 | "overflow": null,
2616 | "overflow_x": null,
2617 | "overflow_y": null,
2618 | "padding": null,
2619 | "right": null,
2620 | "top": null,
2621 | "visibility": null,
2622 | "width": null
2623 | }
2624 | },
2625 | "d7050aa8916b47bebd0ea2f7b439f38f": {
2626 | "model_module": "@jupyter-widgets/controls",
2627 | "model_name": "DescriptionStyleModel",
2628 | "model_module_version": "1.5.0",
2629 | "state": {
2630 | "_model_module": "@jupyter-widgets/controls",
2631 | "_model_module_version": "1.5.0",
2632 | "_model_name": "DescriptionStyleModel",
2633 | "_view_count": null,
2634 | "_view_module": "@jupyter-widgets/base",
2635 | "_view_module_version": "1.2.0",
2636 | "_view_name": "StyleView",
2637 | "description_width": ""
2638 | }
2639 | },
2640 | "b8dde2ad8fa842859cada5ed80d62e5e": {
2641 | "model_module": "@jupyter-widgets/base",
2642 | "model_name": "LayoutModel",
2643 | "model_module_version": "1.2.0",
2644 | "state": {
2645 | "_model_module": "@jupyter-widgets/base",
2646 | "_model_module_version": "1.2.0",
2647 | "_model_name": "LayoutModel",
2648 | "_view_count": null,
2649 | "_view_module": "@jupyter-widgets/base",
2650 | "_view_module_version": "1.2.0",
2651 | "_view_name": "LayoutView",
2652 | "align_content": null,
2653 | "align_items": null,
2654 | "align_self": null,
2655 | "border": null,
2656 | "bottom": null,
2657 | "display": null,
2658 | "flex": null,
2659 | "flex_flow": null,
2660 | "grid_area": null,
2661 | "grid_auto_columns": null,
2662 | "grid_auto_flow": null,
2663 | "grid_auto_rows": null,
2664 | "grid_column": null,
2665 | "grid_gap": null,
2666 | "grid_row": null,
2667 | "grid_template_areas": null,
2668 | "grid_template_columns": null,
2669 | "grid_template_rows": null,
2670 | "height": null,
2671 | "justify_content": null,
2672 | "justify_items": null,
2673 | "left": null,
2674 | "margin": null,
2675 | "max_height": null,
2676 | "max_width": null,
2677 | "min_height": null,
2678 | "min_width": null,
2679 | "object_fit": null,
2680 | "object_position": null,
2681 | "order": null,
2682 | "overflow": null,
2683 | "overflow_x": null,
2684 | "overflow_y": null,
2685 | "padding": null,
2686 | "right": null,
2687 | "top": null,
2688 | "visibility": null,
2689 | "width": "20px"
2690 | }
2691 | },
2692 | "168e77f14c2246a884170abbc354e7c0": {
2693 | "model_module": "@jupyter-widgets/controls",
2694 | "model_name": "ProgressStyleModel",
2695 | "model_module_version": "1.5.0",
2696 | "state": {
2697 | "_model_module": "@jupyter-widgets/controls",
2698 | "_model_module_version": "1.5.0",
2699 | "_model_name": "ProgressStyleModel",
2700 | "_view_count": null,
2701 | "_view_module": "@jupyter-widgets/base",
2702 | "_view_module_version": "1.2.0",
2703 | "_view_name": "StyleView",
2704 | "bar_color": null,
2705 | "description_width": ""
2706 | }
2707 | },
2708 | "51df796cf27947fbbd9265418c1b6c53": {
2709 | "model_module": "@jupyter-widgets/base",
2710 | "model_name": "LayoutModel",
2711 | "model_module_version": "1.2.0",
2712 | "state": {
2713 | "_model_module": "@jupyter-widgets/base",
2714 | "_model_module_version": "1.2.0",
2715 | "_model_name": "LayoutModel",
2716 | "_view_count": null,
2717 | "_view_module": "@jupyter-widgets/base",
2718 | "_view_module_version": "1.2.0",
2719 | "_view_name": "LayoutView",
2720 | "align_content": null,
2721 | "align_items": null,
2722 | "align_self": null,
2723 | "border": null,
2724 | "bottom": null,
2725 | "display": null,
2726 | "flex": null,
2727 | "flex_flow": null,
2728 | "grid_area": null,
2729 | "grid_auto_columns": null,
2730 | "grid_auto_flow": null,
2731 | "grid_auto_rows": null,
2732 | "grid_column": null,
2733 | "grid_gap": null,
2734 | "grid_row": null,
2735 | "grid_template_areas": null,
2736 | "grid_template_columns": null,
2737 | "grid_template_rows": null,
2738 | "height": null,
2739 | "justify_content": null,
2740 | "justify_items": null,
2741 | "left": null,
2742 | "margin": null,
2743 | "max_height": null,
2744 | "max_width": null,
2745 | "min_height": null,
2746 | "min_width": null,
2747 | "object_fit": null,
2748 | "object_position": null,
2749 | "order": null,
2750 | "overflow": null,
2751 | "overflow_x": null,
2752 | "overflow_y": null,
2753 | "padding": null,
2754 | "right": null,
2755 | "top": null,
2756 | "visibility": null,
2757 | "width": null
2758 | }
2759 | },
2760 | "04c788403c854984b2fbf382c66ad739": {
2761 | "model_module": "@jupyter-widgets/controls",
2762 | "model_name": "DescriptionStyleModel",
2763 | "model_module_version": "1.5.0",
2764 | "state": {
2765 | "_model_module": "@jupyter-widgets/controls",
2766 | "_model_module_version": "1.5.0",
2767 | "_model_name": "DescriptionStyleModel",
2768 | "_view_count": null,
2769 | "_view_module": "@jupyter-widgets/base",
2770 | "_view_module_version": "1.2.0",
2771 | "_view_name": "StyleView",
2772 | "description_width": ""
2773 | }
2774 | },
2775 | "a8b0edfbff934a2eb78b0b858fb1fad1": {
2776 | "model_module": "@jupyter-widgets/controls",
2777 | "model_name": "HBoxModel",
2778 | "model_module_version": "1.5.0",
2779 | "state": {
2780 | "_dom_classes": [],
2781 | "_model_module": "@jupyter-widgets/controls",
2782 | "_model_module_version": "1.5.0",
2783 | "_model_name": "HBoxModel",
2784 | "_view_count": null,
2785 | "_view_module": "@jupyter-widgets/controls",
2786 | "_view_module_version": "1.5.0",
2787 | "_view_name": "HBoxView",
2788 | "box_style": "",
2789 | "children": [
2790 | "IPY_MODEL_407c85019fa946109c1eb0bd8beb50fe",
2791 | "IPY_MODEL_f30c554c31644300ba33e4bcacded9cb",
2792 | "IPY_MODEL_cd30de9e72c64e4a8c1beefeab3807cd"
2793 | ],
2794 | "layout": "IPY_MODEL_d23709045e2040439721c14a84691bb8"
2795 | }
2796 | },
2797 | "407c85019fa946109c1eb0bd8beb50fe": {
2798 | "model_module": "@jupyter-widgets/controls",
2799 | "model_name": "HTMLModel",
2800 | "model_module_version": "1.5.0",
2801 | "state": {
2802 | "_dom_classes": [],
2803 | "_model_module": "@jupyter-widgets/controls",
2804 | "_model_module_version": "1.5.0",
2805 | "_model_name": "HTMLModel",
2806 | "_view_count": null,
2807 | "_view_module": "@jupyter-widgets/controls",
2808 | "_view_module_version": "1.5.0",
2809 | "_view_name": "HTMLView",
2810 | "description": "",
2811 | "description_tooltip": null,
2812 | "layout": "IPY_MODEL_c9de0f26c0654ca1ab7323a862d928f6",
2813 | "placeholder": "",
2814 | "style": "IPY_MODEL_03d101d933884218af93f9928ddd33ea",
2815 | "value": "New Data Upload : 100%"
2816 | }
2817 | },
2818 | "f30c554c31644300ba33e4bcacded9cb": {
2819 | "model_module": "@jupyter-widgets/controls",
2820 | "model_name": "FloatProgressModel",
2821 | "model_module_version": "1.5.0",
2822 | "state": {
2823 | "_dom_classes": [],
2824 | "_model_module": "@jupyter-widgets/controls",
2825 | "_model_module_version": "1.5.0",
2826 | "_model_name": "FloatProgressModel",
2827 | "_view_count": null,
2828 | "_view_module": "@jupyter-widgets/controls",
2829 | "_view_module_version": "1.5.0",
2830 | "_view_name": "ProgressView",
2831 | "bar_style": "success",
2832 | "description": "",
2833 | "description_tooltip": null,
2834 | "layout": "IPY_MODEL_d89cf9e9c5e24d77afe223d4f2c40f12",
2835 | "max": 1,
2836 | "min": 0,
2837 | "orientation": "horizontal",
2838 | "style": "IPY_MODEL_1da9689469644863abce276ad1e5f3b7",
2839 | "value": 1
2840 | }
2841 | },
2842 | "cd30de9e72c64e4a8c1beefeab3807cd": {
2843 | "model_module": "@jupyter-widgets/controls",
2844 | "model_name": "HTMLModel",
2845 | "model_module_version": "1.5.0",
2846 | "state": {
2847 | "_dom_classes": [],
2848 | "_model_module": "@jupyter-widgets/controls",
2849 | "_model_module_version": "1.5.0",
2850 | "_model_name": "HTMLModel",
2851 | "_view_count": null,
2852 | "_view_module": "@jupyter-widgets/controls",
2853 | "_view_module_version": "1.5.0",
2854 | "_view_name": "HTMLView",
2855 | "description": "",
2856 | "description_tooltip": null,
2857 | "layout": "IPY_MODEL_ed2f10039e4e425d8b53aeedfc0a6ee8",
2858 | "placeholder": "",
2859 | "style": "IPY_MODEL_bc8fbbace35847e19ca731d5c8f29093",
2860 | "value": " 69.5kB / 69.5kB, 174kB/s "
2861 | }
2862 | },
2863 | "d23709045e2040439721c14a84691bb8": {
2864 | "model_module": "@jupyter-widgets/base",
2865 | "model_name": "LayoutModel",
2866 | "model_module_version": "1.2.0",
2867 | "state": {
2868 | "_model_module": "@jupyter-widgets/base",
2869 | "_model_module_version": "1.2.0",
2870 | "_model_name": "LayoutModel",
2871 | "_view_count": null,
2872 | "_view_module": "@jupyter-widgets/base",
2873 | "_view_module_version": "1.2.0",
2874 | "_view_name": "LayoutView",
2875 | "align_content": null,
2876 | "align_items": null,
2877 | "align_self": null,
2878 | "border": null,
2879 | "bottom": null,
2880 | "display": null,
2881 | "flex": null,
2882 | "flex_flow": null,
2883 | "grid_area": null,
2884 | "grid_auto_columns": null,
2885 | "grid_auto_flow": null,
2886 | "grid_auto_rows": null,
2887 | "grid_column": null,
2888 | "grid_gap": null,
2889 | "grid_row": null,
2890 | "grid_template_areas": null,
2891 | "grid_template_columns": null,
2892 | "grid_template_rows": null,
2893 | "height": null,
2894 | "justify_content": null,
2895 | "justify_items": null,
2896 | "left": null,
2897 | "margin": null,
2898 | "max_height": null,
2899 | "max_width": null,
2900 | "min_height": null,
2901 | "min_width": null,
2902 | "object_fit": null,
2903 | "object_position": null,
2904 | "order": null,
2905 | "overflow": null,
2906 | "overflow_x": null,
2907 | "overflow_y": null,
2908 | "padding": null,
2909 | "right": null,
2910 | "top": null,
2911 | "visibility": null,
2912 | "width": null
2913 | }
2914 | },
2915 | "c9de0f26c0654ca1ab7323a862d928f6": {
2916 | "model_module": "@jupyter-widgets/base",
2917 | "model_name": "LayoutModel",
2918 | "model_module_version": "1.2.0",
2919 | "state": {
2920 | "_model_module": "@jupyter-widgets/base",
2921 | "_model_module_version": "1.2.0",
2922 | "_model_name": "LayoutModel",
2923 | "_view_count": null,
2924 | "_view_module": "@jupyter-widgets/base",
2925 | "_view_module_version": "1.2.0",
2926 | "_view_name": "LayoutView",
2927 | "align_content": null,
2928 | "align_items": null,
2929 | "align_self": null,
2930 | "border": null,
2931 | "bottom": null,
2932 | "display": null,
2933 | "flex": null,
2934 | "flex_flow": null,
2935 | "grid_area": null,
2936 | "grid_auto_columns": null,
2937 | "grid_auto_flow": null,
2938 | "grid_auto_rows": null,
2939 | "grid_column": null,
2940 | "grid_gap": null,
2941 | "grid_row": null,
2942 | "grid_template_areas": null,
2943 | "grid_template_columns": null,
2944 | "grid_template_rows": null,
2945 | "height": null,
2946 | "justify_content": null,
2947 | "justify_items": null,
2948 | "left": null,
2949 | "margin": null,
2950 | "max_height": null,
2951 | "max_width": null,
2952 | "min_height": null,
2953 | "min_width": null,
2954 | "object_fit": null,
2955 | "object_position": null,
2956 | "order": null,
2957 | "overflow": null,
2958 | "overflow_x": null,
2959 | "overflow_y": null,
2960 | "padding": null,
2961 | "right": null,
2962 | "top": null,
2963 | "visibility": null,
2964 | "width": null
2965 | }
2966 | },
2967 | "03d101d933884218af93f9928ddd33ea": {
2968 | "model_module": "@jupyter-widgets/controls",
2969 | "model_name": "DescriptionStyleModel",
2970 | "model_module_version": "1.5.0",
2971 | "state": {
2972 | "_model_module": "@jupyter-widgets/controls",
2973 | "_model_module_version": "1.5.0",
2974 | "_model_name": "DescriptionStyleModel",
2975 | "_view_count": null,
2976 | "_view_module": "@jupyter-widgets/base",
2977 | "_view_module_version": "1.2.0",
2978 | "_view_name": "StyleView",
2979 | "description_width": ""
2980 | }
2981 | },
2982 | "d89cf9e9c5e24d77afe223d4f2c40f12": {
2983 | "model_module": "@jupyter-widgets/base",
2984 | "model_name": "LayoutModel",
2985 | "model_module_version": "1.2.0",
2986 | "state": {
2987 | "_model_module": "@jupyter-widgets/base",
2988 | "_model_module_version": "1.2.0",
2989 | "_model_name": "LayoutModel",
2990 | "_view_count": null,
2991 | "_view_module": "@jupyter-widgets/base",
2992 | "_view_module_version": "1.2.0",
2993 | "_view_name": "LayoutView",
2994 | "align_content": null,
2995 | "align_items": null,
2996 | "align_self": null,
2997 | "border": null,
2998 | "bottom": null,
2999 | "display": null,
3000 | "flex": null,
3001 | "flex_flow": null,
3002 | "grid_area": null,
3003 | "grid_auto_columns": null,
3004 | "grid_auto_flow": null,
3005 | "grid_auto_rows": null,
3006 | "grid_column": null,
3007 | "grid_gap": null,
3008 | "grid_row": null,
3009 | "grid_template_areas": null,
3010 | "grid_template_columns": null,
3011 | "grid_template_rows": null,
3012 | "height": null,
3013 | "justify_content": null,
3014 | "justify_items": null,
3015 | "left": null,
3016 | "margin": null,
3017 | "max_height": null,
3018 | "max_width": null,
3019 | "min_height": null,
3020 | "min_width": null,
3021 | "object_fit": null,
3022 | "object_position": null,
3023 | "order": null,
3024 | "overflow": null,
3025 | "overflow_x": null,
3026 | "overflow_y": null,
3027 | "padding": null,
3028 | "right": null,
3029 | "top": null,
3030 | "visibility": null,
3031 | "width": "20px"
3032 | }
3033 | },
3034 | "1da9689469644863abce276ad1e5f3b7": {
3035 | "model_module": "@jupyter-widgets/controls",
3036 | "model_name": "ProgressStyleModel",
3037 | "model_module_version": "1.5.0",
3038 | "state": {
3039 | "_model_module": "@jupyter-widgets/controls",
3040 | "_model_module_version": "1.5.0",
3041 | "_model_name": "ProgressStyleModel",
3042 | "_view_count": null,
3043 | "_view_module": "@jupyter-widgets/base",
3044 | "_view_module_version": "1.2.0",
3045 | "_view_name": "StyleView",
3046 | "bar_color": null,
3047 | "description_width": ""
3048 | }
3049 | },
3050 | "ed2f10039e4e425d8b53aeedfc0a6ee8": {
3051 | "model_module": "@jupyter-widgets/base",
3052 | "model_name": "LayoutModel",
3053 | "model_module_version": "1.2.0",
3054 | "state": {
3055 | "_model_module": "@jupyter-widgets/base",
3056 | "_model_module_version": "1.2.0",
3057 | "_model_name": "LayoutModel",
3058 | "_view_count": null,
3059 | "_view_module": "@jupyter-widgets/base",
3060 | "_view_module_version": "1.2.0",
3061 | "_view_name": "LayoutView",
3062 | "align_content": null,
3063 | "align_items": null,
3064 | "align_self": null,
3065 | "border": null,
3066 | "bottom": null,
3067 | "display": null,
3068 | "flex": null,
3069 | "flex_flow": null,
3070 | "grid_area": null,
3071 | "grid_auto_columns": null,
3072 | "grid_auto_flow": null,
3073 | "grid_auto_rows": null,
3074 | "grid_column": null,
3075 | "grid_gap": null,
3076 | "grid_row": null,
3077 | "grid_template_areas": null,
3078 | "grid_template_columns": null,
3079 | "grid_template_rows": null,
3080 | "height": null,
3081 | "justify_content": null,
3082 | "justify_items": null,
3083 | "left": null,
3084 | "margin": null,
3085 | "max_height": null,
3086 | "max_width": null,
3087 | "min_height": null,
3088 | "min_width": null,
3089 | "object_fit": null,
3090 | "object_position": null,
3091 | "order": null,
3092 | "overflow": null,
3093 | "overflow_x": null,
3094 | "overflow_y": null,
3095 | "padding": null,
3096 | "right": null,
3097 | "top": null,
3098 | "visibility": null,
3099 | "width": null
3100 | }
3101 | },
3102 | "bc8fbbace35847e19ca731d5c8f29093": {
3103 | "model_module": "@jupyter-widgets/controls",
3104 | "model_name": "DescriptionStyleModel",
3105 | "model_module_version": "1.5.0",
3106 | "state": {
3107 | "_model_module": "@jupyter-widgets/controls",
3108 | "_model_module_version": "1.5.0",
3109 | "_model_name": "DescriptionStyleModel",
3110 | "_view_count": null,
3111 | "_view_module": "@jupyter-widgets/base",
3112 | "_view_module_version": "1.2.0",
3113 | "_view_name": "StyleView",
3114 | "description_width": ""
3115 | }
3116 | },
3117 | "1ef55a8510074674b04dd33bf11e0966": {
3118 | "model_module": "@jupyter-widgets/controls",
3119 | "model_name": "HBoxModel",
3120 | "model_module_version": "1.5.0",
3121 | "state": {
3122 | "_dom_classes": [],
3123 | "_model_module": "@jupyter-widgets/controls",
3124 | "_model_module_version": "1.5.0",
3125 | "_model_name": "HBoxModel",
3126 | "_view_count": null,
3127 | "_view_module": "@jupyter-widgets/controls",
3128 | "_view_module_version": "1.5.0",
3129 | "_view_name": "HBoxView",
3130 | "box_style": "",
3131 | "children": [
3132 | "IPY_MODEL_880ce95de5344905b01490ab77791ecd",
3133 | "IPY_MODEL_cd4ea41d9e4840898c702ec214bc38f6",
3134 | "IPY_MODEL_96ef472afbb84a36a12caa4827e49904"
3135 | ],
3136 | "layout": "IPY_MODEL_4dfd57e35e414aec866305e1c3862f43"
3137 | }
3138 | },
3139 | "880ce95de5344905b01490ab77791ecd": {
3140 | "model_module": "@jupyter-widgets/controls",
3141 | "model_name": "HTMLModel",
3142 | "model_module_version": "1.5.0",
3143 | "state": {
3144 | "_dom_classes": [],
3145 | "_model_module": "@jupyter-widgets/controls",
3146 | "_model_module_version": "1.5.0",
3147 | "_model_name": "HTMLModel",
3148 | "_view_count": null,
3149 | "_view_module": "@jupyter-widgets/controls",
3150 | "_view_module_version": "1.5.0",
3151 | "_view_name": "HTMLView",
3152 | "description": "",
3153 | "description_tooltip": null,
3154 | "layout": "IPY_MODEL_f3085f5a5d2d4e5baf91834a7aff0823",
3155 | "placeholder": "",
3156 | "style": "IPY_MODEL_beadea8e78804664bbfcbd95527fda43",
3157 | "value": " outputs/datasets.jsonl : 100%"
3158 | }
3159 | },
3160 | "cd4ea41d9e4840898c702ec214bc38f6": {
3161 | "model_module": "@jupyter-widgets/controls",
3162 | "model_name": "FloatProgressModel",
3163 | "model_module_version": "1.5.0",
3164 | "state": {
3165 | "_dom_classes": [],
3166 | "_model_module": "@jupyter-widgets/controls",
3167 | "_model_module_version": "1.5.0",
3168 | "_model_name": "FloatProgressModel",
3169 | "_view_count": null,
3170 | "_view_module": "@jupyter-widgets/controls",
3171 | "_view_module_version": "1.5.0",
3172 | "_view_name": "ProgressView",
3173 | "bar_style": "success",
3174 | "description": "",
3175 | "description_tooltip": null,
3176 | "layout": "IPY_MODEL_3d306af68c664ed0a366e85aaf0dfcdb",
3177 | "max": 69538,
3178 | "min": 0,
3179 | "orientation": "horizontal",
3180 | "style": "IPY_MODEL_083c8e025835406ab0d2e80e7f4a20a4",
3181 | "value": 69538
3182 | }
3183 | },
3184 | "96ef472afbb84a36a12caa4827e49904": {
3185 | "model_module": "@jupyter-widgets/controls",
3186 | "model_name": "HTMLModel",
3187 | "model_module_version": "1.5.0",
3188 | "state": {
3189 | "_dom_classes": [],
3190 | "_model_module": "@jupyter-widgets/controls",
3191 | "_model_module_version": "1.5.0",
3192 | "_model_name": "HTMLModel",
3193 | "_view_count": null,
3194 | "_view_module": "@jupyter-widgets/controls",
3195 | "_view_module_version": "1.5.0",
3196 | "_view_name": "HTMLView",
3197 | "description": "",
3198 | "description_tooltip": null,
3199 | "layout": "IPY_MODEL_5e3d5a36afe548e58fbc97360205c798",
3200 | "placeholder": "",
3201 | "style": "IPY_MODEL_6ebb44e73cff4db28191370741b00fec",
3202 | "value": " 69.5kB / 69.5kB "
3203 | }
3204 | },
3205 | "4dfd57e35e414aec866305e1c3862f43": {
3206 | "model_module": "@jupyter-widgets/base",
3207 | "model_name": "LayoutModel",
3208 | "model_module_version": "1.2.0",
3209 | "state": {
3210 | "_model_module": "@jupyter-widgets/base",
3211 | "_model_module_version": "1.2.0",
3212 | "_model_name": "LayoutModel",
3213 | "_view_count": null,
3214 | "_view_module": "@jupyter-widgets/base",
3215 | "_view_module_version": "1.2.0",
3216 | "_view_name": "LayoutView",
3217 | "align_content": null,
3218 | "align_items": null,
3219 | "align_self": null,
3220 | "border": null,
3221 | "bottom": null,
3222 | "display": null,
3223 | "flex": null,
3224 | "flex_flow": null,
3225 | "grid_area": null,
3226 | "grid_auto_columns": null,
3227 | "grid_auto_flow": null,
3228 | "grid_auto_rows": null,
3229 | "grid_column": null,
3230 | "grid_gap": null,
3231 | "grid_row": null,
3232 | "grid_template_areas": null,
3233 | "grid_template_columns": null,
3234 | "grid_template_rows": null,
3235 | "height": null,
3236 | "justify_content": null,
3237 | "justify_items": null,
3238 | "left": null,
3239 | "margin": null,
3240 | "max_height": null,
3241 | "max_width": null,
3242 | "min_height": null,
3243 | "min_width": null,
3244 | "object_fit": null,
3245 | "object_position": null,
3246 | "order": null,
3247 | "overflow": null,
3248 | "overflow_x": null,
3249 | "overflow_y": null,
3250 | "padding": null,
3251 | "right": null,
3252 | "top": null,
3253 | "visibility": null,
3254 | "width": null
3255 | }
3256 | },
3257 | "f3085f5a5d2d4e5baf91834a7aff0823": {
3258 | "model_module": "@jupyter-widgets/base",
3259 | "model_name": "LayoutModel",
3260 | "model_module_version": "1.2.0",
3261 | "state": {
3262 | "_model_module": "@jupyter-widgets/base",
3263 | "_model_module_version": "1.2.0",
3264 | "_model_name": "LayoutModel",
3265 | "_view_count": null,
3266 | "_view_module": "@jupyter-widgets/base",
3267 | "_view_module_version": "1.2.0",
3268 | "_view_name": "LayoutView",
3269 | "align_content": null,
3270 | "align_items": null,
3271 | "align_self": null,
3272 | "border": null,
3273 | "bottom": null,
3274 | "display": null,
3275 | "flex": null,
3276 | "flex_flow": null,
3277 | "grid_area": null,
3278 | "grid_auto_columns": null,
3279 | "grid_auto_flow": null,
3280 | "grid_auto_rows": null,
3281 | "grid_column": null,
3282 | "grid_gap": null,
3283 | "grid_row": null,
3284 | "grid_template_areas": null,
3285 | "grid_template_columns": null,
3286 | "grid_template_rows": null,
3287 | "height": null,
3288 | "justify_content": null,
3289 | "justify_items": null,
3290 | "left": null,
3291 | "margin": null,
3292 | "max_height": null,
3293 | "max_width": null,
3294 | "min_height": null,
3295 | "min_width": null,
3296 | "object_fit": null,
3297 | "object_position": null,
3298 | "order": null,
3299 | "overflow": null,
3300 | "overflow_x": null,
3301 | "overflow_y": null,
3302 | "padding": null,
3303 | "right": null,
3304 | "top": null,
3305 | "visibility": null,
3306 | "width": null
3307 | }
3308 | },
3309 | "beadea8e78804664bbfcbd95527fda43": {
3310 | "model_module": "@jupyter-widgets/controls",
3311 | "model_name": "DescriptionStyleModel",
3312 | "model_module_version": "1.5.0",
3313 | "state": {
3314 | "_model_module": "@jupyter-widgets/controls",
3315 | "_model_module_version": "1.5.0",
3316 | "_model_name": "DescriptionStyleModel",
3317 | "_view_count": null,
3318 | "_view_module": "@jupyter-widgets/base",
3319 | "_view_module_version": "1.2.0",
3320 | "_view_name": "StyleView",
3321 | "description_width": ""
3322 | }
3323 | },
3324 | "3d306af68c664ed0a366e85aaf0dfcdb": {
3325 | "model_module": "@jupyter-widgets/base",
3326 | "model_name": "LayoutModel",
3327 | "model_module_version": "1.2.0",
3328 | "state": {
3329 | "_model_module": "@jupyter-widgets/base",
3330 | "_model_module_version": "1.2.0",
3331 | "_model_name": "LayoutModel",
3332 | "_view_count": null,
3333 | "_view_module": "@jupyter-widgets/base",
3334 | "_view_module_version": "1.2.0",
3335 | "_view_name": "LayoutView",
3336 | "align_content": null,
3337 | "align_items": null,
3338 | "align_self": null,
3339 | "border": null,
3340 | "bottom": null,
3341 | "display": null,
3342 | "flex": null,
3343 | "flex_flow": null,
3344 | "grid_area": null,
3345 | "grid_auto_columns": null,
3346 | "grid_auto_flow": null,
3347 | "grid_auto_rows": null,
3348 | "grid_column": null,
3349 | "grid_gap": null,
3350 | "grid_row": null,
3351 | "grid_template_areas": null,
3352 | "grid_template_columns": null,
3353 | "grid_template_rows": null,
3354 | "height": null,
3355 | "justify_content": null,
3356 | "justify_items": null,
3357 | "left": null,
3358 | "margin": null,
3359 | "max_height": null,
3360 | "max_width": null,
3361 | "min_height": null,
3362 | "min_width": null,
3363 | "object_fit": null,
3364 | "object_position": null,
3365 | "order": null,
3366 | "overflow": null,
3367 | "overflow_x": null,
3368 | "overflow_y": null,
3369 | "padding": null,
3370 | "right": null,
3371 | "top": null,
3372 | "visibility": null,
3373 | "width": null
3374 | }
3375 | },
3376 | "083c8e025835406ab0d2e80e7f4a20a4": {
3377 | "model_module": "@jupyter-widgets/controls",
3378 | "model_name": "ProgressStyleModel",
3379 | "model_module_version": "1.5.0",
3380 | "state": {
3381 | "_model_module": "@jupyter-widgets/controls",
3382 | "_model_module_version": "1.5.0",
3383 | "_model_name": "ProgressStyleModel",
3384 | "_view_count": null,
3385 | "_view_module": "@jupyter-widgets/base",
3386 | "_view_module_version": "1.2.0",
3387 | "_view_name": "StyleView",
3388 | "bar_color": null,
3389 | "description_width": ""
3390 | }
3391 | },
3392 | "5e3d5a36afe548e58fbc97360205c798": {
3393 | "model_module": "@jupyter-widgets/base",
3394 | "model_name": "LayoutModel",
3395 | "model_module_version": "1.2.0",
3396 | "state": {
3397 | "_model_module": "@jupyter-widgets/base",
3398 | "_model_module_version": "1.2.0",
3399 | "_model_name": "LayoutModel",
3400 | "_view_count": null,
3401 | "_view_module": "@jupyter-widgets/base",
3402 | "_view_module_version": "1.2.0",
3403 | "_view_name": "LayoutView",
3404 | "align_content": null,
3405 | "align_items": null,
3406 | "align_self": null,
3407 | "border": null,
3408 | "bottom": null,
3409 | "display": null,
3410 | "flex": null,
3411 | "flex_flow": null,
3412 | "grid_area": null,
3413 | "grid_auto_columns": null,
3414 | "grid_auto_flow": null,
3415 | "grid_auto_rows": null,
3416 | "grid_column": null,
3417 | "grid_gap": null,
3418 | "grid_row": null,
3419 | "grid_template_areas": null,
3420 | "grid_template_columns": null,
3421 | "grid_template_rows": null,
3422 | "height": null,
3423 | "justify_content": null,
3424 | "justify_items": null,
3425 | "left": null,
3426 | "margin": null,
3427 | "max_height": null,
3428 | "max_width": null,
3429 | "min_height": null,
3430 | "min_width": null,
3431 | "object_fit": null,
3432 | "object_position": null,
3433 | "order": null,
3434 | "overflow": null,
3435 | "overflow_x": null,
3436 | "overflow_y": null,
3437 | "padding": null,
3438 | "right": null,
3439 | "top": null,
3440 | "visibility": null,
3441 | "width": null
3442 | }
3443 | },
3444 | "6ebb44e73cff4db28191370741b00fec": {
3445 | "model_module": "@jupyter-widgets/controls",
3446 | "model_name": "DescriptionStyleModel",
3447 | "model_module_version": "1.5.0",
3448 | "state": {
3449 | "_model_module": "@jupyter-widgets/controls",
3450 | "_model_module_version": "1.5.0",
3451 | "_model_name": "DescriptionStyleModel",
3452 | "_view_count": null,
3453 | "_view_module": "@jupyter-widgets/base",
3454 | "_view_module_version": "1.2.0",
3455 | "_view_name": "StyleView",
3456 | "description_width": ""
3457 | }
3458 | },
3459 | "e386f44e49a1465f941b8be9a0122fb1": {
3460 | "model_module": "@jupyter-widgets/controls",
3461 | "model_name": "HBoxModel",
3462 | "model_module_version": "1.5.0",
3463 | "state": {
3464 | "_dom_classes": [],
3465 | "_model_module": "@jupyter-widgets/controls",
3466 | "_model_module_version": "1.5.0",
3467 | "_model_name": "HBoxModel",
3468 | "_view_count": null,
3469 | "_view_module": "@jupyter-widgets/controls",
3470 | "_view_module_version": "1.5.0",
3471 | "_view_name": "HBoxView",
3472 | "box_style": "",
3473 | "children": [
3474 | "IPY_MODEL_464b2f8c754545f188e22354fa5656b2",
3475 | "IPY_MODEL_135ad0f12d9f4856b6db2525a2e7d6f2",
3476 | "IPY_MODEL_7b69d2e073c34debb8c1f67e4abe43e6"
3477 | ],
3478 | "layout": "IPY_MODEL_22f7305ccc934e288c3301f4c034949e"
3479 | }
3480 | },
3481 | "464b2f8c754545f188e22354fa5656b2": {
3482 | "model_module": "@jupyter-widgets/controls",
3483 | "model_name": "HTMLModel",
3484 | "model_module_version": "1.5.0",
3485 | "state": {
3486 | "_dom_classes": [],
3487 | "_model_module": "@jupyter-widgets/controls",
3488 | "_model_module_version": "1.5.0",
3489 | "_model_name": "HTMLModel",
3490 | "_view_count": null,
3491 | "_view_module": "@jupyter-widgets/controls",
3492 | "_view_module_version": "1.5.0",
3493 | "_view_name": "HTMLView",
3494 | "description": "",
3495 | "description_tooltip": null,
3496 | "layout": "IPY_MODEL_fc1baa2c97ce4ab1800174f7a8dc9f5e",
3497 | "placeholder": "",
3498 | "style": "IPY_MODEL_0805f99a9a23458c9ed5770a0aca463a",
3499 | "value": "Processing Files (1 / 1) : 100%"
3500 | }
3501 | },
3502 | "135ad0f12d9f4856b6db2525a2e7d6f2": {
3503 | "model_module": "@jupyter-widgets/controls",
3504 | "model_name": "FloatProgressModel",
3505 | "model_module_version": "1.5.0",
3506 | "state": {
3507 | "_dom_classes": [],
3508 | "_model_module": "@jupyter-widgets/controls",
3509 | "_model_module_version": "1.5.0",
3510 | "_model_name": "FloatProgressModel",
3511 | "_view_count": null,
3512 | "_view_module": "@jupyter-widgets/controls",
3513 | "_view_module_version": "1.5.0",
3514 | "_view_name": "ProgressView",
3515 | "bar_style": "success",
3516 | "description": "",
3517 | "description_tooltip": null,
3518 | "layout": "IPY_MODEL_08a60d97e888439cbca68cdfbbd91c72",
3519 | "max": 1,
3520 | "min": 0,
3521 | "orientation": "horizontal",
3522 | "style": "IPY_MODEL_5402917fe7b74add855ac97ed388c50b",
3523 | "value": 1
3524 | }
3525 | },
3526 | "7b69d2e073c34debb8c1f67e4abe43e6": {
3527 | "model_module": "@jupyter-widgets/controls",
3528 | "model_name": "HTMLModel",
3529 | "model_module_version": "1.5.0",
3530 | "state": {
3531 | "_dom_classes": [],
3532 | "_model_module": "@jupyter-widgets/controls",
3533 | "_model_module_version": "1.5.0",
3534 | "_model_name": "HTMLModel",
3535 | "_view_count": null,
3536 | "_view_module": "@jupyter-widgets/controls",
3537 | "_view_module_version": "1.5.0",
3538 | "_view_name": "HTMLView",
3539 | "description": "",
3540 | "description_tooltip": null,
3541 | "layout": "IPY_MODEL_6755033e680049f391bcba6f9c55762f",
3542 | "placeholder": "",
3543 | "style": "IPY_MODEL_6bd62bd1ff8345b28eb37636a8f6ea34",
3544 | "value": " 30.0kB / 30.0kB, 18.8kB/s "
3545 | }
3546 | },
3547 | "22f7305ccc934e288c3301f4c034949e": {
3548 | "model_module": "@jupyter-widgets/base",
3549 | "model_name": "LayoutModel",
3550 | "model_module_version": "1.2.0",
3551 | "state": {
3552 | "_model_module": "@jupyter-widgets/base",
3553 | "_model_module_version": "1.2.0",
3554 | "_model_name": "LayoutModel",
3555 | "_view_count": null,
3556 | "_view_module": "@jupyter-widgets/base",
3557 | "_view_module_version": "1.2.0",
3558 | "_view_name": "LayoutView",
3559 | "align_content": null,
3560 | "align_items": null,
3561 | "align_self": null,
3562 | "border": null,
3563 | "bottom": null,
3564 | "display": null,
3565 | "flex": null,
3566 | "flex_flow": null,
3567 | "grid_area": null,
3568 | "grid_auto_columns": null,
3569 | "grid_auto_flow": null,
3570 | "grid_auto_rows": null,
3571 | "grid_column": null,
3572 | "grid_gap": null,
3573 | "grid_row": null,
3574 | "grid_template_areas": null,
3575 | "grid_template_columns": null,
3576 | "grid_template_rows": null,
3577 | "height": null,
3578 | "justify_content": null,
3579 | "justify_items": null,
3580 | "left": null,
3581 | "margin": null,
3582 | "max_height": null,
3583 | "max_width": null,
3584 | "min_height": null,
3585 | "min_width": null,
3586 | "object_fit": null,
3587 | "object_position": null,
3588 | "order": null,
3589 | "overflow": null,
3590 | "overflow_x": null,
3591 | "overflow_y": null,
3592 | "padding": null,
3593 | "right": null,
3594 | "top": null,
3595 | "visibility": null,
3596 | "width": null
3597 | }
3598 | },
3599 | "fc1baa2c97ce4ab1800174f7a8dc9f5e": {
3600 | "model_module": "@jupyter-widgets/base",
3601 | "model_name": "LayoutModel",
3602 | "model_module_version": "1.2.0",
3603 | "state": {
3604 | "_model_module": "@jupyter-widgets/base",
3605 | "_model_module_version": "1.2.0",
3606 | "_model_name": "LayoutModel",
3607 | "_view_count": null,
3608 | "_view_module": "@jupyter-widgets/base",
3609 | "_view_module_version": "1.2.0",
3610 | "_view_name": "LayoutView",
3611 | "align_content": null,
3612 | "align_items": null,
3613 | "align_self": null,
3614 | "border": null,
3615 | "bottom": null,
3616 | "display": null,
3617 | "flex": null,
3618 | "flex_flow": null,
3619 | "grid_area": null,
3620 | "grid_auto_columns": null,
3621 | "grid_auto_flow": null,
3622 | "grid_auto_rows": null,
3623 | "grid_column": null,
3624 | "grid_gap": null,
3625 | "grid_row": null,
3626 | "grid_template_areas": null,
3627 | "grid_template_columns": null,
3628 | "grid_template_rows": null,
3629 | "height": null,
3630 | "justify_content": null,
3631 | "justify_items": null,
3632 | "left": null,
3633 | "margin": null,
3634 | "max_height": null,
3635 | "max_width": null,
3636 | "min_height": null,
3637 | "min_width": null,
3638 | "object_fit": null,
3639 | "object_position": null,
3640 | "order": null,
3641 | "overflow": null,
3642 | "overflow_x": null,
3643 | "overflow_y": null,
3644 | "padding": null,
3645 | "right": null,
3646 | "top": null,
3647 | "visibility": null,
3648 | "width": null
3649 | }
3650 | },
3651 | "0805f99a9a23458c9ed5770a0aca463a": {
3652 | "model_module": "@jupyter-widgets/controls",
3653 | "model_name": "DescriptionStyleModel",
3654 | "model_module_version": "1.5.0",
3655 | "state": {
3656 | "_model_module": "@jupyter-widgets/controls",
3657 | "_model_module_version": "1.5.0",
3658 | "_model_name": "DescriptionStyleModel",
3659 | "_view_count": null,
3660 | "_view_module": "@jupyter-widgets/base",
3661 | "_view_module_version": "1.2.0",
3662 | "_view_name": "StyleView",
3663 | "description_width": ""
3664 | }
3665 | },
3666 | "08a60d97e888439cbca68cdfbbd91c72": {
3667 | "model_module": "@jupyter-widgets/base",
3668 | "model_name": "LayoutModel",
3669 | "model_module_version": "1.2.0",
3670 | "state": {
3671 | "_model_module": "@jupyter-widgets/base",
3672 | "_model_module_version": "1.2.0",
3673 | "_model_name": "LayoutModel",
3674 | "_view_count": null,
3675 | "_view_module": "@jupyter-widgets/base",
3676 | "_view_module_version": "1.2.0",
3677 | "_view_name": "LayoutView",
3678 | "align_content": null,
3679 | "align_items": null,
3680 | "align_self": null,
3681 | "border": null,
3682 | "bottom": null,
3683 | "display": null,
3684 | "flex": null,
3685 | "flex_flow": null,
3686 | "grid_area": null,
3687 | "grid_auto_columns": null,
3688 | "grid_auto_flow": null,
3689 | "grid_auto_rows": null,
3690 | "grid_column": null,
3691 | "grid_gap": null,
3692 | "grid_row": null,
3693 | "grid_template_areas": null,
3694 | "grid_template_columns": null,
3695 | "grid_template_rows": null,
3696 | "height": null,
3697 | "justify_content": null,
3698 | "justify_items": null,
3699 | "left": null,
3700 | "margin": null,
3701 | "max_height": null,
3702 | "max_width": null,
3703 | "min_height": null,
3704 | "min_width": null,
3705 | "object_fit": null,
3706 | "object_position": null,
3707 | "order": null,
3708 | "overflow": null,
3709 | "overflow_x": null,
3710 | "overflow_y": null,
3711 | "padding": null,
3712 | "right": null,
3713 | "top": null,
3714 | "visibility": null,
3715 | "width": "20px"
3716 | }
3717 | },
3718 | "5402917fe7b74add855ac97ed388c50b": {
3719 | "model_module": "@jupyter-widgets/controls",
3720 | "model_name": "ProgressStyleModel",
3721 | "model_module_version": "1.5.0",
3722 | "state": {
3723 | "_model_module": "@jupyter-widgets/controls",
3724 | "_model_module_version": "1.5.0",
3725 | "_model_name": "ProgressStyleModel",
3726 | "_view_count": null,
3727 | "_view_module": "@jupyter-widgets/base",
3728 | "_view_module_version": "1.2.0",
3729 | "_view_name": "StyleView",
3730 | "bar_color": null,
3731 | "description_width": ""
3732 | }
3733 | },
3734 | "6755033e680049f391bcba6f9c55762f": {
3735 | "model_module": "@jupyter-widgets/base",
3736 | "model_name": "LayoutModel",
3737 | "model_module_version": "1.2.0",
3738 | "state": {
3739 | "_model_module": "@jupyter-widgets/base",
3740 | "_model_module_version": "1.2.0",
3741 | "_model_name": "LayoutModel",
3742 | "_view_count": null,
3743 | "_view_module": "@jupyter-widgets/base",
3744 | "_view_module_version": "1.2.0",
3745 | "_view_name": "LayoutView",
3746 | "align_content": null,
3747 | "align_items": null,
3748 | "align_self": null,
3749 | "border": null,
3750 | "bottom": null,
3751 | "display": null,
3752 | "flex": null,
3753 | "flex_flow": null,
3754 | "grid_area": null,
3755 | "grid_auto_columns": null,
3756 | "grid_auto_flow": null,
3757 | "grid_auto_rows": null,
3758 | "grid_column": null,
3759 | "grid_gap": null,
3760 | "grid_row": null,
3761 | "grid_template_areas": null,
3762 | "grid_template_columns": null,
3763 | "grid_template_rows": null,
3764 | "height": null,
3765 | "justify_content": null,
3766 | "justify_items": null,
3767 | "left": null,
3768 | "margin": null,
3769 | "max_height": null,
3770 | "max_width": null,
3771 | "min_height": null,
3772 | "min_width": null,
3773 | "object_fit": null,
3774 | "object_position": null,
3775 | "order": null,
3776 | "overflow": null,
3777 | "overflow_x": null,
3778 | "overflow_y": null,
3779 | "padding": null,
3780 | "right": null,
3781 | "top": null,
3782 | "visibility": null,
3783 | "width": null
3784 | }
3785 | },
3786 | "6bd62bd1ff8345b28eb37636a8f6ea34": {
3787 | "model_module": "@jupyter-widgets/controls",
3788 | "model_name": "DescriptionStyleModel",
3789 | "model_module_version": "1.5.0",
3790 | "state": {
3791 | "_model_module": "@jupyter-widgets/controls",
3792 | "_model_module_version": "1.5.0",
3793 | "_model_name": "DescriptionStyleModel",
3794 | "_view_count": null,
3795 | "_view_module": "@jupyter-widgets/base",
3796 | "_view_module_version": "1.2.0",
3797 | "_view_name": "StyleView",
3798 | "description_width": ""
3799 | }
3800 | },
3801 | "64eb477930fc41b99959d645f48715d3": {
3802 | "model_module": "@jupyter-widgets/controls",
3803 | "model_name": "HBoxModel",
3804 | "model_module_version": "1.5.0",
3805 | "state": {
3806 | "_dom_classes": [],
3807 | "_model_module": "@jupyter-widgets/controls",
3808 | "_model_module_version": "1.5.0",
3809 | "_model_name": "HBoxModel",
3810 | "_view_count": null,
3811 | "_view_module": "@jupyter-widgets/controls",
3812 | "_view_module_version": "1.5.0",
3813 | "_view_name": "HBoxView",
3814 | "box_style": "",
3815 | "children": [
3816 | "IPY_MODEL_64a124094c47471096d06036005390a2",
3817 | "IPY_MODEL_b6a380f0e58443f7a559db24976bc31f",
3818 | "IPY_MODEL_a17a2e82f2844e25afc7f290581e7f2d"
3819 | ],
3820 | "layout": "IPY_MODEL_23b66fac5d1c42768f37aedd1abca1aa"
3821 | }
3822 | },
3823 | "64a124094c47471096d06036005390a2": {
3824 | "model_module": "@jupyter-widgets/controls",
3825 | "model_name": "HTMLModel",
3826 | "model_module_version": "1.5.0",
3827 | "state": {
3828 | "_dom_classes": [],
3829 | "_model_module": "@jupyter-widgets/controls",
3830 | "_model_module_version": "1.5.0",
3831 | "_model_name": "HTMLModel",
3832 | "_view_count": null,
3833 | "_view_module": "@jupyter-widgets/controls",
3834 | "_view_module_version": "1.5.0",
3835 | "_view_name": "HTMLView",
3836 | "description": "",
3837 | "description_tooltip": null,
3838 | "layout": "IPY_MODEL_b0bcd57df2404c76afc84208bdaf62f4",
3839 | "placeholder": "",
3840 | "style": "IPY_MODEL_1714d8edf85549c193d6143aa0eac916",
3841 | "value": "New Data Upload : 100%"
3842 | }
3843 | },
3844 | "b6a380f0e58443f7a559db24976bc31f": {
3845 | "model_module": "@jupyter-widgets/controls",
3846 | "model_name": "FloatProgressModel",
3847 | "model_module_version": "1.5.0",
3848 | "state": {
3849 | "_dom_classes": [],
3850 | "_model_module": "@jupyter-widgets/controls",
3851 | "_model_module_version": "1.5.0",
3852 | "_model_name": "FloatProgressModel",
3853 | "_view_count": null,
3854 | "_view_module": "@jupyter-widgets/controls",
3855 | "_view_module_version": "1.5.0",
3856 | "_view_name": "ProgressView",
3857 | "bar_style": "success",
3858 | "description": "",
3859 | "description_tooltip": null,
3860 | "layout": "IPY_MODEL_858128db76f743b795e8a70a7e6a4e81",
3861 | "max": 1,
3862 | "min": 0,
3863 | "orientation": "horizontal",
3864 | "style": "IPY_MODEL_2d0a60a2d2b6456897c20274102de722",
3865 | "value": 1
3866 | }
3867 | },
3868 | "a17a2e82f2844e25afc7f290581e7f2d": {
3869 | "model_module": "@jupyter-widgets/controls",
3870 | "model_name": "HTMLModel",
3871 | "model_module_version": "1.5.0",
3872 | "state": {
3873 | "_dom_classes": [],
3874 | "_model_module": "@jupyter-widgets/controls",
3875 | "_model_module_version": "1.5.0",
3876 | "_model_name": "HTMLModel",
3877 | "_view_count": null,
3878 | "_view_module": "@jupyter-widgets/controls",
3879 | "_view_module_version": "1.5.0",
3880 | "_view_name": "HTMLView",
3881 | "description": "",
3882 | "description_tooltip": null,
3883 | "layout": "IPY_MODEL_e2c52bf417854b0284df7bd2eafeae18",
3884 | "placeholder": "",
3885 | "style": "IPY_MODEL_805883b350a84f3a83da12eadcc8dfa2",
3886 | "value": " 30.0kB / 30.0kB, 18.8kB/s "
3887 | }
3888 | },
3889 | "23b66fac5d1c42768f37aedd1abca1aa": {
3890 | "model_module": "@jupyter-widgets/base",
3891 | "model_name": "LayoutModel",
3892 | "model_module_version": "1.2.0",
3893 | "state": {
3894 | "_model_module": "@jupyter-widgets/base",
3895 | "_model_module_version": "1.2.0",
3896 | "_model_name": "LayoutModel",
3897 | "_view_count": null,
3898 | "_view_module": "@jupyter-widgets/base",
3899 | "_view_module_version": "1.2.0",
3900 | "_view_name": "LayoutView",
3901 | "align_content": null,
3902 | "align_items": null,
3903 | "align_self": null,
3904 | "border": null,
3905 | "bottom": null,
3906 | "display": null,
3907 | "flex": null,
3908 | "flex_flow": null,
3909 | "grid_area": null,
3910 | "grid_auto_columns": null,
3911 | "grid_auto_flow": null,
3912 | "grid_auto_rows": null,
3913 | "grid_column": null,
3914 | "grid_gap": null,
3915 | "grid_row": null,
3916 | "grid_template_areas": null,
3917 | "grid_template_columns": null,
3918 | "grid_template_rows": null,
3919 | "height": null,
3920 | "justify_content": null,
3921 | "justify_items": null,
3922 | "left": null,
3923 | "margin": null,
3924 | "max_height": null,
3925 | "max_width": null,
3926 | "min_height": null,
3927 | "min_width": null,
3928 | "object_fit": null,
3929 | "object_position": null,
3930 | "order": null,
3931 | "overflow": null,
3932 | "overflow_x": null,
3933 | "overflow_y": null,
3934 | "padding": null,
3935 | "right": null,
3936 | "top": null,
3937 | "visibility": null,
3938 | "width": null
3939 | }
3940 | },
3941 | "b0bcd57df2404c76afc84208bdaf62f4": {
3942 | "model_module": "@jupyter-widgets/base",
3943 | "model_name": "LayoutModel",
3944 | "model_module_version": "1.2.0",
3945 | "state": {
3946 | "_model_module": "@jupyter-widgets/base",
3947 | "_model_module_version": "1.2.0",
3948 | "_model_name": "LayoutModel",
3949 | "_view_count": null,
3950 | "_view_module": "@jupyter-widgets/base",
3951 | "_view_module_version": "1.2.0",
3952 | "_view_name": "LayoutView",
3953 | "align_content": null,
3954 | "align_items": null,
3955 | "align_self": null,
3956 | "border": null,
3957 | "bottom": null,
3958 | "display": null,
3959 | "flex": null,
3960 | "flex_flow": null,
3961 | "grid_area": null,
3962 | "grid_auto_columns": null,
3963 | "grid_auto_flow": null,
3964 | "grid_auto_rows": null,
3965 | "grid_column": null,
3966 | "grid_gap": null,
3967 | "grid_row": null,
3968 | "grid_template_areas": null,
3969 | "grid_template_columns": null,
3970 | "grid_template_rows": null,
3971 | "height": null,
3972 | "justify_content": null,
3973 | "justify_items": null,
3974 | "left": null,
3975 | "margin": null,
3976 | "max_height": null,
3977 | "max_width": null,
3978 | "min_height": null,
3979 | "min_width": null,
3980 | "object_fit": null,
3981 | "object_position": null,
3982 | "order": null,
3983 | "overflow": null,
3984 | "overflow_x": null,
3985 | "overflow_y": null,
3986 | "padding": null,
3987 | "right": null,
3988 | "top": null,
3989 | "visibility": null,
3990 | "width": null
3991 | }
3992 | },
3993 | "1714d8edf85549c193d6143aa0eac916": {
3994 | "model_module": "@jupyter-widgets/controls",
3995 | "model_name": "DescriptionStyleModel",
3996 | "model_module_version": "1.5.0",
3997 | "state": {
3998 | "_model_module": "@jupyter-widgets/controls",
3999 | "_model_module_version": "1.5.0",
4000 | "_model_name": "DescriptionStyleModel",
4001 | "_view_count": null,
4002 | "_view_module": "@jupyter-widgets/base",
4003 | "_view_module_version": "1.2.0",
4004 | "_view_name": "StyleView",
4005 | "description_width": ""
4006 | }
4007 | },
4008 | "858128db76f743b795e8a70a7e6a4e81": {
4009 | "model_module": "@jupyter-widgets/base",
4010 | "model_name": "LayoutModel",
4011 | "model_module_version": "1.2.0",
4012 | "state": {
4013 | "_model_module": "@jupyter-widgets/base",
4014 | "_model_module_version": "1.2.0",
4015 | "_model_name": "LayoutModel",
4016 | "_view_count": null,
4017 | "_view_module": "@jupyter-widgets/base",
4018 | "_view_module_version": "1.2.0",
4019 | "_view_name": "LayoutView",
4020 | "align_content": null,
4021 | "align_items": null,
4022 | "align_self": null,
4023 | "border": null,
4024 | "bottom": null,
4025 | "display": null,
4026 | "flex": null,
4027 | "flex_flow": null,
4028 | "grid_area": null,
4029 | "grid_auto_columns": null,
4030 | "grid_auto_flow": null,
4031 | "grid_auto_rows": null,
4032 | "grid_column": null,
4033 | "grid_gap": null,
4034 | "grid_row": null,
4035 | "grid_template_areas": null,
4036 | "grid_template_columns": null,
4037 | "grid_template_rows": null,
4038 | "height": null,
4039 | "justify_content": null,
4040 | "justify_items": null,
4041 | "left": null,
4042 | "margin": null,
4043 | "max_height": null,
4044 | "max_width": null,
4045 | "min_height": null,
4046 | "min_width": null,
4047 | "object_fit": null,
4048 | "object_position": null,
4049 | "order": null,
4050 | "overflow": null,
4051 | "overflow_x": null,
4052 | "overflow_y": null,
4053 | "padding": null,
4054 | "right": null,
4055 | "top": null,
4056 | "visibility": null,
4057 | "width": "20px"
4058 | }
4059 | },
4060 | "2d0a60a2d2b6456897c20274102de722": {
4061 | "model_module": "@jupyter-widgets/controls",
4062 | "model_name": "ProgressStyleModel",
4063 | "model_module_version": "1.5.0",
4064 | "state": {
4065 | "_model_module": "@jupyter-widgets/controls",
4066 | "_model_module_version": "1.5.0",
4067 | "_model_name": "ProgressStyleModel",
4068 | "_view_count": null,
4069 | "_view_module": "@jupyter-widgets/base",
4070 | "_view_module_version": "1.2.0",
4071 | "_view_name": "StyleView",
4072 | "bar_color": null,
4073 | "description_width": ""
4074 | }
4075 | },
4076 | "e2c52bf417854b0284df7bd2eafeae18": {
4077 | "model_module": "@jupyter-widgets/base",
4078 | "model_name": "LayoutModel",
4079 | "model_module_version": "1.2.0",
4080 | "state": {
4081 | "_model_module": "@jupyter-widgets/base",
4082 | "_model_module_version": "1.2.0",
4083 | "_model_name": "LayoutModel",
4084 | "_view_count": null,
4085 | "_view_module": "@jupyter-widgets/base",
4086 | "_view_module_version": "1.2.0",
4087 | "_view_name": "LayoutView",
4088 | "align_content": null,
4089 | "align_items": null,
4090 | "align_self": null,
4091 | "border": null,
4092 | "bottom": null,
4093 | "display": null,
4094 | "flex": null,
4095 | "flex_flow": null,
4096 | "grid_area": null,
4097 | "grid_auto_columns": null,
4098 | "grid_auto_flow": null,
4099 | "grid_auto_rows": null,
4100 | "grid_column": null,
4101 | "grid_gap": null,
4102 | "grid_row": null,
4103 | "grid_template_areas": null,
4104 | "grid_template_columns": null,
4105 | "grid_template_rows": null,
4106 | "height": null,
4107 | "justify_content": null,
4108 | "justify_items": null,
4109 | "left": null,
4110 | "margin": null,
4111 | "max_height": null,
4112 | "max_width": null,
4113 | "min_height": null,
4114 | "min_width": null,
4115 | "object_fit": null,
4116 | "object_position": null,
4117 | "order": null,
4118 | "overflow": null,
4119 | "overflow_x": null,
4120 | "overflow_y": null,
4121 | "padding": null,
4122 | "right": null,
4123 | "top": null,
4124 | "visibility": null,
4125 | "width": null
4126 | }
4127 | },
4128 | "805883b350a84f3a83da12eadcc8dfa2": {
4129 | "model_module": "@jupyter-widgets/controls",
4130 | "model_name": "DescriptionStyleModel",
4131 | "model_module_version": "1.5.0",
4132 | "state": {
4133 | "_model_module": "@jupyter-widgets/controls",
4134 | "_model_module_version": "1.5.0",
4135 | "_model_name": "DescriptionStyleModel",
4136 | "_view_count": null,
4137 | "_view_module": "@jupyter-widgets/base",
4138 | "_view_module_version": "1.2.0",
4139 | "_view_name": "StyleView",
4140 | "description_width": ""
4141 | }
4142 | },
4143 | "b1d541f65026459e87ec6c8b2885d5e1": {
4144 | "model_module": "@jupyter-widgets/controls",
4145 | "model_name": "HBoxModel",
4146 | "model_module_version": "1.5.0",
4147 | "state": {
4148 | "_dom_classes": [],
4149 | "_model_module": "@jupyter-widgets/controls",
4150 | "_model_module_version": "1.5.0",
4151 | "_model_name": "HBoxModel",
4152 | "_view_count": null,
4153 | "_view_module": "@jupyter-widgets/controls",
4154 | "_view_module_version": "1.5.0",
4155 | "_view_name": "HBoxView",
4156 | "box_style": "",
4157 | "children": [
4158 | "IPY_MODEL_de1084195ecf48609a52faf367a1c774",
4159 | "IPY_MODEL_46791d145bcd494b955dba5dc04a7ffe",
4160 | "IPY_MODEL_2eb0868140ef4a159f24a391a7a81175"
4161 | ],
4162 | "layout": "IPY_MODEL_4cf10360bda24346b064cee196b4d9ca"
4163 | }
4164 | },
4165 | "de1084195ecf48609a52faf367a1c774": {
4166 | "model_module": "@jupyter-widgets/controls",
4167 | "model_name": "HTMLModel",
4168 | "model_module_version": "1.5.0",
4169 | "state": {
4170 | "_dom_classes": [],
4171 | "_model_module": "@jupyter-widgets/controls",
4172 | "_model_module_version": "1.5.0",
4173 | "_model_name": "HTMLModel",
4174 | "_view_count": null,
4175 | "_view_module": "@jupyter-widgets/controls",
4176 | "_view_module_version": "1.5.0",
4177 | "_view_name": "HTMLView",
4178 | "description": "",
4179 | "description_tooltip": null,
4180 | "layout": "IPY_MODEL_8e5b7a693c214004abd0e83ed176bfb9",
4181 | "placeholder": "",
4182 | "style": "IPY_MODEL_8e8589eb7b56495fb5d81ebcb16bb0bc",
4183 | "value": " outputs/train.parquet : 100%"
4184 | }
4185 | },
4186 | "46791d145bcd494b955dba5dc04a7ffe": {
4187 | "model_module": "@jupyter-widgets/controls",
4188 | "model_name": "FloatProgressModel",
4189 | "model_module_version": "1.5.0",
4190 | "state": {
4191 | "_dom_classes": [],
4192 | "_model_module": "@jupyter-widgets/controls",
4193 | "_model_module_version": "1.5.0",
4194 | "_model_name": "FloatProgressModel",
4195 | "_view_count": null,
4196 | "_view_module": "@jupyter-widgets/controls",
4197 | "_view_module_version": "1.5.0",
4198 | "_view_name": "ProgressView",
4199 | "bar_style": "success",
4200 | "description": "",
4201 | "description_tooltip": null,
4202 | "layout": "IPY_MODEL_1768820f1f9f4fe1b190c89c1c849778",
4203 | "max": 30041,
4204 | "min": 0,
4205 | "orientation": "horizontal",
4206 | "style": "IPY_MODEL_f91c6242eac9427bbd7b6b6ef0cf7f45",
4207 | "value": 30041
4208 | }
4209 | },
4210 | "2eb0868140ef4a159f24a391a7a81175": {
4211 | "model_module": "@jupyter-widgets/controls",
4212 | "model_name": "HTMLModel",
4213 | "model_module_version": "1.5.0",
4214 | "state": {
4215 | "_dom_classes": [],
4216 | "_model_module": "@jupyter-widgets/controls",
4217 | "_model_module_version": "1.5.0",
4218 | "_model_name": "HTMLModel",
4219 | "_view_count": null,
4220 | "_view_module": "@jupyter-widgets/controls",
4221 | "_view_module_version": "1.5.0",
4222 | "_view_name": "HTMLView",
4223 | "description": "",
4224 | "description_tooltip": null,
4225 | "layout": "IPY_MODEL_26a22e17c52549c4b30aa5ae22451467",
4226 | "placeholder": "",
4227 | "style": "IPY_MODEL_484c25d6506a4b7891c9090e8ae7b444",
4228 | "value": " 30.0kB / 30.0kB "
4229 | }
4230 | },
4231 | "4cf10360bda24346b064cee196b4d9ca": {
4232 | "model_module": "@jupyter-widgets/base",
4233 | "model_name": "LayoutModel",
4234 | "model_module_version": "1.2.0",
4235 | "state": {
4236 | "_model_module": "@jupyter-widgets/base",
4237 | "_model_module_version": "1.2.0",
4238 | "_model_name": "LayoutModel",
4239 | "_view_count": null,
4240 | "_view_module": "@jupyter-widgets/base",
4241 | "_view_module_version": "1.2.0",
4242 | "_view_name": "LayoutView",
4243 | "align_content": null,
4244 | "align_items": null,
4245 | "align_self": null,
4246 | "border": null,
4247 | "bottom": null,
4248 | "display": null,
4249 | "flex": null,
4250 | "flex_flow": null,
4251 | "grid_area": null,
4252 | "grid_auto_columns": null,
4253 | "grid_auto_flow": null,
4254 | "grid_auto_rows": null,
4255 | "grid_column": null,
4256 | "grid_gap": null,
4257 | "grid_row": null,
4258 | "grid_template_areas": null,
4259 | "grid_template_columns": null,
4260 | "grid_template_rows": null,
4261 | "height": null,
4262 | "justify_content": null,
4263 | "justify_items": null,
4264 | "left": null,
4265 | "margin": null,
4266 | "max_height": null,
4267 | "max_width": null,
4268 | "min_height": null,
4269 | "min_width": null,
4270 | "object_fit": null,
4271 | "object_position": null,
4272 | "order": null,
4273 | "overflow": null,
4274 | "overflow_x": null,
4275 | "overflow_y": null,
4276 | "padding": null,
4277 | "right": null,
4278 | "top": null,
4279 | "visibility": null,
4280 | "width": null
4281 | }
4282 | },
4283 | "8e5b7a693c214004abd0e83ed176bfb9": {
4284 | "model_module": "@jupyter-widgets/base",
4285 | "model_name": "LayoutModel",
4286 | "model_module_version": "1.2.0",
4287 | "state": {
4288 | "_model_module": "@jupyter-widgets/base",
4289 | "_model_module_version": "1.2.0",
4290 | "_model_name": "LayoutModel",
4291 | "_view_count": null,
4292 | "_view_module": "@jupyter-widgets/base",
4293 | "_view_module_version": "1.2.0",
4294 | "_view_name": "LayoutView",
4295 | "align_content": null,
4296 | "align_items": null,
4297 | "align_self": null,
4298 | "border": null,
4299 | "bottom": null,
4300 | "display": null,
4301 | "flex": null,
4302 | "flex_flow": null,
4303 | "grid_area": null,
4304 | "grid_auto_columns": null,
4305 | "grid_auto_flow": null,
4306 | "grid_auto_rows": null,
4307 | "grid_column": null,
4308 | "grid_gap": null,
4309 | "grid_row": null,
4310 | "grid_template_areas": null,
4311 | "grid_template_columns": null,
4312 | "grid_template_rows": null,
4313 | "height": null,
4314 | "justify_content": null,
4315 | "justify_items": null,
4316 | "left": null,
4317 | "margin": null,
4318 | "max_height": null,
4319 | "max_width": null,
4320 | "min_height": null,
4321 | "min_width": null,
4322 | "object_fit": null,
4323 | "object_position": null,
4324 | "order": null,
4325 | "overflow": null,
4326 | "overflow_x": null,
4327 | "overflow_y": null,
4328 | "padding": null,
4329 | "right": null,
4330 | "top": null,
4331 | "visibility": null,
4332 | "width": null
4333 | }
4334 | },
4335 | "8e8589eb7b56495fb5d81ebcb16bb0bc": {
4336 | "model_module": "@jupyter-widgets/controls",
4337 | "model_name": "DescriptionStyleModel",
4338 | "model_module_version": "1.5.0",
4339 | "state": {
4340 | "_model_module": "@jupyter-widgets/controls",
4341 | "_model_module_version": "1.5.0",
4342 | "_model_name": "DescriptionStyleModel",
4343 | "_view_count": null,
4344 | "_view_module": "@jupyter-widgets/base",
4345 | "_view_module_version": "1.2.0",
4346 | "_view_name": "StyleView",
4347 | "description_width": ""
4348 | }
4349 | },
4350 | "1768820f1f9f4fe1b190c89c1c849778": {
4351 | "model_module": "@jupyter-widgets/base",
4352 | "model_name": "LayoutModel",
4353 | "model_module_version": "1.2.0",
4354 | "state": {
4355 | "_model_module": "@jupyter-widgets/base",
4356 | "_model_module_version": "1.2.0",
4357 | "_model_name": "LayoutModel",
4358 | "_view_count": null,
4359 | "_view_module": "@jupyter-widgets/base",
4360 | "_view_module_version": "1.2.0",
4361 | "_view_name": "LayoutView",
4362 | "align_content": null,
4363 | "align_items": null,
4364 | "align_self": null,
4365 | "border": null,
4366 | "bottom": null,
4367 | "display": null,
4368 | "flex": null,
4369 | "flex_flow": null,
4370 | "grid_area": null,
4371 | "grid_auto_columns": null,
4372 | "grid_auto_flow": null,
4373 | "grid_auto_rows": null,
4374 | "grid_column": null,
4375 | "grid_gap": null,
4376 | "grid_row": null,
4377 | "grid_template_areas": null,
4378 | "grid_template_columns": null,
4379 | "grid_template_rows": null,
4380 | "height": null,
4381 | "justify_content": null,
4382 | "justify_items": null,
4383 | "left": null,
4384 | "margin": null,
4385 | "max_height": null,
4386 | "max_width": null,
4387 | "min_height": null,
4388 | "min_width": null,
4389 | "object_fit": null,
4390 | "object_position": null,
4391 | "order": null,
4392 | "overflow": null,
4393 | "overflow_x": null,
4394 | "overflow_y": null,
4395 | "padding": null,
4396 | "right": null,
4397 | "top": null,
4398 | "visibility": null,
4399 | "width": null
4400 | }
4401 | },
4402 | "f91c6242eac9427bbd7b6b6ef0cf7f45": {
4403 | "model_module": "@jupyter-widgets/controls",
4404 | "model_name": "ProgressStyleModel",
4405 | "model_module_version": "1.5.0",
4406 | "state": {
4407 | "_model_module": "@jupyter-widgets/controls",
4408 | "_model_module_version": "1.5.0",
4409 | "_model_name": "ProgressStyleModel",
4410 | "_view_count": null,
4411 | "_view_module": "@jupyter-widgets/base",
4412 | "_view_module_version": "1.2.0",
4413 | "_view_name": "StyleView",
4414 | "bar_color": null,
4415 | "description_width": ""
4416 | }
4417 | },
4418 | "26a22e17c52549c4b30aa5ae22451467": {
4419 | "model_module": "@jupyter-widgets/base",
4420 | "model_name": "LayoutModel",
4421 | "model_module_version": "1.2.0",
4422 | "state": {
4423 | "_model_module": "@jupyter-widgets/base",
4424 | "_model_module_version": "1.2.0",
4425 | "_model_name": "LayoutModel",
4426 | "_view_count": null,
4427 | "_view_module": "@jupyter-widgets/base",
4428 | "_view_module_version": "1.2.0",
4429 | "_view_name": "LayoutView",
4430 | "align_content": null,
4431 | "align_items": null,
4432 | "align_self": null,
4433 | "border": null,
4434 | "bottom": null,
4435 | "display": null,
4436 | "flex": null,
4437 | "flex_flow": null,
4438 | "grid_area": null,
4439 | "grid_auto_columns": null,
4440 | "grid_auto_flow": null,
4441 | "grid_auto_rows": null,
4442 | "grid_column": null,
4443 | "grid_gap": null,
4444 | "grid_row": null,
4445 | "grid_template_areas": null,
4446 | "grid_template_columns": null,
4447 | "grid_template_rows": null,
4448 | "height": null,
4449 | "justify_content": null,
4450 | "justify_items": null,
4451 | "left": null,
4452 | "margin": null,
4453 | "max_height": null,
4454 | "max_width": null,
4455 | "min_height": null,
4456 | "min_width": null,
4457 | "object_fit": null,
4458 | "object_position": null,
4459 | "order": null,
4460 | "overflow": null,
4461 | "overflow_x": null,
4462 | "overflow_y": null,
4463 | "padding": null,
4464 | "right": null,
4465 | "top": null,
4466 | "visibility": null,
4467 | "width": null
4468 | }
4469 | },
4470 | "484c25d6506a4b7891c9090e8ae7b444": {
4471 | "model_module": "@jupyter-widgets/controls",
4472 | "model_name": "DescriptionStyleModel",
4473 | "model_module_version": "1.5.0",
4474 | "state": {
4475 | "_model_module": "@jupyter-widgets/controls",
4476 | "_model_module_version": "1.5.0",
4477 | "_model_name": "DescriptionStyleModel",
4478 | "_view_count": null,
4479 | "_view_module": "@jupyter-widgets/base",
4480 | "_view_module_version": "1.2.0",
4481 | "_view_name": "StyleView",
4482 | "description_width": ""
4483 | }
4484 | }
4485 | }
4486 | }
4487 | },
4488 | "nbformat": 4,
4489 | "nbformat_minor": 5
4490 | }
--------------------------------------------------------------------------------