└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Collection of large datasets for Vietnamese LLM Finetuning 2 | 3 | These datasets have been translated into Vietnamese and created by our team, to support various natural language processing tasks. I hope it is helpful! 4 | 5 | ### Table of Contents 6 | 7 | ## Image-to-Text Datasets 8 | 9 | - [Vietnamese-ShareGPT4V](https://huggingface.co/datasets/5CD-AI/Vietnamese-Lin-Chen-ShareGPT4V-gg-translated) 10 | - Tasks: Visual Question Answering, Question Answering 11 | - Languages: Vietnamese, English 12 | - Size: 102k rows 13 | - Description: This dataset comprises questions and answers, translated into Vietnamese, suitable for visual question answering and general question answering tasks. 14 | 15 | - [Vietnamese-LLaVA-Instruct-150K-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-LLaVA-Instruct-150K-gg-translated) 16 | - Tasks: Visual Question Answering, Question Answering 17 | - Languages: Vietnamese, English 18 | - Size: 150k rows 19 | - Description: This dataset comprises questions and answers, translated into Vietnamese, suitable for visual question answering and general question answering tasks. 20 | 21 | - [Vietnamese-yfcc15m-OpenAICLIP](https://huggingface.co/datasets/5CD-AI/Vietnamese-yfcc15m-OpenAICLIP) 22 | - Tasks: Image-to-Text, Text-to-Image, Visual Question Answering 23 | - Languages: Vietnamese, English 24 | - Size: 15.4M rows 25 | - Description: This large dataset provides images and corresponding text descriptions, translated into Vietnamese, suitable for various image and text tasks including visual question answering. 26 | 27 | ## Chat Datasets 28 | 29 | - [sendo_vietnamese_multiturn_gemini_50k](https://huggingface.co/datasets/5CD-AI/sendo_vietnamese_multiturn_gemini_50k) 30 | - Languages: Vietnamese 31 | - Size: 50k rows 32 | - Description: A Vietnamese multi-turn chat dataset designed for conversation-based models. 33 | 34 | - [travel-multi-turn-chat-gemini](https://huggingface.co/datasets/5CD-AI/travel-multi-turn-chat-gemini) 35 | - Languages: Vietnamese 36 | - Size: 34.4k rows 37 | - Description: A multi-turn chat dataset focusing on travel-related conversations, suitable for training conversational models. 38 | 39 | - [tiki-multi-turn-chat-gemini-vietnamese-50k](https://huggingface.co/datasets/5CD-AI/tiki-multi-turn-chat-gemini-vietnamese-50k) 40 | - Languages: Vietnamese 41 | - Size: 50k rows 42 | - Description: A multi-turn chat dataset with conversations sourced from the Tiki platform, translated into Vietnamese. 43 | 44 | - [viet-ecommerce-alpaca](https://huggingface.co/datasets/5CD-AI/viet-ecommerce-alpaca) 45 | - Languages: Vietnamese 46 | - Size: 69.3k rows 47 | - Description: A dataset related to e-commerce activities in Vietnamese. 48 | 49 | - [Vietnamese-argilla-OpenHermesPreferences-66k-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-argilla-OpenHermesPreferences-66k-gg-translated) 50 | - Tasks: Text Generation, Question Answering 51 | - Languages: Vietnamese, English 52 | - Size: 66k rows 53 | - Description: A dataset translated into Vietnamese, focusing on text generation and question-answering tasks. 54 | 55 | ## CoT Datasets 56 | 57 | - [Vietnamese-nampdn-ai-tiny-webtext-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-nampdn-ai-tiny-webtext-gg-translated) 58 | - Tasks: Question Answering, Text Generation 59 | - Languages: Vietnamese, English 60 | - Size: 1.85M rows 61 | - Description: A collection of web text data translated into Vietnamese, suitable for question-answering and text generation tasks. 62 | 63 | - [Vietnamese-1m5-kaist-CoT-gg-translated-unrefined](https://huggingface.co/datasets/5CD-AI/Vietnamese-1m5-kaist-CoT-gg-translated-unrefined) 64 | - Tasks: Question Answering, Text Generation 65 | - Languages: Vietnamese, English 66 | - Size: 1.5M rows 67 | - Description: A collection of Kaist CoT data translated into Vietnamese, suitable for question-answering and text-generation tasks. 68 | 69 | - [Vietnamese-mabryCodes-tiny-cot-alpaca-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-mabryCodes-tiny-cot-alpaca-gg-translated) 70 | - Tasks: Question Answering, Text Generation 71 | - Languages: Vietnamese, English 72 | - Size: 500k rows 73 | - Description: A collection of GOOD CoT data translated into Vietnamese, suitable for question-answering and text-generation tasks. 74 | 75 | ## DPO Datasets 76 | 77 | - [Vietnamese-beyond-rlhf-reward-single-round-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-beyond-rlhf-reward-single-round-gg-translated) 78 | - Tasks: Question Answering, Text Generation 79 | - Languages: Vietnamese, English 80 | - Size: 20k rows 81 | - Description: A collection of DPO data translated into Vietnamese, suitable for question-answering and text-generation tasks. 82 | 83 | - [Vietnamese-Intel-orca_dpo_pairs-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-Intel-orca_dpo_pairs-gg-translated) 84 | - Tasks: Question Answering, Text Generation 85 | - Languages: Vietnamese, English 86 | - Size: 13k rows 87 | - Description: A collection of DPO data translated into Vietnamese, suitable for question-answering and text-generation tasks. 88 | 89 | ## Math Datasets 90 | 91 | - [Vietnamese-395k-meta-math-MetaMathQA-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-395k-meta-math-MetaMathQA-gg-translated) 92 | - Tasks: Question Answering 93 | - Languages: Vietnamese, English 94 | - Size: 395k rows 95 | - Tags: math, math-qa, meta-math 96 | - Description: A large dataset containing math-related questions translated into Vietnamese, designed for question-answering tasks. 97 | 98 | - [Vietnamese-nvidia-OpenMathInstruct-1-50k-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-nvidia-OpenMathInstruct-1-50k-gg-translated) 99 | - Tasks: Question Answering 100 | - Languages: Vietnamese, English 101 | - Size: 50k rows 102 | - Tags: math, math-qa, meta-math 103 | - Description: A large dataset containing math-related questions translated into Vietnamese, designed for question-answering tasks. 104 | 105 | - [Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated](https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated) 106 | - Tasks: Question Answering 107 | - Languages: Vietnamese, English 108 | - Size: 200k rows 109 | - Description: This dataset contains math word problems translated into Vietnamese, suitable for question-answering tasks. 110 | --------------------------------------------------------------------------------