smoltalk-chinese
License:
Apache-2.0
Steward:
OpenCSG
Task: LLM
Release Date: 1/22/2026
Format: parquet
Size: 879.81 MB
Share
Description
SmolTalk-Chinese is a high-quality multi-task dataset for Chinese conversational scenarios, covering 19 task types (e.g., advice-seeking, code generation, daily chat). The full dataset is accessible via www.opencsg.com. It provides rich data for few-shot dialogue model training/evaluation, boosts models’ Chinese conversational capabilities, and supports downstream tasks like dialogue system development and LLM fine-tuning.
Specifics
Considerations
Restrictions/Special Constraints
This dataset is intended solely for academic research and non-commercial use. Users must attribute the dataset to "SmolTalk-Chinese (provided by opencsg)" in any publications or derivative works. Additionally, the use of this dataset must comply with the open-source license terms associated with it.
Forbidden Usage
Commercial profit-making activities (e.g., integrating the dataset into paid products or services); Generating harmful, malicious, discriminatory, or illegal content; Using the dataset to infringe upon the legitimate rights and interests of individuals or organizations; Any usage that violates applicable laws, regulations, or ethical guidelines.