smoltalk-chinese

License icon

License:

Apache-2.0

Shield icon

Steward:

OpenCSG

Task: LLM

Release Date: 1/22/2026

Format: parquet

Size: 879.81 MB


Share

Description

SmolTalk-Chinese is a high-quality multi-task dataset for Chinese conversational scenarios, covering 19 task types (e.g., advice-seeking, code generation, daily chat). The full dataset is accessible via www.opencsg.com. It provides rich data for few-shot dialogue model training/evaluation, boosts models’ Chinese conversational capabilities, and supports downstream tasks like dialogue system development and LLM fine-tuning.

Specifics

Licensing

Apache License 2.0 (Apache-2.0)

https://spdx.org/licenses/Apache-2.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for academic research and non-commercial use. Users must attribute the dataset to "SmolTalk-Chinese (provided by opencsg)" in any publications or derivative works. Additionally, the use of this dataset must comply with the open-source license terms associated with it.

Forbidden Usage

Commercial profit-making activities (e.g., integrating the dataset into paid products or services); Generating harmful, malicious, discriminatory, or illegal content; Using the dataset to infringe upon the legitimate rights and interests of individuals or organizations; Any usage that violates applicable laws, regulations, or ethical guidelines.

Metadata