chinese-cosmopedia
License:
Apache-2.0
Steward:
OpenCSG
Task: LLM
Release Date: 1/22/2026
Format: parquet
Size: 6.09 GB
Share
Description
A large-scale high-quality Chinese text dataset developed by OpenCSG, containing ~15 million entries (≈60B tokens) covering multi-domain content (encyclopedia, education, etc.). Cleaned and deduplicated to remove low-quality content, it is optimized for large language model pretraining, text generation, and other Chinese NLP downstream tasks, compatible with mainstream toolchains (Hugging Face Datasets, PyTorch).
Specifics
Considerations
Restrictions/Special Constraints
This dataset is licensed under Apache License 2.0. Commercial use requires prior written approval from OpenCSG (contact: lorraineg@opencsg). Users must comply with applicable data laws & ethical AI guidelines.`
Forbidden Usage
1. Unauthorized commercial use without OpenCSG's approval 2. Generating harmful/misleading/unethical content (e.g., misinformation, discrimination) 3. Infringing on third-party intellectual property/privacy 4. Use violating laws, regulations, or public order`