chinese-cosmopedia

License icon

License:

Apache-2.0

Shield icon

Steward:

OpenCSG

Task: LLM

Release Date: 1/22/2026

Format: parquet

Size: 6.09 GB


Share

Description

A large-scale high-quality Chinese text dataset developed by OpenCSG, containing ~15 million entries (≈60B tokens) covering multi-domain content (encyclopedia, education, etc.). Cleaned and deduplicated to remove low-quality content, it is optimized for large language model pretraining, text generation, and other Chinese NLP downstream tasks, compatible with mainstream toolchains (Hugging Face Datasets, PyTorch).

Specifics

Licensing

Apache License 2.0 (Apache-2.0)

https://spdx.org/licenses/Apache-2.0.html

Considerations

Restrictions/Special Constraints

This dataset is licensed under Apache License 2.0. Commercial use requires prior written approval from OpenCSG (contact: lorraineg@opencsg). Users must comply with applicable data laws & ethical AI guidelines.`

Forbidden Usage

1. Unauthorized commercial use without OpenCSG's approval 2. Generating harmful/misleading/unethical content (e.g., misinformation, discrimination) 3. Infringing on third-party intellectual property/privacy 4. Use violating laws, regulations, or public order`

Metadata