Tokens reshape AI industry ecosystem- CHINESE SOCIAL SCIENCES NET

Tokens reshape AI industry ecosystem

Author:LI YONGJIE and FU JINLIN Source:Chinese Social Sciences Today 2026-05-22

In late March 2026, the China National Committee for Terminology in Science and Technology officially established ciyuan as the standardized Chinese term for token. Photo: TUCHONG

When users engage in a text-based conversation with AI, a familiar scene unfolds: Rather than producing a complete paragraph all at once, the system outputs a stream of characters that flicker onto the screen one after another. The process resembles a word-chain game. In reality, however, intensive computation is taking place in the background as these characters appear. Machines do not directly process the “characters” or “words” as humans perceive them. Instead, they process sequences of tokens and then decode them into natural language.

In late March 2026, the China National Committee for Terminology in Science and Technology officially established ciyuan as the standardized Chinese term for token. Recently, several scholars interviewed by CSST noted that ciyuan—a technical term once largely confined to the back end of AI systems—is now entering public discussions on topics such as the consumer market for AI agents and competition over AI autonomy.

From ‘digital atom’ to ‘smart currency’

“From a technical perspective, a token is the smallest discrete unit of information processed by large language models (LLMs),” said Huang Xuanjing, a professor from the College of Computer Science and Artificial Intelligence at Fudan University. When processing text, LLMs segment the input into a series of basic units—tokens. A token, for example, may be a complete word, a root or affix, a single Chinese character, or even a symbol. The Chinese translation “ciyuan” derives its meaning accordingly: “Ci” points to its close connection with language and writing, while “yuan” signifies a “basic unit” or “atom.”

As AI advances more deeply into fields such as image, audio, and video processing, the concept of the token in multimodal fusion has long moved beyond the scope of text alone. Huang noted that in multimodal LLMs, text, images, and sound are ultimately represented in a unified way as sequences of tokens, which together enter the model’s computational workflow. In this sense, the token has expanded from its original meaning as the “basic unit of text” to become the universal basic unit through which AI understands and generates information—a bridge between the human world and machine intelligence.

According to Zhao Kuo, an associate professor from the School of Intelligent Systems Science and Engineering at Jinan University in Guangdong Province, tokens enable AI to achieve unified perception and reconstruction of the complex world by probabilistically combining “digital atoms.”

As tokens become the underlying logic by which machines deconstruct the world and engage in creative work, these pulsating “digital atoms” also leap off the screen and become a form of “smart currency” in the marketplace. Liu Liehong, head of the National Data Bureau, proposed at the China Development Forum 2026 that tokens are not only the value anchor of the intelligent era but also the settlement unit connecting technological supply with business demand. According to the 2024–2025 China AI Large Model Market Status and Development Trends Research Report released by iiMedia Research, China’s LLM market reached 29.416 billion yuan in 2024 and is expected to exceed 70 billion yuan by 2026. Against this vast market backdrop, major LLM providers now generally base API pricing on the unit of “per million tokens.”

Xu Fei, executive vice president of Fuyao University of Science and Technology, noted in a recently published article that the standardization and generalization of ciyuan will restructure the AI industry ecosystem, give rise to a new economic form—the “token economy”—and generate entirely new professions and industrial sectors.

Competitiveness of token system

In 2023, the concept of “new quality computing infrastructure” was proposed for the first time. Its core essence lies in integrating computing power, data, storage, and security technologies to support the development of new quality productive forces. Three years later, the phrases “create new forms of smart economy” and “launch new infrastructure projects on hyper-scale intelligent computing clusters and coordinated development of computing capacity and electricity supply” were formally written into the 2026 Government Work Report. Behind this shift is the deep integration of AI into various economic and social domains. Computing power, as a new driver for cultivating new quality productive forces, plays a key role in the construction of AI ecosystems through infrastructure layout. Scholars interviewed suggested that the development of token systems, as a core link in LLMs, is a crucial pillar for building new quality computing infrastructure.

“If the overall capacity and level of linguistic intelligence processing are regarded as a kind of ‘new quality computing power,’ then the research and development of software and hardware related to LLMs are among the core components of building new quality computing infrastructure,” said Shi Jianjun, a professor from the Institute of Language Sciences at Shanghai International Studies University. The token system, Shi explained, affects model autonomy, inference speed, output quality, and value orientation, so its importance is self-evident.

Zhao noted that current global LLM token systems exhibit four distinctive features. First, the technological path is becoming more diversified. In the text domain, subword algorithms remain dominant, while in the multimodal domain, the field is gradually shifting toward discrete mapping through vector quantization. Second, standardization is accelerating. In China, the terminological definition and the status of ciyuan as a settlement unit have been clearly established. Internationally, efforts are underway to address fragmentation by expanding token vocabularies, indicating that standardization has entered a critical stage. Third, ecosystem compatibility continues to improve. Open-source AI development platforms and tool ecosystems, with Hugging Face as a prominent example, are maturing, and mainstream models and frameworks have achieved interoperability among vocabularies. Fourth, commercial value is becoming increasingly prominent. Tokens have become the core billing unit for LLM services. At the same time, this shift has triggered a series of new industry challenges, including copyright protection, privacy and security, and cross-language rule alignment.

Huang stated that China has developed strong capacity for independent innovation in token systems, though there is still room for optimization in underlying computing chips and the basic software ecosystem. To assess that progress and the remaining gaps, she pointed to four key dimensions. The first is daily token invocation volume: As of March 2026, daily token invocation volume had exceeded 140 trillion on average, highlighting the token economy as the core logic of AI commercialization as LLMs are deployed in industrial applications. The second is segmentation efficiency, where domestic LLMs have achieved outstanding results. Models such as DeepSeek use tokenizers optimized for Chinese language, greatly improving encoding efficiency. The third is context window size. Mainstream models in China and abroad have now expanded their context windows to hundreds of thousands or even millions of tokens, with domestic models keeping pace with international frontiers. Finally, token processing throughput and cost have become key breakthrough directions for the development of China’s token systems, as algorithmic optimization in domestic LLMs has improved token processing efficiency.

Zhao added that domestic LLMs have already taken the lead in Chinese tokenization efficiency. By optimizing vocabulary structures, domestic models consume fewer tokens when processing Chinese, resulting in lower inference costs and faster responses. In addition, in vertical domains such as government affairs and finance, China’s token systems are better suited to local linguistic contexts, offering stronger targeting and closer integration in practical applications.

However, gaps remain in the development of China’s token systems. Open-source tokenization tools still lag behind the international state of the art. Multimodal annotation systems and cross-modal benchmarks remain incomplete. Corpora for local dialects and low-resource languages are relatively scarce. The development of relevant datasets and the supporting tool ecosystem still requires sustained advancement.

Supporting AI industry

Tokens provide a new perspective for understanding the world and also lay the foundation for promoting the high-quality development of AI in industrial transformation. Scholars believe that China’s token systems must continue to break through technical bottlenecks while also focusing on the cultivation of new quality productive forces and strategically planning for high-quality development.

Regarding the construction of data infrastructure for token development, Zhao proposed that policy implementation should focus on three aspects. First, a unified data standards system should be established, with improved standards for collection and management as well as specifications for cross-domain annotation and quality assessment, thereby ensuring multi-scenario and multilingual coverage of datasets. Second, privacy protection safeguards should be strengthened, the application of privacy-preserving computation should be deepened, and the secure circulation of data should be ensured. Third, sharing mechanisms should be built through the national data trading platform to break down cross-departmental and cross-domain data barriers.

Zhao noted that high-quality data is a critical foundation for upgrading token technology. Rich corpora can optimize subword vocabularies and improve the ability to generalize across low-frequency words. Cross-modal annotation facilitates precise token alignment. Low-resource language corpora can expand coverage. To this end, efforts should focus on building multimodal corpora in key areas such as industrial manufacturing, finance, and healthcare, while increasing policy and financial support for open-source tools and alignment benchmarks. A regular quality assessment system should also be established, deeper integration between data and token technology should be promoted, and an autonomous and controllable industrial ecosystem should be built.

Editor：Yu Hui

close print