National language corpus: unlocking cultural treasures in digital era


The “Outline for the Development of National Philosophy Discipline and Social Sciences During the 14th Five-Year Plan Period (2021-2025)” emphasizes the promotion of the application of big data, cloud computing, artificial intelligence, and other technologies in the fields of humanities and social sciences. It aims to facilitate the cross-penetration and integration innovation between social sciences and natural sciences, further enrich the content of the disciplines, and introduce new research methods and technical approaches. For a country, language serves as an important historical, cultural, and practical resource. A corpus is a database that integrates a large amount of language information used in real situations, specially designed for research use. It carries the fundamental information of national language and culture, recording the historical development of language and culture.

Corpus construction

Many countries consider corpus construction to be an important foundational project and have established national corpora. For example, the British National Corpus (BNC) began construction in 1991, completed its first version in 1994, and subsequently released the second and third editions in 2001 and 2017, with a scale of 100 million words. The American National Corpus (ANC) began planning in 1998, released its first version in 2003 (with a scale of 11.1 million words), followed by the second edition in 2005 (with a scale of 22 million words). After 2006, the focus shifted to the construction of open corpora (OANC) and manually annotated sub-corpora (MASC). The development of the ANC closely mirrors that of the BNC, with the synchronous section also intended to comprise 100 million words. This project is presently in progress. In 1998, the South Korean government initiated the “21st Century Sejong Plan” project to construct the Korean National Corpus (KNC) with a volume of 200 million words (eojul), which has now been completed. In addition, countries such as Russia, Hungary, Thailand, Estonia, and others have also constructed and released their respective national corpora. These national corpora are all balanced, subjected to segmentation, annotation, and other processing, thereby playing a positive role in promoting research on their respective languages.

This illustrates that a national corpus is a significant cultural initiative established and supervised by national-level institutions or designated agencies, adhering to national standards and focusing the national lingua franca. A national corpus should exhibit characteristics such as extensive scale, meticulous balance, comprehensiveness coverage, dynamic updates, rich annotation, diverse applications, open sharing, and user-friendly usage. It should truly reflect the overall usage and development of the national common language. The construction of a national corpus in China has become urgent. 

The construction of Chinese corpora began in the 1970s, leading to the establishment of multiple Chinese corpora of various scale, including those independently built by universities and research institutes. These corpora have played a positive role in the education and research of the national common language. However, due to the temporary, localized, short-term, and functionally singular limitations that may have been imposed on these corpora during their initial construction, they lack long-term consideration and comprehensive design, thus failing to fully reflect the current state of national common language usage.

Major issues

The main issues that need to be addressed are as follows. Firstly, there is an imbalance in corpus sampling, with an overrepresentation of written language corpora and a deficiency of spoken language corpora. For example, in a widely used Chinese corpus in the field, over 70% of its contemporary data consists of written materials from newspapers and periodicals, while spoken language materials account for less than 0.3%. Some corpora only include microblog texts from a specific year as spoken language materials, and some large-scale corpora may not include spoken language materials at all. From an academic perspective, spoken language corpora are indispensable for reflecting the real conditions of language use and directly embodying linguistic personalities. In well-established corpora from some linguistically advanced countries, there is a significant proportion of spoken language materials, aligning with theoretical principles. For instance, the BNC comprises 90% written language and 10% spoken language, while the first edition of the ANC, with 11 million words, includes 8 million words of written language and 3 million words of spoken language.

Secondly, there is an issue of uncontrolled sample sizes, resulting in limited coverage of texts for corpora of equal size. Some corpora do not control for sample size, impacting the balance and representativeness of the corpus. For example, certain corpora include the complete works of contemporary authors without controlling for the size, violating the principles of a balanced corpus that should avoid an excessive concentration of works from the same author, especially if they are too long or too numerous. In this regard, the BNC, for instance, extracts samples of up to 45,000 words from different parts of a single author’s works. 

Thirdly, regular plans for updating corpora are lacking, posing challenges for scholars conducting based on diachronic balanced corpora. The construction of diachronic corpora requires early design and planning, along with long-term follow-up. Most currently operational corpora do not adequately address this aspect. National corpora should establish long-term plans for regular updates. For example, the ANC plans to incrementally increase its size by 10% every five years, in addition to its existing 100 million words of synchronous corpus.

Fourthly, media formats in corpus data are limited, with the majority of existing large corpora in China being in text format. Corpora in multimedia formats are generally smaller in scale and have limited sources, making it challenging to reflect the full landscape of Chinese language use, especially in the context of vibrant spoken language. Research based on “multimedia, multimodal” corpora is gaining prominence at the international research frontier, but the construction of national corpora in multimedia and multimodal formats lags behind. 

Fifthly, corpus application systems often lack sufficient functionality. If the application system of a corpus lacks rich features, it cannot provide users with the necessary services, leading to a significant discount in the applied value and significance of corpus construction. Many foreign corpora have powerful application platforms with a range of functions, including concordance retrieval, frequency statistics, collocation discovery, and comparative analysis. Internet corpus application platforms like CQPweb and Sketch Engine, capable of using computer arrays for complex calculations and providing diverse corpus application functions, represent mainstream directions for future development. In contrast, most domestic corpora only offer concordance retrieval functions, with only a few providing basic statistical word list functions, falling far short of the in-depth research needs of linguistics. Particularly in the integration, query, and analysis of multimedia and multimodal corpus data, there is still a long way to go from theoretical exploration to practical application software development.

Going forward

In the long run, if corpora fail to meet the practical needs of language investigation and research, they will become obstacles to the development of disciplines, scientific research, and international collaboration. The value and significance of national corpus construction are primarily reflected in three aspects: firstly, a national corpus can comprehensively depict the overall usage and development of the national lingua franca, serving as a manifestation of national soft power and an essential resource awaiting construction; secondly, the construction of a national corpus helps fill the void in academia for large-scale, dynamically balanced corpora of the national common language, thus better serving language research; thirdly, national corpus construction will catalyze various research endeavors. Beyond extensive studies in the field of linguistics, such as large-scale national general language descriptive grammar research, multi-perspective investigations into language life, diverse studies on language development and evolution, and interactive research on language ontology and language information processing, it can also contribute to digital humanities, public sentiment observation, and other areas in the fields of literature, history, philosophy, and social sciences.

Within the aforementioned context, the Institute of Linguistics at the Chinese Academy of Social Sciences has initiated the construction of a national corpus. This project is well-positioned to take advantage of favorable timing, rich accumulations, and mastery of new technologies for data acquisition and processing. However, we are keenly aware that this new-type corpus, launched in a new era of development, carries a new mission and faces new challenges. We are entrusted with the linguistic and written treasures of the great Chinese cultural heritage, which boasts a civilizational history spanning over 5000 years. How to draw on the mature experience of international corpus construction while firmly grounding ourselves in the centrality of our national language and culture is a significant challenge. Establishing a corpus classification system based on the characteristics of the Chinese language and comprehensively integrating modern linguistic achievements with our cultural characteristics requires a thoughtful approach. 

In this context, we are confronted with the theoretical task of systematically researching and establishing principles for corpus construction. Under new principles and standards, we will face substantial operational tasks, such as reintegration of existing resources and the collection and organization of applicable resources. Additionally, there is the task of cultivating interdisciplinary and composite talents. As a “dynamic” corpus, it needs to support collaborative construction among multiple units and users, manage the workflow of data collection and compilation, and dynamically update content. It should efficiently handle high-concurrency and low-latency responses for various word lists and statistical data analysis under complex conditions. This goal places high demands on corpus indexing and querying technologies, as well as the construction of corpus application platforms.

Therefore, with confidence in advancing the efficient development of the teaching and research of the national common language, we aim to construct a large-scale, well-balanced, comprehensive, dynamically updated, richly annotated, versatile, open-shared, and user-friendly Chinese National Corpus. This corpus will provide better assurance and support for the education and research of the national common language. 


Zhang Bojiang (director) and Zhang Yongwei (associate research fellow) are from the Institute of Linguistics at the Chinese Academy of Social Sciences.




