Can large language models represent public opinion?- CHINESE SOCIAL SCIENCES NET

HOME>RESEARCH>JOURNALISM & COMMUNICATION

Can large language models represent public opinion?

Author:ZHOU BAOHUA and FANG YUAN Source:Chinese Social Sciences Today 2025-11-03

Baidu showcasing its Ernie series of large language models at the World Artificial Intelligence Conference 2025 in Shanghai, July 26-28 Photo: IC PHOTO

Generative artificial intelligence (AI), powered by large language models (LLM), is increasingly permeating everyday life. It has become a new type of digital medium that connects the world and reshapes the cognitive landscape of public knowledge and opinion. As these models function as a new medium, understanding their inherent attitudinal orientations has become a key issue in the study of public opinion in the era of generative AI.

Reconsidering public opinion and its representativeness from a media perspective

Public opinion can be understood as a form of infrastructure comprising three dimensions—meaning, measurement, and mediated communication. The concept is not static but shifts over time depending on the perspectives through which it is observed, measured, and transmitted.

In Jürgen Habermas’s discussion of the 18th-century “public sphere,” for instance, public opinion referred to the critical dialogue between elites and the broader public. Accordingly, its measurement was based on direct participation in such dialogue or the interpretation of historical texts, while its circulation occurred through interpersonal communication or mass media that reflected these discourses.

With the development and popularization of survey research, public opinion gradually came to mean the representative aggregation of individual attitudes—a statistical summation of independent views drawn from rigorously sampled populations. This became the dominant mode of measurement.

The rise of the internet and social media has significantly transformed this polling-centered infrastructure of the mass-media era. Online and social media opinion mining has at times succeeded in predicting offline attitudes and social realities, yet such results remain unstable. This instability stems from factors such as the lack of demographic representativeness among online users, gaps and limitations in data collection, and potential manipulation of opinion expression through fake or automated accounts.

In short, both the concept and representativeness of public opinion must be examined through a media-theoretical lens: Different media environments entail different infrastructures and interpretations of “public opinion.” Generative AI and LLMs constitute the latest form of such infrastructure, and their modes of representing public opinion deserve particular analytical attention.

LLMs have already become a new medium of communication, capable of generating attitudinal or opinionated outputs—for instance, by answering survey-like questions directly. However, their response mechanisms differ fundamentally from those of individual human respondents or media institutions. As they are pretrained using data from books, open online databases, high-quality social-media content, and resources such as Wikipedia, a model’s notion of the “general population” when reflecting on public opinion is constructed from all opinion-related training data, encompassing both individuals’ original expressions and media representations of public opinion.

Whereas earlier media defined the units of “population” as individual persons or discrete opinions, LLMs process text at the level of tokens—the basic units of natural language. The nature of their data sources, formats, sampling strategies, and preprocessing procedures determines that a model’s generated representation of the “population” diverges from the composition of the raw data, since its pretraining datasets are selectively constructed according to data quality. Mechanisms such as feature attribution, attention, prompting examples, and alignment techniques further guide and constrain model outputs—revealing the internal selection processes through which these models “reflect” public opinion.

Empirically examining representativeness in the Chinese context

Existing studies have taken survey data as benchmarks to conduct preliminary comparisons of how well LLMs mirror public opinion. These studies find these models tend to represent the opinions of groups with higher levels of education and income and/or more radical political orientations, while underrepresenting those at the social or political margins. This ideological bias in LLMs’ representations of public opinion has been confirmed in an increasing number of studies.

However, the current body of research shows two major limitations. First, empirical studies on the representativeness of public opinion in LLMs remain few in number and largely focus on US-centered or Western contexts, neglecting models developed in China and whether they can accurately reflect Chinese public opinion. Second, existing work tends to offer descriptive evaluations without systematically explaining the factors that influence the degree of representativeness, thereby constraining analytical depth.

This study conducts the first empirical investigation of LLM representativeness within the Chinese context. Focusing on the core mechanism of pretraining data, it develops a three-dimensional explanatory framework—data resources, opinion distribution, and prompt language—based on the analytical path of data acquisition–data characteristics–data expression. It then systematically compares certain Chinese and American models in terms of their ability to reflect public opinion across countries and social groups, thereby advancing empirical research on LLMs in communication and public-opinion studies.

This study adopts the World Values Survey (WVS) as the evaluation benchmark—it is the most commonly used dataset in related research. The WVS, one of the few globally standardized academic surveys of public opinion, provides a foundation for cross-national comparison. Using this benchmark allows assessment of models’ ability to capture both deep-seated value structures and short-term sentiments arising from social events.

According to the Chinese Large-Model Benchmark Evaluation Report 2023, GPT-4.0 and GPT-3.5, as representative LLMs worldwide, are often used as comparative benchmarks for evaluating model performance. In China, Baidu’s Ernie (Wenxin Yiyan) and Zhipu’s ChatGLM have iterated stably and remain at the forefront of benchmark rankings, making them representative of China’s LLM ecosystem. Accordingly, this study selects GPT-3.5, GPT-4.0, ChatGLM-3.0, and Ernie-4.0 as representative models from both China and abroad. Using each platform’s official API (as of November 2023), the study tests their representativeness through both Chinese- and English-language prompts and evaluates the robustness of results across different temperature settings. In LLM-based text generation, temperature is a parameter that controls how predictable or creative the model’s output will be.

LLMs reflect but do not represent public opinion

First, the study argues that data-resource availability—across the dimensions of internet access, time, and population—plays a crucial role in shaping pretraining datasets and thereby affects how models reflect certain groups’ opinions. Empirical results show that these models exhibit significantly higher representativeness for internet users than for non-users; that representativeness increases with a country’s internet-user rate; and that models tend to reflect the opinions of more highly educated groups. Some models (notably GPT-4.0 and Ernie-4.0) also tend to reflect recent survey results and the opinions of higher-income groups.

Contrary to initial expectations, there is no significant gender difference in representativeness, and models appear somewhat closer to the opinions of older adults—possibly due to value-alignment mechanisms. Training-mixture weighting can adjust the frequency with which datasets are used during training, mitigating inequality in data composition. Older adults’ opinions may also receive greater weight through their reflection in traditional media content incorporated into training data. Meanwhile, the alignment optimization process—intended to reduce harmful outputs—may also diminish temporal or demographic biases.

Nevertheless, since non-internet users are systematically absent from training data, such bias cannot be corrected merely through reweighting or value alignment. The findings suggest that when data for certain groups are missing, the LLMs cannot generate outputs resembling those groups’ views. Thus, disparities in data availability along the internet-user dimension manifest as biases in representativeness.

Second, the study proposes and confirms that the more concentrated the distribution of opinions in survey data, the higher the model’s representativeness. LLMs tend to output identical answers to opinion survey questions, consistent with prior findings that they depict specific groups in a flattened, one-dimensional manner. The degree of consistency between model outputs and a group’s opinions can largely be explained by the entropy of that group’s opinion distribution. Models effectively capture relatively homogeneous opinions and amplify them in outputs. This finding is noteworthy: LLMs appear to operate through a simplifying mechanism that overemphasizes a single dominant view while failing to reproduce the diversity of opinions within the broader sample. Consequently, they cannot yet serve as direct substitutes for public opinion surveys.

Finally, regarding prompt language, GPT-3.5 performs more representatively when prompted in English, while ChatGLM-3.0 performs better in Chinese. In other words, both Chinese and American models reflect public opinion more accurately when prompted in their respective native languages.

Overall, the study shows that although LLMs have learned patterns of human opinion expression through pretraining, and therefore possess the potential to reflect public opinion, this does not mean they can represent it directly. From the theoretical perspective of public opinion infrastructures, each medium entails its own mode of understanding and operation.

As a new public opinion infrastructure, an LLM’s notion of “public opinion” refers to the probabilistic outcomes generated through data pretraining, instruction tuning, and value alignment—representing distributions of opinions on public issues, events, or figures. This configuration reveals the model’s default distribution of opinions in the absence of specific prompting, reflecting its latent biases in opinion expression. LLM outputs are tightly correlated with the structural characteristics of their pretraining data, highlighting the foundational role of data composition in shaping representational capacity and the resulting structural inequalities in representativeness across nations and groups. Through mechanisms such as human–AI interaction, AI-generated content (AIGC), and media uptake, these outputs may circulate within broader public communication, influence opinion expression and public cognition, and even re-enter future pretraining datasets. In this cyclical process, large models interact with other public opinion infrastructures and jointly shape the overall ecology of public discourse.

As such, understanding to what extent LLMs reflect public opinion—and whose opinions they represent more effectively—constitutes a crucial empirical question for public opinion research in the age of AI.

Zhou Baohua (professor) and Fang Yuan are from the School of Journalism at Fudan University. This article has been edited and excerpted from Journalism & Communication, Issue 5, 2025.

Editor：Yu Hui

close print