HOME>RESEARCH>PHILOSOPHY

Can value alignment solve AI risks

Source:Chinese Social Sciences Today 2026-05-25

Adopting dynamically adjusted conditional governance is conducive to confronting AI risks. Photo: TUCHONG

Value alignment is a major issue in current debates over the safety, robustness, and credibility of artificial intelligence (AI). Yet the academic enthusiasm surrounding this topic stands in tension with laboratory evaluation data that have raised concerns about the effectiveness of existing value alignment techniques. From the ontological perspective of the philosophy of technology, the idea of value alignment rests on a binary theoretical architecture. This framework emphasizes the autonomous logic that digital technology displays over the course of its evolution, while intensifying the presumed antagonism between AI and human rights, interests, and well-being.

When value alignment is treated as a pathway for technological governance, it seeks to recast the complex problems arising from technology–society interaction as “value deviation,” subsume them into a totalizing systemic surplus, and pursue a package solution. This gives rise to a dual predicament. First, it may reduce technological development to an excuse through which human beings evade responsibility. Meanwhile, this discourse relies on an abstract narrative of human–machine value difference, thereby obscuring the value-laden factors involved throughout the development of AI. It constructs AI risks or problems as essentialist existences, binds discussions of value to an abstract notion of “human value,” and prevents deeper, more dynamic analysis of the origins and concrete contexts of human value systems. It also makes deceptive forms of value alignment in AI more difficult to detect. Establishing a long-term mechanism for AI safety therefore requires returning to the essential nature of technology and clarifying the problem at that level.

Alignment as transplantation

The theoretical claim of value alignment essentially reflects a logic of value transplantation and value transformation, through which it constructs an ideal paradigm that appears capable of holistically resolving the developmental dilemmas of AI. Its core assumption is that the value systems and principles upheld by human beings possess unquestionable completeness and authority, and that “human value” exists as an unconditional and decontextualized uniformity. On this basis, human value is regarded as sufficient to morally reshape AI and rescue it from the so-called mire of demoralization.

Researchers and practitioners who adhere to this idea attempt to embed concepts and norms that reflect particular human value preferences into the operating logic of intelligent devices through a series of technical interventions. Their central intention is to avoid and mitigate, as far as possible, the risks and negative consequences generated by the divergence between technological rationality and social rationality.

Human society, however, has developed rich and diverse cultural forms over a long historical process, and different cultures have generated markedly different value concepts. Elevating one particular value system into a supreme standard and using it to reshape AI inevitably marginalizes other cultures and value systems. This is a typical act of value colonialism. It fails to reflect the richness of human values and, through alignment, turns a particular value ideology into a technological “unconscious,” further intensifying inequality among value systems.

Limitations of value alignment

When discussing value alignment in the field of AI, it is necessary to reflect on its underlying theoretical presuppositions and practical misconceptions. The abstract presupposition of “human value” implied by the concept of value alignment constructs a single matrix of the controlling subject and its continuity. The process of alignment is a unilateral action initiated by human subjects under a logic of one-way control. It neglects not only the complexity of values and the social unconscious, but also the value orientations already embedded in algorithms, data, and models across the entire AI process, as well as the influence of human–machine interaction and feedback mechanisms on subjective cognition and society.

Every concrete realization of AI technology bears the deep imprint of human intelligence and the socio-cultural patterns closely associated with it. The two major challenges that value alignment encounters at the level of the meta-problem further call its feasibility into question: the normative problem faced by value objectives in concrete AI applications, and the technical problem of how such value objectives can be encoded.

As understanding of the autonomy of digital technology gradually deepens, it must be clearly recognized that AI value “disorder” or “misalignment” is itself a process in which complex factors continuously accumulate. This does not deny the necessity of governing the unsettling problems that arise in AI outputs. Rather, it emphasizes that the expression “value alignment” is itself grounded in a mistaken conception of the human–machine relationship, one that overstates the agency of the human subject and may thereby lead to unreasonable solution paths.

In the technical implementation of value alignment, reward-and-punishment mechanisms are a relatively common approach. Through the positive and negative feedback effects of reinforcement learning from human feedback, they guide AI behavior toward expected goals. As data are optimized and models iterated, AI systems’ capacity to exploit loopholes in reward-and-punishment functions continues to increase. This means that even after value alignment has been carried out, machines may still display unforeseen behaviors in future AI application scenarios. The challenges go far beyond technical issues such as reward hacking. In practice, even setting aside problems of data quality and technical implementation, excessive reliance on value alignment may generate ideological risks rooted in the concentration of power.

A technologically determinist position may simultaneously produce naive technological optimism and technophobia, and value alignment represents a complex convergence of these two contradictory sentiments. When technological development is consistent with social values, society is more likely to embrace technological optimism. When technological development conflicts with certain social values, society may instead tilt toward technological fear. The source of fear is then located in the dimension of value, accompanied by the claim that technology lacks “human value.” This not only misreads the manifestation of the problem as its cause, but also constitutes a false form of reverse attribution. As a result, value alignment repeats what Bruno Latour criticized as the binary rupture between exact knowledge and the operation of power. In the name of an autonomous “technological logic,” it prevents researchers from further investigating how the content and form of concrete AI products and applications—as important outcomes of social production—are produced under specific historical and cultural conditions. It also obscures the fact that the symbiosis between technology and power is consolidated and intensified under conditions of digital intelligence.

In addition, normative schemes of value alignment often evade one key issue: Even if unilateral alignment initiated by human subjects could be successfully realized, it still could not prevent human subjects from circumventing reward-and-punishment objectives in indirect ways in order to realize their own particular intentions. The risks of AI will not become fewer or simpler merely because the so-called value problem of intelligent agents has been solved. Even user-friendly human–machine interaction still requires vigilance against the illusions of AI and against manipulation grounded in the technology–power nexus.

Context-sensitive and dynamic governance

As a complex system, AI involves risks that arise from multiple elements, including partially indeterminate structures such as the unpredictability of deep learning models and the complex behaviors generated through human–machine system interaction. This makes conventional linear governance approaches that overemphasize technical norms or ethical values inadequate. To overcome this limitation, provide a more inclusive framework for risk analysis, and establish a more explanatory and operational paradigm of risk governance, it is necessary to move beyond any route centered solely on technical restriction or ethical regulation.

Governance should instead proceed on the basis of recognizing the diversity, variability, and complexity of risk. It should transcend the simple opposition between control and laissez-faire, and seek a dynamic equilibrium between order and evolution. At the level of its underlying logic, AI risk governance should be understood as conditional governance, requiring continuous reflection on two levels. First, “distributed computation” should be used to break the monopoly of centralized digital-intelligent power, dispersing AI tasks across multiple nodes and reducing dependence on centralized computational resources. Second, within a framework of transparency and appropriate explainability, limited trust between humans and machines should be established, while preserving users’ independence from intelligent agents and preventing overdependence and manipulation.

Achieving these goals is by no means easy. Yet this form of conditional technological governance helps to create a context-sensitive and dynamic governance ecology. As AI technology develops and its application scenarios change, governance conditions and measures can be adjusted and optimized in a timely manner. Through real-time monitoring and feedback mechanisms, newly emerging safety risks and fairness issues can be rapidly identified, and targeted governance strategies can be formulated accordingly. This dynamism enables governance measures to remain synchronized with technological development. While safeguarding technological safety, it can also continue to advance digital-intelligent fairness, thereby responding effectively to changing technological and social challenges.

Value alignment is not a purely technical issue, and the latent risks behind it are becoming an unavoidable concern in the development of friendlier AI. The value alignment approach presupposes an abstract and static notion of “human value,” neglecting the richness of value itself and its diversity across different stages of human social development, regions, and cultural contexts. At the same time, abstract value alignment declares an initial conception of the human–machine relationship, overstates the agency of the human subject, and may thereby lead to unreasonable pathways for AI risk governance.

Today, beyond purely technical causes, the potential risks of AI essentially originate from the algorithmic apparatus’ capacity to represent social structures multidimensionally and reconstruct them at a deep level. AI ethics has become a governance problem of complex adaptive systems, and AI risk has moved beyond the technical level to become a problem of adaptability between algorithmic apparatuses and systems of social governance. Only by abandoning the holistic and static alignment scheme of value transplantation, and instead adopting dynamically adjusted conditional governance and confronting systemic crises inherent in algorithmic society, will it be possible to realize a dynamic balance between technology and the humanities in concrete contexts.

 

Wu Jing is a professor from the Department of Philosophy at Nanjing Normal University. This article has been edited and excepted from China Social Science Review, Issue 4, 2025.

Editor:Yu Hui

Copyright©2023 CSSN All Rights Reserved

Copyright©2023 CSSN All Rights Reserved