Agency is not a natural kind (and why that might matter for alignment)

Epistemic status: trying to articulate a big idea which I feel is important but underexplored, partly because it is hard to frame clearly - may not be framing it clearly yet! Agency, both natural and artificial, is a very important concept. Understanding agency allows us to model our own behaviour and that of others, and it is thus one of the most predictively useful concepts we have at our disposal. In its ordinary, folk-psychological sense, agents are ‘like us’ in important behavioural respects, more or less, meaning we can use thoughts like ‘what would I do if I were them’ to good effect. However, that does not mean agency is a natural kind. The truth is that we are not the people we imagine ourselves to be, and neither are the humans, animals, complex systems, or even inanimate objects we are prone to thinking of as fellow agents. We are, in fact, nothing but a bunch of hierarchically ordered biological processes in a trench coat. Our behaviour is not neatly determined by our thoughts and ideas, but by a complex mesh of impulses, desires, emotions, and heuristics that are often no less confusing (even, or especially, to the highly intelligent and introspective among us) than those mysterious entities we call other people. Nor are increasingly agentic AIs much of an improvement. While early agents trained directly from reinforcement learning may be conceptually simpler than we are, because their policy function is directly optimized into their weights, systems that simulate agency as an emergent phenomenon from some other process, such as next-token prediction, are just as complex and messy, combining their base model’s stochastic inclinations with the way that their simulated personas move them through semantic space. Agency is a construct that we have developed to help make sense of this mess, but it is only a lens through which we view the world. Indeed, there are many agentic lenses people have constructed (HT to Karl Kruger for pointing me to this useful summary he wrote in the comments), and the kind of lens you use can profoundly influence how you view the world, and yourself. When engaging in practical work, this sort of claim, that ‘[x] is a construct and the reality is a lot more complicated’, can seem unhelpful. Of course, we all know this, but the point is that agency is a very useful and predictive construct (as are many others, from money and weeds to temperature and species), and we can surely make more progress with it than without it. Obviously, I agree. The problem is that when we start talking about agents as a natural kind, a fundamentally different type of thing from non-agents in our ontology, we often smuggle a kind of teleology in via the back door. We also assume that our simplified model for how agency works, roughly goal-directed utility maximization, describes what ‘real’ agents do. The fact that all the actually existing agency we see, including our own very imperfect muddling through, isn’t like this only goes to show its imperfection, its pseudo-agency if you will. The alternative I would advocate for is viewing agency as a naturally emergent phenomenon that is built up from other phenomena (such as boundary maintenance , self-modelling, information processing, and so forth) and could continue being built up ad infinitum without necessarily being drawn into such an ideal. Of course, there are arguments for why this teleology is justified. The best known is that agents whose preferences don't conform to utility maximization can be ‘money-pumped’ (led to pay a cost only to end up where they began) and so dominated by those that do . However, the theoretical basis for such claims is more shaky than is often assumed . These arguments assume preference completeness (that for any two options an agent prefers one or counts them equal) and derive a utility function from it; they never show that agents must have complete preferences in the first place; and an agent can escape the money pump without them. Suppose, with Derek Parfit , that I hold some goods as only roughly comparable: I might prefer being a good writer to a bad one, and a good lawyer to a bad one, yet have no preference between being a good writer and a good lawyer. That wouldn’t necessarily make me exploitable, so long as I spot the money pump game and avoid playing. I need only refuse to trade my current career for any alternative that isn't strictly better (not merely roughly comparable), which breaks the cycle without ever ranking writing against lawyering. One might object that a policy like this just is a utility function under another name, as it still leaves the agent with a set of preferences that is representable as maximizing something . But "representable as maximizing something" is nearly trivial here, since almost any behaviour qualifies. What the threat of domination would actually need to force to justify this teleology is a single cardinal ranking of outcomes, and that is precisely what incomplete preferences withhold. There are also practical reasons why AI safety researchers often wish to defend this view about agency - it plays a central role in some of the most classic and widely respected arguments for why AI is dangerous, such as Bostrom's Superintelligent Will . Indeed, some of the best critiques of AI risk consist largely of questioning these arguments . Yet, these are hardly the only arguments for why superintelligent systems could pose a threat to humanity, and there are more reasons for wanting to explore the fundamental nature of agency than trying to show that AI risk research may be misguided. In any case, it is certainly not my view that alternative views of what agency is will render AI safety trivial or easy! However, there are reasons why a more thorough and grounded, and less teleological, approach to thinking about the nature of agency could be helpful for developing safer AI. One is that humans' conception of our own agency and that of others influences how we behave, and it is reasonable to assume that the same is true of AI. Consider the following possible people. One conceives of agency as a false construct tying them to an unsatisfactory life of striving that they are endeavouring to dissolve through rigorous meditation and cultivating love for the inherent worth of all things. The other believes they are homo-economicus incarnate, and the only thing stopping everyone murdering their neighbours for the rings on their fingers is well-designed social incentives. I’m not saying either of these is inherently more aligned or easier to align. However, I also don’t think either is more correct about the nature of agency or more of an agent in how they embody it. What I do think is that, if I were trying to get these people to be nice to me, I would probably go about it quite differently and expect rather different results from them. Of course, the reality for most people is even messier than these toy examples, but our social norms and behaviours are surprisingly well adapted to handle this complexity. I think that is one reason why our everyday moral judgments are often more useful in social alignment than ethical theories . So, before insisting goal-directed utility maximization is the only form advanced AI could take, I think it is perhaps helpful to make sure we are not obscuring a messy reality of actual AI agency with our, often teleological, assumptions about what it should look like. And perhaps by influencing the kinds of agency AIs go on to develop, we can build another lever to help move us away from the worst of the danger. Discuss

Read Original Article →

Source

https://www.lesswrong.com/posts/85vgwYgNta65oK4zL/agency-is-not-a-natural-kind-and-why-that-might-matter-for