Wednesday, September 17, 2025

#IJCAI2025 distinguished paper: Combining MORL with restraining bolts to be taught normative behaviour


Picture supplied by the authors – generated utilizing Gemini.

For many people, synthetic intelligence (AI) has change into a part of on a regular basis life, and the speed at which we assign beforehand human roles to AI techniques exhibits no indicators of slowing down. AI techniques are the essential substances of many applied sciences — e.g., self-driving automobiles, sensible city planning, digital assistants — throughout a rising variety of domains. On the core of many of those applied sciences are autonomous brokers — techniques designed to behave on behalf of people and make selections with out direct supervision. With a view to act successfully in the actual world, these brokers should be able to finishing up a variety of duties regardless of presumably unpredictable environmental situations, which frequently requires some type of machine studying (ML) for attaining adaptive behaviour.

Reinforcement studying (RL) [6] stands out as a robust ML approach for coaching brokers to realize optimum behaviour in stochastic environments. RL brokers be taught by interacting with their surroundings: for each motion they take, they obtain context-specific rewards or penalties. Over time, they be taught behaviour that maximizes the anticipated rewards all through their runtime.

Picture supplied by the authors – generated utilizing Gemini.

RL brokers can grasp all kinds of complicated duties, from profitable video video games to controlling cyber-physical techniques equivalent to self-driving automobiles, usually surpassing what professional people are able to. This optimum, environment friendly behaviour, nevertheless, if left fully unconstrained, could turn into off-putting and even harmful to the people it impacts. This motivates the substantial analysis effort in protected RL, the place specialised strategies are developed to make sure that RL brokers meet particular security necessities. These necessities are sometimes expressed in formal languages like linear temporal logic (LTL), which extends classical (true/false) logic with temporal operators, permitting us to specify situations like “one thing that should at all times maintain”, or “one thing that should finally happen”. By combining the adaptability of ML with the precision of logic, researchers have developed highly effective strategies for coaching brokers to behave each successfully and safely.

Nonetheless, security isn’t all the things. Certainly, as RL-based brokers are more and more given roles that both substitute or intently work together with people, a brand new problem arises: making certain their conduct can also be compliant with the social, authorized and moral norms that construction human society, which frequently transcend easy constraints guaranteeing security. For instance, a self-driving automotive may completely comply with security constraints (e.g. avoiding collisions), but nonetheless undertake behaviors that, whereas technically protected, violate social norms, showing weird or impolite on the highway, which could trigger different (human) drivers to react in unsafe methods.

Norms are usually expressed as obligations (“you should do it”), permissions (“you’re permitted to do it”) and prohibitions (“you’re forbidden from doing it”), which aren’t statements that may be true or false, like classical logic formulation. As a substitute, they’re deontic ideas: they describe what is true, improper, or permissible — ultimate or acceptable behaviour, as a substitute of what’s truly the case. This nuance introduces a number of tough dynamics to reasoning about norms, which many logics (equivalent to LTL) battle to deal with. Even every-day normative techniques like driving rules can function such issues; whereas some norms could be quite simple (e.g., by no means exceed 50 kph inside metropolis limits), others could be extra complicated, as in:

  1. All the time keep 10 meters between your automobile and the automobiles in entrance of and behind you.
  2. If there are lower than 10 meters between you and the automobile behind you, you must decelerate to place more room between your self and the automobile in entrance of you.

(2) is an instance of a contrary-to-duty obligation (CTD), an obligation you should comply with particularly in a state of affairs the place one other main obligation (1) has already been violated to, e.g., compensate or scale back injury. Though studied extensively within the fields of normative reasoning and deontic logic, such norms could be problematic for a lot of fundamental protected RL strategies based mostly on imposing LTL constraints, as was mentioned in [4].

Nonetheless, there are approaches for protected RL that present extra potential. One notable instance is the Restraining Bolt approach, launched by De Giacomo et al. [2]. Named after a tool used within the Star Wars universe to curb the conduct of droids, this technique influences an agent’s actions to align with specified guidelines whereas nonetheless permitting it to pursue its targets. That’s, the restraining bolt modifies the conduct an RL agent learns in order that it additionally respects a set of specs. These specs, expressed in a variant of LTL (LTLf [3]), are every paired with its personal reward. The central concept is straightforward however highly effective: together with the rewards the agent receives whereas exploring the surroundings, we add a further reward every time its actions fulfill the corresponding specification, nudging it to behave in ways in which align with particular person security necessities. The task of particular rewards to particular person specs permits us to mannequin extra sophisticated dynamics like, e.g., CTD obligations, by assigning one reward for obeying the first obligation, and a distinct reward for obeying the CTD obligation.

Nonetheless, points with modeling norms persist; for instance, many (if not most) norms are conditional. Contemplate the duty stating “if pedestrians are current at a pedestrian crossing, THEN the close by automobiles should cease”. If an agent had been rewarded each time this rule was glad, it will additionally obtain rewards in conditions the place the norm isn’t truly in drive. It’s because, in logic, an implication holds additionally when the antecedent (“pedestrians are current”) is fake. Because of this, the agent is rewarded every time pedestrians should not round, and may be taught to extend its runtime with a purpose to accumulate these rewards for successfully doing nothing, as a substitute of effectively pursuing its meant process (e.g., reaching a vacation spot). In [5] we confirmed that there are situations the place an agent will both ignore the norms, or be taught this “procrastination” conduct, regardless of which rewards we select. Because of this, we launched Normative Restraining Bolts (NRBs), a step ahead towards imposing norms in RL brokers. In contrast to the unique Restraining Bolt, which inspired compliance by offering further rewards, the normative model as a substitute punishes norm violations. This design is impressed by the Andersonian view of deontic logic [1], which treats obligations as guidelines whose violation essentially triggers a sanction. Thus, the framework not depends on reinforcing acceptable conduct, however as a substitute enforces norms by guaranteeing that violations carry tangible penalties. Whereas efficient for managing intricate normative dynamics like conditional obligations, contrary-to-duties, and exceptions to norms, NRBs depend on trial-and-error reward tuning to implement norm adherence, and subsequently could be unwieldy, particularly when attempting to resolve conflicts between norms. Furthermore, they require retraining to accommodate norm updates, and don’t lend themselves to ensures that optimum insurance policies decrease norm violations.

Our contribution

Constructing on NRBs, we introduce Ordered Normative Restraining Bolts (ONRBs), a framework for guiding reinforcement studying brokers to adjust to social, authorized, and moral norms whereas addressing the constraints of NRBs. On this method, every norm is handled as an goal in a multi-objective reinforcement studying (MORL) drawback. Reformulating the issue on this manner permits us to:

  • Show that when norms don’t battle, an agent who learns optimum behaviour will decrease norm violations over time.
  • Specific relationships between norms by way of a rating system describing which norm needs to be prioritized when a battle happens.
  • Use MORL strategies to algorithmically decide the mandatory magnitude of the punishments we assign such that it’s guarantied that as long as an agent learns optimum behaviour, norms shall be violated as little as attainable, prioritizing the norms with the best rank.
  • Accommodate modifications in our normative techniques by “deactivating” or “reactivating” particular norms.

We examined our framework in a grid-world surroundings impressed by technique video games, the place an agent learns to gather assets and ship them to designated areas. This setup permits us to reveal the framework’s means to deal with the complicated normative situations we famous above, together with direct prioritization of conflicting norms and norm updates. As an example, the determine under

shows how the agent handles norm conflicts, when it’s each obligated to (1) keep away from the harmful (pink) areas, and (2) attain the market (blue) space by a sure deadline, supposing that the second norm takes precedence. We are able to see that it chooses to violate (1) as soon as, as a result of in any other case it will likely be caught at first of the map, unable to satisfy (2). Nonetheless, when given the likelihood to violate (1) as soon as extra, it chooses the compliant path, regardless that the violating path would permit it to gather extra assets, and subsequently extra rewards from the surroundings.

In abstract, by combining RL with logic, we will construct AI brokers that don’t simply work, they work proper.

This work gained a distinguished paper award at IJCAI 2025. Learn the paper in full: Combining MORL with restraining bolts to be taught normative behaviour, Emery A. Neufeld, Agata Ciabattoni and Radu Florin Tulcan.

Acknowledgements

This analysis was funded by the Vienna Science and Expertise Fund (WWTF) venture ICT22-023 and the Austrian Science Fund (FWF) 10.55776/COE12 Cluster of Excellence Bilateral AI.

References

[1] Alan Ross Anderson. A discount of deontic logic to alethic modal logic. Thoughts, 67(265):100–103, 1958.

[2] Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. Foundations for restraining bolts: Reinforcement studying with LTLf/LDLf restraining specs. In Proceedings of the worldwide convention on automated planning and scheduling, quantity 29, pages 128–136, 2019.

[3] Giuseppe De Giacomo and Moshe Y Vardi. Linear temporal logic and linear dynamic logic on finite traces. In IJCAI, quantity 13, pages 854–860, 2013.

[4] Emery Neufeld, Ezio Bartocci, and Agata Ciabattoni. On normative reinforcement studying through protected reinforcement studying. In PRIMA 2022, 2022.

[5] Emery A Neufeld, Agata Ciabattoni, and Radu Florin Tulcan. Norm compliance in reinforcement studying brokers through restraining bolts. In Authorized Information and Info Techniques JURIX 2024, pages 119–130. IOS Press, 2024.

[6] Richard S. Sutton and Andrew G. Barto. Reinforcement studying – an introduction. Adaptive computation and machine studying. MIT Press, 1998.



Agata Ciabattoni
is a Professor at TU Wien.



Emery Neufeld
is a postdoctoral researcher at TU Wien.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com