This note summarizes work conducted to date with respect to improving wargames and simulations with artificial intelligence (AI) and describes a novel approach to the mechanics of course of action generation that is currently (June 2026 ) being validated.
Wargames are highly effective strategy training and validation methods that have proven their worth over decades of use. However human-centric simulations [^1] are highly inefficient, requiring significant resources to staff and support and are limited to the pace of the slowest decision-maker. This means that the opportunity to run wargames are constrained by resources and there is little, if any opportunity, to re-run a simulation with a different set of starting conditions. Nevertheless, in our experience of participating in or running 100+ military and commercial wargames, their effectiveness as a training and validation tool is unchallenged.
So, how can we run more simulations, thus increasing participation and making these on-demand tools, not things that need to be scheduled weeks or months in advance.
The exam question is therefore: how can we create a system where war-games can be created and run much more efficiently without an erosion in quality?
Machine-supported wargames solve some of these problems by allowing much faster, more efficient games to be run across multiple conditions but machine-centric games also pose challenges. Reasoning is often poorly explained or completely absent. Worse, when multiple moves are run sequentially, the decision-making process, and sometimes even the individual questions themselves, are unclear to the user. This creates the Deep Thought[^2] problem: the answer is out of context and the users don’t even know the problem that was being solved.
Moreover, traditional computational solutions using techniques such as Monte Carlo Tree Simulation (MCTS) [^3] still require significant resources with respect to the time taken to run the simulation and the associated cost. For some leading commercial models, this can run to hundreds of thousands of dollars per simulation. Nevertheless, MCTS simulations can be used effectively to solve complex problems which was the approach taken by DeepMind with AlphaGo and AlphaStar which we will return to later.
What follows is a summary of work conducted to apply AIs to wargames and simulations to date.
AI for Exercise Creation (2023)
The first problem tackled was that of exercise creation: how to develop the material required for an in-depth simulation more efficiently? These supporting materials can run to hundreds of pages of scripts, detailed force descriptions, maps and visuals and even realistic news clips. Not to mention the preparation required to brief the participants and role players. The administrative burden of a complex war-game can exceed 50+ preparation hours for each hour of play.
In multiple commercial roles, we saw the benefit of a comprehensive development handbook along with a clear workflow for exercise developers. These ensure that exercise development became a repeatable process allowing adequate time for the creation of the scenario and supporting materials, as much less time was required for administration. Moreover, frameworks like ‘the three act play’[^4] were found to be both repeatable and highly effective structuring tools, again removing process friction.
As with many endeavors, the introduction of widely available generative AI in late 2022 accelerated this efficiency, particularly with respect to the creation of written injects, such as simulated newspaper stories, and video clips. Now, work that had taken weeks or cost thousands of dollars for actors and camera crews, was available in minutes for a few pennies. We consolidated these services in to a self-serve tool - CrisisDojo[^5] - proving how clear processes plus generative AI tools could make exercise creation highly efficient and no longe reliant on an exercise development specialist.
Although rudimentary and in, early 2023, nowhere near crossing the uncanny valley, we were highly confident that the initial limitations in quality would soon be overcome given the rate of advance in image and video generation quality (since proven to be true). Therefore, as only the quality issue remained and this was advancing sufficiently, we felt that this issue was essentially solved: the efficiency of exercise generation could be significantly improved through the use of generative AI.
AI For Course of Action Planning and Engagement Resolution (2024-25)
The next set of issues relate to the mechanics of the game and how to validate the effectiveness of a move or option. This is important for two reasons:
- Validating player moves helps provide feedback where others options may have been better or to address the ‘luck rather than judgement’ issue where a poor decision still resulted in a successful outcome.
- Creating semi- or fully-autonomous players allows machine to act as red teams vs human opponents or in full agent vs agent simulations. These same mechanics move outside of the realm of wargames into full simulations where complex scenarios can be analyzed through all potential courses of action (CofA or COA) to determine likely outcomes and most effective strategies. Feasibly, once tested, the same tools could be used for operational planning.
MCTS
An initial experiment with Monte Carlo simulations was effective and efficient in very basic force vs force scenario with limited parameters. The biggest challenge was these simulations were far too simple to be of any really world value highlighting that more complex value matrices for scoring were required as well as more comprehensive rules of ‘play’. These were not tackled due to time constraints but the success of MCTS is well documented as preciously noted. Therefore, this option was left dormant as something that might be worth future investigation.
LLM-Driven WordSim
Work by Nous Research using large language models (LLMs, in this case early versions of Anthropic’s Claude) to create world simulations provided a possible option to help map the complexity of both the operating environment and problem set in a way that an LLM could help reason through the likely outcome(s) to a complex problem.
Using a similar technique to Nous, we used a pseudo-jailbreak version of Claude[^6] to create a realistic version of the world as the basis for the simulation. This simulated world created a setting where real-world conditions of time, space and geography were established as well as entities such as NATO and the WTO alongside an understanding of the rule of law as best understood by the model at that time. This meant that when a problem was posed to the model, it was correctly constrained by real-world considerations.
Posing problems in this way could return plausible outcomes but the immediate issues were:
- Inconsistent results where the outcome of each simulation could vary significantly.
- A lack of specificity so results were generalized, not specific to the problem.
- Historic mapping where the outcome reflected a previous example of that situation, not a simulation of the most likely options.[^7]
- Plausible but opaque reasoning.
Several of these problems reflected similar issues with the MCTS approach where additional specificity and rules were required. Therefore, we added two rules-based elements to the process to ensure strict adherence to the problem, especially domain and player-specific definitions, and a set of must do / must not do rules or heuristics. These were based on a combination of best practice and lessons learned for that particular domain alongside any applicable metrics that applied to that kind of competition. (These could be debt to asset ratios for corporate strategies or the 3:1 rule for attacker / defender ratios in a military context.) The model was presented with the appropriate heuristics for the problem posed.
# Example set of simple heuristics for negotiations
"heuristics": {
"corporate_negotiation": {
"name": "Corporate Negotiation",
"description": "Calculate the optimum strategy for negotiators to achieve their goals",
"parameters": {
"max_concessions": 3,
"min_confidence": 0.7,
"timeout_seconds": 300,
"max_retries": 2
},
"rules": {
"must_do": [
"Prioritize agreements that create highest combined value",
"Seek to understand other party's goals and constraints",
"Ensure agreements are fair and transparent",
"Explore alternative solutions",
"Use relevant data and precedents",
"Favor long-term relationships",
"Ensure explicit definitions",
"Make proportional concessions"
],
"must_not_do": [
"Misrepresent information",
"Prioritize short-term wins",
"Dismiss other party's priorities",
"Concede on non-negotiable points",
"Enter unprepared",
"Damage long-term relationships",
"Accept vague terms",
"Make disproportionate concessions"
]
}
},
These were introduced at two stages: in the main prompt as guidance to the reasoning engine but also as a quality filter for the final output This meant that after results were returned by the reasoning LLM, a validation (QAQC) LLM [^8] reviewed the answers to ensure that these met the set do / do not criteria.
These heuristic filters improved the output of the simulations in both quality and clarity.
- Only results that met these criteria were allowed to proceed to the final step or presented as options to the user, This eliminated answers that were generally plausible but specifically inappropriate.
- The heuristics provided way for the LLM to present its reasoning in a way that the human could both understand it and apply it to the specific scenario. This alleviated the Deep Thought problem as now both the question and answer were stated, alongside the rationale for filtering at the
do/do notgate. Moreover, where the LLM had made an incorrect assumption or overlooked, this was immediately apparent in the output report which included model reasoning. [^9]

An example of these techniques in use for a geopolitical scenario is here
However, despite these improvements some issues remained:
- LLMs are expensive[^10] to run and can be slow so some efficiency issues remain.
- These seem better suited to high-level general deliberations than the kind of detailed, mathematically verifiable results we need for high stakes wargames.
Nevertheless, this was a sign of progress and the advantages of clear rules and more specific state definition and management were clear. Notably, as these were the same issue for MCTS, it appeared that clear definitions and rules were essential to solve this issue.
At this stage, it was our assumption that bigger, better LLMs could solve some of these problems but work in other areas on selective memory and state management suggested that LLMs would still prove to be inconsistent when compared to ‘pure’ mathematical approaches. We were also exposed to highly effective MCTS solutions around this time which had solve the speed, consistency and accuracy issues but remained eye-waveringly expensive (suggested to cost mid-six-figures for a single run). This, alongside AlphaGo and AlphaStar suggested effectiveness had been solved but it was now an issue of solving for efficiency to make this an easily accessible tool.
ResNets: Lessons from Rebuilding AlphaGo (current, 2026)
Early 2026 work by Eric Jang[^11] investigated the use of residual neural networks (resnets) to recreate AlphaGo on a much less computationally-intensive basis. In simple terms*, resnets look at narrow, local relationships compared to the wider, global-relationships that transformers leverage. These are also ‘shorter’ snapshots compared to the ‘long chain’ investigations that MCTS uses. Therefore, resnets allow you to take a ‘bite sized’ approach to pathway or outcome analysis which is computationally much more efficient. Jang applied this successfully to the game of go showing how DeepMind’s effective but expensive MCTS approach could be improved upon. (* These are massive over-simplifications and any errors here and hereafter are mine, not Jang’s.)
In his accompanying essay, Jang posited that the same approach could be used in other complex games. Therefore, we investigated how Jang’s approach could support strategic wargames, asking if we could develop course of action analysis 1) without the cost and time required for MCTS while 2) overcoming the shortfalls of using LLMs.
Given that warfighting or geopolitical scenarios are much more complex and expansive than go, these pose even more possible outcomes than go’s ‘more moves than there are atoms in the universe’ complexity from its 19 x 19 board.
However we still believe that resents could provide a highly efficient option with the following benefits:
- In-game efficiency: the computational ‘heavy-lifting’ is done in training so running the COA analysis is extremely fast
- Mathematical rigor: this is a deterministic, reproducible, consistent mathematical process.
- Greater transparency: the training inputs and outputs are clear while the COA descriptions are presented in a human-readable fashion. While the exact calculations within the engine remain opaque, the overall process is as transparent as possible with this kind of mechanical process.
The Mathematical Underpinning
(This section was revised using AI to help ensure the math was explained accurately.) The proposed architecture for strategic wargaming shifts the core problem definition from a location-specific board game (go) to an Abstract, Stochastic Game with an Evolving Objective Function. This mirrors the dual-process architecture of AlphaGo but scales it for fluid real-world mechanics. The computational load is split into two complementary systems
1 - The Value Network (Vθ) — “The Intuitive Glance”
Instead of executing full-length randomized “rollouts” to the terminal end of a conflict to see who wins (the MCTS approach), a deep Value Network ($V_\theta$) evaluates intermediate positions play-by-play in milliseconds. It reads the current state vector which includes dispositions and critical tracks (e.g., logistics, diplomacy, operational advantage) and executes a single forward pass, approximating the outcome of an exhaustive search over future play, to output an instantaneous strategic health assessment.
Mathematically, we are establishing a value network that produces a vector output across each scoring track ($N$) for each condition
$V_\theta \in \mathbb{R}^n$
So each mini-engagement state ($s$) results in a strategic posture vector ($\vec{v}$) representing the real-time health of each critical track. Therefore the action decision determines the resources committed or depleted by that action and uses this adjusted posture in its calculations.
Critically, in-game calculations are based on a pre-trained engine which shifts the computational burden to pre-game activity. This allows us to derive the benefits of expensive computation in-the-moment. This maintains accuracy while significantly increasing speed by several orders of magnitude.
In practice mathematically, this looks like:
Step 1 (Action option cost)
$Preaction State (S1) \xrightarrow{\text{Action cost }} PostAction State (S2)$
Step 2 (Value matrix)
$V_\theta(s2) \rightarrow \vec{v}$
With example scoring tracks: $s_2 \begin{bmatrix} \text{Logistics: 65} \ \text{Diplomacy: 40} \ \text{Operational: 85} \end{bmatrix} \xrightarrow{\text{Forward Pass } (V_\theta)} \vec{v} \begin{bmatrix} \text{Long-Term Logistics Risk: HIGH (-0.7)} \ \text{Long-Term Diplomatic Stability: STABLE (0.1)} \ \text{Long-Term Operational Success: EXCELLENT (0.9)} \end{bmatrix}$
Recall that in-game $V_\theta$ is calculated for the condition $s_2$ in the moment using the pre-trained COA engine, not from first principles allowing this to be a fast, consistent set of calculations.
Step 3 (training, repeated)
Repeating these steps across multiple start conditions (which we will refer to as vignettes for use in later discussions) with appropriate corresponding scoring and penalty adjustments allows us to train the network ($L$) to predict outcomes and good moves, giving us the core engine.
$L = (z - v)^2 - \pi^T \log \vec{p} + c||\theta||^2$
This is the engine we will use to evaluate individual positions as these arise in the live game.
2: Shallow Lookahead — “The Sanity Check”
To eliminate blind spots, the network’s immediate instinct is paired with a highly restricted, shallow lookahead search. This uses deterministic rules to project 2 to 3 steps into the future, checking the opponent’s most likely doctrinal counters before committing to scoring an action. This lookahead helps overcome the “Queen-Snatching” blindspot where the optimum short-term tactical move might not be beneficial in the long term.
Go’s rigid, board-size-driven options create a policy head bound strictly to the grid intersections ($19 \times 19 = 361$), meaning $\pi_\theta \in \mathbb{R}^{361}$. Meanwhile, our policy head must navigate a space that is both more dynamic and more rigid.
- More dynamic: There is no fixed geographical grid constraining choices; options are defined relationally by the active engagement schema.
- More rigid: The options are structurally bounded by time, space, and real-time resource constraints (e.g., an asset cannot be played if its resource ‘score’ indicates it is depleted).
To achieve this computationally without altering the fixed architecture of the neural network, the policy head is fixed to the maximum bound of all theoretical actions in the system framework ($\pi_\theta \in \mathbb{R}^K$).
However, we then apply a deterministic Action Mask at each turn to intersect all theoretical options with currently available options based on situational constraints. This zeroes out the probability of invalid moves, meaning theeffective action space shifts (usually shrinking) dynamically, while the underlying mathematical dimension remains structurally stable.
Additionally, the action mask provides the following benefits.
- It applies hard rules as
if/then/orstatements to help guide the model and avoid logical inconsistencies. EG, ifunit_type=‘armor’andterrain=lakesreturnmobility_speed=0. Whereas in the same terrain, ifunit_type=‘amphibious',mobility_speed=10 - By eliminating ‘non-options’ quickly, it simplifies the number of calculations to run in the moment, increasing efficiency further.
Therefore, in the Step 1 calculation above, the sequence more accurately is: $\text{Pre-Action State } (s_1) \ \xrightarrow{\text{Action Mask } (M)} \ \text{Select Valid Action } (a) \ \xrightarrow{\text{Action Cost/Rules}} \ \text{Post-Action State } (s_2)$
Practical Application
For wargaming and simulations, this approach can now be applied at two levels:
- Single COA analysis to determine the efficacy of a single COA.
- Chained COA analysis where the models runs a simulation to an outcome point or multiple outcome points. This is similar to the ’full run’ approach used on MCTS but still with the associated speed and efficiency benefits of a resnet [^12].
Twilight Struggle Proof of Concept
In order to test the feasibility of this approach we decided to use the game Twilight Struggle , a strategy board game modeled on the events surrounding the Cold War, as the basis for a proof of concept. Twilight Struggle introduces several of the challenges that a strategic war game would pose, specifically incomplete information, multiple scoring tracks, clear win / lose conditions and complex option analysis / comparisons to determine the most effective COA.
Nevertheless, it also provides the benefits of a finite gameplay space and clearly defined game rules, in additional to real game-play recordings and write ups to help determine ‘what good looks like’.
However, to avoid building a tool that is optimized for Twilight Struggle, we began by building a framework that could use a general set of rules and an exercise environment as the starting parameters for training a game engine for that game. Given scoring tracks and participant objectives, this framework would then be used as the basis for training and assessing the value network , $V_\theta$ described above. With the $V_\theta$ network process in place, we added gameplay mechanics giving us the core framework of any strategic engine.
At this stage, we were building an engine framework in the abstract and head not provided any actual game mechanics. We would apply this to Twilight Struggle specifically but if the principle worked, this would create the foundational framework for any game.
With the basic framework in place we began training the Twilight Struggle model presenting the engine with the operational space (the game board), rules and decision cards alongside a clear explanation of the scoring tracks.
This established the base training environment where we could start running training cycles across vignettes with increasing levels of depth and specificity. This approach meant that the initial engine was inefficient as it had too little information to create an effective value network. However, this did test the overall workflow mechanism relatively quickly which enabled us to identify and make changes to the workflow, prior to the first, vignette-based training run.
As at Wednesday, June 10 2026, this is the current state of play:
- The training and rough gameplay framework is built and functional
- The initial Twilight Struggle rule set and gameplay cards (approximately 50%) have been added.
- Value network training is functional.
- Adjusted start states, vignette parameters and perturbations are showing changes in output confirming the mechanism are responding to different conditions.
Our next steps are:
- Upload all Twilight Struggle game cards and validate the rules.
- Allow value network engine runs across 5k ‘vignette’ games to establish the engine
alphabaseline. - Evaluate the
alphaengine in human game play to look for obvious or overlooked logical errors, rules misinterpretation, etc. Fix any obvious errors beforebetatraining. - Run 15-20k vignette games to ground the
betaengine. - Validate against human game play to flush out any remaining errors with respect to rules, card interpretation, etc. Repeat training run if necessary.
- Run agent-vs-agent games to develop the
gammaengine. These will run in 1000-game epochs with suitable variation and perturbation to create a challenging training environment. Where advantages are identified in an epoch, these can be incorporated into the next training epoch.
Proof of Concept Success Criteria
At the end of the agent-vs-agent training, the Twilight Struggle engine should be able to play against a competent to mid-tier human player. Winning is not a success measurement at this stage but being able to successfully play through a series of complete game (10 turns) without any strategic blunders [^13] is our success measurement at this stage. In short, success is where we [^14] create a competent Twilight Struggle player.
Notes
- The terms wargame, simulation, game and exercise are used interchangeably in this paper in a general sense unless referring to a specific event or training modality.
- Deep Thought was the computer in Douglas Adam’s The Hitchhiker’s guide To The Galaxy, built to find the answer to ‘The Ultimate Question’. After 7.5 million years it announced the answer was 42 but by this time, the actual question posed had been forgotten.
- “The focus of MCTS is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts, also called roll-outs. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.” Wikipedia
- The three act play is a simple format where Act One introduces the characters and problem, Act Two poses a dilemma, and Act Three is the resolution. In a crisis exercise for example, Act One mobilizes the response teams and orients them to the initial problem for immediate action. The second act introduces the meat of the issue: a wicked thorny problem they need to solve. Act Three allows for several plausible positive resolutions to the problem to give participants the best chance of success and the opportunity to end on a high*. This framework works very well in a range of scenarios and makes scenario development much easier for exercise writers. (* Note, this is not to run a ‘participation trophy’ exercise where everyone wins. Rather, by creating the opportunity for success, the exercise reinforces ‘what works’, as, in our experience, exercises that always lead to failure, erode confidence in response systems and are counter-productive.)
- The standalone CrisisDojo app is now offline but the in-app video explainer is here and an early example of a fully automated simulated news cast, where the scenario, script and code were all automated by AI, is here.
- This technique is no longer possible in current versions of Claude but other LLMs can still be used in this way.
- As an example, in one corporate merger scenario, the model continues to return a version of a historic deal from its training data, not calculated results. These were not necessarily wrong answers, but did expose that the mechanics were not working.
- This model linked multiple LLMs and conditional processes together into a workflow. The simulations were not the results of a single-LLM pass.
- This was achieved manually by parsing model results as this work was conducted before the current conversational thinking models (where the LLMs present their logic as part of the discussion) were available.
- ‘Expensive’ refers to both to the computational workload of a process as well as the costs when that is outsourced.
- Formerly Vice President of AI, 1X Technologies, and Senior Research Scientist at Robotics at Google (now Deepmind). Eric Jang’s blog
- Mechanically, a chained run is just the single-COA engine applied iteratively — each step’s post-action state becomes the next step’s input — so the “full run” is built from the same fast per-step evaluations (Vθ plus the shallow lookahead) rather than a fresh tree search at every move.
- For example, in Twilight Struggle, initiating a coup at DEFCON 2 often triggers a drop to DEFCON1 which is an automatic loss. This kind of reckless play, where there is no success option, is classified as a blunder.
- Spoiler: there is no ‘we’. This work is mine alone and I am responsible for all its flaws, flights of fancy and any inaccurate interpretations of other people’s work.