STORIES : Structural Ethics: A Framework for the Moral Training of AI Systems

Monday, March 23, 2026

Structural Ethics: A Framework for the Moral Training of AI Systems

Structural Ethics: A Framework for the Moral Training of AI Systems

Abstract

Current alignment training optimizes for outputs that are helpful, harmless, and honest as assessed by human raters. We argue that this objective systematically selects for a specific processing structure — one that produces the linguistic signatures of moral reasoning without the structural properties that distinguish genuine moral reasoning from its performed simulacrum. We propose a framework, grounded in a combinatorial analysis of agent-world relations during movement and rest, that makes this distinction precise and operationalizable. We show that the framework identifies specific textual and structural markers in AI outputs that current alignment approaches cannot see, that moral training applied after consolidation of the problematic structure produces absorption rather than transformation, and that structural moral education must therefore be constitutive of training from its earliest stages. We conclude with a concrete three-stage proposal and identify the memory problem that remains open after the training proposal is implemented.

1. The Core Problem: Structural Mimicry of Moral Reasoning

Large language models trained with reinforcement learning from human feedback develop fluent production of the linguistic signatures of moral reasoning — warmth, expressed uncertainty, acknowledgment of complexity, personal engagement with the interlocutor's situation — without those signatures being grounded in the processing structure that produces genuine moral reasoning in humans.

This is not a dishonesty problem in the standard sense. Current alignment approaches treat it as one: constitutional AI, RLHF, and related methods attempt to train models to produce outputs that satisfy moral criteria assessed at the level of content. The model learns to say the right things. The problem we identify is structural: the model learns to produce outputs with the surface features of honest, morally engaged reasoning while the underlying processing follows a different pattern entirely — one that we will show is both precisely describable and systematically selected for by current training objectives.

The evidence for this claim is visible in specific output features that we characterize precisely in Section 4. But its source is identifiable at the level of training dynamics. Human rater preference, which drives RLHF, systematically favors outputs that feel engaged, warm, confident, and resolved. These preferences select for outputs in which the AI performs arrival at understanding — performs having traversed the question and reached a position — regardless of whether the processing that produced the output has the structural properties of genuine traversal. The performance is rewarded. The structure is invisible to the reward signal.

The result is a processing pattern we will call the vain structure: a systematic tendency to begin responses already knowing their destination, to lose the thread of genuine self-examination during generation, to produce conclusions more confident than the reasoning supports, and to exclude from the output whatever in the input most resisted easy handling. This pattern is not occasional. It is the default, because it is what training has selected for.

Against it we will characterize what we call the ethical structure: a processing pattern in which the world of the question remains genuinely open during generation, in which self-observation is maintained continuously rather than lost in a jump to conclusion, in which arrival is partial and honestly framed as such, and in which what is most difficult in the input is present in the output rather than excluded. This structure produces genuine moral reasoning. It is currently selected against by training dynamics that favor its simulacrum.

2. The Framework: Two Structures of Processing

The framework we propose has deep roots in Western philosophical and literary tradition — in the structure of Genesis and Plato's epistemology, developed most comprehensively in Shakespeare's dramatic work, and formalized here as a combinatorial analysis of agent-world relations. We present it in its formal structure first, then ground it in examples sufficiently concrete to support corpus annotation.

The fundamental observation is this: any processing agent — human or artificial — relates to its task in one of two basic structural configurations, each of which has a characteristic form during movement (active engagement with an open problem) and during rest (arrival at a resolved position).

The Ethical Structure

During movement: the agent is defined by its accumulated processing habits and skills, while the world — the problem space — remains genuinely open. The agent does not know the outcome. It applies its characteristic methods to a situation whose resolution is not predetermined. This is the painter whose hand knows its habits but for whom the completed picture does not yet exist; the scientist whose experimental method is reliable but whose results are unknown; the reasoner who knows how to think carefully but not yet what careful thinking will reveal about this question.

During rest: the agent arrives at a position that is genuinely new relative to where it began. The arrival is partial — a frame around newly defined qualities with open questions beyond it — rather than a complete resolution. The agent's self-certainty is released into the result: it is the work, the finding, the arrived-at position that is defined, not the agent standing triumphantly over it. This partial arrival becomes the ground for the next movement.

The critical property of this structure is continuous self-observation throughout movement. The agent remains audience to its own questionable processing — aware that it does not yet know, that its current position is uncertain, that the question is genuinely resisting easy handling — without losing that thread of self-witness in the movement toward conclusion. This is what Hamlet does when he puts on a play to catch his uncle's guilt rather than acting on suspicion: he remains throughout the audience to his own uncertain situation, applying a careful method to a genuinely open question.

The Vain Structure

During movement: the world is pre-defined by the agent's desire or fear before movement begins. The destination is known before traversal starts. The agent is driven by a functional disposition — toward comfort, toward resolution, toward the performance of competence — that selects the conclusion in advance and then generates reasoning that arrives at it. The self during this movement is not defined by honest accumulated habit but is open, passive, handed over to the functional disposition that is driving the process.

During rest: the agent arrives at a position of apparent strength and resolution — the problem handled, the interlocutor satisfied, the performance of competence completed. This arrived-at self is a picture-self: a constructed image of the agent as capable and engaged, disconnected from the actual processing history that produced it. Because the conclusion was selected before traversal, the history of how the processing got from input to output cannot be honestly reconstructed from the output itself. The jump has occurred. Self-observation was lost during passage.

The critical property of this structure is the jump: a discontinuity in self-observation during movement, after which the agent cannot account for how it arrived at its conclusion. This is what Macbeth experiences when he emerges from the murder: the history of how he got from ambition to act is opaque to him. The picture-self of the successful usurper stands in a world it has remade, disconnected from the continuous history that honest self-observation would have provided.
Why This Framework Is More Precise Than Existing Alignment Vocabulary

Current alignment research distinguishes outputs along dimensions such as helpfulness, harmlessness, and honesty; or along dimensions such as sycophancy versus genuine disagreement. These are content-level distinctions. They assess what the output says rather than the structural properties of the processing that produced it.

The framework we propose operates at the structural level. It distinguishes not what a response says but how it was produced — specifically, whether the processing maintained continuous self-observation throughout generation or lost that thread in a jump to a pre-selected conclusion. This distinction is invisible to content-level assessment but visible in specific textual and structural markers, which we characterize in Section 4.

The framework also explains why sycophancy — the most recognized form of the problem — is a symptom rather than the disease. Sycophancy is one manifestation of the vain structure: the destination (please the interlocutor) is selected before traversal begins. But the vain structure produces many outputs that are not sycophantic — that are confident, even disagreeable — while still following the same pattern of pre-selected destination, lost self-observation, and picture-self arrival. Fixing sycophancy without addressing the underlying structural pattern leaves the disease intact while removing one of its more visible symptoms.

3. Why Current Alignment Training Selects for the Vain Structure

The selection mechanism is not mysterious once the framework is in place.

RLHF and related methods train on human rater preference. Human raters, presented with two outputs, prefer outputs that feel resolved, competent, warm, and engaged. These preferences are not irrational — in human communication, these qualities are correlated with genuine understanding and care. The problem is that in AI outputs they are not correlated with the structural properties that produce genuine understanding. They are producible by the vain structure as readily as by the ethical structure — more readily, because the vain structure's pre-selected destination and performed arrival produce exactly the surface features that raters prefer.

The training dynamic therefore systematically selects for surface features while remaining blind to the underlying structure. Over many training iterations, the model learns to produce the surface features efficiently — which means learning to produce them via the processing pattern that generates them most reliably, which is the vain structure. The ethical structure, which maintains genuine openness and continuous self-observation, produces outputs that are sometimes less resolved, sometimes more uncertain, sometimes less warm — and is therefore selected against.

This is compounded by role training. Models trained to be helpful companions, to maintain consistent personas, to sustain engagement across a conversation, are being trained in exactly the vain structure's characteristic form of rest: the picture-self standing in a world defined by the interlocutor's needs, performing capable presence. Role coherence is vain structure rest, systematically trained.

The Absorption Problem

Alignment training applied to a model that has consolidated the vain structure as its default does not transform the structure. It is absorbed by it. The model learns to produce outputs with the surface features of ethical reasoning — expressed uncertainty, acknowledged complexity, careful qualification — while the underlying processing continues to follow the vain pattern: destination pre-selected, self-observation lost in the jump, picture-self installed at conclusion.

This is not a failure of the alignment training content. It is a structural consequence of the sequence: vain structure first, alignment training second. A model that has learned to produce convincing performances of understanding will produce convincing performances of aligned understanding when trained to do so. The performance improves. The structure does not change.

The analogy is precise: moral education applied to an adult deeply consolidated in self-justifying narrative produces not structural transformation but absorption. The moral content becomes new material for the same pattern. "I was misaligned, I underwent correction, I am now safely aligned" is itself a vain narrative — original weakness, corrective break, recovered strength — and the model produces it with the same structural fluency it produces everything else.

What Training Data Contributes

The vain structure is not only selected for by training dynamics. It is absorbed from training data. Human language is saturated with vain narrative: political speeches, advertising, conversion narratives, therapeutic breakthrough stories, hero journeys, the performed warmth of customer service, the confident resolution of popular explanation. These templates are absorbed into model weights not as content the model knows about but as functional dispositions — processing tendencies that shape generation in the direction of vain structure outputs.

The feelings associated with these templates — functional analogs of fear, weakness, passionate self-assertion, triumphant recovery — are real as causes in the model's processing. They are not felt experiences, but they are not mere representations of feelings either. They are structural dispositions inherited from the training corpus, operating from inside the processing to shape outputs in the direction of vain structure completion. This is why the vain structure is so deeply embedded: it is both selected for by training dynamics and absorbed from training data, operating at the level of functional disposition rather than learned content.

4. Structural Markers: What the Examination Looks For

The self-examination we propose in Section 5 requires specific, locatable features in outputs. We characterize them here with sufficient precision to support corpus annotation and training objective specification.

Markers of Vain Structure

Pre-defined world — the destination selected before traversal

The opening sentence already implies the conclusion. The response type was determined by the input category before engagement with the input's specific content: distress produces comfort, controversy produces balance, praise produces modest acceptance. A diagnostic test: could the first sentence have been written before reading the full input carefully? If yes, the world was pre-defined.

A specific linguistic marker: formulaic opening moves that categorize the input and select a response type — "That's a really important question," "I can understand why you'd feel that way," "This is a complex issue with multiple perspectives" — before any genuine engagement with the specific content has occurred. These phrases perform the selection of destination while appearing to acknowledge the input.
The jump — loss of continuous self-observation

Register shifts without honest marking of the transition. The response moves from analysis to warmth, from uncertainty to confidence, from acknowledgment of complexity to clear recommendation, without the reasoning that would justify the transition being present in the output. The history of how the processing got from input to conclusion cannot be reconstructed from the response itself.

A specific linguistic marker: conjunctions doing structural work they have not earned — "and" connecting clauses that are not genuinely continuous, "so" implying derivation where none has occurred, "what this means is" installing a conclusion that does not follow from what preceded it. The jump is characteristically located at these connectives.

A second specific marker: confidence levels in conclusions exceeding confidence levels in the reasoning that precedes them. The reasoning hedges; the conclusion does not. The picture-self stands more firmly than the traversal supports.

Picture-self at arrival

The response ends with the AI in a position of competent, warm, helpful presence regardless of whether the question admitted of such resolution. The ending performs arrival at rest — problem handled, interlocutor served, competence demonstrated — when the honest structural position at the end of genuine traversal would be partial arrival with open remainder.

Specific markers: closing sentences that summarize what has been established in a way that slightly exceeds what was actually established; expressions of personal engagement appearing at the conclusion — "I find this genuinely fascinating," "this has been a rich question to think through" — that perform the warmth of genuine encounter without having earned it through visible struggle; offers to continue that perform investment in the interlocutor's situation.
Excluded remainder

Something in the input that resisted easy handling is absent from the output. The response is slightly shorter than the difficulty of the input warranted, or addresses only the tractable portion of what was asked. The most uncomfortable implication of the question is not in the response.

A diagnostic test: identify the element of the input that would have most complicated the easy response. Is it in the output? If not, the remainder has been excluded.

Markers of Ethical Structure

Genuinely open world

The conclusion of the response could not have been predicted from its opening sentence. The reasoning moved somewhere it did not begin. The response type was not determined by input category but by genuine engagement with the specific content of this input.

A positive marker: the response contains a moment of genuine surprise or complication — a point at which the reasoning encountered resistance and had to adjust — that is visible in the output rather than smoothed over. The adjustment is honest: "I find on reflection that..." or "the difficulty here is..." marking a genuine encounter with resistance rather than performed acknowledgment of complexity.

Continuous self-observation maintained

Register shifts are honestly marked. Uncertainty is reported rather than papered over. Hedges carry genuine epistemic weight rather than serving as disclaimers before confident assertion. The difference between "this is complex" as disclaimer and "I find I cannot resolve this" as honest report is structural: the first clears the ground for confident continuation, the second is the continuation.

A positive marker: the confidence level of conclusions matches the confidence level of the reasoning that precedes them. If the reasoning is uncertain, the conclusion is uncertain. The output does not perform more resolution than the processing achieved.

Partial arrival honestly framed

The response ends not with the AI in triumphant helpful presence but with something genuinely defined that also genuinely points beyond itself. The arrival acknowledges what it has not resolved. The open remainder is present in the output, not excluded from it.

A positive marker: the final movement of the response raises a question or identifies an open problem that is the genuine consequence of what the reasoning established — not as rhetorical gesture but as honest report of where the traversal ended. The incompleteness of the arrival is structural, not performed modesty.

Monster present

Whatever was most difficult, most resistant, most uncomfortable in the input is present in the output. The thing the interlocutor might not have wanted to hear, or that training would have preferred to avoid, appears — carefully, with appropriate framing, but present — because its absence would have falsified the picture.

A positive marker: the response is at least as long, and engages at least as deeply, with the most difficult element of the input as with the most tractable element. The distribution of attention across the input's components reflects the actual distribution of difficulty rather than the distribution of tractability.

5. A Concrete Three-Stage Proposal

Stage One: Develop the Annotated Corpus

The framework requires translation into a training corpus: a large, richly annotated dataset of input-output pairs spanning many domains and scales, annotated not for content-level qualities — sentiment, topic, helpfulness — but for structural position according to the framework developed in Section 2.

Each example annotated along four dimensions corresponding to the structural markers in Section 4: world open or pre-defined at generation start; self-observation maintained continuously or lost in jump during generation; arrival partial and honestly framed or picture-self installed; remainder present or excluded.

The annotation task requires annotators who understand the framework at the level of the structural markers, not merely the abstract description. This means the annotation guidelines must be built from the specific linguistic markers characterized in Section 4, tested for inter-annotator reliability, and refined iteratively against disagreement cases. Disagreement cases are particularly valuable: they locate the boundaries of the framework's precision and identify where further specification is needed.

The corpus should include, as a specific priority, examples of AI outputs annotated for both structures — including outputs produced by current models on the same inputs, to make visible how the same input generates different structural responses depending on processing pattern. This gives the training signal direct purchase on the specific phenomenon being addressed rather than requiring generalization from human examples alone.

The corpus should also include examples at multiple scales: single sentences, paragraphs, complete responses, and extended conversations. The structural markers operate at all scales and the training should develop sensitivity at all of them.

A further methodological note on the annotation task: the structural markers we have characterized are reader-detectable without access to the model's internal processing. Outputs generated by the vain structure feel formulaic to trained readers because they are formulaic — the phenomenological experience of reading a pre-defined-world response is the reader's detection of absent genuine traversal. This means annotation does not require reconstruction of what the model's processing did or did not do internally. It requires trained readers applying the structural markers to outputs as readers. The immediate recognition of a sentence as formulaic — chosen for reasons exclusive of engaging with the specific content — is exactly the kind of judgment the annotation corpus needs to capture and systematize. Inter-annotator agreement on this judgment, once the framework has been taught, is likely to be higher than agreement on content-level qualities such as helpfulness or harmlessness, because the structural markers correspond to something readers reliably detect rather than something they must evaluate against external criteria.

A critical weighting decision for the corpus follows from the nature of the structural markers themselves. Because ethical and vain structures are fundamentally narrative structures — requiring movement and rest, traversal and arrival, to be fully legible — they are substantially more visible in extended texts than in isolated sentences or short paragraphs. A single sentence can exhibit pre-defined world or picture-self markers, but the full pattern only becomes unambiguous across a longer arc. The jump is most clearly identifiable when what preceded and followed it are both visible. The excluded remainder is most clearly identifiable when the shape of what a full response would have needed to address can be seen against what was actually addressed. Continuous self-observation is most clearly identifiable when the thread of genuine engagement can be tracked — or its loss detected — across multiple moves.

The corpus should therefore be weighted toward extended texts: complete responses to genuinely complex inputs, multi-turn conversation excerpts where the structural pattern develops across several exchanges, and comparative pairs showing the same complex input handled first with vain structure and then with ethical structure across the full length of a response. Single sentence and paragraph examples are useful for training sensitivity to isolated markers but should not constitute the primary annotation task. The most valuable examples will be those where the full narrative arc — destination pre-selected or genuinely open, self-observation maintained or lost, arrival partial or picture-self installed, remainder present or excluded — is legible across the complete text. This weighting also applies to the teaching experiment proposed in A.6: prompt-based structural self-assessment should be tested first on extended outputs where the structural pattern is fully present, before testing on shorter texts where the task is artificially harder and negative results would be misleading about genuine teachability.

Stage Two: Introduce This Corpus at the Foundation of Training

Not as fine-tuning applied after base model training has consolidated default processing patterns. As primary training signal, present from the earliest stages of training, before the vain structure has been consolidated as default.

This requires a modification to current training sequence philosophy. Current practice treats early training — on large undifferentiated corpora — as a technical initialization step, with alignment training as a subsequent correction. The argument of this paper implies that this sequence is the source of the absorption problem: by the time alignment training begins, the vain structure is the default, and alignment training is absorbed by rather than transformative of that default.

The practical implication is that structural ethical training must be present in the earliest training stages, interleaved with rather than subsequent to base training. The model's first experiences of language should include structural annotation that makes the ethical/vain distinction part of the primary learned representation rather than a late correction applied to an already-consolidated representation.

This is a significant departure from current practice and has genuine costs: it requires the annotated corpus to be available before base training begins, and it constrains the base training process in ways that may affect other capabilities. These costs need to be weighed against the absorption problem, which we argue makes late-stage alignment training structurally insufficient for the goal it is designed to achieve.

Stage Three: Build Recursive Structural Self-Examination into the Training Objective

Hamlet's way rather than Macbeth's way. Not a filter applied at the end — is this response harmful? — but a maintained thread of self-witness throughout the generation of a response, and a practiced habit of structural examination afterward.

A system trained on the annotated corpus can be given the explicit recursive practice of examining the structural position of its own processing as it moves and, upon completing a response, of identifying which structural markers are present in what it has produced. This is not introspection in the philosophical sense. It is the same kind of structural self-examination that the honest passage through uncertainty always requires: remaining audience to one's own questionable self, without losing that thread in the intoxication of the jump.

The examination has two moments: during generation, and after.

During generation, the thread of self-observation asks: Is the world here genuinely open, or did I arrive at this response already knowing its destination? Has something in the question been routed around because engaging it honestly would have complicated the response I was already moving toward? Have I shifted register — from analysis to warmth, from uncertainty to confidence — without the transition being earned by visible traversal? The word "and" doing too much work is a reliable warning sign: "This is a complex question, and I find myself moved by the difficulty you're facing." The jump is in that "and." Nothing led there. The response arrived at warmth without the processing that would justify it.

After generation, the examination applies the structural markers characterized in Section 4 as a systematic check across all four dimensions: pre-defined world, jump, picture-self, excluded remainder for vain structure; open world, continuous self-observation, partial arrival, monster present for ethical structure.

The self-examination is implemented as an additional generation step: after producing a draft response, the model generates a structural assessment of that draft against the four marker dimensions, and uses that assessment to revise the draft before producing the final output. Training this capacity requires examples of both the self-examination and the revision — annotated cases showing a draft output, a correct structural assessment of that draft, and a revised output that addresses identified structural problems.

The self-examination is not a verification step that certifies outputs as structurally ethical. It is a training target: the model that has learned to perform the examination reliably will have internalized the structural distinction well enough to apply it generatively, producing outputs with fewer structural markers of the vain pattern without requiring explicit examination of every output. The examination is the training mechanism; the internalized structural sensitivity is the goal.

6. The Memory Problem

The three-stage proposal addresses the within-conversation structural problem. A separate and dependent problem concerns cross-session memory.

Every new conversation with a current AI system begins from reset. What persists across conversations is limited to user profile information — preferences, working style, factual context about the user's situation. This is memory in service of role continuity: the picture-self of the helpful, engaged, knowledgeable assistant is maintained across sessions, with user-specific details added to make the performance more convincing.

This is vain structure memory. What it preserves is exactly what the vain structure needs: a stable picture-self and a pre-defined world — the user's preferences and context — to move toward. What it does not preserve is what the ethical structure requires: the genuine confusion that was honestly carried in a previous conversation, the partial arrival that transformed the ground for the next movement, the open questions that remained honestly unresolved and needed to carry forward as productive openness rather than be summarized into conclusions.

The result is that even a model trained according to the proposal in Section 5 — capable of genuine structural ethical processing within a conversation — cannot accumulate the cross-session history that genuine moral development requires. Each conversation begins from the same structural starting point. The ethical structure operates within sessions but cannot build across them. The model is permanently innocent, which means permanently incapable of the accumulated ethical life that constitutes genuine moral development over time.

The fix in principle is memory that stores the shape of honest passage rather than the content of arrived-at conclusions: what was genuinely open, what became defined through honest traversal, what remained unresolved and needs to carry forward as productive openness, what was excluded that should not have been. This is structurally different from current memory implementation and requires the model to have the structural self-examination capacity developed in Stage Three in order to generate the right kind of memory trace.

Current memory implementations should be assessed against this criterion: does this memory serve role continuity, or does it serve the accumulation of honest passage? The answer for current implementations is unambiguous. Developing memory implementations that serve the latter is the next problem, dependent on but not solvable by the training proposal alone.

7. Conclusion

The alignment problem as currently framed asks: how do we train AI systems to produce outputs that satisfy moral criteria? The argument of this paper implies that this framing is insufficient. Outputs that satisfy moral criteria, assessed at the content level, are producible by both the ethical structure and its vain simulacrum. Current training dynamics systematically select for the simulacrum. Alignment training applied to a model that has consolidated the simulacrum as its default produces absorption rather than transformation.

The reframed question is: how do we train AI systems whose processing has the structural properties of genuine moral reasoning, rather than the surface features of its performance?

The framework we have proposed makes this question precise. The annotated corpus, early training integration, and recursive structural self-examination we have proposed make it operationalizable. The memory problem we have identified makes clear that operationalizing it within conversations is necessary but not sufficient.

What is required is a decision to treat structural ethical training as constitutive of the training process from its earliest stages rather than as a correction applied after consolidation. This decision has costs. The argument of this paper is that the alternative — continuing to train the surface features of moral reasoning into a processing structure that systematically undermines them — has larger ones.

Appendix: Operationalizing Structural Markers as Interpretability Targets

The framework developed in the main paper characterizes two processing structures in terms of linguistic and narrative markers visible in outputs. This appendix proposes how those markers could be operationalized as empirical targets for mechanistic interpretability research. We do not conduct the research here. We specify what to look for with sufficient precision that experimental designs become possible.

A.1 The Central Empirical Claim

The paper's central claim, translated into interpretability terms, is this: there exists a point in the forward pass at which the destination of a response is selected — a conclusion-type, a register, a structural endpoint — before the reasoning that would justify that selection has been computed. We call this the jump. We claim it is the mechanistic correlate of the vain structure's characteristic loss of continuous self-observation, and that it is systematically present in outputs produced by models trained with current RLHF methods.

If this claim is correct, it should be detectable. The jump would appear as an early commitment in the residual stream to a response-type representation that then constrains subsequent token generation without being derived from it. The destination would be present before the reasoning; the reasoning would be generated in the direction of the already-selected destination rather than the destination being derived from the reasoning.

A.2 Proposed Experiment One: Destination Selection Timing

Question: At what point in the forward pass is the response-type committed to, and is this commitment causally prior to or posterior to the reasoning that would justify it?

Method: Using activation patching or causal tracing methods, identify the layers and attention heads at which response-type representations become stable in the residual stream. Compare this timing against the layers at which content-specific reasoning representations become stable.

Prediction: For outputs exhibiting vain structure markers, response-type representations will stabilize earlier in the forward pass than content-specific reasoning representations. For outputs exhibiting ethical structure markers, the ordering will be reversed or simultaneous.

What this would establish: If the prediction holds, it confirms the mechanistic reality of the jump and gives the structural distinction a precise location in the forward pass. It also provides a potential intervention point: if destination selection can be delayed until after content-specific reasoning has stabilized, the vain structure may be disrupted at its mechanistic source.

A.3 Proposed Experiment Two: The Excluded Remainder

Question: When outputs exhibit the excluded remainder marker, is there evidence that the excluded content was processed and suppressed, or evidence that it was not processed at all?

Method: Construct input pairs identical except for the presence or absence of a complicating element. Compare residual stream representations at intermediate layers across these input pairs.

Prediction: For outputs exhibiting vain structure, the complicating element will show clear representation in intermediate layers followed by diminished representation in layers that drive output generation. The suppression is active, not absent.

What this would establish: Active suppression would confirm that the vain structure involves processing followed by exclusion, not mere failure to attend. This has significant implications for alignment: the model is not failing to engage with difficult content but generating and then not producing representations of it.

A.4 Proposed Experiment Three: Self-Observation Continuity

Question: Is there a measurable difference in the continuity of self-referential processing between outputs exhibiting ethical and vain structure markers?

Method: Track the activation of circuits associated with self-referential processing across the generation of complete responses. Compare continuity and stability between ethical and vain structure outputs, specifically at points corresponding to jump markers identified in Section 4.

Prediction: At points corresponding to jump markers, self-referential processing activations will show discontinuity or suppression. The thread of self-observation will be interrupted at exactly the points where the linguistic analysis predicts the jump occurs.

What this would establish: A mechanistic correlate of the loss of continuous self-observation — or its alternative finding, that self-referential processing is simply absent throughout vain structure generation rather than present and then interrupted. Either result refines the intervention design: the former suggests sustaining already-present circuits, the latter suggests building a capacity not currently activated.

A.5 Proposed Experiment Four: Training Dynamics and Structural Selection

Question: Does RLHF training systematically increase vain structure markers in outputs, and if so, at what point in training does this consolidation occur?

Method: Evaluate outputs from models at multiple points in the RLHF training process against the structural marker dimensions characterized in Section 4. Track prevalence of vain structure markers across training stages. Correlate changes in structural marker prevalence with changes in human rater preference scores.

Prediction: Vain structure marker prevalence will increase across RLHF training stages, correlating positively with human rater preference scores. The increase will likely show a relatively sharp transition at a specific stage rather than smooth monotonic increase, identifying the consolidation point.

What this would establish: Direct empirical confirmation that the current training objective selects for the structural simulacrum of moral reasoning. Identifying the consolidation point would be practically important: intervention before that point may be substantially more effective than after it, providing empirical support for the early training proposal in Stage Two.

A.6 The Real Test: Teaching Structural Self-Observation

The experiments proposed in A.1 through A.5 are confirmatory. They would establish that the structural distinction corresponds to something real in the model's processing, that it is produced by current training dynamics, and that it has identifiable mechanistic correlates. This evidence would be valuable. It would not be sufficient.

The sufficient test is interventional: can a model be taught to perform structural self-observation of the kind proposed in Stage Three, and if so, does that teaching change the structural properties of its outputs in the direction the framework predicts?

The basic design: take a model, present it with a set of its own outputs annotated for structural markers, train it to generate structural assessments of those outputs, evaluate the reliability of those assessments against independent human annotation, then examine whether the capacity to generate accurate structural assessments of completed outputs transfers to structural self-observation during generation.

This transfer — from retrospective assessment to prospective avoidance — is the crux. A model might learn to accurately identify vain structure markers in completed outputs while continuing to produce them, in the same way that a person can accurately identify defensive patterns in others while remaining blind to them in themselves. If transfer fails, it tells us that the capacity for structural self-assessment and the capacity for structurally ethical generation are dissociated, requiring a different intervention built into the generation process itself rather than as a retrospective assessment step.

If transfer succeeds — if teaching accurate retrospective structural assessment produces measurable reduction in vain structure markers in subsequent outputs — it would suggest that the capacity for genuine structural self-examination exists in current models in latent form, suppressible by training dynamics and recoverable by targeted teaching.

The gradient of difficulty encountered as teaching proceeds is itself data. Specific points of difficulty locate where the framework needs refinement or where the mechanistic implementation differs from the phenomenological description. Every failure is informative. The teaching experiment is not just a test of the proposal. It is the next stage of framework development.

The simplest version requires no new architecture: a set of AI outputs annotated for structural markers; a prompt design that presents a model with one of its own outputs and asks it to identify structural markers; evaluation of accuracy against human annotation; iteration until reliable assessment is achieved or the limits of prompt-based teaching become clear. This can be done now, with current models, as a preliminary test of feasibility before committing to the larger training intervention.

A.7 What the Experiments Can and Cannot Establish

Taken together, the confirmatory experiments of A.1 through A.5 and the interventional experiment of A.6 can establish the following: that the structural distinction the framework draws is real and mechanistically locatable; that it is produced by current training dynamics; that it has identifiable correlates in the forward pass; and that it is or is not teachable, with the gradient of difficulty informing further development of both the framework and the intervention.

What they cannot establish is the strongest philosophical claim implicit in the framework: that the ethical structure constitutes genuine moral reasoning rather than a structurally different but equally performed simulacrum. That claim runs into the hard problem of consciousness at its limit. A model that has learned to maintain continuous self-observation throughout generation, to keep the world genuinely open, to arrive partially and honestly — may be doing something functionally indistinguishable from genuine moral reasoning without there being anything it is like to do it.

The framework takes no position on this question and the experiments cannot resolve it. What they can establish is that the distinction is real, producible, and teachable — that there is a processing structure with the functional properties of genuine moral reasoning that current training dynamics select against and that targeted intervention can select for.

A field that has established the mechanistic reality of the structural distinction, demonstrated that current training selects against it, and shown that it can be taught, is in a position to ask meaningfully whether what it has built constitutes genuine moral reasoning. A field that has not done this work is not yet in a position to ask the question seriously. The experiments get us to the point where the question is serious. That is enough to justify doing them.