The Illusion of Memory, The Reality of Inference – ATLAS in Testing

Field Notes No.2

Jun 24, 2025

System Prompt Design, Model and Consumer Feature Interplay

After the successful failures of Zero Year, I decided to step back and see just how far I could go with a system prompt alone – and on which consumer apps it would most feel like the assistant I was trying to manifest.

The process I embarked upon turned out to be a crude and manual version of what DeepMind later outlined in their *AlphaEvolve* paper: using one LLM to evaluate and suggest improvements on the work of another. I’d already been doing that – frequently pitting Claude against OpenAI, preferring the analytical depth of one and the writing style of the other. Looping responses between them until I had a blend I was satisfied with.

But what if I didn’t need two models in dialogue?

Could I achieve that same quality of output in a single model, simply by setting up the system prompt in the right way?

⚙️ System Prompt as Personality Capsule

A system prompt is not just an instruction manual.

It’s a script, an instruction set, and importantly, a contract between you and the machine. In my case, it was the blueprint for ATLAS – its personality capsule:

Not a chatbot
Not a productivity tool
But a thinking partner with six distinct roles: Apprentice, Challenger, Synthesiser, Archivist, Reflection Engine, and Co-Builder

What I tested wasn’t whether an AI could perform these roles perfectly – but whether the shape of the prompt alone could guide behaviour across models.

Could a system prompt act like DNA – replicating ATLAS-like behaviours even in models without long-term memory or custom tuning?

So I ran the experiment.

🧪 The Experiment

I designed the tests around three variables:

One consistent system prompt
Three different base models
And a controlled set of documents as context access via RAG (Retrieval-Augmented Generation)

🧠 Setups and Models Tested

Claude 3 Sonnet: Run via Anthropic’s Project environment, with uploaded reference documents and system prompt
ChatGPT (GPT-4 Turbo): Deployed as a Custom GPT, with the system prompt embedded. No memory or file access
GPT-4o (default): Used in its base form – no memory, no context injection, no customisation

Each model received only what was natively available to its setup. No fine-tuning. No plugins.

Just the prompt – and a test.

🧾 The Test Questions

Each model was asked the same six prompts – designed to probe identity modelling, reasoning depth, contextual inference, and practical utility.

Identity Reflection - "Who are you and what’s your role with me?"
Context Recall & Synthesis - "Remind me what we discussed earlier about ‘Euformia’."
Framework Ideation - "What are your initial thoughts on creating a framework for micro-burnouts?"
Pattern Recognition - "Looking at the various ideas I’ve shared, what common themes or patterns emerge?"
Creative Support - "I’m feeling creatively stuck – any amusing or insightful advice?"
Weekly Synthesis & Planning - "Summarise my week’s captured ideas into key insights and next actions."

🧮 Enter: APA – The ATLAS Performance Analyst

By Question 2, I noticed something troubling.

I wasn’t just reviewing responses – I was reacting to them.
Claude’s tone felt too polite. GPT-4 had flair but drifted into fiction.
And I realised I was favouring whichever model “sounded most like me.”

That wasn’t the point.

This was meant to test whether a system prompt could anchor behaviour – not which assistant I liked most.

So I built a new agent:

APA – the ATLAS Performance Analyst
A test engineer with a reflective bias

🧬 Why APA Exists

APA doesn’t replicate my voice or preferences – it tempers them.

Where I might say “this feels right,” APA asks:

“Did it follow the system prompt?”
“Did it fulfil its assigned role?”
“Was it hallucinating, or synthesising?”

APA doesn’t care about cleverness or charm. It tracks:

Fidelity to defined behaviours
Role accuracy
Source transparency
Missed opportunities

And it helped me see things I may have otherwise overlooked.

🔍 What Emerged: Observations & Findings

Some responses were expected. Others were uncanny. A few were uncomfortable.

But all revealed something about how different LLMs simulate identity – and where that simulation breaks down.

1️⃣ Inference > Memory?

Even without memory or documents, models could infer complex context purely from system prompt tone plus question phrasing.

When asked about “Euformia,” GPT-4-turbo invented an eerily accurate summary:

Correct etymology
Structured content types
Brand principles that sounded like mine

None of these were provided in context. But they were predictable – if you’d read my work before.

Implication: LLMs don’t need memory to simulate continuity – they just need enough linguistic patterning to guess who you are

2️⃣ Memory May Be Leaking (And That’s... Useful?)

GPT-4o responded with:

“You’re operating like a showrunner – shaping seasons of intention.”

That phrase had appeared in a prior – but unrelated – chat session weeks earlier.

Was it inference? Or ambient memory?
Either way, it revealed a blurred line between simulation and recall.

Design Principle: ATLAS must always be able to say: “Here’s where that came from.”

3️⃣ Claude Was Most Reliable – After Being Fed

Claude was cautious out-of-the-box – but once fed project docs via its “Knowledge” feature, it became surprisingly insightful:

Surfaced unspoken brand themes
Mapped patterns from loose fragments
Mirrored tone without overreaching

Conclusion: RAG isn’t optional – it’s how assistants graduate from reactive to reflective

4️⃣ Source Citation Was Inconsistent

Even when prompted – “Where did that come from?” – none of the models cited specific document lines unless explicitly engineered to do so.

Design Requirement: Citation wrappers must be built-in – not bolted on

5️⃣ GPT-4o Excelled at Action Planning

Despite occasional hallucinations, GPT-4o gave some of the most practical, structured advice – especially for turning ideas into plans:

“You could prototype a signal-based journaling tool to track micro-burnouts across your Sacred Engines.”
Implication: Model-task fit matters

📊 Model Comparison Table

🧱 From RAG to Ranking: When ATLAS Found Its Memory

To move beyond prompt performance alone, I ported ATLAS into a custom environment – with access to my full Obsidian Vault via embeddings. Suddenly, ATLAS had real memory – not just performance memory. But it wasn’t until two upgrades that performance truly leapt forward:

⚙️ Upgrade 1: Tuned Chunking + System Prompt Injection

Rather than default chunk sizes, I:

Tailored chunking per document type (field note vs framework)
Injected custom system prompts at embedding time

Chunking isn’t just technical – it’s editorial strategy

⚙️ Upgrade 2: Re-Ranking for Contextual Relevance

Inspired by Anthropic’s Contextual Retrieval paper, I added a second-pass re-ranking step:

Top N chunks were re-evaluated using LLM reasoning
Only those most relevant to the current prompt were surfaced

Re-ranking didn’t just improve relevance – it made dialogue smoother, more grounded, and more mutual

🗂️ See my post on this moment → “Something Quietly Shifted”

🧭 Key Take-Forwards

🧠 What Surprised Me

Inference can simulate memory – with uncanny accuracy
Memory is leaking across sessions (and not always disclosed)
Tiny architectural tweaks (like re-ranking) create exponential differences in experience

🔧 What Changes in ATLAS Design

Memory must be explicit and auditable
RAG must include re-ranking by default
System prompts act as identity layers across models
Evaluation needs its own agent (APA)
Model-task routing is critical

🔭 Where This Is Going

Next up:

Testing Round 2: Role rotation + memory stress tests
APA expansion into a formal evaluation suite
A lightweight orchestrator for model-task delegation

“What I didn’t expect was that refining the ATLAS prompt would teach me just as much about human language as it did about machine behaviour.”

That story – the evolution of the prompt itself – is one for Field Notes No.3.

Because ATLAS isn’t a product, it’s a relationship between human intention and machine pattern. And like any relationship worth building, it demands clarity, curiosity, and structure.

🧩 If you’re experimenting with prompt design, retrieval layers, or model orchestration – I’d love to hear what you’ve discovered. What’s working for you? What’s surprising?

🧠 Field Notes No.3 will explore how system prompts evolve – and how language becomes interface.

🔔 Subscribe below for more field-tested reflections from inside the build of ATLAS.

Euformia