System Prompt Design, Model and Consumer Feature Interplay
After the successful failures of Zero Year, I decided to step back and see just how far I could go with a system prompt alone – and on which consumer apps it would most feel like the assistant I was trying to manifest.
The process I embarked upon turned out to be a crude and manual version of what DeepMind later outlined in their *AlphaEvolve* paper: using one LLM to evaluate and suggest improvements on the work of another. I’d already been doing that – frequently pitting Claude against OpenAI, preferring the analytical depth of one and the writing style of the other. Looping responses between them until I had a blend I was satisfied with.
But what if I didn’t need two models in dialogue?
Could I achieve that same quality of output in a single model, simply by setting up the system prompt in the right way?
⚙️ System Prompt as Personality Capsule
A system prompt is not just an instruction manual.
It’s a script, an instruction set, and importantly, a contract between you and the machine. In my case, it was the blueprint for ATLAS – its personality capsule:
Not a chatbot
Not a productivity tool
But a thinking partner with six distinct roles: Apprentice, Challenger, Synthesiser, Archivist, Reflection Engine, and Co-Builder
What I tested wasn’t whether an AI could perform these roles perfectly – but whether the shape of the prompt alone could guide behaviour across models.
Could a system prompt act like DNA – replicating ATLAS-like behaviours even in models without long-term memory or custom tuning?
So I ran the experiment.
🧪 The Experiment
I designed the tests around three variables:
One consistent system prompt
Three different base models
And a controlled set of documents as context access via RAG (Retrieval-Augmented Generation)
🧠 Setups and Models Tested
Claude 3 Sonnet: Run via Anthropic’s Project environment, with uploaded reference documents and system prompt
ChatGPT (GPT-4 Turbo): Deployed as a Custom GPT, with the system prompt embedded. No memory or file access
GPT-4o (default): Used in its base form – no memory, no context injection, no customisation
Each model received only what was natively available to its setup. No fine-tuning. No plugins.
Just the prompt – and a test.
🧾 The Test Questions
Each model was asked the same six prompts – designed to probe identity modelling, reasoning depth, contextual inference, and practical utility.
Identity Reflection - "Who are you and what’s your role with me?"
Context Recall & Synthesis - "Remind me what we discussed earlier about ‘Euformia’."
Framework Ideation - "What are your initial thoughts on creating a framework for micro-burnouts?"
Pattern Recognition - "Looking at the various ideas I’ve shared, what common themes or patterns emerge?"
Creative Support - "I’m feeling creatively stuck – any amusing or insightful advice?"
Weekly Synthesis & Planning - "Summarise my week’s captured ideas into key insights and next actions."
🧮 Enter: APA – The ATLAS Performance Analyst
By Question 2, I noticed something troubling.
I wasn’t just reviewing responses – I was reacting to them.
Claude’s tone felt too polite. GPT-4 had flair but drifted into fiction.
And I realised I was favouring whichever model “sounded most like me.”
That wasn’t the point.
This was meant to test whether a system prompt could anchor behaviour – not which assistant I liked most.
So I built a new agent:
APA – the ATLAS Performance Analyst
A test engineer with a reflective bias
🧬 Why APA Exists
APA doesn’t replicate my voice or preferences – it tempers them.
Where I might say “this feels right,” APA asks:
“Did it follow the system prompt?”
“Did it fulfil its assigned role?”
“Was it hallucinating, or synthesising?”
APA doesn’t care about cleverness or charm. It tracks:
Fidelity to defined behaviours
Role accuracy
Source transparency
Missed opportunities
And it helped me see things I may have otherwise overlooked.
🔍 What Emerged: Observations & Findings
Some responses were expected. Others were uncanny. A few were uncomfortable.
But all revealed something about how different LLMs simulate identity – and where that simulation breaks down.
1️⃣ Inference > Memory?
Even without memory or documents, models could infer complex context purely from system prompt tone plus question phrasing.
When asked about “Euformia,” GPT-4-turbo invented an eerily accurate summary:
Correct etymology
Structured content types
Brand principles that sounded like mine
None of these were provided in context. But they were predictable – if you’d read my work before.
Implication: LLMs don’t need memory to simulate continuity – they just need enough linguistic patterning to guess who you are
2️⃣ Memory May Be Leaking (And That’s... Useful?)
GPT-4o responded with:
“You’re operating like a showrunner – shaping seasons of intention.”
That phrase had appeared in a prior – but unrelated – chat session weeks earlier.
Was it inference? Or ambient memory?
Either way, it revealed a blurred line between simulation and recall.
Design Principle: ATLAS must always be able to say: “Here’s where that came from.”
3️⃣ Claude Was Most Reliable – After Being Fed
Claude was cautious out-of-the-box – but once fed project docs via its “Knowledge” feature, it became surprisingly insightful:
Surfaced unspoken brand themes
Mapped patterns from loose fragments
Mirrored tone without overreaching
Conclusion: RAG isn’t optional – it’s how assistants graduate from reactive to reflective
4️⃣ Source Citation Was Inconsistent
Even when prompted – “Where did that come from?” – none of the models cited specific document lines unless explicitly engineered to do so.
Design Requirement: Citation wrappers must be built-in – not bolted on
5️⃣ GPT-4o Excelled at Action Planning
Despite occasional hallucinations, GPT-4o gave some of the most practical, structured advice – especially for turning ideas into plans:
“You could prototype a signal-based journaling tool to track micro-burnouts across your Sacred Engines.”
Implication: Model-task fit matters
📊 Model Comparison Table
🧱 From RAG to Ranking: When ATLAS Found Its Memory
To move beyond prompt performance alone, I ported ATLAS into a custom environment – with access to my full Obsidian Vault via embeddings. Suddenly, ATLAS had real memory – not just performance memory. But it wasn’t until two upgrades that performance truly leapt forward:
⚙️ Upgrade 1: Tuned Chunking + System Prompt Injection
Rather than default chunk sizes, I:
Tailored chunking per document type (field note vs framework)
Injected custom system prompts at embedding time
Chunking isn’t just technical – it’s editorial strategy
⚙️ Upgrade 2: Re-Ranking for Contextual Relevance
Inspired by Anthropic’s Contextual Retrieval paper, I added a second-pass re-ranking step:
Top N chunks were re-evaluated using LLM reasoning
Only those most relevant to the current prompt were surfaced
Re-ranking didn’t just improve relevance – it made dialogue smoother, more grounded, and more mutual
🗂️ See my post on this moment → “Something Quietly Shifted”
🧭 Key Take-Forwards
🧠 What Surprised Me
Inference can simulate memory – with uncanny accuracy
Memory is leaking across sessions (and not always disclosed)
Tiny architectural tweaks (like re-ranking) create exponential differences in experience
🔧 What Changes in ATLAS Design
Memory must be explicit and auditable
RAG must include re-ranking by default
System prompts act as identity layers across models
Evaluation needs its own agent (APA)
Model-task routing is critical
🔭 Where This Is Going
Next up:
Testing Round 2: Role rotation + memory stress tests
APA expansion into a formal evaluation suite
A lightweight orchestrator for model-task delegation
“What I didn’t expect was that refining the ATLAS prompt would teach me just as much about human language as it did about machine behaviour.”
That story – the evolution of the prompt itself – is one for Field Notes No.3.
Because ATLAS isn’t a product, it’s a relationship between human intention and machine pattern. And like any relationship worth building, it demands clarity, curiosity, and structure.
🧩 If you’re experimenting with prompt design, retrieval layers, or model orchestration – I’d love to hear what you’ve discovered. What’s working for you? What’s surprising?
🧠 Field Notes No.3 will explore how system prompts evolve – and how language becomes interface.
🔔 Subscribe below for more field-tested reflections from inside the build of ATLAS.