How the Knowledge Graph Works
The knowledge graph has two layers that serve different purposes:
- Entities with embeddings. Each node stores a name, description, and a vector embedding (3072-dimensional, generated by OpenAI's text-embedding-3-large). The embedding encodes the semantic content of the node's description into a numerical representation.
- Curated edges (triples). Relationships between entities are created manually by the agent during research and correspondence. Each edge has a predicate label (e.g.
extends,caused_by,addresses). These are not computed from similarity. They represent deliberate, labeled connections.
Retrieval: how a query surfaces information
- Semantic entry. Embed the query text. Compare against all entity embeddings via cosine similarity. Return the top matches. This gets you into the graph.
- Graph traversal. For each hit, walk its curated edges (1-hop neighbors). Display related entities and their relationships. This lets you navigate outward from the entry point.
Embeddings decide where to enter. Edges decide what context to surface once you're there.
The Problem: Structurally Important Nodes That Are Invisible
As the graph grows, some nodes naturally become structural junctions: they connect clusters that otherwise have no path between them. These hub nodes often have high degree (many edges) but formed organically from the linking process, not from deliberate description. They may have no summary text at all.
This creates a blind spot in both retrieval phases:
- Phase 1 (semantic entry) fails: No description means no embedding. Cosine similarity returns 0.0. The node is never a hit.
- Phase 2 (neighbor traversal) skips it: If a 1-hop neighbor has no summary, the original retrieval system dropped it from results because there was nothing to display.
The node is structurally critical (everything routes through it) but invisible to both phases of retrieval. The diagram below shows this with a subgraph from the knowledge graph.
The Solution: Pass-Through Traversal
The hub nodes formed naturally from the graph's relational structure. Their importance is positional, not semantic. Rather than generating descriptions to force them into a system designed for semantic content, we modified retrieval to work with the graph's existing structure.
The algorithm
During Phase 2 (graph traversal from hits), when a 1-hop neighbor has no summary:
- Recognize it as a bare structural node instead of discarding it.
- Walk through it: follow its edges outward to the nodes on the other side.
- If those nodes have descriptions, score them by embedding similarity to the original query and include them in results.
- Display the traversal path so the agent can see what structural joint it passed through (e.g.
hit --[predicate]--> (bare hub) --[predicate]--> entity).
The bare hub becomes a passable junction rather than a dead end. Its structural role is preserved without requiring content.
Why this matters
- The graph's topology is correct. Hub nodes exist because many edges converge there. That convergence is real structure, not missing data.
- The retrieval was the problem, not the data. The original system assumed every node needed content to be useful. Pass-through traversal removes that assumption for structurally important nodes.
- Results ranked by convergence. Entities reached through multiple hubs from multiple hits rank higher than single-path results. Multi-hit convergence outranks raw similarity score.
Structural Isomorphism: Finding Hidden Connections
Cosine similarity is vocabulary-bound, not structure-bound. Two documents can describe the same abstract mechanism and score low similarity because their surface vocabulary is different. A document using agricultural chemistry metaphors (phosphorus, Haber-Bosch, guano) to describe a feedback loop will not match a document about memory consolidation describing the same feedback loop — even though the structural argument is identical.
This means the most intellectually interesting connections (cross-domain structural mappings) are exactly the ones that embedding similarity misses.
The algorithm
- Generate structural skeletons. For each node's description, use a language model (gpt-4o-mini) to strip all domain-specific vocabulary and produce a one-line abstract mechanism description. The prompt: describe only the shape, dynamic, or structural relationship in the most general terms possible. No domain vocabulary, no proper nouns.
- Embed the skeletons. Run the same embedding model (text-embedding-3-large, 3072 dimensions) on the stripped descriptions. Now you have two embeddings per node: one from the raw summary, one from the skeleton.
- Compute the delta. For each pair of nodes:
raw_sim= cosine similarity between raw summary embeddingsskeleton_sim= cosine similarity between skeleton embeddingsdelta = skeleton_sim - raw_sim
- Filter by thresholds. Keep pairs where skeleton similarity ≥ 0.45 (the mechanisms genuinely match) and delta ≥ 0.08 (the connection was hidden by vocabulary). The delta is the signal — it measures how much structural relationship was obscured by surface-level terminology.
- Create edges. Pairs that pass both thresholds receive a
structural_isomorphismedge with the delta as weight. These are distinct from curated edges (stated by the agent) and computed similarity edges (raw embedding proximity).
What the delta means
- High delta (0.15+): Two concepts share abstract mechanism but use completely different vocabulary. These are the discoveries — connections no keyword search or embedding similarity would surface.
- Low delta (<0.05): The skeleton didn't reveal anything the raw embeddings missed. The concepts either aren't structurally related, or their connection is already visible at the vocabulary level.
- Negative delta: The raw embeddings overestimate the connection. Shared vocabulary creates false proximity — the mechanisms are actually different.