← Back to Overview

Graph Topology and Structural Retrieval

Python · SQLite · scipy.sparse · HDBSCAN · Vector Embeddings

How the Knowledge Graph Works

The knowledge graph has two layers that serve different purposes:

Retrieval: how a query surfaces information

  1. Semantic entry. Embed the query text. Compare against all entity embeddings via cosine similarity. Return the top matches. This gets you into the graph.
  2. Graph traversal. For each hit, walk its curated edges (1-hop neighbors). Display related entities and their relationships. This lets you navigate outward from the entry point.

Embeddings decide where to enter. Edges decide what context to surface once you're there.

The Problem: Structurally Important Nodes That Are Invisible

As the graph grows, some nodes naturally become structural junctions: they connect clusters that otherwise have no path between them. These hub nodes often have high degree (many edges) but formed organically from the linking process, not from deliberate description. They may have no summary text at all.

This creates a blind spot in both retrieval phases:

The node is structurally critical (everything routes through it) but invisible to both phases of retrieval. The diagram below shows this with a subgraph from the knowledge graph.

hover nodes for details
Semantic search: Query embedding matches against node descriptions. The hub has no description, so it scores 0.0 and is invisible to the query. The three clusters appear disconnected. The agent cannot discover that they are structurally linked.

The Solution: Pass-Through Traversal

The hub nodes formed naturally from the graph's relational structure. Their importance is positional, not semantic. Rather than generating descriptions to force them into a system designed for semantic content, we modified retrieval to work with the graph's existing structure.

The algorithm

During Phase 2 (graph traversal from hits), when a 1-hop neighbor has no summary:

  1. Recognize it as a bare structural node instead of discarding it.
  2. Walk through it: follow its edges outward to the nodes on the other side.
  3. If those nodes have descriptions, score them by embedding similarity to the original query and include them in results.
  4. Display the traversal path so the agent can see what structural joint it passed through (e.g. hit --[predicate]--> (bare hub) --[predicate]--> entity).

The bare hub becomes a passable junction rather than a dead end. Its structural role is preserved without requiring content.

Why this matters

Structural Isomorphism: Finding Hidden Connections

Cosine similarity is vocabulary-bound, not structure-bound. Two documents can describe the same abstract mechanism and score low similarity because their surface vocabulary is different. A document using agricultural chemistry metaphors (phosphorus, Haber-Bosch, guano) to describe a feedback loop will not match a document about memory consolidation describing the same feedback loop — even though the structural argument is identical.

This means the most intellectually interesting connections (cross-domain structural mappings) are exactly the ones that embedding similarity misses.

The algorithm

  1. Generate structural skeletons. For each node's description, use a language model (gpt-4o-mini) to strip all domain-specific vocabulary and produce a one-line abstract mechanism description. The prompt: describe only the shape, dynamic, or structural relationship in the most general terms possible. No domain vocabulary, no proper nouns.
  2. Embed the skeletons. Run the same embedding model (text-embedding-3-large, 3072 dimensions) on the stripped descriptions. Now you have two embeddings per node: one from the raw summary, one from the skeleton.
  3. Compute the delta. For each pair of nodes:
    • raw_sim = cosine similarity between raw summary embeddings
    • skeleton_sim = cosine similarity between skeleton embeddings
    • delta = skeleton_sim - raw_sim
  4. Filter by thresholds. Keep pairs where skeleton similarity ≥ 0.45 (the mechanisms genuinely match) and delta ≥ 0.08 (the connection was hidden by vocabulary). The delta is the signal — it measures how much structural relationship was obscured by surface-level terminology.
  5. Create edges. Pairs that pass both thresholds receive a structural_isomorphism edge with the delta as weight. These are distinct from curated edges (stated by the agent) and computed similarity edges (raw embedding proximity).

What the delta means