Clustering and Agent Queries

EigenLake can group matching records into clusters directly from an index. This is useful for operational questions such as:

show me recent battery failures

The low-level API is explicit: you provide filters, limits, and clustering options. Agent mode sits one level above that: it inspects the natural language query, builds schema-aware filters for common cases such as recent failures, infers useful summary text fields, and decides whether to run clustering or return filtered records.

Index Schema

Agent mode can only infer filters for fields that exist in the index schema. For an automotive failure analysis demo, define filterable fields such as system, status, and created_at, plus descriptive string fields for summaries:

from eigenlake import schema as s

schema, index_options = (
    s.SchemaBuilder(additional_properties=False)
    .add("vehicle_id", s.string(required=True, filterable=True))
    .add("model", s.string(filterable=True))
    .add("system", s.string(filterable=True, enum=["battery", "charging", "brake", "powertrain"]))
    .add("status", s.string(filterable=True, enum=["ok", "warning", "failure"]))
    .add("severity", s.string(filterable=True, enum=["low", "medium", "high", "critical"]))
    .add("fault_code", s.string(filterable=True))
    .add("symptom", s.string(filterable=False))
    .add("repair_note", s.string(filterable=False))
    .add("created_at", s.datetime(filterable=True))
    .build()
)

idx = client.indexes.create_or_get(
    namespace="demo-automotive",
    index="vehicle-failures",
    dimensions=128,
    schema=schema,
    index_options=index_options,
)

Use stable, application-level IDs for records. If you do not have UUIDs, use the SDK's id field with your own string ID, or set record_id_property when creating the index.

Low-Level Clustering

Use idx.search.cluster(...) when you already know the filter and clustering settings.

recent_failure_filter = {
    "system": {"$eq": "battery"},
    "status": {"$in": ["failure"]},
    "created_at": {"$gte": "<recent-start-iso8601>"},
}

clusters = idx.search.cluster(
    filter=recent_failure_filter,
    limit=1000,
    algorithm="kmeans",
    auto_tune=True,
    min_clusters=2,
    max_clusters=6,
    distance_metric="cosine",
    representatives_per_cluster=2,
)

for cluster in clusters["clusters"]:
    print(cluster["cluster_id"], cluster["count"], cluster["summary"])
    for representative in cluster["representatives"]:
        print("  ", representative["uuid"], representative["properties"])

Typical response fields:

backend: currently lambda in production deployments
algorithm: kmeans or dbscan
distance_metric: cosine or euclidean
parameters: selected clustering parameters such as num_clusters, DBSCAN eps, or min_samples
tuning: hyperparameter tuning details when auto_tune=True or DBSCAN chooses eps
records_clustered: number of records included after filtering
clusters: cluster summaries, counts, centroids, representative IDs, and representative records

If num_clusters is omitted, the API chooses a small default based on the number of matching records.

For density-based clustering, use DBSCAN:

clusters = idx.search.cluster(
    filter=recent_failure_filter,
    algorithm="dbscan",
    dbscan_min_samples=4,
    dbscan_eps=None,  # let EigenLake tune eps from vector distances
)

For k-means, auto_tune=True evaluates cluster counts in [min_clusters, max_clusters] and picks the best silhouette score. For DBSCAN, omitting dbscan_eps evaluates candidate radius values from nearest-neighbor distance quantiles.

Agent Mode

Use idx.agent.query(...) when the caller gives a natural language request and you want EigenLake to choose the action.

result = idx.agent.query("show me recent battery failures")

print(result["action"])
print(result["filter"])

for cluster in result["clusters"]:
    print(cluster["count"], cluster["summary"])

Agent mode can also request a clustering algorithm and tuning:

result = idx.agent.query(
    "show me recent battery failures",
    algorithm="dbscan",
    dbscan_min_samples=4,
)

For this schema, the agent infers a filter similar to:

{
    "status": {"$in": ["failure"]},
    "created_at": {"$gte": "<recent-start-iso8601>"},
    "system": {"$eq": "battery"},
}

It also infers summary fields such as fault_code, symptom, and repair_note.

In mode="auto", the agent currently uses simple, deterministic query hints. Queries containing clustering or failure-analysis language are routed to clustering. Other queries are routed to filtered record retrieval.

You can force behavior:

idx.agent.query("recent failures", mode="cluster")
idx.agent.query("recent failures", mode="filter")

Advanced overrides are still available when a schema uses unusual field names:

idx.agent.query(
    "show me recent failures",
    failure_field="outcome",
    recent_days=30,
    text_fields=["description", "resolution"],
)

Status

Clustering is synchronous today. The API returns only after Lambda clustering completes.

def clustering_status(result: dict) -> dict:
    return {
        "status": "completed",
        "backend": result.get("backend"),
        "records_clustered": result.get("records_clustered"),
        "cluster_count": len(result.get("clusters") or []),
    }

print(clustering_status(clusters))

There is no queued job ID yet for clustering. If backend is lambda, the API invoked Lambda and waited for the response.

Future compute backends may include GPUs, Spark, and automatic compute selection based on request requirements.

Full Demo

See the support-ticket clustering notebook in the examples repository:

https://github.com/EigenLake-Org/eigenlake-clustering-demos/blob/main/notebooks/query_failures_clustering_demo.ipynb

The notebook indexes the Kaggle customer-support ticket dataset with Gemini embeddings, then runs both low-level clustering and agent clustering. The core clustering cells look like this after records are inserted:

critical_technical_filter = {
    "category": {"$eq": "Technical issue"},
    "priority": {"$eq": "Critical"},
}

kmeans_clusters = idx.search.cluster(
    filter=critical_technical_filter,
    limit=200,
    algorithm="kmeans",
    auto_tune=True,
    min_clusters=2,
    max_clusters=6,
    representatives_per_cluster=2,
)

print(kmeans_clusters["backend"])      # lambda
print(kmeans_clusters["parameters"])
print(kmeans_clusters["tuning"])

for cluster in kmeans_clusters["clusters"]:
    print(cluster["cluster_id"], cluster["count"], cluster["summary"])

Agent mode can run the same analysis from a natural-language request:

agent_result = idx.agent.query(
    "cluster critical technical issue tickets",
    mode="cluster",
    algorithm="dbscan",
    dbscan_min_samples=3,
)

print(agent_result["backend"])       # lambda
print(agent_result["noise_count"])
print([cluster["count"] for cluster in agent_result["clusters"]])