Clustering and Agent Queries
EigenLake can group matching records into clusters directly from an index. This is useful for operational questions such as:
show me recent battery failures
The low-level API is explicit: you provide filters, limits, and clustering options. Agent mode sits one level above that: it inspects the natural language query, builds schema-aware filters for common cases such as recent failures, infers useful summary text fields, and decides whether to run clustering or return filtered records.
Index Schema
Agent mode can only infer filters for fields that exist in the index schema. For an automotive failure analysis demo, define filterable fields such as system, status, and created_at, plus descriptive string fields for summaries:
from eigenlake import schema as s
schema, index_options = (
s.SchemaBuilder(additional_properties=False)
.add("vehicle_id", s.string(required=True, filterable=True))
.add("model", s.string(filterable=True))
.add("system", s.string(filterable=True, enum=["battery", "charging", "brake", "powertrain"]))
.add("status", s.string(filterable=True, enum=["ok", "warning", "failure"]))
.add("severity", s.string(filterable=True, enum=["low", "medium", "high", "critical"]))
.add("fault_code", s.string(filterable=True))
.add("symptom", s.string(filterable=False))
.add("repair_note", s.string(filterable=False))
.add("created_at", s.datetime(filterable=True))
.build()
)
idx = client.indexes.create_or_get(
namespace="demo-automotive",
index="vehicle-failures",
dimensions=128,
schema=schema,
index_options=index_options,
)
Use stable, application-level IDs for records. If you do not have UUIDs, use the SDK's id field with your own string ID, or set record_id_property when creating the index.
Low-Level Clustering
Use idx.search.cluster(...) when you already know the filter and clustering settings.
recent_failure_filter = {
"system": {"$eq": "battery"},
"status": {"$in": ["failure"]},
"created_at": {"$gte": "<recent-start-iso8601>"},
}
clusters = idx.search.cluster(
filter=recent_failure_filter,
limit=1000,
algorithm="kmeans",
auto_tune=True,
min_clusters=2,
max_clusters=6,
distance_metric="cosine",
representatives_per_cluster=2,
)
for cluster in clusters["clusters"]:
print(cluster["cluster_id"], cluster["count"], cluster["summary"])
for representative in cluster["representatives"]:
print(" ", representative["uuid"], representative["properties"])
Typical response fields:
backend: currentlylambdain production deploymentsalgorithm:kmeansordbscandistance_metric:cosineoreuclideanparameters: selected clustering parameters such asnum_clusters, DBSCANeps, ormin_samplestuning: hyperparameter tuning details whenauto_tune=Trueor DBSCAN choosesepsrecords_clustered: number of records included after filteringclusters: cluster summaries, counts, centroids, representative IDs, and representative records
If num_clusters is omitted, the API chooses a small default based on the number of matching records.
For density-based clustering, use DBSCAN:
clusters = idx.search.cluster(
filter=recent_failure_filter,
algorithm="dbscan",
dbscan_min_samples=4,
dbscan_eps=None, # let EigenLake tune eps from vector distances
)
For k-means, auto_tune=True evaluates cluster counts in [min_clusters, max_clusters] and picks the best silhouette score. For DBSCAN, omitting dbscan_eps evaluates candidate radius values from nearest-neighbor distance quantiles.
Agent Mode
Use idx.agent.query(...) when the caller gives a natural language request and you want EigenLake to choose the action.
result = idx.agent.query("show me recent battery failures")
print(result["action"])
print(result["filter"])
for cluster in result["clusters"]:
print(cluster["count"], cluster["summary"])
Agent mode can also request a clustering algorithm and tuning:
result = idx.agent.query(
"show me recent battery failures",
algorithm="dbscan",
dbscan_min_samples=4,
)
For this schema, the agent infers a filter similar to:
{
"status": {"$in": ["failure"]},
"created_at": {"$gte": "<recent-start-iso8601>"},
"system": {"$eq": "battery"},
}
It also infers summary fields such as fault_code, symptom, and repair_note.
In mode="auto", the agent currently uses simple, deterministic query hints. Queries containing clustering or failure-analysis language are routed to clustering. Other queries are routed to filtered record retrieval.
You can force behavior:
idx.agent.query("recent failures", mode="cluster")
idx.agent.query("recent failures", mode="filter")
Advanced overrides are still available when a schema uses unusual field names:
idx.agent.query(
"show me recent failures",
failure_field="outcome",
recent_days=30,
text_fields=["description", "resolution"],
)
Status
Clustering is synchronous today. The API returns only after Lambda clustering completes.
def clustering_status(result: dict) -> dict:
return {
"status": "completed",
"backend": result.get("backend"),
"records_clustered": result.get("records_clustered"),
"cluster_count": len(result.get("clusters") or []),
}
print(clustering_status(clusters))
There is no queued job ID yet for clustering. If backend is lambda, the API invoked Lambda and waited for the response.
Future compute backends may include GPUs, Spark, and automatic compute selection based on request requirements.
Full Demo
See the support-ticket clustering notebook in the examples repository:
https://github.com/EigenLake-Org/eigenlake-clustering-demos/blob/main/notebooks/query_failures_clustering_demo.ipynb
The notebook indexes the Kaggle customer-support ticket dataset with Gemini embeddings, then runs both low-level clustering and agent clustering. The core clustering cells look like this after records are inserted:
critical_technical_filter = {
"category": {"$eq": "Technical issue"},
"priority": {"$eq": "Critical"},
}
kmeans_clusters = idx.search.cluster(
filter=critical_technical_filter,
limit=200,
algorithm="kmeans",
auto_tune=True,
min_clusters=2,
max_clusters=6,
representatives_per_cluster=2,
)
print(kmeans_clusters["backend"]) # lambda
print(kmeans_clusters["parameters"])
print(kmeans_clusters["tuning"])
for cluster in kmeans_clusters["clusters"]:
print(cluster["cluster_id"], cluster["count"], cluster["summary"])
Agent mode can run the same analysis from a natural-language request:
agent_result = idx.agent.query(
"cluster critical technical issue tickets",
mode="cluster",
algorithm="dbscan",
dbscan_min_samples=3,
)
print(agent_result["backend"]) # lambda
print(agent_result["noise_count"])
print([cluster["count"] for cluster in agent_result["clusters"]])