YAMS Performance Benchmark Report¶

Generated: 2026-06-09 Last Updated: 2026-06-19 YAMS Version: 0.17.0 Build Configuration: Release microbenchmarks (meson compile -C build/release); live-mirror suite (meson compile -C builddir) Host: Apple M4 Max, macOS, Clang 21, 16 cores

Contents¶

Executive Summary
Latest Local Refresh
Historical Comparisons
Core Microbenchmarks
KG Edge/Entity Insert Microbenchmarks
Simeon Lexical Rescoring Observations
Multi-Client Benchmarks
Storage Backends

Executive Summary¶

Coverage: API, metadata, core, IPC, tree, WriteCoordinator, multi-client, local storage, grep, search microbenchmarks, and live-mirror ingestion/retrieval.
Medium-document ingest (api_benchmarks, clean release): 363.6 ops/s p50.
IPC streaming (clean release): StreamingFramer_32x10 22,857 ops/s, UnaryFramer_8KB 233,046 ops/s — above the May 9 baselines.
Multi-client ingest: clean through 16 clients; 32 clients failed with 416 add failures.
Mixed read/write search p95: 525 ms in the 4-client case, 41 ms in the 16-client preseeded case.
Grep microbenchmarks: literal search is now close to std::find; newline scan still trails memchr on this host.
High-entropy Zstd L3 (1 MiB): 0.11 ms p50, 9,308 ops/s (incompressible input, ratio ~1.0).
Live-mirror suite (June 19, M4 Max, Simeon embeddings): 500-document ingestion completed in ~983 ms median (~509 docs/s) — improved from the June 18 1,086 ms / 460.4 docs/s baseline via the dedicated-writer ingestion centralization (MetadataInsertWriter + ContentIndexWriter); retrieval over 200 queries holds MRR 1.0 and Recall@10 0.1628.

Latest Local Refresh¶

Throughput values use p50 latency where available.

Live-Mirror Ingestion And Retrieval (2026-06-19)¶

Validated artifact: /tmp/yams_live_mirror_suite_20260619_162112 (suite ingestion run 1,059 ms). Ingestion repeats (steady-state, poll 10 ms): 1,017 / 936 / 983 ms → median ~983 ms.

Configuration:

Build dir: builddir
Dataset: synthetic live-mirror workload
Corpus: 500 documents
Retrieval queries: 200
Top K: 10
Embeddings: real Simeon (YAMS_EMBED_BACKEND=simeon, YAMS_BENCH_FORCE_MOCK_EMBEDDINGS=0)
Core systems enabled: plugin discovery, Glint, semantic graph/topology, post-ingest pipeline
Change since last refresh: write-path centralization onto dedicated-thread writers (MetadataInsertWriter coalesces document inserts; ContentIndexWriter moves content-index commits off the shared io_context). The synchronous store path is now well below the embedding/KG enrichment floor — wall is enrichment-bound (embedding_generation / kg_extraction ≈ wall; metadata_storage ≈ 0.6× wall).

Workload	Result	Notes
Ingestion wall time	~983 ms median	`ingestion_e2e.json`; suite run 1,059 ms (vs 1,086 ms June 18)
Ingestion throughput	~509 docs/s	500-document run (vs 460.4 docs/s June 18)
Retrieval MRR	1.0000	`stage_trace.jsonl` `hybrid_summary`
Retrieval MAP	1.0000	Hybrid search
Retrieval nDCG@10	1.0000	Hybrid search
Retrieval Precision@10	1.0000	Hybrid search
Retrieval Recall@10	0.1628	Synthetic qrels are broad; exact grep baseline shares this recall ceiling

Post-ingest timing signal from the suite ingestion run (1,059 ms):

Phase	Total	Calls	Notes
`process_batch`	549.3 ms	128	Full post-ingest batch processing (was 703.6 ms)
`commit_batch_results`	385.9 ms	128	Content/FTS metadata writes (was 422.1 ms)
`commit_content_index`	383.3 ms	128	Now on the `ContentIndexWriter` thread (was 420.2 ms)
`dispatch_successes`	27.1 ms	128	DispatchPlan + io_context decongestion (was 194.8 ms)
`prepare_metadata`	135.1 ms	128	Extraction-result preparation

The retrieval stage trace contains a valid hybrid_summary event. The xctrace hot-zone export is usable for coarse direction (total samples=95215000000), with top visible samples in prune classification, SQLite VM execution, search internals, WAL checksum, and Simeon NEON dot product. Treat those samples as directional, not as a replacement for focused phase timings.

API And Metadata¶

Refreshed 2026-06-19 from a clean release build (build/release, --with-tests; a stale -DTRACY_ENABLE flag that blocked the benchmark compile was cleared). These microbenchmarks exercise ContentStore / MetadataRepository paths that are largely independent of the June 2026 ingestion-pipeline centralization work; the gains versus the June 9 row mostly reflect the clean (non-instrumented) build rather than that work.

Benchmark	p50 Latency	p50 Throughput	Notes
`Ingestion_SmallDocument`	0.20 ms	5,068 ops/s	1 KB document
`Ingestion_MediumDocument`	2.75 ms	363.6 ops/s	100 KB document
`Metadata_SingleUpdate`	6.36 ms	15,711 ops/s	100 updates/iteration over 1,000 docs
`Metadata_BulkUpdate(500)`	2.14 ms	234,048 ops/s	500 metadata entries/batch

IPC Streaming¶

Benchmark	p50 Latency	p50 Throughput	Notes
`StreamingFramer_32x10_256B`	0.48 ms	22,857 ops/s	10 chunks, 32 results/chunk
`StreamingFramer_64x6_512B`	1.01 ms	6,908 ops/s	6 chunks, 64 results/chunk
`UnaryFramer_Success_8KB`	< 0.01 ms	233,046 ops/s	8 KB payload

Tree Builder¶

Files	Latency	Throughput	Notes
100	28.5 ms	3,505/s	512-byte files
500	98.1 ms	5,098/s	512-byte files
1,000	200.2 ms	4,994/s	512-byte files
5,000	955.6 ms	5,232/s	512-byte files
10,000	1,935.1 ms	5,167/s	linear scaling still holds

WriteCoordinator¶

Phase	Elapsed	Metadata	Relationships	Nodes	Edges	Max Apply
Cold ingest, 100 files	48.3 ms	46	0	100	50	9 ms
Version churn, 3 iter	8.7 ms	1108	79	1400	700	4706 ms
Final totals	-	1833	96	1600	800	4706 ms

Historical Comparisons¶

Headline API And IPC Throughput¶

Benchmark	Jan 2026	Apr 30	May 9	June 9 Local	June 19 (clean)	June 9 -> June 19
`Ingestion_SmallDocument`	2,821 ops/s	4,896 ops/s	3,550 ops/s	3,163 ops/s	5,068 ops/s	+60.2%
`Ingestion_MediumDocument`	57 ops/s	336 ops/s	129 ops/s	254.5 ops/s	363.6 ops/s	+42.9%
`Metadata_SingleUpdate`	13,966 ops/s	14,022 ops/s	12,101 ops/s	6,723 ops/s	15,711 ops/s	+133.7%
`Metadata_BulkUpdate(500)`	51,341 ops/s	196,852 ops/s	102,742 ops/s	161,429 ops/s	234,048 ops/s	+45.0%
`IPC StreamingFramer_32x10`	3,732 ops/s	20,976 ops/s	5,837 ops/s	17,739 ops/s	22,857 ops/s	+28.9%
`IPC UnaryFramer_8KB`	10,088 ops/s	221,453 ops/s	15,889 ops/s	183,908 ops/s	233,046 ops/s	+26.7%

The June 9 row was an instrumented/slower local run (the June 19 column is a clean release rebuild on the same host), so most of the June 9 → June 19 gain is the build state, not a specific code change.

Previous Debug Refresh (M4, 2026-04-08)¶

Benchmark	Throughput	Delta vs Apr 7	Notes
`Ingestion_SmallDocument`	3,378 ops/s	+15.7%	1 KB document
`Ingestion_MediumDocument`	106 ops/s	+8.2%	100 KB document
`Metadata_SingleUpdate`	12,038 ops/s	+26.4%	1,000 docs
`Metadata_BulkUpdate(500)`	150,875 ops/s	+8.7%	500 updates/batch
`StreamingFramer_32x10_256B`	4,680 ops/s	+4.9%	IPC streaming
`StreamingFramer_64x6_512B`	1,780 ops/s	+5.6%	IPC streaming
`UnaryFramer_Success_8KB`	13,158 ops/s	+9.5%	IPC unary

Previous Release Refresh (M3, 2026-02-12)¶

Benchmark	Latency	Throughput
`Ingestion_SmallDocument`	0.23 ms	4,329 ops/s
`Ingestion_MediumDocument`	3.26 ms	307 ops/s
`Metadata_SingleUpdate`	6.57 ms	15,232 ops/s
`Metadata_BulkUpdate`	2.75 ms	181,818 ops/s
`StreamingFramer_32x10_256B`	0.66 ms	16,579 ops/s
`StreamingFramer_64x6_512B`	1.27 ms	5,531 ops/s
`UnaryFramer_Success_8KB`	0.02 ms	50,000 ops/s

Previous Multi-Client Results¶

Test	Clients	Throughput	Add p50	Add p95	Notes
Baseline single client	1	83.2 docs/s	11.2 ms	11.3 ms	Apr 8 debug
Concurrent pure ingest	4	244.0 docs/s	11.2 ms	11.3 ms	Apr 8 debug
Mixed read/write	16	200.2 ops/s	15.9 ms	82.6 ms	0 failures
Mixed read/write clean tier	68	278.8 ops/s	10.9 ms	502.3 ms	17/17/17/17 layout
Connection contention	32 burst	1,524.8 ops/s	11.5 ms	38.6 ms	0 retry-after responses

Previous Scaling Validation¶

Clients	Aggregate Throughput	Per-client Throughput	Efficiency	Memory
1	47.3 docs/s	47.4 docs/s	100.0%	80.8 MB
2	95.3 docs/s	47.7 docs/s	100.7%	83.1 MB
4	192.3 docs/s	48.1 docs/s	101.6%	87.2 MB
8	380.1 docs/s	47.6 docs/s	100.4%	88.8 MB
16	762.1 docs/s	47.8 docs/s	100.7%	91.3 MB
32	1,468.3 docs/s	46.1 docs/s	97.0%	97.3 MB
64	2,937.3 docs/s	47.5 docs/s	97.0%	107.5 MB
80	3,703.1 docs/s	48.6 docs/s	97.8%	115.3 MB

Core Microbenchmarks¶

Benchmark	p50 Latency	p50 Throughput	Notes
`Hashing_SHA256_1KB`	< 0.01 ms	1,777,778 ops/s	tiny input overhead dominates
`Hashing_SHA256_1MB`	0.30 ms	3,286 ops/s	about 3.29 GiB/s
`Chunking_Rabin_1MB`	1.22 ms	69,647 chunks/s	94 chunks/iteration
`Compression_Zstd_10KB_Text_L3`	< 0.01 ms	1,000,000 ops/s	71-byte output
`Compression_Zstd_1MB_Text_L9`	0.21 ms	4,742 ops/s	158-byte output
`Compression_Zstd_1MB_HighEntropy_L3`	0.11 ms	9,308 ops/s	not smaller than input

Grep Algorithmic Microbenchmarks¶

Case	Current Path	Comparator	Result
Short pattern, 1 MiB	BMH p50 0.63 ms	`std::find` p50 0.65 ms	BMH is faster
Medium pattern, 1 MiB	BMH p50 0.60 ms	`std::find` p50 0.60 ms	Tie
Long pattern, 1 MiB	BMH p50 0.62 ms	`std::find` p50 0.61 ms	std::find is faster
Newlines, 10 MiB	SIMD p50 1.06 ms	`memchr` p50 0.92 ms	memchr is faster

Multi-Client Benchmarks¶

Vectors/model loading disabled. In-process daemon harness. ASAN+coverage lane.

Test	Clients	Throughput	Latency Snapshot	Failures
Baseline single client	1	426 docs/s	add p50 1.27 ms, p95 1.33 ms	0
Concurrent pure ingest	4	781 docs/s	add p50 1.19 ms, p95 1.38 ms	0
Mixed read/write	4	54 ops/s	add p95 2.71 ms, search p95 525.55 ms, list p95 6.30 ms	0
16-client mixed ops	16	1,089 ops/s	add p95 6.45 ms, search p95 41.13 ms, list p95 14.00 ms	0
Connection contention	16 burst	1,326 ops/s	op p50 5.29 ms, p95 81.13 ms	0; retry-after 0

Scaling Curve¶

Clients	Aggregate Throughput	Per-client Throughput	Efficiency	Memory	Failures
1	428.8 docs/s	429.6 docs/s	100.0%	417.4 MB	0
2	781.0 docs/s	398.5 docs/s	91.1%	532.2 MB	0
4	1,644.9 docs/s	422.7 docs/s	95.9%	561.5 MB	0
8	3,101.7 docs/s	399.1 docs/s	90.4%	599.1 MB	0
16	4,421.4 docs/s	295.2 docs/s	64.5%	625.2 MB	0
32	1,225.4 docs/s	62.2 docs/s	8.9%	719.8 MB	416

KG Edge/Entity Insert Microbenchmarks¶

New (2026-06-24) — kg_edge_insert_bench executable. Covers the addEdgesUnique, addEdges (no dedup), upsertNodes, and WriteBatch insert paths at varying batch sizes. These metrics form the baseline for the KGWriteBuffer optimization (#2).

Edge Insert Throughput¶

Benchmark	Batch Size	p50 Throughput	Notes
`AddEdgesUnique_SingleTx`	100	194,367 edges/s	dedup in single transaction
`AddEdgesUnique_SingleTx`	500	196,781 edges/s
`AddEdgesUnique_SingleTx`	1000	194,126 edges/s
`AddEdgesUnique_OneByOne`	100	14,627 edges/s	per-edge insert + dedup (13.3× slower than bulk)
`AddEdgesUnique_OneByOne`	500	13,717 edges/s
`AddEdges_NoDedup_Bulk`	100	123,609 edges/s	no dedup check (upper bound)
`AddEdges_NoDedup_Bulk`	500	121,065 edges/s
`AddEdges_NoDedup_Bulk`	1000	115,969 edges/s
`WriteBatch_Edges`	100	192,270 edges/s	explicit WriteBatch path
`WriteBatch_Edges`	500	188,565 edges/s
`WriteBatch_Edges`	1000	197,334 edges/s

Node Upsert Throughput¶

Benchmark	Batch Size	p50 Throughput	Notes
`UpsertNodes_Bulk`	100	14,689 nodes/s	bulk transaction
`UpsertNodes_Bulk`	500	9,643 nodes/s
`UpsertNodes_Bulk`	1000	11,056 nodes/s
`UpsertNodes_OneByOne`	100	15,357 nodes/s	individual upserts (small-batch parity with bulk)
`UpsertNodes_OneByOne`	500	10,537 nodes/s

Target for KGWriteBuffer (#2)¶

The one-by-one insert paths simulate the current ingest behavior where post-ingest KG extraction yields per-document edges/entities without batching. The KGWriteBuffer should bring one-by-one throughput close to the bulk path by accumulating writes in memory and flushing in batch.

Expected improvement: 2-5x edges/sec in the one-by-one path, approaching the AddEdges_NoDedup_Bulk upper bound.

Simeon Lexical Rescoring Observations¶

Captured from the live-mirror suite retrieval stage trace. The Simeon lexical backend (SimeonLexicalBackend::score()) runs per-component after FTS5 candidate retrieval.

Observed (June 19 suite, M4 Max): The Simeon NEON dot product appears in xctrace hot-zone samples, confirming SIMD acceleration is active. The hybrid_summary event in the stage trace captures fusion pipeline timing but does not break out the simeon lexical component separately from text scoring.

Gap for hot-term cache (#3): No per-term score() latency differentiation exists in the current baseline. The hot-term posting list cache optimization needs:

P50/P95 score() latency for top-100 IDF terms (“function”, “class”, “return”)
P50/P95 score() latency for bottom-100 IDF terms (rare symbols)
Cache hit rate under the live-mirror query workload

These will be instrumented via SearchEngineConfig::includeComponentTiming and SearchEngine::Statistics new counters (simeonLexicalCacheHits, simeonLexicalScoreMicros) in task #3.

Storage Backends¶

See Storage Backends for local vs S3-compatible backend comparisons. docs/benchmarks/README.md