Deduplicate once.
Every future corpus remembers.

Kociol is a Rust deduplication engine with a persistent cross-corpus index. Ingest CommonCrawl today — tomorrow's dataset is automatically checked against it. Exact and near-duplicate detection in a single pass. No re-scanning. Ever.

Built for ML infrastructure teams processing 50 GB+ of text across multiple training runs. If you're running stateless dedup scripts and discarding the index after each corpus, Kociol eliminates that problem permanently.

>112 MB/s
Peak exact dedup throughput
100%
Near-dup recall at s ≥ 0.85
+37%
Extra dups vs. exact-only
0
Corpora re-scanned when checking new data

Cross-corpus persistence.
No other tool does this.

Every Python dedup script, every one-off MinHash run, every DataTrove pipeline — they all throw the index away when they finish. Kociol keeps it.

Run CC-100 on Monday. Deduplicate against it on Friday.

Kociol's sharded database accumulates fingerprints across every corpus it processes. Ingest CommonCrawl today — every document from any future corpus is automatically checked against it without re-scanning anything. The IDF table absorbs language statistics from every run, making near-duplicate detection more accurate over time. No re-indexing. No re-scanning. Source code held in escrow from day one — your team owns the IP regardless of what happens to the vendor.

If you're running a single corpus once and discarding the index, Kociol is not the right fit. Tools like DataTrove or standalone MinHash scripts handle that case well. Kociol is for teams running multiple corpora over time who want deduplication to compound — each run making the next one more accurate, with zero re-scanning cost.


Started as genomic deduplication. Pivoted to AI training data.

Version 8 was a genomic sequencing deduplication engine — detecting duplicate reads in FASTQ/BAM data using a similar sharded index architecture. The genomic space has well-established tooling (Picard, samtools markdup) that is deeply integrated into existing pipelines. The marginal improvement over those tools did not justify displacing them.

The core algorithmic work — persistent sharded bloom+index, MinHash LSH with IDF weighting, batch-parallel fingerprinting — is more valuable applied to unstructured text corpora, where no equivalent persistent cross-corpus solution exists at production scale.

CommonCrawl, The Pile, ROOTS, and every major AI training corpus uses stateless dedup scripts that discard the index after each run. The problem is not the algorithm — it is the missing persistence layer. That is what Kociol solves.

v9 was a clean rewrite targeting AI training pipelines specifically: paragraph-level chunking, JSONL/plain text ingest, Python wheel, C FFI. The genomic index format (v8) is archived; v9 shares none of its code.


Two dedup layers.
One persistent index.

Kociol runs both exact and near-duplicate detection in a single pipeline pass, writing all fingerprints to a sharded on-disk database that persists across corpus runs. Embeds in any language via Rust API, Python wheel, or C FFI.

Exact Deduplication Layer 1

NFKC normalization + case-fold + whitespace collapse, then xxHash3-128 fingerprint. Catches Unicode variants, case differences, and formatting noise that byte comparison misses. Bloom filter pre-check keeps disk reads near zero for already-seen content.

WikiText-103 ingest 50.8 MB/s
CommonCrawl 100GB ingest 112.3 MB/s
Filter throughput 104 MB/s
False positive rate < 0.001%

Near-Duplicate Detection Layer 2

MinHash LSH with IDF-weighted character 5-gram shingling. Catches reformatted, lightly edited, and paraphrased duplicates that exact hashing misses entirely. The IDF table persists and improves with every corpus run — the longer you run Kociol, the more accurately it weights distinctive vs. common phrases.

Recall at Jaccard ≥ 0.85 100%
Additional dups vs. exact-only +37%
Cross-corpus false positives 0%

Near-dedup ingest: 64–150 MB/s on batch-parallelized implementation. Paragraph-level granularity with full disk persistence.

// Rust API — exact dedup
let db = PatternDb::open_or_create(config)?;
let results = db.store_batch(&chunks)?; // returns (hash, is_new) per chunk

// Near-dedup — persistent across corpus runs
let near_db = NearDb::open_or_create(config)?;
let candidates = near_db.ingest_batch_uniform(&texts)?; // returns duplicate doc IDs

// C FFI — embed in Python, Go, or any language
void* db = kociol_db_open("/data/corpus.kdb");
int  is_new = kociol_db_ingest(db, text);

Runs as a standalone binary or Python wheel. Integrates with any workflow orchestrator — Airflow, Ray, Prefect, or custom pipelines — via Python or C FFI interface.


Measured on real corpora.

Exact Dedup — Large Scale

WikiText 10GB66.5 MB/s
CommonCrawl 100GB112.3 MB/s
Filter throughput104 MB/s

Ryzen 7 5700X, 8-core

Exact Dedup — Small Corpus

AG News (34 MB)27.5 MB/s
WikiText-103 (270 MB)50.8 MB/s

Ryzen 5 4500U, 6-core laptop

Near-Dedup — Ingest

Throughput range64–150 MB/s
Near-dup rate (synth 30%)36.9%
Cross-corpus false positives0%

Batch-parallelized implementation

Benchmark methodology: Tested on Ryzen 7 5700X (8-core) at 100GB CommonCrawl scale. Scales linearly with core count. Small-corpus results on Ryzen 5 4500U laptop included for reference. Paragraph-level granularity with full disk persistence throughout. Benchmark scripts and dataset preparation available to evaluation customers.

What's built. What's next.

The core pipeline is production-ready. These are the next engineering milestones.

Shipped

  • Exact dedup: NFKC normalize → xxHash3-128 → persistent sharded index
  • Near-dedup: MinHash LSH, IDF-weighted, batch-parallelized (64–150 MB/s)
  • Cross-corpus persistence: validated at 100GB scale
  • Python wheel (cp38 abi3, manylinux)
  • C FFI interface
  • 73/73 unit tests passing

In progress / planned

  • Layer 3 centroid store — cluster centroids form naturally as docs accumulate, enabling O(centroids) pre-filter before full LSH lookup
  • Distributed shard support — multi-node deployment for corpora exceeding single-machine RAM
  • Semantic near-dedup — embedding-based similarity as an optional Layer 4 above MinHash
  • WASM target — run the exact dedup layer in-browser or in edge compute

What does it cost to build this internally?

Estimated for a senior ML infrastructure engineer at a US AI lab. Conservative figures based on published salary data and engineering blog post timelines.

Build internally
Engineer cost (6–12 months)$100K–$350K
QA & integration work$30K–$80K
Ongoing maintenance/year$40K–$80K
Cross-corpus persistence+$50K–$100K
Year 1 total $220K–$610K

Does not include the cost of getting the architecture wrong once and rebuilding.

License Kociol
60–90 day evaluation on your corpora
Perpetual production license
Source code in escrow from day one
Integration support included
Annual support renewal available
Evaluation fee credited toward license Write to discuss
Licensing is structured around corpus scale and deployment model. Kociol has no GPL dependencies — full dependency license inventory available to evaluation customers. Contracts are governed by Norwegian law by default; US or EU governing law available on request.

Built and maintained by one person.
Engineered to last without them.

Infrastructure IP from a solo founder carries vendor risk. Kociol is designed to eliminate it.

Daniel Maslany — Norway

Kociol is built and maintained by Daniel Maslany, based in Norway. The engine is written in Rust (~4,000 lines), ships with 73/73 unit tests passing, and has been benchmarked against WikiText-103, AG News, and CommonCrawl-scale corpora up to 100GB. All benchmark scripts and methodology are available to evaluation customers — the numbers are reproducible. Source code is held in escrow from day one of a Commercial license: if the vendor ceases operations, the full codebase transfers to you automatically. Contracts are governed by Norwegian law by default; US or EU governing law available on request.

Rust 73/73 tests Source escrow No GPL deps Norway / EU

Get in touch.

Describe your corpus scale and pipeline stack. You'll receive a response within one business day with whether Kociol is a fit and what an evaluation would look like.

Async, no sales calls required Response within 1 business day Evaluation scoped to your actual corpora
Get in touch