Deduplicate once.
Every future corpus remembers.

Name: Kociol
Availability: InStock
Author: Daniel Maslany

Kociol is a Rust deduplication engine with a persistent cross-corpus index. Ingest CommonCrawl today — tomorrow's dataset is automatically checked against it. Exact and near-duplicate detection in a single pass. No re-scanning. Ever.

Built for ML infrastructure teams processing 50 GB+ of text across multiple training runs. If you're running stateless dedup scripts and discarding the index after each corpus, Kociol eliminates that problem permanently.

Request a technical brief See the architecture

>112 MB/s

Peak exact dedup throughput

100%

Near-dup recall at s ≥ 0.85

+37%

Extra dups vs. exact-only

Corpora re-scanned when checking new data

The key differentiator

Cross-corpus persistence.
No other tool does this.

Every Python dedup script, every one-off MinHash run, every DataTrove pipeline — they all throw the index away when they finish. Kociol keeps it.

Run CC-100 on Monday. Deduplicate against it on Friday.

Kociol's sharded database accumulates fingerprints across every corpus it processes. Ingest CommonCrawl today — every document from any future corpus is automatically checked against it without re-scanning anything. The IDF table absorbs language statistics from every run, making near-duplicate detection more accurate over time. No re-indexing. No re-scanning. Source code held in escrow from day one — your team owns the IP regardless of what happens to the vendor.

If you're running a single corpus once and discarding the index, Kociol is not the right fit. Tools like DataTrove or standalone MinHash scripts handle that case well. Kociol is for teams running multiple corpora over time who want deduplication to compound — each run making the next one more accurate, with zero re-scanning cost.

Background

Started as genomic deduplication. Pivoted to AI training data.

Version 8 was a genomic sequencing deduplication engine — detecting duplicate reads in FASTQ/BAM data using a similar sharded index architecture. The genomic space has well-established tooling (Picard, samtools markdup) that is deeply integrated into existing pipelines. The marginal improvement over those tools did not justify displacing them.

The core algorithmic work — persistent sharded bloom+index, MinHash LSH with IDF weighting, batch-parallel fingerprinting — is more valuable applied to unstructured text corpora, where no equivalent persistent cross-corpus solution exists at production scale.

CommonCrawl, The Pile, ROOTS, and every major AI training corpus uses stateless dedup scripts that discard the index after each run. The problem is not the algorithm — it is the missing persistence layer. That is what Kociol solves.

v9 was a clean rewrite targeting AI training pipelines specifically: paragraph-level chunking, JSONL/plain text ingest, Python wheel, C FFI. The genomic index format (v8) is archived; v9 shares none of its code.

Architecture

Two dedup layers.
One persistent index.

Kociol runs both exact and near-duplicate detection in a single pipeline pass, writing all fingerprints to a sharded on-disk database that persists across corpus runs. Embeds in any language via Rust API, Python wheel, or C FFI.

Step 1

Ingest corpus

JSONL or plain text, streaming, gzip supported

Step 2

Normalize + Hash

NFKC + case-fold + xxHash3 / MinHash LSH

Step 3
Persistent DB
Sharded on-disk index, survives across corpus runs

Step 4

Filter next corpus

No re-scanning — check against accumulated index

Exact Deduplication Layer 1

NFKC normalization + case-fold + whitespace collapse, then xxHash3-128 fingerprint. Catches Unicode variants, case differences, and formatting noise that byte comparison misses. Bloom filter pre-check keeps disk reads near zero for already-seen content.

WikiText-103 ingest 50.8 MB/s

CommonCrawl 100GB ingest 112.3 MB/s

Filter throughput 104 MB/s

False positive rate < 0.001%

Near-Duplicate Detection Layer 2

MinHash LSH with IDF-weighted character 5-gram shingling. Catches reformatted, lightly edited, and paraphrased duplicates that exact hashing misses entirely. The IDF table persists and improves with every corpus run — the longer you run Kociol, the more accurately it weights distinctive vs. common phrases.

Recall at Jaccard ≥ 0.85 100%

Additional dups vs. exact-only +37%

Cross-corpus false positives 0%

Near-dedup ingest: 64–150 MB/s on batch-parallelized implementation. Paragraph-level granularity with full disk persistence.

// Rust API — exact dedup
let db = PatternDb::open_or_create(config)?;
let results = db.store_batch(&chunks)?; // returns (hash, is_new) per chunk

// Near-dedup — persistent across corpus runs
let near_db = NearDb::open_or_create(config)?;
let candidates = near_db.ingest_batch_uniform(&texts)?; // returns duplicate doc IDs

// C FFI — embed in Python, Go, or any language
void* db = kociol_db_open("/data/corpus.kdb");
int  is_new = kociol_db_ingest(db, text);

Runs as a standalone binary or Python wheel. Integrates with any workflow orchestrator — Airflow, Ray, Prefect, or custom pipelines — via Python or C FFI interface.

Benchmark results

Measured on real corpora.

Exact Dedup — Large Scale

WikiText 10GB66.5 MB/s

CommonCrawl 100GB112.3 MB/s

Filter throughput104 MB/s

Ryzen 7 5700X, 8-core

Exact Dedup — Small Corpus

AG News (34 MB)27.5 MB/s

WikiText-103 (270 MB)50.8 MB/s

Ryzen 5 4500U, 6-core laptop

Near-Dedup — Ingest

Throughput range64–150 MB/s

Near-dup rate (synth 30%)36.9%

Cross-corpus false positives0%

Batch-parallelized implementation

Benchmark methodology: Tested on Ryzen 7 5700X (8-core) at 100GB CommonCrawl scale. Scales linearly with core count. Small-corpus results on Ryzen 5 4500U laptop included for reference. Paragraph-level granularity with full disk persistence throughout. Benchmark scripts and dataset preparation available to evaluation customers.

Roadmap

What's built. What's next.

The core pipeline is production-ready. These are the next engineering milestones.

Shipped

Exact dedup: NFKC normalize → xxHash3-128 → persistent sharded index
Near-dedup: MinHash LSH, IDF-weighted, batch-parallelized (64–150 MB/s)
Cross-corpus persistence: validated at 100GB scale
Python wheel (cp38 abi3, manylinux)
C FFI interface
73/73 unit tests passing

In progress / planned

Layer 3 centroid store — cluster centroids form naturally as docs accumulate, enabling O(centroids) pre-filter before full LSH lookup
Distributed shard support — multi-node deployment for corpora exceeding single-machine RAM
Semantic near-dedup — embedding-based similarity as an optional Layer 4 above MinHash
WASM target — run the exact dedup layer in-browser or in edge compute

Build vs. buy

What does it cost to build this internally?

Estimated for a senior ML infrastructure engineer at a US AI lab. Conservative figures based on published salary data and engineering blog post timelines.

Build internally

Engineer cost (6–12 months)$100K–$350K

QA & integration work$30K–$80K

Ongoing maintenance/year$40K–$80K

Cross-corpus persistence+$50K–$100K

Year 1 total $220K–$610K

Does not include the cost of getting the architecture wrong once and rebuilding.

License Kociol
60–90 day evaluation on your corpora✓
Perpetual production license✓
Source code in escrow from day one✓
Integration support included✓
Annual support renewal available✓

          Evaluation fee credited toward license
          Write to discuss
        

Licensing is structured around corpus scale and deployment model. Kociol has no GPL dependencies — full dependency license inventory available to evaluation customers. Contracts are governed by Norwegian law by default; US or EU governing law available on request.

Who built this

Built and maintained by one person.
Engineered to last without them.

Infrastructure IP from a solo founder carries vendor risk. Kociol is designed to eliminate it.

Daniel Maslany — Norway

Kociol is built and maintained by Daniel Maslany, based in Norway. The engine is written in Rust (~4,000 lines), ships with 73/73 unit tests passing, and has been benchmarked against WikiText-103, AG News, and CommonCrawl-scale corpora up to 100GB. All benchmark scripts and methodology are available to evaluation customers — the numbers are reproducible. Source code is held in escrow from day one of a Commercial license: if the vendor ceases operations, the full codebase transfers to you automatically. Contracts are governed by Norwegian law by default; US or EU governing law available on request.

Rust 73/73 tests Source escrow No GPL deps Norway / EU

Get in touch

Get in touch.

Describe your corpus scale and pipeline stack. You'll receive a response within one business day with whether Kociol is a fit and what an evaluation would look like.

Async, no sales calls required Response within 1 business day Evaluation scoped to your actual corpora