GitHub - 0xNaN/ngrep: grep plus word embeddings

https://github.com/0xNaN/ngrep • Mar 17, 2026 21:39

Extracto

grep plus word embeddings. Contribute to 0xNaN/ngrep development by creating an account on GitHub.

Resumen

Resumen Principal

ngrep es una innovadora Prueba de Concepto (PoC) que redefine las expresiones regulares al incorporar la semántica del lenguaje natural a través de Word Embeddings. Este enfoque pionero permite buscar texto no solo por coincidencias literales de cadenas, sino por el significado intrínseco de las palabras, abriendo un nuevo paradigma en la recuperación y análisis de información. La herramienta enriquece el lenguaje tradicional de las expresiones regulares con un nuevo y potente operador neuronal ~, que es el pilar de su capacidad de búsqueda por **similitud

Contenido

^{ngrep matching a paragraph from the book Flatland: A Romance of Many Dimensions}

ngrep is an experimental PoC for finding text by meaning rather than literal matching. The core question is simple: what happens when regular expressions are enriched with a small bit of semantics via Word Embeddings?

ngrep explores that idea with a familiar grep interface, extending regex with a new neural operator ~ while keeping the rest of the language intact. It supports the major regex features you already use (including lookarounds like negative lookahead), so you can combine semantic and literal patterns in one expression, built on top of the fantastic 🦀 fancy-regex.

The `~` operator

The ~ operator matches by semantic similarity, using neural Word Embeddings (yes, 2010s nostalgia). For example, ~(fruit)+ matches any token whose embedding is close to that of fruit:

Syntax and Parameters

You can refine the search using the following syntax:

~(word) e.g: ~(car) Match word using the default similarity threshold (from --threshold or the model config).
~(word;threshold) e.g: ~(car;0.3) Overrides the threshold for this specific match.
~(word1::word2) e.g: ~(car::bike) Uses the average embedding of word1 and word2 as the target.

~() matches only a single token, use ~()+ to capture full words or phrases
Word Embeddings are inherently imprecise, expect to tune the threshold and combine multiple ~ operators to get stable results

Install

From the repository root:

cargo install --path .
ngrep --help

Build (no install)

If you prefer to build locally without installing, run:

cargo build --release
./target/release/ngrep --help

Then use ./target/release/ngrep in the commands below instead of ngrep.

After ngrep is available you have to import some Word Embeddings model to start matching. Follow these steps to download and import the English FastText embeddings:

wget -qO- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz | gunzip > cc.en.300.vec
ngrep import cc.en.300.vec ften

Match with:

echo 'a standard example is: hello world' | ngrep '~(hey)+ ~(planet)+'

Alternatively you can import any embeddings in the txt format and configure the default model with ngrep config. You can import multiple models and switch between them with ngrep config or --model, but only one model is used at a time for matching:

FastText Word vectors for 157 languages
Wikipedia2Vec with ENTITY vectors: Offers multiple dimensionality options (100, 300, or 500) and provides language-specific pretrained models that support entity vectors via ENTITY/<topic> tokens.
GloVe: Global Vectors for Word Representation: Other than being high quality, notably offers 50-dimensional vectors that are very fast to load (though slightly less precise).

A note on performance

ngrep's current focus is exploration, not performance. It does not preload or cache vectors, performs frequent disk access, and ~ matches are not compiled into standard regex. This is a deliberate choice to keep the implementation simple while the idea is being explored.

To give you a glimpse of the current performance, it takes about ~18s to find the most common ways to refer to a big animal in the book Moby-Dick on MacBook Pro M4 (1.2MB of text, 22K lines, English FastText 300d):

wget -q https://raw.githubusercontent.com/massimo-nazaria/bash-textgen/refs/heads/main/moby-dick.txt
time ngrep -o '~(big)+ \b~(animal;0.35)+\b' moby-dick.txt | sort | uniq -c | sort -rn
   7 great whale
   5 great whales
   3 large whale
   3 great monster
   2 great fish
   1 tremendous whale
   1 small fish
   1 small cub
   1 little cannibal
   1 large herd
   1 huge reptile
   1 huge elephant
   1 great hunting
   1 great dromedary
   1 gigantic fish
   1 gigantic creature
   1 enormous creatures
   1 enormous creature
   1 big whale

real	0m17.291s
user	0m16.577s
sys	    0m0.726s