GitHub - Sanix-Darker/cisv: The csv parser on steroids.
Extracto
The csv parser on steroids. Contribute to Sanix-Darker/cisv development by creating an account on GitHub.
Contenido
CISV
Cisv is a csv parser on steroids... literally. It's a high-performance CSV parser/writer leveraging SIMD instructions and zero-copy memory mapping. Available as both a Node.js native addon and standalone CLI tool with extensive configuration options.
I wrote about basics in a blog post, you can read here :https://sanixdk.xyz/blogs/how-i-accidentally-created-the-fastest-csv-parser-ever-made.
PERFORMANCE
- 469,968 MB/s throughput on 2M row CSV files (AVX-512)
- 10-100x faster than popular CSV parsers
- Zero-copy memory-mapped I/O with kernel optimizations
- SIMD accelerated with AVX-512/AVX2 auto-detection
- Dynamic lookup tables for configurable parsing
CLI BENCHMARKS WITH DOCKER
$ docker build -t cisv-benchmark .
To run them... choosing some specs for the container to size resources, you can :
$ docker run --rm \ --cpus="2.0" \ --memory="4g" \ --memory-swap="4g" \ --cpu-shares=1024 \ --security-opt \ seccomp=unconfined \ cisv-benchmark
BENCHMARKS
Benchmarks comparison with existing popular tools, cf pipeline you can check : (https://github.com/Sanix-Darker/cisv/actions/runs/17194915214/job/48775516036)
SYNCHRONOUS RESULTS
Library | Speed (MB/s) | Avg Time (ms) | Operations/sec |
---|---|---|---|
cisv (sync) | 30.04 | 0.02 | 64936 |
csv-parse (sync) | 13.35 | 0.03 | 28870 |
papaparse (sync) | 25.16 | 0.02 | 54406 |
SYNCHRONOUS RESULTS (WITH DATA ACCESS)
Library | Speed (MB/s) | Avg Time (ms) | Operations/sec |
---|---|---|---|
cisv (sync) | 31.24 | 0.01 | 67543 |
csv-parse (sync) | 15.42 | 0.03 | 33335 |
papaparse (sync) | 25.49 | 0.02 | 55107 |
ASYNCHRONOUS RESULTS
Library | Speed (MB/s) | Avg Time (ms) | Operations/sec |
---|---|---|---|
cisv (async/stream) | 61.31 | 0.01 | 132561 |
papaparse (async/stream) | 19.24 | 0.02 | 41603 |
neat-csv (async/promise) | 9.09 | 0.05 | 19655 |
ASYNCHRONOUS RESULTS (WITH DATA ACCESS)
Library | Speed (MB/s) | Avg Time (ms) | Operations/sec |
---|---|---|---|
cisv (async/stream) | 24.59 | 0.02 | 53160 |
papaparse (async/stream) | 21.86 | 0.02 | 47260 |
neat-csv (async/promise) | 9.38 | 0.05 | 20283 |
INSTALLATION
NODE.JS PACKAGE
CLI TOOL (FROM SOURCE)
git clone https://github.com/sanix-darker/cisv
cd cisv
make cli
sudo make install-cli
BUILD FROM SOURCE (NODE.JS ADDON)
npm install -g node-gyp make build
QUICK START
NODE.JS
const { cisvParser } = require('cisv'); // Basic usage const parser = new cisvParser(); const rows = parser.parseSync('./data.csv'); // With configuration (optional) const tsv_parser = new cisvParser({ delimiter: '\t', quote: "'", trim: true }); const tsv_rows = tsv_parser.parseSync('./data.tsv');
CLI
# Basic parsing cisv data.csv # Parse TSV file cisv -d $'\t' data.tsv # Parse with custom quote and trim cisv -q "'" -t data.csv # Skip comment lines cisv -m '#' config.csv
CONFIGURATION OPTIONS
Parser Configuration
const parser = new cisvParser({ // Field delimiter character (default: ',') delimiter: ',', // Quote character (default: '"') quote: '"', // Escape character (null for RFC4180 "" style, default: null) escape: null, // Comment character to skip lines (default: null) comment: '#', // Trim whitespace from fields (default: false) trim: true, // Skip empty lines (default: false) skipEmptyLines: true, // Use relaxed parsing rules (default: false) relaxed: false, // Skip lines with parse errors (default: false) skipLinesWithError: true, // Maximum row size in bytes (0 = unlimited, default: 0) maxRowSize: 1048576, // Start parsing from line N (1-based, default: 1) fromLine: 10, // Stop parsing at line N (0 = until end, default: 0) toLine: 1000 });
Dynamic Configuration
// Set configuration after creation parser.setConfig({ delimiter: ';', quote: "'", trim: true }); // Get current configuration const config = parser.getConfig(); console.log(config);
API REFERENCE
TYPESCRIPT DEFINITIONS
interface CisvConfig { delimiter?: string; quote?: string; escape?: string | null; comment?: string | null; trim?: boolean; skipEmptyLines?: boolean; relaxed?: boolean; skipLinesWithError?: boolean; maxRowSize?: number; fromLine?: number; toLine?: number; } interface ParsedRow extends Array<string> {} interface ParseStats { rowCount: number; fieldCount: number; totalBytes: number; parseTime: number; currentLine: number; } interface TransformInfo { cTransformCount: number; jsTransformCount: number; fieldIndices: number[]; } class cisvParser { constructor(config?: CisvConfig); parseSync(path: string): ParsedRow[]; parse(path: string): Promise<ParsedRow[]>; parseString(csv: string): ParsedRow[]; write(chunk: string | Buffer): void; end(): void; getRows(): ParsedRow[]; clear(): void; setConfig(config: CisvConfig): void; getConfig(): CisvConfig; transform(fieldIndex: number, type: string | Function): this; removeTransform(fieldIndex: number): this; clearTransforms(): this; getStats(): ParseStats; getTransformInfo(): TransformInfo; destroy(): void; static countRows(path: string): number; static countRowsWithConfig(path: string, config?: CisvConfig): number; }
BASIC PARSING
import { cisvParser } from "cisv"; // Default configuration (standard CSV) const parser = new cisvParser(); const rows = parser.parseSync('data.csv'); // Custom configuration (TSV with single quotes) const tsvParser = new cisvParser({ delimiter: '\t', quote: "'" }); const tsvRows = tsvParser.parseSync('data.tsv'); // Parse specific line range const rangeParser = new cisvParser({ fromLine: 100, toLine: 1000 }); const subset = rangeParser.parseSync('large.csv'); // Skip comments and empty lines const cleanParser = new cisvParser({ comment: '#', skipEmptyLines: true, trim: true }); const cleanData = cleanParser.parseSync('config.csv');
STREAMING
import { cisvParser } from "cisv"; import fs from 'fs'; const streamParser = new cisvParser({ delimiter: ',', trim: true }); const stream = fs.createReadStream('huge-file.csv'); stream.on('data', chunk => streamParser.write(chunk)); stream.on('end', () => { streamParser.end(); const results = streamParser.getRows(); console.log(`Parsed ${results.length} rows`); });
DATA TRANSFORMATION
const parser = new cisvParser(); // Built-in C transforms (optimized) parser .transform(0, 'uppercase') // Column 0 to uppercase .transform(1, 'lowercase') // Column 1 to lowercase .transform(2, 'trim') // Column 2 trim whitespace .transform(3, 'to_int') // Column 3 to integer .transform(4, 'to_float') // Column 4 to float .transform(5, 'base64_encode') // Column 5 to base64 .transform(6, 'hash_sha256'); // Column 6 to SHA256 // Custom fieldname transform : parser .transform('name', 'uppercase'); // Custom row transform : parser .transformRow((row, rowObj) => {console.log(row}}); // Custom JavaScript transforms parser.transform(7, value => new Date(value).toISOString()); // Apply to all fields parser.transform(-1, value => value.replace(/[^\w\s]/gi, '')); const transformed = parser.parseSync('data.csv');
ROW COUNTING
import { cisvParser } from "cisv"; // Fast row counting without parsing const count = cisvParser.countRows('large.csv'); // Count with specific configuration const tsvCount = cisvParser.countRowsWithConfig('data.tsv', { delimiter: '\t', skipEmptyLines: true, fromLine: 10, toLine: 1000 });
CLI USAGE
PARSING OPTIONS
cisv [OPTIONS] [FILE] General Options: -h, --help Show help message -v, --version Show version -o, --output FILE Write to FILE instead of stdout -b, --benchmark Run benchmark mode Configuration Options: -d, --delimiter DELIM Field delimiter (default: ,) -q, --quote CHAR Quote character (default: ") -e, --escape CHAR Escape character (default: RFC4180 style) -m, --comment CHAR Comment character (default: none) -t, --trim Trim whitespace from fields -r, --relaxed Use relaxed parsing rules --skip-empty Skip empty lines --skip-errors Skip lines with parse errors --max-row SIZE Maximum row size in bytes --from-line N Start from line N (1-based) --to-line N Stop at line N Processing Options: -s, --select COLS Select columns (comma-separated indices) -c, --count Show only row count --head N Show first N rows --tail N Show last N rows
EXAMPLES
# Parse TSV file cisv -d $'\t' data.tsv # Parse CSV with semicolon delimiter and single quotes cisv -d ';' -q "'" european.csv # Skip comment lines starting with # cisv -m '#' config.csv # Trim whitespace and skip empty lines cisv -t --skip-empty messy.csv # Parse lines 100-1000 only cisv --from-line 100 --to-line 1000 large.csv # Select specific columns cisv -s 0,2,5,7 data.csv # Count rows with specific configuration cisv -c -d $'\t' --skip-empty data.tsv # Benchmark with custom delimiter cisv -b -d ';' european.csv
WRITING
cisv write [OPTIONS]
Options:
-g, --generate N Generate N rows of test data
-o, --output FILE Output file
-d, --delimiter DELIM Field delimiter
-Q, --quote-all Quote all fields
-r, --crlf Use CRLF line endings
-n, --null TEXT Null representation
-b, --benchmark Benchmark mode
BENCHMARKS
PARSER PERFORMANCE (273 MB, 5M ROWS)
Parser | Speed (MB/s) | Time (ms) | Relative |
---|---|---|---|
cisv | 7,184 | 38 | 1.0x (fastest) |
rust-csv | 391 | 698 | 18x slower |
xsv | 650 | 420 | 11x slower |
csvkit | 28 | 9,875 | 260x slower |
NODE.JS LIBRARY BENCHMARKS
Library | Speed (MB/s) | Operations/sec | Configuration Support |
---|---|---|---|
cisv | 61.24 | 136,343 | Full |
csv-parse | 15.48 | 34,471 | Partial |
papaparse | 25.67 | 57,147 | Partial |
(you can check more benchmarks details from release pipelines)
RUNNING BENCHMARKS
# CLI benchmarks make clean && make cli && make benchmark-cli # Node.js benchmarks npm run benchmark # Benchmark with custom configuration cisv -b -d ';' -q "'" --trim european.csv
TECHNICAL ARCHITECTURE
- SIMD Processing: AVX-512 (64-byte vectors) or AVX2 (32-byte vectors) for parallel processing
- Dynamic Lookup Tables: Generated per-configuration for optimal state transitions
- Memory Mapping: Direct kernel-to-userspace zero-copy with
mmap()
- Optimized Buffering: 1MB ring buffer sized for L3 cache efficiency
- Compiler Optimizations: LTO and architecture-specific tuning with
-march=native
- Configurable Parsing: RFC 4180 compliant with extensive customization options
FEATURES (PROS)
- RFC 4180 compliant with configurable extensions
- Handles quoted fields with embedded delimiters
- Support for multiple CSV dialects (TSV, PSV, etc.)
- Comment line support
- Field trimming and empty line handling
- Line range parsing for large files
- Streaming API for unlimited file sizes
- Safe fallback for non-x86 architectures
- High-performance CSV writer with SIMD optimization
- Row counting without full parsing
LIMITATIONS
- Linux/Unix support only (optimized for x86_64 CPU)
- Windows support planned for future release
LICENSE
MIT © sanix-darker
ACKNOWLEDGMENTS
Inspired by:
Fuente: GitHub