gridgulp

GridGulp

PyPI version Python Versions License: MIT

Gemma-first spreadsheet schematization for Excel, CSV, TSV, and text files.

What Changed

GridGulp now supports two explicit execution modes:

Only Gemma-family models are accepted for LLM execution.

Installation

pip install gridgulp

For local development:

uv pip install -e ".[dev]"

Configuration

GridGulp loads config in this order:

  1. gridgulp.toml
  2. GRIDGULP_* environment variables
  3. CLI flags

Start from gridgulp.toml.example.

Minimal Gemma-first config:

detection_strategy = "gemma_first"

[gemma]
enabled = true
base_url = "http://localhost:11434/v1"
model_id = "gemma4:26b"
timeout_seconds = 60
max_retries = 2
max_tokens = 1200

[vision]
enabled = true
mode = "fallback_only"
max_proposal_crops = 4
crop_margin_cells = 1

CLI

Validate your config first:

gridgulp doctor

Validate multimodal support explicitly against Ollama:

gridgulp --gemma-base-url http://localhost:11434/v1 --gemma-model-id gemma4:26b doctor

Schematize a file:

gridgulp schematize path/to/report.xlsx

Emit JSON:

gridgulp schematize path/to/report.xlsx --json

Keep the lower-level table detector available when you need raw table boundaries:

gridgulp detect path/to/report.xlsx --save-debug-images .gridgulp-debug

Process a directory:

gridgulp schematize path/to/reports --recursive

Python API

Gemma-first schematization:

from gridgulp import GridGulp
from gridgulp.config import Config

config = Config.from_toml("gridgulp.toml")
gg = GridGulp(config=config)

result = gg.schematize_sync("sales_report.xlsx")
print(result.total_entities)
for entity in result.entities:
    print(entity.suggested_name, entity.record_count_estimate)
    for field in entity.fields:
        print(" ", field.id, field.semantic_type, field.physical_type)

Lower-level table debugging:

from gridgulp import GridGulp
from gridgulp.config import Config, DetectionStrategy

config = Config(detection_strategy=DetectionStrategy.HEURISTIC_ONLY)
gg = GridGulp(config=config)

result = gg.detect_tables_sync("sales_report.xlsx")

Schema Output

The primary output is WorkbookSchemaResult, which includes:

DetectionResult remains available as a lower-level primitive for debugging table boundaries.

Detection And Schematization Architecture

For gemma_first, GridGulp runs:

  1. file typing and file reading
  2. deterministic proposal generation from the existing detector stack
  3. structured-text prompt construction from sheet maps, boundary context, merged cells, multi-row header hints, and semantic row patterns
  4. Gemma text reasoning over final keep, merge, split, add, and reject decisions
  5. multimodal Gemma fallback with rendered PNG overview/crops when the text pass is low-confidence or visually ambiguous
  6. post-processing into TableInfo objects with clipped ranges, deduplicated overlaps, header metadata, and semantic summaries
  7. canonical schematization into regions, entities, fields, records, and relationships

Supported Formats

.xlsb is detected but not supported for reading.

Development Workflow

gridgulp doctor
gridgulp schematize examples/spreadsheets/simple/basic_table.csv --json
pytest -q tests/test_detection_agent.py

To compare Gemma-first schema output against the deterministic baseline, switch detection_strategy to heuristic_only in config or via env:

GRIDGULP_DETECTION_STRATEGY=heuristic_only gridgulp schematize file.xlsx --json

Docs