lwoodard/Summerize

Fork 0

Files

Levi Woodard 54629aecad Initial push to gitea

2026-05-10 13:37:17 -06:00

16 KiB

Raw Blame History

publish

Go CLI that turns a local audio/video recording of a church service into:

A markdown summary (--summerize)
A 60–90s social-media hook clip cut from the source (--clip)
(Future) A post to Spotify for Podcasters (--post — currently stubs out)

The repo directory is still summerize/ (historical), but the module and binary are both publish.

Pipeline (one pass, shared by all modes)

input  ──ffmpeg──►  16kHz mono WAV  ──whisper.cpp -oj──►  []Segment{Start,End,Text}
                                                                │
                                              ┌─────────────────┴─────────────────┐
                                              │                                   │
                              PlainText(segs)                 FormatForLLM(segs)
                                              │                                   │
                                              ▼                                   ▼
                                        Summarizer                          clip.Pick
                                  (Anthropic API or                  (same Summarizer,
                                   shelled-out claude CLI)            different prompt → JSON)
                                              │                                   │
                                              ▼                                   ▼
                                       markdown summary               ffmpeg cut [start,end]
                                              │
                                              └─► output.MarkdownToSpotifyHTML
                                                  (b/i/a/ul/ol/li/p subset
                                                   that Spotify show notes accept)

Whisper output is cached at <input>.segments.json. Subsequent runs (different modes, different prompt-clip params) skip whisper entirely.

Layout

main.go                           flat flagset, mode dispatch, orchestration
prompts/church-service.md         default --summerize prompt (go:embed)
prompts/clip-selector.md          default --clip prompt; templated with
                                  {{MIN_SECONDS}} / {{MAX_SECONDS}} (go:embed)
internal/audio/audio.go           ffmpeg → 16kHz mono PCM WAV
internal/transcribe/
  transcribe.go                   Transcriber interface, Segment type
  segments.go                     Segment, PlainText, FormatForLLM, mm:ss helper
  whispercpp.go                   shells out to whisper-cli with -oj; parses JSON
internal/summarize/
  summarize.go                    Summarizer interface
  anthropic.go                    direct Messages API via net/http
                                  (no SDK dep; reads ANTHROPIC_API_KEY)
  claudecli.go                    `claude -p <prompt>` with transcript on stdin
internal/clip/
  clip.go                         Selection, Pick (LLM JSON parse), Extract (ffmpeg)
  clip_test.go                    JSON object extraction edge cases
internal/output/
  spotify.go                      markdown → Spotify-safe HTML
  spotify_test.go
  clipboard.go                    wl-copy / xclip / pbcopy
Makefile                          build/install/link/doctor/uninstall targets
scripts/install.sh                interactive setup (OS + GPU detect → deps,
                                  whisper.cpp build, model download, link)

Zero external Go dependencies. Stdlib only.

CLI surface

publish [mode...] [flags] <input>

modes (combine freely; defaults to --summerize):
  --summerize          write a markdown summary
  --clip               cut a 60-90s social hook clip
  --post               post to Spotify (not implemented yet)

Modes share whisper output, so publish --summerize --clip sermon.mp4 only transcribes once.

Shared flags

flag	purpose	default
`--summarizer`	`claude-cli` or `claude-api`	`claude-cli`
`--model`	model name (Anthropic API path defaults to `claude-sonnet-4-6`)	empty
`--prompt-summary`	override summary prompt path	bundled
`--prompt-clip`	override clip-selector prompt path	bundled
`--whisper-bin`	whisper.cpp binary; auto-detects best backend (see "Backend auto-detect" below)	auto
`--whisper-model`	path to ggml model	`~/.cache/whisper.cpp/ggml-base.en.bin`
`--whisper-lang`	force language code	auto-detect
`--whisper-threads`	thread count	library default
`--segments`	segments JSON cache path	`<input>.segments.json`
`--keep-transcript`	also write `<input>.transcript.txt`	off
`--keep-wav`	keep the normalized WAV instead of tempdir	off
`-v`	verbose progress to stderr	off

--summerize flags

flag	purpose	default
`--prompt`	producer's notes (any pre-written framing, title, key points) that anchor the summary	empty
`--md PATH`	markdown output; `-` = stdout, `""` = disable	`<input>.summary.md`
`--spotify PATH`	Spotify HTML output; `-` = stdout	disabled
`--copy`	copy Spotify HTML to clipboard	off

When --prompt is set, the value is prepended to the user message as a "Producer's notes" block above the transcript. The bundled prompt instructs the LLM to treat producer's notes as authoritative for titles, speaker names, framing, and key points, then use the transcript to expand and enrich them. Use this when the Spotify show notes you've already drafted should drive the summary's framing rather than the LLM inferring everything from scratch.

For longer notes, use shell expansion: --prompt "$(cat notes.md)".

Note: --prompt-summary (system prompt template path) and --prompt (user notes content) are different flags. The former overrides the system prompt; the latter feeds user content into it.

--clip flags

flag	purpose	default
`--min`	minimum clip length (seconds)	60
`--max`	maximum clip length (seconds)	90
`--out PATH`	clip output path	`<input>.clip<ext>` (`.clip.m4a` for audio)
`--copy-codec`	ffmpeg `-c copy` (fast, keyframe-aligned) — skips the 9:16 portrait crop, since stream copy can't apply video filters	off
`--dry-run`	print the picked window but don't run ffmpeg	off

Video clips are always re-encoded as 1080×1920 portrait (9:16) with a center crop, capped at 1 GiB via ffmpeg's -fs. The crop filter is crop=min(iw,ih*9/16):min(ih,iw*16/9) so any source aspect (16:9, 4:3, 1:1, or already-portrait) yields the largest 9:16 sub-rectangle without distortion. See portraitFilter and MaxClipBytes in internal/clip/clip.go.

Conventions / non-obvious choices

Spelling: summerize is intentional. It's the original name of the project and the user's preferred spelling. Use summerize (e.g. for --summerize) rather than auto-correcting to summarize in user-facing surfaces. Internal Go package internal/summarize keeps the standard spelling.
Pluggable Summarizer is shared between modes. --clip reuses the same Summarizer interface; the only difference is the prompt and the expectation of JSON output. If you add a new mode that talks to an LLM, plug it in there.
Summarizer.Summarize takes the user content verbatim. No implicit "Transcript:" prefix or other framing. Callers (doSummerize in main.go, clip.Pick) build the full user message themselves — that's how --prompt (producer's notes) prepends a "Producer's notes:" block above the transcript without the message getting mislabeled.
Whisper output is the source of truth. All text-only consumers go through transcribe.PlainText(segs); we don't run whisper twice.
JSON parsing for clip selection is defensive. clip.extractJSONObject walks balanced braces (skipping strings) so the model can wrap its answer in prose despite the prompt asking for raw JSON.
Clip extraction defaults to re-encode. Frame-accurate cuts matter for short social hooks; --copy-codec trades that for speed.
Anthropic API call uses net/http directly. Adding the SDK was tempting, but the request is one POST and avoiding the dep keeps go.sum empty.
prepareWAV cleanup is owned by the caller. It returns a func() you must defer. Don't call os.RemoveAll on the wav path yourself.
No subcommands. The CLI is one flat flagset. Modes are boolean flags so multiple can run in one invocation and share state.

Build / install

Fresh machine (recommended) — clone, then run the interactive installer. It detects OS + GPU, builds whisper.cpp with the right backend, downloads a ggml model, and links publish + whisper-cli-<backend> into ~/.local/bin:

git clone <repo-url> ~/Git\ Repos/summerize
cd ~/Git\ Repos/summerize
make install        # interactive
make doctor         # just print detected platform/GPU/dependencies

Re-runnable; each step is idempotent and skippable. The script supports Arch (pacman), Debian/Ubuntu (apt), Fedora (dnf), and macOS (brew); for unknown distros it prints the package list and skips the install command.

Already built once — just rebuild:

go build -o publish .
# or
make link          # rebuilds + (re)points ~/.local/bin/publish at the repo

The symlink at ~/.local/bin/publish is the canonical install location; rebuilds update in place via the symlink.

External dependencies (runtime)

tool	required for	install
`ffmpeg`	always (audio extraction + clip cut)	`pacman -S ffmpeg`
`whisper-cli` (whisper.cpp)	transcription	`pacman -S whisper.cpp` for CPU; for GPU acceleration see "GPU builds" below
ggml whisper model	transcription	`curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin` into `~/.cache/whisper.cpp/`
`claude` CLI	`--summarizer claude-cli` (default)	already installed (Claude Code)
`ANTHROPIC_API_KEY`	`--summarizer claude-api`	env var
`wl-copy` / `xclip` / `pbcopy`	`--copy` flag (Spotify HTML to clipboard)	wl-copy ships with wayland on omarchy

Backend auto-detect

When --whisper-bin is not set, resolveBin in internal/transcribe/whispercpp.go picks a backend at runtime:

CUDA — if ~/.local/bin/whisper-cli-cuda (or whisper-cli-cuda on PATH) exists and nvidia-smi -L exits 0.
ROCm — if whisper-cli-rocm exists and rocminfo exits 0.
Vulkan — if whisper-cli-vulkan exists and vulkaninfo --summary exits 0.
CPU fallback — first of whisper-cli / whisper-cpp / main on PATH.

Each probe is gated on a 5s timeout. The chosen backend is logged on a single stderr line (whisper: using CUDA backend (/path)); -v adds diagnostics about which probes were skipped or failed. The convention is one whisper.cpp checkout per host with a per-backend symlink in ~/.local/bin/whisper-cli-<backend>, so the same publish binary works across machines without machine-specific flags.

CUDA build (RTX 3070 Ti / desktop)

sudo pacman -S --needed cuda          # ~3GB; installs to /opt/cuda
git clone --depth=1 https://github.com/ggerganov/whisper.cpp ~/Git\ Repos/whisper.cpp
cd ~/Git\ Repos/whisper.cpp
PATH=/opt/cuda/bin:$PATH cmake -B build \
    -DGGML_CUDA=1 \
    -DCMAKE_CUDA_ARCHITECTURES=86 \
    -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-15
PATH=/opt/cuda/bin:$PATH cmake --build build -j$(nproc) --config Release
ln -sf "$PWD/build/bin/whisper-cli" ~/.local/bin/whisper-cli-cuda

CUDA 13.2 caps the host compiler at GCC 15; system gcc is 16, so the -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-15 line is required (the gcc15 package ships g++-15 alongside the default toolchain). sm_86 matches the RTX 3070 Ti compute capability — adjust if the GPU changes.

CUDA smoke test — these stderr lines should appear in any run:

whisper_init_with_params_no_state: use gpu    = 1
ggml_cuda_init: found 1 CUDA devices ...
whisper_backend_init_gpu: using CUDA0 backend

ROCm build (Framework 16 / Radeon RX 7700S)

The 7700S is RDNA3 (gfx1102). ROCm 6.x supports it.

sudo pacman -S --needed rocm-hip-sdk rocm-hip-runtime hipblas rocblas
git clone --depth=1 https://github.com/ggerganov/whisper.cpp ~/Git\ Repos/whisper.cpp
cd ~/Git\ Repos/whisper.cpp
HIPCXX=/opt/rocm/llvm/bin/clang++ cmake -B build-rocm \
    -DGGML_HIP=1 \
    -DAMDGPU_TARGETS=gfx1102 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build-rocm -j$(nproc)
ln -sf "$PWD/build-rocm/bin/whisper-cli" ~/.local/bin/whisper-cli-rocm

If ROCm doesn't recognize gfx1102 (older ROCm releases), set HSA_OVERRIDE_GFX_VERSION=11.0.0 in the shell before invoking publish to spoof gfx1100 — same RDNA3 ISA, supported kernels.

ROCm smoke test — look for ggml_cuda_init (HIP reuses the CUDA backend naming in whisper.cpp) plus a ROCm device line on stderr.

Vulkan build (universal GPU fallback)

Vulkan is the easiest cross-vendor path; uses any GPU with a working Vulkan driver (Mesa RADV for AMD/Intel, Nvidia proprietary, etc.).

sudo pacman -S --needed vulkan-headers vulkan-icd-loader shaderc
cd ~/Git\ Repos/whisper.cpp
cmake -B build-vulkan -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build-vulkan -j$(nproc)
ln -sf "$PWD/build-vulkan/bin/whisper-cli" ~/.local/bin/whisper-cli-vulkan

Slower than native CUDA/ROCm but works on machines where the vendor toolchain is too painful to install. Useful as a portable fallback for laptops with iGPUs.

Metal build (Apple Silicon)

make install handles this automatically; the manual recipe is short because cmake on macOS picks up Metal by default — no special flag.

Prerequisites:

xcode-select --install                                    # one-time
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install go cmake ffmpeg

Build:

git clone --depth=1 https://github.com/ggerganov/whisper.cpp ~/Git\ Repos/whisper.cpp
cd ~/Git\ Repos/whisper.cpp
cmake -B build-metal -DCMAKE_BUILD_TYPE=Release
cmake --build build-metal -j$(sysctl -n hw.ncpu)
ln -sf "$PWD/build-metal/bin/whisper-cli" ~/.local/bin/whisper-cli-metal

The resolver special-cases Darwin: if whisper-cli-metal exists it's used immediately (no probe — Metal is always available on macOS). On a Mac without that symlink, the CPU fallback finds whisper-cli / whisper-cpp from brew (which is itself Metal-enabled by default), so a plain brew install whisper-cpp is a workable lazy path. It just shows "CPU backend" in the publish log line even though whisper.cpp is in fact running Metal kernels.

Metal smoke test — these stderr lines should appear in any run:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 ...
whisper_backend_init_gpu: using Metal backend

Tests

go test ./...

Covered:

internal/output/spotify_test.go — markdown→Spotify-HTML conversion, escaping
internal/clip/clip_test.go — JSON object extraction, including prose-wrapped and fence-wrapped model output

There are no integration tests for whisper or the LLM calls — those depend on external binaries and remote APIs.

Future work

--post: post the markdown summary as a Spotify for Podcasters episode description. Requires the Spotify show-notes API or Spotify for Podcasters upload integration. Reuse output.MarkdownToSpotifyHTML since their show- notes editor accepts that subset.
Multi-clip output: pick the top-N hooks instead of one. The current Selection would become []Selection and the prompt would request an array.
Faster --summarizer for short transcripts: default to Haiku for very short inputs to save on API costs.

16 KiB Raw Blame History Unescape Escape