elasticsearch and fscrawler

Great-to-Haves — Handled

elasticsearch with fscrawler handled these well for my needs.

  • Low-touch / unattended ingestion
  • Sensible, configurable defaults
  • Helpful, accurate dry runs
  • Non-catastrophic re-runs (i.e. smart enough to minimize overwriting or duplicating existing entries)
  • Clever de-duping
  • Customizable / scriptable input and output handling
  • File meta data capture
  • Full-Text indexing of file content

Clever de-duping” is TBD, and starting with rsync or rclone helps there. elasticsearch runs lean enough and is straightforward enough to configure for my local dev env. Defaults are sensible, and re-building indexes is a matter of

fscrawler job_name --loop 1 --restart

These help balancing up-front config time with GIGO and “we’ll take care of it in post-production.”

Eye-balling ingestion and indexing processes is fine for seeing initial results, tweaking, and discovering more as index searches yield more results. Getting media consolidated and indexed locally was one set of goals met. Locating assets I needed for other work was another win, and Kibana surfaced more follow-up opportunities than log viewing alone.

By Steve McNally

I build products, teams and business lines that blend publishing, marketing and advertising technologies for global brands and startups.