- Pattern matching — glob, regex, or date-templated globs.
- Ordering — latest-by-mtime, FIFO, lexicographic, or process-all.
- Wait policy — retry up to N times with configurable interval; act
on deadline miss with
alert/skip/fail. - Atomic staging — every matched file is renamed into a
.processing/folder before any byte is read; on success it moves to a date-templated.archive/folder, on failure to.failed/. - Format parsers — CSV, JSON, and JSONL ship in Phase 3; Parquet, Avro, ORC, TSV, and XML throw an explicit scope-defer error rather than silently mishandle binary blobs.
Connector scope
Phase 3 implements the file-handling LOGIC for thelocal_fs
connector. The IR schema validates seven connector kinds (s3, gcs,
azure_blob, sftp, ftp, local_fs, http_pull) but only
local_fs has a runtime executor today; the others throw a clear
scope-defer error. Cloud-connector executors land in a dedicated
follow-up phase that focuses on cloud-SDK integration + credentials +
retry semantics — orthogonal to the file-handling logic shipped here.
Pattern types
Date-template tokens
| Token | Meaning |
|---|---|
{YYYY} | 4-digit year |
{MM} | 2-digit month |
{DD} | 2-digit day |
{HH} | 2-digit hour (24h) |
{mm} | 2-digit minute |
{ss} | 2-digit second |
{YYYY-MM-DD} | shorthand combo |
{YYYYMMDD} | compact combo |
timezone
(IANA), defaulting to UTC. After expansion, the value is compiled as a
glob (literal text + * and ? wildcards).
Ordering
| Mode | Behavior |
|---|---|
latest_by_mtime | mtime descending; newest first |
fifo_by_name | lexicographic ascending |
lexicographic | alias of fifo_by_name |
process_all | preserve readdir order |
Wait policy
maxRetries times, sleeping
intervalMinutes between attempts. On exhaustion:
onMissAction | Behavior |
|---|---|
alert | Log a warn; return rowsOut: 0; downstream sees the empty result |
skip | Log info only; return rowsOut: 0 |
fail | Throw file arrival deadline missed; the run fails |
Atomic staging
stagingFolder BEFORE any byte is
read. On clean parse, files move to successFolder (date templates are
expanded at archive time, in UTC by default). On parse failure, they
move to failureFolder and the upstream error rethrows.
stagingFolder, successFolder, and failureFolder may be relative
to path or absolute. If a file with the same basename already exists
at the destination, a .dup-<unix-ms> suffix is appended.
Crash recovery
If a run crashes mid-parse, files remain instagingFolder. A
follow-up improvement will rescan that folder first on the next run;
today the operator should manually move files back if needed.
Formats
| Format | Phase 3 status |
|---|---|
csv | implemented (header + comma-split; blank trailing lines ignored) |
json | implemented (array → rows; object → single row) |
jsonl | implemented (line-delimited; parse error reports the line number) |
parquet / avro / orc / tsv / xml | scope-defer error at runtime |
Error model
| Where | Trigger | Effect |
|---|---|---|
resolvePattern | malformed regex | run aborts with the original pattern |
discoverFiles | dir not found | ENOENT propagates (operator-visible) |
waitForFiles | exhausted + onMissAction=fail | run fails with file arrival deadline missed |
stageFiles | rename error | rethrows; downstream archive routes to failureFolder |
parseByFormat | malformed input | run fails; files moved to failureFolder |
| Unsupported format | parquet etc. | scope-defer error |
Related
- Pipeline IR — the typed source-node schema.
- AI Pipeline Authoring — chat surface that produces source-node IR.
- MCP Flow Server — external-agent surface.