fix: keep _index.json total_pages in sync after recovering pages

The reprocess pass added chunks for pages beyond the original total_pages but
never updated the field, so doc-page navigation thought docs ended early
(jumped to next document mid-doc) and the page counter was wrong. Now bump
total_pages to the real max chunk page on each integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Luiz Gustavo 2026-05-21 14:32:55 -03:00
parent fe19bb9c57
commit ebc6fa41e9

View file

@ -321,6 +321,10 @@ def process_one_page(doc_id: str, page_num: int) -> tuple[bool, int]:
except Exception as e:
print(f" [ERR ] {doc_id} p{page_num:03d} — integrate: {e}", flush=True)
return (False, 0)
# Keep total_pages in sync with the real max page (recovered pages extend it)
max_page = max((c.get("page", 0) for c in idx.get("chunks") or []), default=0)
if max_page > idx.get("total_pages", 0):
idx["total_pages"] = max_page
idx_path.write_text(json.dumps(idx, indent=2, ensure_ascii=False), encoding="utf-8")
print(f" [OK ] {doc_id} p{page_num:03d}{n} chunks", flush=True)
return (True, n)