Data & runtime — PDF / Excel / Python / APT
The data-extraction and runtime tool namespaces — reading and searching PDFs, reading/writing Excel workbooks, the persistent Python runtime, and local package management.
The tools for pulling data out of documents and crunching it: PDF extraction,
Excel I/O, a persistent Python runtime, and package management. All operate
on the Leif host — paths are Leif-workspace paths, and apt_* installs on
Leif. For files on another host, get them there first (see
Shell & file tools).
read_pdf(file_path, start_page=None, end_page=None, max_chars=100000)
extract_pdf_tables(file_path, page=None)
search_pdf(file_path, search_term, case_sensitive=False, max_results=50)
get_pdf_info(file_path)
read_pdf pulls text (bounded by max_chars and an optional page range);
extract_pdf_tables is the one for tabular data (omit page for all pages);
search_pdf finds a term without reading the whole document; get_pdf_info
returns metadata (page count, etc.). Useful for vendor price lists and quotes
that arrive as PDFs.
Excel
read_excel(path, sheet_name=0, header_row=0)
write_excel(path, data, sheet_name="Sheet1")
append_to_excel(path, data, sheet_name=None)
convert_excel_to_csv(excel_path, csv_path=None, sheet_name=0)
search_excel(path, search_term, sheet_name=None, case_sensitive=False)
get_excel_info(path)
sheet_name accepts an index (0) or a name. For write_excel /
append_to_excel, data is a list of row objects (dicts). convert_excel_to_csv
is handy for getting a workbook into the CSV shape the pricing importer wants —
though the import itself still reads from nvrbackup (see
Pricing App File Import).
Python runtime
A persistent Python environment — variables defined in one
execute_python call survive into the next unless you clear them.
execute_python(code, clear_namespace=False, timeout=None)
get_python_namespace()
install_python_package(package)
list_python_packages()
validate_script(file_path, run_checks=True)
clear_namespace=True resets the environment for a clean run;
get_python_namespace shows what’s currently defined. Install missing deps with
install_python_package. validate_script syntax-checks a file before you run
it. This is the escape hatch for any data wrangling the dedicated tools don’t
cover.
Local packages (APT)
apt_install(packages, update_cache=True)
apt_search(search_term)
Installs on the Leif host. packages is a list.
Related pages
- Shell & file tools — getting files onto the host these tools read
- web — fetching the documents to process
- sheets — Google Sheets (not Excel)
- Pricing App File Import — where parsed CSVs ultimately go