pyref

pyref Contributor Quickstart Guide

pyref is a Python library for reducing and analyzing polarized resonant soft X-ray reflectivity (PRSoXR) data collected at ALS Beamline 11.0.1.2, Lawrence Berkeley National Laboratory. The library couples a Python interface with a Rust backend via PyO3 bindings to achieve the parallel throughput required for large beamtime datasets. It is organized into three primary components: IO, which handles raw data ingestion and cataloging; Reduction, which reduces 2D detector images into 1D reflectivity profiles; and Fitting, which fits reduced profiles against optical models. Each component is designed to be independently extensible.

Technical Jargon

Fundamental Concepts

IO Module

The IO module is responsible for ingesting raw FITS files, cataloging their contents into a structured SQLite database, and exposing the resulting data through a lazy interface that returns pandas or polars DataFrames on demand. By default, the module expects the FITS file conventions and directory structures produced by ALS Beamline 11.0.1.2. New beamline formats can be added by extending the IO layer.

All performance-critical IO operations are implemented in Rust and exposed to Python via PyO3 bindings. Rust is responsible for parallel FITS file reading, header card extraction, image data loading, filename parsing, directory traversal, Diesel-managed SQLite catalog construction, and zarr archive management. Python is responsible for the user-facing query interface, DataFrame construction from catalog results, and any logic that does not require parallel throughput. This boundary is a design constraint, not a guideline: do not implement parallelism in Python and do not implement user-facing query logic in Rust.

The catalog database is managed by Diesel with the SQLite backend. The schema is defined in src/schema.rs and all database interactions must go through Diesel’s type-checked query builder. Raw SQL is prohibited except inside Diesel migration files. SQLite foreign key enforcement is off by default; the Rust connection initializer must execute PRAGMA foreign_keys = ON on every new connection before any other statement.

The IO module is accessed via pyref.io and the cataloging subsystem via pyref.io.catalog.

Reduction Module

The Reduction module converts per-frame 2D detector images into normalized, stitched 1D reflectivity profiles. Reduction proceeds in three sequential stages.

The first stage localizes the beamspot in each 2D detector image, integrates the ROI intensity, and subtracts the background estimated from a dark region of the detector. The second stage normalizes the extracted beam intensity against I0, beam current, exposure time, and the upstream gold mesh absorption current (Ai 3 Izero). The third stage identifies the scan domain (fixed energy or fixed angle), classifies each frame as an I0 point, stitch point, or overlap point, computes per-stitch scaling corrections from the weighted mean of overlap regions, and assembles the stitched profile.

The Reduction module is accessed via pyref.reduction. The primary user-facing class is PrsoxrLoader.

Fitting Module

The Fitting module fits reduced reflectivity profiles against optical layer-stack models. The primary backend is refnx, which implements the 4x4 transfer matrix method for anisotropic and resonant systems. The module is responsible for model definition, parameter specification, constraint enforcement, objective function construction, fitting algorithm selection, and output formatting. The fitting module is accessed via pyref.fitting.

Reduction Subtleties

Uncertainty Quantification

Each frame is a photon-counting measurement. The raw intensity in each detector pixel follows a Poisson distribution to first approximation, giving a per-pixel standard deviation equal to the square root of the pixel count. Two classes of non-Poissonian noise are also present and must be accounted for.

Systematic noise originates from the detector itself (readout noise, dark current, stray light) and is characterized using a dark region of the detector image that is far from the beamspot. The mean and variance of this dark region are used to estimate the per-pixel systematic noise floor, which is subtracted from the ROI intensity and propagated into the final uncertainty.

Random non-Poissonian noise is characterized by the Fano factor, defined as the ratio of the observed variance in I0 measurements to the hypothesized Poissonian variance at the same intensity level. The Fano factor is computed as a function of incident energy from the ensemble of I0 frames within a scan, and is applied as a multiplicative scale on the Poissonian uncertainty for all frames at that energy. Agents implementing or modifying the uncertainty pipeline must propagate both the dark-region contribution and the Fano-scaled Poissonian contribution in quadrature at every reduction step. Silent variance truncation or implicit dtype coercion that reduces numerical precision is a correctness bug.

Beamspot Localization

Beamspot localization is applied to each frame independently. The algorithm proceeds in the following fixed order and agents must not reorder or skip steps without explicit justification.

First, camera edge artifacts are removed by zeroing or masking a fixed border of pixels around the image perimeter. Second, a row-by-row background subtraction is applied: for each row, the median of a set of dark columns (columns known to be outside the beamspot region) is subtracted from all pixels in that row. Third, a column-by-column background subtraction is applied analogously. Fourth, a Gaussian filter is applied to suppress residual high-frequency noise. Fifth, a 2D peak fitting routine locates the beamspot centroid, integrated intensity, and fit standard deviation. The background intensity and its uncertainty are extracted from a designated dark region of the post-subtraction image.

Failed detections occur when the peak fitter cannot identify a credible Gaussian peak above the noise floor. A detection is considered failed when the fitted peak amplitude is less than a configurable multiple of the dark region standard deviation, or when the fitted centroid falls outside the detector boundary. Failed detections must be flagged in the BeamFinding Table and must not silently propagate NaN or zero values into the Reflectivity Table. Beamspot drift across a scan is expected and is not itself a failure condition; drift is characterized by fitting a linear model to the centroid coordinates as a function of Q or theta. Frames where the centroid deviates from the linear trend by more than a configurable threshold are flagged separately.

Scan Type and Domain Identification

The scan domain is determined by inspecting the motor trajectory across all frames in a scan. The classification procedure is as follows.

A scan is classified as a fixed-energy reflectivity scan when a leading block of frames has Sample Theta = 0 and a constant beamline energy (these are the I0 frames), followed by frames in which Sample Theta and CCD Theta increase monotonically (subject to stitch reversals). A scan is classified as a fixed-angle reflectivity scan when either a leading block of frames has Sample Theta = 0 and a varying beamline energy (I0 frames collected as a function of energy), or the scan contains no I0 block at all and beamline energy varies monotonically throughout. A multi-profile scan is not a distinct scan type but is a repetition of the above patterns within a single experimental scan: the instrument completes one full fixed-energy or fixed-angle sweep, then changes the fixed parameter (energy or angle) and repeats the sweep. Multi-profile scans are decomposed into their constituent profiles during reduction, each profile being treated as an independent fixed-energy or fixed-angle scan.

Once the domain is identified, stitch points are located by finding frames where the independent variable decreases relative to the preceding frame. Overlap points are the initial frames of a new stitch whose independent variable values fall within the range already covered by the preceding stitch. The scaling correction for each stitch is the weighted mean of the reflectivity values at the overlap points, where the weights are the inverse squared uncertainties of those frames.

Filename Parsing

The cataloging system parses FITS filenames to extract the sample name, zero or more tags, the scan number, and the frame number. The parsing contract is as follows and must be implemented exactly as specified.

The frame number is always the five-digit zero-padded integer to the right of the last hyphen in the filename stem (before the .fits extension). The scan number is always the five-digit zero-padded integer immediately to the left of that hyphen. The remainder of the stem to the left of the scan number is the concatenation of the sample name and any tags, optionally separated by underscores or hyphens. Because no separator is guaranteed between the sample name and the scan number, the scan number anchor is the five digits immediately left of the hyphen; the parser must split there first before attempting to tokenize the sample name and tags. The following filename patterns are all valid and must be handled without special-casing individual formats.

<sample_name>_<tag1>_<tag2>_<scan_number>-<frame_number>.fits
<sample_name>-<tag1>-<tag2>-<scan_number>-<frame_number>.fits
<sample_name>_<scan_number>-<frame_number>.fits
<sample_name><scan_number>-<frame_number>.fits
<sample_name><tag1><tag2><scan_number>-<frame_number>.fits
<sample_name>_<tag1>-<tag2>_<tag3>_<scan_number>-<frame_number>.fits

The number of tags is unbounded. Tags may contain alphanumeric characters and hyphens. Parsing failures must be logged and the offending file flagged in the File Table rather than silently skipped or allowed to panic.

Directory Layout Traversal

Two directory layouts are supported. The cataloging system must detect which layout is present by inspection and handle both without user configuration.

The first layout places each scan in its own instrument subdirectory within a date-grouped scan directory. Detection criterion: the beamtime root contains one or more date directories, each of which contains one or more scan directories (named with a scan number prefix), each of which contains an instrument subdirectory named either CCD or Axis Photonique. FITS files live inside the instrument subdirectory.

<beamtime_root>/
    <date_dir>/
        CCD Scan <scan_number>/
            CCD/
                <sample>_<tags>_<scan>-<frame>.fits
            <sample>_<tags>_<scan>-AI.txt
        CCD Scan <scan_number>/
            Axis Photonique/
                <sample>_<tags>_<scan>-<frame>.fits
            <sample>_<tags>_<scan>-<frame>_AI.txt

The second layout places all FITS files in a single flat instrument directory directly under the beamtime root. Detection criterion: the beamtime root contains a directory named CCD or Axis Photonique that holds FITS files from multiple scan numbers.

<beamtime_root>/
    CCD/
        <sample1>_<tags>_<scan1>-<frame>.fits
        <sample2>_<tags>_<scan2>-<frame>.fits
    <sample1>_<tags>_<scan1>-AI.txt
    <sample2>_<tags>_<scan2>-AI.txt

AI text files are supplementary and are not required for cataloging. If present, they should be associated with their scan by matching the scan number extracted from the filename. If neither layout is detected, the cataloging system must emit a structured error identifying the unrecognized layout rather than silently producing an empty catalog.

I/O Operations and Cataloging

Connecting individual frames back to their originating sample, scan, and beamtime requires meticulous bookkeeping that is impractical to maintain manually across a full beamtime. pyref provides an automated cataloging system that ingests a beamtime directory and populates a Diesel-managed SQLite database encoding this hierarchy. The database is the structural backbone for all downstream reduction and fitting workflows. Users interact with it primarily through the lazy DataFrame interface exposed by pyref.io, which allows them to filter by sample name, tag, energy, or angle and receive a polars or pandas DataFrame containing the relevant frame metadata and image retrieval handles.

Catalog and Cache Storage

Default: local user data directory

By default, pyref maintains a single persistent catalog that accumulates every beamtime the user has ever ingested. The catalog and its associated zarr cache live in the platform-appropriate user data directory, resolved at runtime by the Rust IO layer using the directories crate:

Platform Default catalog path
Linux $XDG_DATA_HOME/pyref/catalog.db (falls back to ~/.local/share/pyref/catalog.db)
macOS ~/Library/Application Support/pyref/catalog.db
Windows %APPDATA%\pyref\catalog.db

The zarr archive for each beamtime is stored on local disk under the same platform data directory as the catalog, not under the system cache directory: <data_dir>/pyref/.cache/<beamtime_hash>/beamtime.zarr, where <data_dir> is the same root as in the table above ($XDG_DATA_HOME or ~/.local/share, ~/Library/Application Support, or %APPDATA% as appropriate) and <beamtime_hash> is a stable SHA-256 digest of the beamtime root path recorded at ingestion time. Example on macOS: ~/Library/Application Support/pyref/.cache/<beamtime_hash>/beamtime.zarr. The zarr tree is local-only; NAS-backed FITS are used for ingestion and re-ingestion, not for routine image reads after ingest.

Optional environment overrides: PYREF_CATALOG_DB (absolute path to catalog.db) and PYREF_CACHE_ROOT (parent of <beamtime_hash>/beamtime.zarr directories). Parallel FITS reads during ingest honor PYREF_INGEST_WORKER_THREADS or PYREF_INGEST_RESOURCE_FRACTION when explicit kwargs or TUI config fields are unset.

Ingestion is a single pipeline: catalog metadata (Diesel/SQLite) and zarr array writes happen in one pass. There is no separate user-visible “metadata only” phase followed by a later image materialization step.

The raw FITS files on the NAS are only required during initial ingestion and re-ingestion. After ingestion, reduction and browsing workflows operate from the local catalog and zarr store. If the NAS is unavailable, previously ingested beamtimes remain fully accessible from local storage.

Path aliasing for NAS-sourced data

Because NAS mount points differ across machines (e.g., /Volumes/beamdata on macOS vs. /mnt/beamdata on Linux), paths stored in beamtimes.path and files.path are recorded as logical URIs of the form nas://<label>/<relative_path> rather than absolute filesystem paths. The label component is a short user-assigned name for the NAS volume (e.g., als-data). Physical mount point resolution is handled by the path_aliases table in the catalog, which stores (label, physical_path) pairs for the current machine.

On a new machine, the user registers the mount point once:

pyref config set-mount als-data /mnt/beamdata

This inserts or updates the row in path_aliases. All path resolution in the Rust IO layer goes through this table before any filesystem operation. If a label has no registered physical path, the IO layer must return a structured UnresolvedAlias error that names the missing label explicitly, rather than a generic file-not-found error. The path aliasing layer is transparent to Diesel queries; aliases are resolved by the IO layer before constructing filesystem paths, never inside SQL.

The zarr cache path stored in beamtimes is an absolute local path and is never aliased, since it lives on the local machine by definition. Ingestion records both the NAS logical URI (for FITS provenance) and the resolved local zarr path (for image retrieval) in beamtimes as separate columns.

Alternative: shared catalog on a network drive or mounted filesystem

For groups where multiple users share a beamtime dataset and want a common catalog, pyref supports an explicit catalog path override. The user specifies a path to a directory that is visible to all machines, typically a network share or a FUSE-mounted filesystem:

pyref config set-catalog /mnt/shared/pyref/catalog.db

When using a shared catalog, the following constraints apply and must be enforced by the IO layer.

SQLite over NFS is unsafe for concurrent writes due to unreliable advisory file locking. If the configured catalog path resolves to a network filesystem (detected by comparing the device ID of the catalog file against known local device IDs, or by an explicit --network flag acknowledged by the user at config time), the IO layer must open the connection in WAL mode with a generous busy timeout and must warn the user that concurrent write access from multiple machines is not supported. Concurrent read access is safe in WAL mode. The recommended usage pattern for shared catalogs is that one designated machine performs all ingestion (writes) and all other machines open the catalog read-only.

The zarr cache may also be redirected to a shared location using a separate config key:

pyref config set-cache /mnt/shared/pyref/zarr

A zarr archive on a fast local network share (e.g., 10GbE NFS or SMB) is acceptable for read-heavy workloads such as browsing and fitting. It is not acceptable as the primary location for zarr writes during ingestion, which should always target local storage first and be copied to the shared location afterwards if shared access is desired.

path_aliases table

This table lives in the catalog database and is machine-local in semantics, even when the catalog is on a shared drive. It stores one row per registered NAS label for the current machine. The Rust IO layer reads this table on startup and caches the mappings in memory for the duration of the process. Agents must never read path_aliases directly from Python; path resolution is an IO-layer concern exposed through the pyref.io interface.

Column Type Description
id Integer Primary key.
label Text Short user-assigned NAS label (e.g., als-data). Unique per catalog.
physical_path Text Absolute filesystem path to the mount point on this machine.
registered_at Text ISO 8601 timestamp of last registration.

Cataloging System

The cataloging system ingests a beamtime directory and populates a Diesel-managed SQLite database that serves as the structural backbone for all downstream reduction and fitting workflows. Its primary purpose is to resolve individual FITS frames back to their originating sample, scan, and beamtime, and to expose this hierarchy as a queryable, lazily accessible interface. The schema is defined in src/schema.rs and must be kept in sync with the Diesel migration files under migrations/. The most common access pattern is retrieving all frames associated with a given sample and tag combination, grouped by energy or angle, for assembly into a stitched reflectivity profile.

The profiles table is the primary user-facing entry point. Users browse profiles, not scans. Scans are administrative provenance; profiles are the scientific deliverable.

Header Card Tiering

The 115 FITS primary HDU cards per frame are split into two tiers at ingestion time. Eleven cards that directly drive scan classification, beamspot localization, normalization, and profile identity are promoted to first-class typed columns on the frames table: sample_x, sample_y, sample_z, sample_theta, ccd_theta, beamline_energy, epu_polarization, exposure, ring_current, ai3_izero, and beam_current. All remaining cards are stored in the frame_header_values EAV table, keyed through the header_cards registry. The header_cards table is populated automatically on first ingestion from whatever cards are present in the FITS files; subsequent beamtimes with new or renamed channels append rows to this table without requiring a schema migration. If a card that was previously treated as non-critical needs to be queried as a first-class column, the correct remedy is a Diesel migration that adds the column to frames and backfills it from frame_header_values, not a workaround join.

beamtimes

Root of the catalog hierarchy. Stores two path columns: nas_uri, which is the logical nas://<label>/<relative_path> URI pointing to the original FITS data on the NAS, and zarr_path, which is the absolute local filesystem path to the beamtime’s zarr archive at <data_dir>/pyref/.cache/<beamtime_hash>/beamtime.zarr (same <data_dir> convention as the default catalog path). The nas_uri is used for provenance and during ingestion and re-ingestion; all post-ingestion image retrieval goes through zarr_path. The date parsed from the beamtime directory name is also stored here. All other tables carry a foreign key to this table.

samples

One row per unique sample name within a beamtime. Stores the sample name and the median sample_x, sample_y, and sample_z stage positions computed across all frames attributed to that name. Stage positions are nominally fixed per sample; frames that deviate beyond a configurable tolerance are flagged mislabeled_sample in frames.quality_flag rather than creating a second sample row.

tags

One row per unique tag slug parsed from FITS filenames. Tags carry no intrinsic meaning to the catalog. The many-to-many relationship between tags and files is resolved through the file_tags junction table.

file_tags

Junction table linking files to tags. Allows many tags to map to many files without duplication.

files

One row per FITS file ingested. Stores the absolute path, bare filename, scan number, frame number, sample ID, and beamtime ID. This is the canonical provenance reference for raw file locations and the join target for tag resolution. Image data is not retrieved through this table; it is accessed via the zarr keys on frames.

scans

One row per scan. Stores the scan number, scan type (fixed_energy or fixed_angle), start and end timestamps, sample ID, and beamtime ID. Scan type is determined during ingestion from the motor trajectory analysis and recorded here as a first-class attribute so downstream reduction does not recompute it.

header_cards

Registry of FITS header card names discovered during ingestion. One row per unique card name. Each row stores the raw card name as it appears in the FITS header, a human-readable display name, and a category label (motor, ai, camera, or metadata). This table is the lookup key for the frame_header_values EAV table. Agents must not hard-code card name strings outside of this table and the first-class column definitions on frames.

frames

One row per frame per scan. Stores the eleven first-class header card values as typed Double columns, plus the zarr retrieval keys (zarr_group_key and zarr_frame_index) needed to fetch the detector image. The monolithic zarr archive is located at beamtimes.zarr_path for the parent beamtime. Within the archive, scan number is the group key, frame number is the dataset index within that group, and each group stores two datasets per frame named raw and processed. The raw dataset is the image as extracted from the FITS file; the processed dataset is the image after edge artifact removal, row-wise and column-wise background subtraction, and Gaussian filtering. Both share the same frame index. Each row also carries FKs to files and scans, providing the full provenance chain from pixel to beamtime.

frame_header_values

EAV store for all FITS header cards not promoted to first-class columns on frames. All values are stored as Double. The card name is resolved through header_cards. This table is append-only after initial ingestion; values are never updated in place.

profiles

The primary user-facing table. One row per reduced reflectivity profile, where a profile is a single continuous 1D curve assembled from one or more stitches. Multi-profile scans produce multiple rows sharing the same scan_id, distinguished by profile_index. Each row stores the profile type (fixed_energy or fixed_angle), the value of the fixed parameter (energy in eV for fixed-angle profiles, theta in degrees for fixed-energy profiles), epu_polarization, and the median stage position (sample_x, sample_y, sample_z) over all member frames. The stage position columns here are denormalized from frames for query convenience; the authoritative per-frame positions remain in frames.

profile_frames

Junction table mapping profiles to their constituent frames, with a frame_role column classifying each frame’s function in the reduction pipeline. Valid roles are i0, stitch, overlap, and reflectivity. I0 frames appear here multiple times when they serve as the normalization reference for more than one profile in a multi-profile scan; this is by design and is the correct resolution of the shared-I0 problem. Agents must not attempt to enforce uniqueness on frame_id in this table.

beam_finding

Per-frame output of the beamspot localization pipeline. Stores the preprocessing parameters applied (edge removal flag, dark column and row counts for background subtraction, Gaussian kernel sigma), the peak fitting result (centroid row and column, ROI intensity, fit standard deviation), the dark region statistics (mean and standard deviation), and a detection_flag with values ok, beam_detection_failed, or beam_drift_anomaly. Each row carries a FK to frames. Frames with detection_flag = beam_detection_failed must not appear in reflectivity.

stitch_corrections

Per-stitch correction factors computed during the normalization and stitching pipeline. One row per stitch segment within a profile. Stores the fano_factor (always non-null; 1.0 when no Fano correction was applied), the overlap_scale_factor (null for the first stitch, which has no preceding stitch), the i0_normalization_value, and i0_source_scan_id (null when I0 comes from frames within the current profile; set to the external scan’s ID for fixed-angle profiles that borrow I0 from a separate scan). Each row carries a FK to profiles.

reflectivity

Frame-level reduced reflectivity data after normalization and stitching. One row per reduced frame. Stores Q (inverse angstroms), theta (degrees), energy (eV), normalized intensity, propagated one-sigma uncertainty, and frame_type (i0, stitch, overlap, or reflectivity). Carries FKs to profiles, frames, and beam_finding. Frames flagged beam_detection_failed must not appear here. Stitched profile assembly and parquet export are performed by the packaging utility in pyref.reduction, which joins this table against stitch_corrections filtered by profile_id.

Data Quality Flagging

Several conditions arising during cataloging and reduction require flagging rather than silent failure or hard error. The following flags are first-class catalog attributes stored in typed text columns, not log messages, and must be queryable from the DataFrame interface. Flag values use lowercase snake_case to match the string literals defined in schema.rs.

Frames attributed to a sample name whose inferred stage position deviates from the representative position for that sample by more than a configurable threshold are flagged mislabeled_sample in frames.quality_flag. The threshold is configurable per beamtime but should default to a value that tolerates sub-millimeter drift while rejecting stage position changes consistent with a deliberate sample move.

Frames for which the beamspot localization algorithm fails to identify a credible peak are flagged beam_detection_failed in beam_finding.detection_flag. These frames must not appear in reflectivity and must not be silently zeroed or filled.

Frames whose fitted beamspot centroid deviates from the linear trend model across the scan by more than a configurable multiple of the fit residual standard deviation are flagged beam_drift_anomaly in beam_finding.detection_flag. These frames may still appear in reflectivity but are surfaced to the user for manual inspection before the packaging utility includes them in a parquet export.

Files that fail the filename parser are flagged parse_failure in files.parse_flag. Files whose directory layout does not match either supported pattern are flagged at the beamtime level via a structured error returned to the caller rather than a catalog row.

Profile Packaging Utility

The packaging utility assembles a stitched reflectivity profile from the reflectivity and stitch_corrections tables and exports it as a flat parquet file. It is a standalone tool in pyref.reduction that operates on catalog query results and is not part of the catalog itself. The primary input is a profile_id; the utility joins reflectivity against stitch_corrections filtered by that ID to assemble the full stitched curve. The output parquet schema is fixed and contains one row per reduced frame with the following columns at minimum: q (inverse angstroms), theta (degrees), energy (eV), intensity (normalized, dimensionless), uncertainty (one-sigma), frame_type, scan_number, sample_name, and overlap_scale_factor. The utility must reject frames with beam_finding.detection_flag = beam_detection_failed and must warn the user before including frames with beam_drift_anomaly. Agents must not add or remove columns from the output schema without updating this specification.

Fitting Module Architecture

The Fitting module wraps refnx to provide a domain-specific interface for PRSoXR model construction and optimization. The 4x4 transfer matrix method implemented in refnx handles anisotropic dielectric tensors and resonant scattering contrast, which are the physically relevant cases for this beamline. Agents working in pyref.fitting are expected to understand the transfer matrix formalism and the relationship between the model layer stack (thickness, roughness, optical constants) and the computed reflectivity curve.

The module boundary is as follows. Model construction (layer stack definition, optical constant assignment, parameter bounds, and inter-parameter constraints) lives in pyref.fitting. Numerical optimization and MCMC sampling are delegated entirely to refnx and must not be reimplemented. Objective function construction follows refnx conventions. Fitting output is returned as a structured result object containing the optimized parameters, their uncertainties, the fitted reflectivity curve, and the residuals, and must be serializable to parquet or HDF5 for archival.

Numerical Precision and Scientific Correctness

This library processes experimental data where uncertainty quantification is not decorative but determines the validity of downstream physical conclusions. The following constraints are non-negotiable.

Uncertainty propagation must be exact to first order at every reduction step. Any operation that discards, truncates, or implicitly zeros an uncertainty is a correctness bug. Weighted means must use inverse-variance weights throughout; unweighted means are not acceptable substitutes in any reduction context.

Silent dtype coercion is prohibited. Operations that would coerce float64 to float32, or integer counts to floating point without explicit intent, must be made explicit or prevented. The catalog and parquet outputs must preserve float64 precision for all physical quantities.

Frame provenance must be preserved end-to-end. Every row in reflectivity must be traceable back to the originating FITS file and zarr frame index through the FK chain reflectivity -> frames -> files. Any reduction step that aggregates without recording the constituent frame IDs breaks this chain and is unacceptable.

Fano factor computation and application must be documented in stitch_corrections for every scan. A scan processed without a Fano correction must record fano_factor = 1.0 rather than leaving the field null.

General

General Structure

This codebase is maintained by contributors with physics PhDs and extensive backgrounds in scientific and engineering software, including numerical computing, data analysis, instrumentation, simulation, and research-grade reproducibility. Maintainers are highly mathematically literate, comfortable with linear algebra and statistics, and expect rigorous numerics with explicit type handling—silent coercion and imprecise computations are not acceptable.

Operating principles

Communication and documentation outside code

Public API documentation (language-agnostic)

Module and package documentation

Tooling, skills, and continued learning

Task shape (goal, context, constraints, completion)

Quality bar for agent output

Python

Python

The following applies to Python work in this repository: scientific and general-purpose code, with emphasis on clear structure, reproducible tooling, and documentation that matches how the team uses Cursor (skills, subagents, and editor rules).

Conventions

Tooling

Use uv for environments, runs, and dependency changes. Pair it with the Astral stack as configured in this project.

If a uv subcommand differs by version, use uv --help or the uv docs.

Testing

Cursor: skills

Load these skills by name when the task matches (each skill’s own SKILL.md and references hold the full detail). Installed skills usually live under .cursor/skills/ (or your editor’s equivalent).

Skill Use it for
general-python Hub: uv / ruff / ty workflow, builtins and collections, functions and classes, dataclasses, typing boundaries, pytest, scientific defaults, and pointers to the other skills.
numpy-scientific NumPy: dtypes, views vs copies, broadcasting, ufuncs and reductions, linalg / einsum, Generator, I/O, interop with tables and plotting.
dataframes pandas and Polars: when to use which, indexing, joins, lazy execution, I/O, nulls, Arrow interop.
numpy-docstrings Numpydoc-style docstrings: section order, semantics (what belongs in docstrings vs types vs tests), anti-patterns, Parameters / Returns / Examples / classes / modules.
matplotlib-scientific Publication-style Matplotlib: OO API, axes and legends, layout, export, journal widths, optional SciencePlots.
lab-instrumentation PyVISA / VISA sessions, sockets vs VISA, hardware abstraction, input validation before I/O, testing without hardware, PDF extraction for datasheets and manuals.

Cursor: subagents

Delegate by subagent name when a focused pass is better than inline editing. Subagents usually live under .cursor/agents/ (or your editor’s equivalent).

Subagent Use it for
python-reviewer Reviewing changes: uv hygiene, typing, numerics footguns, tests, docstring quality.
python-types Deep typing for ty: annotations, PEP 695-style generics, exhaustive match, fixing checker output.
python-refactor Structure: unclear multi-value returns, composition vs inheritance, oversized functions or classes, deterministic boundaries.

Cursor: rules

External references

Rust

This workspace uses the dotagent Rust stack.

Python - Jupyter

This workspace extends Python with Jupyter notebook expectations.

General Guidelines

This project will make strong use of jupyter notebooks to solve a number of problems, primarily with a scientific focus, but may also have a general purpose focus. Notebooks should be written and well documented. But keep in mind that a notebook allows for a lot of flexibility, and as such, the code should be written in a way that is easy to understand and maintain.

!!!!Make sure that you use your new jupyter tools instead of generating as a json!!!!

As a general rule, we will use notebooks for one of the following purposes:

Lightweight Exploration Notebooks

Here we will have a few cells that will build the idea or concept, test it with some data, and then present the results in a clean and easy way. Use matplotlib for static plotting, or hvplot/altair/plotly for interactive plotting.

It is important to note that these notebooks are not really designed to be robust, and as such we should not focus too hard on making them so. Write the minimal code to get the job done, and then move on to the next notebook.

Prototyping Notebooks

This is where we will prototype a robust coding solution to a problem. We will use this to test and validate each chunk of code in the complex workflow. As such it is important that we use many small cells to test and validate each chunk of code. Eventually, we will want to move this code into a final script/library.

It is important to note that these notebooks, are not really designed to showcase the code, but rather to test and validate each chunk of code in the complex workflow. As such, we should not focus too hard on making them look nice, but rather ensure that we atomize the code into small cells that can be tested and validated individually. Testing and validation might be done using simple displays, or plotting, or other cases. But assertstetements are not necessary, and should be avoided if possible in favor of displaying the results to the user.

Demonstration Notebooks

These notebooks are designed to showcase a workflow of production ready code. Ideally, after a library is complete and ready to use, the user will be able to import the library and use the code treating the notebook as a production environment. They will have a minimal ammount of cells, mixing in documentation and examples of how to use the library.

It is important to note that these notebooks are designed to be robust. These should mix in a healthy ammount of markdown documentation and explaination. but not be too heavy handed. Keep in mind that the goal of these notebooks is that a user can copy them into their own notebooks and know how to use the library.

Use of Cells

Python - PyO3

This workspace extends Python with PyO3 / Maturin integration expectations.

General Guidelines

We are using the pyo3 library to create a python extension for our project. The ONLY reason we are doing this is to speed up the execution of critical code. As such, we should only use pyo3 for code that is critical to the performance of the project, and should avoid using it for code that is not performance critical. Generally, iO operations are easier to implement in python, while multi-threaded operations are faster in rust.

The python GIL is the global interpreter lock, limiting the ability for python to be truly multi-threaded. However, rust executions can bypass the GIL leading to a significant speedup in performance. As such, we should not multi-thread in python, but rather pass tasks into rust for parallel processing, and then return the results back to python.

Tooling

uv is still king within the project, and should be used for all python needs. However, we will use the maturin tool as the build system for the project. See the maturin documentation for more information on how to use it. Ensure that maturin is installed in the dev group. See the uv documentation for more information on how to use maturin as a tool. In general, we prefer a structure native to maturin following the following format:

.
├── pyproject.toml
├── Cargo.toml
├── python/
│   ├── <module_name>/
│   │   ├── __init__.py
│   │   └── ...
├── src/
│   ├── lib.rs
│   └── bindings.rs

Ensure that the python/ directory is a valid python package, and that the src/ directory is a valid rust crate.

Rust - TUI

This workspace extends Rust with TUI application expectations.

Rust - PyO3

This workspace extends Rust with PyO3 / Maturin extension expectations.

Learned User Preferences

Learned Workspace Facts