Test Suite Performance Analysis #

Date: 2026-03-30 Run: poe test:full (sequential, no -n auto) Duration: 8h 48m 45s (31,725s) Results: 17,463 passed, 43 failed, 107 skipped, 9 xfailed / 17,621 collected

1. Test Results: Are Things OK? #

Almost entirely OK. The 43 failures break down as:

Failure Group	Count	Cause	Status
`test_changelog.py`	1	Invalid changelog section type	Minor bug, non-critical
`test_context_thread_safety.py`	1	Concurrent topocentric contexts	Real bug, likely race condition
`test_extended_asteroids.py`	25	Asteroids outside SPK range (1920-2080) tested over -5000 to +5000	Known limitation (documented in CLAUDE.md)
`test_compare_leb_asteroids.py`	15	Same asteroid SPK range issue	Known limitation
`test_heliocentric_positions.py`	1	Earth helio vs Sun geocentric mismatch	Precision bug

The 40 asteroid failures are expected — JPL SPK21 coverage is 1920-2080 CE, but extended tier tests span -5000 to +5000. Outside SPK range, Keplerian fallback produces catastrophically wrong results. These tests should be filtered or skipped.

4,664,832 warnings were emitted (mostly Skyfield RuntimeWarning from relativity.py at extreme dates).

2. Performance Bottlenecks #

Bottleneck #1: `CompareHelper` Resets State on Every Call in Inner Loops #

The single largest time sink. The LEB comparison tests iterate over hundreds of dates per test, calling both compare.skyfield() and compare.leb() at each step:

# test_extended_lunar.py — runs for each of 6 ecliptic bodies × 6 test classes
for jd in ext_dates_500:  # 500 dates!
    ref, _ = compare.skyfield(ephem.swe_calc_ut, jd, body_id, SEFLG_SPEED)
    leb, _ = compare.leb(ephem.swe_calc_ut, jd, body_id, SEFLG_SPEED)

Each compare.skyfield() call:

Sets _LEB_FILE = None, _LEB_READER = None
Calls set_precision_tier(tier) — may trigger BSP reload
Calls set_calc_mode("skyfield")
Executes full Skyfield computation (DE441 for extended tier)
Calls set_calc_mode(None) in finally block

Each compare.leb() call:

Sets _LEB_FILE = path, _LEB_READER = None — forces LEB file re-open
Calls set_calc_mode("auto")
Executes LEB computation (reader re-created from scratch)
Cleanup: _LEB_FILE = None, _LEB_READER = None, set_calc_mode(None)

Impact on test_extended_lunar.py alone:

6 bodies × 6 test classes × 500 dates × 2 calls = 36,000 swe_calc_ut invocations
Each with full mode switching + LEB reader re-creation
Estimated: 3-5 hours for this single test file

The _LEB_READER = None on every call is the critical issue — it forces open_leb() (file open + header parse + index load) on every iteration of the inner loop.

Bottleneck #2: `reset_ephemeris_state` Autouse Fixture Destroys Caches #

This fixture runs before and after every single test (17,621 times):

@pytest.fixture(autouse=True)
def reset_ephemeris_state():
    ephem.swe_set_sid_mode(SE_SIDM_FAGAN_BRADLEY)
    clear_caches()  # <-- NUKES ALL LRU CACHES
    yield
    # ... restore state

clear_caches() empties:

Nutation LRU cache
Obliquity LRU cache
Time UT1 conversion LRU cache
Time TT conversion LRU cache
Observer-at cache

Impact: Sequential date computations within a test benefit from cache hits (same nutation/obliquity for nearby dates). But the cache is wiped between every test, destroying locality. For the comparison tests that iterate over 200-500 dates, the first few calls after a cache clear are significantly slower.

Bottleneck #3: Skyfield DE441 Is Intrinsically Slow #

Extended tier tests use DE441, which covers 10,000 years (-13200 to +17191). Each swe_calc_ut via Skyfield involves:

JD → TT conversion (with Delta-T lookup)
SPK segment interpolation (DE441 has large segments)
Nutation computation (Meeus polynomial series)
Aberration, gravitational deflection corrections
Full ecliptic/equatorial coordinate pipeline

Estimated ~10-50ms per Skyfield call. With ~100,000+ total Skyfield calls across the suite, this accounts for hours.

Bottleneck #4: No Parallelization #

poe test:full runs bare pytest without -n auto. All 17,621 tests execute sequentially on a single CPU core. The machine has multiple cores available (98% CPU utilization on one core).

Bottleneck #5: Redundant Computation in Test Structure #

The comparison tests split each body into 6 separate test classes (longitude, latitude, distance, speed_lon, speed_lat, speed_dist). Each class iterates over the same dates calling the same function with the same arguments, but only checks one component of the 6-tuple result.

For test_extended_lunar.py:

6 classes × 6 bodies = 36 tests
Each test calls swe_calc_ut() 500×2 = 1,000 times
But all 6 classes compute identical results — they just assert on different indices

A single test verifying all 6 components would reduce Skyfield calls by 6×.

3. Estimated Time Distribution #

Test File/Area	# Tests	Dates/Test	swe_calc_ut Calls	Est. Time
`test_extended_lunar.py`	36	500	36,000	3-5h
`test_leb_precision.py`	~2,400	150-200	~480,000	1-2h
`test_extended_velocities.py`	~60	200	24,000	1-2h
`test_extended_ancient/future/sidereal.py`	~100	100-200	~40,000	30m-1h
`test_extended_distances.py`	~35	200	14,000	15-30m
`test_flag_pair_robustness.py`	940	5	9,400	~10m
`test_flag_combinations_stress.py`	892	varies	~5,000	~10m
All other tests (~13,000)	13,000	1-9	~50,000	~30m
Total	17,621		~650,000+	~8h 49m

The LEB compare/extended tests dominate: ~80% of total time in ~300 tests.

4. Recommended Fixes (Priority Order) #

P0: Fix `CompareHelper` to Keep LEB Reader Open #

Impact: 2-5× speedup on compare tests (saves ~4h)

Stop resetting _LEB_READER = None on every call. The reader should stay open for the duration of the test:

def leb(self, fn, *args, **kwargs):
    ephem.state._LEB_FILE = self.leb_path
    # DON'T set _LEB_READER = None — let it persist
    ephem.set_calc_mode("auto")
    try:
        return fn(*args, **kwargs)
    finally:
        ephem.set_calc_mode(None)

The reader cleanup should happen once in teardown(), not in every call.

P1: Consolidate Test Classes to Avoid Redundant Computation #

Impact: ~6× fewer Skyfield calls for compare tests (saves ~3-5h)

Instead of 6 separate test classes each iterating 500 dates:

# BEFORE: 6 tests × 500 dates × 2 calls = 6,000 swe_calc_ut per body
class TestLongitude: ...   # iterates 500 dates
class TestLatitude: ...    # iterates same 500 dates (redundant!)
class TestDistance: ...    # iterates same 500 dates (redundant!)
class TestSpeedLon: ...   # iterates same 500 dates (redundant!)
class TestSpeedLat: ...   # iterates same 500 dates (redundant!)
class TestSpeedDist: ...  # iterates same 500 dates (redundant!)

Consolidate into one test per body that checks all 6 components:

# AFTER: 1 test × 500 dates × 2 calls = 1,000 swe_calc_ut per body
class TestExtLunarPrecision:
    def test_all_components(self, compare, ext_dates_500, body_id, body_name):
        for jd in ext_dates_500:
            ref, _ = compare.skyfield(...)
            leb, _ = compare.leb(...)
            # Check all 6 components from same call
            assert lon_err < tol_lon
            assert lat_err < tol_lat
            assert dist_err < tol_dist
            assert speed_lon_err < tol_speed_lon
            assert speed_lat_err < tol_speed_lat
            assert speed_dist_err < tol_speed_dist

P2: Run Tests in Parallel #

Impact: 4-8× speedup with -n auto

Change poe test:full to use pytest-xdist:

"test:full" = "pytest -n auto"

This is already available (pytest-xdist>=3.5.0 is installed) and used by test:full:fast.

P3: Don’t `clear_caches()` in `reset_ephemeris_state` by Default #

Impact: 1.5-2× speedup globally

The clear_caches() call before every test destroys LRU cache locality. Options:

Remove it from the autouse fixture; only tests that actually change ephemeris files need it.
Or make the full reset a non-autouse fixture, applying it only to tests that modify global state.

P4: Skip Asteroid Tests Outside SPK Range #

Impact: eliminates 40 known failures + wasted computation

Add explicit skips or filter dates to 1920-2080 for asteroid bodies in extended tier tests. The current failures are expected and waste time computing meaningless results.

P5: Reduce Date Counts for Extended Tier #

Impact: proportional time reduction

500 dates per test for extended lunar tests is excessive for regression detection. 100-200 dates would catch the same precision issues with 2.5-5× fewer Skyfield calls.

5. Expected Outcome #

Fix	Time Saved	Cumulative
P0: Keep LEB reader open	~2-3h	~6h
P1: Consolidate test classes	~3-5h	~2-3h
P2: Parallel execution (`-n auto`)	4-8× on remaining	~20-40 min
P3: Keep caches across tests	~15-30% further	~15-30 min

Realistic target with P0+P1+P2: from 8h 49m → ~20-40 minutes.