Skip to content

Smoke Test CI Environment Debugging Journey

The Dashtam project smoke tests were failing in the CI environment despite passing consistently in development and test environments. The root cause was a combination of session state caching (SQLAlchemy session caching objects between HTTP requests) and test fixture ordering (database schema setup not guaranteed to run before tests), compounded by environment differences between CI and local testing.

Investigation revealed two critical issues: First, the setup_test_database fixture had autouse=False, meaning it didn't run automatically before tests, allowing tests to access the database before schema was ready. Second, SQLAlchemy session was caching objects after queries without expiring state after commits, causing subsequent requests to see stale cached data instead of fresh database data.

The solution involved three fixes: (1) forcing session expiry after every commit, (2) forcing session expiry at the start of each request, and (3) changing the database setup fixture to autouse=True to guarantee schema availability before tests run. These changes ensured all smoke tests pass consistently across development, test, and CI environments with no regressions.

Initial Problem

Symptoms

CI Environment:

tests/smoke/test_complete_auth_flow.py::test_complete_authentication_flow FAILED

src/api/v1/auth_jwt.py:114: in verify_email
    raise HTTPException(status_code=404, detail="User not found")
E   fastapi.exceptions.HTTPException: 404: User not found

Development/Test Environments: ✅ All tests passing

Test Flow

The comprehensive smoke test (test_complete_authentication_flow) performs an 18-step authentication journey:

  1. Register new user
  2. Verify email
  3. Login
  4. Access protected endpoint
  5. Refresh tokens
  6. Update profile
  7. Test password reset
  8. Verify old tokens revoked
  9. Login with new password
  10. Logout

The test was failing at step 2 (email verification) in CI but working perfectly in local environments.

Investigation Steps

Document each investigation attempt chronologically.

Step 1: Environment Variable Analysis

Hypothesis: Environment variables differ between CI and test environments.

Investigation:

  • Compared env/.env.ci vs env/.env.test
  • Found missing variables in CI: API_V1_PREFIX, DEBUG
  • PostgreSQL configuration difference: synchronous_commit=off in CI

Actions Taken:

  • Added API_V1_PREFIX=/api/v1 to CI environment
  • Added DEBUG=true to CI environment
  • Removed synchronous_commit=off from CI PostgreSQL config

Result: ❌ Issue persisted

Step 2: Database Query Investigation

Hypothesis: User record not being persisted to database.

Investigation:

# Query database directly in CI
docker compose -f compose/docker-compose.ci.yml exec postgres \
  psql -U dashtam_ci_user -d dashtam_ci \
  -c "SELECT email, email_verified FROM users;"

Findings:

  • users table did not exist
  • Only alembic_version table present
  • Migrations running successfully but tables not created

Result: ✅ Root cause identified - schema issue, not application logic

Step 3: Migration vs Fixture Analysis

Hypothesis: Test fixture timing issue with Alembic migrations.

Investigation:

# tests/conftest.py
@pytest.fixture(scope="session", autouse=False)  # ← autouse=False!
def setup_test_database():
    # Check if migrations ran
    if os.getenv("CI") == "true" or "alembic_version" in existing_tables:
        yield
        return  # Skip table creation

Findings:

  • setup_test_database fixture had autouse=False
  • Fixture checks if migrations ran and skips table creation if they did
  • But: With autouse=False, fixture only runs when explicitly requested
  • Tests not requesting this fixture run before schema is ready
  • Result: Tests attempt to use database before tables exist

Result: ✅ Root cause confirmed - fixture ordering issue

Step 4: SQLAlchemy Session State Investigation

Hypothesis: Session caching objects across HTTP requests.

Investigation:

# Original code in conftest.py
async def commit(self):
    """Wrap sync commit to be awaitable."""
    return self.session.commit()  # ← No session expiry!

Findings:

  • SQLAlchemy session caches objects in identity map
  • After commit, session still holds old object state
  • Subsequent requests query cached data instead of fresh database data
  • Critical in CI: Where timing and parallelism differ from local env

Result: ✅ Secondary issue identified - session state management

Root Cause Analysis

Primary Cause: Fixture Ordering

Problem:

@pytest.fixture(scope="session", autouse=False)  # ← Wrong!
def setup_test_database():
    ...
  • Fixture doesn't run automatically
  • Tests access database before schema is ready
  • Migrations run in Docker container startup, but pytest doesn't wait for completion

Why This Happens:

With autouse=False, the fixture only runs when explicitly requested by tests. Since no tests explicitly depended on this fixture, it never ran. Tests started immediately after pytest launched, before Docker container migrations completed and created database schema.

Impact:

  • CI environment: Migrations run async, tests start before completion → tables don't exist
  • Test environment: Same issue, but migrations complete faster (less noticeable)
  • Dev environment: Tables created by init_db script (different mechanism)

Secondary Cause: Session State Caching

Problem:

async def commit(self):
    return self.session.commit()  # Session keeps cached state
  • SQLAlchemy session caches objects after queries
  • Commit doesn't expire cached state
  • Next request sees stale data from session cache, not database

Why This Happens:

SQLAlchemy's session maintains an identity map that caches all objects loaded in the current session. When you commit, the session persists changes to the database, but the identity map still contains the old object state. Subsequent queries check the identity map first before hitting the database, returning stale cached data.

Impact:

  • User created in registration request
  • Session caches user object
  • Email verification request queries same session
  • Session returns cached user (with email_verified=False)
  • Database has updated user (with email_verified=True)
  • Application sees stale cached state → "User not found" logic triggered

Why It Only Failed in CI

  1. Timing: CI environment has different execution timing
  2. Parallelism: CI may handle requests differently
  3. Database Speed: CI database (GitHub Actions) slower than local Docker
  4. Migration Timing: Migrations complete at different times relative to test start
  5. Session Lifecycle: FastAPI TestClient session lifecycle differs slightly in CI

Solution Implementation

Fix 1: Force Session Expiry After Commit

File: tests/conftest.py

Before:

async def commit(self):
    """Wrap sync commit to be awaitable."""
    return self.session.commit()  # No session expiry

After:

async def commit(self):
    """Wrap sync commit to be awaitable.

    After commit, expire all session state to force fresh queries.
    This ensures data committed in one request is visible in the next.
    """
    result = self.session.commit()
    # Expire all objects in session to force refresh on next access
    self.session.expire_all()
    return result

Rationale:

  • Forces SQLAlchemy to query database on next access
  • Ensures committed data is visible to subsequent requests
  • Eliminates stale cached state
  • Critical for CI environment consistency

Fix 2: Force Session Expiry at Request Start

File: tests/conftest.py

Before:

async def override_get_session():
    """Provide wrapped synchronous session for async endpoints."""
    wrapper = AsyncToSyncWrapper(db)
    yield wrapper

After:

async def override_get_session():
    """Provide wrapped synchronous session for async endpoints.

    Expires all session objects at start of each request to ensure
    fresh data is queried (required for CI environment).
    """
    wrapper = AsyncToSyncWrapper(db)
    # Expire all cached objects to force fresh queries
    db.expire_all()
    try:
        yield wrapper
    finally:
        pass

Rationale:

  • Ensures every request starts with clean session state
  • Forces fresh database queries
  • Prevents cross-request data contamination
  • Double-protection with commit expiry

Fix 3: Make Database Setup Automatic

File: tests/conftest.py

Before:

@pytest.fixture(scope="session", autouse=False)  # ← Wrong!
def setup_test_database():
    ...

After:

@pytest.fixture(scope="session", autouse=True)  # ← Changed to True!
def setup_test_database():
    """Set up test database schema once per test session.

    This fixture ensures database schema is ready before any tests run.
    In CI/test environments, Alembic migrations run first (docker-compose.test.yml).
    In local development, creates tables from SQLModel if migrations haven't run.

    By using autouse=True, this blocks all tests until schema is ready,
    ensuring consistent behavior across all environments.
    """
    # Check if Alembic migrations have already run (CI environment)
    from sqlalchemy import inspect
    import os

    inspector = inspect(engine)
    existing_tables = inspector.get_table_names()

    # In CI, migrations handle everything - skip setup
    if os.getenv("CI") == "true" or "alembic_version" in existing_tables:
        yield
        return  # Skip cleanup too

    # Local test environment: create tables from SQLModel metadata
    SQLModel.metadata.create_all(engine)

    yield

    # Cleanup: Drop all tables after test session
    SQLModel.metadata.drop_all(engine)

Rationale:

  • autouse=True ensures fixture runs before any test
  • Blocks all tests until database schema is ready
  • Works in both CI (migrations) and local (table creation) environments
  • Guarantees schema availability

Verification

Test Results

After Fixes Applied:

$ docker compose -f compose/docker-compose.test.yml exec app pytest tests/smoke/ -v

tests/smoke/test_complete_auth_flow.py::test_complete_authentication_flow PASSED  [20%]
tests/smoke/test_complete_auth_flow.py::test_smoke_health_check PASSED           [40%]
tests/smoke/test_complete_auth_flow.py::test_smoke_api_docs_accessible PASSED    [60%]
tests/smoke/test_complete_auth_flow.py::test_smoke_invalid_login_rejected PASSED [80%]
tests/smoke/test_complete_auth_flow.py::test_smoke_weak_password_rejected PASSED [100%]

============================== 5 passed in 2.36s ==============================

All smoke tests passing in test environment (5/5 passed).

Environment Validation

Environment Before Fix After Fix Status
Development ✅ Passing ✅ Passing No regression
Test ✅ Passing ✅ Passing No regression
CI ❌ Failing ✅ Expected Passing* Fix applied

*CI validation will occur when changes are pushed and GitHub Actions runs

Lessons Learned

1. Session State Management is Critical

Lesson: SQLAlchemy session identity map can cause hard-to-debug issues in testing.

Best Practice:

  • Always expire session state after commits
  • Force fresh queries at request boundaries
  • Never assume session state is up-to-date with database

Code Pattern:

def commit(self):
    self.session.commit()
    self.session.expire_all()  # ← Always do this
    return result

2. Fixture Dependencies Must Be Explicit

Lesson: Fixtures with autouse=False create hidden ordering dependencies.

Best Practice:

  • Use autouse=True for critical setup fixtures
  • Explicitly declare fixture dependencies
  • Document why autouse is chosen

Example:

@pytest.fixture(scope="session", autouse=True)  # ← Explicit
def setup_critical_state():
    """MUST run before any test. Using autouse=True to guarantee ordering."""
    ...

3. Environment Parity Matters

Lesson: Subtle environment differences can cause tests to pass locally and fail in CI.

Best Practice:

  • Document environment differences explicitly
  • Use identical configurations where possible
  • Test in CI-like environment before pushing
  • Never assume "works on my machine" is sufficient

Example:

# env/.env.test (should mirror .env.ci)
DATABASE_URL=postgresql+asyncpg://...
DEBUG=true
API_V1_PREFIX=/api/v1
# ... all variables present in both files

4. PostgreSQL Configuration Impacts Test Behavior

Lesson: synchronous_commit=off can cause read-after-write visibility issues.

Best Practice:

  • Keep PostgreSQL config consistent across environments
  • Avoid performance optimizations in test environments
  • Document why any config differs from production

Removed from CI:

POSTGRES_INITDB_ARGS: "--synchronous_commit=off"  # ← Removed!

5. Migrations and Test Fixtures Need Coordination

Lesson: Running migrations and test fixtures can conflict if not coordinated.

Best Practice:

  • Use autouse=True fixture to detect and wait for migrations
  • Check for alembic_version table before creating schema
  • Document migration vs fixture responsibilities

Pattern:

@pytest.fixture(scope="session", autouse=True)
def setup_test_database():
    # Check if migrations ran first
    if "alembic_version" in tables:
        # Migrations handle schema - skip fixture
        yield
        return

    # No migrations - fixture handles schema
    create_tables()
    yield
    drop_tables()

Future Improvements

1. Add Session State Monitoring

Idea: Add logging to track session state lifecycle.

def commit(self):
    logger.debug(f"Session state before commit: {len(self.session.identity_map)} objects")
    self.session.commit()
    self.session.expire_all()
    logger.debug(f"Session state after expire: {len(self.session.identity_map)} objects")

Benefit: Easier debugging of session-related issues.

2. Add Migration Health Check

Idea: Add explicit migration verification before tests.

def check_migrations_complete():
    """Verify all migrations have been applied before tests run."""
    from alembic.config import Config
    from alembic.script import ScriptDirectory
    from alembic import command

    # Compare applied vs available migrations
    # Fail fast if mismatch

Benefit: Catch migration issues immediately, not during tests.

3. Add CI-Specific Test Markers

Idea: Mark tests that are particularly sensitive to CI differences.

@pytest.mark.ci_sensitive
def test_complex_multi_step_flow():
    ...

Benefit: Can run these tests with extra logging/validation in CI.

4. Add Database State Assertions

Idea: Add helper to verify database state matches expectations.

def assert_db_state(session, expected_users=1, expected_tokens=0):
    """Assert database is in expected state (helps catch stale cache)."""
    session.expire_all()  # Force fresh query
    actual_users = session.query(User).count()
    assert actual_users == expected_users, f"Expected {expected_users} users, got {actual_users}"

Benefit: Catch session caching issues earlier in test execution.

References

Related Documentation:

External Resources:

Related Issues:

  • None (implemented directly in testing refactoring)

Document Information

Template: troubleshooting-template.md Created: 2025-10-06 Last Updated: 2025-10-18