Files
LibreChat/api/server
Danny Avila c27d6b85a4 🤫 refactor: Silent MCP OAuth Refresh on Mid-Session 401 (#13369)
* 🤫 fix: Silent MCP OAuth Refresh on Mid-Session 401

Avoids the hourly interactive re-auth prompt when an MCP server
(e.g. Azure Entra ID) returns 401 mid-session by attempting a refresh
token exchange first, and only falling back to the interactive OAuth
flow when no refresh token is stored or the refresh server rejects it.

Resolves #13364.

* fix: Use distinct flow type for silent token refresh to avoid cache hit

Addresses the Codex review on PR #13369: `attemptSilentTokenRefresh` was
reusing the `'mcp_get_tokens'` flow type, so
`FlowStateManager.createFlowWithHandler` would short-circuit and return
the same tokens cached by an earlier `getOAuthTokens` call — the very
tokens the server just rejected — without executing the forced-refresh
handler.

Switch silent refresh to the distinct `'mcp_force_refresh_tokens'` flow
type so coalescing still works but stale `mcp_get_tokens` cache entries
are not reused. After a successful refresh, invalidate the
`mcp_get_tokens` flow cache so the next `getOAuthTokens` call reads the
freshly persisted tokens from storage rather than the stale cached
value.

Add a regression test that simulates the real
`FlowStateManager.createFlowWithHandler` cache-hit behavior for
`mcp_get_tokens` and verifies the silent refresh handler still runs and
returns the freshly refreshed tokens.

* fix: Address Codex round-2 review on silent MCP OAuth refresh

Three follow-up findings from Codex on PR #13369:

1. The new `mcp_force_refresh_tokens` flow type was itself cached by
   `FlowStateManager.createFlowWithHandler`, so a subsequent 401 within
   the refreshed token's `expires_at` could re-serve the just-rejected
   token without ever re-running the refresh handler.

2. The factory's `oauthRequired` listener was removed immediately after
   the initial `attemptToConnect` succeeded, so a real mid-session 401
   emitted by `MCPConnection.connectClient` during transport recovery
   had no listener — the OAuth handled-promise would simply time out
   instead of triggering the silent refresh.

3. Routing the silent refresh through a distinct flow type broke
   coalescing with the `mcp_get_tokens` lock used by `getOAuthTokens`,
   letting two paths concurrently redeem the same stored refresh token.
   For providers that rotate refresh tokens (e.g. Azure Entra) the
   second redemption is rejected, kicking the user back into interactive
   OAuth despite a successful refresh elsewhere.

Resolution:

- Drop `FlowStateManager` from the silent-refresh path entirely. Replace
  with a process-local `inflightSilentRefreshes` Map keyed by
  `userId:serverName` that holds only the in-flight Promise (no cached
  result), so every fresh 401 after settlement triggers a fresh
  redemption while concurrent 401s for the same user/server still share
  one redemption.
- Stop calling `cleanupOAuthHandlers()` on successful initial connect,
  keeping the OAuth handler attached for the connection's lifetime so
  mid-session 401s actually reach `attemptSilentTokenRefresh`.
- Add a regression test reproducing the stale-cache scenario by faking
  the `mcp_get_tokens` cache hit and asserting silent refresh still runs
  against storage and returns the fresh tokens.
- Add a coalescing test asserting two concurrent oauthRequired events
  for the same user/server result in a single `forceRefreshTokens` call.
- Clear `inflightSilentRefreshes` in `beforeEach` to prevent
  cross-test leakage; switch the silent-refresh test mocks to
  `mockResolvedValueOnce` / `mockImplementationOnce` so leftover mock
  state cannot leak into later test cases.

Acknowledged remaining gap: the silent refresh still races
`getOAuthTokens`'s `mcp_get_tokens` flow when both run concurrently
(narrow window when an existing connection's local `expires_at` is
still valid but the server invalidated the token, and a new connection
is being created in parallel). The race is self-healing on the next
401 and documented inline.

* fix: Address Codex round-3 review on silent MCP OAuth refresh

Three more findings from Codex on PR #13369:

1. The in-flight silent-refresh promise was unbounded. If
   `forceRefreshTokens()` ever hung (slow provider, dropped TCP), the
   `inflightSilentRefreshes` lock stayed occupied forever and every
   later 401 for the same user/server joined the stuck promise instead
   of starting a fresh attempt or falling back to interactive OAuth.

2. The interactive-OAuth fallback didn't invalidate the
   `mcp_get_tokens` flow cache after persisting fresh tokens. For
   providers that don't issue refresh tokens (so silent refresh
   returns null), the old cache could still feed stale access tokens
   to the next `getOAuthTokens` call until its TTL expired — causing
   an immediate reconnect with the same just-rejected token.

3. When silent refresh failed, the handler fell through to
   `handleOAuthRequired()` whose recent-completion fast path can
   reuse a COMPLETED `mcp_oauth` flow within `PENDING_STALE_MS`. Those
   cached tokens are exactly the ones the server just rejected, so
   the connection would keep adopting them and looping on 401s until
   the cache aged out.

Resolution:

- Wrap `runSilentRefresh()` with a 60-second `withTimeout` (well under
  `connectClient`'s 120s OAuth timeout). On timeout the `.catch`
  resolves to null and the `finally` clears the in-flight entry, so
  the next 401 starts fresh and falls through to interactive OAuth.
- Extract two helpers — `invalidateGetTokensFlow` and
  `invalidateCompletedOAuthFlow` — and call them from the right
  branches: clear `mcp_get_tokens` after silent-refresh success AND
  after interactive-OAuth `storeTokens`; clear the COMPLETED
  `mcp_oauth` state (plus its CSRF mapping) before falling through to
  interactive OAuth so the fast-reuse path can't re-serve the
  rejected tokens.
- Add three regression tests: hung refresh release-the-lock under
  fake timers, completed-OAuth cache invalidation pre-fallback, and
  `mcp_get_tokens` invalidation after interactive token store.

* fix: Address Codex round-4 review on silent MCP OAuth refresh

Three more findings from Codex on PR #13369:

1. (P1) The silent-refresh in-flight lock keyed only by
   `userId:serverName`. In multi-tenant setups where two tenants share a
   userId (e.g. username-based IDs) and the same MCP server name, a
   concurrent mid-session 401 from tenant B would join tenant A's
   in-flight refresh and adopt tenant A's freshly minted tokens onto a
   tenant-B connection — a cross-tenant credential leak.

2. (P2) `invalidateGetTokensFlow` deleted the `mcp_get_tokens` flow
   state regardless of its status. When another connection was
   currently in `getOAuthTokens()` (PENDING flow) and joiners were
   monitoring it, the unconditional delete made those waiters see
   "Flow state not found" and unnecessarily fall back to interactive
   OAuth — even though fresh tokens were already being written.

3. (P2) The 60s `withTimeout` wrapping `runSilentRefresh()` only races
   the promise; it does not cancel the underlying `forceRefreshTokens`
   /  refresh-token HTTP request. If the request returned after a
   subsequent interactive OAuth had stored newer tokens, the late
   completion would `storeTokens` over the newer state. This requires
   a provider that doesn't rotate refresh tokens AND a refresh slower
   than 60s AND a successful interactive OAuth in that window — narrow
   but real.

Resolution:

- Capture `getTenantId()` into a new `factory.tenantId` field at
  factory construction time (before the OAuth handler closes over it
  outside the original request's async context) and include it in the
  silent-refresh lock key as `tenantId:userId:serverName`.
- `invalidateGetTokensFlow` now calls `getFlowState` first and only
  deletes when `status === 'COMPLETED'`. PENDING lookups are left
  alone so concurrent `getOAuthTokens` waiters via `monitorFlow` can
  still settle.
- For (3), document the race as a known limitation inline. Fully
  closing it requires threading an `AbortSignal` through
  `MCPTokenStorage.forceRefreshTokens` and the OAuth refresh handler
  to skip the late `storeTokens` after timeout — out of scope for this
  PR's surgical change.
- Add `getTenantId` to the `MCPOAuthConnectionEvents` test's
  `@librechat/data-schemas` mock so the factory constructor doesn't
  blow up under that suite.
- Add three regression tests: per-tenant lock isolation, PENDING-state
  preservation under `invalidateGetTokensFlow`, and (reused) the
  existing interactive-store invalidation test now driven through
  `getFlowState` returning the COMPLETED state.

* fix: Address silent MCP OAuth refresh review

Restore captured tenant context around token storage and OAuth fallback paths so mid-session callbacks do not lose tenant scope.

Thread AbortSignal through forced refresh and OAuth token requests, cap silent refresh by the connection OAuth timeout, and prevent timed-out refreshes from writing stale credentials after fallback.

Complete pending mcp_get_tokens flows with fresh tokens, add missing FlowState createdAt test fixtures, and cover the new tenant/abort/cache behaviors.

* fix: Tighten tenant-scoped MCP token refresh

Cap silent refresh by both the factory connect timeout and the connection OAuth wait timeout so fallback OAuth wins before the outer connect attempt expires.

Tenant-scope mcp_get_tokens flow ids for both token lookup and refresh invalidation, preventing cross-tenant flow completion or cache deletion when tenants share user ids and server names.

Add regression tests for the omitted initTimeout budget and tenant-prefixed token flow locks.

* fix: Reserve MCP OAuth fallback budget

* fix: Harden MCP OAuth refresh races

* fix: Keep MCP OAuth fallback route-compatible

* test: Add SDK MCP OAuth refresh repro

* fix: Address MCP OAuth refresh review findings

* fix: Address MCP OAuth tenant review findings

* fix: Close MCP OAuth route tenant gaps

* fix: Preserve MCP OAuth refresh flow guards

* fix: Avoid reprocessing MCP OAuth reauth config

* fix: Release timed-out MCP refresh locks

* fix: Release MCP OAuth request callbacks

* fix: Tenant-scope remaining MCP OAuth flow lookups

* ci: Sort imports in MCP OAuth test suites
2026-06-10 13:12:42 -04:00
..