troubleshooting

ai agent connection refused? here's why (and how to fix it)

January 16, 202614 min read

You're shipping an agent feature, you hit run, and the very first request fails with `connect ECONNREFUSED`. If you're debugging an AI agent connection refused error, you're in one of the most common startup failure paths in cloud agent infrastructure.

The painful part is that ECONNREFUSED looks generic. It does not tell you whether the machine is sleeping, the port is closed, the service crashed, or your route points to the wrong address. Teams end up adding random sleeps and retries instead of fixing the real cause.

This guide breaks the issue into root causes and gives production-safe fixes you can actually implement: auto-wake logic, health checks, port discovery, and bounded retries. If you want lifecycle context first, review how in10nt instances start and stop and agent startup sequencing.

What does "connection refused" mean for AI agents?

At TCP level, ECONNREFUSED means your request reached a network endpoint, but no process accepted the connection on that port. The network path existed. The listening service did not.

Typical Node runtime error:

Error: connect ECONNREFUSED 172.19.8.44:8080
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:...)

Common variants you will see in logs include "AI agent ECONNREFUSED", "AI agent connection error", and "agent instance connection refused". This differs from timeout errors (`ETIMEDOUT`) where no response arrives before deadline, and from DNS failures (`ENOTFOUND`) where host resolution fails before connection attempt.

Authoritative references: Node.js common system errors and Fly Machines docs.

Root causes of AI agent connection refused

1) Machine not started yet

Most agent workloads run on demand. If the machine is stopped to save cost, the first request can land before startup completes. The client tries to connect to a port that is not listening yet and gets ECONNREFUSED.

First request after idle period fails, second or third succeeds
Log stream shows machine state transition from stopped -> starting -> running
Failures cluster around cold starts

If this is your pattern, compare your startup timings against cold start guidance for agents.

2) Port not exposed or not reachable

Your app may be healthy but inaccessible because the port is not in your exposed port config, or the reverse proxy is not routing to that port. This is common when teams assume defaults across different frameworks.

Instance created without expected `openPorts`
Service bound to `localhost` only instead of `0.0.0.0`
Proxy route targets wrong internal port

Use port proxying patterns in in10nt to keep ports explicit and testable.

3) Wrong internal address or stale machine target

Another root cause is routing drift. You may connect to a stale machine ID, a public host from a private path, or a cached address that changed after restart. In those cases the request can land on an endpoint where your service is not active.

Validate instance ID, current machine metadata, and routing source of truth before connecting. Fly networking behavior and private addressing are documented in Fly private networking.

4) Service is not listening yet (or crashed)

Machine state "running" is not equal to service readiness. The process can still be installing dependencies, compiling, loading configuration, or failing startup entirely.

ss -lntp | grep 8080
# empty output means nothing is listening on port 8080

Correlate run logs with process state using real-time log streaming and keep readiness checks separate from liveness checks.

Solutions that work in production

Auto-wake machines before first connect

async function ensureRunning(instanceId: string) {
  const state = await getState(instanceId)
  if (state !== 'running') {
    await startInstance(instanceId)
    await waitForState(instanceId, 'running', 20000)
  }
}

Call this before network requests. It removes a large class of transient AI agent connection refused failures caused by sleeping infrastructure.

Use health checks, not blind sleeps

async function waitForHealth(url: string) {
  const delays = [200, 400, 800, 1200, 2000]
  for (const d of delays) {
    try {
      const r = await fetch(url, { signal: AbortSignal.timeout(1500) })
      if (r.ok) return
    } catch {}
    await sleep(d)
  }
  throw new Error('service never became healthy')
}

A health endpoint should validate critical dependencies, not only process existence. See health check design for agents.

Discover active ports dynamically

const meta = await getInstance(instanceId)
const port = meta.openPorts?.[0] ?? await getPortFromConfig(instanceId)
if (!port) throw new Error('no reachable port')
await probe(`https://api.in10nt.dev/instances/${instanceId}/ports/${port}/health`)

Avoid hard-coded ports in multiple services. Keep a single source of truth and runtime verification. This is especially important for shared environments and template-based agents.

Implement bounded retry with exponential backoff

async function withRetry<T>(op: () => Promise<T>) {
  const delays = [250, 500, 1000, 2000, 3000]
  let last: unknown
  for (const delay of delays) {
    try {
      return await op()
    } catch (err) {
      last = err
      await sleep(delay)
    }
  }
  throw last
}

Retries are for transient startup windows, not permanent misconfiguration. If all attempts fail, surface diagnostics immediately.

How in10nt handles this for you

in10nt provides automatic machine wake-up, built-in readiness checks, port proxying, and retry-friendly request handling. Instead of custom orchestration in every service, you can rely on managed instance lifecycle APIs and focus on task logic.

`POST /instances` to create an isolated runtime
`POST /instances/:id/run` to execute agent work
`GET /instances/:id/stream/logs` for startup diagnostics
`GET /instances/:id/filesystem` for workspace inspection

Before and after implementation

// before: direct connect, fails on cold start
await fetch(`http://${host}:8080/run`, { method: 'POST' })

// after: in10nt instance flow
const instance = await api.createInstance({ openPorts: [8080] })
await api.run(instance.id, { task: 'execute workflow safely' })

Best practices checklist

Check machine state before first connection
Use readiness endpoints and poll until healthy
Verify active ports from instance metadata
Use bounded retries with explicit timeouts
Track refusal rates and cold-start latency over time
Keep internal routing metadata fresh and centralized

Conclusion

AI agent connection refused errors are usually predictable lifecycle or configuration issues. Once you enforce wake-up checks, health probing, dynamic port discovery, and bounded retries, ECONNREFUSED becomes rare and easier to diagnose. in10nt automates most of this path so teams can ship agent features without writing platform glue code first.