troubleshooting
ai agent connection refused? here's why (and how to fix it)
You're shipping an agent feature, you hit run, and the very first request fails with `connect ECONNREFUSED`. If you're debugging an AI agent connection refused error, you're in one of the most common startup failure paths in cloud agent infrastructure.
The painful part is that ECONNREFUSED looks generic. It does not tell you whether the machine is sleeping, the port is closed, the service crashed, or your route points to the wrong address. Teams end up adding random sleeps and retries instead of fixing the real cause.
This guide breaks the issue into root causes and gives production-safe fixes you can actually implement: auto-wake logic, health checks, port discovery, and bounded retries. If you want lifecycle context first, review how in10nt instances start and stop and agent startup sequencing.
What does "connection refused" mean for AI agents?
At TCP level, ECONNREFUSED means your request reached a network endpoint, but no process accepted the connection on that port. The network path existed. The listening service did not.
Typical Node runtime error:
Error: connect ECONNREFUSED 172.19.8.44:8080
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:...)Common variants you will see in logs include "AI agent ECONNREFUSED", "AI agent connection error", and "agent instance connection refused". This differs from timeout errors (`ETIMEDOUT`) where no response arrives before deadline, and from DNS failures (`ENOTFOUND`) where host resolution fails before connection attempt.
Authoritative references: Node.js common system errors and Fly Machines docs.
Root causes of AI agent connection refused
1) Machine not started yet
Most agent workloads run on demand. If the machine is stopped to save cost, the first request can land before startup completes. The client tries to connect to a port that is not listening yet and gets ECONNREFUSED.
- First request after idle period fails, second or third succeeds
- Log stream shows machine state transition from stopped -> starting -> running
- Failures cluster around cold starts
If this is your pattern, compare your startup timings against cold start guidance for agents.
2) Port not exposed or not reachable
Your app may be healthy but inaccessible because the port is not in your exposed port config, or the reverse proxy is not routing to that port. This is common when teams assume defaults across different frameworks.
- Instance created without expected `openPorts`
- Service bound to `localhost` only instead of `0.0.0.0`
- Proxy route targets wrong internal port
Use port proxying patterns in in10nt to keep ports explicit and testable.
3) Wrong internal address or stale machine target
Another root cause is routing drift. You may connect to a stale machine ID, a public host from a private path, or a cached address that changed after restart. In those cases the request can land on an endpoint where your service is not active.
Validate instance ID, current machine metadata, and routing source of truth before connecting. Fly networking behavior and private addressing are documented in Fly private networking.
4) Service is not listening yet (or crashed)
Machine state "running" is not equal to service readiness. The process can still be installing dependencies, compiling, loading configuration, or failing startup entirely.
ss -lntp | grep 8080
# empty output means nothing is listening on port 8080Correlate run logs with process state using real-time log streaming and keep readiness checks separate from liveness checks.
Solutions that work in production
Auto-wake machines before first connect
async function ensureRunning(instanceId: string) {
const state = await getState(instanceId)
if (state !== 'running') {
await startInstance(instanceId)
await waitForState(instanceId, 'running', 20000)
}
}Call this before network requests. It removes a large class of transient AI agent connection refused failures caused by sleeping infrastructure.
Use health checks, not blind sleeps
async function waitForHealth(url: string) {
const delays = [200, 400, 800, 1200, 2000]
for (const d of delays) {
try {
const r = await fetch(url, { signal: AbortSignal.timeout(1500) })
if (r.ok) return
} catch {}
await sleep(d)
}
throw new Error('service never became healthy')
}A health endpoint should validate critical dependencies, not only process existence. See health check design for agents.
Discover active ports dynamically
const meta = await getInstance(instanceId)
const port = meta.openPorts?.[0] ?? await getPortFromConfig(instanceId)
if (!port) throw new Error('no reachable port')
await probe(`https://api.in10nt.dev/instances/${instanceId}/ports/${port}/health`)Avoid hard-coded ports in multiple services. Keep a single source of truth and runtime verification. This is especially important for shared environments and template-based agents.
Implement bounded retry with exponential backoff
async function withRetry<T>(op: () => Promise<T>) {
const delays = [250, 500, 1000, 2000, 3000]
let last: unknown
for (const delay of delays) {
try {
return await op()
} catch (err) {
last = err
await sleep(delay)
}
}
throw last
}Retries are for transient startup windows, not permanent misconfiguration. If all attempts fail, surface diagnostics immediately.
How in10nt handles this for you
in10nt provides automatic machine wake-up, built-in readiness checks, port proxying, and retry-friendly request handling. Instead of custom orchestration in every service, you can rely on managed instance lifecycle APIs and focus on task logic.
- `POST /instances` to create an isolated runtime
- `POST /instances/:id/run` to execute agent work
- `GET /instances/:id/stream/logs` for startup diagnostics
- `GET /instances/:id/filesystem` for workspace inspection
Before and after implementation
// before: direct connect, fails on cold start
await fetch(`http://${host}:8080/run`, { method: 'POST' })
// after: in10nt instance flow
const instance = await api.createInstance({ openPorts: [8080] })
await api.run(instance.id, { task: 'execute workflow safely' })Best practices checklist
- Check machine state before first connection
- Use readiness endpoints and poll until healthy
- Verify active ports from instance metadata
- Use bounded retries with explicit timeouts
- Track refusal rates and cold-start latency over time
- Keep internal routing metadata fresh and centralized
Related reading: connection retry strategy for AI agents, instance routing strategy, and operational runbooks for agent reliability.
Conclusion
AI agent connection refused errors are usually predictable lifecycle or configuration issues. Once you enforce wake-up checks, health probing, dynamic port discovery, and bounded retries, ECONNREFUSED becomes rare and easier to diagnose. in10nt automates most of this path so teams can ship agent features without writing platform glue code first.