Architecture Decision Records

Why ADRs Matter

Architecture Decision Records are how this team builds without losing its mind. Warehouse automation involves dozens of interdependent systems operating in real-time, often in degraded conditions. When a robot stops, you need to know not just what it's doing but why it was designed to do that - what alternatives were ruled out, and what failure modes were accepted.

24 ADRs across the fleet management stack. These aren't form-filling exercises - they're working documents that junior engineers read before touching any subsystem, and that post-incident reviews reference when something goes wrong. Decisions don't live in Slack threads or the heads of senior engineers. They live here.

What you're reading below are 4 curated selections - chosen to show range: communications protocol (0003), real-time perception (0008), predictive operations (0015), and a safety decision triggered by a near-miss incident (0021). The other 20 cover everything from fleet task scheduling to database schema versioning to OTA update rollout strategies.

ADR-0003 Accepted March 2024

Fleet Coordination Protocol: MQTT over WebSocket or Polling

Every robot in the warehouse needs to receive task assignments, report position and status, and acknowledge commands - all in near-real-time, across an environment where WiFi coverage is uneven and connections drop frequently. The choice of coordination protocol shapes everything else: how the fleet controller routes tasks, how quickly robots adapt to changing conditions, and how gracefully the system degrades when the network misbehaves.

The Problem

Warehouse WiFi is not office WiFi. Racking systems create RF shadows, forklifts cause intermittent interference, and robots moving at speed may lose and regain signal every few seconds. A protocol designed for stable connections will fail constantly in this environment. We needed something built for exactly this kind of unreliable transport.

At peak operation, 40 robots each publish position updates every 500ms - that's 80 messages per second just for telemetry, before task assignments and acknowledgements. The protocol also needed to handle a supervisor dashboard consuming live fleet status without polling every robot individually.

The Decision

MQTT with QoS level 1 (at-least-once delivery), using an on-premises broker. Each robot maintains a persistent MQTT session. The fleet controller and dashboard subscribe to topic hierarchies rather than querying individual robots. Broker runs on dedicated hardware in the server room, isolated from the warehouse WiFi segment.

Robot Fleet (40 units) │ │ MQTT publish warehouse/robot/{id}/telemetry │ MQTT publish warehouse/robot/{id}/status ▼ MQTT Broker (on-prem, server room) ←── retained messages on reconnect │ ├──► Fleet Controller (subscribed to warehouse/#) │ │ │ └── publishes warehouse/robot/{id}/commands │ └──► Supervisor Dashboard (subscribed to warehouse/+/telemetry) WiFi drop ──► robot reconnects ──► broker delivers queued commands robot resumes without fleet controller restart

Alternatives Considered

Option	Why Rejected
REST polling (robot pulls tasks every N seconds)	High latency for task dispatch. Polling interval is a tradeoff between responsiveness and server load. Doesn't propagate urgent commands (e-stop, reroute) fast enough. Scales poorly - 40 robots polling every second = 40 req/s at idle.
WebSocket (persistent bidirectional TCP)	WebSocket connections don't have built-in session resumption. On WiFi drop, the connection is lost and state must be re-established from scratch. Requires application-layer reconnect logic and message queuing on both ends. Also requires managing 40 persistent server-side connections.
WebRTC data channels	Designed for peer-to-peer media, not server-mediated fleet coordination. No broker model. Overkill for structured telemetry.
Cloud MQTT (AWS IoT Core, etc.)	Adds internet dependency for an on-premises operation. Outage or connectivity issue at the facility would take down the entire fleet. Latency over internet is unacceptable for real-time task dispatch.

Consequences

Type	Detail
Positive	Robots reconnect seamlessly after WiFi drops. Fleet controller receives all telemetry without polling. Supervisor dashboard is always current. Broker decouples robots from fleet controller - either can restart independently. QoS 1 guarantees commands are delivered even through brief outages.
Negative	MQTT broker is a new infrastructure dependency to operate and monitor. On-prem hardware requires maintenance. Topic hierarchy design must be done carefully - a poorly structured hierarchy becomes hard to query. QoS 1 means occasional duplicate message delivery that consumers must handle idempotently.
Mitigations	Broker runs in HA pair with automatic failover. Fleet controller and robot clients are idempotent by design. Topic structure documented and version-controlled as part of this ADR.

Why This ADR Matters

Protocol choice is one of those decisions that looks trivial until a robot gets stuck in a WiFi shadow and the whole approach falls apart. MQTT's design assumption - that the network is unreliable - matches warehouse reality. Polling and WebSocket both assume stable connections and require significant application-level work to compensate when they're not. We decided to use the tool built for the job.

ADR-0008 Accepted May 2024

Obstacle Detection Pipeline: Edge Compute over Cloud Inference

Each robot carries depth cameras and lidar. The raw sensor data needs to be processed into actionable obstacle classifications - "person crossing path," "pallet dropped in aisle," "forklift approaching" - fast enough that the robot can stop or reroute before an incident occurs. The question: where does that inference run?

The Problem

End-to-end latency is the constraint. A robot moving at 1.2 m/s needs at least 400ms to come to a full stop from detection. That means obstacle classification must complete within 100ms of the sensor data being captured, leaving margin for braking. Sending frames to a cloud service and waiting for a response introduces round-trip latency that makes this impossible on any real network. Even a local server adds unnecessary hops.

The fleet also operates in a facility with variable WiFi. An obstacle detection pipeline that requires network connectivity is a pipeline that fails precisely when the robot is navigating the congested areas where obstacles are most likely.

The Decision

Inference runs on the robot's onboard compute module (NVIDIA Jetson). Each robot carries a quantized detection model that classifies obstacles from fused camera and lidar data entirely locally. The fleet controller receives only classified events ("obstacle detected at grid B7, type: person") not raw frames. The cloud receives aggregated safety logs, not a real-time feed.

Sensor Layer (on robot) Depth Camera + Lidar │ ▼ Onboard Compute (Jetson, on robot) target: <100ms end-to-end Sensor fusion ──► Obstacle detection model (quantized) ──► Classification: person | forklift | static | unknown │ │ │ obstacle event │ <threshold: continue ▼ ▼ Motion Controller (stop/reroute) Normal navigation │ └──► MQTT publish warehouse/robot/{id}/safety-event └──► Fleet Controller (log + rerout adjacent robots)

Alternatives Considered

Option	Why Rejected
Cloud inference (frames sent to API, classified remotely)	Round-trip latency makes sub-100ms detection impossible. Network dependency means detection fails during congestion exactly when obstacles are most likely. Data volume is also prohibitive - 40 robots at 30fps is not a viable upload stream.
Central on-prem inference server (frames sent over LAN)	LAN latency is lower than cloud, but still adds 20-40ms plus queuing when multiple robots submit simultaneously. Creates a single point of failure for all robot safety systems. A server restart disables obstacle detection fleet-wide.
Rule-based proximity detection only (no ML)	Lidar proximity thresholds can stop a robot near any object. But they can't distinguish a person from a structural column, so false-positive stops would make the fleet too slow to be useful. Classification is required for the rerouting logic to make sensible decisions.

Consequences

Type	Detail
Positive	Detection latency consistently under 80ms in testing. Zero network dependency for safety-critical path. Per-robot fault isolation - one robot's compute failing doesn't affect others. Model updates can be validated on individual units before fleet-wide rollout.
Negative	Each robot carries more expensive compute hardware. Model updates require OTA deployment to 40 units with rollback capability. Quantization reduces model accuracy compared to full-precision cloud inference. Edge models may drift from cloud-trained baselines if retraining cadence isn't maintained.
Mitigations	OTA update system with staged rollout and automatic rollback (see ADR-0017). Accuracy benchmarks run on each model update before production deployment. Cloud inference retained as a non-real-time auditing path for safety log review.

Why This ADR Matters

Safety decisions that depend on network availability are not safety decisions - they're gambles. Moving inference to the edge was the only architecture that could meet the latency requirement unconditionally. The added hardware cost per robot is a direct cost of operating safely in a mixed human-robot environment.

ADR-0015 Accepted August 2024

Battery Management Strategy: Predictive Charging with Scheduled Fallback

A robot that runs out of battery mid-task is a blocked aisle and a missed delivery commitment. A robot that charges too conservatively is idle capacity. Charging strategy directly affects fleet throughput, and with 40 robots sharing a fixed number of charging docks, the scheduling problem is non-trivial.

The Problem

Scheduled charging (charge at fixed times, e.g., every shift break) is simple to implement but ignores actual usage. A robot that ran light tasks will be pulled off the floor unnecessarily. A robot that ran heavy routes may not make it to the next scheduled window. The variance in per-robot workload during a shift is high enough that a fixed schedule wastes significant capacity.

At the same time, predictive charging requires confidence in the model. An overconfident prediction that keeps a robot on the floor too long is worse than a conservative schedule that charges it a bit early. The failure mode matters.

The Decision

Predictive charging as the primary path, with scheduled charging as an automatic fallback. Each robot reports battery voltage, discharge rate, temperature, and estimated remaining range. The fleet controller runs a charge-need scoring function and dispatches robots to docks before they hit a critical threshold. If the predictive model hasn't updated recently or the confidence interval is wide, the robot falls back to the scheduled policy for that cycle.

Every 60 seconds: Robot publishes battery telemetry voltage, discharge_rate, temperature, estimated_range_meters │ ▼ Fleet Controller: charge-need scoring │ ├── model confidence HIGH? │ │ YES: predictive dispatch │ │ ──► score = f(current_soc, discharge_rate, queue_depth, dock_availability) │ │ ──► if score > threshold: assign to nearest available dock │ │ │ └── NO: schedule fallback │ ──► charge at next shift break regardless of score │ └── SoC < 15% emergency override ──► immediate dock assignment (bypasses all scoring, highest priority)

Alternatives Considered

Option	Why Rejected
Fixed scheduled charging only	Wastes throughput - robots pulled off floor regardless of actual need. Doesn't adapt to shift variability. Peak-demand shifts will still run robots dry between windows.
Pure predictive (no fallback)	A model that silently degrades - due to sensor drift, unusual workload, or edge cases in the scoring function - would allow robots to run down to empty. No floor on risk is unacceptable in a shared human-robot environment.
Charge whenever docks are free (opportunistic)	Inefficient use of dock capacity. Robots interrupt tasks unnecessarily when charging wasn't needed. Complicates task scheduling since dock availability becomes unpredictable.
Swappable battery packs	Evaluated for phase 2. Requires different robot hardware design, trained swap personnel, and battery inventory management. Operational complexity exceeds benefit at current scale.

Consequences

Type	Detail
Positive	8% improvement in fleet throughput versus fixed schedule (pilot data). Robots charge when needed, not on a clock. Fallback policy prevents model failures from causing battery emergencies. Fleet controller has full visibility into charge state across all units.
Negative	Scoring function requires ongoing tuning as workload patterns change. Two policies running simultaneously creates more operational complexity than a single rule. Predictive model adds a retraining dependency that must be maintained.
Mitigations	Scoring function parameters configurable without code deploy. Fallback policy acts as circuit breaker when model confidence drops. Emergency SoC override is always active regardless of policy state.

Why This ADR Matters

The hybrid approach is the honest answer to "we want to be smart about charging but we can't afford to be wrong." Pure prediction optimizes throughput but has no floor on risk. Pure schedule is safe but wasteful. The fallback policy isn't an admission that the model doesn't work - it's an acknowledgment that any model can fail, and the consequences of a battery emergency are worse than the cost of a conservative charge cycle.

ADR-0021 Accepted November 2024

Incident Response: Fail-Stop over Fail-Safe

When a robot encounters a condition it cannot resolve - conflicting sensor readings, navigation deadlock, unexpected obstacle classification, communication timeout with fleet controller - it has two options: keep moving in a degraded state, or stop completely and wait for human intervention. This ADR documents the decision to always stop.

Context for this decision: In October 2024, a robot running with a degraded lidar (one of four sensors reporting noise) continued operating at reduced confidence because the fail-safe logic determined it could still navigate "safely enough." The robot clipped a shelving unit in an area it had navigated correctly hundreds of times before. No injury, minor product damage, but the near-miss triggered a full review of the failure response policy.

The Problem

Fail-safe designs - where a system continues operating in a degraded mode - are attractive because they maximize uptime. A robot that stops every time something is uncertain is a robot that blocks aisles and requires frequent human intervention. But the October incident showed the limit of that reasoning: when a robot navigates with reduced confidence in a human-occupied environment, the consequences of a wrong decision are not bounded by the degradation level. A robot that is 80% confident it can navigate safely is not 80% safe.

The Decision

Fail-stop. Any condition the robot cannot resolve within its defined operational parameters results in a full stop and a human-required alert. The robot does not attempt degraded navigation, rerouting under uncertainty, or self-recovery beyond a defined retry window. Uptime is a secondary concern to predictability. A stopped robot is a known, bounded problem. A robot navigating with silent degradation is not.

Fault Detected (sensor disagreement, nav deadlock, comm timeout, classification failure) │ ▼ Is fault self-recoverable within defined retry window? │ ├── YES (e.g., brief MQTT drop, transient sensor spike) │ ──► retry up to N times with backoff │ ──► if resolved: resume task, log event │ ──► if unresolved: fall through to STOP │ └── NO (or retry exhausted) │ ▼ FULL STOP (motors off, brakes engaged, position held) │ ├──► MQTT publish warehouse/robot/{id}/fault (immediate) ├──► Dashboard alert + audio alarm in supervisor station └──► Robot requires manual clearance before resuming NO degraded navigation. NO partial-confidence rerouting. NO silent recovery.

Alternatives Considered

Option	Why Rejected
Fail-safe continued operation at reduced speed	The October incident. Reduced speed does not prevent contact with a misidentified obstacle. A robot that is uncertain about its environment should not be moving in that environment, regardless of speed.
Confidence-weighted degraded navigation	Requires defining "safe enough" thresholds for every possible degraded state. The threshold-setting problem is unsolvable without accepting arbitrary risk. A 70% confidence threshold is not derived from physics - it's a guess.
Autonomous rerouting under uncertainty	Rerouting under sensor uncertainty compounds the problem - the robot may navigate into an area where its degraded sensors perform even worse. Self-recovery that moves through the environment is not safe recovery.
Alert and continue (notify supervisor but keep moving)	Supervisor dashboards have 30-60 second notification-to-response latency in practice. At 1.2 m/s, a robot travels 36-72 meters before the supervisor can intervene. Not acceptable.

Operational Impact

Fail-stop increases the frequency of human interventions compared to the previous fail-safe policy. The data from the first 30 days after adoption:

Metric	Before (Fail-Safe)	After (Fail-Stop)
Unplanned stops per shift	1.2 average	3.4 average
Human interventions per shift	0.3 average	3.4 average
Near-miss incidents	3 in 90 days	0 in 30 days
Fleet throughput reduction	0% (baseline)	~4% (acceptable)

Consequences

Type	Detail
Positive	Zero near-miss incidents in 30 days post-adoption versus 3 in the prior 90 days. Robot behavior is fully predictable when faults occur - stops and alerts, nothing else. Fault logging is complete because every stop is a recorded event. Human operators can trust that a robot in motion has met all operational parameters.
Negative	More frequent human interventions per shift. Approximately 4% throughput reduction compared to fail-safe baseline. Operations team required additional training on common fault patterns to reduce intervention time. Some warehouse operators initially pushed back on frequency of stops.
Mitigations	Fault classification improvements reduced "unnecessary" stops by 40% within 60 days (transient conditions now handled by retry logic before triggering fail-stop). Intervention runbooks created for the 5 most common fault types. Operations team accepted tradeoff after reviewing near-miss data.

Accepted cost: Fail-stop costs throughput. We accepted that tradeoff explicitly. The alternative is a system where "safe enough to keep moving" is a judgment made by a model under uncertainty, in a warehouse with people in it. That judgment will eventually be wrong. We prefer to be predictably stopped over unpredictably risky.