Many current IoT communication approaches try to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it becomes part of a virtual network and then listens for incoming traffic using that address, acting like a server. In this section, we document various architectural approaches that we have seen, highlight their characteristics, and then propose an alternative that is suitable for many IoT scenarios.
NAT-based device network
This architectural design approach uses network address translation (NAT)65 to expose internal devices in the network that usually use a private IP address, to the outside world by reserving a port on the edge device and mapping this port to the private IP address. The following diagram illustrates this approach.
Figure . NAT-based device network
The previous figure shows a device that uses an internal IPv4 IP address (192.168.1.112) that is listening on port 8088. The device is exposed to the outside world on IP address 127.x.x.x, using port 721. The DNS entry associated with this IP address is device.mynetwork.com. Clients accessing device.mynetwork.com on port 721 will be routed directly to the internal device.
This approach has been used in many traditional networks, and depending on the scenario, it can still work today. However, we have found this scenario to be typically limited by the amount of devices that it can support (about 65,000) due to the number of available ports, the need to be statically located (not moving), and the fact that every exposed device needs to act like a server (receiving, parsing, and answering arbitrary requests from clients), which increases its attack surface for malicious abuse.
IPv6 direct-addressing device network
With the rollout of IPv6, it is natural to think about giving every device in an IoT solution its own publically routable IP address to let it connect to peers, services in the system, or other systems. The following diagram conceptually depicts this model, which we have seen many times.
Figure . IPv6 direct-addressing device network
We mentioned the drawbacks with this approach at the start of this section. Many current IoT communication approaches try to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it becomes part of a virtual network and then listens for incoming traffic using that address, acting like a server. For NAT-based device networks that use either of these protocols, a device needs to act like a server, and with the implicit direct-connectivity model, it must be stationary to avoid connection loss, or it must employ application-layer measures that can handle this scenario.
NAT-based, PAN device network For PAN power-constrained and mostly wirelessly connected devices that are often not IP-based, a common approach to bridging the last few feet of connectivity is to use a hub device wired to the main network that can bridge to the devices on the local network. The following figure illustrates this approach.
Figure . NAT-based, PAN devices network
Even though a hub translates between IP and the various PAN protocols, the problem space is the same as with other NAT-based device networks that we described.
Generic concerns with direct addressing
All previous architectures that provide direct addressability for devices share common concerns. As each device is publically addressable, it needs to handle inbound commands itself, taking care of all application layer responsibilities, such as hosting the server accepting inbound connections, interpreting commands, queuing requests, and so on. Because many devices in large-scale deployments will have limited resources, constraining the number of socket connections that they can handle, and leaving them open to simple denial-of-service (DoS) attacks.66 In this approach, the devices would also have to handle the authentication of users for command + control, using the already scarce sockets, memory and compute power to call out to a service or connect to a database and handle its responses and I/O.
Service-assisted communication
Another approach to connecting a large number of devices to the central service within a system is to have the device connect to a well-known service (called a gateway) and then use that service to tunnel commands to the device. The goal of this approach is to establish trustworthy and bi-directional communication paths between control systems and special-purpose devices that are deployed in untrusted physical space. To that end, the following principles are established:
Security trumps all other capabilities. If a capability cannot be implemented securely, it must not be implemented. Threats are identified and either mitigated or accepted.
Devices do not accept unsolicited network information. All connections and routes are established in an outbound-only fashion.
Devices are peered with a gateway to only connect or establish routes to well-known services. If devices need to feed information to or receive commands from a multitude of services, they are peered with a gateway that takes care of routing information downstream. This ensures that commands are only accepted from authorized parties before routing them to the devices.
The communication path between device and service or device and gateway is secured at the application protocol layer. This mutually authenticates the device to the service or gateway and vice versa. Because the application does not normally concern itself with lower-level layers in the network stack as we discussed earlier in Connectivity, device applications do not trust the link-layer below.
System-level authorization and authentication must be based on per-device identities. One device, one identity ensures that you have granular control over which devices can access the system, provide data, and receive commands.
Access credentials and permissions must be revocable. In case of device abuse, the system must be able to quickly respond by removing the device as an authorized part of the system.
Bi-directional communication for devices may be facilitated by an intermediate store. Devices that are connected sporadically due to power or connectivity concerns may be facilitated through holding commands and notifications for the devices in a queue or mailbox structure until they can connect to retrieve them.
Application payload data may be separately secured. This is to protect transit through gateways to any particular service.
Figure . Service-assisted communication pattern
From the previous illustration, we can derive the following set of attributes:
Device. The device acts like a client; it connects to the gateway and does not listen for unsolicited traffic. The device connects to an external gateway by creating and maintaining an outbound TCP socket across a NAT boundary or by establishing a bi-directional UDP route, potentially using mechanisms such as Session Traversal Utilities for NAT (STUN) or with larger NATs, such as Traversal Using Relay NAT (TURN). These facilitate the detection of a NAT and the discovery of the public IP address of the network for binding.
Connection. The connection is routed through the edge device, usually a router. Because the connection is outbound, the port mapping is performed automatically. By only relying on outbound connectivity, the NAT/Firewall device at the edge of the local network will never have to be opened up for any unsolicited inbound traffic.
The outbound connection or route is maintained by either client or gateway in a fashion that intermediaries such as NATs will not drop due to inactivity. That means that either side might send some form of a keep-alive packet periodically, or send a payload packet periodically that then doubles as a keep-alive packet. Under most circumstances it will be preferable for the device to send keep-alive traffic as it is the originator of the connection or route, and it can and should react to a failure by establishing a new one.
As TCP connections are endpoint concepts, a connection will only be declared dead if the route is considered collapsed and the detection of this fact requires packet flow. A device and its gateway may therefore sit idle for quite a while believing that the route and connection is still intact before the lack of acknowledgement of the next packet confirms that assumption is incorrect. This conflict in behavior calls for a tradeoff decision to be made.
Carrier-grade NATs (CGNs) employed by mobile network operators permit very long periods of connection inactivity and mobile devices that get direct IPv6 address allocations are not forced through a NAT at all. The push notification mechanisms employed by all popular smartphone platforms use this to dramatically reduce the power consumption of the devices by maintaining the route very infrequently—every 20 minutes or more—so the devices can remain in sleep mode with most systems turned off while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that the time it takes to detect a bad route is, at worst, as long as the keep-alive interval. Ultimately, it is a tradeoff between battery-power and traffic-volume cost (on metered subscriptions) and acceptable latency for commands and notifications in case of failures. The device can actively detect potential issues and abandon the connection and create a new one when, for instance, it hops to a different network or when it recovers from signal loss.
The connection from the device to the gateway is protected end-to-end and ignores any underlying link-level protection measures. The gateway authenticates with the device and the device authenticates with the gateway, so neither is anonymous to the other. In the simplest case, this can be done by exchanging a previously shared key. As we see quite often in more capable devices, it can also be done via a X.509 certificate exchange as performed by Transport Layer Security (TLS), or a combination of a TLS handshake with server authentication where the device later supplies credentials or an authorization token at the application level. The privacy and integrity protection of the route is also established end-to-end, ideally as a byproduct of the authentication handshake so that a potential attacker cannot waste cryptographic resources on either side without producing proof of authorization.
Today, TLS/DTLS and Secure Shell (SSH) dominate as application-level connection security protocols. SSH is popular, but it lacks a standard session-resumption gesture. TLS supports both the X.509 certificate-exchange model and a simplified model (TLS-PSK) that uses previously shared keys. Removing support for X.509 certificate handling and wire-level exchange reduces the footprint of the TLS library, and by reducing the supported algorithms (for example, supporting only AES-256 and SHA-256), it’s feasible to use this protocol on compute- and memory-constrained devices while remaining compatible with other application layer protocols that rely on TLS. The result of all this is a secure peer connection between the device and a gateway that only the gateway can feed.
Edge security. Because there are no ports open to listen on the edge device, the attack surface on the local network and its devices is minimized.
Gateway. The connection is accepted by a hosted process called a gateway, a system hosted in an environment that is defendable against external threats, either at the edge of the internal network or based in the cloud. It provides a well-defined endpoint and API for clients to connect to and communicate with, effectively acting as a proxy for the device. Eventual peer-to-peer connections inside the network are acceptable, but only if the gateway permits them and facilitates a secure handshake between the peers.
In case any authorized client wishes to send a command (or a reply to a previous request) to a device, it can do so by sending the command to the gateway, providing one or even several different APIs and protocol surfaces that can be translated to the primary bi-directional protocol used by the device. As the gateway is a layer of abstraction, it provides the device with a stable address, location transparency and location hiding.
As this gateway forms an abstraction toward the device, the device could be limited to speak AMQP, MQTT or some proprietary protocol, and yet have a full HTTP/REST interface projection at the gateway, with the gateway taking care of the required translation and also the enrichment where responses from the device can be augmented with reference data. The device can connect from any context and it can even switch contexts, yet its projection into the gateway and its address remains completely stable. The gateway can also be federated with external identity and authorization services, so that only callers acting on behalf of particular users or systems can invoke particular device functions. The gateway therefore provides basic network defense, API virtualization, and authorization services all combined into in one. This approach gets even better when it includes or is based on an intermediary messaging infrastructure that provides a scalable queuing model for both ingress (device to cloud) and egress (cloud to device) traffic.
Without this intermediary infrastructure, this approach would still suffer from the issue that devices must be online and available to receive commands and notifications when the control system sends them. With a per-device queue or per-device subscription on a publish/subscribe infrastructure, the control system can drop a command at any time, and the device can pick it up whenever it is online. If the queue provides time-to-live expiration alongside a dead-lettering mechanism for such expired messages, the control system can also know immediately when a message has not been picked up and processed by the device in the allotted time.
The queue also ensures that the device can never be overtaxed with commands or notifications. The device maintains one connection into the gateway and it fetches commands and notifications on its own schedule. Any backlog forms in the gateway and can be handled there accordingly. The gateway can start rejecting commands on the device’s behalf if the backlog grows beyond a threshold or the cited expiration mechanism kicks in and the control system gets notified that the command cannot be processed at this time.
On the ingress-side (from the gateway perspective) using a queue has the same kind of advantages for the back-end systems. If devices are connected at scale and input from the devices comes in bursts or has significant spikes around certain hours of the day, such as with telematics systems in passenger cars during rush-hour, having the gateway deal with the traffic spikes keeps the back-end system robust. The ingestion queue also allows telemetry and other data to be held temporarily when the back-end systems or their dependencies are taken down for service or suffer from service degradation of any kind.
Share with your friends: |