This topic applies to Armor Complete and Armor Anywhere users.
The purpose of this document is to outline reference architectures and provide configuration details for BCDR and scalability scenarios using Armor's Log Relay product. It will provide architectural diagrams and guidance for configuring Log Relay in a highly-available, scalable configuration.
High Availability (HA) is approached uniquely depending on the mechanism used to ingest logs. For most log source types, this involves provisioning additional log relays in an active/active configuration. This is also how one would scale to increase overall throughput.
For other log sources (specifically ones where the Log Relay polls data from an external source –?as opposed to data being sent to the Log Relay), resiliency should be configured at the log source. Recovery scenarios are detailed in the Polling Sources section.
SYSLOG SOURCES (TCP and UDP)
n order to make Log Relays highly available for syslog sources, a load balancer should be configured to distribute logs across a pool of similarly-configured Log Relays. This may be configured in an active/active or active/passive scenario depending on the capabilities of your load balancer. There is no advantage to an active/passive configuration and configuring an evenly-distributed active/active configuration will increase overall throughput.
Figure 1 – High-availability and scaling configuration diagram for Log Relay syslog ingestion
As with any clustering scenario, it is recommended that Log Relay nodes be provisioned across at least three (3) different availability zones, and that total throughput should not be provisioned in excess of the capacity of your accepted failure condition.
For example, if you have three (3) nodes provisioned, and your accepted failure condition allows for up to one (1) node to fail, overall throughput should be capped at the capacity of the two (2) working nodes. Additional capacity and scaling planning guidance is offered in the Scaling section.
Notes on UDP Load Balancing
In many cases, UDP syslog streams are preferable to using TCP. One example of this is Cisco routing devices which (by default, and in some cases unavoidably) will stop passing traffic if an outbound syslog TCP output fails. While the Log Relay supports both TCP and UDP for most log source types, the load balancer must also support UDP load balancing.
Some UDP load balancers use session pinning to prevent duplicate deliveries of packets, which in the case of high-availability is acceptable, but may not have the desired effect when load balancing for scaling. This is covered in more detail in the Scaling section.
When configuring health checks on your load balancer, there are a variety of options available, but the simplest is a simple port check. Note that some load balancers (such as the AWS Network Load Balancer) only support port checks using TCP. It is recommended that each pool perform a health check on each target port in use.
Additionally, you can configure health checks on the log relay service in each of the nodes, monitor for pipeline failures in the application log, or integrate with ARMOR’s API to assert the health of a member node.
TLS Syslog Sources
Some log source devices require that log messages be sent over an encrypted channel such as TLS syslog. Most load balancers have the option of passing-through TLS connectivity or terminating it at the load balancer. Depending on your network topology, it is recommended that TLS remain terminated at the Log Relay unless other circumstances require it be terminated at the load balancer.
High-availability for polling sources (where the Log Relay pulls data from a source as opposed to having it pushed to the Log Relay) should be guaranteed at the source. The Log Relay has inherent recovery behavior built-in, which in the event the source is unavailable it will continue to retry; additionally, in the event the network connection between the Log Relay and the ARMOR ingestion point is interrupted, logs will continue to be stored until the connection is restored.
Figure 2 – Basic flow diagram for logs originating in an Amazon S3 bucket
In the scenario where the Amazon S3 bucket becomes unavailable, the log relay will continue to retry with an exponential back-off maxing out at three hundred (300) seconds. In the case the network connection between the Log Relay and ARMOR is interrupted, the Log Relay will continue to store logs until its persistence limit is reached (by default this is one (1) GB) per pipeline. This process has an identical back-off pattern for retrying delivery to ARMOR.
Log Relay Queue Persistence
In most cases, and in this example with AWS S3 buckets, each separate configured S3 bucket is a unique pipeline with its own queue and persistence limit. When the persistence limit is reached, the Log Relay will stop ingesting logs from the source until it has resumed sending logs to ARMOR.
It is important to guarantee both that the Log Relay has sufficient disk space to accommodate the maximum queue size multiplied by the number of queues/pipelines; and that the storage is sufficiently fast to both process the latent/stored logs and newly incoming logs as backpressure is relieved.
External Source Persistence
Similar to the above storage requirements, it is important to ensure the retention policy of the source device is at least as long as your Recovery Point Objective (RPO). For example, if your recovery point objective is anywhere in the past seven (7) days, the S3 retention policy should be set to at least seven (7) days.
Some devices may generate too many logs for a single Log Relay to handle. This is especially true if a log aggregator is used to collect logs from various sources and sends them into the Log Relay. For these scenarios we recommend using a Log Relay Cluster and scaling horizontally. While vertical scaling is possible, the configuration has been optimized for this specific use case and we recommend using a load balancer and cluster to scale.
Scaling Total Throughput
When scaling to increase the total throughput (available evenly for all log source types to consume), the architecture is identical to the high-availability architecture as shown above in Figure 1. When adding nodes to a cluster it is important to consider the BCDR implications as outlined in the Syslog Sources section.
Scaling Specific Device Types
When scaling to increase the throughput of a specific device type, you can leverage the type/port-binding relationship to configure specific listener groups/pools in your load balancer.
For example, if your NGINX servers are generating a disproportionately large volume of logs, you could bind the listener group for port 10435 (the NGINX-specific port) to one pool of Log Relay nodes and the remaining listeners to a shared pool of Log Relay nodes. This is illustrated below.
Figure 3 – Flow diagram illustrating scaling throughput for a specific device type
In the diagram above, the high-volume log source (in blue) is configured to distribute the volume of logs across multiple nodes; the low-volume log source (in green) and feasibly others are still being sent through the load balancer for HA/failover scenarios, but defaults to a single-node pool.