References
Related Documents
The following documentation is available for DANZ Monitoring Fabric:
The following documentation is available for DANZ Monitoring Fabric:
Prometheus is an open source monitoring and alerting toolkit. It collects and stores metrics from different sources in a time-series database. Prometheus offers a powerful query language, which allows users to analyze and visualize the collected data in real-time. With its robust alerting system, Prometheus can also notify users of potential issues which helps with their timely resolution.
Starting with the DMF 8.5.0 release, the Prometheus server can scrape metrics from a DMF (DANZ Monitoring Fabric) deployment. The DMF Controller exposes interface counters, CPU usage, memory usage, sensor states, and disk usage statistics from all the devices including the Controllers from a single Prometheus endpoint.
The aforementioned diagram shows a DMF deployment with an active/standby Controller cluster. In this environment, each Controller collects metrics from all the devices it manages as well as from both the Controller nodes. It then exposes them via the telemetry endpoint /api/v1/metrics/prometheus. This is an authenticated endpoint that listens on port 8443 and supports both Prometheus and OpenMetrics exposition formats.
Even though each Controller is capable of serving the telemetry information, it is recommended to use cluster's virtual IP (VIP) in the Prometheus configuration in order to achieve seamless continuity in the event of a Controller failover.
No additional configuration is necessary on the DMF Controller to enable the metric collection. However, to allow the Prometheus service to access the new HTTP telemetry endpoint, a user access-token needs to be created. An admin user can choose an existing user or create a dedicated user with correct privileges and generate a token in order for Prometheus to fetch metrics from a fabric. The following sections describe the necessary configurations.
The group that the user belongs to needs to have sufficient permission to query the Prometheus endpoint. The following table summarizes the behavior.
Group | Behavior |
---|---|
admin |
An access-token generated for a user in the admin group will have access to the telemetry endpoint. |
read-only |
An access-token for a user in the read-only group will not have access to the telemetry endpoint. |
Any custom group | An access-token for a user in a group with TELEMETRY or DEFAULT permission but not with DEFAULT/SENSITIVE permission will have access. |
To set telemetry permission for a custom group, use the following commands:
dmf-controller(config)# group group_name
dmf-controller(config)# permission category:TELEMETRY privilege read-only
dmf-controller(config)# associate user username
Generate an access token for the user using the following command.
dmf-controller(config)# user username
dmf-controller(config-user)# access-token descriptive name for the access-token
access-token : ZxyHXL0QyOhDUogT8wjZj7ouSiVtWNB3
The following configuration needs to be added to the Prometheus server to fetch metrics from the Controller's /api/v1/metrics/prometheus
endpoint periodically.
scrape_configs:
- job_name: <job_name>
scheme: https
authorization:
type: Bearer
credentials: <access-token>
metrics_path: /api/v1/metrics/prometheus
scrape_interval: <interval>
static_configs:
- targets:
- <vip>:8443
tls_config:
insecure_skip_verify: true
The table below depicts the recommended configurations for this feature.
Configuration | Value |
---|---|
Credential | The corresponding access token created on the Controller |
Scrape Interval | The minimum supported interval is 10s |
Target | Use the VIP of the DMF Controller cluster |
TLS | If a self-signed certificate is used on the Controller, add insecure_skip_verify: true |
Please refer to the configuration guidelines of the specific Prometheus version you are using in the production.
et1
) rather than the user configured name associated with the role of an interface (e.g., filter1
).device_name
label for all the metrics corresponding to it. In the case of a Controller, the configured hostname is used in the device_name
label. Thus, these names are expected to be unique in a specific DMF deployment.scrape_duration_seconds, scrape_samples_post_metric_relabeling,
scrape_samples_scraped
, and scrape_series_added
. These metrics do not collect any data from the DMF fabric and can be ignored.StateSet
metric. Each possible state is reported as a separate metric. The current state is reported with value 1 and the other states with value 0. The state itself is reported as a label with the same name as that of the metric. For example, device_psu_oper_status
will display multiple metrics for the operational status of a PSU (Power Supply Unit). If a PSU, psu1
is in the failed state, then the metric device_psu_oper_status{name="psu1",
device_psu_oper_status="failed"}
will report value 1. At the same time, we will see the metric device_psu_oper_status{name="psu1",
device_psu_oper_status="ok"}
with value 0 for another state ok
.If no metric is collected or no change in the metrics is visible on Prometheus for a few minutes, please follow the following troubleshooting steps:
curl -k -H "Authorization: Bearer
<token>"
https://<vip>:8443/api/v1/metrics/prometheus
show
telemetry connection
command. Check the Telemetry Collector section of the DANZ Monitoring Fabric User Guide for more details.The following table shows the details of each metric exposed by the DMF fabric. The supported metric column shows that if a metric is generally supported on the device type. However, some specific platforms or hardware might not report a specific metric.
Metric | Description | Supported device type | ||||
---|---|---|---|---|---|---|
Ctrl | SWL | EOS | SN | RN | ||
device_cgroup_cpu_percentage | The normalized CPU utilization percentage by a control group | Y | Y | N | Y | Y |
device_cgroup_memory_bytes | The memory used by a control group in bytes | Y | Y | N | Y | Y |
device_config_info | The informational metrics of a managed device | N | Y | Y | Y | Y |
device_cpu_utilization_percentage | The percentage utilization of a CPU on a device of the fabric | Y | Y | Y | Y | Y |
device_fan_oper_status | The current operational status of a fan | Y | Y | Y | Y | Y |
device_fan_rpm | The current rate of rotation of a fan | Y | Y | Y | Y | Y |
device_fan_speed_percentage | The percentage of the max rotation capacity that a fan is currently rotating | N | Y | N | N | N |
device_interface_in_broadcast_packets_total | The total number of broadcast packets received on the interface of a device | Y | Y | Y | Y | Y |
device_interface_in_discards_packets_total | The number of discarded inbound packets by the interface of a device even though no errors had been detected | Y | Y | Y | Y | Y |
device_interface_in_errors_packets_total | The number of inbound packets discarded at the interface of a device for errors | Y | Y | Y | Y | Y |
device_interface_in_fcs_errors_packets_total | The number of received packets on the interface of a device with erroneous frame check sequence (FCS) | Y | Y | Y | Y | Y |
device_interface_in_multicast_packets_total | The total number of multicast packets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_interface_in_octets_total | The total number of octets received on the interface of a device | Y | Y | Y | Y | Y |
device_interface_in_packets_total | The total number of packets received on the interface of a device | Y | Y | Y | Y | Y |
device_interface_in_unicast_packets_total | The total number of unicast packets received on the interface of a device | Y | Y | Y | Y | Y |
device_interface_operational_status | The operational state of an interface of a device | Y | Y | Y | Y | Y |
device_interface_out_broadcast_packets_total | The total number of broadcast packets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_interface_out_discards_packets_total | The number of discarded outbound packets at the interface of a device even though no errors had been detected. | Y | Y | Y | Y | Y |
device_interface_out_errors_packets_total | The the number of outbound packets that could not be transmitted by the interface of a device because of errors | Y | Y | Y | Y | Y |
device_interface_out_multicast_packets_total | The total number of multicast packets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_interface_out_octets_total | The total number of octets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_interface_out_packets_total | The total number of packets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_interface_out_unicast_packets_total | The total number of unicast packets transmitted from the interface of a device | Y | Y | Y | Y | Y |
device_memory_available_bytes | The current available memory at a device of the fabric | Y | Y | Y | Y | Y |
device_memory_total_bytes | The total memory of this device | Y | Y | N | Y | Y |
device_memory_utilized_bytes | The current memory utilization of a device of the fabric | Y | Y | Y | Y | Y |
device_mount_point_space_available_megabytes | The amount of unused space on the filesystem | Y | Y | N | Y | Y |
device_mount_point_space_usage_percentage | The used space in percentage | Y | Y | N | Y | Y |
device_mount_point_space_utilized_megabytes | The amount of space currently in use on the filesystem | Y | Y | N | Y | Y |
device_mount_point_total_space_megabytes | The total size of the initialized filesystem. | Y | Y | N | Y | Y |
device_psu_capacity_watts | The maximum power capacity of a power supply | N | N | Y | N | N |
device_psu_input_current_amps | The input current drawn by a power supply | Y | Y | Y | Y | Y |
device_psu_input_power_watts | The input power to a power supply | Y | Y | N | Y | Y |
device_psu_input_voltage_volts | The input voltage to a power supply | Y | Y | Y | Y | Y |
device_psu_oper_status | The current operational status of a power supply unit | Y | Y | Y | Y | Y |
device_psu_output_current_amps | The output current supplied by a power supply | N | Y | Y | N | N |
device_psu_output_power_watts | The output power of a power supply | N | Y | Y | N | N |
device_psu_output_voltage_volts | The output voltage supplied by a power supply | N | Y | Y | N | N |
device_thermal_oper_status | The current operational status of a thermal sensor | Y | Y | N | Y | Y |
device_thermal_temperature_celsius | The current temperature reported by a thermal sensor | Y | Y | Y | Y | Y |
* Ctrl = Controller, SWL = A switch running Switch Light
EOS = A switch running Arista EOS, SN = Service Node, RN = Recorder Node.The DANZ Monitoring Fabric (DMF) Controller in Azure feature supports the operation of the Arista Networks DMF Controller on the Microsoft Azure platform and uses the Azure CLI or the Azure portal to launch the Virtual Machine (VM) running the DMF Controller.
The DMF Controller in Azure feature enables the registration of VM deployments in Azure and supports auto-firstboot
using Azure userData
or customData
.
Configure Azure VMs auto-firstboot
using customData
or userData
. There is no data merging from these sources, so provide the data via customData
or userData
, but not both.
Arista Networks recommends using customData
as it provides a better security posture because it is available only during VM provisioning and requires sudo access to mount the virtual CDROM.
userData
is less secure because it is available via Instance MetaData Service (IMDS) after provisioning and can be queried from the VM without any authorization restrictions.
If sshKey
is configured for the admin account during Azure VM provisioning along with auto-firstboot
parameters, then it is also configured for the admin
user of the DMF Controllers.
The following table lists details of the first boot parameters for the auto-firstboot
configuration.
Firstboot Parameters - Required Parameters
Key | Description | Valid Values |
---|---|---|
admin_password | This is the password to set for the admin user. When joining an existing cluster node this will be the admin-password for the existing cluster node. | string |
recovery_password | This is the password to set for the recovery user. | string |
Additional Parameters
Key | Description | Required | Valid Values | Default Value |
---|---|---|---|---|
hostname | This is the hostname to set for the appliance. | no | string | configured from Azure Instance Metadata Service |
cluster_name | This is the name to set for the cluster. | no | string | Azure-DMF-Cluster |
cluster_to_join | This is the IP which firstboot will use to join an existing cluster. Omitting this parameter implies that the firstboot will create a new cluster.
Note: If this parameter is present ntp-servers, cluster-name, and cluster-description will be ignored. The existing cluster node will provide these values after joining.
|
no | IP Address String | |
cluster_description | This is the description to set for the cluster. | no | string |
Networking Parameters
Key | Description | Required | Valid Values | Default Value |
---|---|---|---|---|
ip_stack | What IP protocols should be set up for the appliance management NIC. | no | enum: ipv4, ipv6, dual-stack | ipv4 |
ipv4_method | How to setup IPv4 for the appliance management NIC. | no | enum: auto, manual | auto |
ipv4_address | The static IPv4 address used for the appliance management NIC. | only if ipv4-method is set to manual | IPv4 Address String | |
ipv4_prefix_length | The prefix length for the IPv4 address subnet to use for the appliance management NIC. | only if ipv4-method is set to manual | 0..32 | |
ipv4_gateway | The static IPv4 gateway to use for the appliance management NIC. | no | IPv4 Address String | |
ipv6_method | How to set up IPv6 for the appliance management NIC. | no | enum: auto, manual | auto |
ipv6_address | The static IPv6 address to use for the appliance management NIC. | only if ipv6-method is set to manual | IPv6 Address String | |
ipv6_prefix_length | The prefix length for the IPv6 address subnet to use for the appliance management NIC. | only if ipv6-method is set to manual | 0..128 | |
ipv6_gateway | The static IPv6 gateway to use for the appliance management NIC. | no | IPv6 Address String | |
dns_servers | The DNS servers for the cluster to use | no | List of IP address strings | |
dns_search_domains | The DNS search domains for the cluster to use. | no | List of the host names or FQDN strings | |
ntp_servers | The NTP servers for the cluster to use. | no | List of the host names of FQDN strings
|
{
"admin_password": "admin_user_password",
"recovery_password": "recovery_user_password"
}
Full List of Parameters
{
"admin-password": "admin_user_password",
"recovery_password": "recovery_user_password",
"hostname": "hostname",
"cluster_name": "cluster name",
"cluster_description": "cluster description",
"ip_stack": "dual-stack",
"ipv4_method": "manual",
"ipv4_address": "10.0.0.3",
"ipv4_prefix-length": "24",
"ipv4_gateway": "10.0.0.1",
"ipv6_method": "manual",
"ipv6_address": "be:ee::1",
"ipv6_prefix-length": "64",
"ipv6_gateway": "be:ee::100",
"dns_servers": [
"10.0.0.101",
"10.0.0.102"
],
"dns_search_domains": [
"dns-search1.com",
"dns-search2.com"
],
"ntp_servers": [
"1.ntp.server.com",
"2.ntp.server.com"
]
}
The following limitations apply to the DANZ Monitoring Fabric (DMF) Controller in Microsoft Azure.
There is no support for any features specific to Azure-optimized Ubuntu Linux, including Accelerated Networking.
The DMF Controllers in Azure are only supported on Gen-1 VMs.
The DMF Controllers in Azure do not support adding the virtual IP address for the cluster.
There is no support for capture interfaces in Azure.
DMF ignores the Azure username and password fields.
There is no support for static IP address assignment that differs from what is configured on the Azure NIC.
The DMF Controllers are rebooted if the static IP on the NIC is updated.
Please refer to the following resources for more information.
Azure user data details: https://learn.microsoft.com/en-us/azure/virtual-machines/user-data
Azure custom data details: https://learn.microsoft.com/en-us/azure/virtual-machines/custom-data
Azure Gen1 vs Gen2 VMs: https://learn.microsoft.com/en-us/azure/virtual-machines/generation-2
Azure optimized Ubuntu Linux features: https://ubuntu.com/blog/microsoft-and-canonical-increase-velocity-with-azure-tailored-kernel
Azure NIC assignment behavior: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/reset-network-interface-azure-linux-vm
Azure DMF Controller VMs can be accessed via an ssh
login.
systemctl
should be in a running state without any failed units for the Controllers to be in a healthy state as shown in the following example:
dmf-controller-0-vm> debug bash
admin@dmf-controller-0-vm:~$ sudo systemctl status
dmf-controller-0-vm
State: running
Jobs: 0 queued
Failed: 0 units
There are three possible failure modes:
VM fails Azure registration.
auto-firstboot
fails due to a transient error or bug.
auto-firstboot
parameter validation fails.
firstboot
logs, after manually booting the VM.ssh
. Firstboot logs can be accessed using debug bash
as shown below:
dmf-controller-0-vm> debug bash
admin@dmf-controller-0-vm:~$ less /var/log/floodlight/firstboot/firstboot.log
dmf-controller-0-vm> debug bash
admin@dmf-controller-0-vm:~$ less /var/lib/floodlight/firstboot/validation-results.json
Removing default IPv4 or IPv6 permit entry before adding a specific permit rule in the sync access list will permanently break communication between active and standby Controllers. Follow the procedure outlined in this appendix to recover from a split Controller HA cluster.
show controller details
command in the CLI.
DMF-CTL2(config-controller-access-list)# show controller details
Cluster Name : DMF-7050
Cluster UID : a5de38214971de42aa7b51b96ac7345f4f228b20
Cluster Virtual IP : 10.240.130.18
Redundancy Status : redundant
Redundancy Description : Cluster is Redundant
Last Role Change Time : 2022-11-05 00:56:04.862000 UTC
Cluster Uptime : 2 months, 1 week
# IPHostname @ Node Id Domain Id State StatusUptime
-|-------------|--------|-|-------|---------|-------|---------|---------------|
1 192.168.39.44 DMF-CTL2 * 16181 activeconnected 2 weeks, 2 days
2 192.168.55.11 DMF-CTL1 23955 1 standby connected 2 weeks, 2 days
~~~~~~~~~~~~~~~~~~~~~~~~~~~ Failover History ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# New Active Time completed Node Reason Description
-|----------|------------------------------|-----|---------------------|------------------|
1 220492022-11-05 00:55:35.994000 UTC 22049 cluster-config-change Changed connection
state: cluster configuration changed
ProcedureOverwrite data to erase data stored on the DMF appliance securely. This section describes how to do this using the Dell LifeCycle Controller. There is an erase API for the DMF Recorder Node, but it does not securely remove the data from its disk. Instead, it unlinks files in the Index and Packet partitions so the file system can reclaim the space for future packets and indices. This approach was a design decision because unlinking files is faster than overwriting data. To securely erase data and to prevent anyone from accessing the data, use the following procedure.
This appendix describes how to install and configure a VM for the DANZ Monitoring Fabric Controller.
VMware vMotion is supported for Controller pairs. This support is offered only for vCenter 7.0.2.
qcow2
format (.qcow2
file).This appendix provides details on how to create a bootable USB with Switch Light OS.
Procedure
Procedure
To build a USB boot image using Windows, complete the following steps.
Procedure
This appendix provides information on how to uninstall the OS from a switch.
Procedure
To revert from SWLOS to EOS, complete the following steps:
Procedure
To revert from DMF (EOS) to UCN EOS, complete the following steps:Some switch platforms may have a preexisting operating system (OS) installed. When installing the Switch Light OS on top of an existing OS, there is a chance of failure. To avoid this issue, first uninstall any existing OS on the switch.
For example, to use a Dell switch with Force 10 OS (FTOS) pre-installed, remove FTOS before installing Switch Light OS. If FTOS is not first deleted, Switch Light installation may fail.
When you boot the switch, if only ONIE options are listed in the switch GNU GRUB boot menu, the switch does not have an existing OS installed. The following example shows the prompts that indicate no OS is installed on the switch.
GNU GRUB version 2.02~beta2+e4a1fe391
+----------------------------------------------------------------------------+
|*ONIE: Install OS |
| ONIE: Rescue |
| ONIE: Uninstall OS |
| ONIE: Update ONIE |
| ONIE: Embed ONIE |
| |
+----------------------------------------------------------------------------+
If the switch prompt looks something like this, skip this section and proceed directly to the following section to install Switch Light OS.
Procedure
To delete FTOS from a Dell switch, complete the following steps:
For detailed instructions on updating ONIE (Open Network Install Environment) on the Dell switches, please refer to the Firmware Upgrade for Dell Switches document at https://www.arista.com/assets/data/pdf/Dell-Switches-Firmware-Upgrade-Manual.pdf.
ONIE is a small operating system typically pre-installed as firmware on bare metal network switches. It provides an environment for automated provisioning.
For detailed instructions on upgrading the CPLD on the Dell switches, please refer to the Firmware Upgrade for Dell Switches document at https://www.arista.com/assets/data/pdf/Dell-Switches-Firmware-Upgrade-Manual.pdf.