Upgrading CloudVision Portal (CVP)

Note: While upgrading CVP, refer to the latest release notes available at Arista Software Download page; and upgrade procedures.

Devices under management must:

  • be running supported EOS version
  • have supported TerminAttr version installed
  • have the TerminAttr agent enabled and successfully streaming telemetry to CVP.

The following steps can be taken at any point on an existing cluster as part of preparing for an upgrade to the current version:

  1. Upgrade existing CVP clusters to the latest CVP release
  2. Upgrade all EOS devices under management to the supported release train.
  3. For devices running EOS releases prior to 4.20, ensure that the eAPI unix domain socket is enabled with the following configuration:
    management api http-commands
     protocol unix-socket
  4. Install supported TerminAttr on all EOS devices under management.
  5. Enable state streaming from all EOS devices under management by applying the SYS_StreamingTelemetry configlet and pushing the required configuration to all devices.
  6. Ensure that all devices are successfully streaming to the CVP cluster.
  7. Ensure that all devices are in image and config compliance.
  8. Complete regular backups. Complete a final backup prior to upgrade.
  9. Ensure that all tasks are in a terminal state (Success, Failed, or Canceled).
  10. Ensure that all Change Controls are in a terminal state.
    Note: After the cluster is upgraded to the latest CVP release, systems running unsupported TerminAttr versions fail to connect to the CVP cluster. These devices will have to be first upgraded to a supported TerminAttr version by re-onboarding them from the CloudVision UI. You cannot rollback a device to a time before it was running the supported TerminAttr version.

    The upgrade from the previous CVP release to the current CVP release trains include data migrations that can take several hours on larger scale systems.

Upgrades

Upgrades do not require that the VMs be redeployed, and do not result in the loss of logs.

The CVP cluster must be functional and running to successfully complete an upgrade. As a precaution against the loss of CVP data, it is recommended that you backup the CVP data before performing an upgrade. To upgrade CVP to the current release, you must first upgrade CVP to the supported release that supports an upgrade to the current release. For more information, refer the CVP release notes at Arista Software Download page.
Note: Centos updates (yum update commands) outside of CVP upgrades are not supported.

Verifying the Health of CVP before Performing Upgrades

Upgrades should only be performed on healthy and fully functional CVP systems. Before performing the upgrade, make sure that you verify that the CVP system is healthy.

Complete the following steps to verify the health of CVP.

  1. Enter into the Linux shell of the primary node as cvp user.
  2. Execute the cvpi status all command on your CVP:

    This shows the status of all CVP components.

  3. Confirm that all CVP components are running.
  4. Log into the CVP system to check functionality.

    Once you have verified the health of your CVP installation, you can begin the upgrade process.

Upgrading from version 2018.1.2 (or later)

Use this procedure to complete the fast upgrade of CVP to the current version of CVP.

Pre-requisites:

Before you begin the upgrade procedure, make sure that you have:
  • Verified the health of your CVP installation (see Verifying the health of CVP before performing upgrades.
  • Verified that you are running version 2018.1.2 or later.

Complete the following steps to perform the upgrade.

  1. SSH as root into the primary node.
  2. Run these commands:
    1. rm -rf /tmp/upgrade(to remove data from old upgrades if already present)
    2. mkdir /data/upgrade
    3. ln -s /data/upgrade /tmp/upgrade
    4. scp/wget cvp-upgrade-<version>.tgz to the /data/upgrade directory.
  3. Run the su cvpadmin command to trigger the shell.
  4. Select the upgrade option from the shell.
    Note: On a multi-node cluster, upgrade can be performed only on the primary node. Upgrading to the current version may take up to 30 minutes.
    Note: If an issue occurs during an upgrade, you will be prompted to continue the upgrade once the issue is resolved.
    Note: Upgrade to 2021.1.0 and newer requires the configuration of a kubernetes cluster network. You will be prompted during the upgrade to enter the private IP range for the kubernetes cluster network.For this reason, a separate, unused network addressing should be provided when configuring CVP.

    Users will see this prompt while running the upgrade:

    This upgrade requires to configure kubernetes cluster network. 
    Please enter private ip range for kubernetes cluster network : 
    

    The cvpi env command will show kubernetes cluster related parameters. KUBE_POD_NETWORK and KUBE_SERVICE_NETWORK are the two subnetworks derived from KUBE_CLUSTER_NETWORK.KUBE_CLUSTER_DNS is the second IP address from KUBE_SERVICE_NETWORK.

    Note: KUBE_CLUSTER_NETWORK is the kubernetes private IP range and this should not conflict with CVP nodes, device interface IPs, cluster interface IPs, or switch IPs. In addition, do not use link-local or the subnet reserved for loopback purposes or any multicast IP addresses. The subnet length for KUBE_CLUSTER_NETWORK needs to be less than or equal to 20.

CVP Node RMA

Use this procedure to replace any node of a multi-node cluster. Replacing nodes of multi-node cluster involves removing the node you want to replace, waiting for the remaining cluster nodes to recover, powering on the replacement node, and applying the cluster configuration to the new node.

When you replace cluster nodes, you must replace only one node at a time. If you plan to replace more than one node of a cluster, you must complete the entire procedure for each node to be replaced.

When replacing a node the CloudVision VM that comes with the new CVA might not be the same version as the one running on the other nodes. For more information on redeploying with the correct version refer to: https://www.arista.com/en/qsg-cva-200cv-250cv/cva-200cv-250cv-redeploy-cvp-vm-tool

Check that the XML file is similar as on the other appliances. This can be checked using the virsh dumpxml cvp command.

Note: It is recommended that you save the CVP cluster configuration to a temporary file, or write down the configuration on a worksheet. The configuration can be found in /cvpi/cvp-config.yaml.
  1. Power off the node you want to replace (primary, secondary, or tertiary).
  2. Remove the node to be replaced.
  3. Allow all components of the remaining nodes to recover.

    The remaining nodes need to be up and settled before continuing to step 4.

  4. Use the cvpi status all command to ensure that remaining nodes are healthy.You will see some services are reported as “NOT RUNNING” due to not all pods for those services being online. This is expected while a node is offline.
    [root@node2 ~]# cvpi status all
    
    Executing command. This may take some time...
    Completed 227/227 discovered actions
    
    secondary components total:147 running:108 disabled:12 not running:27
    tertiarycomponents total:112 running:103 disabled:9
    primary NODE DOWN
    
     
    Action Output
    -------------
    
    COMPONENTACTIONNODESTATUS ERROR
    
    aaastatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    ambassador statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    apiserverstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    auditstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    clickhouse statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    cloudmanager statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    corednsstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    device-interaction statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    elasticsearch-recorder statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    elasticsearch-server statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    enroll statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    flannelstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    ingest statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    inventorystatussecondary NOT RUNNINGOnly 2/3 pod(s) ready 
    
    kafkastatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    labelstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    local-provider statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    nginx-appstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    prometheus-node-exporter statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    prometheus-serverstatussecondary NOT RUNNINGOnly 0/1 pod(s) ready
    
    radius-providerstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    script-executorstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    script-executor-v2 statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    service-clover statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    snapshot statussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    tacacs-providerstatussecondary NOT RUNNINGOnly 2/3 pod(s) ready
    
    task statussecondary NOT RUNNINGOnly 2/3 pod(s) ready 
    
  5. Power on the replacement node.
  6. Log in as cvpadmin.
  7. Enter the cvp cluster configuration.
    CentOS Linux 7 (Core)
    Kernel 3.10.0-957.1.3.el7.x86_64 on an x86_64
    
    localhost login: cvpadmin
    Last login: Fri Mar 15 12:24:45 on ttyS0
    Changing password for user root.
    New password:
    Retype new password:
    passwd: all authentication tokens updated successfully.
    Enter a command
    [q]uit [p]rint [s]inglenode [m]ultinode [r]eplace [u]pgrade
    >r
    Please enter minimum configuration to connect to the other peers
    *Ethernet interface for the cluster network: eth0
    *IP address of eth0: 172.31.0.216
    *Netmask of eth0: 255.255.0.0
    *Default route: 172.31.0.1
    *IP address of one of the two active cluster nodes: 172.31.0.161
     Root password of 172.31.0.161:
  8. Wait for the RMA process to complete. No action is required.
    Root password of 172.31.0.161: 
    External interfaces, ['eth1'], are discovered under /etc/sysconfig/network-scripts
    These interfaces are not managed by CVP.
    Please ensure that the configurations for these interfaces are correct.
    Otherwise, actions from the CVP shell may fail.
    Running : /bin/sudo /sbin/service network restart
    [334.001886] vmxnet3 0000:0b:00.0 eth0: intr type 3, mode 0, 9 vectors allocated
    [334.004577] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps
    [334.006315] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
    [334.267535] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
    [348.252323] vmxnet3 0000:13:00.0 eth1: intr type 3, mode 0, 9 vectors allocated
    [348.254925] vmxnet3 0000:13:00.0 eth1: NIC Link is Up 10000 Mbps
    [348.256504] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [348.258035] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    Fetching version information
    Run cmd: sudo -u cvp -- ssh 172.31.0.156 cat /cvpi/property/version.txt 0.18
    Fetching version information
    Run cmd: sudo -u cvp -- ssh 172.31.0.216 cat /cvpi/property/version.txt 10.19
    Fetching version information
    Run cmd: sudo -u cvp -- ssh 172.31.0.161 cat /cvpi/property/version.txt 0.16
    Running : cvpConfig.py tool...
    [392.941983] vmxnet3 0000:0b:00.0 eth0: intr type 3, mode 0, 9 vectors allocated
    [392.944739] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps
    [392.946388] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
    [393.169460] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
    [407.229180] vmxnet3 0000:13:00.0 eth1: intr type 3, mode 0, 9 vectors allocated
    [407.232306] vmxnet3 0000:13:00.0 eth1: NIC Link is Up 10000 Mbps
    [407.233940] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [407.235728] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [408.447642] Ebtables v2.0 unregistered
    [408.935626] ip_tables: (C) 2000-2006 Netfilter Core Team
    [408.956578] ip6_tables: (C) 2000-2006 Netfilter Core Team
    [408.982927] Ebtables v2.0 registered
    [409.029603] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
    Stopping: ntpd
    Running : /bin/sudo /sbin/service ntpd stop
    Running : /bin/sudo /bin/systemctl is-active ntpd
    Starting: ntpd
    Running : /bin/sudo /bin/systemctl start ntpd.service
    Waiting for all components to start. This may take few minutes.
    Run cmd: su - cvp -c '/cvpi/bin/cvpi -v=3 status zookeeper' 0.45
    Run cmd: su - cvp -c '/cvpi/bin/cvpi -v=3 status zookeeper' 0.33
    Checking if third party applications exist
    Run cmd: su - cvp -c '/cvpi/zookeeper/bin/zkCli.sh ls /apps | tail -1' 0.72
    Running : cvpConfig.py tool...
    Stopping: cvpi-check
    Running : /bin/sudo /sbin/service cvpi-check stop
    Running : /bin/sudo /bin/systemctl is-active cvpi-check
    Starting: cvpi-check
    Running : /bin/sudo /bin/systemctl start cvpi-check.service
  9. Continue waiting for the RMA process to complete. No action is required.
    [Fri Mar 15 20:26:28 UTC 2019] :
    Executing command. This may take some time...
    
    (E) => Enabled
    (D) => Disabled
    (?) => Zookeeper Down
    
    Action Output
    -------------
    COMPONENT ACTIONNODESTATUSERROR
    hadoopcluster tertiary(E) DONE
    hbase cluster tertiary(E) DONE
    Executing command. This may take some time...
    
    (E) => Enabled
    (D) => Disabled
    (?) => Zookeeper Down
    
    Action Output
    -------------
    COMPONENT ACTIONNODESTATUSERROR
    aerisdiskmonitorconfigprimary (E) DONE
    aerisdiskmonitorconfigsecondary (E) DONE
    aerisdiskmonitorconfigtertiary(E) DONE
    apiserver configprimary (E) DONE
    apiserver configsecondary (E) DONE
    apiserver configtertiary(E) DONE
    cvp-backend configprimary (E) DONE
    cvp-backend configsecondary (E) DONE
    cvp-backend configtertiary(E) DONE
    cvp-frontendconfigprimary (E) DONE
    cvp-frontendconfigsecondary (E) DONE
    cvp-frontendconfigtertiary(E) DONE
    geigerconfigprimary (E) DONE
    geigerconfigsecondary (E) DONE
    geigerconfigtertiary(E) DONE
    hadoopconfigprimary (E) DONE
    hadoopconfigsecondary (E) DONE
    hadoopconfigtertiary(E) DONE
    hbase configprimary (E) DONE
    hbase configsecondary (E) DONE
    hbase configtertiary(E) DONE
    kafka configprimary (E) DONE
    kafka configsecondary (E) DONE
    kafka configtertiary(E) DONE
    zookeeper configprimary (E) DONE
    zookeeper configsecondary (E) DONE
    zookeeper configtertiary(E) DONE
    Executing command. This may take some time...
    secondary 89/89 components running
    primary 78/78 components running
    Executing command. This may take some time...
    COMPONENT ACTIONNODESTATUSERROR
    Including: /cvpi/tls/certs/cvp.crt
    Including: /cvpi/tls/certs/cvp.key
    Including: /etc/cvpi/cvpi.key
    Including: /cvpi/tls/certs/kube-cert.pem
    Including: /data/journalnode/mycluster/current/VERSION
    Including: /data/journalnode/mycluster/current/last-writer-epoch
    Including: /data/journalnode/mycluster/current/last-promised-epoch
    Including: /data/journalnode/mycluster/current/paxos
    Including: /cvpi/tls/certs/ca.crt
    Including: /cvpi/tls/certs/ca.key
    Including: /cvpi/tls/certs/server.crt
    Including: /cvpi/tls/certs/server.key
    mkdir -p /cvpi/tls/certs
    mkdir -p /data/journalnode/mycluster/current
    mkdir -p /cvpi/tls/certs
    mkdir -p /etc/cvpi
    mkdir -p /cvpi/tls/certs
    mkdir -p /cvpi/tls/certs
    mkdir -p /cvpi/tls/certs
    mkdir -p /data/journalnode/mycluster/current
    mkdir -p /cvpi/tls/certs
    mkdir -p /data/journalnode/mycluster/current
    mkdir -p /data/journalnode/mycluster/current
    mkdir -p /cvpi/tls/certs
    Copying: /etc/cvpi/cvpi.key from secondary
    rsync -rtvp 172.31.0.161:/etc/cvpi/cvpi.key /etc/cvpi
    Copying: /cvpi/tls/certs/cvp.crt from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/cvp.crt /cvpi/tls/certs
    Copying: /cvpi/tls/certs/server.key from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/server.key /cvpi/tls/certs
    Copying: /cvpi/tls/certs/ca.crt from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/ca.crt /cvpi/tls/certs
    Copying: /cvpi/tls/certs/cvp.key from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/cvp.key /cvpi/tls/certs
    Copying: /cvpi/tls/certs/ca.key from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/ca.key /cvpi/tls/certs
    Copying: /data/journalnode/mycluster/current/last-writer-epoch from secondary
    rsync -rtvp 172.31.0.161:/data/journalnode/mycluster/current/last-writer-epoch /data/journalnode/mycluster/current
    Copying: /cvpi/tls/certs/kube-cert.pem from secondary
    Copying: /cvpi/tls/certs/server.crt from secondary
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/server.crt /cvpi/tls/certs
    Copying: /data/journalnode/mycluster/current/VERSION from secondary
    rsync -rtvp 172.31.0.161:/data/journalnode/mycluster/current/VERSION /data/journalnode/mycluster/current
    Copying: /data/journalnode/mycluster/current/paxos from secondary
    rsync -rtvp 172.31.0.161:/data/journalnode/mycluster/current/paxos /data/journalnode/mycluster/current
    Copying: /data/journalnode/mycluster/current/last-promised-epoch from secondary
    rsync -rtvp 172.31.0.161:/data/journalnode/mycluster/current/last-promised-epoch /data/journalnode/mycluster/current
    rsync -rtvp 172.31.0.161:/cvpi/tls/certs/kube-cert.pem /cvpi/tls/certs
    Starting: cvpi-config
    Running : /bin/sudo /bin/systemctl start cvpi-config.service
    Starting: cvpi
    Running : /bin/sudo /bin/systemctl start cvpi.service
    Running : /bin/sudo /bin/systemctl start cvpi-watchdog.timer
    Running : /bin/sudo /bin/systemctl enable docker
    Running : /bin/sudo /bin/systemctl start docker
    Running : /bin/sudo /bin/systemctl enable kube-cluster.path
  10. Enter "q" to quit the process after the RMA process is complete! message is displayed.
    Waiting for all components to start. This may take few minutes.
    [560.918749] FS-Cache: Loaded
    [560.978183] FS-Cache: Netfs 'nfs' registered for caching
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 48.20
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.73
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 7.77
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.55
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.23
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.64
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.59
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.07
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.70
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.51
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.57
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.40
    Run cmd: su - cvp -c '/cvpi/bin/cvpi status all --cluster' 2.24
    Waiting for all components to start. This may take few minutes.
    Run cmd: su - cvp -c '/cvpi/bin/cvpi -v=3 status all' 9.68
    RMA process is complete!
    [q]uit [p]rint [e]dit [v]erify [s]ave [a]pply [h]elp ve[r]bose
    >q
  11. Use the cvpi status all command to ensure that the cluster is healthy.
    [cvp@cvp87 ~]$ cvpi status all
    
    
    Executing command. This may take some time...
    Completed 215/215 discovered actions
    primary 	components total:112 running:104 disabled:8
    secondary 	components total:122 running:114 disabled:8
    tertiary 	components total:97 running:91 disabled:6

    When a node is RMA'd, the other nodes will replicate their state via HDFS to the new node. We can track this in real time by issuing the following command:

    watch -n 30 "hdfs dfsadmin -report | grep 'Under replicated'"

    Once the count of "Under replicated" blocks hits 0, data synchronization to the new node is complete.

    The disk usage on the new node will also grow as the blocks are replicated and the RMA'd node will have a similar disk space utilization as the other nodes once the operation has finished successfully.

CVP / EOS Dependencies

To ensure that CVP can provide a base level of management, all EOS devices must be running at least EOS versions 4.17.3F or later. To ensure device compatibility supported EOS version advice should be sought from the Arista account team.

CVP should not require any additional EOS upgrades to support the standard features and functions in later versions of the appliance. Newer features and enhancements to CVP may not be available for devices on older code versions.

Refer to the latest Release Notes for additional upgrade/downgrade guidance.

Related topics:

Upgrade CV-CUE As Part of a CV Upgrade

In case of a CV upgrade, services go through the following steps:

  1. Services or service containers (such as CV-CUE) are stopped.
  2. Existing container images are deleted.
  3. New component RPMs are installed.
  4. The server is rebooted and all services are started again.

    A service on CV is upgraded only if its version is different from the pre-upgrade version (CV stores its pre-upgrade state to decide this). The wifimanager component follows a similar process. When CV boots up after an upgrade, wifimanager starts and upgrades only if the CV upgrade has resulted in a new wifimanager version. The following actions precede every wifimanager start operation:

    1. load: Loads the wifimanager container image into docker when CV boots up for the first time after an upgrade.
    2. init: Initializes wifimanager before the start. The wifimanager init is versioned init-8.8.0-01, for example. The init-<version> handler initiates a wifimanager upgrade if needed. Thus, if the wifimanager version has not changed after the CV upgrade, the wifimanager upgrade is not invoked. If the wifimanager version has changed, then a wifimanager upgrade is called before its start.
    Note: Load and init are internal actions to the wifimanager start operation; they are not run separately. The CV-CUE service might take longer to start than other CV services.