Juniper NETWORKS Telemetry In Junos for AI ML Workloads Software User Guide

August 20, 2024
JUNIPER NETWORKS

Telemetry in Junos for AI/ML Workloads
Author: Shalini Mukherjee

Introduction

As AI cluster traffic requires lossless networks with high throughput and low latency, a critical element of the AI network is the collection of monitoring data. Junos Telemetry enables granular monitoring of key performance indicators, including thresholds and counters for congestion management and traffic load balancing. gRPC sessions support the streaming of telemetry data. gRPC is a modern, open-source, high performance framework that is built on HTTP/2 transport. It empowers native bidirectional streaming capabilities and includes flexible custom-metadata in request headers. The initial step in telemetry is to know what data is to be collected. We can then analyze this data in various formats. Once we collect the data, it is important to present it in a format that is easy to monitor, make decisions and improve the service being offered. In this paper, we use a telemetry stack consisting of Telegraf, InfluxDB, and Grafana. This telemetry stack collects data using a push model. Traditional pull models are resource-intensive, require manual intervention, and could include information gaps in the data they collect. Push models overcome these limitations by delivering data asynchronously. They enrich the data by using user-friendly tags and names. Once the data is in a more readable format, we store it in a database and use it in an interactive visualization web application for analyzing the network. Figure. 1 shows us how this stack is designed for efficient data collection, storage, and visualization, from network devices pushing data to the collector to the data being displayed on dashboards for analysis.

TIG Stack

We used an Ubuntu server to install all the software including the TIG stack.

Telegraf
To collect data, we use Telegraf on an Ubuntu server running 22.04.2. The Telegraf version running in this demo is 1.28.5.
Telegraf is a plugin driven server agent for collecting and reporting metrics. It uses processor plugins to enrich and normalize the data. The output plugins are used to send this data to various data stores. In this document we use two plugins: one for openconfig sensors and the other for Juniper native sensors.
InfluxDB
To store the data in a time series database, we use InfluxDB. The output plugin in Telegraf sends the data to InfluxDB, which stores it in a highly efficient manner. We are using V1.8 as there is no CLI present for V2 and above.
Grafana
Grafana is used to visualize this data. Grafana pulls the data from InfluxDB and allows users to create rich and interactive dashboards. Here, we are running version 10.2.2.

Configuration On The Switch

To implement this stack, we first need to configure the switch as shown in Figure 2. We have used port 50051. Any port can be used here. Log in to the QFX switch and add the following configuration.

Note: This configuration is for labs/POCs as the password is transmitted in clear text. Use SSL to avoid this.

Environment

Nginx
This is needed if you are unable to expose the port on which Grafana is hosted. The next step is to install nginx on the Ubuntu server to serve as a reverse proxy agent. Once nginx is installed, add the lines shown in Figure 4 to the “default” file and move the file from /etc/nginx to /etc/nginx/sites- enabled.

Ensure that the firewall is adjusted to give full access to the nginx service as shown in Figure 5.

Once nginx is installed and the required changes are made, we should be able to access Grafana from a web browser by using the IP address of the Ubuntu server where all the software is installed.
There is a small glitch in Grafana that does not let you reset the default password. Use these steps if you run into this issue.
Steps to be performed on the Ubuntu server to set the password in Grafana:

  • Go to /var/lib/grafana/grafana.db

  • Install sqllite3
    o sudo apt install sqlite3

  • Run this command on your terminal
    o sqlite3 grafana.db

  •  Sqlite command prompt opens; run the following query:

    delete from user where login=’admin’

  • Restart grafana and type admin as username and password. It prompts for a new password.

Once all the software is installed, create the config file in Telegraf which will help pull the telemetry data from the switch and push it to InfluxDB.

Openconfig Sensor Plugin

On the Ubuntu server, edit the /etc/telegraf/telegraf.conf file to add all the required plugins and sensors. For the openconfig sensors, we use the gNMI plugin shown in Figure 6. For demo purposes, add the hostname as “spine1”, the port number “50051” that is used for gRPC, the username and password of the switch, and the number of seconds for redial in case of failure.
In the subscription stanza, add a unique name, “cpu” for this particular sensor, the sensor path, and the time interval for grabbing this data from the switch. Add the same plugin inputs.gnmi and inputs.gnmi.subscription for all the open config sensors. (Figure 6)

Native Sensor Plugin

This is a Juniper telemetry interface plugin used for native sensors. In the same telegraf.conf file, add the native sensor plugin inputs.jti_openconfig_telemetry where the fields are almost the same as openconfig. Use a unique client ID for every sensor; here, we use “telegraf3”. The unique name used here for this sensor is “mem” (Figure 7).

Lastly, add an output plugin outputs.influxdb to send this sensor data to InfluxDB. Here, the database is named “telegraf” with username as “influx” and password “influxdb” (Figure 8).

Once you’ve edited the telegraf.conf file, restart the telegraf service. Now, check in the InfluxDB CLI to make sure if measurements are created for all the unique sensors. Type “influx” to enter the InfluxDB CLI.

As seen in Figure. 9, enter the influxDB prompt and use the database “telegraf”. All the unique names given to the sensors are listed as measurements.
To see the output of any one measurement, just to make sure the telegraf file is correct and the sensor is working, use the command “select * from cpu limit 1” as shown in Figure 10.

Every time changes are made to the telegraf.conf file, make sure to stop InfluxDB, restart Telegraf, and then start InfluxDB.
Log on to Grafana from the browser and create dashboards after ensuring that the data is being collected correctly.
Go to Connections > InfuxDB > Add new data source.

  1. Give a name to this data source. In this demo it is “test-1”.

  2.  Under the HTTP stanza, use the Ubuntu server IP and 8086 port.

  3. In the InfluxDB details, use the same database name, “telegraf,” and provide the username and password of the Ubuntu server.

  4. Click Save & test. Ensure that you see the message, “successful”.

  5. Once the data source is successfully added, go to Dashboards and click New. Let us create a few dashboards that are essential for AI/ML workloads in editor mode.

Examples Of Sensor Graphs

The following are examples of some major counters that are essential for monitoring an AI/ML network.
Percentage utilization for an ingress interface et-0/0/0 on spine-1

  • Select the data source as test-1.
  • In the FROM section, select the measurement as “interface”. This is the unique name used for this sensor path.
  • In the WHERE section, select device::tag, and in the tag value, select the hostname of the switch, that is, spine1.
  • In the SELECT section, choose the sensor branch that you want to monitor; in this case choose “field(/interfaces/interface[if_name=’et-0/0/0’]/state/counters/if_in_1s_octets)”. Now in the same section, click on “+” and add this calculation math (/50000000000 * 100). We are basically calculating the percentage utilization of a 400G interface.
  • Make sure the FORMAT is “time-series,” and name the graph in the ALIAS section.

Peak buffer occupancy for any queue

  • Select the data source as test-1.
  • In the FROM section, select the measurement as “buffer.”
  • In the WHERE section, there are three fields to fill. Select device::tag, and in the tag value select the hostname of the switch (i.e. spine-1); AND select /cos/interfaces/interface/@name::tag and select the interface (i.e. et- 0/0/0); AND select the queue as well, /cos/interfaces/interface/queues/queue/@queue::tag and choose the queue number 4.
  • In the SELECT section, choose the sensor branch that you want to monitor; in this case choose “field(/cos/interfaces/interface/queues/queue/PeakBufferOccupancy).”
  • Make sure the FORMAT is “time-series” and name the graph in the ALIAS section.

You can collate data for multiple interfaces on the same graph as seen in Figure 17 for et-0/0/0, et-0/0/1, et-0/0/2 etc.

PFC and ECN mean derivative

For finding the mean derivative (the difference in value within a time range), use the raw query mode.
This is the influx query that we have used to find the mean derivative between two PFC values on et-0/0/0 of Spine-1 in a sec.
SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/0′]/state/pfc- counter/tx_pkts”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/8′]/state/error- counters/ecn_ce_marked_pkts”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

Input resource errors mean derivative

The raw query for resource errors mean derivative is:
SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/0′]/state/error- counters/if_in_resource_errors”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

Tail drops mean derivative

The raw query for tail drops mean derivative is:
SELECT derivative(mean(“/cos/interfaces/interface/queues/queue/tailDropBytes”), 1s) FROM “buffer” WHERE (“device”::tag = ‘Leaf-1’ AND “/cos/interfaces/interface/@name”::tag = ‘et-0/0/0’ AND “/cos/interfaces/interface/queues/queue/@queue”::tag = ‘4’) AND $timeFilter GROUP BY time($__interval) fill(null)
CPU utilization

  • Select the data source as test-1.
  • In the FROM section, select the measurement as “newcpu”
  • In the WHERE, there are three fields to fill. Select device::tag and in the tag value select the hostname of the switch (i.e. spine-1). AND in /components/component/properties/property/name:tag, and select cpuutilization-total AND in name::tag select RE0.
  • In the SELECT section, choose the sensor branch that you want to monitor. In this case, choose “field(state/value)”.

The raw query for finding the non-negative derivative of tail drops for multiple switches on multiple interfaces in bits/sec.
SELECT non_negative_derivative(mean(“/cos/interfaces/interface/queues/queue/tailDropBytes”), 1s)*8 FROM “buffer” WHERE (device::tag =~ /^Spine-[1-2]$/) and (“/cos/interfaces/interface/@name”::tag =~ /et-0\/0\/[0-9]/ or “/cos/interfaces/interface/@name”::tag=~/et-0\/0\/1[0-5]/) AND $timeFilter GROUP BY time($__interval),device::tag fill(null)

These were some of the examples of the graphs that can be created for monitoring an AI/ML network.

Summary

This paper illustrates the method of pulling telemetry data and visualizing it by creating graphs. This paper specifically talks about AI/ML sensors, both native and openconfig but the setup can be used for all kinds of sensors. We have also included solutions for multiple issues that you might face while creating the setup. The steps and outputs depicted in this paper are specific to the versions of the TIG stack mentioned earlier. It is subject to change depending on the version of the software, the sensors and the Junos version.

References

Juniper Yang Data Model Explorer for all sensor options
https://apps.juniper.net/ydm-explorer/
Openconfig forum for openconfig sensors
https://www.openconfig.net/projects/models/

Corporate and Sales Headquarters
Juniper Networks, Inc.
1133 Innovation Way
Sunnyvale, CA 94089 USA
Phone: 888. JUNIPER (888.586.4737)
or +1.408.745.2000
Fax: +1.408.745.2100
www.juniper.net
APAC and EMEA Headquarters
Juniper Networks International B.V.
Boeing Avenue 240
1119 PZ Schiphol-Rijk
Amsterdam, The Netherlands
Phone: +31.207.125.700
Fax: +31.207.125.701
Copyright 2023 Juniper Networks. Inc. Ail rights reserved. Juniper Networks, the Juniper Networks logo, Juniper, Junos, and other trademarks are registered trademarks of Juniper Networks. inc. and/or its affiliates in the United States and other countries. Other names may be trademarks of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change. modify. transfer, or otherwise revise this publication without notice.
Send feedback to: design-center-comments@juniper.net V1.0/240807/ejm5-telemetry-junos-ai-ml

References

Read User Manual Online (PDF format)

Read User Manual Online (PDF format)  >>

Download This Manual (PDF format)

Download this manual  >>

JUNIPER NETWORKS User Manuals

Related Manuals