Juniper NETWORKS Telemetry In Junos for AI ML Workloads Software User Guide

: August 20, 2024
: JUNIPER NETWORKS

Table of Contents

Introduction
TIG Stack
Configuration On The Switch
Environment
Openconfig Sensor Plugin
Native Sensor Plugin
Examples Of Sensor Graphs
Summary
References
References
Read User Manual Online (PDF format)
Download This Manual (PDF format)

Telemetry in Junos for AI/ML Workloads
Author: Shalini Mukherjee

Introduction

As AI cluster traﬃc requires lossless networks with high throughput and low latency, a critical element of the AI network is the collection of monitoring data. Junos Telemetry enables granular monitoring of key performance indicators, including thresholds and counters for congestion management and traﬃc load balancing. gRPC sessions support the streaming of telemetry data. gRPC is a modern, open-source, high performance framework that is built on HTTP/2 transport. It empowers native bidirectional streaming capabilities and includes ﬂexible custom-metadata in request headers. The initial step in telemetry is to know what data is to be collected. We can then analyze this data in various formats. Once we collect the data, it is important to present it in a format that is easy to monitor, make decisions and improve the service being oﬀered. In this paper, we use a telemetry stack consisting of Telegraf, InﬂuxDB, and Grafana. This telemetry stack collects data using a push model. Traditional pull models are resource-intensive, require manual intervention, and could include information gaps in the data they collect. Push models overcome these limitations by delivering data asynchronously. They enrich the data by using user-friendly tags and names. Once the data is in a more readable format, we store it in a database and use it in an interactive visualization web application for analyzing the network. Figure. 1 shows us how this stack is designed for eﬃcient data collection, storage, and visualization, from network devices pushing data to the collector to the data being displayed on dashboards for analysis.

TIG Stack

We used an Ubuntu server to install all the software including the TIG stack.

Telegraf
To collect data, we use Telegraf on an Ubuntu server running 22.04.2. The Telegraf version running in this demo is 1.28.5.
Telegraf is a plugin driven server agent for collecting and reporting metrics. It uses processor plugins to enrich and normalize the data. The output plugins are used to send this data to various data stores. In this document we use two plugins: one for openconﬁg sensors and the other for Juniper native sensors.
InﬂuxDB
To store the data in a time series database, we use InﬂuxDB. The output plugin in Telegraf sends the data to InﬂuxDB, which stores it in a highly eﬃcient manner. We are using V1.8 as there is no CLI present for V2 and above.
Grafana
Grafana is used to visualize this data. Grafana pulls the data from InﬂuxDB and allows users to create rich and interactive dashboards. Here, we are running version 10.2.2.

Configuration On The Switch

To implement this stack, we ﬁrst need to conﬁgure the switch as shown in Figure 2. We have used port 50051. Any port can be used here. Log in to the QFX switch and add the following conﬁguration.

Note: This conﬁguration is for labs/POCs as the password is transmitted in clear text. Use SSL to avoid this.

Environment

Nginx
This is needed if you are unable to expose the port on which Grafana is hosted. The next step is to install nginx on the Ubuntu server to serve as a reverse proxy agent. Once nginx is installed, add the lines shown in Figure 4 to the “default” ﬁle and move the ﬁle from /etc/nginx to /etc/nginx/sites- enabled.

Ensure that the ﬁrewall is adjusted to give full access to the nginx service as shown in Figure 5.

Once nginx is installed and the required changes are made, we should be able to access Grafana from a web browser by using the IP address of the Ubuntu server where all the software is installed.
There is a small glitch in Grafana that does not let you reset the default password. Use these steps if you run into this issue.
Steps to be performed on the Ubuntu server to set the password in Grafana:

Go to /var/lib/grafana/grafana.db
Install sqllite3
o sudo apt install sqlite3
Run this command on your terminal
o sqlite3 grafana.db
Sqlite command prompt opens; run the following query:

delete from user where login=’admin’
Restart grafana and type admin as username and password. It prompts for a new password.

Once all the software is installed, create the conﬁg ﬁle in Telegraf which will help pull the telemetry data from the switch and push it to InﬂuxDB.

Openconfig Sensor Plugin

On the Ubuntu server, edit the /etc/telegraf/telegraf.conf ﬁle to add all the required plugins and sensors. For the openconﬁg sensors, we use the gNMI plugin shown in Figure 6. For demo purposes, add the hostname as “spine1”, the port number “50051” that is used for gRPC, the username and password of the switch, and the number of seconds for redial in case of failure.
In the subscription stanza, add a unique name, “cpu” for this particular sensor, the sensor path, and the time interval for grabbing this data from the switch. Add the same plugin inputs.gnmi and inputs.gnmi.subscription for all the open conﬁg sensors. (Figure 6)

Native Sensor Plugin

This is a Juniper telemetry interface plugin used for native sensors. In the same telegraf.conf ﬁle, add the native sensor plugin inputs.jti_openconfig_telemetry where the ﬁelds are almost the same as openconﬁg. Use a unique client ID for every sensor; here, we use “telegraf3”. The unique name used here for this sensor is “mem” (Figure 7).

Lastly, add an output plugin outputs.influxdb to send this sensor data to InﬂuxDB. Here, the database is named “telegraf” with username as “inﬂux” and password “inﬂuxdb” (Figure 8).

Once you’ve edited the telegraf.conf ﬁle, restart the telegraf service. Now, check in the InﬂuxDB CLI to make sure if measurements are created for all the unique sensors. Type “inﬂux” to enter the InﬂuxDB CLI.

As seen in Figure. 9, enter the inﬂuxDB prompt and use the database “telegraf”. All the unique names given to the sensors are listed as measurements.
To see the output of any one measurement, just to make sure the telegraf ﬁle is correct and the sensor is working, use the command “select * from cpu limit 1” as shown in Figure 10.

Every time changes are made to the telegraf.conf ﬁle, make sure to stop InﬂuxDB, restart Telegraf, and then start InﬂuxDB.
Log on to Grafana from the browser and create dashboards after ensuring that the data is being collected correctly.
Go to Connections > InfuxDB > Add new data source.

Give a name to this data source. In this demo it is “test-1”.
Under the HTTP stanza, use the Ubuntu server IP and 8086 port.
In the InﬂuxDB details, use the same database name, “telegraf,” and provide the username and password of the Ubuntu server.
Click Save & test. Ensure that you see the message, “successful”.
Once the data source is successfully added, go to Dashboards and click New. Let us create a few dashboards that are essential for AI/ML workloads in editor mode.

Examples Of Sensor Graphs

The following are examples of some major counters that are essential for monitoring an AI/ML network.
Percentage utilization for an ingress interface et-0/0/0 on spine-1

Select the data source as test-1.
In the FROM section, select the measurement as “interface”. This is the unique name used for this sensor path.
In the WHERE section, select device::tag, and in the tag value, select the hostname of the switch, that is, spine1.
In the SELECT section, choose the sensor branch that you want to monitor; in this case choose “ﬁeld(/interfaces/interface[if_name=’et-0/0/0’]/state/counters/if_in_1s_octets)”. Now in the same section, click on “+” and add this calculation math (/50000000000 * 100). We are basically calculating the percentage utilization of a 400G interface.
Make sure the FORMAT is “time-series,” and name the graph in the ALIAS section.

Peak buffer occupancy for any queue

Select the data source as test-1.
In the FROM section, select the measurement as “buﬀer.”
In the WHERE section, there are three ﬁelds to ﬁll. Select device::tag, and in the tag value select the hostname of the switch (i.e. spine-1); AND select /cos/interfaces/interface/@name::tag and select the interface (i.e. et- 0/0/0); AND select the queue as well, /cos/interfaces/interface/queues/queue/@queue::tag and choose the queue number 4.
In the SELECT section, choose the sensor branch that you want to monitor; in this case choose “ﬁeld(/cos/interfaces/interface/queues/queue/PeakBuﬀerOccupancy).”
Make sure the FORMAT is “time-series” and name the graph in the ALIAS section.

You can collate data for multiple interfaces on the same graph as seen in Figure 17 for et-0/0/0, et-0/0/1, et-0/0/2 etc.

PFC and ECN mean derivative

For ﬁnding the mean derivative (the diﬀerence in value within a time range), use the raw query mode.
This is the inﬂux query that we have used to ﬁnd the mean derivative between two PFC values on et-0/0/0 of Spine-1 in a sec.
SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/0′]/state/pfc- counter/tx_pkts”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/8′]/state/error- counters/ecn_ce_marked_pkts”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

Input resource errors mean derivative

The raw query for resource errors mean derivative is:
SELECT derivative(mean(“/interfaces/interface[if_name=’et-0/0/0′]/state/error- counters/if_in_resource_errors”), 1s) FROM “interface” WHERE (“device”::tag = ‘Spine-1’) AND $timeFilter GROUP BY time($interval)

Tail drops mean derivative

The raw query for tail drops mean derivative is:
SELECT derivative(mean(“/cos/interfaces/interface/queues/queue/tailDropBytes”), 1s) FROM “buﬀer” WHERE (“device”::tag = ‘Leaf-1’ AND “/cos/interfaces/interface/@name”::tag = ‘et-0/0/0’ AND “/cos/interfaces/interface/queues/queue/@queue”::tag = ‘4’) AND $timeFilter GROUP BY time($__interval) ﬁll(null)
CPU utilization

Select the data source as test-1.
In the FROM section, select the measurement as “newcpu”
In the WHERE, there are three ﬁelds to ﬁll. Select device::tag and in the tag value select the hostname of the switch (i.e. spine-1). AND in /components/component/properties/property/name:tag, and select cpuutilization-total AND in name::tag select RE0.
In the SELECT section, choose the sensor branch that you want to monitor. In this case, choose “ﬁeld(state/value)”.

The raw query for ﬁnding the non-negative derivative of tail drops for multiple switches on multiple interfaces in bits/sec.
SELECT non_negative_derivative(mean(“/cos/interfaces/interface/queues/queue/tailDropBytes”), 1s)*8 FROM “buﬀer” WHERE (device::tag =~ /^Spine-[1-2]$/) and (“/cos/interfaces/interface/@name”::tag =~ /et-0\/0\/[0-9]/ or “/cos/interfaces/interface/@name”::tag=~/et-0\/0\/1[0-5]/) AND $timeFilter GROUP BY time($__interval),device::tag ﬁll(null)

These were some of the examples of the graphs that can be created for monitoring an AI/ML network.

Summary

This paper illustrates the method of pulling telemetry data and visualizing it by creating graphs. This paper speciﬁcally talks about AI/ML sensors, both native and openconﬁg but the setup can be used for all kinds of sensors. We have also included solutions for multiple issues that you might face while creating the setup. The steps and outputs depicted in this paper are speciﬁc to the versions of the TIG stack mentioned earlier. It is subject to change depending on the version of the software, the sensors and the Junos version.

References

Juniper Yang Data Model Explorer for all sensor options
https://apps.juniper.net/ydm-explorer/
Openconﬁg forum for openconﬁg sensors
https://www.openconﬁg.net/projects/models/

Corporate and Sales Headquarters
Juniper Networks, Inc.
1133 Innovation Way
Sunnyvale, CA 94089 USA
Phone: 888. JUNIPER (888.586.4737)
or +1.408.745.2000
Fax: +1.408.745.2100
www.juniper.net
APAC and EMEA Headquarters
Juniper Networks International B.V.
Boeing Avenue 240
1119 PZ Schiphol-Rijk
Amsterdam, The Netherlands
Phone: +31.207.125.700
Fax: +31.207.125.701
Copyright 2023 Juniper Networks. Inc. Ail rights reserved. Juniper Networks, the Juniper Networks logo, Juniper, Junos, and other trademarks are registered trademarks of Juniper Networks. inc. and/or its affiliates in the United States and other countries. Other names may be trademarks of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change. modify. transfer, or otherwise revise this publication without notice.
Send feedback to: [email protected] V1.0/240807/ejm5-telemetry-junos-ai-ml