verizon Security Information & Event Management Software User Guide

: October 27, 2023
: Verizon

Table of Contents

Introduction
What is Event Managemental
Active Monitoring
Alarm Creation
Passive Monitoring
Meraki
From Event to Incident Ticket
Alarm List and Thresholds
Ticket priority Definitions
Alarm Correlation
Read User Manual Online (PDF format)
Download This Manual (PDF format)

Event Management for
Managed WAN/LAN verizon Security Information &amp Event Management
Software

Introduction

The purpose of this presentation is to provide a high level overview of the process where an event triggers the creation of a proactive incident ticket.
It is a generic overview and therefore exceptions as well as custom arrangements are not being covered.
Please refer to the appendix at the end of the presentation for an explanation of terms.

What is Event Managemental

Event Management Definition
An event can be defined as any detectable occurrence that has significance for the delivery of IT services. Events are typically notifications created by an IT service, Configuration Item (Cl) or monitoring tool.
Event Management by Verizon
Verizon is using SMARTS as the event monitoring tool for Managed WAN/LAN together with M3 for Meraki devices. SMARTS uses two methods to detect service interruptions:

Active Monitoring: Pollers on SMARTS are configured to poll (SNMP & ICMP walk) managed devices every 3 minutes.
Passive Monitoring: Managed devices are configured to send an alert (SNMP trap) each time a specific faults occur.

Active Monitoring

verizon Security Information &amp Event Management Software - FIGURE
1

Polling: SMARTS is configured to poll the device (equipment) every 3 minutes. 1
Fault detection: When SMARTS does not receive an answer from the polled device, SMARTS marks the device as “DOWN”. SMARTS starts the second polling cycle (“smoothing” cycle) of 3 minutes and 35 seconds to confirm the device is unresponsive. The extended time has been implemented to wait for automatic network recovery. 2
Event sent to Automation (IMPACT): If SMARTS does not receive an answer from the second polling, SMARTS sends the fault alert to IMPACT. 3
IMPACT: Upon receiving an alert, IMPACT queries ESP (Managed Device Inventory Database) against the entity name to retrieve information such as: Circuit ID, Customer name, Product, Service desk, NOC, etc. This information is used to populate the alarm and to create the ticket within Verizon’s Enterprise Ticket Management System (ETMS). 4

Alarm Creation

How long does it take to create an alarm?

First polling cycle detects the fault in 3 minutes
Second “ smoothing” cycle confirms the fault in 3 minutes and 35 seconds
IMPACT receives the alert and collects additional information to create the alarm ina few seconds

Total time is: 3min + 3min 35sec + few seconds ~ 7 minutes
Additional alarm criteria:

The alarm creation process (from the polling mechanism) can be interrupted at any time if the device starts answering back to the polling.

Passive Monitoring

verizon Security Information &amp Event Management Software - FIGURE
2

SNMP Traps: Traps are sent by the devices to SMARTS each time specific events occur. 1
The following default traps are configured in the customer premises devices:

Interface up/down (a trap is sent each time the state of an interface changes)
Cold/Warm Start-Up (a trap is sent each time a device starts up, meaning that SMARTS knows when a device reboots (i.e. manual reset or loss of power)

IMPACT: Upon the repetitive occurrence of specific traps (for example if a device sends an interface up/down trap 4 times in 4 hours) IMPACT creates “unstable” alarms. For Interface unstable alarms it depends per management center if this also automatically will result in a proactive incident ticket. 2

Meraki

Management of Meraki devices is not performed by SMARTS but by an Verizon internally developed monitoring system called M3. This system polls the Cloud Controller (i.e. Dashboard) at 3 minute intervals, captures availability and related data, and communicates alarm conditions to IMPACT.
Meraki Cloud controller polls CPE every 5 minutes.
M3 interacts with the Meraki cloud in one of two ways – SNMP or a REST API. In the initial release of M3 the API was used for provisioning, and SNMP was used for monitoring. That approach has been replaced with one that utilizes the API exclusively and for new activations SNMP is not used.

From Event to Incident Ticket

verizon Security Information &amp Event Management Software - FIGURE
4

A ticket is created 9-13 minutes after the initial network event.
Automated troubleshooting commences immediately after the creation of the proactive incident ticket. This is the so called ‘triage’ phase and is published on the VEC Portal and via eBonding.
Triage Automated troubleshooting enables faster resolution as ticket is automatically transferred to NOC if further diagnostics are required by technicians. The NOC technicians can also use the Triage output to diagnostic data.

verizon Security Information &amp Event Management Software - FIGURE
5

Alarm List and Thresholds

Product	Incident Type	Priority	Description
MS WAN	BGP Service Down	1	The BGP Service and all of the BGP sessions

Notes:

Depending on management center and certain criteria these might be opened as Pri 2
The alarm clears when 10 minutes without linkup‘linkdown e-ents
The alarm clears when 24 hours without Coldstart‘Warmstart events
The alarm clears when 4 hours without linkup/link down traps

Each type of product has a different set of alarms, the product group is shown in the first column.

Product	Incident Type	Priority	Description
MS WAN	BGP Protocol Endpoint Disabled	4	The interface is administratively

Notes:

Depending on management center and certain criteria these might be opened as Pri 2
The alarm clears when 10 minutes without linkup/lookdown events
The alarm clears when 24 hours without Colds tart/Warms tart events
The alarm clears when 4 hours without linkup/lookdown traps

Meraki Alarm List

Product	Incident Type	Priority	Description
MS WLAN	Appliance Down	1	The cloud controlled appliance is unreachable from

Ticket priority Definitions

Ticket Type	Priority	Description
Outage	1	Service is unusable, complete loss of service. The service is

released for testing without restriction.
Degraded| 2| Service is experiencing intermittent issues or is degraded and is not released for testing without restriction.
Service Risk| 3| Quality issues that threaten the performance of the service.
Assistance Request| 4| Non-service impacting issues requiring investigation, resolution or other action.

These are the standard ticket priorities definitions used within Verizon.

Alarm Correlation

When alarms are presented to IMPACT, a correlation key is applied based on shortname and location identifier. Alarms with the same key will be added to the same event and ticket. This key remains active for either 15 minutes for Hub locations or for 2 hours for remote locations.

After the timer expires new alarms will create new events, perform all of the wait-time, backend queries, etc. and then a pre-existing ticket check will move the alarm to a previous event/ticket when an open event/ticket is found against the same shortage and location identifier.

Appendix
API
Application Programming Interface, a software intermediary that allows two applications to talk to each other.
CPE
Customer Premise Equipment
ESP
This is the primary database for Managed Services Customers. All information pertaining to the management and monitoring of Managed Services devices/services is stored in ESP.
M3
Verizon internally developed monitoring system.
IMPACT
Integrated Management Platform for Advanced Communications Technologies is a application that provides surveillance, alarm topology augmentation, correlation, ticketing, and automation capabilities for the Verizon network.
NOC
Network Operation Center
SMARTS
Part of the EMC Service Assurance Suite and delivers critical management insights for applications and services. Responsible for sending alerts to IMPACT each time a fault is detected.