verizon Security Information & Event Management Software User Guide

October 27, 2023
Verizon

Event Management for
Managed WAN/LANverizon Security Information &amp Event Management
Software

Introduction

The purpose of this presentation is to provide a high level overview of the process where an event triggers the creation of a proactive incident ticket.
It is a generic overview and therefore exceptions as well as custom arrangements are not being covered.
Please refer to the appendix at the end of the presentation for an explanation of terms.

What is Event Managemental

Event Management Definition
An event can be defined as any detectable occurrence that has significance for the delivery of IT services. Events are typically notifications created by an IT service, Configuration Item (Cl) or monitoring tool.
Event Management by Verizon
Verizon is using SMARTS as the event monitoring tool for Managed WAN/LAN together with M3 for Meraki devices. SMARTS uses two methods to detect service interruptions:

  1. Active Monitoring: Pollers on SMARTS are configured to poll (SNMP & ICMP walk) managed devices every 3 minutes.
  2. Passive Monitoring: Managed devices are configured to send an alert (SNMP trap) each time a specific faults occur.

Active Monitoring

verizon Security Information &amp Event Management Software - FIGURE
1

Polling: SMARTS is configured to poll the device (equipment) every 3 minutes. 1
Fault detection:
When SMARTS does not receive an answer from the polled device, SMARTS marks the device as “DOWN”. SMARTS starts the second polling cycle (“smoothing” cycle) of 3 minutes and 35 seconds to confirm the device is unresponsive. The extended time has been implemented to wait for automatic network recovery. 2
Event sent to Automation (IMPACT):
If SMARTS does not receive an answer from the second polling, SMARTS sends the fault alert to IMPACT. 3
IMPACT:
Upon receiving an alert, IMPACT queries ESP (Managed Device Inventory Database) against the entity name to retrieve information such as: Circuit ID, Customer name, Product, Service desk, NOC, etc. This information is used to populate the alarm and to create the ticket within Verizon’s Enterprise Ticket Management System (ETMS). 4

Alarm Creation

How long does it take to create an alarm?

  • First polling cycle detects the fault in 3 minutes
  • Second “ smoothing” cycle confirms the fault in 3 minutes and 35 seconds
  • IMPACT receives the alert and collects additional information to create the alarm ina few seconds

Total time is: 3min + 3min 35sec + few seconds ~ 7 minutes
Additional alarm criteria:

  • The alarm creation process (from the polling mechanism) can be interrupted at any time if the device starts answering back to the polling.

Passive Monitoring

verizon Security Information &amp Event Management Software - FIGURE
2

SNMP Traps: Traps are sent by the devices to SMARTS each time specific events occur. 1
The following default traps are configured in the customer premises devices:

  • Interface up/down (a trap is sent each time the state of an interface changes)
  • Cold/Warm Start-Up (a trap is sent each time a device starts up, meaning that SMARTS knows when a device reboots (i.e. manual reset or loss of power)

IMPACT: Upon the repetitive occurrence of specific traps (for example if a device sends an interface up/down trap 4 times in 4 hours) IMPACT creates “unstable” alarms. For Interface unstable alarms it depends per management center if this also automatically will result in a proactive incident ticket. 2

Meraki

Management of Meraki devices is not performed by SMARTS but by an Verizon internally developed monitoring system called M3. This system polls the Cloud Controller (i.e. Dashboard) at 3 minute intervals, captures availability and related data, and communicates alarm conditions to IMPACT.
Meraki Cloud controller polls CPE every 5 minutes.
M3 interacts with the Meraki cloud in one of two ways – SNMP or a REST API. In the initial release of M3 the API was used for provisioning, and SNMP was used for monitoring. That approach has been replaced with one that utilizes the API exclusively and for new activations SNMP is not used.

From Event to Incident Ticket

verizon Security Information &amp Event Management Software - FIGURE
4

A ticket is created 9-13 minutes after the initial network event.
Automated troubleshooting commences immediately after the creation of the proactive incident ticket. This is the so called ‘triage’ phase and is published on the VEC Portal and via eBonding.
Triage Automated troubleshooting enables faster resolution as ticket is automatically transferred to NOC if further diagnostics are required by technicians. The NOC technicians can also use the Triage output to diagnostic data.

verizon Security Information &amp Event Management Software - FIGURE
5

Alarm List and Thresholds

Product Incident Type Priority Description
MS WAN BGP Service Down 1 The BGP Service and all of the BGP sessions

associated with this service are down
MS ‘WAN| Obsession Down I Disconnected| 1| The BGP Session is not established without a known root cause
MS WAN| Host I Card I Node Down| 1¹| The device is unresponsive to SNMP polling
MS WAN| Interface Down| 1¹| Interface is down
MS WAN| Interface Unstable| 1¹| 5 linkup/lookdown traps have been received within a 10 minute rolling window ‘:
MS WAN| Network Connection Down| 1| The network connection is down
MS WAN| Network Connection Unstable| 1| 5 linkup/linkdown traps have been received within a 10 minute rolling window
MS WAN| Router I Firewall I Switch Down| 1¹| The device is unresponsive to SNMP polling
MS WAN| Firewall I Router I Switch I Host I swatch Unstable| 2| At least 2 Colds tart Warms tart traps have been received within the past 24 hours
MS WAN| Interface Chronic Unstable| 2| The interface has dropped at least 16 times over a 4 hour period .1
MS WAN| OSPF Network Auth Typervilismatch I Auth Hemistich| 2| Misconfiguration symptoms exist on this OSPF Network
MS WAN| OSPF Network DRElectionFailure| 2| The designated router has not been elected
MS WAN| OSPFInterface Down| 2| Two or more OSPF neighbor relationships exist on the interface and all of them are down
MS WAN| OSPFNeighborRelationship Down| 2| The OSPF link between neighboring endpoints is down
MS WAN| OSPFNeighborRelationship NeighborStateAlarrn| 2| Connectivity between the OSPF Neighbors has been impaired by connectivity failures in the Layer 2 or 3

Notes:

  1. Depending on management center and certain criteria these might be opened as Pri 2
  2. The alarm clears when 10 minutes without linkup‘linkdown e-ents
  3. The alarm clears when 24 hours without Coldstart‘Warmstart events
  4. The alarm clears when 4 hours without linkup/link down traps

Each  type of product has a different set of alarms, the product group is shown in the first column.

Product Incident Type Priority Description
MS WAN BGP Protocol Endpoint Disabled 4 The interface is administratively

down and the BGP session for this endpoint is reporting an improper state
MS WAN| BGP Protocol Endpoint Remote ASMismatch I IBGP Peer Missing| 4| The BGP session for this endpoint is reporting an improper state
MS WAN| BGP Protocol Endpoint Remote System Not Running BGP I Remote Speaker Not Configured| 4| The BGP session for this endpoint is reporting an improper state
MS WAN| Interface Disabled| 4| The interface is administratively down (manually disabled)
MS WAN| OSPF lnterface Disabled| 4| The interface is administratively down and at least one OSPF adjacency is reporting an improper state
MS WAN| OSPF Neighbor End Point Unknown Nbma Neighbor| 4| The NBMA Neighbor for this OSPF Neighbor Endpoint is not present in the topology
MS WAN| OSPF Network All Router Priorities Zero| 4| OSPF Network routers unable to be prioritized
MS WAN| OSPF Network Mismatch I Duplicate Router lD| 4| Misconfiguration symptoms exist on this OSPF Network
MS WAN| OSPF Virtual Neighbor Endpoint Unknow Virtual Neighbor| 4| Misconfiguration symptoms exist on this OSPF Network
MS WAN – lite| Firewall 1 Router Unstable| 2| 2 ColdstartMarmstart events have been recorded within 24 hours 2′
MS WAN – lite| Host I Node down| 2| The device is unresponsive to SNMP polling
MS WAN – lite| Interface Down| 2| Interface is down
MS WAN – lite| Network Connection Down| 2| The network connection is down
MS WAN – lite| Router I Firewall Down| 2| The router or Firewall is unresponsive to St‘lfsiP polling a

Notes:

  1. Depending on management center and certain criteria these might be opened as Pri 2
  2. The alarm clears when 10 minutes without linkup/lookdown events
  3. The alarm clears when 24 hours without Colds tart/Warms tart events
  4. The alarm clears when 4 hours without linkup/lookdown traps

Meraki Alarm List

Product Incident Type Priority Description
MS WLAN Appliance Down 1 The cloud controlled appliance is unreachable from

the Meraki Dashboard
Ms WEAN| Authentication Failure| 1| This indicates a failure between M3 and the Meraki dashboard ( tvleraki.com )
MS WLAN| Dashboard Down| 1| Communication lost with the Cisco Meraki Cloud Controller
MS WLAN| License Expiration| 1,2,4| The license is expiring or has expired as indicated in the alarm text
Ticket priorities are as follows 60 days = P4. 30 days = P2. 0 days = P2. -30 days = P1
MS WLAN| AP I Switch Down| 2| The cloud controlled device is unreachable from the Meraki Dashboard
MS WLAN| Interface Down| 2| A managed interface on the MX is down
MS WLAN| LTE Backup Not Ready| 2| LTE connection status (cellular Status) is ‘connecting’ for 2 M3 polling cycles
MS WLAN| LTE Backup Not Available| 2| The USB cellular modern shoulld be there. but isn’t
MS WLAN| On LTE Backup| 2| LTE connection (cellular Status) is active
MS WLAN| Admin Added| 4| An administrative user was added to the Meraki organizations local user database
MS WLAN| Admin Deleted| 4| An administrative user was deleted from the Meraki organizations local user database
MS WLAN| AP I Appliance I Switch Removed| 4| The device indicated was removed from the dashboard

Ticket priority Definitions

Ticket Type Priority Description
Outage 1 Service is unusable, complete loss of service. The service is

released for testing without restriction.
Degraded| 2| Service is experiencing intermittent issues or is degraded and is not released for testing without restriction.
Service Risk| 3| Quality issues that threaten the performance of the service.
Assistance Request| 4| Non-service impacting issues requiring investigation, resolution or other action.

These are the standard ticket priorities definitions used within Verizon.

Alarm Correlation

When alarms are presented to IMPACT, a correlation key is applied based on shortname and location identifier. Alarms with the same key will be added to the same event and ticket. This key remains active for either 15 minutes for Hub locations or for 2 hours for remote locations.

After the timer expires new alarms will create new events, perform all of the wait-time, backend queries, etc. and then a pre-existing ticket check will move the alarm to a previous event/ticket when an open event/ticket is found against the same shortage and location identifier.

Appendix
API
Application Programming Interface, a software intermediary that allows two applications to talk to each other.
CPE
Customer Premise Equipment
ESP
This is the primary database for Managed Services Customers. All information pertaining to the management and monitoring of Managed Services devices/services is stored in ESP.
M3
Verizon internally developed monitoring system.
IMPACT
Integrated Management Platform for Advanced Communications Technologies is a application that provides surveillance, alarm topology augmentation, correlation, ticketing, and automation capabilities for the Verizon network.
NOC
Network Operation Center
SMARTS
Part of the EMC Service Assurance Suite and delivers critical management insights for applications and services. Responsible for sending alerts to IMPACT each time a fault is detected.

June 2021
Verizon Public

Read User Manual Online (PDF format)

Read User Manual Online (PDF format)  >>

Download This Manual (PDF format)

Download this manual  >>

Related Manuals