Supporting Work-From-Home Users with ThousandEyes

Supporting Work-From-Home Users with ThousandEyes

A powerful network monitoring solution that takes your troubleshooting up a notch.

The Problem

If you’re supporting a large remote workforce, you’ve had an executive call on Monday morning to say: “I joined a meeting over the weekend and audio quality was terrible. Can you make sure it doesn’t happen again?

Pre-pandemic, it may have been acceptable for IT to dismiss work-from-home problems without much digging. Troubleshooting was a crapshoot because of so many variables with little visibility into the traffic path. What device was the user logged in on? Do they have reasonable wireless coverage? Who is their ISP? Better yet, is their ISP having problems, or is the service they’re trying to connect to even up? This was less a can of worms, and more like Pandora’s Box made into a Rubik’s Cube. So, like most difficult problems with limited tools to work through, it was easy to attribute it to something uncontrollable. ThousandEyes is an excellent tool to combat this though.

The Challenges Solved by ThousandEyes

Today, with work-from-home now the new-normal for large quantities of workers, end users – from front line to executive level – expect an equivalent experience at home to that which they get at the office.

Traditional network monitoring provided a good grasp of in-house corporate infrastructure, but effectively treated the internet as the ultimate “black box” – it either worked right or it didn’t, without much attention as to the “why” or “how” of a problem, or with any useful information for correcting it. So, what’s to be done now that large quantities of your user base are relying on it day-to-day to get their jobs done?

In this article I’ll address that question by examining one of my favorite ThousandEyes use cases: Using it to demystify Internet black-box in order to proactively support end-users.

Before we get into how that works, first let’s look at what ThousandEyes is and how it works.

What is ThousandEyes, anyway?

“The internet is not a single network. For many, the internet has become a “black box” that is too complex to manage, too big to maintain visibility, and too vast to monitor. It is unpredictable, and composed of thousands of independently managed service providers, any of which can impact the experience of users connecting to an application or site. Even if the enterprises may not directly control the internet, they are ultimately still responsible for the reachability of their service and user experience.”

Courtesy of Cisco

ThousandEyes is a “next gen” network monitoring tool, focused on providing granular monitoring across the internet. Think of it as our guide to working through Pandora’s Rubik’s Cube, helping map specific traffic flow across the internet for your end users and devices.  I’ll elaborate on that a little later.

ThousandEyes monitoring is accomplished by three different agent types:

  1. Enterprise Agents – Installed on a virtual machine or a container – including a container on a Cisco Switch.
  2. Endpoint Agents – Installed on PCs, most useful for the work-from-home scenario. These are very lightweight but more limited than Enterprise Agents.
  3. Cloud Agents – These are basically Enterprise Agents installed on Cisco’s infrastructure – to give perspectives from places you can’t place your own Enterprise Agents.

We’ll be focusing primarily on Endpoints Agents for our current discussion, with a light review of Enterprise Agents. Cloud Agents are out-of-scope.

Endpoint Agents for ThousandEyes

The endpoint agent is installed on your end-user PCs. It’s lightweight and out-of-the-box it gathers data on the PC itself, the local network (including wireless details), and VPN usage. Customized tests are then typically configured to provide monitoring.

Tests fall into two types on endpoint agents:

  • Scheduled Tests – to test the ability of the agent to ping hosts, open TCP connections, or test the success of opening a web page. Scheduled tests also pull underlying network data, more on that below.
  • Browser Session Tests – to monitor the performance of pulling a web page. This is done passively when a user visits domains the administrator has specified, to save on unnecessary overhead of scheduling accessing a web page they’re likely to have visited anyway.
As agents are added and report back to ThousandEyes, they’ll geo-IP and get added to a topological diagram.

Adding New Scheduled Tests in ThousandEyes

Let’s start with Scheduled Tests and cover how to configure them. The first thing we’ll do is setup a test to open an HTTPS connection to Outlook on Office 365:

This will automatically push out to all agents once you select “Add New Test”

In reality I added this test weeks ago. These are the current results:

This is the last two weeks of data. Let’s zoom in on a few of the details:

Clearly, the section of dense red (Agents with errors) alongside the drop of the dark green (availability dropping) indicates an outage. All ThousandEyes tests gather “lower level” data automatically. The most obvious example is that a traceroute is run with this web test. As such, we have a network path for every host running the test. This is helpful for identifying regional or ISP-level outages. ThousandEyes makes a fantastic visual of this data as it’s gathered. See below:

It’s difficult to get the full picture in a blog, but this diagram can be drilled into and reorganized in real-time, to help drill into problem areas. You can get incredibly granular with this. For some insight, each number within the circles you see above represents “jumps” you can drill down to for the route of the packet.

I also have a variety of other tests running on Endpoint Agents towards Microsoft services:

For brevity, I won’t dive into them individually.

Endpoint Agent Browser Sessions

As mentioned above, one of the brilliant things about the Endpoint Agent is that business domains can be monitored passively. Rather than having the agent open an “invisible browser” itself and pull down a website to test granular web performance, it works on the assumption that if the website is vital to the business, you’ll be browsing there yourself pretty regularly. The agent has an integration with Chromium-based browsers (Chrome and Edge) that will monitor web performance as the user browses the page.

The first step is to setup the domains we want to monitor, and the IP space we want to monitor from.

Our domains:

And for simplicity, we monitor from everywhere, as most of our employees are remote anyway. An easy example of where this might not be ideal is if you’re running Enterprise Agents at corporate and don’t want to gather a bunch of extra data from users in the office.

We’ll focus on the same timeframe as we did above – where Microsoft had problems – January 25th:

My first reaction to seeing this – being a network engineer – was “what the heck is an experience score”? It turns out traditional metrics such as delay or jitter aren’t as valuable when measuring an end-user experience, so ThousandEyes built a new one. If you’re curious, here is how it’s calculated.

Enterprise Agents

So we’ve seen how to gather data from our endpoint clients. Now let’s briefly introduce Enterprise Agents and cross-reference data. Enterprise Agents are a big topic, as they can measure far more situations than the deliberately-lightweight Endpoint Agent.

Enterprise Agents can also perform these tests:

  • UDP Tests
    • Bandwidth tests (Use carefully!)
  • Agent-to-Agent Tests
    • How is the performance between two enterprise agents?
  • Layer 2 discovery & monitoring
    • Discover, auto-diagram, and monitor inside your network
  • BGP Tests – Particularly path changes which indicate instability
  • Web page load tests – How fast does a page load, and how are individual web objects loading?
  • Web page transaction tests – This is Application Performance Monitoring (APM). The ThousandEyes Recorder will record your mouse clicks and data entry on any website, and then replay it on schedule to test performance.

For this particular topic, let’s look at BGP changes, as it’s the one thing the Endpoint Agent didn’t give us:

Microsoft had a series of BGP advertisement changes during this timeframe and experienced significant route table instability.

What we learned from this incident is that Microsoft themselves had the problem – not our end users. Our monitoring with ThousandEyes enabled us to send out a proactive notification to our users that Microsoft had instability and to be patient during the outage, rather than opening any tickets.

But what if it is one of our users?

The example above gave a great way of correlating data, which is why I used it as the primary example. However, we have had three end-user issues in the past weeks that were reported as “My computer isn’t working right.” Two turned out to be network issues, while one was resolvable with minimal effort. Individual PCs can be pulled up for performance metrics. Let’s explore some individual scenarios.

Scenario 1 – Employee having continued audio problems on conference calls

Checking the image above we see:

  • Memory – 63% used. No problems there.
  • CPU Load – 4.2% used. No problems there.
  • Network access – 144 Mbps, 79% signal. Not ideal.
  • Gateway loss – 0.4% – this is the problem.

In this particular case, we found out that the employee was using the free wireless provided by their landlord, and the performance of that wireless in the location his work computer is in is quite poor.

Solution: Recommended buying their own ISP.

Scenario 2 – Employee complained of very poor application performance.

During the problem period, we see a large spike in VPN latency. None of our other users were complaining of VPN problems, nor did they see this spike. This took further investigation.

It turns out, this user wasn’t connected to our VPN. A customer had provided us VPN access for reaching one of their ticketing applications, and it had full-tunnel enabled. As such, the entirety of this end-user’s traffic, including that which wasn’t related to this end-customer, was being tunneled through the end-customer’s network. When the end-customer had a network problem, it impacted our user.

Solution: We instructed the user to only connect to that VPN when accessing that one application.

Scenario 3 – An end-user was complaining of poor web browsing performance on their computer.

One look at the memory usage tells us this is a systems problem – in this case, the user had a run-away process, and our desktop support engineers got in and resolved the problem. No network issue this time!

Solution: Local device issue, no network troubles found.

Another Potential Use Case

Residential ISPs are not always known for their reliability. Is a residential ISP having problems, impacting a group of users? ThousandEyes can export the data to non-ThousandEyes users to show the problem – export it, and give it to the ISP and ask for a repair.

Conclusion

ThousandEyes can greatly decrease the time-to-detection, time-to-root-cause, and time-to-resolution, with remote users, saving a great deal of IT manpower as well as decreasing loss of productivity of the remote employee.

Looking to deploy a solution like ThousandEyes? Contact us here to get started.

Thinking Outside the box with Cisco DNA Center

What other applications does DNA have?

Cisco’s DNA Center appliance is generally talked about in the context of SD-Access (SDA), but SDA is a complex technology that involves significant planning and re-architecture to deploy.  DNA Center is not just SDA, though – it has multiple features that can be used on day 1 that can cut down on administrative tasks and reduce the likelihood of errors or omissions.  From conversations with our customers, the most asked-for capability is software image management and automatic deployment, and that is something that DNA Center handles extremely well compared to many other solutions out there.

Wait…I can manage software updates with DNA?

Managing software on network devices can be a substantial time burden, especially in businesses that have a substantial compliance burden and require regular software updates.  Add to this the increasing size of network device images – pretty much all the major switch and router vendors’ products now have image sizes in the hundreds of megabytes up to several gigabytes, and software management can now take up a significant chunk of an IT department’s time.  One of our customers is interested in DNA Center for this specific purpose – with 500+ switches, being able to automate software deployment saves several weeks of engineer time over the course of a year.

That may leave you asking…

So, what devices can I manage? 

DNA Center can manage software for any current production Cisco router, switch, or wireless controller.  Additionally, some previous-generation hardware is also supported.  Of this hardware, the Catalyst 2960X and XR switches as well as the Catalyst 3650/3850 switches are the most commonly used with DNA Center. Now let’s talk about how DNA Center does this.

Neat! Now, tell me how to do it. 

First, be sure that every device you want to manage is imported into DNA Center.  Once that’s done, the image repository screen will automatically populate itself with available software image versions by device type.

Here’s an example:

From here, select the device family to see details.  Once you’ve decided on the version you want to use, click on the star icon, and DNAC will mark that as the golden image (aka the image you want to deploy).  If not already present on the appliance, the image will also be downloaded as well.

Next, go to Provision > Network Devices > Inventory to start the update process.  From here, select the devices you want to update, then click on Actions > Software Image > Update Image.  You’ll be given the option to either distribute the new images immediately or to wait until a specific time to start the process.  Different upgrade jobs can be configured for different device groups as well.

Here, I’ve set DNAC to distribute images on Saturday the 19th at 1pm local time for all my sites.  This process is just the file copy, so no changes are made to the devices at this time.  The file copy process is also tolerant of slow WAN connections, though not poor-quality connections.  We’ve tested this process in our lab and found out that it’ll happily work even over a 64k connection (though it’ll take quite a while).  Poor quality connectivity, however, will cause this process to fail.  Finally, once the image is copied to the target devices, a hash check is performed to ensure the image hasn’t been corrupted.

The next step is to activate the image.  Activation here means ‘install the image and reboot the device’.

Like the distribution process, DNAC can either install immediately or wait until a scheduled time.  Note that for IOS XE devices, this process will do a full install of the image vs. just copying the .bin file over.  Once the software activation is complete, the devices will show their status in the inventory screen. As you can see, DNA Center’s software image management capability can save substantial time when updating software as well as ensuring that no devices fail to receive updates through error or omission.

Prepared by: Chris Crotteau