.NET Geeks
-Unleash Your Inner Geek!!!

Hardware, Network & Application Monitoring in a SharePoint Environment

 “It’s better to be proactive then reactive”

Performance and Availability are very important aspects of a well running Microsoft Office SharePoint Server (MOSS) 2007 environment. But, how do you know that and when there is an issue? And once you have determined there is an issue, how do you determine exactly where it is?  

First, understanding the MOSS farm’s architectural limitations perspective should play a role in troubleshooting methodologies to isolate where the bottleneck is and what measures are necessary to eliminate it.

Troubleshooting

Here are some initial questions to ask:

1.       What is the server hardware monitoring solution in place within the organization?  i.e. HP Compaq Insight,  Dell OpenManage or IBM Director

2.       What network monitoring tools are being utilized? i.e. CA UniCenter or HP OpenView

3.       What is the software/application monitoring solution used? i.e. Microsoft Operations Manager (MOM) or NetIQ AppManager

4.       What needs to be monitor?

As mentioned, knowing there is a problem is the first step in process. Here are some cases where a methodical process could  be implemented to determine the issue:

1.       A planned outage or load on the system

2.       An outage or slowness to the system for a brief period

3.       The system has an outage, or displays systems of slowness with no noticeable pattern

4.       The system systematically experiences an outage or slowness

However, there might be issues seen in the system that are only apparent to end users within the environment.  Users know if there is an issue because the system does not operate as expected and generally will report this to the help desk. The regular monitoring of support calls over a period can be a good indicator an issue is present.  Another way issues can be discovered is through system monitoring.  This will be discussed in more detail below.

After you have determined there is a problem, the next step is to find out the cause.   Using solid troubleshooting techniques and process will prove to be a priceless and save time when problems and issues are discovered.   The following are recommended steps  that should be followed when issues are discovered:

1.       Check the monitoring logs from the hardware monitor for hardware issues

2.       Check network monitoring logs for a traffic load or outage related issue

3.       Look into the SharePoint server(s) windows logs for problems that may have occurred

4.       Look at the database logs for any load or outage that may have resulted in this issue

5.       Check the domain controllers to see if authorization was the point that caused the occurrence

6.       Review the SharePoint’s logs

Once you have a clear understand where the issue is, you can now proceed to eradicate it. The solution could be as simple as having a piece of hardware in one of the servers replaced to adding additional server(s) to a cluster.  Other solutions could be adding network bandwidth or adding another domain controller to manage more authentication request to adding another web-front-end server to the farm.

Monitoring

Monitoring an environment is broken up into three major components:  hardware, network and software/application monitoring. As with any monitoring solution, if it is not utilized correctly it is of no use. Once metrics are determined and thresholds are set, make sure an individual is responsible for the actions if one of the thresholds is reached.  In addition, notification must be configured to alert a responsible resource when an issue occurs and then trigger a resolution.

Utilize reporting. Make sure you are keeping tracks of trends of the three major system components.  Make sure the appropriate information is being reviewed and observed.  If there are thresholds set for certain services make sure these specific thresholds are being captures and reported on so that intervention will occur when needed.  Doing baseline load testing on the MOSS “farm” and knowing where the load causes performance degradation is the best way to determine the appropriate monitoring thresholds needed for each piece of the MOSS environment.

Suggestion: look to implement solutions that work one with another.  An example would be the CIM management pack for MOM consolidator. This will allow for a better more unified monitoring environment, and easier to manage.

Know the limits of the network infrastructure.  Make sure and segment more traffic intensive services and/or logical applications (i.e. SharePoint farm) are put in their own VLAN or network segment.

An example of some things to monitor and their thresholds are as follows:

System Monitor Counter

Threshold

Memory: % Committed Bytes in Use

Greater than 80 percent

Memory: Available Mbytes

Less than 50 MB

Web Service: Connection Attempts/sec

Greater than 500 attempts per second

Processor: % Processor Time: _Total (CPU Utilization)

Greater than 80 percent

Current Connections–Warning

1000 connections

Current Connections–Error

2000 connections

Disk Usage

Less than 10 percent

System: Processor Queue Length

Greater than 10 threads

Memory Pages/sec

Greater than 220 pages per second


Posted Apr 12 2007, 12:54 PM by cooperfdiv
Powered by Community Server (Non-Commercial Edition), by Telligent Systems