SharePoint ULS Log analysis using ELK (ElasticSearch LogStash and Kibana)

SharePoint ULS Log analysis using ELK (ElasticSearch LogStash and Kibana)

A Solution to Multi-Tenant systems Log Access.

By: George Gergues



SharePoint is a large platform that is always growing, and changing, and as with large application platforms that hosts many components, the complexity is always manifested in the platform log (ULS Logs) and log management. There are several ways to trace and debug issues using those logs, and tooling in this area has not been keeping up with the speed SharePoint. The other major complexity is multi tenancy, as the cost to own and operate a single SharePoint farm has been on the rise, many companies started to offer a multitenant farm system, mainly Office 365 by Microsoft, along with other hosting vendors in the same domain. Multitenancy reduces the cost of owning into a cost to lease or operate, but takes away platform specific facilities like Audit Log and platform log, as those log containers are not partitioned at the same tenancy level but are singletons per host in the case of ULS Log or per Content Database in the case of Audit Log table.


The Solution

In this posting we are proposing the use of the ELK platform (ElasticSearch, LogStash and Kibana) as a tool to ingest the ULS logs, leverage the fast index and search capabilities and make use of the event correlation features that come by aggregating logs and other factors.

Such tool used by system engineers and hosting staff can be very useful in responding to first level support incidents, or to relay the events to the end user without compromising security.

Document level links: The system engineer can share a document level link, to the event to show the full details of the event, those are mainly

Also this tool can show trending of events and repeated problems or patterns that might be cyclical in nature and provide better picture of farm level performance and common problems.

With little development, you can expose those search portal to the end users to directly perform those tenant level searches without exposing the full log to the end user or tenant admin.

The next few sections will show in detail how to build and customize such system.

What is ELK

ELK = [ E + L + K ]

ELK is the combination of three open source solutions to three problems and merged together, they form the best solution for large log and large dataset analytics. All three systems operate using REST.

E: ElasticSearch:

Index and search server base on the Lucene open source project, storing documents based items (JSON objects), the storage is clusterable and highly redundant through shards and multiple distributed nodes. Elastic Search is the Java ported version of Lucene and runs inside a JVM.

L: LogStash

is a Log capture and processing framework that works in a Capture-> Process -> Store cycle per single event on the log source. It is open source project that evolved a lot in the past year to operate with a long list filters, modules and plugins. LogStash is very powerful when used to construct schema JSON documents from unstructured text inside log files using GROK filters. LogStash is written mainly in Ruby and running inside JVM using JRuby.

K: Kibana

is a “node.js” data visualization layer that is tightly integrated with the elasticsearch index and can build charts and dashboards that represents data in the elastic search index very fast. It is making use of the fast REST api and is very responsive inside the browser even with large datasets. The Kibana configuration is stored as a JSON document in the elastic search index, along with all the charting and dashboard scripts that it produces.

The Challenge

  1. Multitenancy

    As mentioned in the introduction, with multitenancy comes the lack of access to platform ULS logs due to security restrictions to OS Filesystem level. The only way SharePoint shows application and system exceptions is via a custom application error pages, and they are very obscure, normally assigned a session or transaction id ( uuid ) named (correlation Id)and that is the only piece of information the user gets along with the timestamp.

Example of Correlation Id “53fed7f1-cf35-1253-0000-000050f7b00c”

Cannot give all tenants access to the shared logs as they main contain information regarding custom application or extensions, it might also expose other tenants content or exceptions depending on the debug level.

  1. Log Size and Log Cycling

    The other problem is Log Sizes and Log cycling. As with any large enterprise deployment, you are required to maintain some weeks of log history, either on the same system or offline. The minute those files leave the system, they are harder to manage and /or correlate.

  2. Multi Log Events

    Some events can span the end of one log file and continue onto the next log, and some of the current tools like ULS viewer could not handle easily. With such system, the log parsing is already done, and the queries are much faster.            

  3. Aggregate Log Sources

    Using such system, an engineer would have the ability to shows events from multiple aggregate sources, and to analyze events that may have started and caused by other dependencies, while manifested only at SharePoint where the error occurs. Some easy log aggregation candidates would include the IIS server logs per host along with windows event logs and SQL server logs.

ULS Logs

The Unified Logging System (ULS log) is the standard logging format used for SharePoint farms.

Location is Configurable by admin but default installation point to the SharePoint hive C:\Program files\Common Files\Microsoft Shared\Web Server Extensions\14 or 15 or 16 \LOGS\*.log

Process: Host
Executable running and causing the event.

TID: Process thread Id on that host.

Area: Component or application generating the event.

Event ID: Internal category Id per application.

The files rotate every 30 minutes by default.

SharePoint Application Error Page

The SharePoint custom error page for 2010, 2013 and Office 365 looks like this.

Only a correlation Id is visible to the users and they are to report it to the support technician.

This can be very frustrating as a developer as you need to wait for hours and in some cases days before you get an answer .

Setting up ELK

  1. Install Java Runtime (JRE)
  2. Download the packages
    (zip files) for each of the following and place each on a separate folder. Logstash , ElasticSearch, and Kibana.
  3. Edit the configuration file for each package for your environment (Development configuration )

For simplicity we opted to use the file input module to perform in our limited development environment, yet you can steam the log files using from multiple source using any plugin, LumberJack is the most commonly used log streaming plugin.




Details on the GROK filter syntax

    The Grok filter is the main core component that tries to understand the schema of your unstructured log text and generates a temporary schema based object to be stored, indexed and later on queried upon.


There are a few formats for the log file and the date timestamp variations that puts a bit of a burden to collect construct a pattern to match all the value.


Sample Pattern

SP_ULS_FMT1 (?<sptimestamp>%{MONTHNUM}/%{MONTHDAY}/%{YEAR}%{HOUR}:%{MINUTE}:%{SECOND}\*?*)%{SPACE}%{PROG:sp_process}\(%{BASE16NUM:sp_pid}\)%{SPACE}%{BASE16NUM:sp_tid}%{SPACE}\t+%{DATA:sp_area}%{SPACE}\t+%{DATA:sp_category}%{SPACE}\t+%{DATA:sp_eventid}\t+%{SPACE}%{WORD:severity}%{SPACE}%{DATA:sp_eventmessage}%{SPACE}%{UUID:correlationid}%{SPACE}

You can see the full configuration script on the GitHub repo

Building the Dashboard: Kibana Visualization

The Kibana configuration is fairly simple and all constructed through the UI. The Configuration of each of those elements is stored in the elasticSearch storage as an index named (.kibana).

The dashboard is made of some smaller components, mainly charts, and widgets, that target a specific measurement against the index.

The raw data of each element can be displayed and extracted as csv file.

A Severity chart, is simply a vector of the unique elements in the field SP_SEVERITY describing the event severity level.

The whole dashboard is dynamically rendered and filtered via the queries you select


This sample shows the event correlation id with showing the severity level and the total count of events that happened along with the components







Full Configuration scripts

You can find the full configuration scripts to get you started here.


ELK inside JVM instances (Java Virtual Machine), so you are bound by the limits of that particular instance. As with any Java based process you can tweak
free memory and heap allocation parameters to get the best throughput, yet the optimal performance will be achieved by using clusters of the service, and that is where shard storage shine.


  1. Aggregate Log sources in dashboards

As a next natural enhancement for this project, IIS Logs on all servers, windows log and SQL Server logs as well as firewall and or load balancers.

  1. Security

    1. You can start implementing some security over this system, not that in our sample system. Initially users don’t have access to the Search API directly but only via Kibana (the visualization layer) this is the first measure you can take.
    2. You can also implement an IIS or Apache proxy to implement authentication and authorization with any access control.
    3. You can use one of the commercial products (Shield and Marvel) from Elastic.Co
  2. Scalability

    This system is designed to be very highly distributed and available, with multiple node and clusters, (only keep the elastic nodes in the same geography to reduce latency). But you can have multiple hosts the logstash role, kibana role or all of them.


  1. Elasticsearch :
  2. LogStash :
  3. Kibana :
  4. ELK guides :
  5. Guides on using Kibana :
  6. Grok Debug Tool :
  7. The full configuration scripts:

SharePoint audit in action

The Triangle SharePoint User Group  ( TriSPUG) meeting for Tuesday  03JUN2014 at the Microsoft building in Durham .
I finally finished the presentation and the slides and the code samples on codeplex.


Please find attached

The code samples are here


Let me know if you have any questions.


Best of luck

SharePoint audit in action



InfoPath SharePoint FormServer error 5566

The Error code 5566 is very common, and if you get that error

“ An error occurred querying a data source.

Click OK to resume filling out the form. You may want to check your form data for errors.
 Hide error details
 System.Xml.XmlException: There are multiple root elements. Line 2, position 2 ……………………..  “

Code 5566 is a very common error when performing cross web services calls

The Problem is more of a server architecture issue ( on a single sever farm configuration you may not have those issues)

The Root causes

  1. Name resolution
  2. Certificate validation errors
  3. UAG or any Url Filter or traffic parsing engines (F5 Big-IP and the like.)

The cause can be one or all of the above.

Simply to understand the problem, you need to understand how InfoPath handles this type of traffic.

  1. Client ( C  ) requests a form operation from form server ( S )
  2. S read the template from the same server or the document library or storage .
  3. S builds a temp map in memory for the current user of the form rules and code for the duration of the session.
  4. S Execute the operation (read, update, or new) form.
  5. C render on InfoPath Client or Browser (thin client )
  6. S terminates session.

Where things break

At steps 1, 2, 3 and 4

Problem Solution
1 [1] C resolve the server as [IP x.y.z.w] but Sresolves as different IP and server encounter a template or form load error but does not report it to the user Try to browse the data connection urls from the server itself  and check if you encounter any problem , resolve accordingly(In some cases internal DNS record does not match the proper configuration  use hosts file entry to manually force the session to the same server)
2 [1] If you are behind a proxy or load balanced farm Try to configure it so that the server sessions are bound to a single server for the same client.
3 [1] If you are using a public name and internal name using AAM Make sure you are resolving the correct IP inside and outside the proxy/firewall see Item 1
4 [2] S can’t load the form or the template That should not cause 5566 but it will be more descriptive If you are using a proper proxy configuration , but in some cases where the proxy configuration is not correct this will show as error 5566
5 [3] If you have dynamic links for services that gets compiled at load You need to debug this by loading this form on the same server.
6 [3] If you are using SSL certs Make sure your server can validate the certificate or disable certificate validation.
7 [4] If read new or  update  operation Check the on load rules and see if those generate certain other web services or list connections that cause this issue and handle as in item 1 above

Best of Luck


SharePoint 2010 Services crashes post windows OS Upgrade

Attempted to upgrade one of my development machines from Windows 2008 Server Standard server into Windows 2008 Server Standards R2

And got a slew of errors (but thank God for VM snapshots )

Here are the errors and the fix.

OWSTimer.exe Timer service crashing.


System.IO.FileNotFoundException Occurred in SPUCHostService.exe

System.IO.FileNotFoundException Occurred in SPUCWorkerProcessProxy.exe



System.IO.FileNotFoundException Occurred in WebAnalyticsService.exe

System.IO.FileNotFoundException Occurred in OWSTimer.exe






It is mainly a version incompatibility with the Windows Identity Foundation


You need to install this version, of Windows Identity framework ( the one that matches the new OS level)

Windows6.1-KB974405-x64 downloads from TechNet.


You can force install this package using the SharePoint ISO or deployment media , (rerun the Prerequisite installer)



(If you have multiple Servers in the farm .you will need to do the same on each member server no matter what the role is .)


[Run the SharePoint Products Configuration wizard]


Also make sure you run the windows updates and install all the needed updates.

Best of luck.

SharePoint Issues post service pack /Cumulative updates or if you add or remove a member to the SharePoint Farm

SharePoint Issues post service pack /Cumulative updates or if you add or remove a member to the SharePoint Farm

I have had those in the SharePoint 2007 years but didn’t expect that with SharePoint 2010 the DEC 2011 cumulative update.

But here is a way to fix it

(run from the command prompt at the 14\bin\ as an administrator) or better, run the SP management PowerShell console as administrator and run the following command)

Note: the command is harmless but the farm might not be accessible during execution.

It performs a forced upgrade and that should clear all errors on Central Administration or any others encountered during the configuration wizard upgrade.

psconfig -cmd upgrade -inplace b2b -force -cmd applicationcontent -install -cmd installfeatures

This phase may take a while (depending on how large it your databases)

Make sure you are able to get to Central administration after this task completes

This has been a life saver in many cases

Best of luck