PowerCram

Tuesday, February 4, 2014

My EC2 Server That Ran For Over 5 Years

Today I observed a moment of silence as I shutdown our longest running Amazon Web Services EC2 instance. We started this instance on the evening of October 29, 2008 when we were first getting acquainted with AWS, and stopped it today, February 4, 2014. That's 5 years, 3 months, 6 days.

Tuesday, November 19, 2013

Crunching Billions of Log Lines With LogParser

Yesterday three different people came to me asking what type and how many browsers are our "users" using to access our content. Since our products load on about 50 - 80 million pages a day we have a pretty good sample size. And a hell of a lot of log files to collect and analyze. Since I already summarize this info on each server daily it was rather simple to gather these summaries and tally them for the month.

These calls are from one of our services that handles an average of about 50,000,000 calls per day. Of course these calls are spread across several web servers, so I took the daily summary from each server & combined them per day, then took these daily summaries and combined them for last month (October 2013). In total this represents only about 5-8% of our total daily calls, but this particular service is the first called by many of our products so it is the best place from which to gather info like user agent distribution.

UserAgent	UAHits	UAPercent
Internet Explorer	516,408,427	34.05%
Chrome	318,859,924	21.02%
Firefox	262,120,296	17.28%
Apple IOS	165,269,836	10.90%
Safari	136,577,103	9.01%
Android	77,221,373	5.09%
Other Mobile	10,372,620	0.68%
Other User Agents	9,694,239	0.64%
Search Bot	5,097,159	0.34%
Opera	3,938,622	0.26%
Monitoring	5,527,439	0.36%
IEMobile	2,698,171	0.18%
BlackBerry	1,161,637	0.08%
No User Agent	1,119,479	0.07%
Monitoring2	307,882	0.02%
Gaming Device	152,633	0.01%
CMS	44,012	0.00%
wget	6,034	0.00%
curl	1,784	0.00%

Total Hits	1,516,579,920

For our users Internet Explorer reigns supreme, but the percentage of hits for IE is down quite a bit from my sample last spring where it represented just over 38% of the total. Since then IOS slipped a little from 11.3% to 10.9%, and Android rose from 3.12% to 5.09%. In total nearly 17% of our users access our content with "mobile" devices (includes phones, tablets, etc.). I suspect this is a little lower than the average of some corners of the Internet, but since the majority of our users access our content during the day on weekdays (makes me question their productivity while at work....) it's no surprise it's fairly low & desktop browsers are higher.

I've written much about my love affair with Microsoft's logparser, and that love continues. All of these over 1.5 billion log lines were crunched with logparser using only the servers' CPU's which served the content and are running 24/7 anyway. The bottom line is this info was gathered (along with a lot of other info) for free! That's right, instead of spending thousands or even tens of thousands of dollars for fancy third party log analyzing tools I've leveraged available CPU time, free software (logparser itself) and a little ingenuity to collect a great deal of useful information from a rather large amount of log files.

Friday, May 17, 2013

Improving Web Services (Part II)

Last year I wrote about how we improved web server performance with some fairly small changes, and about how I keep an eye on these metrics with reports I create analyzing my web server logs with Microsoft's LogParser. This is a follow-up to those.

Recently we did an upgrade to our platform. One of the "improvements" was our amazing DBA (he is truly amazing!) did was to tighten up some of the SQL stored procedures used for returning dynamic data to our video players (playlists, video meta data, etc.). These "player services" get hit around 300,000,000 to 400,000,000 times per day, so even a small improvement can have far-reaching impact.

As I'm sure is common across regions of the web traffic is lower at certain times. Ours is no different, so I leverage lower CPU load in the middle of the night to crunch my web server logs across my fleet of web servers. As this RRD Tool graph shows CPU load is considerably lower overnight, except when the server is processing its own log file analysis. Which takes about an hour or so on each server. It's also worth noting that average response times are not negatively affected during this time - I know, I keep a close eye on that!

Among the various pieces of data gleaned by this log processing is the time (in milliseconds) each response takes, as recorded by the server. This is very valuable information to me as I can definitively know impacts of various factors; like systems deployments (such as the one that spurred this post...), performance under various load conditions (peak times vs. slow times), performance during operations or maintenance windows (crunching logs, system updates, patches, etc.), and last but not least when people come to me saying anecdotally "customers are saying our system is slow..." I can show them with absolute certainty, both historically and at any point in time (I have some really good methods of running ad hoc reports to get up-to-the minute stats), how our system is performing or has performed.

So any time we roll out a change of any kind I look at the data to understand the impact(s), if any. After this deployment of the new and improved SQL stored procedures I'm seeing approximately a 30% decrease in response times. That's a huge improvement!

Besides loading faster (at the client side) this is also causing a noticeably lower load on both the front end web servers and database servers. Therefore we have more available capacity or head room with the same number of servers, or I could potentially shut down some of our AWS EC2 servers saving money. Now we have set the bar even higher for performance of our systems, and any future "improvements" or modifications can be accurately measured against this.

I love the fact that I have such good insight into these systems and can measure any impact of changes or varying load with great accuracy!

Sunday, March 31, 2013

Using Log Parser to Extract User Agents From Web Log Files

I've been meaning to write a follow-up (or two...) to my I LOVE LogParser post from a few months ago. The time has finally arrived.

Every day I collect and analyze (at least at a high level) somewhere around 1/2 a billion (yes, billion) web server log lines. And about that many more from a couple CDNs. Needless to say that's quite a bit of information. Besides the mundane stuff like number of hits, bandwidth served, hits per file type, etc. I've recently buckled down and written a few pretty good scripts with LogParser to extract and count User Agents.

I know even this isn't all that sexy or sophisticated, and that numerous companies selling analytic have already solved this, but, since I have the data right at my finger tips why should I pay someone else tens of thousands of dollars to analyze my logs and give me a pretty web interface. Yeah, I'll admit that would be nice, but for what I'm after I'm able to get what I need with just a little elbow grease.

This pursuit actually began several months ago when my boss came to me and asked how many and what types of Android devices were hitting our services. Since our product runs on numerous sites around the web we get all kinds of traffic. And, of course, many people (our partners, etc.) all say your product has to run on this Android device, or that Android device. But with so many out there all running so many different OS versions it's crazy. This image (from this phandroid.com article) shows it quite well.

Figure 1 - Android Fragmentation.

At this point I must give credit where credit is due. The LogParser queries below are by no means unique, nor are they original. The best one I found was a little dated. So I took it, made it a little prettier, and adapted it for three distinct outputs. First, is my daily User Agents Summary report (below). This is a high level report showing us the type and distribution of browsers, which are hitting our sites. While others publish similar information regularly this applies directly to us, to our products, which gives us good, reliable information we can both use to convey to our partners, but also so we know where to focus time and energy on development and QA resources.

The numbers in this summary report and others in this post come from a single web server (I have lots more) for one day (March 28, 2013 UTC). So, this is current as of this posting. (See below for the exact LogParser query I'm using for this summary report.)

UserAgent	UAHits	UAPercent
Internet Explorer	2,611,065	38.03%
Chrome	1,322,187	19.26%
Firefox	1,241,372	18.08%
Apple IOS	775,693	11.30%
Safari	620,618	9.04%
Android	214,479	3.12%
Other Mobile	25,171	0.37%
Opera	19,262	0.28%
Other User Agents	15,342	0.22%
IEMobile	6,188	0.09%
No User Agent	5,061	0.07%
BlackBerry	4,481	0.07%
Search Bot	3,407	0.05%
Gaming Device	590	0.01%
	6,864,916

This summary is great and very useful, but we certainly need some detail. Since the detail report is over 100 lines long I'm only going to show about a dozen lines here. Again, this detail is great for us so we know which versions of which browsers are being used to access our content at any given point.

UserAgent	UAHits	UAPercent
IE 9	1,170,034	16.99
Firefox 19	968,039	14.06
Chrome 25	846,618	12.30
IE 8	815,316	11.84
Safari 5	622,317	9.04
iPad	553,625	8.04
IE 10	331,290	4.81
Chrome 26	289,195	4.20
IE 7	286,986	4.17
iPhone	234,653	3.41
Android 4.1	72,227	1.05
Android 2	64,688	0.94
Android 4.0	54,877	0.80

Finally, the thing I was really after - what type of Android devices are being used to access our content? Just like the detail report this is only a partial list.

UserAgent	UAHits	PercentOfAndroid
Other Android Devices	52,470	24.11
Samsung Galaxy S III	25,903	11.90
Motorola Droid RAZR	16,558	7.61
Samsung Galaxy Tab	12,285	5.64
Samsung Galaxy S II	9,589	4.41
Samsung Galaxy Nexus 7	6,772	3.11
Samsung Galaxy Nexus	6,458	2.97
Samsung Galaxy Note II	6,270	2.88
HTC EVO	6,177	2.84
B&N Nook	4,271	1.96
Motorola Droid Bionic	3,736	1.72
Asus Transformer Pad	3,279	1.51
Motorola Droid RAZR M	3,140	1.44
Motorola Droid X	2,479	1.14
Motorola Droid 4	2,308	1.06
Motorola Xoom	2,278	1.05
Kindle Fire	1,498	0.69
Kindle Fire HD	728	0.33

It's no surprise that there are dozens and dozens of Android devices that are all browsing the web and hitting sites like ours. One little surprise is that Barnes and Nobel's Nook registered higher than the Kindle Fire. So many devices so little time.

Here's the Log Parser query I'm using for the User Agent summary (above).

select case strcnt(cs(user-agent),'Android') when 1 THEN 'Android' else case strcnt(cs(user-agent),'BlackBerry') when 1 THEN 'BlackBerry' else case strcnt(cs(user-agent),'iPad') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS' else case strcnt(cs(user-agent),'iPhone') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS' else case strcnt(cs(user-agent),'iPod') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS' else case strcnt(cs(user-agent),'Opera') when 1 THEN 'Opera' else case strcnt(cs(user-agent),'Chrome') when 1 THEN 'Chrome' else case strcnt(cs(user-agent),'Safari') when 1 THEN 'Safari' else case strcnt(cs(user-agent),'IEMobile') when 1 THEN 'IEMobile' else case strcnt(cs(user-agent),'MSIE') when 1 THEN 'Internet Explorer' else case strcnt(cs(user-agent),'Firefox') when 1 THEN 'Firefox' else case strcnt(cs(user-agent),'Googlebot') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'Yahoo!+Slurp') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'bingbot') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'Yandex') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'Baiduspider') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'loc.gov') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'crawler@alexa.com') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot' else case strcnt(cs(user-agent),'Mobile') when 1 THEN 'Other Mobile' else case strcnt(cs(user-agent),'PlayStation') when 1 THEN 'Gaming Device' else case strcnt(cs(user-agent),'Nintendo') when 1 THEN 'Gaming Device' else case strcnt(cs(user-agent),'curl') when 1 THEN 'curl' else case strcnt(cs(user-agent),'wget') when 1 THEN 'wget' else case strcnt(cs(user-agent),'-') when 1 THEN 'No User Agent' ELSE 'Other User Agents' End End End End End End End End End End End End End End End End End End End End End End End AS UserAgent, count(cs(User-Agent)) AS UAHits, MUL(PROPCOUNT(*),100) AS UAPercent INTO D:\Reports\UserAgent_Summary.csv FROM D:\Logs\.log GROUP BY UserAgent ORDER BY UAHits DESC

Friday, October 5, 2012

AWS VPC VPN with SonicWALL NSA and PRO Series Firewalls

Recently Amazon announced, (see also) "You can now create Hardware VPN connections to your VPC using static routing." This is great news as it greatly expands the type of devices from which a point-to-point IPSec VPN can be created to your Virtual Private Cloud. Previously only dynamic routing was supported, which required BGP and a device (like Cisco ISR). Now VPC supports static routing, greatly expanding the types of devices through which a VPN can be connected. Now devices like Cisco ASA 5500 firewalls, and even Microsoft Windows Server 2008 R2 (or later) can be used. And, as I finally got working, SonicWALL firewalls (I connected with a NSA 2400, but I'm sure others will work as well).

Here's what I did to get my statically routed point-to-point IPSec VPN setup between my Amazon Virtual Private Cloud (VPC) and a SonicWALL NSA 2400.

First, create a VPC. Here is a great step-by-strep guide to create a VPC: How to Create an Amazon VPC.

In the VPC Management Console click on VPN Connections, select your VPN (you may only have one), then click Download Configuration. Next to Vendor select Generic, then Download.

This file contains all the critical information you'll need, like pre-shared keys, IP addresses, etc.

Connect to your SonicWALL's web interface and perform the following.

Step 1 - Create Address Object
Go to Network, select Address Object. In the Address Objects section, click the Add button and configure with these settings:

Name: VPC LAN (this is arbitrary)
Zone Assignment: VPN
Type: Network
Network: the subnet portion of the VPC CIDR
Netmask: the subnet mask portion of the VPC CIDR

Step 2 - Create New VPN Policy
From VPN, Settings add new policy, using the following information:

General Tab

Authentication Method: IKE using Preshared Secret
Name: Any name you choose
IPsec Primary Gateway: IP address from downloaded config
IPsec Secondary Gateway: Secondary IP address from config
Shared Secret: Shared secret from config

Network Tab

Local Networks: Select appropriate setting for your environment
Destination Networks: VPC LAN from previous step

Proposals Tab

Exchange: Main Mode
DH Group: Group 2
Encryption: AES-128
Authentication: SHA1
Life Time: 28800
Protocol: ESP
Encryption: AES-128
Authentication SHA1
DH Group: Group 2
Life Time 28800

Advanced Tab

Set as required for your environment.

Once all the settings are correct you should be able to see the tunnel status in both your SonicWALL and AWS Console. Test connections over the tunnel using ICMP ping or other methods.

VPN Status from SonicWALL

VPN Status from AWS Console

One Guy, Two Blog Posts

I'm not much into self-aggrandizement, but WTH, here goes.

This evening while watching a disappointing baseball game I wrote a little about what I do at work, which includes overseeing a network of systems that served over 58,000,000 page views yesterday (a new single-day record for us with much more traffic soon...). To load our content on this many pages my systems answered somewhere around 1 billion calls in 24 hours. This is a pretty good overview of what I do while sitting behind a computer all day, and what I've been working on for the last four years.

A few of my fellow employees used to work for a quasi-competitor of ours and went to lunch recently with some of their former colleagues who, when they found out our company only has one "IT guy" were amazed. Who is that masked man, anyway....

For those fellow geeks out there this is a little more insight into my world.

Enjoy!

I LOVE LogParser

Recently I wrote about some code improvements we deployed that had a huge impact, causing our web services calls to be answered much quicker and saving us a fair amount of money each month. In that post I talked about using Cacti to graph server statistics (like CPU and IIS stats), and service response times. That post also included this spreadsheet showing how we lowered service response times considerably.

Figure 1 - Spreadsheet comparing average load time before and after code deployment.

This post is about how I use Microsoft's LogParser to extract very useful information from my IIS web server logs. About LogParser Microsoft says:

Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®. You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.

Most software is designed to accomplish a limited number of specific tasks. Log Parser is different... the number of ways it can be used is limited only by the needs and imagination of the user. The world is your database with Log Parser.

I use LogParser for a variety of tasks, but this is all about getting some good summary data from my IIS logs. (See Log Parser Rocks! More than 50 Examples! for some great examples of using LogParser - I've used this as a reference a lot!)

Over night when load to my IIS servers is considerably lower than during the day I have scheduled jobs that summarize information from the logs, then zips them & writes them to S3 for longer-term storage. For a little perspective my IIS servers handle somewhere around a cumulative total of 500,000,000 to 600,000,000 (yes, over 1/2 a BILLION requests) per day. Since every request is logged (and I log every possible parameter in IIS) that's over 1/2 a billion log lines in total. Each and every day. Each of my IIS servers logs between 10-25 million requests per day - these log files come in at about 5-10 GB per server per day.

Figure 2 - Windows Explorer view showing rather large IIS log files on one server.

I guess I should address the question of why I log all requests and why I spend the time to analyze the logs. Besides the fact that I'm a bit of a data junkie, and that it gives me bragging rights that I collect so much data, the main reason is that it gives me a lot of very useful information. These summaries tell me number of requests, bandwidth served, average response time (both by hour and by minute - see spreadsheet above), number of and distribution of server responses (I.E. HTTP responses like 200, 302, 404, 500, 503, etc.), website referrers and much more. Then I use this summary data to compare servers - I can compare them to one another for a day, and over time. Of major significance is the fact that I can get a fairly quick and granular view of the impact to our systems of changes like code deployments (see my post, "Improving Web Server Performance in a Big Way With Small Changes" for more on this).

I'm not going to bore you with all my logparser commands. After all I'm not doing anything special & what most of them do has been well documented elsewhere. I am going to show a couple of things I do with logparser, both in my daily summarization of logs and on-demand to view the immediate effect of changes.

Note: I'm using line breaks in the following commands for readability.

Hits By Hour

logparser -i:W3C -o:CSV "SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time), 3600)) AS DateTime, count(*) as Requests, MUL(PROPCOUNT(*), 100) AS Percentage, SUM(sc-bytes) AS TotalBytesSent, MIN(time-taken) AS MinTime, MAX(time-taken) AS MaxTime, AVG(time-taken) AS AvgTime, Div(TotalBytesSent,Requests) AS BytesPerReq INTO D:\LogFiles\IIS\Reports\%IISTYPE%\%IISTYPE%_%IISDT%_%COMPUTERNAME%_%RTIME%_HitsByHourSummary.csv FROM D:\LogFiles\IIS\%IISSITE%\ex%IISDT%*.log GROUP BY DateTime ORDER BY DateTime"

Hits By Minute

logparser -i:W3C -o:CSV "SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time), 60)) AS DateTime, count(*) as Requests, MUL(PROPCOUNT(*), 100) AS Percentage, SUM(sc-bytes) AS TotalBytesSent, MIN(time-taken) AS MinTime, MAX(time-taken) AS MaxTime, AVG(time-taken) AS AvgTime, Div(TotalBytesSent,Requests) AS BytesPerReq INTO D:\LogFiles\IIS\Reports\%IISTYPE%\%IISTYPE%_%IISDT%_%COMPUTERNAME%_%RTIME%_HitsByMinuteSummary.csv FROM D:\LogFiles\IIS\%IISSITE%\ex%IISDT%*.log GROUP BY DateTime ORDER BY DateTime"

Breakdown

logparser -i:W3C -o:CSV - this tells logparser the input file is W3C and to output results as CSV
SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time), 60)) AS DateTime - this little gem breaks down the results per minute (using 60) or per hour (3600); you could specify any number of seconds for this - at times when I'm feeling randy I'll break it out into 5 minute intervals with 300
count(*) as Requests - this gives me a total count of the log lines
MUL(PROPCOUNT(*), 100) AS Percentage - assigns a percentage of total calls to the specified time period
SUM(sc-bytes) AS TotalBytesSent - total bytes sent (server to client) from server
MIN(time-taken) AS MinTime - lowest time (in miliseconds) for a response
MAX(time-taken) AS MaxTime - highest time (in miliseconds) for a response
AVG(time-taken) AS AvgTime - average time (in miliseconds) for a response (this is the one I'm really after with this command - more on this in a minute...)
Div(TotalBytesSent,Requests) AS BytesPerReq - bytes per request or how much data was sent to the client on average for each request (another very useful one)
INTO D:\LogFiles\IIS\Reports\%IISTYPE%\%IISTYPE%_%IISDT%_%COMPUTERNAME%_%RTIME%_HitsByMinuteSummary.csv - into summary file (variables are used in scheduled job and should be self explanitory. One to note, however, is the %IISDT% [DT is for date] variable which I've documented in Get Yesterday's date in MS DOS Batch file)
FROM D:\LogFiles\IIS\%IISSITE%\ex%IISDT%*.log - from file (once again variables used in batch process)
GROUP BY DateTime ORDER BY DateTime - finally group by and order by clauses

Results
Here's a look at the results of this query (the second one, hits by hour) with a summary showing the particularly useful average time per request.

Figure 3 - results of logparser hits by hour query.

Wait, There's More
I use the hits by minute query for both summary data and to take a near real-time look at a server's performance. This is particularly useful to check the impact of load or code changes. In fact, just today we had to push out some changes to a couple SQL stored procedures and to determine if these changes had a negative (or positive) impact on the web servers I ran the following:

logparser -i:W3C -o:TSV "SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time), 60)) AS DateTime, count(*) as Requests, SUM(sc-bytes) AS TotalBytesSent, Div(TotalBytesSent,Requests) AS BytesPerReq, MIN(time-taken) AS MinTime, MAX(time-taken) AS MaxTime, AVG(time-taken) AS AvgTime FROM D:\LogFiles\IIS\W3SVC1\u_ex%date:~12,2%%date:~4,2%%date:~7,2%*.log GROUP BY DateTime ORDER BY DateTime"

NOTES

First, notice I'm outputting it as a TSV and not specifying an output (with the INTO command) file. This is because I want the results displayed to the command prompt where I'm running it and want it tab separated to make it easier to view.
Second, and I'm quite proud of this, is "u_ex%date:~12,2%%date:~4,2%%date:~7,2%*.log". These variables parse today's date and format it in the default YYMMDD used by IIS logs. See the results of echo u_ex%date:~12,2%%date:~4,2%%date:~7,2%*.log command.

Figure 4 - results of echo u_ex%date:~12,2%%date:~4,2%%date:~7,2%*.log

So, when we pushed the code change today I ran the above command on each of my front-end servers to determine if there was a negative (or positive) impact. By breaking the results, particularly the average time per response, out by minute I could see if there was any impact immediately after the code was pushed. Which there wasn't - at least that's what I determined after a little while of thinking there had been a negative impact of the code push....

Notice the lines at 13:25 and 13:39 where the average time per response increased to 149 and 273 respectively. It turns out the minute or so while logparser was running (even though while watching task manager there didn't seem to be much of a noticeable CPU load hit) the average response time climbed quite a bit.

Figure 5 - results of logparser by minute command

So, by using logparser I'm able to summarize hundreds of millions of IIS log lines each day, and on-demand when needed to keep a good pulse on just what my servers are doing.

Improving Web Server Performance in a Big Way With Small Changes

A week ago today we pushed out some code to our IIS and MS SQL servers that has had a huge impact. In a good way - a very good way.

First, a little background. I work for an online video company that has "widgets" and video "players" that load on thousands of websites millions of times a day. Each time one of these entities loads numerous calls get made to our systems for both static and dynamic content. One of the highest volume services is what we call "player services," where calls are made to display playlists, video meta data, etc. based on our business rules. For example, can a particular video be displayed on this website, etc. Anyway, these player services get hit a lot in a given day, about 350,000,000 times (or more) a day. That's over 1/3 of a billion (yes, billion, with a B) times a day to just this one service (this represents about 1/3 of the total hits to the systems I oversee. You do the math...).

We do have a scalable infrastructure which has steadily been growing over the past few years. In fact, we've grown tremendously as you can see by this Quantcast graph.

Figure 1 - Traffic growth over the past few years.

Over the years we've stumbled a few times, but have learned as we've grown. (I remember when we had our first Drudge link & it caused all kinds of mayhem. Recently we had five links on Drudge simultaneously and barely felt it.) We've also had many break-through's that have made things more efficient and/or saved us money. This one did both.

We have fairly predictable daily traffic patterns where weekday's are heavier than weekends, and during the day is heavier than at night. Using a couple of my favorite tools (RRDTool & Cacti) I am able to graph all kinds of useful information over time to know how my systems are performing. The graph in figure 2 shows the number of web requests per second to a particular server. It also shows the relative traffic pattern over the course of a week.

Figure 2 - Cacti RRDTool graph showing daily traffic pattern over one week for one server.

I use a great little Cacti plug-in called, "CURL BWTEST Response Time Graph Template" to monitor response times for various calls. As figure 3 shows, our average service response times would climb during peak traffic times each day. I monitor these closely every day and if the response times get too high I can add more servers to reduce the response time to an acceptable level.

Figure 3 - Cacti CURL graph showing service call response time.

Here's the exciting part, finally. As you can see in figure 3 we no longer have increasing response times during the day when our load increases. How did we do it, you ask? Well, a couple of really sharp guys on our team, our DBA and a .NET developer, worked on several iterations of changes to both SQL stored procedures and front-end code that now deliver the requested data quicker and more efficiently. Believe it or not this took a little work. Earlier I alluded to stumbling a few times as we've grown, and that happened here. A couple weeks ago we made a few additions to our code and deployed it. The first time we basically took down our whole system for a few minutes until we could back it out. Not only did that scare the crap out of us it made us look bad.

So we went back to the drawing board and worked on the "new and improved" addition to our code. After a couple more false starts we finally cracked this one. Without boring you with the details our DBA made the stored procedures much more efficient and our front-end developer made some tweaks to his code & the combination was dynamite.

As figure 3 shows, we aren't suffering from increased response times for calls under heavier load conditions. This in and of itself is fantastic; after all, quicker responses equals quicker load times for our widgets and players which equals a better user experience, which ultimately should translate to increased revenue. But this isn't the only benefit of this "update." The next improvement is the fact that overall the responses are quicker. Much quicker. As you can see in figure 4 on Sept. 14 (before the code deployment) the average response time (broken out per hour) increased dramatically during peak traffic times and averaged a little over 100 ms for the day. After the deployment, looking at data from Oct. 4, not only did the average response time stay flat during the higher traffic times, but it is down considerably overall. Responses now average 25 ms at all hours of the day, even under heavy load. This is a great double whammy improvement! In fact, Oct. 4 was a much heavier load day than 9/14, so not only were the responses quicker, the server handled way more traffic.

Figure 4 - Spreadsheet comparing average load time before and after code deployment.

So far I've discussed the benefits on the front-end web servers, but we've also seen a dramatic improvement on the back-end database servers that service the web servers. Figure 5 shows the CPU utilization over the past four weeks of one of our DB servers and how since the deployment a week ago it has averaged almost half what it did before the deployment. This enabled me to shut down a couple of database servers to save quite a bit of money. I've also been able to shut down several front-end servers.

Figure 5 - Cacti graph of database server CPU utilization over 4 weeks.

Due to this upgrade I have been able to shut down a total of 12 servers, saving us over $5000 per month; made the calls return quicker causing our widgets and players to load faster - at all hours of the day; and made our infrastructure even more scalable. Win, win, win!

For more information, and if you're feeling especially geeky see my post, "I LOVE LogParser" for details on how I am able to use Microsoft's logparser to summarize hundreds of millions of IIS log lines each day, and on-demand when needed to keep a good pulse on just what my servers are doing.

Wednesday, October 3, 2012

Enable Quick Launch in Windows 7, Windows 2008 and 2008 R2

Call me old fashioned. Say I'm stuck in the past. Whatever. I just don't like a lot of the things Microsoft has done to the Windows interface/desktop over the years. Every time I get a new computer or start up a new server certain things have to be done to make it usable, i.e. not annoying to me. One of those things is to enable the Quick Launch bar.

So to add Quick Launch to any Windows 7 or Windows 2008 server do the following.

Right-click an empty area of the taskbar, select Toolbars, then click New toolbar.
In the dialog box, enter %AppData%\Microsoft\Internet Explorer\Quick Launch, then click Select Folder.

Now you can customize Quick Launch by displaying small icons, no labels, moving it, etc.

Monday, October 1, 2012

Windows Command Line WhoIs

I regularly find myself trying to find the owner of a domain or needing other information, like authoritative name servers. A few command line whois.exe programs exist out there for Windows, but the one I like best is the one by Mark Russinovich at Sysinternals. You can visit the previous link to download or use wget http://download.sysinternals.com/files/WhoIs.zip at the command line (assuming you have wget for Windows.)

One little trick I like to do is place whois.exe in the Windows system32 directory (c:\windows\system32 for example), because this directory is normally in your system path. This way whois can be executed in a command prompt from any directory.

Here's a look at whois microsoft.com: