internet

Is WebAssembly the dawn of a new age of web performance?

This post was contributed by Intechnica Performance Architect Cristian Vanti. Check out another of his posts, “Performance Testing is not a luxury, it’s a necessity”.

Even though the internet was created several years earlier, I think that the birth of the World Wide Web as we know it coincides with the release of the Mosaic browser in 1993.

In the past 22 years, everything to do with the web has changed incredibly fast and very few things have resisted change during this time. I can think of only three examples of this:

  • IPv4 (born in 1981)
  • HTTP 1.1 (born in 1997)
  • Javascript (born in 1995)

The first was superseded years ago, even if IPv6 hasn’t been fully adopted despite several attempts.

In February finally HTTP/2 has been formally approved, and easily it will quickly replace version 1.1 after 18 years.

Yet Javascript, after 20 years, is still the only language universally used in web browsers. There were some attempts to replace it with Java applets, Flash or Silverlight but none of them has ever threatened Javascript’s position. On the contrary, it started to conquer the servers as well (primes example: Node.JS).

While in the server area, a plethora of different languages have been created aiming to simplify the development of web applications. In the front end, Javascript has been the only real option.

On 17th June 2014, Google, Microsoft and Mozilla jointly announced WebAssembly. This could be a turning point for front end development for several reasons.

Firstly, there have been several attempts to replace Javascript, but each one was backed by a single player. This time the three main browser developers have joined to overcome Javascript.

Secondly, they decided to not replace Javascript in a disruptive way, but rather putting at its side a new binary format, a sort of bytecode. The user will not see any difference; everything will continue to work in the same way for whoever wants to stay with Javascript, but a huge opportunity has been created for those who want to develop faster applications.

Thirdly, the performance improvement that WebAssembly could carry is impossible by any other means.

And lastly, WebAssembly is a brilliant solution, something so simple but so powerful, something that should have been invented years ago.

WebAssembly is simply a binary format for Javascript. It isn’t a real bytecode: It is the binary format for the Javascript Abstract Syntax Tree (AST), the product of the first step in the Javascript parsing, nothing more. It is not a new framework, not a new language, not another vulnerability option. Not another virtual machine, but still the good old Javascript one.

In this way the webserver will not send the pure Javascript text but instead will send the first elaboration for that code in a binary format. The benefits will be a more compact size for the code and less work for the browser compiler.

But the full potential comes from the use of asm.js, a highly optimizable subset of Javascript that Mozilla created some time ago and is already implemented by all the biggest browsers. asm.js code is only slightly slower than C code, giving CPU intensive applications a great opportunity. Moreover there are already cross-compilers that can parse other languages (C, C++, Java, C#, etc.) and produce asm.js code. This means that it’s been possible to compile game engine code to asm.js, and the same will happen for heavy desktop applications like CAD or image editors.

Silently asm.js and WebAssembly are leading us to a new internet age.

This post was contributed by Intechnica Performance Architect Cristian Vanti. Check out another of his posts, “Performance Testing is not a luxury, it’s a necessity”.

The Twitter Performance Revamp

So, Twitter have attempted to improve performance of their site again.

Interestingly, they’ve decided to back away from the JS-driven !# approach that caused so much fuss last year, and move back to an initial-payload method (i.e. they’re sending the content with the original response, instead of sending back a very lightweight response and loading the content with JavaScript).

Their reasons for doing so all boil down to “We think it makes the site more performant”. They’ve based their measurement of “How fast is my site?” on the time until the first tweet on any page is rendered, and with some (probably fairly involved) testing, they’ve decided that the initial-payload method gives them the fastest “first tweet” time.

The rest of their approach involves grouping their JavaScript into modules, and serving modules in bundles based on the required functionality. This should theoretically allow them to only deliver content (JavaScript, in this case) that they need, and only when they need it.

I have a few issues with the initial-payload approach, from a performance optimisation standpoint (ironic, huh?), and a few more with their reasons for picking this route.

Let’s start right at the beginning, with the first request from a user who’s never been to your site before (or any site at all, for that matter). They’ve got a completely empty cache, so they need to load all your assets. These users actually probably account for between 40 and 60 percent of your traffic.

For a blank request to an arbitrary tweet, the user has to load 18 separate items of content, plus perform a POST request to “/scribe”, which seems to be some form of tracking. Interestingly, this post request is the second request fired from the page, and blocks the load of static content, which is obviously fairly bad.

A firebug net panel break-down of the load times from the linked tweet (click to view full size)

A firebug net panel break-down of the load times from the linked tweet (click to view full size)

Interestingly, most of the content that’s loaded by the page is not actually rendered in the initial page request; it’s additional content that is needed for functionality that’s not available from the get-go. The response is also not minified, there’s a lot of whitespace in there that could quite easily be removed either on-the-fly (which is admittedly CPU intensive), or at application deploy time.

There we have stumbled upon my issue with the reasons for Twitter taking the initial-payload approach to content delivery: They’re doing it wrong.

They are sending more than just the initial required content with the payload, in fact they’re sending more than 10 modal popouts that are, obviously, not used apart from in 10 different circumstances. Quite simply, this is additional data that does not need to be transferred to the client, yet.

I’ve un-hidden all the additional content from the linked tweet page in this screenshot, there are around 10 popups.

I’ve un-hidden all the additional content from the linked tweet page in this screenshot, there are around 10 popups. I particularly like the SMS codes one, I wonder what percentage of Twitter users have ever seen that?

The worst thing is, if someone comes onto the new Twitter page without JavaScript enabled, they going to get a payload that includes a bunch of modal popouts that they cannot use without a postback and a new payload. So, for people who have JS enabled, they’re getting data in the initial payload that could be loaded (and cached) on the fly after the page is done rendering; for people without JS enabled they’re getting data in the initial payload that they simply can’t make use of.

How would I handle this situation, in an ideal world?

Well, first I’d ditch the arbitrary “first-tweet” metric. If they’re really worried about the time until the first tweet is rendered, they’d be sending the first tweet, and the first tweet only, with the initial payload. That’s a really easy way to hit your performance target: make an arbitrary one, and arbitrarily hit it. My performance target would be time until the page render is complete, i.e. the time from the initial request to the server right up to the bit where the background is done rendering, and the page fully functional at a point where the user can achieve the action they came to the page to do. If this is a page displaying a single Tweet, then the page is fully functional when the Tweet is readable, if this is a sign-up page the page is fully functional when the user can begin to fill out the sign-up form, etc.. This will force the developers to think about what really needs to go with the initial payload, and what content can be dynamically loaded after rendering is complete.

I would then look at the fastest way to get to a fully functional page, as soon as possible. In my opinion, the way to do this is to have a basic payload, which is as small as possible but large enough to give the end-user an indication that something is happening, and then dynamically load the minimum amount of data that will fill the page content. After the page content is retrieved and rendered, that’s the time to start adding things like popouts, adverts, fancy backgrounds, etc., things that add to the user experience, but that are not needed in the initial load in 100% of cases.

I’ve written a really simple application to test how fast I can feasibly load a small amount of data after the initial payload, and I estimate that by using the bare minimum of JavaScript (a raw XHR request, no jQuery, no imports) at the bottom of the BODY tag, I can cut parsing down to around 10ms, and return a full page in around 100ms locally (obviously the additional content loads are not counted towards this, I’d probably not load the images automatically if this were a more comprehensive PoC).

A break-down of the mockup I made, which loads the tweet content from the original linked tweet.

A break-down of the mockup I made, which loads the tweet content from the original linked tweet. It’s a contrived example, but it’s also a good indicator for the sort of speed you can get when you really think about optimising a page.

Add in some static assets and render time, and you’re still looking at a very fast page turnaround, as well as a very low-bandwidth cost.

The idea that Twitter will load all their JS in modules is not a completely bad one. Obviously there will be functionality that has its own JS scripts that can be loaded separately, and loading them when not needed is fairly daft. However, it really depends on numbers; with each separate JS payload you’re looking at around a few hundred milliseconds of wait time (if you’re not using a cached version of the script), so you’ll want to group the JS into as few packages as possible. If you can avoid loading packages until they’re really needed (i.e. if you have lightbox functionality, you could feasibly load the JS that drives the functionality while the lightbox is fading into view) then you can really improve the perceived performance of your site. The iPhone is a good example of this; the phone itself is not particularly performant, but by adding animations that run while the phone is loading new content (such as while switching between pages of apps on the homescreen) the perceived speed of the phone is very high, as the user doesn’t notice the delay.

The final thing I would look at is JavaScript templates. Jon Resig wrote a very small, very simple, JS template function that can work very well as a micro view engine; my testing indicates it can be parsed in around 1ms on my high-spec laptop. Sticking this inline in the initial payload, as well as a small JS template that can be used to render data returned from a web service, will allow very fast turnaround of render times. Admittedly, on browsers with slow JS parsing this will provide a worse response turnaround, but that’s where progressive enhancement comes into play. A quick browser sniff would allow a parameter pass to your service, telling it to return raw HTML instead of a JSON string, which would allow faster rendering where JS is your bottleneck.

The key thing to take from this is that if you are looking for extreme performance gains you should look to minimise the initial payload of your application. Whether you should do this via JavaScript with a JS view engine and JSON web services, or via a minified HTML initial payload, really depends on your user base, there’s never a one-size-fits-all approach in these situations.

Using JS as the main driver for your website is a fairly new approach that really shows how key JavaScript is as the language of the web. It’s an approach that requires some careful tweaking, as well as a very in-depth knowledge of JS, user experience, and performance optimisation; it’s also an approach that, when done correctly, can really drive your website performance into the stratosphere. Why a website like Twitter, with no complex logic and a vested interest in being as performant as possible, would take a step back towards the bad old days of the “the response is the full picture” is a question I did not expect to have to ask myself.

Visit Ed’s blog at http://ed-j.co.uk/

Performance Nightmare: HM Revenue & Customs Tax calculator crashes on first day

HM Revenue & Customs (HMRC)

Transparency of data is no good when people can’t access the site

Setting the scene: As part of a plan to improve the transparency of UK government spending, HM Revenue & Customs (HMRC) launched an online tool for calculating the amount of tax paid based on earnings, and where this money is spent by the government. The tool was launched on Monday 28th May 2012 on the HMRC website, as well as in app form for iOS and Android devices. The calculator is supposedly “a big step towards a more transparent, 21st century system”, according to exchequer secretary to the Treasury David Gauke.

Performance Nightmare: In an all too familiar story for online government tools with the remit of “providing transparency” to the public, the tax calculator website crashed within hours of launching. The site was unavailable to users as HMRC admitted that the issues were “purely” down to “phenomenal demand”, as 400,000 people tried to access the website in its first morning. Instead, many were greeted with the error message “Sorry, the HMRC Tax Calculator is currently not available. We apologise for any inconvenience caused. Please try again later.”

This is not the first time that public interest in a new online tool, touted as promoting government transparency, has fallen at the first hurdle. As we wrote about in a previous post, back in February 2011 the police website’s crime map tool crashed catastrophically under the strain of 18 million hits an hour on its first day, and both the US (1940) and UK (1901) census websites, launched ten years apart from each other, were crippled on their first day by overwhelming demand. This new story of the HMRC service crashing is already gaining attention in mainstream media, just as the examples listed above did, which does not give a positive impression on the quality of these services.

Source

Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

Intechnica are a full service digital agency offering performance assurance and complex application development. We solve performance problems!

Performance Nightmare: Nasdaq & the Facebook IPO

Nasdaq OMX Group Inc.

Facebook investors didn’t “like” it when glitches hindered trading on opening day

Setting the scene: On February 1st 2012, on its eighth birthday, Facebook filed for an initial public offering (IPO), and was soon predicted to be valued at $100 billion (that’s four times the value of Google when it went public in 2004). Trading opened on Friday 18th May 2012, with shares opening for $38 (£24) but selling for $42.

Performance Nightmare: Due to a technical glitch on Nasdaq.com, the Facebook IPO launch was delayed by half an hour. Then, traders suffered a lack of visibility on order changes, and problems cancelling orders or even accessing the site, allegedly due to a “high-volume rush”. This hiccup affected 30 million shares and may have cost investors $100 million, as in some cases the price of shares had already dropped by the time orders had eventually been processed – hours after being placed, and in some cases “cancelled”. The level of demand for Facebook shares also affected financial site etrade.com:

Indeed, it seemed like neither Nasdaq or etrade were adequately prepared for the biggest, most public IPO opening in history; Nasdaq acknowledged “design problems” with its technology and vowed to improve it for future IPO openings. The fallout is looking to be costly; Nasdaq has already agreed to reimburse investors as much as $13 million for the blunders, and a multi-million dollar lawsuit has been filed against them by representatives of investors; Facebook itself along with Mark Zuckerberg are also being sued for Facebook shares allegedly being overvalued initially. The glitches were also blamed for share prices dropping below the expected level after several days of trading.

Picture: Sean MacEntee

 

SourceAnother source / Yet another source

Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

Intechnica are a full service digital agency offering performance assurance and complex application development. We solve performance problems!

Performance Nightmare: Rhythm and Vines Music Festival

Rhythm and Vines Music Festival

Performance is especially critical in transactional websites targeted at social media savvy customers

Setting the scene: The Rhythm and Vines Music Festival (established 2003) is an annual music festival in Gisbourne, New Zealand traditionally taking place during New Year. It started as a small New Year’s Eve celebration featuring Kiwi musical acts and was attended by 1,800 people in its first year. This expanded to a three day international festival, with 25,000 people attending in 2010. In the 2011 pre-sales period, 4,000 tickets were sold.

Performance Nightmare:When pre-sales opened for the 2012 festival, demand doubled from the previous year, bringing down the ticketing website. The website had troubles soon after launching, but was struck down again by overwhelming demand soon after relaunching. This was made worse for users by the fact that the website seemed to process their order and took their money, but would not confirm their purchase, instead displaying an error message. The festival organisers tried to keep customers up to date on the status of the site via Facebook and Twitter but both sites were hit by hundreds of negative comments from frustrated users, many of whom confused about the status of their payments. The festival’s sales manager announced that these orders were now being processed manually, which has to be a big strain on time and resources.

Systems need to scale with increasing demand. Photo: http://www.facebook.com/rhythm.vines

Source

Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

Intechnica are a full service digital agency offering performance assurance and complex application development. We solve performance problems!

Performance Nightmare: UK Border Agency

UK Border Agency

IT systems failures can have a real life impact on the people who depend on them

Setting the scene: In the lead up to the London 2012 Olympics, London Heathrow (the busiest airport in the world) began to feel the strain of people coming into the country. Queues of up to three hours were reported at airport Immigration lines. Union strikes and new rules for foreign nationals from outside the EU to require a biometrics residents permit are said to have compounded the problem. The identity card system holds biometric records for over 600,000 foreign nationals living in the UK.

Performance Nightmare: The computer system tasked with processing the biometric permits collapsed the entire Croydon branch on April 3rd 2012, being deemed in the media as “a mess” and “not fit for purpose”. It was reported that the sheer volume of applicants was to blame for the failure, and that the system was not built to cope with the new demands imposed on it. It was even suggested that global business managers were being put off from doing business in the UK as a result of the inconvenience, as many people queuing for biometric permits to stay in Britain were grounded indefinitely and forced to resubmit their applications as a result of the technical issue, which shut down the system for two weeks.

Source

Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

Performance Nightmare: Dixons Retail Group

Dixons, Curry’s & PC World

As e-commerce sites become more critical to business, so should the focus on how those websites perform

Setting the scene: Dixons Retail Group is a consumer electronic retailer owning the stores and websites of Dixons, Currys and PC World. Dixons Retail last year announced plans to close more than 100 UK stores, so e-commerce is a key revenue stream for the company.

Performance Nightmare: Shortly after the start of Dixon’s annual sale on Monday 23rd April 2012, Dixons.co.uk, PCWorld.co.uk and Currys.co.uk all went offline for 17 hours. This in itself was bad enough, as customers were quick to go to competitor sites to get their wares, but the problem was compounded by a lack of communication with customers over what exactly the problem was. Error messages changed throughout the day, blaming high traffic (which would make sense considering a sale had just started, and could have been planned for ahead of time with performance best practice), then “maintenance”, before a spokesman placed the blame on roadwork damage to cables. There was also speculation online that the sites were a victim of hackers, although that is probably a conclusion jumped to due to no definitive reason emerging. The Dixons nor the Currys & PC World Twitter and Facebook accounts offered any explanation or apology for the outages, despite a flood of customer complaints through social platforms.

The initial reason given for the outage was high traffic.

Source

Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

15 Web Performance Nightmares, and the damage they caused

In a new section of the Intechnica blog, titled “Performance Nightmares”, we’re going to take a closer look at some of the most notorious, high-profile, brand-damaging website performance failures in the history of the internet. Now, the internet is a huge place, and every day, all sorts of websites struggle with performance issues of all kinds; not just the big websites, but also many smaller sites. A quick search on Twitter for “slow website” or “website down” shows the scope of this.

https://twitter.com/lukey868/status/191857279111405568 https://twitter.com/Frewps/status/191704594269736961 Of course, the cause of each of these problems could be all sorts of things, but these comments are entirely public, and potentially hundreds or even thousands of people could read these negative tweets. But bad publicity is not the only negative effect of slow or failing websites, as a previous post on this blog tells. The business impact can be very damaging too. So to kick off “Performance Nightmares”, let’s jump right in with 15 examples spanning from across the last 11 years. I’ve split them into four categories; Marketing Oversights, Overwhelming Public Interest, Unforeseeable Events and Technical Hiccups. Be sure to check back (or subscribe) to the blog to see more Performance Nightmares as they are reported!

Marketing Oversights

The goal of any marketing campaign is to generate awareness and interest in a brand or product, and drive customers to find out more, ultimately converting leads to sales. In the past, the more interest you directed towards your product the better, as long as supply met demand. But in the following cases, so much more demand was created than expected that the websites fell down, locking out potential customers (not to mention regular and existing customers). While the companies often say that “it’s a good problem to have” or “we were too successful for our own good”, the under-performing website has shot them in the foot, and customers old and new end up frustrated. Here are a few high profile examples…

1. Nectar

When directing people to a website for a special offer, it’s worth checking they will all be able to access it

Setting the scene: When it launched in 2002, loyalty card scheme Nectar pushed hard with TV adverts and email marketing directed at over 10 million households, driving people to their newly launched service. While they had phone lines and direct mail as means of collecting registrations, Nectar attempted to save costs by offering a rewards incentive to people registering online via the website.

Performance Nightmare: While Nectar prepared for this by increasing their server capacity six-fold, a peak of 10,000 visitors in one hour was enough to bring the site down for three days. Nectar cited the complexity of the registration process (security & encryption) as a bottleneck.

Source

2. Glastonbury Festival

Website crashes causes people to express real disappointment

Glastonbury: Messy business

Setting the scene: As an iconic music festival of the past 42 years, Glastonbury needs no introduction. In fact, its fame and popularity on the global music calendar has been almost a gift as well as a curse for some time; plagued by demand greatly outstripping availability of tickets, £2 million was spent in 2002 to build a giant fence, which kept out people who did not have a ticket, and festival goers long complained of having to pay inflated prices to ticket touts online. The problem was compounded in 2005 when information leaked about acts like Oasis and Paul McCartney being set to headline.

Performance Nightmare: The ticketing website got two million impressions in the first five minutes, overloading the system and resulting in disappointment for many people seeking tickets. Bad news spreads fast, with the BBC being flooded with emails about the service, and reports of people selling t-shirts displaying the error message shown by the website.

Source

3. Dr Pepper

Offer a free drink to 300 million people… what’s the worst that can happen? 

Setting the scene: This is almost the poster boy for a marketing campaign that did not take the limits of a website’s performance into account. Dr Pepper promised that, if Guns n’ Roses released the “Chinese Democracy” album in 2008, they would give everyone in the US a free Dr Pepper. When the album was released, Dr Pepper made good on their offer… limiting it to just one day. There are 300 million people in the United States. I think you can see where this is going.

Performance Nightmare: If you guessed that the site was overwhelmed and crashed, you’d be right. The traffic spiked dramatically, forcing Dr Pepper to add more server capacity and extend the offer by a day… which probably would have been a good idea in the first place.

Source

4. Reiss

The gift of free publicity can be a double-edged sword for web performance

Setting the scene: The internet has had a massive effect on the fashion industry, from e-commerce retail through to trend setting and social media. Since her engagement and the globally covered event of her marriage to Prince William, Kate Middleton has become a fashion icon. And when you have an event as watched by the world as the first meeting of the US Presidential first family with “Kate & Wills”, in May 2011, every fashion-hunter had their eye on what the duchess chose to wear. This was great publicity for the designer in question, Reiss…

Performance Nightmare: … Until the sheer volume of interest crashed their website for two and a half hours. While it probably wasn’t a formal marketing campaign as such, the exposure of the brand via Kate Middleton meeting the Obamas is a major fashion event for many, and such a high profile endorsement draws more traffic than perhaps any traditional marketing campaign, something the website was unable to cope with.

Source

5. Ticketmaster, Ticketline, See Tickets, The Ticket Factory

Some events are in such high demand, they can take out multiple sites in one day

Setting the scene: Take That were one of the biggest boy bands of the 90’s, and when they returned for a nationwide tour, women in their mid 20s would trample over their father to get tickets. When the band announced a huge tour with the full line up, including Robbie Williams, the response was frenzied among fans. Tickets were stocked by many major websites, including Ticketmaster, Ticketline, See Tickets and the Ticket Factory.

Performance Disaster: The demand was so high for tickets that fans flooded and crashed all four sites mentioned above. Would-be ticket buyers were forced to wait, in some cases all day, for their order to be processed successfully (if they could get on the sites at all), and the slow running of the websites continued even after the tickets were all gone. Considering the popularity of Take That, it should be no surprise that this negative experience was widely shared and reported on in the mainstream media. Maybe those Take That fans should have had a little patience.

Source

6. Paddy Power

You think your website will perform when it matters… Wanna bet? 

Don’t let your site fall at the first hurdle

Setting the scene: The Grand National is the biggest betting event of the year in the UK, with it actually being the only betting event of the year for many. Business is at its peak for bookies and betting websites, with the British public spending £80 million in bets on each year’s Grand National. With fierce competition between betting websites to get the business of both regular and once-a-year betters, many offer special deals, such as Paddy Power’s “Five Places” payout offer.

Performance Disaster: So great was the demand for bets on Paddy Power, higher than any other day in its history in fact, that the website came crashing down more minutes before the race. The site was down just 20 minutes, coming back up 15 minutes before the race, but in such a short time frame, this was a costly hiccup where there are many other betting websites to choose from. Studies show that, at busy times, 75% of customers will move onto a competitor’s website rather than suffer delays. This has to be compounded with such a time-sensitive case as getting good odds on the Grand National. 88% of people won’t come back to a website after a bad experience; Paddy Power have since offered a free bet to all its customers as damage control.

Source

Overwhelming Public Interest

Clearly there is a costly disconnect between marketing efforts and website performance considerations, but sometimes simple public interest in a product or service can back a website into a corner. Government and public service websites are more and more becoming essential resources for the general public, and especially at service launches or at times of peak interest, these key web applications need to be able to scale – but sometimes don’t…

7. Swine Flu Pandemic

Curiosity killed this website of high public interest

Setting the scene: Back in July 2009, the UK was caught up in the supposed “Swine Flu Pandemic”. Some reports went as far as to say that up to 60,000 could die of swine flu. To help ease the strain on the medical sector, the government decided to launch a website with a check list of the symptoms of swine flu, giving appropriate advice to those who had them, while putting those without at ease.

Performance Nightmare: With media scaremongering at a high, the website received 2,600 hits per second, or 9.3 million hits per hour, just two hours after launching. Unsurprisingly, the website crashed temporarily, although it was quickly restored; this was put down to most people visiting out of “curiosity” and quickly leaving the site after deciding on their diagnosis.

Source

8. UK Police Crime Maps

Even with higher than expected demand, websites can still be guilty of not being built to scale 

Setting the scene: In a move to increase transparency in crime statistics, the UK government launched a website in February 2011 allowing members of the public to get access to information about crime rates in their areas via markers on an interactive map. This received mainstream news coverage, with questions being raised about the accuracy & impact of the reports; for example, the affect on insurance rates or house prices.

Performance Nightmare: The crime maps were of such great public interest that it received 18 million hits an hour on its first day, bringing it tumbling down within a few hours in a very public fashion. While it might seem reasonable for a website to fall down under 18 million hits an hour, it was clear that the site simply wasn’t designed to perform at any kind of scale, despite using Amazon EC2 machines to spin up extra capacity; “you still need to build a site that scales without needing 1000′s of servers”.

Source

9. Census (1901 UK & 1940 USA)

Learn from other people’s mistakes, and make sure you can scale up enough

Data entry for the 1940 census web archive… sort of

Setting the scene: Although separated by 39 years and the Atlantic Ocean, these two census reports caused a very similar problem to their respective websites. As the census information came into the public domain (in 2002 for the UK, 2012 for the States), each government commissioned websites to host the historical data, which was placed into databases and images scanned in for downloading. Both sites expected a high level of interest and part of their remit was to cope with the high level of load (the US census was expected to support 10 million hits a day, while the UK census was required to cater for 1.2 million users per day).

Performance Nightmare: Demand for each service was so overwhelming that it exceeded both predefined targets set to the websites, bringing them down within hours. To start with the 2002 failure of the 1901 UK census, the website hit its 1.2 million hit limit within just 3 hours, and the site was closed in an attempt to investigate means to make it scale. It was closed in January and eventually reopened in August, with full functionality being restored in November. Ten years later, the US government apparently didn’t heed this lesson in scaling a census website, as the site hit 22.5 million hits within 3 hours of launching. Again, despite being hosted in Amazon’s AWS cloud, the site didn’t scale to meet the demand, and the site was forced to restrict its functionality when it came back online the next day.

Source

10. London 2012 Olympics

Web performance can be more like a marathon than a sprint

Setting the scene: The Olympics is often a cause of controversy, in the sheer level of interest it generates. Cities all over the world clamour to host the global event, as it draws in tourism and revenue, but with that comes social, economical and logistical challenges, as even advanced cities prepare to welcome a sudden spike in the population.

Performance Nightmare: There is almost too much to write about this one. In April 2011, a window to buy 6.6 million tickets through a public ballot came to an end; as it was not a first-come, first-served basis, many people waited until the last minute to decide on what tickets to bid on. The website was slowed to a crawl late on the last day under heavy load, forcing the six-week window to be extended by several hours. A few months later, the Olympics ticket resale site opened, allowing people to buy and sell official tickets with each other, but this also failed to cope with the strain of demand, slowing to a crawl. More problems arose in December and January, with more ticket website outages and cases of events being oversold.

Source

11. UCAS

Increasing adoption of the internet in general can impact your service performance

Setting the scene: Compared to post or call centres, the internet is a cost-effective way to collect information from lots of people at once, and with more people having access than ever to internet services, it makes sense to expect the public to use them. Indeed, UCAS now uses their website to allow students to book places on courses with vacant places in the clearing stage of University applications.

Performance Nightmare: In 2011 185,000 students were chasing just 29,000 unfilled course places. The number of hopeful students logging into the UCAS clearing site quadrupled from the previous year. UCAS were forced to shut down the site for over an hour to cope with the volume of traffic coming into the site, as students were dependant on the service to find out the status of their applications.

Source

12. Floodline

Sometimes it’s not the volume of traffic, but what the traffic is doing that causes problems

Disclaimer: Not a realistic danger of a flooded website

Setting the scene: The UK Environment Agency’s National Floodline was set up in 2002 to provide instant information via call centre or over web about potential flood dangers across the UK. However, heavy rainfall over the Christmas and New Year of 2002/2003 caused a surge of activity at both channels.

Performance Nightmare: The sudden demand and searches for information made the website suddenly unavailable for many. As the risk of flooding rose, phone enquiries climbed to a peak of 32,650 calls a day, and as people failed to get through, many turned to the web site where they would execute complicated searches in order to establish the impact of flooding in their area. At the peak, on 2 January, 23,350 people were hitting the site, and while the site was built to support a high number of users (and had successfully done so in the past), it was the complexity of the searches than was the main cause of bottlenecks. As the Environment minister told a parliamentary committee, the web site crash (which took the site out for several days) was not helped by the fact that so many people were at home over that period “and had little else to do except surf the net and look for flood information”.

Source

Technical Hiccups

While website performance problems are only brought to light when the site in question needs to perform well more than ever, and site owners find themselves “victims of their own success” when a marketing push or genuine public interest flood their website, there are also times where a glitch or error can bring on a Performance Nightmare. From hardware failure through to human error, such instances have proven to cause serious problems ranging from bad PR through to legal woes, and of course have cost their victims a lot of money.

13. Tesco

Customers needing a service won’t hesitate to go elsewhere 

Setting the scene: Online grocery site Tesco.com is a service used to order shopping for home delivery in the UK. Many people use it for their weekly grocery shop, as part of a busy lifestyle or perhaps through being unable to physically get to and from a supermarket. Tesco makes an estimated £255 million a year through online sales.

Performance Nightmare: In September 2011, the Tesco online service was halted for 2 hours by “technical glitches”. Disgruntled customers, who in some cases depend on getting specific delivery slots, were quick to go elsewhere with their custom, as many other UK supermarkets now offer an online delivery service.

Source

14. TD Waterhouse

A case of the financial impact being all too apparent

Setting the scene: TD Waterhouse, now known in the US as Ameritrade and elsewhere as TD Direct Investing, is an individual investment services company. Customers use its online service to order stocks and shares. As of 2001, it was the second largest discount broker in the US.

Performance Nightmare: The stock broker’s website suffered significant outages, which prevented customer orders from being processed on 33 different trade days spanning from November 1997 through to April 2000. The outages lasted up to 1 hour 51 minutes. This, along with TD Waterhouse’s failure to advise customers about alternative order methods, plus a general lack of customer service around the matter, caused the New York Stock Exchange to fine TD Waterhouse $225,000. The company put the outages down to “software issues”. The Securities and Exchange Commission released a report in January 2001 calling on brokerage firms to improve areas such as performance.

Source

15. JP Morgan Chase

Communication and prompt action are key when customers suffer from a web performance failure

Setting the scene: American bank JP Morgan chase, which as of 2010 had $2 trillion in assets, provides an online banking services for its customers to manage their accounts and make transactions.

Performance Nightmare: On 14th September 2010, Chase bank’s online service went down “sometime overnight”, causing inconvenience for customers, who took to Twitter to vent. One user was quoted as tweeting  “Dear Chase Bank, I have about 10 million expense reports to do, please get your act together so I can see my transactions online!” While occasional online bank outages aren’t rare, in this case the outage lasted around 18 hours.

Source

Got a contribution to the list? Leave it in the comments below!

Want to avoid a web performance nightmare of your own? Check out Intechnica’s Event Performance Management service!