Archive for March, 2008

Google/YouTube

Sunday, March 2nd, 2008

Last weekend there was a widely reported outage on the YouTube video sharing website. It happened the same day that Internet Service Providers (ISPs) in Pakistan had been instructed to prevent their customers from accessing the site.

So, why did this cause a problem for Internet users around the rest of the world?

To start off, here’s a little introduction to how the computers on the Internet know how to reach each other. Every computer has an Internet Protocol address (IP address). This is usually represented as four numbers separated by dots, e.g. 192.168.12.34, but at the computer level it is a single number between zero and about four billion. Each ISP advertises the addresses it is responsible for by using the lowest number of the range and the length of the range. So, for example, I might say I am responsible for everything from 192.168.12.0 to 192.168.12.255.

This advertisement of which addresses an ISP is responsible for is known as a routing advertisement, and is advertised using a protocol called the Border Gateway Protocol (BGP). ISPs speak BGP to each other so they all know who is responsible for which range. Usually only one ISP is responsible for a particular range of addresses, but it is possible to punch holes in that by advertising a smaller range of addresses. For a particular computer, the narrower range always takes precedence over the larger range.

For the range of addresses relevant to the problem last week, YouTube advertises a range of about a thousand addresses, but to prevent access to YouTube, one of the ISPs in Pakistan claimed to be responsible for about 250 of those addresses. This claim was only meant to be used internally to the ISP, and it would not have been a problem if it was kept internal, but the engineer made a mistake and the routing advertisement was passed onto other ISPs, until most of the world saw it. As it was a smaller range of addresses (a “more specific” route advertisement), everyone started to send YouTube traffic to the ISP in Pakistan, but of course, it didn’t know what to do with them and so the traffic was dropped and nobody could access YouTube.

YouTube tried to fix this by advertising ranges of about 130 addresses to try to be even more specific, but most ISPs don’t listen to advertisements of less than 250 addresses because there is the potential for there to be just too many of them. In the end, it took the ISP of the Pakistan ISP to block the route advertisements for everything to start working again. This was achieved within a couple of hours, which may have been a long time for YouTube to be unreachable, but isn’t bad when you consider the distance involved between all the parties in question.

Whilst the ISP is Pakistan should not have advertised this address range, the larger ISP that provides service to them should not have believed it either, so knowing that accidents happen, the blame lies with both of them. However, the mechanisms for knowing which route advertisements to believe are far from perfect, and there is a large degree of trust in fellow ISPs involved.

The problem is well-known, and work has been on-going to try and provide better ways of authenticating route announcements for some years, but we are still a little way from that being a reality. In the meantime, know that ISPs that were being less strict on what they believe from other ISPs are trying to tighten that down, but if things do go wrong again, rest assured that engineers at most of the larger ISP are talking to each other trying to spot any problems and fix them quickly.