Overview
- Introduction
- Squid caching architecture
- Geographical DNS
- Purging
- Network design
Wikimedia clusters
Wikimedia Sysadmins
Squid caching architecture
Most content is in Tampa, Florida
With the exception of some Asian language wikis, which are served from Seoul, South Korea.
Two-tier setup
- Content is served by Apache servers, through 1st tier Squid HTTP caches in the same cluster
- Possibly forwarded from and cached by 2nd tier Squids from another cluster
Squid caching architecture
- Cache misses:
- Squid caches ask their adjacent neighbours first using the HTCP inter-cache protocol...
- ...and only contact the parent caches if the object is not locally available
- Many Squids run diskless
Retrieving from upstream is often faster than waiting for (slow) disks
- Text (wiki) content, and static (image) content are separate Squid groups
Allows for better optimization of different content types, and some cache locality
- Load balancing is done by Linux Virtual Server (LVS)
How to distribute users?
Geographically distributed clusters are nice, but how to distribute users over them?
- Randomly, using DNS Round-robin
Even load, but sends users to servers far away
- HTTP forwarding
Defeats much of the purpose
- Return DNS results based on geographic or network topology information
Sends users to the closest and hopefully fastest cluster
Geographical DNS
Observations:
- Most users use a DNS resolver relatively close to them
- IP address to location mapping lists are available that are at least 90-95% accurate
- Maxmind GeoIP
- countries.nerd.dk DNSBL
- Geographically close often also means close network topology wise
So why not have the Wikimedia DNS servers determine the location of the querying DNS resolvers, and give out answers based on that?
Geographical DNS
- DNS purists don't like it, but it works well. Many big multi-cluster sites use something similar.
- Several commercial offerings: Akamai et al., load balancing vendors, Content Delivery Networks
- An unofficial patch for Bind exists, GeoDNS.
But I don't like Bind, and I like to reinvent the wheel... ;-)
PowerDNS and Geobackend
Therefore:
- [[PowerDNS]], a modern DNS server:
- multithreaded core handling the complex DNS logic
- many backends delivering the actual data using arbitrary methods
- Geobackend, a PowerDNS backend written by me that returns DNS answers using arbitrary mappings, e.g. geographical ones.
- Reads a RBLDNS style zonefile (countries.nerd.dk) as IP map
- Uses a flat file director map to map arbitrary numbers (e.g. countries) to DNS CNAMEs.
Alternative methods
Some alternative geographical balancing options could be:
- BGP DNS, where the distributed DNS servers have a BGP "view" of the local network routing table
- BGP delivers the cheapest, but not necessarily "best" route
- Needs cooperation of ISPs; BGP access everywhere
- Network latency measurements based on previous traffic
- Difficult to implement and get right reliably
- BGP Anycast instead of DNS - let the network sort it out
- Also needs ISP cooperation everywhere, and anycast isn't great for TCP
A combination of all these options might work.
Object Purging
Old method
HTTP PURGE requests over HTTP TCP connections to all Squids, from all MediaWiki servers
Object Purging
This didn't scale well:
- Exponential number of purges: # Squids x # Requests
- Overhead because of TCP, latency, timeouts
- Need for connectivity between (internal) MediaWiki servers to all (external) Squids
- Many open sockets / file handles
First idea: have MediaWiki contact a single external daemon, which takes care of the rest.
A little better, but mostly the same problems, moved.
Object Purging using Multicast
Object purging is simply a broadcast or multicast message! Let the network take care of it.
- MediaWiki sends a single purge message
- The network delivers it to all interested hosts (only)
But... Only one-way communication possible
HTCP
- Already using HTCP inter-cache protocol...UDP based!
- The HTCP specification includes a purge method: HTCP CLR
- ...but HTCP CLR wasn't implemented in Squid
- A buggy, non working patch was found on the Internet
Got it working after heavy modification
- Squid's HTCP code was rather immature, despite being the reference implementation
Incorrect implementation according to the RFC, memleaks, inefficiencies...
Multicast between clusters
- Multicast on our internal networks is no problem, but...
- Multicast routing between our clusters over the Internet is
Solution: convert multicast packet to unicast, send over the Internet, and reconvert to multicast again.
A simple Python script accomplishes this.
The Wikimedia Network
The Wikimedia Network
Florida, currently:
- 2 VLANs, external and internal
- Failover setup:
- Two uplinks to ISP, using BGP
- HSRP
- 2 layer 3 switches (Cisco Catalyst 3560G)
- 4 access switches (Cisco Catalyst 2948G, Netgear)
The Wikimedia Network
Florida, soon:
- One core switch/router
- Older switches used for "access ports" and emergency backup
- Multihomed network
Foundry BigIron RX-8