A System for Using
Multiple Web Site Mirrors:
How to increase bandwidth and response time on a worldwide scale
Benjamin
O’Connor
6.033 Design
Project #1
Prof. T.
Saltzer (TR1)
Abstract:
A popular website with an international audience needs a distributed array of servers to transfer information quickly and efficiently in large volumes. This design provides a solution of relatively low complexity with redundancy and scalability issues also taken into account. This solution redirects web clients from an entry point to a particular mirrored web server automatically based on server loads and calculated communications speed to the client. Each client is treated individually and is directed to the particular Data Server best suited for it. Taking into account the current workloads of each potential server allows servers to maintain balanced, reasonable workloads. Measuring the communication response times to clients provides for the quickest possible server connection.
Operational Overview
The
distribution of loads and traffic in this proposed system arises from a
separation of the host which receives an initial request from a client (a Hit
Server) from the host which will handle all subsequent requests from that
client (a Data Server). The hostname www.acme.com would resolve to one of at
least two Hit Servers, via a round-robin DNS scheme. All of the Hit Servers are exact mirrors of
each other. The Hit Servers maintain
open TCP channels to all of the organization’s Data Servers placed
strategically around the globe. An
accessed Hit Server would immediately initiate message queries to all Data
Servers, via these open TCP channels, in an attempt to determine where this
particular client should be redirected to obtain the best performance.
One
response from the Data Servers to these query messages would tell our Hit
Server what the current workloads are on our Data Servers. Another response would tell our Hit Server
the round-trip packet transfer times from our Data Servers to the client. Using these loads and response times, the Hit
Server would determine which Data Server is the best match for our client. The client is then redirected via a standard
HTTP redirect to an appropriate Data Server, where all subsequent requests are
submitted. This open connection allows
the status and availability of our Data Servers to be monitored by the Hit
Servers and would provide for immediate notification in the case of a network
or machine outage. Similarly, the Hit
Servers monitor each other to handle network and machine outages that may occur
in their cluster. (See Fig. 1)
Figure 1. System
Overview
Theory of Operation
The
first and most rudimentary method used in this system to provide an appropriate
solution is the round-robin DNS assignment of multiple Hit Servers. As clients attempt to access the acme web
site, the fully qualified domain name of www.acme.com
resolves to different network numbers.
Consecutive connections will be directed to different machines. This balances the workload on our Hit
Servers. It also increases redundancy by
ensuring a backup if one Hit Server should fail. These Hit Servers do not necessarily have be
located together, although placing them in separate places around the world
would not give us a great performance gain since connections to them are
assigned randomly and not by any speed heuristic.
In
order for the Hit Servers to properly redirect to an appropriate Data Server, a
list of all Data Servers in the organization will reside on each Hit
Server. At a request, the Hit Server
will iterate down this list of connected Data Servers. For each Data Server, the Hit Server will
send two messages, and accept two responses.
The SpeedTo(client) message
sent to a Data Server will evoke a response containing the round trip time of a
packet sent from it to our client. The MyLoad()
message will request that the Data Server respond with a measure of its current
workload, a factor of current connection availability as well as processor
usage. If, after a reasonable but small
amount of time, no response is received, it will be assumed that the
corresponding Data Server is not functioning and no clients should be
redirected to it. Given the responses to
these two queries for each Data Server, the Hit Server will determine which one
our client will get redirected to.
Packet response will take precedence in the decision, with a threshold
value set on the appropriate server load.
For example:
|
Data
Server |
Response
Time (out of 1K) |
Workload
(out of 1K) |
|
Data Server #1 |
500 |
500 |
|
Data Server #2 |
100 |
100 |
|
Data Server #3 |
300 |
100 |
|
Data Server #4 |
400 |
700 |
|
Data Server #5 |
200 |
950 |
Figure 2.
In this
case, it is obvious that our client will be redirected to Data Server #2. The heuristic is as follows – consider only
the best half of our Servers arranged by response time. With a certain percentage tolerance for ties
(10% is reasonable) redirect our client to a Data Server with the best average
of Response Time and Workload. In the
case of a tie, take the Server with the best workload.
|
Data
Server |
Response
Time (out of 1K) |
Workload
(out of 1K) |
|
Data Server #1 |
50 |
200 |
|
Data Server #2 |
100 |
160 |
|
Data Server #3 |
200 |
500 |
|
Data Server #4 |
500 |
750 |
|
Data Server #5 |
850 |
102 |
Figure 3.
In this
case, Servers 4 and 5 are not considered due to their slow response time.
Server 1 achieves an average of 125.
Server 2 achieves an average of 130.
This counts as a tie because it is within a 10% threshold. Therefore, our client will be redirected to
Server 2, as it has the least workload of the two.
Workload and connection statuses are both
constantly monitored for all Data Servers.
High workload warnings are logged and a System Administrator would be
notified in the case of high workloads or extremely long or timed-out network
responses. In this way, decisions can be
made as to what geographic regions might need additional Data Servers.
Adding
another Data Server to our network would not be an extremely complicated
operation. A new server would have to be
placed on the network, and its configuration copied from an existing one. In addition, the master list of Data Servers
would have to be updated on the Hit Servers in order for clients to get
redirected to this new Data Server. Our
Data Servers could run any web server software.
One configuration exists for all of them, as they are all mirrors of
each other. Besides the HTTP server
software, a small process is running on the Data Servers. This process would maintain the TCP
connection with the Hit Servers and respond to the query messages. This process would also spawn the pings to
our prospective client needed to find out our network round trip time.
Implementation Considerations
1.) System Extensibility:
As
we have already seen, a new Data Server can easily be added to our system. However, once a certain number of Data
Servers exist on our network, the message-handling overhead will grow on our
Hit Servers. To remedy this, we can
easily add more Hit Servers in our DNS round robin to further distribute
clients. However, after a certain number
of Data Servers are present, the number of message responses concerning just
one client would grow to be a hindrance.
The pings from all of our data servers to our perspective client machine
would also become a hindrance, and may slow down our clients. This is the upper limit of the system. However, this would only occur after a
relatively large number of Data Servers (more than 50) are participating.
2.) Overhead:
This
solution is geared towards providing concurrency while not increasing overhead
much on our Data Servers. The
decision-making and message-generating overhead of the system is all absorbed
by our Hit Servers. This is optimal,
since our Data Servers will be running computationally intensive server
applications and we want to lessen perceived slowness for our customers. In contrast, a completely distributed system,
with no Hit Servers, where a request is fed at random to any one of our Data
Servers and is then forwarded to another appropriate Data Server based on
similar decisions concentrates added overhead on the critical Data
Servers. Slight overhead is added to the
network and our client by the need to calculate packet round trip times. This overhead increases linearly with the
number of Data Servers we have, but should not be a problem for our high-speed
network or the high-speed networks of our clients.
3.) Failure and Disaster Recovery
The
Hit Server cluster is the heart of our intelligent redirection system. Having more Hit Servers in the system not
only increases efficiency, but also protects against failure. Should one fail, the others are exact mirrors
and will perform the same function. The
constant monitoring of Data Servers’ availability and workload also gives us
the upper hand in any disaster situation.
It is immediately known when a Data Server fails because of a hardware
or network failure. After failure the
affected Data Server is not available to receive any client redirects. This will not stop the operation of our web
site, however if loads are already high on the remaining servers, clients may
notice increased latency and slowness.
This can be improved by adding enough Data Servers so that we have
enough capacity even with a failure.