A System for Using Multiple Web Site Mirrors:

 

How to increase bandwidth and response time on a worldwide scale

 

 

 

 

 

 

 

 

 

Benjamin O’Connor

6.033 Design Project #1

March 18th 1999

Prof. T. Saltzer (TR1)

 

 

Abstract:

 

          A popular website with an international audience needs a distributed array of servers to transfer information quickly and efficiently in large volumes.  This design provides a solution of relatively low complexity with redundancy and scalability issues also taken into account.  This solution redirects web clients from an entry point to a particular mirrored web server automatically based on server loads and calculated communications speed to the client.  Each client is treated individually and is directed to the particular Data Server best suited for it.  Taking into account the current workloads of each potential server allows servers to maintain balanced, reasonable workloads.  Measuring the communication response times to clients provides for the quickest possible server connection.


Operational Overview

 

            The distribution of loads and traffic in this proposed system arises from a separation of the host which receives an initial request from a client (a Hit Server) from the host which will handle all subsequent requests from that client (a Data Server).  The hostname www.acme.com would resolve to one of at least two Hit Servers, via a round-robin DNS scheme.  All of the Hit Servers are exact mirrors of each other.  The Hit Servers maintain open TCP channels to all of the organization’s Data Servers placed strategically around the globe.  An accessed Hit Server would immediately initiate message queries to all Data Servers, via these open TCP channels, in an attempt to determine where this particular client should be redirected to obtain the best performance. 

One response from the Data Servers to these query messages would tell our Hit Server what the current workloads are on our Data Servers.  Another response would tell our Hit Server the round-trip packet transfer times from our Data Servers to the client.  Using these loads and response times, the Hit Server would determine which Data Server is the best match for our client.  The client is then redirected via a standard HTTP redirect to an appropriate Data Server, where all subsequent requests are submitted.  This open connection allows the status and availability of our Data Servers to be monitored by the Hit Servers and would provide for immediate notification in the case of a network or machine outage.  Similarly, the Hit Servers monitor each other to handle network and machine outages that may occur in their cluster.  (See Fig. 1)

Figure 1.  System Overview


Theory of Operation

            The first and most rudimentary method used in this system to provide an appropriate solution is the round-robin DNS assignment of multiple Hit Servers.  As clients attempt to access the acme web site, the fully qualified domain name of www.acme.com resolves to different network numbers.  Consecutive connections will be directed to different machines.  This balances the workload on our Hit Servers.  It also increases redundancy by ensuring a backup if one Hit Server should fail.  These Hit Servers do not necessarily have be located together, although placing them in separate places around the world would not give us a great performance gain since connections to them are assigned randomly and not by any speed heuristic. 

            In order for the Hit Servers to properly redirect to an appropriate Data Server, a list of all Data Servers in the organization will reside on each Hit Server.  At a request, the Hit Server will iterate down this list of connected Data Servers.  For each Data Server, the Hit Server will send two messages, and accept two responses.  The SpeedTo(client)  message sent to a Data Server will evoke a response containing the round trip time of a packet sent from it to our client.  The MyLoad() message will request that the Data Server respond with a measure of its current workload, a factor of current connection availability as well as processor usage.  If, after a reasonable but small amount of time, no response is received, it will be assumed that the corresponding Data Server is not functioning and no clients should be redirected to it.  Given the responses to these two queries for each Data Server, the Hit Server will determine which one our client will get redirected to.  Packet response will take precedence in the decision, with a threshold value set on the appropriate server load.  For example:

Data Server

Response Time (out of 1K)

Workload (out of 1K)

Data Server #1

500

500

Data Server #2

100

100

Data Server #3

300

100

Data Server #4

400

700

Data Server #5

200

950

Figure 2.

 

In this case, it is obvious that our client will be redirected to Data Server #2.  The heuristic is as follows – consider only the best half of our Servers arranged by response time.  With a certain percentage tolerance for ties (10% is reasonable) redirect our client to a Data Server with the best average of Response Time and Workload.  In the case of a tie, take the Server with the best workload.

 

Data Server

Response Time (out of 1K)

Workload (out of 1K)

Data Server #1

50

200

Data Server #2

100

160

Data Server #3

200

500

Data Server #4

500

750

Data Server #5

850

102

Figure 3.

 

In this case, Servers 4 and 5 are not considered due to their slow response time. Server 1 achieves an average of 125.  Server 2 achieves an average of 130.  This counts as a tie because it is within a 10% threshold.  Therefore, our client will be redirected to Server 2, as it has the least workload of the two.

              Workload and connection statuses are both constantly monitored for all Data Servers.  High workload warnings are logged and a System Administrator would be notified in the case of high workloads or extremely long or timed-out network responses.  In this way, decisions can be made as to what geographic regions might need additional Data Servers. 

Adding another Data Server to our network would not be an extremely complicated operation.  A new server would have to be placed on the network, and its configuration copied from an existing one.  In addition, the master list of Data Servers would have to be updated on the Hit Servers in order for clients to get redirected to this new Data Server.  Our Data Servers could run any web server software.  One configuration exists for all of them, as they are all mirrors of each other.  Besides the HTTP server software, a small process is running on the Data Servers.  This process would maintain the TCP connection with the Hit Servers and respond to the query messages.  This process would also spawn the pings to our prospective client needed to find out our network round trip time. 


Implementation Considerations

 

1.) System Extensibility:

            As we have already seen, a new Data Server can easily be added to our system.  However, once a certain number of Data Servers exist on our network, the message-handling overhead will grow on our Hit Servers.  To remedy this, we can easily add more Hit Servers in our DNS round robin to further distribute clients.  However, after a certain number of Data Servers are present, the number of message responses concerning just one client would grow to be a hindrance.  The pings from all of our data servers to our perspective client machine would also become a hindrance, and may slow down our clients.  This is the upper limit of the system.  However, this would only occur after a relatively large number of Data Servers (more than 50) are participating.

 

2.) Overhead:                         

            This solution is geared towards providing concurrency while not increasing overhead much on our Data Servers.  The decision-making and message-generating overhead of the system is all absorbed by our Hit Servers.  This is optimal, since our Data Servers will be running computationally intensive server applications and we want to lessen perceived slowness for our customers.  In contrast, a completely distributed system, with no Hit Servers, where a request is fed at random to any one of our Data Servers and is then forwarded to another appropriate Data Server based on similar decisions concentrates added overhead on the critical Data Servers.  Slight overhead is added to the network and our client by the need to calculate packet round trip times.  This overhead increases linearly with the number of Data Servers we have, but should not be a problem for our high-speed network or the high-speed networks of our clients.

 

3.) Failure and Disaster Recovery

            The Hit Server cluster is the heart of our intelligent redirection system.  Having more Hit Servers in the system not only increases efficiency, but also protects against failure.  Should one fail, the others are exact mirrors and will perform the same function.  The constant monitoring of Data Servers’ availability and workload also gives us the upper hand in any disaster situation.  It is immediately known when a Data Server fails because of a hardware or network failure.  After failure the affected Data Server is not available to receive any client redirects.  This will not stop the operation of our web site, however if loads are already high on the remaining servers, clients may notice increased latency and slowness.  This can be improved by adding enough Data Servers so that we have enough capacity even with a failure.