Thursday, August 13, 2009

Performance of HTTP polling duplex server-side channel in Microsoft Silverlight 3

Introduction

Silverlight supports a communication protocol that allows a server to asynchronously send messages to a Silverlight client (“push”) using Windows Communication Foundation (WCF) services. This feature is based on a polling duplex (HTTP long polling, or Comet-style) protocol, and ships as two DLLs in the Microsoft Silverlight SDK, both named System.ServiceModel.PollingDuplex.dll – one for the Silverlight client and one for the server. More information can be found in MSDN help.

In Silverlight 3, the usability of this feature was greatly improved, and integrated with the Add Service Reference tool in Visual Studio, making it very easy to write client-side code to consume messages “pushed” from the server.

However, writing the server-side “push” service in a way that performs well for a large number of clients, and assessing the viability of the polling duplex protocol for any given application, remained a challenge. This post and the accompanying sample attempt to help you decide whether this technology is suitable for the requirements of your scenario.

Scenarios

Scenarios that were considered for measuring polling duplex performance are based on the pub/sub architecture. Clients can contact the server to subscribe to events associated with a specific topic. When an event is published for that topic, the server sends out notification about that event to all clients who subscribed to the topic.

Within this pub/sub architecture, two classes of scenarios were considered: broadcast and collaboration.

Broadcast Scenario

In this scenario, many clients subscribe to the same topic. When an event is published to that topic, all clients receive a notification.

broadcast

For example, if 3,000 clients are connected, such a scenario could involve all 3,000 clients monitoring the price of a certain stock, or the progress of a certain sporting event, through a Silverlight application. As soon as the stock quote or the sports score changes, the same notification is broadcast to all 3,000 clients.

Collaboration Scenario

In this scenario, several topics exist. Each topic has several clients acting as both subscribers and publishers. Upon publishing of an event for a topic by one of the clients, the server notifies all remaining clients subscribing to this topic of that event.

collaboration

For example, if 3,000 clients are connected, such a scenario could involve 1,500 one-on-one technical support chats between a customer and a technician through a Silverlight chat client, or 1000 real-time games with 3 participants each, or 1,500 collaborations with 2 people editing the same diagram or data grid in real-time. In all these cases, the action of one person immediately gets “pushed” to their collaborator(s), but does not affect the other clients. In reality, many scenarios will fall in between the two extremes of broadcast and 2-client collaboration.

Scalability vs performance

Current implementation of polling duplex protocol in Silverlight 3 requires client affinity to a particular physical server machine for the lifetime of the WCF channel (WCF proxy). Moreover, the server maintains in-memory state for the duration of the session with the client. If you are using a load balancer that cannot guarantee client affinity to a particular backend, or if your hosting infrastructure cannot guarantee that the service will keep running on the same machine, the protocol will fail. In the practice of load balanced web farms, the Silverlight polling duplex protocol in the current form does not scale out well.

You can work around the backend affinity limitation by performing load-balancing manually at the application level. For example, suppose that for your scenario, based on this document or your own measurements, you discover that you can only support 3,000 clients, but need to support 6,000. You can set up two servers, with explicitly different domain names (e.g. at service1.contoso.com and service2.contoso.com), and the client can select one of these to connect to either randomly, based on a hash of a known value (e.g. topic name), or by calling a “discovery service” that returns the address of the duplex service to connect to.

There is no easy way to work around the problem of the server maintaining in-memory session state at the moment. We are actively working on addressing this problem in future releases.

Taking these scalability considerations into account, it is still important to understand performance characteristics of the protocol implementation to assess its suitability for a particular scenario. Performance is the main topic of this post.

Tuning

To optimize the performance of the Polling Duplex component, certain settings in IIS, ASP.NET, the .NET Framework and WCF need to be adjusted. All measurements in this document assume that these modifications have been made. Some of the settings discussed below need to be changed when a system is put into production. Such differences are discussed when appropriate. Please refer to the accompanying sample for a additional tuning information.

WCF configuration tuning

Polling duplex protocol has been implemented on the server side as a WCF binding that provides session channels. Each session channel corresponds to a single client connection. The number of session channels that a WCF service can support concurrently is throttled using ServiceThrottlingBehavior.MaxConcurrentSessions service behavior. In order to measure the maximum number of connections the server can support, this throttle needs to be increased. For the purpose of this measurement, we increased the throttle to Int32.MaxValue using WCF configuration:

<serviceThrottling maxConcurrentSessions="2147483647"/>

When the system is put into production, you would want to set the value of maxConcurrentSessions to match the maximum number of clients your service can support concurrently.

There are several customizations that were introduced in the settings of the PollingDuplexBindingElement for the purposes of this performance measurements:

<binding name="PubSub">
    <binaryMessageEncoding/>
    <pollingDuplex maxPendingSessions="2147483647"
                 maxPendingMessagesPerSession="2147483647"
                 inactivityTimeout="02:00:00"
                 serverPollTimeout="00:05:00"/>
    <httpTransport/>
</binding>

The value of inactivityTimeout controls the maximum time without any activity on the channel before the channel is faulted. The value has been set to 2 hours to avoid a situation when a channel is faulted due to infrequent message exchanges in a test variation. In production, you should set this value to exceed the expected duration of a client connection, which is application specific. Regardless of the value though, the client code should take appropriate fault-tolerance measures to possibly re-establish a connection when the channel is faulted due to inactivity.

The value of serverPollTimeout controls the maximum time the server will hold onto client’s long poll HTTP request before responding. If that time has elapsed without the server having an application message to push back to the client, the server will send back an empty HTTP OK response (causing the client to re-issue a new long poll). For the purpose of this measurement, the value has been set to 5 minutes to minimize the frequency and therefore cost of empty polls. The default value of this setting is 15 seconds. In production, in addition to performance consideration, one should consider the presence of proxy servers which may limit the duration of outstanding HTTP requests.

The value of maxPendingSessions throttles the number of new sessions that wait to be accepted on the server. This situation can occur when the speed at which new sessions are established at the server (new clients connect) exceeds the server’s ability to accept them. This value has been increased to Int32.MaxValue from the default of 4 to allow for the client connection pattern implemented in our performance code, where all the clients attempt to connect at once at the beginning of the test. In a typical production scenario, new client connections are spread more evenly in time, in which case the default value of 4 should be adequate.

The value of the maxPendingMessagesPerSession throttles the number of new messages from a client (sent over a particular session) that wait to be accepted by the server. Similarly to maxPendingSessions, this situation can occur when the speed at which new messages arrive at the server exceeds the server’s ability to accept them. The value has been increased to Int32.MaxValue (the default is 8) to allow for the initial wave of messages given the client connection pattern in the test. In a typical production scenario, messages from the client can be expected to be spread more evenly in time, in which case the default value of 8 should be appropriate.

PollingDuplexHttpBinding standard binding by default uses binary encoding (newly added in Silverlight 3). The custom binding used in the performance measurements also uses binary encoding:

<binaryMessageEncoding/>

Binary encoding has several performance benefits compared to text encoding, which are outlined in the Improving the performance of web services in Silverlight 3 Beta post.

ASP.NET and IIS7 configuration tuning

In .NET Framework 3.5 SP1, WCF introduced a new asynchronous HTTP handler for IIS7 which allows for better scalability of WCF services by not blocking worker threads for the duration of high latency service operations. This feature is described in detail in Wenlong Dong’s blog post about asynchronous WCF HTTP Handlers. In order to realize the full potential of asynchronous HTTP handlers, the concurrent HTTP request quota (MaxConcurrentRequestsPerCPU registry setting) also needs to be adjusted, which is described in more detail in Thomas Marquardt’s post about threading in IIS 6.0 and IIS 7.0.

Registration of the asynchronous HTTP handler for WCF as well as adjustment of the quota for concurrent HTTP requests can we performed with Wenlong Dong’s WcfAsyncWebTool:

WcfAsyncWebTool.exe /ia /t 20000

For the purpose of this test, we are setting the quota at 20000 to ensure it is not going to become the limiting factor in measuring the maximum number of clients the server can accommodate. In production, setting this limit should take into account on one hand a sustainable number of clients (resulting from performance measurements of the actual scenario), as well as an acceptable working set.

Following the setting of the quota for concurrent HTTP requests to 20000, the quotas for the thread pool worker threads and IO threads also need to be adjusted through .NET Configuration in machine.config:

<processModel maxWorkerThreads="20000"
                      maxIoThreads="20000" 
                      minWorkerThreads="10000"/>

These settings ensure there is adequate supply of threads to handle the HTTP requests IIS7 will accept given the concurrent HTTP request quota, as well as for the service to send  bursts of asynchronous notifications to clients.

WCF code tuning

WCF has made it easy to author a well performing RPC service, but requirements and messaging patterns of a pub/sub service are sufficiently different from RPC to require a few performance optimizations in code.

One consideration is that a pub/sub service often sends out multiple identical messages to several clients, in particular in the broadcast scenario. Given that a substantial portion of the cost of sending a message is serializing its content, it is worthwhile to pre-serialize a message once and then send a copy of it multiple times. In order to accomplish this, the callback contract of a pub/sub service should take a Message as a parameter as opposed to typed parameters. This allows the message to be pre-serialized and converted to a MessageBuffer using TypedMessageConverter. Then, for every notification to be sent, the MessageBuffer can be used to create a clone of the Message without incurring the serialization cost.

Sending a large number of notifications from the server is a high latency operation. In order to optimize thread use, sending the notifications to clients in a loop should use asynchronous methods of the callback contract, or enqueue the synchronous invocations using thread poll thread. We did not measure a meaningful difference in performance between these two methods.

The concurrency mode of the WCF service should be set to ConcurrencyMode.Multiple, and instance mode to InstanceMode.Single. This requires explicit synchronization code to be added around access to critical resources (e.g. shared data structures), but the extra effort pays off in reduced contention of concurrent requests to the service.

All of these optimizations are demonstrated in the reference implementation of a pub/sub WCF service using the HTTP polling duplex protocol.

Results

The following server configuration was used in all measurements: 4-proc Intel Xeon 2.66GHz, 4GB RAM, Windows Server 2008 SP1 64bit with .NET Framework 3.5 SP1 and IIS7.

Clients were run on machines other than the server. The number of machines and clients running on each machine was adjusted to achieve the point of saturation of the server.

Message format and content was the same in all measurements. The message consisted of 20 short strings. Although we don’t have formal results as a function of a message size, ad-hoc measurements indicated the results were not affected by message sizes between 1 and 100 strings in a meaningful way.

Broadcast scenario results

The broadcast scenario measurement was based on the following script:

  1. M clients subscribed to a single topic on the server.
  2. The server started generating messages to be published to the topic every P seconds. Upon publishing of a message, M notifications were sent to M clients subscribed to the topic (one per client), which we called a “burst”.
  3. Several runs of the test were performed for increasing numbers of M, up to the point where the server was unable to finish sending all messages in a burst before the next burst was due to be sent. The largest value of M at which the server was able to send all messages of each of at least 100 consecutive bursts before the next burst became due was considered the result of the test for a given burst frequency P.

Results of this measurement are shown on the chart below:

broadcast_results

An example of interpreting this data is that a single server using the HTTP polling duplex protocol from Microsoft Silverlight 3 can support sending notifications to 5000 connected clients every 10 seconds.

Collaboration scenario results

The collaboration scenario measurement was based on the following script, simulating a server supporting multiple chartrooms with many participants each:

  1. T topics (chat rooms) are created on the server.
  2. N distinct clients subscribe to every one of the T topics (the total number of clients connected to the server is then N * T).
  3. One of the clients subscribed to any given topic publishes a single message to that topic. This happens simultaneously for all topics. The server broadcasts the message back to the N-1 other clients subscribed to that topic immediately after receiving the publish message.
  4. After all N-1 clients to whom notifications were sent have received them, no activity occurs in the chat room (topic) for D seconds. The time between the publish message message was sent by the publisher to a topic and received by a subscribers to that topic is captured; we call this metric a latency. Each publish event generates N-1 data points, one for every subscriber of a topic other than the publisher.
  5. Every topic (chat room) repeats the #3-#4 cycle independently (simultaneously) over a minimum of 100 iterations.
  6. The latencies gathered across all iterations, topics, and subscribers are statistically analyzed.

The results for a few distinct sets of defining parameters (number of topics T, number of subscribers per topic N, delay between publications D) are presented below. For each of the variations we have measured the mean, median, and standard deviation of the notification latency.

Topics T

Subscribers per topic N

Total clients N * T

Delay between publications D [s]

Mean latency [ms]

Median latency [ms]

Stdev [ms]

100

5

500

1

91

63

122

500

2

1000

15

4

0

12

500

3

1500

15

108

94

73

800

3

2400

15

2743

1328

3195

1000

2

2000

5

496

498

288

1000

2

2000

15

6

0

17

2000

2

4000

15

25

0

99

Whenever the median latency shows 0ms, it indicates the latency in over 50% of data points was below the threshold of a time span we could capture.

The data indicates a single server can support 2000 simultaneous chat rooms with 2 participants each and a 15 second delay between publications with a 25ms mean latency (0 ms median latency), which should satisfy latency requirements of most UI driven scenarios. At the same time, the data shows that the latency gets out of hand with 800 chatrooms with 3 participants each and 15 second delay between publications.

21 comments:

  1. Very good article. Thank you for writing. This provides some good evidence about the suitability of full duplex in several scenarios. This question comes up often when talking about Silverlight client-server models but it is hard to answer well without having the numbers to back it up. Based upon these figured it seems as though a pub/sub architecture would actually be viable for some applications, which was not what I expected. Pessimistically I had assumed that the performance hit would be too great. However, 1000+ simultaneously connected users is a large threshold for most simple web applications.

    Thanks!

    ReplyDelete
  2. When you calculate the number of clients in the broadcast scenario, is that per value subscribed to or per user. For the stock price example you give, for example, a user would likely be subscribed to changes in a portfolio of stocks (20 or so) not just 1. So if you have 1,000 users then there would be 20,000 price subscriptions. Is there a way to group all 20 subscriptions into just 1 per client to increase server capacity?

    ReplyDelete
  3. The broadcast scenario assumes existence of a single topic (e.g. stock quote) all clients are subscribed to. The number of subscribers in the broadcast scenario is to be understood as the total number of clients connected to the server simultaneously, which require simultaneus notifications. Please note however that from the perspective of the polling duplex protocol, the content of the notifications sent to clients is less important than number of clients and notification frequency - the messages themselves may be identical or they may differ (assuming they remain within a size ballpark of course). One could model the multiple ticker subscriptions using separate topics (in the pub/sub sense) for every ticker. One also could model this scenario using a single topic per unique combination of stock tickers. The latter approach would most likely perform better with the polling duplex protocol, as it requires fewer notifications to be sent to every client (at most as many as there are clients).

    ReplyDelete
  4. Hi Tomas, I have try your application(pubsub) on server 2008, 4gb ram, iis7 with change you describe in your blocg, but it open only 10 clinet. on 11th it not work. please provide me the solution what we do for solve this problem. Thnaks

    ReplyDelete
  5. It looks like you may be affected by the ServiceThrottlingBehavior.MaxConcurrentSessions throttle, default value of which is 10. One of the configuration changes described in this blog suggests to increase the throttle using the respective configuration setting:

    <serviceThrottling maxConcurrentSessions="2147483647"/>

    Have you applied this change to your service?

    ReplyDelete
  6. i have allready set these settings. I m check this open more than 10 ie windows and on 11th it not work.

    ReplyDelete
  7. Hi im also check this with create an service object in a loop. on 11th count the service object failed to create.it give an communication exception.

    ReplyDelete
  8. Would you recommend adding compression
    "enable dynamic content compression" and in ApplicationHost.config adding <add mimeType="application/soap+msbin1" enabled="true" />?

    ReplyDelete
  9. You should measure the effect of dynamic content compression vs. binary encoding in WCF in your particular scenario, but as a rule of thumb dynamic content compression will yield substantially smaller payload sizes than WCF's binary encoding at the cost of substantially higher CPU utilization.

    ReplyDelete
  10. Regarding the limit of the number of clients that could be connected to the pub/sub service from the sample, this is a client side issue, not a server side limitation (assuming you have bumped up the MaxConcurrentSessions correctly). You can validate this by opening additional browser windows from a different client machine, or using a different browser brand from the same machine (e.g. IE and Chrome). Which version of IE are you using?

    ReplyDelete
  11. Regarding compression above, I meant compression + binary, as im seeing 5000 row datagrid with binary only, is 5mb and 20 sec to download/render but when i enable binary + compression, the same 5000 row report, is 300k and 3 second to render. Is there any reason we should not enable it for low users but large reports?
    slyi

    ReplyDelete
  12. I don't recommend enabling WCF binary encoding if you already opt into dynamic compression in IIS. The latter is a high-cost high-yield solution. Binary encoding in addition to IIS compression will yield marginal improvements in terms of bandwidth savings. Have you checked your numbers with IIS compression alone and text encoding? You can find some considerations for enabling dynamic IIS compression at http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/25d2170b-09c0-45fd-8da4-898cf9a7d568.mspx?mfr=true.

    ReplyDelete
  13. I'm curious as to how you obtained these numbers? How did you write the test client (clearly not silverlight) and get it to use the pollingduplex binding?

    ReplyDelete
  14. You are right, we did not use Silverlight on the client to measure server side performance. We have compiled a .NET 3.5 version of the client side of the polling duplex protocol. While doing so, we have turned off the optimization whereby the client will only issue a single poll at a time per scheme/host/port combination at the scope of an AppDomain. (The goal of this optimization in the first place is to reduce the impact of the polling duplex protocol on the connection limit imposed by virtually all common browsers). This allowed us to simulate thousands of Silverlight clients using a single .NET 3.5 process running on a single machine. We then scaled the number of clients to reach saturation point on the server.

    ReplyDelete
  15. Thanks for your response. Is there anyway I can get ahold of this testing code?

    Adam

    ReplyDelete
  16. Unfortunately the .NET 3.5 version of the client side protocol is currently not available publicly. We are planning to release a sample (in source code form) that demonstrates the consumption of the protocol from .NET 3.5, I will make sure to blog about it once it is available.

    ReplyDelete
  17. Very interesting article! I read it some weeks ago and the performance you showed here shows that this technique will be enough for our needs. Unfortunately, when we developed a test to see if everything performed good enough we ran into some issues. I would really appreciate if you could give me some hints of what can be wrong.

    I've developed a SL3-client that uses pollingduplex to connect to a service. The SL-client is the subscribing client. We also have a publishing client that connects to the service using netTCP binding (over internet). The service is hosted on a VPS with WS2008R2 and IIS7.

    The problem is really poor performance! We have tried the service with 30 - 40 clients and it more or less stops responding when trying to connect with more clients. Not stops completely but it's really slow. The task manager on the server shows that the 'IIS worker process' (w3wp.exe) takes up 100% CPU usage.

    The strange thing is that this is caused only by connecting the clients to the service. No publishing occurs.

    We're running the service configured with 'InstanceContextMode.PerCall' and 'ConcurrencyMode.Single'. We're using throttling as well.

    Do you have any idea what might cause this poor performance?

    Another question: In your example you've configured the service with 'InstanceContextMode.Single' and 'ConcurrencyMode.Multiple'. Why would it be better from performance point of view to run the service as a singleton instead of as PerCall?

    Thanks!

    /Kristian

    ReplyDelete
  18. QUOTE ==> We are planning to release a sample (in source code form) that demonstrates the consumption of the protocol from .NET 3.5, I will make sure to blog about it once it is available. <== END QUOTE

    Any news about that non-silverlight client sample that would support the "this" long polling duplex protocol? (not the one posted on code.msdn which is completly different)

    Thanks!
    Eric

    ReplyDelete
  19. QUOTE ==> We are planning to release a sample (in source code form) that demonstrates the consumption of the protocol from .NET 3.5, I will make sure to blog about it once it is available. <== END QUOTE

    I think that many people out there would like to have this sample...
    Please if you have any news let us now...

    Thanks,
    Michael

    ReplyDelete
  20. excuse me if this is a stupid question, what the bottleneck is it simply hardware, so for instance Broadcast push would is max'd at 1400 users every 2 seconds. To increase that limit significantly would i have to look at a much better server? or load balancing, or reducing the publish frequency?(this option isnt really any good for my application)

    Stupid questuion no2, if i had 2 services publishing data out at every 2 seconds would I expect user limit to drop to 700 per service?

    ReplyDelete
  21. Hi

    Gone through your article, currently I'm developing a Pub/Sub architecture to my system using polling duplex in silverlight with WCF. The error which Kills me in the publish flow is

    "The communication object, System.ServiceModel.Channels.ServiceChannel, cannot be used for communication because channel got aborted."


    IT will be more helpful if i get a solution for the same

    Thanks
    Praveen

    ReplyDelete

My Photo
My name is Tomasz Janczuk. I am currently working on my own venture - Mobile Chapters (http://mobilechapters.com). Formerly at Microsoft (12 years), focusing on node.js, JavaScript, Windows Azure, and .NET Framework.