Info in my Cache Memory: December 2014

Sunday, December 21, 2014

Traffic Policing and Shaping - Part2

Traffic Policing and Shaping Implementation

Following are the main differences between traffic policing and traffic shaping. Generally speaking, traffic policing takes a harder view of traffic that violates the configured allowable amount of traffic; what this means is that traffic that exceeds a traffic policing limit is typically dropped. Traffic shaping, on the other hand, takes the softer approach by attempting to “smooth” out traffic that violates the configured limit by buffering it and attempting to send it when available bandwidth exists.

Both traffic policing and shaping are implemented using a token bucket mechanism. It is important to have a general idea of how the token bucket mechanism works in order to understand how traffic decisions are made by these two features.

A token bucket has three components: a mean rate, a burst size, and a time interval (Tc). The mean rate (also referred to as the Committed Information Rate (CIR)) specifies how much data that can be transmitted per unit of time on average and is typically represented in bits per second. The burst size (also referred to as Bc) is used to specify the amount of traffic that is allowed to “burst” over the configured mean rate, and is typically represented in bits (shaping) or bytes (policing) per burst. The time interval, also called the measurement interval, is used to specify the time quantum in seconds per burst.

The tokens that are added to the token bucket represent the traffic that is allowed to be transmitted by the device. The policer or shaper adds a specific amount of tokens to the bucket based on the configured settings; when traffic is ready to be processed, the token bucket is checked to ensure that enough tokens exist to pass that amount of traffic. If there are enough tokens to pass the traffic, traffic is allowed; if not, a secondary action is taken. The action that is taken depends on whether policing (traffic is typically dropped) and/or shaping (traffic is typically buffered) is configured. Traffic policing and shaping also differ in how they implement “filling” this token bucket.

With traffic policing, the bucket is initially filled with tokens equal to the normal (committed) burst value configured in bytes; as this is the total capacity of the token bucket, no additional tokens are allowed until the bucket is initially used when the first packet processed. Upon the receipt of the second packet, the token bucket is filled based on the following formula:

(Time between packets * policer rate)/8 bits per byte = Number of tokens to add

With traffic shaping, instead of the bucket being filled based on when each packet is received, it is constantly filled at each timed interval based on the configured rate in bits per second. To calculate the timed interval, the following formula is used:

Tc = Bc/CIR (in seconds)

Friday, December 12, 2014

Queue Scheduling Mechanisms

We will introduce several common queue-scheduling mechanisms here.

I. FIFO queuing

Figure 3-1 FIFO queuing

As shown in above figure, FIFO (First In First Out) designs the forwarding order of packets depending upon their arrival time. On a firewall, the resources assigned for data traffic of users are based on the arrival time of packets and the current load status of the network. Best-Effort services use FIFO queuing policy.

If there is only one FIFO-based output/input queue on firewall’s interface, malicious applications may occupy all network resources and seriously affect mission-critical data.

Within each queue, the sending (sequence) of packets is defaulted as FIFO.

II. PQ (Priority Queuing)

Figure 3-2 Priority queuing

Priority queuing is designed for mission-critical applications. Those applications have an important feature, i.e. when congestion occurs they require preferential service to reduce the response delay. PQ can flexibly design priority sequence according to different network protocols (e.g. IP and IPX), interface receiving packets, packet length, source/destination IP address etc. Priority queuing classifies the packets into four different types: top, middle, normal and bottom, in descending order. By default, the data flow enters the normal queue.

During queues dispatching, PQ strictly comply with the priority sequence from high to low, and it will send packets in the high-priority queue first. When that queue is empty, PQ will begin to send packets in lower priority queue. The system then put packets of mission-critical application in higher priority queue, and packets of normal application in lower priority queue. It guarantees that the packets of mission-critical application are sent with priority, and the packets of normal application are sent at the free interval of operating mission-critical application.

The disadvantage of PQ is that packets in the lower queues will be neglected if there are packets in the higher queues for a long time.

III. CQ (Custom Queuing)

Figure 3-3 Custom queuing

CQ classifies packets into 17 classes in accordance with certain rules (corresponding to 17 queues). Based on their own classes, packets will enter the corresponding custom queues with FIFO policy.

Of these 17 queues, queue 0 is a system queue (not shown) which is not configurable and queues 1 through 16 are user queues, as shown in the above figure. User can set the rules of traffic classification and assign the proportions on occupying interface bandwidth for those 16 user queues. During dispatching, packets in system queue are sent preferentially till the queue is empty. Then with polling method a certain number of packets, taken from No.1-16 user queues under the bandwidth-occupying proportion set in advance, are sent out. In this way, packets of different application can be assigned with different bandwidth. Therefore, it will not only ensure mission-critical application to get more bandwidth but also prevent normal application from obtaining no bandwidth at all. By default, the data flow enters the No.1 queue.

Another advantage of custom queuing is that bandwidth can be assigned according to the busyness of applications, which is suitable for those applications having special requirement for bandwidth. Though the dispatching for 16 user queues is polling, the service time for each queue is not fixed. So when there are no packets of certain classes, the CQ dispatching mechanism can automatically increase bandwidth occupied by the packets of current existing class.

IV. WFQ (Weighted Fair Queuing)

Figure 3-4 WFQ diagram

Before further to Weighted Fair Queuing, FQ (Fair Queue) is to be introduced first. FQ is designed for fairly sharing network resources, which will try to reduce the delay and jitter of all traffics to their optimum levels. It has taken all the aspects into consideration, which has the following features:

l Different queues have fair opportunity of dispatching to equilibrate the delay of each stream on the whole.

l Short packets and long packets are treated fairly while dequeuing: if there are long packets in a queue and short packets in another queue waiting simultaneously to be send out, the short packets should also be cared, and statistically the short packets should be treated preferentially, and the jitter between packets of every traffic will be reduced on the whole.

Compared with FQ, WFQ concerns about priority in addition when calculating the dispatching sequence of packets. Statistically, with WFQ, high priority traffic takes priority over low priority packets in dispatching. WFQ can automatically classify traffic according to the “session” information of traffic (protocol type, source/destination TCP or UDP port number, source/destination IP address, preference bits of ToS field, etc), and try to provide more queues so that each traffic will be equably put into different queues and equilibrate the delay of every traffic on a whole. While dequeuing, WFQ can assign the bandwidth of egress interfaces occupied by each flow according to the precedence. The bigger the numerical value of the precedence, the more bandwidth can be obtained.

For example: Now if there are five streams on the interface, and their precedence values are 0, 1, 2, 3, 4, respectively, then the total bandwidth quota will be: the sum of total (precedence of traffic +1), i.e.

1+2+3+4+5=15

The bandwidth-occupying proportion for each traffic is: (priority + 1)/total quota of bandwidth, i.e. Bandwidth available for each traffic: 1/15, 2/15, 3/15, 4/15, 5/15.

Because WFQ can balance the delay and jitter of every flow when congestion occurs, it is effectively applied in particular fields. For instance, in the assured services using the RSVP (resource reservation protocol), generally, WFQ will be used as the dispatching policy. And also in TS, WFQ is used to dispatch buffered packets.

V. CBQ (Class-Based Queuing)

CBQ is the extension of WFQ, which supports user-defined classes. CBQ allocates an independent FIFO reserved queue for each user-defined class to buffer data of the same class. In case of network congestion, CBQ matches output packets according to user-defined class rules and enables them to enter corresponding queues. It is necessary to check the congestion avoidance mechanism, namely, whether tail drop or weighted random early detection (WRED) is used, and bandwidth restriction before the packets enter queues. WFQ is performed to the packets in the queue corresponding to each class when they go out of the queue.

CBQ provides an emergency queue for emergency packets. CBQ uses FIFO dispatch and has no bandwidth restriction. If CBQ performs WFQ on queues of all classes, such data traffic as voice packets that are delay sensitive may not be transmitted in time. In this case, PQ is added to CBQ, which is known as LLQ (Low Latency Queuing) to provide strict priority (SP) mechanism for such delay-sensitive data streams as voice packets.

LLQ combines SP mechanism with CBQ. The user can set a class subject to SP mechanism when defining the class. Such a class is known as priority class. All packets of the priority class will enter the same priority queue. It is necessary to check bandwidth restriction of packets before they enter queues. When the packets go out of queues, the packets in priority queue get transmitted first and then the packets in other queues corresponding to other classes follows. These packets are dispatched in weighted fair mode when being transmitted.

In order that packets in other queues will not be delayed too long, maximum available bandwidth can be specified to each priority class when using LLQ. The bandwidth value is used to monitor traffic in case of congestion. If no congestion happens, the priority class is allowed to use the bandwidth exceeding the allocated value. If congestion happens, the packets of the priority class exceeding the allocated bandwidth will be discarded. LLQ can also specify burst-size.

The system will always match the priority class first and then the other classes when matching rules for packets. Multiple priority classes are matched in the configuration sequence. It is the same with other classes. Multiple rules in a class are matched in the configuration sequence.

VI. RTP (Real-time Transport Protocol) priority queuing

RTP priority queuing technology is used to solve the QoS problems of real-time service (including audio and video services). Its principle is to put RTP packets carrying audio or video into high-priority queue and send it first, thus minimizing delay and jitter and ensuring the quality of audio or video service which is sensitive to delay.

Figure 3-5 RTP queuing diagram

As shown in the above figure, an RTP packet is sent into a high priority queue. RTP packet is the UDP packet whose port number is even. The range of the port number is custom. RTP priority queue can be used along with any queue (e.g., FIFO, PQ, CQ, WFQ and CBQ), while it has the highest priority. Since LLQ of CBQ can also be used to solve real-time service, it is recommended not to use RTP together with CBQ.

QoS Scheduling in WiMAX Networks

To offer an efficient QoS support to the end user, a WiMAX equipment vendor needs to design and implement a set of protocol components that are left open by the standard. These include traffic policing, traffic shaping, connection admission control (CAC), and packet scheduling.

Due to the highly variable nature of multimedia flows, traffic shaping and traffic policing are required by the SS, to ensure an efficient and fair utilization of network resources. At connection setup, the application requests network resources according to its characteristics and to the required level of service guarantees. A traffic shaper is necessary to ensure that the traffic generated actually conforms to the prenegotiated traffic specification. However, traffic shaping may not guarantee such conformance between the influx traffic and service requirements. This is dealt with by a traffic policer, which compares the conformance of the user data traffic with the QoS attributes of the corresponding service and takes corresponding actions, for example, it rejects or penalizes non conformance flows.

QoS profiles for SS are usually detailed in terms of Committed Information Rate (CIR) and Maximum Information Rate (MIR) for the various QoS classes. The CIR (defined for nrtPS and rtPS traffic) is equal to the information transfer rate that the WiMAX system is committed to carry out under normal conditions. The MIR (defined for nrtPS and BE QoS types) is the maximum information rate that the system will allow for the connection. Both these QoS parameters are averaged over a given interval time.

To guarantee that the newly admitted traffic does not result in network overload or service degradation for existing traffic, a (centralized) CAC scheme also has to be provided.

Although all the aforementioned components are necessary to provide an efficient level of QoS support, the core of such a task resides in the scheduling algorithm. An efficient scheduling algorithm is the essential condition for the provision of QoS guarantees, and it plays an essential role in determining the network performance. Besides, a traffic shaper, policer, and CAC mechanisms are tightly coupled with the scheduler employed. Therefore, the rest of this section is devoted to such an issue.

Although the scheduling is not specified in the standard, system designers can exploit the existing rich literature about scheduling in wireless ATM, from which WiMAX has inherited many features. If this allows one not to start from scratch, existing schemes need to be adapted to match the peculiar features (e.g., traffic classes, frame structure) of the IEEE 802.16 standard.

As an example, the IEEE 802.16 scheduling mode can be seen as an outcome of the research carried out on hierarchical scheduling. This is rooted in the necessity of limiting the MAC exchange overhead by letting the BS handle all connections of each SS as an aggregated flow. As explained in the previous section, according to the standard, the SSs request bandwidth on per-connection basis; however, the BS grants bandwidth to each individual SS, so that the resources are allocated to the aggregation of active flows at each SS. Each SS is then in charge of allocating the granted bandwidth to the active flows, which can be done in an efficient way because the SS has complete knowledge of its queues status. This, however, requires the introduction of a scheduler at each SS, enhancing the complexity (and consequently the cost) of the SS equipment. A detailed operational scheme is depicted in Figure 10.3, outlining the role played by each component and the requests/grants mechanism at the basis of WiMAX QoS support.

Figure 1: Graphic representation of hierarchical scheduling.

Schedulers work on multiple connections to ensure the negotiated throughputs, delay bounds, and loss rates. The target of a scheduling algorithm is to select which connection has to be served next. This selection process is based on the QoS requirements of each connection. An efficient scheduling algorithm at the BS must be provided guarantee proper performance. To better explain the scheduler's role, let us first assume that the BS performs the scheduling functions on a per-connection basis. To schedule packets correctly, information such as the number of pending connections, their reserved throughputs, and the statues of session queues is needed. While this information is easily accessible as concerns downlink connections, the SSs need to send their bandwidth requests and queue status to the BS for the uplink. This has a twofold effect. On the one hand, it increases the signalling overhead, while, on the other hand, it provides the BS with information that may be not up-to-date (e.g., due to contention delays, and so on). In downlink, the scheduler has complete knowledge of the queue status, and, thus, may use some classical scheduling schemes, such as weighted round robin (WRR), weighted fair queueing (WFQ), etc. Priority-oriented fairness features are also important in providing differentiated services in WiMAX networks. Through priority, different traffic flows can be treated almost as isolated when sharing the same radio resource. However, due to the nature of WiMAX TDD systems, the BS scheduler is non-work-conserving, as the output link can be idle even if there are packets waiting in some queues. Indeed, after downlink flows are served in their devoted subframe, no additional downlink flows can be served till the end of the subsequent uplink subframe.

Scheduling uplink flows is more complex because the input queues are located in the SSs and are hence separated from the BS. The UL connections work on a request/grant basis. Using bandwidth requests, the uplink packet scheduling may retrieve the status of the queues and the bandwidth parameters. The literature is not rich in terms of QoS scheduling schemes specifically designed for WiMAX networks. In the following, we will briefly describe the most relevant works that address such a topic, to the best of the authors' knowledge.

In particular, they propose a scheduling process divided into two parts. The first one, executed by the uplink scheduler inside the BS, is performed to grant resources to the SSs in response to bandwidth requests. This is done by means of a classical WRR. At each SS, bandwidth assignments are computed by starting from the highest priority class (i.e., UGS flows) and then going down to rtPS, nrtPS, and BE. In this way, a strict priority among service classes is guaranteed. The scheduling schemes employed for the various classes are different. A classical WFQ is used for UGS and rtPS, whereas a simpler WRR is used for nrtPS service class. BE traffic is served through a simple FIFO policy. By means of this prioritized approach (which resembles somehow multiclass priority fair queueing), the proposed architecture is able to guarantee a good performance level to UGS and rtPS classes, to the detriment of lower priority traffic (i.e., nrtPS and BE flows).

Via simulation, the performance of an IEEE 802.16 system using the class of latency-rate scheduling algorithms where a minimum reserved rate is the basic QoS parameter negotiated by a connection within a scheduling service. Specifically, within this class, they selected deficit round robin (DRR) as the downlink scheduler to be implemented in the BS, as it combines the ability to provide fair queueing in the presence of variable length packets with the simplicity of implementation. In particular, DRR requires a minimum rate to be reserved for each packet flow being scheduled. Therefore, although not required by the IEEE 802.16 standard, BE connections should be guaranteed a minimum rate. This fact can be exploited to both avoid BE traffic starvation in overloaded scenarios, and let BE traffic take advantage of the excess bandwidth which is not reserved for the other scheduling services. On the other hand, DRR assumes that the size of the head-of-line packet is known at each packet queue; thus, it cannot be used by the BS to schedule transmissions in the uplink direction. In fact, with regard to the uplink direction, the BS is only able to estimate the overall amount of backlog of each connection, but not the size of each backlogged packet. Therefore, the authors selected WRR as the uplink scheduler. Like DRR, WRR belongs to the class of rate latency scheduling algorithms. At last, DRR is implemented in the SS scheduler, because the SS knows the sizes of the head-of-line packets of its queues.

Thursday, December 11, 2014

Traffic Policing versus Shaping-Part1

There are two methods to limit the amount of traffic originating from an interface: policing and shaping. When an interface is policed outbound, traffic exceeding the configured threshold is dropped (or remarked to a lower class of service). Shaping, on the other hand, buffers excess (burst) traffic to transmit during non-burst periods. Shaping has the potential to make more efficient use of bandwidth at the cost of additional overhead on the router.

All this is just dandy, but doesn't mean much until you see its effects on real traffic. Consider the following lab topology:

We'll be using Iperf on the client (192.168.10.2) to generate TCP traffic to the server (192.168.20.2). In the middle is R1, a Cisco 3725. Its F0/1 interface will be configured for policing or shaping outbound to the server.

Iperf

Iperf is able to test the bandwidth available across a link by generating TCP or UDP streams and benchmarking the throughput of each. To illustrate the effects of policing and shaping, we'll configure Iperf to generate four TCP streams, which we can monitor individually. To get a feel for how Iperf works, let's do a dry run before applying any QoS policies. Below is the output from Iperf on the client end after running unrestricted across a 100 Mbit link:

Client$ iperf -c 192.168.20.2 -t 30 -P 4
------------------------------------------------------------
Client connecting to 192.168.20.2, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[1916] local 192.168.10.2 port 1908 connected with 192.168.20.2 port 5001
[1900] local 192.168.10.2 port 1909 connected with 192.168.20.2 port 5001
[1884] local 192.168.10.2 port 1910 connected with 192.168.20.2 port 5001
[1868] local 192.168.10.2 port 1911 connected with 192.168.20.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[1900]  0.0-30.0 sec  84.6 MBytes  23.6 Mbits/sec
[1884]  0.0-30.0 sec  84.6 MBytes  23.6 Mbits/sec
[1868]  0.0-30.0 sec  84.6 MBytes  23.6 Mbits/sec
[1916]  0.0-30.0 sec  84.6 MBytes  23.6 Mbits/sec
[SUM]  0.0-30.0 sec   338 MBytes  94.5 Mbits/sec

Iperf is run with several options:

-c - Toggles client mode, with the IP address of the server to contact
-t - The time to run, in seconds
-P - The number of parallel connections to establish

We can see that Iperf is able to effectively saturate the link at around 95 Mbps, with each stream consuming a roughly equal share of the available bandwidth.

Policing

Our first test will measure the throughput from the client to the server when R1 has been configured to police traffic to 1 Mbit. To do this we'll need to create the appropriate QoS policy and apply it outbound to F0/1:

policy-map Police
 class class-default
  police cir 1000000
!
interface FastEthernet0/1
 service-policy output Police

We can then inspect our applied policy with show policy-map interface. F0/1 is being policed to 1 Mbit with a 31250 bytes burst:

R1# show policy-map interface
 FastEthernet0/1

Service-policy output: Police

Class-map: class-default (match-any)
  2070 packets, 2998927 bytes
  5 minute offered rate 83000 bps, drop rate 0 bps
  Match: any
  police:
      cir 1000000 bps, bc 31250 bytes
    conformed 1394 packets, 1992832 bytes; actions:
      transmit
    exceeded 673 packets, 1005594 bytes; actions:
      drop
    conformed 57000 bps, exceed 30000 bps

Repeating the same Iperf test now yields very different results:

Client$ iperf -c 192.168.20.2 -t 30 -P 4
------------------------------------------------------------
Client connecting to 192.168.20.2, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[1916] local 192.168.10.2 port 1922 connected with 192.168.20.2 port 5001
[1900] local 192.168.10.2 port 1923 connected with 192.168.20.2 port 5001
[1884] local 192.168.10.2 port 1924 connected with 192.168.20.2 port 5001
[1868] local 192.168.10.2 port 1925 connected with 192.168.20.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[1884]  0.0-30.2 sec   520 KBytes   141 Kbits/sec
[1916]  0.0-30.6 sec  1.13 MBytes   311 Kbits/sec
[1900]  0.0-30.5 sec   536 KBytes   144 Kbits/sec
[1868]  0.0-30.5 sec   920 KBytes   247 Kbits/sec
[SUM]  0.0-30.6 sec  3.06 MBytes   841 Kbits/sec

Notice that although we've allowed for up to 1 Mbps of traffic, Iperf has only achieved 841 Kbps. Also notice that, unlike our prior test, each flow does not receive an equal proportion of the available bandwidth. This is because policing (as configured) does not recognize individual flows; it merely drops packets whenever they threaten to exceed the configured threshold.

Using Wireshark's IO graphing feature on a capture obtained at the server, we can observe the apparently random nature of the flows. The black line measures the aggregate throughput, and the colored lines each represent an individual TCP flow.

Shaping

In contrast to policing, we'll see that shaping handles traffic in a very organized, predictable manner. First we'll need to configure a QoS policy on R1 to shape traffic to 1 Mbit. When applying the Shape policy outbound on F0/1, be sure to remove the Police policy first with no service-policy output Police.

policy-map Shape
 class class-default
  shape average 1000000
!
interface FastEthernet0/1
 service-policy output Shape

Immediately after starting our Iperf test a third time we can see that shaping is taking place:

R1# show policy-map interface
 FastEthernet0/1

Service-policy output: Shape

Class-map: class-default (match-any)
  783 packets, 1050468 bytes
  5 minute offered rate 0 bps, drop rate 0 bps
  Match: any
  Traffic Shaping
       Target/Average   Byte   Sustain   Excess    Interval  Increment
         Rate           Limit  bits/int  bits/int  (ms)      (bytes)
      1000000/1000000   6250   25000     25000     25        3125

Adapt  Queue     Packets   Bytes     Packets   Bytes     Shaping
    Active Depth                         Delayed   Delayed   Active
    -      69        554       715690    491       708722    yes

This last test concludes with very consistent results:

Client$ iperf -c 192.168.20.2 -t 30 -P 4
------------------------------------------------------------
Client connecting to 192.168.20.2, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[1916] local 192.168.10.2 port 1931 connected with 192.168.20.2 port 5001
[1900] local 192.168.10.2 port 1932 connected with 192.168.20.2 port 5001
[1884] local 192.168.10.2 port 1933 connected with 192.168.20.2 port 5001
[1868] local 192.168.10.2 port 1934 connected with 192.168.20.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[1916]  0.0-30.4 sec   896 KBytes   242 Kbits/sec
[1868]  0.0-30.5 sec   896 KBytes   241 Kbits/sec
[1884]  0.0-30.5 sec   896 KBytes   241 Kbits/sec
[1900]  0.0-30.5 sec   896 KBytes   241 Kbits/sec
[SUM]  0.0-30.5 sec  3.50 MBytes   962 Kbits/sec

With shaping applied, Iperf is able to squeeze 962 Kbps out of the 1 Mbps link, a 14% gain over policing. However, keep in mind that the gain measured here is incidental and very subject to change under more real-world conditions. Also notice that each stream receives a fair share of bandwidth. This even distribution is best illustrated graphically through an IO graph of a second capture: