Powered By Blogger

Thursday, June 28, 2018

Linux Scheduler Evolution

The Linux scheduler is a priority based scheduler that schedules tasks based upon their static and dynamic priorities. When these priorities are combined they form a task's goodness . Each time the Linux scheduler runs, every task on the run queue is examined and its goodness value is computed. The task with the highest goodness is chosen to run next.
When there are cpu bound tasks running in the system, the Linux scheduler may not be called for intervals of up to .40 seconds. This means that the currently running task has the CPU to itself for periods of up to .40 seconds (how long depends upon the task's priority and whether it blocks or not). This is good for throughput because there are few computationally uneccessary context switches. However it can kill interactivity because Linux only reschedules when a task blocks or when the task's dynamic priority (counter) reaches zero. Thus under Linux's default priority based scheduling method, long scheduling latencies can occur.
Looking at the scheduling latency in finer detail, the Linux scheduler makes use of a timer that interrupts every 10 msec. This timer erodes the currently running task's dynamic priority (decrements its counter). A task's counter starts out at the same value its priority contains. Once its dynamic priority (counter) has eroded to 0 it is again reset to that of its static priority (priority). It is only after the counter reaches 0 that a call to schedule() is made. Thus a task with the default priority of 20 may run for .200 secs (200 msecs) before any other task in the system gets a chance to run. A task at priority 40 (the highest priority allowed) can run for .400 secs without any scheduling occurring as long as it doesn't block or yield.
Linux scheduler has been gone through some big improvements since kernel version 2.4. There were a lot of complaints about the interactivity of the scheduler in kernel 2.4. During this version, the scheduler was implemented with one running queue for all available processors. At every scheduling, this queue was locked and every task on this queue got its timeslice update. This implementation caused poor performance in all aspects. The scheduler algorithm and supporting code went through a large rewrite early in the 2.5 kernel development series. The new scheduler was arisen to achieveO(1 ) run-time regardless number of runnable tasks in the system. To achieve this, each processor has its own running queue. This helps a lot in reducing lock contention. The priority array was introduced which used active array and expired array to keep track running tasks in the system. TheO(1 ) running time is primarily drawn from this new data structure. The scheduler puts all expired processes into expired array. When there is no active process available in active array, it swaps active array with expired array, which makes active array becomes expired array and expired array becomes active array. There were some twists made into this scheduler to optimize further by putting expired task back to active array instead of expired array in some cases.O(1 ) scheduler uses a heuristic calculation to update dynamic priority of tasks based on their interactivity (I/O bound versus CPU bound) The industry was happy with this new scheduler until Con Kolivas introduced his new scheduler named Rotating Staircase Deadline (RSDL) and then later Staircase Deadline (SD). His new schedulers proved the fact that fair scheduling among processes can be achieved without any complex computation. His scheduler was designed to run inO(n ) but its performance exceeded the currentO(1 ) scheduler.
The result achieved from SD scheduler surprised all kernel developers and designers. The fair scheduling approach in SD scheduler encouraged Igno Molnar to re-implement the new Linux scheduler named Completely Fair Scheduler (CFS). CFS scheduler was a big improvement over the existing scheduler not only in its performance and interactivity but also in simplifying the scheduling logic and putting more modularized code into the scheduler. CFS scheduler was merged into mainline version 2.6.23. Since then, there have been some minor improvements made to CFS scheduler in some areas such as optimization, load balancing and group scheduling feature.


  • An O(n) scheduler - Goes through the entire “ global runqueue” to determine the next task to be run. This is an O(n) algorithm where 'n' is the number of processes. The time taken was proportional to the number of active processes in the system

  • A Global runqueue - All CPUs had to wait for other CPUs to finish execution. A Global runqueue for all processors in a symmetric multiprocessing system (SMP). This meant a task could be scheduled on any processor -- which can be good for load balancing but bad for memory caches. For example, suppose a task executed on CPU-1, and its data was in that processor's cache. If the task got rescheduled to CPU-2, its data would need to be invalidated in CPU-1 and brought into CPU-2 .

  • This lead to large performance hits during heavy workload

Kernel 2.4 Scheduler Policies:
  • SCHED_FIFO - A First-In, First-Out real-time process. When the scheduler assigns the CPU to the process, it leaves the process descriptor in its current position in the runqueue list. If no other higher-priority realtime process is runnable, the process will continue to use the CPU as long as it wishes, even if other real-time processes having the same priority are runnable

  • SCHED_RR - A Round Robin real-time process. When the scheduler assigns the CPU to the process, it puts the process descriptor at the end of the runqueue list. This policy ensures a fair assignment of CPU time to all SCHED_RR real-time processes that have the same priority

  • SCHED_OTHER - A conventional, time-shared process. The policy field also encodes a SCHED_YIELD binary flag. This flag is set when the process invokes the sched_ yield( ) system call (a way of voluntarily relinquishing the processor without the need to start an I/O operation or go to sleep. The scheduler puts the process descriptor at the bottom of the runqueue list.

O(1) Algorithm ( Constant time algorithm )
  • Choose the task on the highest priority list to execute
  • To make this process more efficient, a bitmap is used to define when tasks are on a given priority list
  • On most architectures, a find-first-bit-set instruction is used to find the highest priority bit set in one of five 32-bit words (for the 140 priorities
  • The time it takes to find a task to execute depends not on the number of active tasks but instead on the number of priorities
  • This makes the 2.6 scheduler an O(1) process because the time to schedule is both fixed and deterministic regardless of the number of active tasks


  • The 2.6 scheduler was designed and implemented by Ingo Molnar. His motivation in working on the new scheduler was to create a completely O(1) scheduler for wakeup, context-switch, and timer interrupt overhead
  • One of the issues that triggered the need for a new scheduler was the use of Java virtual machines (JVMs). The Java programming model uses many threads of execution, which results in lots of overhead for scheduling in an O(n) scheduler
  • Each CPU has a runqueue made up of 140 priority lists that are serviced in FIFO order. Tasks that are scheduled to execute are added to the end of their respective runqueue's priority list
  • Each task has a time slice that determines how much time it's permitted to execute
  • The first 100 priority lists of the runqueue are reserved for real-time tasks, and the last 40 are used for user tasks (MAX_RT_PRIO=100 and MAX_PRIO=140)
  • In addition to the CPU's runqueue, which is called the active runqueue, there's also an expired runqueue
  • When a task on the active runqueue uses all of its time slice, it's moved to the expired runqueue. During the move, its time slice is recalculated (and so is its priority)
  • If no tasks exist on the active runqueue for a given priority, the pointers for the active and expired runqueues are swapped, thus making the expired priority list the active one
Scheduler

Kernel 2.6 Scheduler Policies:
  • SCHED_NORMAL - A conventional, time-shared process (used to be called SCHED_OTHER), for normal tasks
  1. Each task assigned a “Nice” value
  2. PRIO = MAX_RT_PRIO + NICE + 20
  3. Assigned a time slice
  4. Tasks at the same prio(rity) are round-robined
  5. Ensures Priority + Fairness

  • SCHED_FIFO - A First-In, First-Out real-time process
  1. Run until they relinquish the CPU voluntarily
  2. Priority levels maintained
  3. Not pre-empted !!

  • SCHED_RR - A Round Robin real-time process
  1. Assigned a timeslice and run till the timeslice is exhausted.
  2. Once all RR tasks of a given prio(rity) level exhaust their timeslices, their timeslices are refilled and they continue running
  3. Prio(rity) levels are maintained

  • SCHED_BATCH - for "batch" style execution of processes
  1. For computing-intensive tasks
  2. Timeslices are long and processes are round robin scheduled
  3. lowest priority tasks are batch-processed (nice +19)

  • SCHED_IDLE - for running very low priority background job
  1. nice value has no influence for this policy
  2. extremely low priority (lower than +19 nice)

Completely Fair Scheduler (CFS)
  • The main idea behind the CFS is to maintain balance (fairness) in providing processor time to tasks. This means processes should be given a fair amount of the processor. When the time for tasks is out of balance (meaning that one or more tasks are not given a fair amount of time relative to others), then those out-of-balance tasks should be given time to execute.
    To determine the balance, the CFS maintains the amount of time provided to a given task in what's called the virtual runtime. The smaller a task's virtual runtime—meaning the smaller amount of time a task has been permitted access to the processor—the higher its need for the processor. The CFS also includes the concept of sleeper fairness to ensure that tasks that are not currently runnable (for example, waiting for I/O) receive a comparable share of the processor when they eventually need it.
    But rather than maintain the tasks in a run queue, as has been done in prior Linux schedulers, the CFS maintains a time-ordered red-black tree (see Figure below). A red-black tree is a tree with a couple of interesting and useful properties. First, it's self-balancing, which means that no path in the tree will ever be more than twice as long as any other. Second, operations on the tree occur in O(log n) time (where n is the number of nodes in the tree). This means that you can insert or delete a task quickly and efficiently.
cfs

Wednesday, June 6, 2018

Static and dynamic libraries

Linux Library Types:

There are two Linux C/C++ library types which can be created:
  1. Static libraries (.a): Library of object code which is linked with, and becomes part of the application.
  2. Dynamically linked shared object libraries (.so): There is only one form of this library but it can be used in two ways.
    1. Dynamically linked at run time. The libraries must be available during compile/link phase. The shared objects are not included into the executable component but are tied to the execution.
    2. Dynamically loaded/unloaded and linked during execution (i.e. browser plug-in) using the dynamic linking loader system functions.

Library naming conventions:

Libraries are typically named with the prefix "lib". This is true for all the C standard libraries. When linking, the command line reference to the library will not contain the library prefix or suffix.

Consider the following compile and link command: gcc src-file.c -lm -lpthread 

The libraries referenced in this example for inclusion during linking are the math library ("m") and the thread library ("pthread"). They are found in /usr/lib/libm.a and /usr/lib/libpthread.a.

Static Libraries: (.a)
How to generate a static library (object code archive file):
  • Compile: cc -Wall -c ctest1.c ctest2.c 
    Compiler options:
    • -Wall: include warnings. See man page for warnings specified.
  • Create library "libctest.a": ar -cvq libctest.a ctest1.o ctest2.o
  • List files in library: ar -t libctest.a
  • Linking with the library:
    • cc -o executable-name prog.c libctest.a
    • cc -o executable-name prog.c -L/path/to/library-directory -lctest

Dynamically Linked "Shared Object" Libraries: (.so)

How to generate a shared object: (Dynamically linked object library file.) Note that this is a two step process.
  1. Create object code
  2. Create library
  3. Optional: create default version using a symbolic link.
Library creation example:

    gcc -Wall -fPIC -c *.c
    gcc -shared -Wl,-soname,libctest.so.1 -o libctest.so.1.0   *.o
    mv libctest.so.1.0 /opt/lib
    ln -sf /opt/lib/libctest.so.1.0 /opt/lib/libctest.so.1
    ln -sf /opt/lib/libctest.so.1.0 /opt/lib/libctest.soCompiler 
  • -Wall: include warnings. See man page for warnings specified.
  • -fPIC: Compiler directive to output position independent code, a characteristic required by shared libraries. Also see "-fpic".
  • -shared: Produce a shared object which can then be linked with other objects to form an executable.
  • -Wl,options: Pass options to linker. 
    In this example the options to be passed on to the linker are: "-soname libctest.so.1". The name passed with the "-o" option is passed to gcc.
  • Option -o: Output of operation. In this case the name of the shared object to be output will be "libctest.so.1.0"
Library Links:
  • The link to /opt/lib/libctest.so allows the naming convention for the compile flag -lctest to work.
  • The link to /opt/lib/libctest.so.1 allows the run time binding to work. See dependency below.
Compile main program and link with shared object library:

gcc -Wall -L/opt/lib prog.c -lctest -o prog
Library Path:
In order for an executable to find the required libraries to link with during run time, one must configure the system so that the libraries can be found. Methods available: (Do at least one of the following)
  1. Add library directories to be included during dynamic linking to the file /etc/ld.so.conf
    Sample: /etc/ld.so.conf
    /usr/X11R6/lib
    /usr/lib
    ...
    ..
    /usr/lib/sane
    /usr/lib/mysql
    /opt/lib
              
    Add the library path to this file and then execute the command (as root) ldconfig to configure the linker run-time bindings. 
    You can use the "-f file-name" flag to reference another configuration file if you are developing for different environments. 
    See man page for command ldconfig.
    OR
  2. Add specified directory to library cache: (as root) 
    ldconfig -n /opt/lib 
    Where /opt/lib is the directory containing your library libctest.so 
    (When developing and just adding your current directory: ldconfig -n . Link with -L.)
    This will NOT permanently configure the system to include this directory. The information will be lost upon system reboot.
    OR
  3. Specify the environment variable LD_LIBRARY_PATH to point to the directory paths containing the shared object library. This will specify to the run time loader that the library paths will be used during execution to resolve dependencies. 
    (Linux/Solaris: LD_LIBRARY_PATH, SGI: LD_LIBRARYN32_PATH, AIX: LIBPATH, Mac OS X: DYLD_LIBRARY_PATH, HP-UX: SHLIB_PATH)
    Example (bash shell): export LD_LIBRARY_PATH=/opt/lib:$LD_LIBRARY_PATH or add to your ~/.bashrc file:
    ...
    if [ -d /opt/lib ];
    then
       LD_LIBRARY_PATH=/opt/lib:$LD_LIBRARY_PATH
    fi
    
    ...
    
    export LD_LIBRARY_PATH
          

    This instructs the run time loader to look in the path described by the environment variable LD_LIBRARY_PATH, to resolve shared libraries. This will include the path /opt/lib.
Library paths used should conform to the "Linux Standard Base" directory structure.

Dynamic loading and un-loading of shared libraries using libdl:
01#include <stdio.h>
02#include <dlfcn.h>
03#include "ctest.h"
04 
05int main(int argc, char **argv)
06{
07   void *lib_handle;
08   double (*fn)(int *);
09   int x;
10   char *error;
11 
12   lib_handle = dlopen("/opt/lib/libctest.so", RTLD_LAZY);
13   if (!lib_handle)
14   {
15      fprintf(stderr, "%s\n", dlerror());
16      exit(1);
17   }
18 
19   fn = dlsym(lib_handle, "ctest1");
20   if ((error = dlerror()) != NULL) 
21   {
22      fprintf(stderr, "%s\n", error);
23      exit(1);
24   }
25 
26   (*fn)(&x);
27   printf("Valx=%d\n",x);
28 
29   dlclose(lib_handle);
30   return 0;
31}

gcc -rdynamic -o progdl progdl.c -ldl
Explanation:
  • dlopen("/opt/lib/libctest.so", RTLD_LAZY); 
    Open shared library named "libctest.so". 
    The second argument indicates the binding. See include file dlfcn.h
    Returns NULL if it fails. 
    Options:
    • RTLD_LAZY: If specified, Linux is not concerned about unresolved symbols until they are referenced.
    • RTLD_NOW: All unresolved symbols resolved when dlopen() is called.
    • RTLD_GLOBAL: Make symbol libraries visible.
  • dlsym(lib_handle, "ctest1"); 
    Returns address to the function which has been loaded with the shared library.. 
    Returns NULL if it fails. 
    Note: When using C++ functions, first use nm to find the "mangled" symbol name or use the extern "C" construct to avoid name mangling. 
    i.e. extern "C" void function-name();