From: Mikael Pettersson <mikpe@csd.uu.se>

This patch updates perfctr's documentation:
- adds new Implementation Notes section to the x86 documentation
- some minor fixes in the x86 documentation
- adds new documentation on the per-process perfctrs
- adds new overview documentation

Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/Documentation/perfctr/low-level-x86.txt |  122 +++++++-
 25-akpm/Documentation/perfctr/overview.txt      |  129 ++++++++
 25-akpm/Documentation/perfctr/virtual.txt       |  355 ++++++++++++++++++++++++
 3 files changed, 602 insertions(+), 4 deletions(-)

diff -puN Documentation/perfctr/low-level-x86.txt~perfctr-documentation-update Documentation/perfctr/low-level-x86.txt
--- 25/Documentation/perfctr/low-level-x86.txt~perfctr-documentation-update	2004-07-12 00:04:07.771309056 -0700
+++ 25-akpm/Documentation/perfctr/low-level-x86.txt	2004-07-12 00:04:07.776308296 -0700
@@ -8,6 +8,13 @@ This document only describes x86-specifi
 For detailed hardware control register layouts, see
 the manufacturers' documentation.
 
+Contents
+========
+- Supported processors
+- Contents of <asm-i386/perfctr.h>
+- Processor-specific Notes
+- Implementation Notes
+
 Supported processors
 ====================
 - Intel P5, P5MMX, P6, P4.
@@ -17,8 +24,8 @@ Supported processors
 - VIA C3. (bad P6 clone)
 - Any generic x86 with a TSC.
 
-Contents of <asm-$ARCH/perfctr.h>
-=================================
+Contents of <asm-i386/perfctr.h>
+================================
 
 "struct perfctr_sum_ctrs"
 -------------------------
@@ -116,7 +123,7 @@ On HT P4s, resource conflicts can occur 
 (T0 and T1) in a processor share the same perfctr registers.
 To prevent conflicts, only thread 0 in each processor is allowed
 to access the counters. perfctr_cpus_forbidden_mask contains the
-smp_processor_id()'s of each processor's thread 1, and it is the
+smp_processor_id()s of each processor's thread 1, and it is the
 responsibility of the high-level driver to ensure that it never
 accesses the perfctr state from a forbidden thread.
 
@@ -155,7 +162,7 @@ The hardware stores both counters' contr
 control register, the CESR MSR. The evntsel values are
 limited to 16 bits each, and are combined by the low-level
 driver to form the value for the CESR. Apart from that,
-the evntsel values direct images of the CESR.
+the evntsel values are direct images of the CESR.
 
 Bits 0xFE00 in an evntsel value are reserved.
 At least one evntsel CPL bit (0x00C0) must be set.
@@ -244,3 +251,110 @@ If p4.pebs_enable is non-zero:
 - at least one of bits 10, 9, 2, 1, or 0 must be set
 - in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear,
   and at least one of bits 1 and 0 must be set
+
+Implementation Notes
+====================
+
+Caching
+-------
+Each 'struct perfctr_cpu_state' contains two cache-related fields:
+- 'id': a unique identifier for the control data contents
+- 'isuspend_cpu': the identity of the CPU on which a state containing
+  interrupt-mode counters was last suspended
+
+To this the driver adds a per-CPU cache, recording:
+- the 'id' of the control data currently in that CPU
+- the current contents of each control register
+
+When perfctr_cpu_update_control() has validated the new control data,
+it also updates the id field.
+
+The driver's internal 'write_control' function, called from the
+perfctr_cpu_resume() API function, first checks if the state's id
+matches that of the CPU's cache, and if so, returns. Otherwise
+it checks each control register in the state and updates those
+that do not match the cache. Finally, it writes the state's id
+to the cache. Tests on various x86 processor types have shown that
+MSR writes are very expensive: the purpose of these cache checks
+is to avoid MSR writes whenever possible.
+
+Unlike accumulation-mode counters, interrupt-mode counters must be
+physically stopped when suspended, primilarly to avoid overflow
+interrupts in contexts not expecting them, and secondarily to avoid
+increments to the counters themselves (see below).
+
+When suspending interrupt-mode counters, the driver:
+- records the CPU identity in the per-CPU cache
+- stops each interrupt-mode counter by disabling its control register
+- lets the cache and state id values remain the same
+
+Later, when resuming interrupt-mode counters, the driver:
+- if the state and cache id values match:
+  * the cache id is cleared, to force a reload of the control
+    registers stopped at suspend (see below)
+  * if the state's "suspend" CPU identity matches the current CPU,
+    the counter registers are still valid, and the procedure returns
+- if the procedure did not return above, it then loops over each
+  interrupt-mode counter:
+  * the counter's control register is physically disabled, unless
+    the cache indicates that it already is disabled; this is necessary
+    to prevent premature events and overflow interrupts if the CPU's
+    registers previously belonged to some other state
+  * then the counter register itself is restored
+After this interrupt-mode specific resume code is complete, the
+driver continues by calling 'write_control' as described above.
+The state and cache ids will not match, forcing write_control to
+reload the disabled interrupt-mode control registers.
+
+Call-site Backpatching
+----------------------
+The x86 family of processors is quite diverse in how their
+performance counters work and are accessed. There are three
+main designs (P5, P6, and P4) with several variations.
+To handle this the processor type detection and initialisation
+code sets up a number of function pointers to point to the
+correct procedures for the actual CPU type.
+
+Calls via function pointers are more expensive than direct calls,
+so the driver actually performs direct calls to wrappers that
+backpatch the original call sites to instead call the actual
+CPU-specific functions in the future.
+
+Unsynchronised code backpatching in SMP systems doesn't work
+on Intel P6 processors due to an erratum, so the driver performs
+a "finalise backpatching" step after the CPU-specific function
+pointers have been set up. This step invokes the API procedures
+on a temporary state object, set up to force every backpatchable
+call site to be invoked and adjusted.
+
+Several low-level API procedures are called in the context-switch
+path by the per-process perfctrs kernel extension, which motivates
+the efforts to reduce runtime overheads as much as possible.
+
+Overflow Interrupts
+-------------------
+The x86 hardware enables overflow interrupts via the local
+APIC's LVTPC entry, which is only present in P6/K7/K8/P4.
+
+The low-level driver supports overflow interrupts as follows:
+- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR.
+- It adds a local APIC exception handler to entry.S, which
+  invokes the driver's smp_perfctr_interrupt() procedure.
+- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR
+  interrupt gate to the exception handler in entry.S.
+- During processor type detection, it records whether the
+  processor supports the local APIC, and sets up function pointers
+  for the suspend and resume operations on interrupt-mode counters.
+- When the low-level driver is activated, it enables overflow
+  interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC.
+- Overflow interrupts now end up in smp_perfctr_interrupt(), which
+  ACKs the interrupt and invokes the interrupt handler installed
+  by the high-level service/driver.
+- When the low-level driver is deactivated, it disables overflow
+  interrupts by masking APIC_LVTPC in each CPU. It then releases
+  the local APIC back to the NMI watchdog.
+
+At compile-time, the low-level driver indicates overflow interrupt
+support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature
+is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag
+in the perfctr_info object.
diff -puN /dev/null Documentation/perfctr/overview.txt
--- /dev/null	2003-09-15 06:40:47.000000000 -0700
+++ 25-akpm/Documentation/perfctr/overview.txt	2004-07-12 00:04:07.777308144 -0700
@@ -0,0 +1,129 @@
+$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $
+
+AN OVERVIEW OF PERFCTR
+======================
+The perfctr package adds support to the Linux kernel for using
+the performance-monitoring counters found in many processors.
+
+Perfctr is internally organised in three layers:
+
+- The low-level drivers, one for each supported architecture.
+  Currently there are two, one for 32 and 64-bit x86 processors,
+  and one for 32-bit PowerPC processors.
+
+  low-level-api.txt documents the model of the performance counters
+  used in this package, and the internal API to the low-level drivers.
+
+  low-level-{x86,ppc}.txt provide documentation specific for those
+  architectures and their low-level drivers.
+
+- The high-level services.
+  There is currently one, a kernel extension adding support for
+  virtualised per-process performance counters.
+  See virtual.txt for documentation on this kernel extension.
+
+  [There used to be a second high-level service, a simple driver
+  to control and access all performance counters in all processors.
+  This driver is currently removed, pending an acceptable new API.]
+
+- The top-level, which performs initialisation and implements
+  common procedures and system calls.
+
+Rationale
+---------
+The perfctr package solves three problems:
+
+- Hardware invariably restricts programming of the performance
+  counter registers to kernel-level code, and sometimes also
+  restricts reading the counters to kernel-level code.
+
+  Perfctr adds APIs allowing user-space code access the counters.
+  In the case of the per-process counters kernel extension,
+  even non-privileged processes are allowed access.
+
+- Hardware often limits the precision of the hardware counters,
+  making them unsuitable for storing total event counts.
+
+  The counts are instead maintained as 64-bit values in software,
+  with the hardware counters used to derive increments over given
+  time periods.
+
+- In a non-modified kernel, the thread state does not include the
+  performance monitoring counters, and the context switch code
+  does not save and restore them. In this situation the counters
+  are system-wide, making them unreliable and inaccurate when used
+  for monitoring specific processes or specific segments of code.
+
+  The per-process counters kernel extension treats the counter state as
+  part of the thread state, solving the reliability and accuracy problems.
+
+Non-goals
+---------
+Providing high-level interfaces that abstract and hide the
+underlying hardware is a non-goal. Such abstractions can
+and should be implemented in user-space, for several reasons:
+
+- The complexity and variability of the hardware means that
+  any abstraction would be inaccurate. There would be both
+  loss of functionality, and presence of functionality which
+  isn't supportable on any given processor. User-space tools
+  and libraries can implement this, on top of the processor-
+  specific interfaces provided by the kernel.
+
+- The implementation of such an abstraction would be large
+  and complex. (Consider ESCR register assignment on P4.)
+  Performing complex actions in user-space simplifies the
+  kernel, allowing it to concentrate on validating control
+  data, managing processes, and driving the hardware.
+  (C.f. the role of compilers.)
+
+- The abstraction is purely a user-convenience thing. The
+  kernel-level components have no need for it.
+
+Common System Calls
+===================
+This lists those system calls that are not tied to
+a specific high-level service/driver.
+
+Querying CPU and Driver Information
+-----------------------------------
+int err = sys_perfctr_info(struct perfctr_info *info,
+			   struct perfctr_cpu_mask *cpus,
+			   struct perfctr_cpu_mask *forbidden);
+
+This operation retrieves information from the kernel about
+the processors in the system.
+
+If non-NULL, '*info' will be updated with information about the
+capabilities of the processor and the low-level driver.
+
+If non-NULL, '*cpus' will be updated with a bitmask listing the
+set of processors in the system. The size of this bitmask is not
+statically known, so the protocol is:
+
+1. User-space initialises cpus->nrwords to the number of elements
+   allocated for cpus->mask[].
+2. The kernel reads cpus->nrwords, and then writes the required
+   number of words to cpus->nrwords.
+3. If the required number of words is less than the original value
+   of cpus->nrwords, then an EOVERFLOW error is signalled.
+4. Otherwise, the kernel converts its internal cpumask_t value
+   to the external format and writes that to cpus->mask[].
+
+If non-NULL, '*forbidden' will be updated with a bitmask listing
+the set of processors in the system on which users must not try
+to use performance counters. This is currently only relevant for
+hyper-threaded Pentium 4/Xeon systems. The protocol is the same
+as for '*cpus'.
+
+Notes:
+- The internal representation of a cpumask_t is as an array of
+  unsigned long. This representation is unsuitable for user-space,
+  because it is not binary-compatible between 32 and 64-bit
+  variants of a big-endian processor. The 'struct perfctr_cpu_mask'
+  type uses an array of unsigned 32-bit integers.
+- The protocol for retrieving a 'struct perfctr_cpu_mask' was
+  designed to allow user-space to quickly determine the correct
+  size of the 'mask[]' array. Other system calls use weaker protocols,
+  which force user-space to guess increasingly larger values in a
+  loop, until finally an acceptable value was guessed.
diff -puN /dev/null Documentation/perfctr/virtual.txt
--- /dev/null	2003-09-15 06:40:47.000000000 -0700
+++ 25-akpm/Documentation/perfctr/virtual.txt	2004-07-12 00:04:07.780307688 -0700
@@ -0,0 +1,355 @@
+$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $
+
+VIRTUAL PER-PROCESS PERFORMANCE COUNTERS
+========================================
+This document describes the virtualised per-process performance
+counters kernel extension. See "General Model" in low-level-api.txt
+for the model of the processor's performance counters.
+
+Contents
+========
+- Summary
+- Design & Implementation Notes
+  * State
+  * Thread Management Hooks
+  * Synchronisation Rules
+  * The Pseudo File System
+- API For User-Space
+  * Opening/Creating the State
+  * Updating the Control
+  * Unlinking the State
+  * Reading the State
+  * Resuming After Handling Overflow Signal
+  * Reading the Counter Values
+- Limitations / TODO List
+
+Summary
+=======
+The virtualised per-process performance counters facility
+(virtual perfctrs) is a kernel extension which extends the
+thread state to record perfctr settings and values, and augments
+the context-switch code to save perfctr values at suspends and
+restore them at resumes. This "virtualises" the performance
+counters in much the same way as the kernel already virtualises
+general-purpose and floating-point registers.
+
+Virtual perfctrs also adds an API allowing non-privileged
+user-space processes to set up and access their perfctrs.
+
+As this facility is primarily intended to support developers
+of user-space code, both virtualisation and allowing access
+from non-privileged code are essential features.
+
+Design & Implementation Notes
+=============================
+
+State
+-----
+The state of a thread's perfctrs is packaged up in an object of
+type 'struct vperfctr'. It consists of CPU-dependent state, a
+sampling timer, and some auxiliary administrative data. This is
+an independent object, with its own lifetime and access rules.
+
+The state object is attached to the thread via a pointer in its
+thread_struct. While attached, the object records the identity
+of its owner thread: this is used for user-space API accesses
+from threads other than the owner.
+
+The state is separate from the thread_struct for several resons:
+- It's potentially large, hence it's allocated only when needed.
+- It can outlive its owner thread. The state can be opened as
+  a pseudo file: as long as that file is live, so is the object.
+- It can be mapped, via mmap() on the pseudo file's descriptor.
+  To facilitate this, a full page is allocated and reserved.
+
+Thread Management Hooks
+-----------------------
+Virtual perfctrs hooks into several thread management events:
+
+- exit_thread(): Calls perfctr_exit_thread() to stop the counters
+  and detach the thread's vperfctr object.
+
+- copy_thread(): Calls perfctr_copy_thread() to initialise
+  the child's vperfctr pointer. Currently the settings are
+  not inherited from parent to child, so the pointer is set
+  to NULL in the child's thread_struct.
+
+- switch_to():
+  * Calls perfctr_suspend_thread() on the previous thread, to
+    suspend its counters.
+  * Calls perfctr_resume_thread() on the next thread, to resume
+    its counters. Also resets the sampling timer (see below).
+
+- update_process_times(): Calls perfctr_sample_thread(), which
+  decrements the sampling timer and samples the counters if the
+  timer reaches zero.
+
+  Sampling is normally only done at switch_to(), but if too much
+  time passes before the next switch_to(), a hardware counter may
+  increment by more than its range (usually 2^32). If this occurs,
+  the difference from its start value will be incorrect, causing
+  its updated sum to also be incorrect. The sampling timer is used
+  to prevent this problem, which has been observed on SMP machines,
+  and on high clock frequency UP machines.
+
+- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect
+  attempts to migrate the thread to a "forbidden" CPU, in which
+  case a flag in the vperfctr object is set. perfctr_resume_thread()
+  checks this flag, and if set, marks the counters as stopped and
+  sends a SIGILL to the thread.
+
+  The notion of forbidden CPUs is a workaround for a design flaw
+  in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt
+  for details.
+
+To reduce overheads, these hooks are implemented as inline functions
+that check if the thread is using perfctrs before calling the code
+that implements the behaviour. The hooks also reduce to no-ops if
+CONFIG_PERFCTR_VIRTUAL is disabled.
+
+Synchronisation Rules
+---------------------
+There are four types of accesses to a thread's perfctr state:
+
+1. Thread management events (see above) done by the thread itself.
+   Suspend, resume, and sample are lock-less.
+
+2. API operations done by the thread itself.
+   These are lock-less, except when an individual operation
+   has specific synchronisation needs. For instance, preemption
+   is often disabled to prevent accesses due to context switches.
+
+3. API operations done by a different thread ("monitor thread").
+   The owner thread must be suspended for the duration of the operation.
+   This is ensured by requiring that the monitor thread is ptrace()ing
+   the owner thread, and that the owner thread is in TASK_STOPPED state.
+
+4. set_cpus_allowed().
+   The kernel does not lock the target during set_cpus_allowed(),
+   so it can execute concurrently with the owner thread or with
+   some monitor thread. In particular, the state may be deallocated.
+
+   To solve this problem, both perfctr_set_cpus_allowed() and the
+   operations that can change the owner thread's perfctr pointer
+   (creat, unlink, exit) perform a task_lock() on the owner thread
+   before accessing the perfctr pointer.
+
+   When concurrent set_cpus_allowed() isn't a problem (because the
+   architecture doesn't have a notion of forbidden CPUs), atomicity
+   of updates to the thread's perfctr pointer is ensured by disabling
+   preemption.
+
+The Pseudo File System
+----------------------
+The perfctr state is accessed from user-space via a file descriptor.
+
+The main reason for this is to enable mmap() on the file descriptor,
+which gives read-only access to the state.
+
+The file descriptor is a handle to the perfctr state object. This
+allows a very simple implementation of the user-space 'perfex'
+program, which runs another program with given perfctr settings
+and reports their final values. Without this handle, monitoring
+applications like perfex would have to be implemented like debuggers
+in order to catch the target thread's exit and retrieve the counter
+values before the exit completes and the state disappears.
+
+The file for a perfctr state object belongs to the vperfctrs pseudo
+file system. Files in this file system support only a few operations:
+- mmap()
+- release() decrements the perfctr object's reference count and
+  deallocates the object when no references remain
+- the listing of a thread's open file descriptors identifies
+  perfctr state file descriptors as belonging to "vperfctrfs"
+The implementation is based on the code for pipefs.
+
+In previous versions of the perfctr package, the file descriptors
+for perfctr state objects also supported the API's ioctl() method.
+
+API For User-Space
+==================
+
+Opening/Creating the State
+--------------------------
+int fd = sys_vperfctr_open(int tid, int creat);
+
+'tid' must be the id of a thread, or 0 which is interpreted as an
+alias for the current thread.
+
+This operation returns an open file descriptor which is a handle
+on the thread's perfctr state object.
+
+If 'creat' is non-zero and the object did not exist, then it is
+created and attached to the thread. The newly created state object
+is inactive, with all control fields disabled and all counters
+having the value zero. If 'creat' is non-zero and the object
+already existed, then an EEXIST error is signalled.
+
+If 'tid' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Notes:
+- The access rule in the non-self case is the same as for the
+  ptrace() system call. It ensures that no other thread, including
+  the target thread itself, can access or change the target thread's
+  perfctr state during the operation.
+- An open file descriptor for a perfctr state object counts as a
+  reference to that object; even if detached from its thread the
+  object will not be deallocated until the last reference is gone.
+- The file descriptor can be passed to mmap(), for low-overhead
+  counter sampling. See "READING THE COUNTER VALUES" for details.
+- The file descriptor can be passed to another thread. Accesses
+  from threads other than the owner are permitted as long as they
+  posses the file descriptor and use ptrace() for synchronisation.
+
+Updating the Control
+--------------------
+int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control);
+
+'fd' must be the return value from a call to sys_vperfctr_open(),
+The perfctr object must still be attached to its owner thread.
+
+This operation stops and samples any currently running counters in
+the thread, and then updates the control settings. If the resulting
+state has any enabled counters, then the counters are restarted.
+
+Before restarting, the counter sums are reset to zero. However,
+if a counter's bit is set in the control object's 'preserve'
+bitmask field, then that counter's sum is not reset. The TSC's
+sum is only reset if the TSC is disabled in the new state.
+
+If any of the programmable counters are enabled, then the thread's
+CPU affinity mask is adjusted to exclude the set of forbidden CPUs.
+
+If the control data activates any interrupt-mode counters, then
+a signal (specified by the 'si_signo' control field) will be sent
+to the owner thread after an overflow interrupt. The documentation
+for sys_vperfctr_iresume() describes this mechanism.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+The perfctr state object denoted by 'fd' must still be attached
+to its owner thread.
+
+Notes:
+- It is strongly recommended to memset() the vperfctr_control object
+  to all-bits-zero before setting the fields of interest.
+- Stopping the counters is done by invoking the control operation
+  with a control object that activates neither the TSC nor any PMCs.
+
+Unlinking the State
+-------------------
+int err = sys_vperfctr_unlink(int fd);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+
+This operation stops and samples the thread's counters, and then
+detaches the perfctr state object from the thread. If the object
+already had been detached, then no action is performed.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Reading the State
+-----------------
+int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum,
+			    struct vperfctr_control *control);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+
+This operation copies data from the perfctr state object to
+user-space. If 'sum' is non-NULL, then the counter sums are
+written to it. If 'control' is non-NULL, then the control data
+is written to it.
+
+If the perfctr state object is attached to the current thread,
+then the counters are sampled and updated first.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Notes:
+- An alternate and faster way to retrieve the counter sums is described
+  below. This system call can be used if the hardware does not permit
+  user-space reads of the counters.
+
+Resuming After Handling Overflow Signal
+---------------------------------------
+int err = sys_vperfctr_iresume(int fd);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+The perfctr object must still be attached to its owner thread.
+
+When an interrupt-mode counter has overflowed, the counters
+are sampled and suspended (TSC remains active). Then a signal,
+as specified by the 'si_signo' control field, is sent to the
+owner thread: the associated 'struct siginfo' has 'si_code'
+equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set
+of overflown counters.
+
+The counters are suspended to avoid generating new performance
+counter events during the execution of the signal handler, but
+the previous settings are saved. Calling sys_vperfctr_iresume()
+restores the previous settings and resumes the counters. Doing
+this is optional.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Reading the Counter Values
+--------------------------
+The value of a counter is computed from three components:
+
+	value = sum + (now - start);
+
+Two of these (sum and start) reside in the kernel's state object,
+and the third (now) is the contents of the hardware counter.
+To perform this computation in user-space requires access to
+the state object. This is achieved by passing the file descriptor
+from sys_vperfctr_open() to mmap():
+
+	volatile const struct vperfctr_state *kstate;
+	kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+
+Reading the three components is a non-atomic operation. If the
+thread is scheduled during the operation, the three values will
+not be consistent and the wrong result will be computed.
+To detect this situation, user-space should check the kernel
+state's TSC start value before and after the operation, and
+retry the operation in case of a mismatch.
+
+The algorithm for retrieving the value of counter 'i' is:
+
+	tsc0 = kstate->cpu_state.tsc_start;
+	for(;;) {
+		rdpmcl(kstate->cpu_state.pmc[i].map, now);
+		start = kstate->cpu_state.pmc[i].start;
+		sum = kstate->cpu_state.pmc[i].sum;
+		tsc1 = kstate->cpu_state.tsc_start;
+		if (likely(tsc1 == tsc0))
+			break;
+		tsc0 = tsc1;
+	}
+	return sum + (now - start);
+
+The algorithm for retrieving the value of the TSC is similar,
+as is the algorithm for retrieving the values of all counters.
+
+Notes:
+- Since the state's TSC time-stamps are used, the algorithm requires
+  that user-space enables TSC sampling.
+- The algorithm requires that the hardware allows user-space reads
+  of the counter registers. If this property isn't statically known
+  for the architecture, user-space should retrieve the kernel's
+  'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC
+  flag is set.
+
+Limitations / TODO List
+=======================
+- Perfctr settings are not inherited from parent to child at fork().
+  The issue is not fork() but propagating final counts from children
+  to parents, and allowing user-space to distinguish "self" counts
+  from "children" counts.
+  An implementation of this feature is being planned.
+- Buffering of overflow samples is not implemented. So far, not a
+  single user has requested it.
_