From: Mikael Pettersson <mikpe@csd.uu.se>

This patch adds documentation for perfctr's low-level drivers in
Documentation/perfctr/.  The internal API between perfctr's low-level and
high-level drivers is described, as are the architecture-specific data
structures users use to control and inspect the counters.

Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
DESC
perfctr documentation update
EDESC
From: Mikael Pettersson <mikpe@csd.uu.se>

This patch updates perfctr's documentation:
- adds new Implementation Notes section to the x86 documentation
- some minor fixes in the x86 documentation
- adds new documentation on the per-process perfctrs
- adds new overview documentation

Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/Documentation/perfctr/low-level-api.txt   |  216 +++++++++++++
 25-akpm/Documentation/perfctr/low-level-ppc32.txt |  164 ++++++++++
 25-akpm/Documentation/perfctr/low-level-x86.txt   |  360 ++++++++++++++++++++++
 25-akpm/Documentation/perfctr/overview.txt        |  129 +++++++
 25-akpm/Documentation/perfctr/virtual.txt         |  355 +++++++++++++++++++++
 5 files changed, 1224 insertions(+)

diff -puN /dev/null Documentation/perfctr/low-level-api.txt
--- /dev/null	Thu Apr 11 07:25:15 2002
+++ 25-akpm/Documentation/perfctr/low-level-api.txt	Mon Aug 16 15:57:30 2004
@@ -0,0 +1,216 @@
+$Id: low-level-api.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $
+
+PERFCTR LOW-LEVEL DRIVERS API
+=============================
+
+This document describes the common low-level API.
+See low-level-$ARCH.txt for architecture-specific documentation.
+
+General Model
+=============
+The model is that of a processor with:
+- A non-programmable clock-like counter, the "TSC".
+  The TSC frequency is assumed to be constant, but it is not
+  assumed to be identical to the core frequency.
+  The TSC may be absent.
+- A set of programmable counters, the "perfctrs" or "pmcs".
+  Control data may be per-counter, global, or both.
+  The counters are not assumed to be interchangeable.
+
+  A normal counter that simply counts events is referred to
+  as an "accumulation-mode" or "a-mode" counter. Its total
+  count is computed by adding the counts for the individual
+  periods during which the counter is active. Two per-counter
+  state variables are used for this: "sum", which is the
+  total count up to but not including the current period,
+  and "start", which records the value of the hardware counter
+  at the start of the current period. At the end of a period,
+  the hardware counter's value is read again, and the increment
+  relative the start value is added to the sum. This strategy
+  is used because it avoids a number of hardware problems.
+
+  A counter that has been programmed to generate an interrupt
+  on overflow is referred to as an "interrupt-mode" or "i-mode"
+  counter. I-mode counters are initialised to specific values,
+  and after overflowing are reset to their (re)start values.
+  The total event count is available just as for a-mode counters.
+
+  The set of counters may be empty, in which case only the
+  TSC (which must be present) can be sampled.
+
+Contents of <asm-$ARCH/perfctr.h>
+=================================
+
+"struct perfctr_sum_ctrs"
+-------------------------
+struct perfctr_sum_ctrs {
+	unsigned long long tsc;
+	unsigned long long pmc[..];	/* one per counter */
+};
+
+Architecture-specific container for counter values.
+Used in the kernel/user API, but not by the low-level drivers.
+
+"struct perfctr_cpu_control"
+----------------------------
+This struct includes at least the following fields:
+
+	unsigned int tsc_on;
+	unsigned int nractrs;		/* # of a-mode counters */
+	unsigned int nrictrs;		/* # of i-mode counters */
+	unsigned int pmc_map[..];	/* one per counter: virt-to-phys mapping */
+	unsigned int evntsel[..];	/* one per counter: hw control data */
+	int ireset[..];			/* one per counter: i-mode (re)start value */
+
+Architecture-specific container for control data.
+Used both in the kernel/user API and by the low-level drivers
+(embedded in "struct perfctr_cpu_state").
+
+"tsc_on" is non-zero if the TSC should be sampled.
+
+"nractrs" is the number of a-mode counters, corresponding to
+elements 0..nractrs-1 in the per-counter arrays.
+
+"nrictrs" is the number of i-mode counters, corresponding to
+elements nractrs..nractrs+nrictrs-1 in the per-counter arrays.
+
+"nractrs+nrictrs" is the total number of counters to program
+and sample. A-mode and i-mode counters are separated in order
+to allow quick enumeration of either set, which is needed in
+some low-level driver operations.
+
+"pmc_map[]" maps each counter to its corresponding hardware counter
+identification. No two counters may map to the same hardware counter.
+This mapping is present because the hardware may have asymmetric
+counters or other addressing quirks, which means that a counter's index
+may not suffice to address its hardware counter.
+
+"evntsel[]" contains the per-counter control data. Architecture-specific
+global control data, if any, is placed in architecture-specific fields.
+
+"ireset[]" contains the (re)start values for the i-mode counters.
+Only indices nractrs..nractrs+nrictrs-1 are used.
+
+"struct perfctr_cpu_state"
+--------------------------
+This struct includes at least the following fields:
+
+	unsigned int cstatus;
+	unsigned int tsc_start;
+	unsigned long long tsc_sum;
+	struct {
+		unsigned int map;
+		unsigned int start;
+		unsigned long long sum;
+	} pmc[..];	/* one per counter; the size is not part of the user ABI */
+#ifdef __KERNEL__
+	struct perfctr_cpu_control control;
+#endif
+
+This type records the state and control data for a collection
+of counters. It is used by many low-level operations, and may
+be exported to user-space via mmap().
+
+"cstatus" is a re-encoding of control.tsc_on/nractrs/nrictrs,
+used because it reduces overheads in key low-level operations.
+Operations on cstatus values include:
+- unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, unsigned int nrictrs);
+  Construct a cstatus value.
+- unsigned int perfctr_cstatus_enabled(unsigned int cstatus);
+  Check if any part (tsc_on, nractrs, nrictrs) of the cstatus is non-zero.
+- int perfctr_cstatus_has_tsc(unsigned int cstatus);
+  Check if the tsc_on part of the cstatus is non-zero.
+- unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus);
+  Retrieve nractrs+nrictrs from the cstatus.
+- unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus);
+  Check if the nrictrs part of cstatus is non-zero.
+
+"tsc_start" and "tsc_sum" record the state of the TSC.
+
+"pmc[]" contains the per-counter state, in the "start" and "sum"
+fields. The "map" field contains the corresponding hardware counter
+identification, from the counter's entry in "control.pmc_map[]";
+it is copied into pmc[] to reduce overheads in key low-level operations.
+
+"control" contains the control data which determines the
+behaviour of the counters.
+
+User-space overflow signal handler items
+----------------------------------------
+After a counter has overflowed, a user-space signal handler may
+be invoked with a "struct siginfo" identifying the source of the
+signal and the set of overflown counters.
+
+#define SI_PMC_OVF	..
+
+Value to be stored in "si.si_code".
+
+#define si_pmc_ovf_mask	..
+
+Field in which to store a bit-mask of the overflown counters.
+
+Kernel-internal API
+-------------------
+
+/* Driver init/exit.
+   perfctr_cpu_init() performs hardware detection and may fail. */
+extern int perfctr_cpu_init(void);
+extern void perfctr_cpu_exit(void);
+
+/* CPU type name. Set if perfctr_cpu_init() was successful. */
+extern char *perfctr_cpu_name;
+
+/* Hardware reservation. A high-level driver must reserve the
+   hardware before it may use it, and release it afterwards.
+   "service" is a unique string identifying the high-level driver.
+   perfctr_cpu_reserve() returns NULL on success; if another
+   high-level driver has reserved the hardware, then that
+   driver's "service" string is returned. */
+extern const char *perfctr_cpu_reserve(const char *service);
+extern void perfctr_cpu_release(const char *service);
+
+/* PRE: state has no running interrupt-mode counters.
+   Check that the new control data is valid.
+   Update the low-level driver's private control data.
+   is_global should be zero for per-process counters and non-zero
+   for global-mode counters.
+   Returns a negative error code if the control data is invalid. */
+extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global);
+
+/* Stop i-mode counters. Update sums and start values.
+   Read a-mode counters. Subtract from start and accumulate into sums.
+   Must be called with preemption disabled. */
+extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state);
+
+/* Reset i-mode counters to their start values.
+   Write control registers.
+   Read a-mode counters and update their start values.
+   Must be called with preemption disabled. */
+extern void perfctr_cpu_resume(struct perfctr_cpu_state *state);
+
+/* Perform an efficient combined suspend/resume operation.
+   Must be called with preemption disabled. */
+extern void perfctr_cpu_sample(struct perfctr_cpu_state *state);
+
+/* The type of a perfctr overflow interrupt handler.
+   It will be called in IRQ context, with preemption disabled. */
+typedef void (*perfctr_ihandler_t)(unsigned long pc);
+
+/* Install a perfctr overflow interrupt handler.
+   Should be called after perfctr_cpu_reserve() but before
+   any counter state has been activated. */
+extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t);
+
+/* PRE: The state has been suspended and sampled by perfctr_cpu_suspend().
+   Should be called from the high-level driver's perfctr_ihandler_t,
+   and preemption must not have been enabled.
+   Identify which counters have overflown, reset their start values
+   from ireset[], and perform any necessary hardware cleanup.
+   Returns a bit-mask of the overflown counters. */
+extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*);
+
+/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to
+   bypass internal caching and force a reload of the i-mode pmcs.
+   This ensures that perfctr_cpu_identify_overflow()'s state changes
+   are propagated to the hardware. */
+extern void perfctr_cpu_ireload(struct perfctr_cpu_state*);
diff -puN /dev/null Documentation/perfctr/low-level-ppc32.txt
--- /dev/null	Thu Apr 11 07:25:15 2002
+++ 25-akpm/Documentation/perfctr/low-level-ppc32.txt	Mon Aug 16 15:57:30 2004
@@ -0,0 +1,164 @@
+$Id: low-level-ppc32.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $
+
+PERFCTRS PPC32 LOW-LEVEL API
+============================
+
+See low-level-api.txt for the common low-level API.
+This document only describes ppc32-specific behaviour.
+For detailed hardware control register layouts, see
+the manufacturers' documentation.
+
+Supported processors
+====================
+- PowerPC 604, 604e, 604ev.
+- PowerPC 750/740, 750CX, 750FX, 750GX.
+- PowerPC 7400, 7410, 7451/7441, 7457/7447.
+- Any generic PowerPC with a timebase register.
+
+Contents of <asm-$ARCH/perfctr.h>
+=================================
+
+"struct perfctr_sum_ctrs"
+-------------------------
+struct perfctr_sum_ctrs {
+	unsigned long long tsc;
+	unsigned long long pmc[8];
+};
+
+The pmc[] array has room for 8 counters.
+
+"struct perfctr_cpu_control"
+----------------------------
+struct perfctr_cpu_control {
+	unsigned int tsc_on;
+	unsigned int nractrs;		/* # of a-mode counters */
+	unsigned int nrictrs;		/* # of i-mode counters */
+	unsigned int pmc_map[8];
+	unsigned int evntsel[8];	/* one per counter, even on P5 */
+	int ireset[8];			/* [0,0x7fffffff], for i-mode counters */
+	struct {
+		unsigned int mmcr0;	/* sans PMC{1,2}SEL */
+		unsigned int mmcr2;	/* only THRESHMULT */
+		/* IABR/DABR/BAMR not supported */
+	} ppc;
+	unsigned int _reserved1;
+	unsigned int _reserved2;
+	unsigned int _reserved3;
+	unsigned int _reserved4;
+};
+
+The per-counter arrays have room for 8 elements.
+
+ireset[] values must be non-negative, since overflow occurs on
+the non-negative-to-negative transition.
+
+The ppc sub-struct contains PowerPC-specific control data:
+- mmcr0: global control data for the MMCR0 SPR; the event
+  selectors for PMC1 and PMC2 are in evntsel[], not in mmcr0
+- mmcr2: global control data for the MMCR2 SPR; only the
+  THRESHMULT field can be specified
+
+"struct perfctr_cpu_state"
+--------------------------
+struct perfctr_cpu_state {
+	unsigned int cstatus;
+	struct {	/* k1 is opaque in the user ABI */
+		unsigned int id;
+		int isuspend_cpu;
+	} k1;
+	/* The two tsc fields must be inlined. Placing them in a
+	   sub-struct causes unwanted internal padding on x86-64. */
+	unsigned int tsc_start;
+	unsigned long long tsc_sum;
+	struct {
+		unsigned int map;
+		unsigned int start;
+		unsigned long long sum;
+	} pmc[8];	/* the size is not part of the user ABI */
+#ifdef __KERNEL__
+	unsigned int ppc_mmcr[3];
+	struct perfctr_cpu_control control;
+#endif
+};
+
+The k1 sub-struct is used by the low-level driver for
+caching purposes. "id" identifies the control data, and
+"isuspend_cpu" identifies the CPU on which the i-mode
+counters were last suspended.
+
+The pmc[] array has room for 8 elements.
+
+ppc_mmcr[] is computed from control by the low-level driver,
+and provides the data for the MMCR0, MMCR1, and MMCR2 SPRs.
+
+User-space overflow signal handler items
+----------------------------------------
+#ifdef __KERNEL__
+#define SI_PMC_OVF	(__SI_FAULT|'P')
+#else
+#define SI_PMC_OVF	('P')
+#endif
+#define si_pmc_ovf_mask	_sifields._pad[0]
+
+Kernel-internal API
+-------------------
+
+In perfctr_cpu_update_control(), the is_global parameter
+is ignored. (It is only relevant for x86.)
+
+CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK is never defined.
+(It is only relevant for x86.)
+
+Overflow interrupt handling is not yet implemented.
+
+Processor-specific Notes
+========================
+
+General
+-------
+pmc_map[] contains a counter number, an integer between 0 and 5.
+It never contains an SPR number.
+
+Basic operation (the strategy for a-mode counters, caching
+control register contents, recording "suspend CPU" for i-mode
+counters) is the same as in the x86 driver.
+
+PowerPC 604/750/74xx
+--------------------
+These processors use similar hardware layouts, differing
+mainly in the number of counter and control registers.
+The set of available events differ greatly, but that only
+affects users, not the low-level driver itself.
+
+The hardware has 2 (604), 4 (604e/750/7400/7410), or 6
+(745x) counters (PMC1 to PMC6), and 1 (604), 2 (604e/750),
+or 3 (74xx) control registers (MMCR0 to MMCR2).
+
+MMCR0 contains global control bits, and the event selection
+fields for PMC1 and PMC2. MMCR1 contains event selection fields
+for PMC3-PMC6. MMCR2 contains the THRESHMULT flag, which
+specifies how MMCR0[THRESHOLD] should be scaled.
+
+In control.ppc.mmcr0, the PMC1SEL and PMC2SEL fields (0x00001FFF)
+are reserved. The PMXE flag (0x04000000) may only be set when
+the driver supports overflow interrupts.
+
+If FCECE or TRIGGER is set in MMCR0 on a 74xx processor, then
+MMCR0 can change asynchronously. The driver handles this, at
+the cost of some additional work in perfctr_cpu_suspend().
+Not setting these flags avoids that overhead.
+
+In control.ppc.mmcr2, only the THRESHMULT flag (0x80000000)
+may be set, and only on 74xx processors.
+
+The SIA (sampled instruction address) register is not used.
+The SDA (sampled data address) register is 604/604e-only,
+and is not used. The BAMR (breakpoint address mask) register
+is not used, but it is cleared by the driver.
+
+Generic PowerPC with timebase
+-----------------------------
+The driver supports any PowerPC as long as it has a timebase
+register, and the TB frequency is available via Open Firmware.
+In this case, the only valid usage mode is with tsc_on == 1
+and nractrs == nrictrs == 0 in the control data.
diff -puN /dev/null Documentation/perfctr/low-level-x86.txt
--- /dev/null	Thu Apr 11 07:25:15 2002
+++ 25-akpm/Documentation/perfctr/low-level-x86.txt	Mon Aug 16 15:57:33 2004
@@ -0,0 +1,360 @@
+$Id: low-level-x86.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $
+
+PERFCTRS X86 LOW-LEVEL API
+==========================
+
+See low-level-api.txt for the common low-level API.
+This document only describes x86-specific behaviour.
+For detailed hardware control register layouts, see
+the manufacturers' documentation.
+
+Contents
+========
+- Supported processors
+- Contents of <asm-i386/perfctr.h>
+- Processor-specific Notes
+- Implementation Notes
+
+Supported processors
+====================
+- Intel P5, P5MMX, P6, P4.
+- AMD K7, K8. (P6 clones, with some changes)
+- Cyrix 6x86MX, MII, and III. (good P5 clones)
+- Centaur WinChip C6, 2, and 3. (bad P5 clones)
+- VIA C3. (bad P6 clone)
+- Any generic x86 with a TSC.
+
+Contents of <asm-i386/perfctr.h>
+================================
+
+"struct perfctr_sum_ctrs"
+-------------------------
+struct perfctr_sum_ctrs {
+	unsigned long long tsc;
+	unsigned long long pmc[18];
+};
+
+The pmc[] array has room for 18 counters.
+
+"struct perfctr_cpu_control"
+----------------------------
+struct perfctr_cpu_control {
+	unsigned int tsc_on;
+	unsigned int nractrs;		/* # of a-mode counters */
+	unsigned int nrictrs;		/* # of i-mode counters */
+	unsigned int pmc_map[18];
+	unsigned int evntsel[18];	/* one per counter, even on P5 */
+	struct {
+		unsigned int escr[18];
+		unsigned int pebs_enable;	/* for replay tagging */
+		unsigned int pebs_matrix_vert;	/* for replay tagging */
+	} p4;
+	int ireset[18];			/* < 0, for i-mode counters */
+	unsigned int _reserved1;
+	unsigned int _reserved2;
+	unsigned int _reserved3;
+	unsigned int _reserved4;
+};
+
+The per-counter arrays have room for 18 elements.
+
+ireset[] values must be negative, since overflow occurs on
+the negative-to-non-negative transition.
+
+The p4 sub-struct contains P4-specific control data:
+- escr[]: the control data to write to the ESCR register
+  associatied with the counter
+- pebs_enable: the control data to write to the PEBS_ENABLE MSR
+- pebs_matrix_vert: the control data to write to the
+  PEBS_MATRIX_VERT MSR
+
+"struct perfctr_cpu_state"
+--------------------------
+struct perfctr_cpu_state {
+	unsigned int cstatus;
+	struct {	/* k1 is opaque in the user ABI */
+		unsigned int id;
+		int isuspend_cpu;
+	} k1;
+	/* The two tsc fields must be inlined. Placing them in a
+	   sub-struct causes unwanted internal padding on x86-64. */
+	unsigned int tsc_start;
+	unsigned long long tsc_sum;
+	struct {
+		unsigned int map;
+		unsigned int start;
+		unsigned long long sum;
+	} pmc[18];	/* the size is not part of the user ABI */
+#ifdef __KERNEL__
+	struct perfctr_cpu_control control;
+	unsigned int p4_escr_map[18];
+#endif
+};
+
+The k1 sub-struct is used by the low-level driver for
+caching purposes. "id" identifies the control data, and
+"isuspend_cpu" identifies the CPU on which the i-mode
+counters were last suspended.
+
+The pmc[] array has room for 18 elements.
+
+p4_escr_map[] is computed from control by the low-level driver,
+and provides the MSR number for the counter's associated ESCR.
+
+User-space overflow signal handler items
+----------------------------------------
+#ifdef __KERNEL__
+#define SI_PMC_OVF	(__SI_FAULT|'P')
+#else
+#define SI_PMC_OVF	('P')
+#endif
+#define si_pmc_ovf_mask	_sifields._pad[0]
+
+Kernel-internal API
+-------------------
+
+In perfctr_cpu_update_control(), the is_global parameter controls
+whether monitoring the other thread (T1) on HT P4s is permitted
+or not. On other processors the parameter is ignored.
+
+SMP kernels define CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK and
+"extern cpumask_t perfctr_cpus_forbidden_mask;".
+On HT P4s, resource conflicts can occur because both threads
+(T0 and T1) in a processor share the same perfctr registers.
+To prevent conflicts, only thread 0 in each processor is allowed
+to access the counters. perfctr_cpus_forbidden_mask contains the
+smp_processor_id()s of each processor's thread 1, and it is the
+responsibility of the high-level driver to ensure that it never
+accesses the perfctr state from a forbidden thread.
+
+Overflow interrupt handling requires local APIC support in the kernel.
+
+Processor-specific Notes
+========================
+
+General
+-------
+pmc_map[] contains a counter number, as used by the RDPMC instruction.
+It never contains an MSR number.
+
+Counters are 32, 40, or 48 bits wide. The driver always only
+reads the low 32 bits. This avoids performance issues, and
+errata on some processors.
+
+Writing to counters or their control registers tends to be
+very expensive. This is why a-mode counters only use read
+operations on the counter registers. Caching of control
+register contents is done to avoid writing them. "Suspend CPU"
+is recorded for i-mode counters to avoid writing the counter
+registers when the counters are resumed (their control
+registers must be written at both suspend and resume, however).
+
+Some processors are unable to stop the counters (Centaur/VIA),
+and some are unable to reinitialise them to arbitrary values (P6).
+Storing the counters' total counts in the hardware counters
+would break as soon as context-switches occur. This is another
+reason why the accumulate-differences method for maintaining the
+counter values is used.
+
+Intel P5
+--------
+The hardware stores both counters' control data in a single
+control register, the CESR MSR. The evntsel values are
+limited to 16 bits each, and are combined by the low-level
+driver to form the value for the CESR. Apart from that,
+the evntsel values are direct images of the CESR.
+
+Bits 0xFE00 in an evntsel value are reserved.
+At least one evntsel CPL bit (0x00C0) must be set.
+
+For Cyrix' P5 clones, evntsel bits 0xFA00  are reserved.
+
+For Centaur's P5 clones, evntsel bits 0xFF00 are reserved.
+It has no CPL bits to set. The TSC is broken and cannot be used.
+
+Intel P6
+--------
+The evntsel values are mapped directly onto the counters'
+EVNTSEL control registers.
+
+The global enable bit (22) in EVNTSEL0 must be set. That bit is
+reserved in EVNTSEL1.
+
+Bits 21 and 19 (0x00280000) in each evntsel are reserved.
+
+For an i-mode counter, bit 20 (0x00100000) of its evntsel must be
+set. For a-mode counters, that bit must not be set.
+
+Hardware quirk: Counters are 40 bits wide, but writing to a
+counter only writes the low 32 bits: remaining bits are
+sign-extended from bit 31.
+
+AMD K7/K8
+---------
+Similar to Intel P6. The main difference is that each evntsel has
+its own enable bit, which must be set.
+
+VIA C3
+------
+Superficially similar to Intel P6, but only PERFCTR1/EVNTSEL1
+are programmable. pmc_map[0] must be 1, if nractrs == 1.
+
+Bits 0xFFFFFE00 in the evntsel are reserved. There are no auxiliary
+control bits to set.
+
+Generic
+-------
+Only permits TSC sampling, with tsc_on == 1 and nractrs == nrictrs == 0
+in the control data.
+
+Intel P4
+--------
+For each counter, its evntsel[] value is mapped onto its CCCR
+control register, and its p4.escr[] value is mapped onto its
+associated ESCR control register.
+
+The ESCR register number is computed from the hardware counter
+number (from pmc_map[]) and the ESCR SELECT field in the CCCR,
+and is cached in p4_escr_map[].
+
+pmc_map[] contains the value to pass to RDPMC when reading the
+counter. It is strongly recommended to set bit 31 (fast rdpmc).
+
+In each evntsel/CCCR value:
+- the OVF, OVF_PMI_T1 and hardware-reserved bits (0xB80007FF)
+  are reserved and must not be set
+- bit 11 (EXTENDED_CASCADE) is only permitted on P4 models >= 2,
+  and for counters 12 and 15-17
+- bits 16 and 17 (ACTIVE_THREAD) must both be set on non-HT processors
+- at least one of bits 12 (ENABLE), 30 (CASCADE), or 11 (EXTENDED_CASCADE)
+  must be set
+- bit 26 (OVF_PMI_T0) must be clear for a-mode counters, and set
+  for i-mode counters; if bit 25 (FORCE_OVF) also is set, then
+  the corresponding ireset[] value must be exactly -1
+
+In each p4.escr[] value:
+- bit 32 is reserved and must not be set
+- the CPL_T1 field (bits 0 and 1) must be zero except on HT processors
+  when global-mode counters are used
+- IQ_ESCR0 and IQ_ESCR1 can only be used on P4 models <= 2
+
+PEBS is not supported, but the replay tagging bits in PEBS_ENABLE
+and PEBS_MATRIX_VERT may be used.
+
+If p4.pebs_enable is zero, then p4.pebs_matrix_vert must also be zero.
+
+If p4.pebs_enable is non-zero:
+- only bits 24, 10, 9, 2, 1, and 0 may be set; note that in contrast
+  to Intel's documentation, bit 25 (ENABLE_PEBS_MY_THR) is not needed
+  and must not be set
+- bit 24 (UOP_TAG) must be set
+- at least one of bits 10, 9, 2, 1, or 0 must be set
+- in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear,
+  and at least one of bits 1 and 0 must be set
+
+Implementation Notes
+====================
+
+Caching
+-------
+Each 'struct perfctr_cpu_state' contains two cache-related fields:
+- 'id': a unique identifier for the control data contents
+- 'isuspend_cpu': the identity of the CPU on which a state containing
+  interrupt-mode counters was last suspended
+
+To this the driver adds a per-CPU cache, recording:
+- the 'id' of the control data currently in that CPU
+- the current contents of each control register
+
+When perfctr_cpu_update_control() has validated the new control data,
+it also updates the id field.
+
+The driver's internal 'write_control' function, called from the
+perfctr_cpu_resume() API function, first checks if the state's id
+matches that of the CPU's cache, and if so, returns. Otherwise
+it checks each control register in the state and updates those
+that do not match the cache. Finally, it writes the state's id
+to the cache. Tests on various x86 processor types have shown that
+MSR writes are very expensive: the purpose of these cache checks
+is to avoid MSR writes whenever possible.
+
+Unlike accumulation-mode counters, interrupt-mode counters must be
+physically stopped when suspended, primilarly to avoid overflow
+interrupts in contexts not expecting them, and secondarily to avoid
+increments to the counters themselves (see below).
+
+When suspending interrupt-mode counters, the driver:
+- records the CPU identity in the per-CPU cache
+- stops each interrupt-mode counter by disabling its control register
+- lets the cache and state id values remain the same
+
+Later, when resuming interrupt-mode counters, the driver:
+- if the state and cache id values match:
+  * the cache id is cleared, to force a reload of the control
+    registers stopped at suspend (see below)
+  * if the state's "suspend" CPU identity matches the current CPU,
+    the counter registers are still valid, and the procedure returns
+- if the procedure did not return above, it then loops over each
+  interrupt-mode counter:
+  * the counter's control register is physically disabled, unless
+    the cache indicates that it already is disabled; this is necessary
+    to prevent premature events and overflow interrupts if the CPU's
+    registers previously belonged to some other state
+  * then the counter register itself is restored
+After this interrupt-mode specific resume code is complete, the
+driver continues by calling 'write_control' as described above.
+The state and cache ids will not match, forcing write_control to
+reload the disabled interrupt-mode control registers.
+
+Call-site Backpatching
+----------------------
+The x86 family of processors is quite diverse in how their
+performance counters work and are accessed. There are three
+main designs (P5, P6, and P4) with several variations.
+To handle this the processor type detection and initialisation
+code sets up a number of function pointers to point to the
+correct procedures for the actual CPU type.
+
+Calls via function pointers are more expensive than direct calls,
+so the driver actually performs direct calls to wrappers that
+backpatch the original call sites to instead call the actual
+CPU-specific functions in the future.
+
+Unsynchronised code backpatching in SMP systems doesn't work
+on Intel P6 processors due to an erratum, so the driver performs
+a "finalise backpatching" step after the CPU-specific function
+pointers have been set up. This step invokes the API procedures
+on a temporary state object, set up to force every backpatchable
+call site to be invoked and adjusted.
+
+Several low-level API procedures are called in the context-switch
+path by the per-process perfctrs kernel extension, which motivates
+the efforts to reduce runtime overheads as much as possible.
+
+Overflow Interrupts
+-------------------
+The x86 hardware enables overflow interrupts via the local
+APIC's LVTPC entry, which is only present in P6/K7/K8/P4.
+
+The low-level driver supports overflow interrupts as follows:
+- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR.
+- It adds a local APIC exception handler to entry.S, which
+  invokes the driver's smp_perfctr_interrupt() procedure.
+- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR
+  interrupt gate to the exception handler in entry.S.
+- During processor type detection, it records whether the
+  processor supports the local APIC, and sets up function pointers
+  for the suspend and resume operations on interrupt-mode counters.
+- When the low-level driver is activated, it enables overflow
+  interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC.
+- Overflow interrupts now end up in smp_perfctr_interrupt(), which
+  ACKs the interrupt and invokes the interrupt handler installed
+  by the high-level service/driver.
+- When the low-level driver is deactivated, it disables overflow
+  interrupts by masking APIC_LVTPC in each CPU. It then releases
+  the local APIC back to the NMI watchdog.
+
+At compile-time, the low-level driver indicates overflow interrupt
+support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature
+is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag
+in the perfctr_info object.
diff -puN /dev/null Documentation/perfctr/overview.txt
--- /dev/null	Thu Apr 11 07:25:15 2002
+++ 25-akpm/Documentation/perfctr/overview.txt	Mon Aug 16 15:57:33 2004
@@ -0,0 +1,129 @@
+$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $
+
+AN OVERVIEW OF PERFCTR
+======================
+The perfctr package adds support to the Linux kernel for using
+the performance-monitoring counters found in many processors.
+
+Perfctr is internally organised in three layers:
+
+- The low-level drivers, one for each supported architecture.
+  Currently there are two, one for 32 and 64-bit x86 processors,
+  and one for 32-bit PowerPC processors.
+
+  low-level-api.txt documents the model of the performance counters
+  used in this package, and the internal API to the low-level drivers.
+
+  low-level-{x86,ppc}.txt provide documentation specific for those
+  architectures and their low-level drivers.
+
+- The high-level services.
+  There is currently one, a kernel extension adding support for
+  virtualised per-process performance counters.
+  See virtual.txt for documentation on this kernel extension.
+
+  [There used to be a second high-level service, a simple driver
+  to control and access all performance counters in all processors.
+  This driver is currently removed, pending an acceptable new API.]
+
+- The top-level, which performs initialisation and implements
+  common procedures and system calls.
+
+Rationale
+---------
+The perfctr package solves three problems:
+
+- Hardware invariably restricts programming of the performance
+  counter registers to kernel-level code, and sometimes also
+  restricts reading the counters to kernel-level code.
+
+  Perfctr adds APIs allowing user-space code access the counters.
+  In the case of the per-process counters kernel extension,
+  even non-privileged processes are allowed access.
+
+- Hardware often limits the precision of the hardware counters,
+  making them unsuitable for storing total event counts.
+
+  The counts are instead maintained as 64-bit values in software,
+  with the hardware counters used to derive increments over given
+  time periods.
+
+- In a non-modified kernel, the thread state does not include the
+  performance monitoring counters, and the context switch code
+  does not save and restore them. In this situation the counters
+  are system-wide, making them unreliable and inaccurate when used
+  for monitoring specific processes or specific segments of code.
+
+  The per-process counters kernel extension treats the counter state as
+  part of the thread state, solving the reliability and accuracy problems.
+
+Non-goals
+---------
+Providing high-level interfaces that abstract and hide the
+underlying hardware is a non-goal. Such abstractions can
+and should be implemented in user-space, for several reasons:
+
+- The complexity and variability of the hardware means that
+  any abstraction would be inaccurate. There would be both
+  loss of functionality, and presence of functionality which
+  isn't supportable on any given processor. User-space tools
+  and libraries can implement this, on top of the processor-
+  specific interfaces provided by the kernel.
+
+- The implementation of such an abstraction would be large
+  and complex. (Consider ESCR register assignment on P4.)
+  Performing complex actions in user-space simplifies the
+  kernel, allowing it to concentrate on validating control
+  data, managing processes, and driving the hardware.
+  (C.f. the role of compilers.)
+
+- The abstraction is purely a user-convenience thing. The
+  kernel-level components have no need for it.
+
+Common System Calls
+===================
+This lists those system calls that are not tied to
+a specific high-level service/driver.
+
+Querying CPU and Driver Information
+-----------------------------------
+int err = sys_perfctr_info(struct perfctr_info *info,
+			   struct perfctr_cpu_mask *cpus,
+			   struct perfctr_cpu_mask *forbidden);
+
+This operation retrieves information from the kernel about
+the processors in the system.
+
+If non-NULL, '*info' will be updated with information about the
+capabilities of the processor and the low-level driver.
+
+If non-NULL, '*cpus' will be updated with a bitmask listing the
+set of processors in the system. The size of this bitmask is not
+statically known, so the protocol is:
+
+1. User-space initialises cpus->nrwords to the number of elements
+   allocated for cpus->mask[].
+2. The kernel reads cpus->nrwords, and then writes the required
+   number of words to cpus->nrwords.
+3. If the required number of words is less than the original value
+   of cpus->nrwords, then an EOVERFLOW error is signalled.
+4. Otherwise, the kernel converts its internal cpumask_t value
+   to the external format and writes that to cpus->mask[].
+
+If non-NULL, '*forbidden' will be updated with a bitmask listing
+the set of processors in the system on which users must not try
+to use performance counters. This is currently only relevant for
+hyper-threaded Pentium 4/Xeon systems. The protocol is the same
+as for '*cpus'.
+
+Notes:
+- The internal representation of a cpumask_t is as an array of
+  unsigned long. This representation is unsuitable for user-space,
+  because it is not binary-compatible between 32 and 64-bit
+  variants of a big-endian processor. The 'struct perfctr_cpu_mask'
+  type uses an array of unsigned 32-bit integers.
+- The protocol for retrieving a 'struct perfctr_cpu_mask' was
+  designed to allow user-space to quickly determine the correct
+  size of the 'mask[]' array. Other system calls use weaker protocols,
+  which force user-space to guess increasingly larger values in a
+  loop, until finally an acceptable value was guessed.
diff -puN /dev/null Documentation/perfctr/virtual.txt
--- /dev/null	Thu Apr 11 07:25:15 2002
+++ 25-akpm/Documentation/perfctr/virtual.txt	Mon Aug 16 15:57:33 2004
@@ -0,0 +1,355 @@
+$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $
+
+VIRTUAL PER-PROCESS PERFORMANCE COUNTERS
+========================================
+This document describes the virtualised per-process performance
+counters kernel extension. See "General Model" in low-level-api.txt
+for the model of the processor's performance counters.
+
+Contents
+========
+- Summary
+- Design & Implementation Notes
+  * State
+  * Thread Management Hooks
+  * Synchronisation Rules
+  * The Pseudo File System
+- API For User-Space
+  * Opening/Creating the State
+  * Updating the Control
+  * Unlinking the State
+  * Reading the State
+  * Resuming After Handling Overflow Signal
+  * Reading the Counter Values
+- Limitations / TODO List
+
+Summary
+=======
+The virtualised per-process performance counters facility
+(virtual perfctrs) is a kernel extension which extends the
+thread state to record perfctr settings and values, and augments
+the context-switch code to save perfctr values at suspends and
+restore them at resumes. This "virtualises" the performance
+counters in much the same way as the kernel already virtualises
+general-purpose and floating-point registers.
+
+Virtual perfctrs also adds an API allowing non-privileged
+user-space processes to set up and access their perfctrs.
+
+As this facility is primarily intended to support developers
+of user-space code, both virtualisation and allowing access
+from non-privileged code are essential features.
+
+Design & Implementation Notes
+=============================
+
+State
+-----
+The state of a thread's perfctrs is packaged up in an object of
+type 'struct vperfctr'. It consists of CPU-dependent state, a
+sampling timer, and some auxiliary administrative data. This is
+an independent object, with its own lifetime and access rules.
+
+The state object is attached to the thread via a pointer in its
+thread_struct. While attached, the object records the identity
+of its owner thread: this is used for user-space API accesses
+from threads other than the owner.
+
+The state is separate from the thread_struct for several resons:
+- It's potentially large, hence it's allocated only when needed.
+- It can outlive its owner thread. The state can be opened as
+  a pseudo file: as long as that file is live, so is the object.
+- It can be mapped, via mmap() on the pseudo file's descriptor.
+  To facilitate this, a full page is allocated and reserved.
+
+Thread Management Hooks
+-----------------------
+Virtual perfctrs hooks into several thread management events:
+
+- exit_thread(): Calls perfctr_exit_thread() to stop the counters
+  and detach the thread's vperfctr object.
+
+- copy_thread(): Calls perfctr_copy_thread() to initialise
+  the child's vperfctr pointer. Currently the settings are
+  not inherited from parent to child, so the pointer is set
+  to NULL in the child's thread_struct.
+
+- switch_to():
+  * Calls perfctr_suspend_thread() on the previous thread, to
+    suspend its counters.
+  * Calls perfctr_resume_thread() on the next thread, to resume
+    its counters. Also resets the sampling timer (see below).
+
+- update_process_times(): Calls perfctr_sample_thread(), which
+  decrements the sampling timer and samples the counters if the
+  timer reaches zero.
+
+  Sampling is normally only done at switch_to(), but if too much
+  time passes before the next switch_to(), a hardware counter may
+  increment by more than its range (usually 2^32). If this occurs,
+  the difference from its start value will be incorrect, causing
+  its updated sum to also be incorrect. The sampling timer is used
+  to prevent this problem, which has been observed on SMP machines,
+  and on high clock frequency UP machines.
+
+- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect
+  attempts to migrate the thread to a "forbidden" CPU, in which
+  case a flag in the vperfctr object is set. perfctr_resume_thread()
+  checks this flag, and if set, marks the counters as stopped and
+  sends a SIGILL to the thread.
+
+  The notion of forbidden CPUs is a workaround for a design flaw
+  in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt
+  for details.
+
+To reduce overheads, these hooks are implemented as inline functions
+that check if the thread is using perfctrs before calling the code
+that implements the behaviour. The hooks also reduce to no-ops if
+CONFIG_PERFCTR_VIRTUAL is disabled.
+
+Synchronisation Rules
+---------------------
+There are four types of accesses to a thread's perfctr state:
+
+1. Thread management events (see above) done by the thread itself.
+   Suspend, resume, and sample are lock-less.
+
+2. API operations done by the thread itself.
+   These are lock-less, except when an individual operation
+   has specific synchronisation needs. For instance, preemption
+   is often disabled to prevent accesses due to context switches.
+
+3. API operations done by a different thread ("monitor thread").
+   The owner thread must be suspended for the duration of the operation.
+   This is ensured by requiring that the monitor thread is ptrace()ing
+   the owner thread, and that the owner thread is in TASK_STOPPED state.
+
+4. set_cpus_allowed().
+   The kernel does not lock the target during set_cpus_allowed(),
+   so it can execute concurrently with the owner thread or with
+   some monitor thread. In particular, the state may be deallocated.
+
+   To solve this problem, both perfctr_set_cpus_allowed() and the
+   operations that can change the owner thread's perfctr pointer
+   (creat, unlink, exit) perform a task_lock() on the owner thread
+   before accessing the perfctr pointer.
+
+   When concurrent set_cpus_allowed() isn't a problem (because the
+   architecture doesn't have a notion of forbidden CPUs), atomicity
+   of updates to the thread's perfctr pointer is ensured by disabling
+   preemption.
+
+The Pseudo File System
+----------------------
+The perfctr state is accessed from user-space via a file descriptor.
+
+The main reason for this is to enable mmap() on the file descriptor,
+which gives read-only access to the state.
+
+The file descriptor is a handle to the perfctr state object. This
+allows a very simple implementation of the user-space 'perfex'
+program, which runs another program with given perfctr settings
+and reports their final values. Without this handle, monitoring
+applications like perfex would have to be implemented like debuggers
+in order to catch the target thread's exit and retrieve the counter
+values before the exit completes and the state disappears.
+
+The file for a perfctr state object belongs to the vperfctrs pseudo
+file system. Files in this file system support only a few operations:
+- mmap()
+- release() decrements the perfctr object's reference count and
+  deallocates the object when no references remain
+- the listing of a thread's open file descriptors identifies
+  perfctr state file descriptors as belonging to "vperfctrfs"
+The implementation is based on the code for pipefs.
+
+In previous versions of the perfctr package, the file descriptors
+for perfctr state objects also supported the API's ioctl() method.
+
+API For User-Space
+==================
+
+Opening/Creating the State
+--------------------------
+int fd = sys_vperfctr_open(int tid, int creat);
+
+'tid' must be the id of a thread, or 0 which is interpreted as an
+alias for the current thread.
+
+This operation returns an open file descriptor which is a handle
+on the thread's perfctr state object.
+
+If 'creat' is non-zero and the object did not exist, then it is
+created and attached to the thread. The newly created state object
+is inactive, with all control fields disabled and all counters
+having the value zero. If 'creat' is non-zero and the object
+already existed, then an EEXIST error is signalled.
+
+If 'tid' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Notes:
+- The access rule in the non-self case is the same as for the
+  ptrace() system call. It ensures that no other thread, including
+  the target thread itself, can access or change the target thread's
+  perfctr state during the operation.
+- An open file descriptor for a perfctr state object counts as a
+  reference to that object; even if detached from its thread the
+  object will not be deallocated until the last reference is gone.
+- The file descriptor can be passed to mmap(), for low-overhead
+  counter sampling. See "READING THE COUNTER VALUES" for details.
+- The file descriptor can be passed to another thread. Accesses
+  from threads other than the owner are permitted as long as they
+  posses the file descriptor and use ptrace() for synchronisation.
+
+Updating the Control
+--------------------
+int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control);
+
+'fd' must be the return value from a call to sys_vperfctr_open(),
+The perfctr object must still be attached to its owner thread.
+
+This operation stops and samples any currently running counters in
+the thread, and then updates the control settings. If the resulting
+state has any enabled counters, then the counters are restarted.
+
+Before restarting, the counter sums are reset to zero. However,
+if a counter's bit is set in the control object's 'preserve'
+bitmask field, then that counter's sum is not reset. The TSC's
+sum is only reset if the TSC is disabled in the new state.
+
+If any of the programmable counters are enabled, then the thread's
+CPU affinity mask is adjusted to exclude the set of forbidden CPUs.
+
+If the control data activates any interrupt-mode counters, then
+a signal (specified by the 'si_signo' control field) will be sent
+to the owner thread after an overflow interrupt. The documentation
+for sys_vperfctr_iresume() describes this mechanism.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+The perfctr state object denoted by 'fd' must still be attached
+to its owner thread.
+
+Notes:
+- It is strongly recommended to memset() the vperfctr_control object
+  to all-bits-zero before setting the fields of interest.
+- Stopping the counters is done by invoking the control operation
+  with a control object that activates neither the TSC nor any PMCs.
+
+Unlinking the State
+-------------------
+int err = sys_vperfctr_unlink(int fd);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+
+This operation stops and samples the thread's counters, and then
+detaches the perfctr state object from the thread. If the object
+already had been detached, then no action is performed.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Reading the State
+-----------------
+int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum,
+			    struct vperfctr_control *control);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+
+This operation copies data from the perfctr state object to
+user-space. If 'sum' is non-NULL, then the counter sums are
+written to it. If 'control' is non-NULL, then the control data
+is written to it.
+
+If the perfctr state object is attached to the current thread,
+then the counters are sampled and updated first.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Notes:
+- An alternate and faster way to retrieve the counter sums is described
+  below. This system call can be used if the hardware does not permit
+  user-space reads of the counters.
+
+Resuming After Handling Overflow Signal
+---------------------------------------
+int err = sys_vperfctr_iresume(int fd);
+
+'fd' must be the return value from a call to sys_vperfctr_open().
+The perfctr object must still be attached to its owner thread.
+
+When an interrupt-mode counter has overflowed, the counters
+are sampled and suspended (TSC remains active). Then a signal,
+as specified by the 'si_signo' control field, is sent to the
+owner thread: the associated 'struct siginfo' has 'si_code'
+equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set
+of overflown counters.
+
+The counters are suspended to avoid generating new performance
+counter events during the execution of the signal handler, but
+the previous settings are saved. Calling sys_vperfctr_iresume()
+restores the previous settings and resumes the counters. Doing
+this is optional.
+
+If 'fd' does not denote the current thread, then it must denote a
+thread that is stopped and under ptrace control by the current thread.
+
+Reading the Counter Values
+--------------------------
+The value of a counter is computed from three components:
+
+	value = sum + (now - start);
+
+Two of these (sum and start) reside in the kernel's state object,
+and the third (now) is the contents of the hardware counter.
+To perform this computation in user-space requires access to
+the state object. This is achieved by passing the file descriptor
+from sys_vperfctr_open() to mmap():
+
+	volatile const struct vperfctr_state *kstate;
+	kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+
+Reading the three components is a non-atomic operation. If the
+thread is scheduled during the operation, the three values will
+not be consistent and the wrong result will be computed.
+To detect this situation, user-space should check the kernel
+state's TSC start value before and after the operation, and
+retry the operation in case of a mismatch.
+
+The algorithm for retrieving the value of counter 'i' is:
+
+	tsc0 = kstate->cpu_state.tsc_start;
+	for(;;) {
+		rdpmcl(kstate->cpu_state.pmc[i].map, now);
+		start = kstate->cpu_state.pmc[i].start;
+		sum = kstate->cpu_state.pmc[i].sum;
+		tsc1 = kstate->cpu_state.tsc_start;
+		if (likely(tsc1 == tsc0))
+			break;
+		tsc0 = tsc1;
+	}
+	return sum + (now - start);
+
+The algorithm for retrieving the value of the TSC is similar,
+as is the algorithm for retrieving the values of all counters.
+
+Notes:
+- Since the state's TSC time-stamps are used, the algorithm requires
+  that user-space enables TSC sampling.
+- The algorithm requires that the hardware allows user-space reads
+  of the counter registers. If this property isn't statically known
+  for the architecture, user-space should retrieve the kernel's
+  'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC
+  flag is set.
+
+Limitations / TODO List
+=======================
+- Perfctr settings are not inherited from parent to child at fork().
+  The issue is not fork() but propagating final counts from children
+  to parents, and allowing user-space to distinguish "self" counts
+  from "children" counts.
+  An implementation of this feature is being planned.
+- Buffering of overflow samples is not implemented. So far, not a
+  single user has requested it.
_