From: Ingo Molnar <mingo@elte.hu>

This is the latest version of the scheduler cache-hot-auto-tune patch.

The first problem was that detection time scaled with O(N^2), which is 
unacceptable on larger SMP and NUMA systems. To solve this:

- I've added a 'domain distance' function, which is used to cache 
  measurement results. Each distance is only measured once. This means 
  that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT 
  distances 0 and 1, and on SMP distance 0 is measured. The code walks 
  the domain tree to determine the distance, so it automatically follows 
  whatever hierarchy an architecture sets up. This cuts down on the boot 
  time significantly and removes the O(N^2) limit. The only assumption 
  is that migration costs can be expressed as a function of domain
  distance - this covers the overwhelming majority of existing systems, 
  and is a good guess even for more assymetric systems.

  [ People hacking systems that have assymetries that break this 
    assumption (e.g. different CPU speeds) should experiment a bit with 
    the cpu_distance() function. Adding a ->migration_distance factor to 
    the domain structure would be one possible solution - but lets first 
    see the problem systems, if they exist at all. Lets not overdesign. ]

Another problem was that only a single cache-size was used for measuring 
the cost of migration, and most architectures didnt set that variable 
up. Furthermore, a single cache-size does not fit NUMA hierarchies with 
L3 caches and does not fit HT setups, where different CPUs will often 
have different 'effective cache sizes'. To solve this problem:

- Instead of relying on a single cache-size provided by the platform and
  sticking to it, the code now auto-detects the 'effective migration 
  cost' between two measured CPUs, via iterating through a wide range of
  cachesizes. The code searches for the maximum migration cost, which
  occurs when the working set of the test-workload falls just below the
  'effective cache size'. I.e. real-life optimized search is done for
  the maximum migration cost, between two real CPUs.

  This, amongst other things, has the positive effect hat if e.g. two
  CPUs share a L2/L3 cache, a different (and accurate) migration cost
  will be found than between two CPUs on the same system that dont share
  any caches.

(The reliable measurement of migration costs is tricky - see the source
for details.)

Furthermore i've added various boot-time options to override/tune
migration behavior.

Firstly, there's a blanket override for autodetection:

	migration_cost=1000,2000,3000

will override the depth 0/1/2 values with 1msec/2msec/3msec values.

Secondly, there's a global factor that can be used to increase (or
decrease) the autodetected values:

	migration_factor=120

will increase the autodetected values by 20%. This option is useful to
tune things in a workload-dependent way - e.g. if a workload is
cache-insensitive then CPU utilization can be maximized by specifying
migration_factor=0.

I've tested the autodetection code quite extensively on x86, on 3
P3/Xeon/2MB, and the autodetected values look pretty good:

Dual Celeron (128K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
 ---------------------
           [00]    [01]
 [00]:     -     1.7(1)
 [01]:   1.7(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 1.7 (1784008)
 ---------------------

Here the slow memory subsystem dominates system performance, and even
though caches are small, the migration cost is 1.7 msecs.

Dual HT P4 (512K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
 ---------------------
           [00]    [01]    [02]    [03]
 [00]:     -     0.4(1)  0.0(0)  0.4(1)
 [01]:   0.4(1)    -     0.4(1)  0.0(0)
 [02]:   0.0(0)  0.4(1)    -     0.4(1)
 [03]:   0.4(1)  0.0(0)  0.4(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (33900) 0.4 (448514)
 ---------------------

Here it can be seen that there is no migration cost between two HT 
siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory 
system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

8-way P3/Xeon [2MB L2 cache]:

 ---------------------
 migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
 ---------------------
           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
 [00]:     -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [01]:  19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [02]:  19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [03]:  19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [04]:  19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1)
 [05]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1)
 [06]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1)
 [07]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 19.2 (19281756)
 ---------------------

This one has huge caches and a relatively slow memory subsystem - so the 
migration cost is 19 msecs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 dev/null                              |    0 
 arch/i386/kernel/smpboot.c            |    1 
 arch/ia64/kernel/setup.c              |    6 
 arch/ppc/xmon/xmon.c                  |    2 
 include/asm-i386/topology.h           |    1 
 include/asm-ia64/topology.h           |    2 
 include/asm-mips/mach-ip27/topology.h |    1 
 include/asm-ppc64/topology.h          |    1 
 include/asm-x86_64/topology.h         |    1 
 include/linux/sched.h                 |   11 
 include/linux/topology.h              |    2 
 kernel/sched.c                        |  482 ++++++++++++++++++++++++++++++++++
 12 files changed, 500 insertions(+), 10 deletions(-)

diff -puN arch/i386/kernel/smpboot.c~scheduler-cache-hot-autodetect arch/i386/kernel/smpboot.c
--- devel/arch/i386/kernel/smpboot.c~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/arch/i386/kernel/smpboot.c	2005-09-07 20:11:02.000000000 -0700
@@ -1061,6 +1061,7 @@ static void smp_tune_scheduling (void)
 			cachesize = 16; /* Pentiums, 2x8kB cache */
 			bandwidth = 100;
 		}
+		max_cache_size = cachesize * 1024;
 	}
 }
 
diff -L arch/ia64/kernel/domain.c -puN /dev/null /dev/null
diff -puN arch/ia64/kernel/setup.c~scheduler-cache-hot-autodetect arch/ia64/kernel/setup.c
--- devel/arch/ia64/kernel/setup.c~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/arch/ia64/kernel/setup.c	2005-09-07 20:11:02.000000000 -0700
@@ -647,6 +647,7 @@ static void
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
+	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -676,6 +677,8 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
+		if (cache_size < cci.pcci_cache_size)
+			cache_size = cci.pcci_cache_size;
 		if (!cci.pcci_unified) {
 			status = ia64_pal_cache_config_info(l,
 						    /* cache_type (instruction)= */ 1,
@@ -692,6 +695,9 @@ get_max_cacheline_size (void)
 			ia64_i_cache_stride_shift = cci.pcci_stride;
 	}
   out:
+#ifdef CONFIG_SMP
+	max_cache_size = max(max_cache_size, cache_size);
+#endif
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }
diff -puN arch/ppc/xmon/xmon.c~scheduler-cache-hot-autodetect arch/ppc/xmon/xmon.c
--- devel/arch/ppc/xmon/xmon.c~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/arch/ppc/xmon/xmon.c	2005-09-07 20:11:02.000000000 -0700
@@ -99,7 +99,7 @@ static void remove_bpts(void);
 static void insert_bpts(void);
 static struct bpt *at_breakpoint(unsigned pc);
 static void bpt_cmds(void);
-static void cacheflush(void);
+void cacheflush(void);
 #ifdef CONFIG_SMP
 static void cpu_cmd(void);
 #endif /* CONFIG_SMP */
diff -puN include/asm-i386/topology.h~scheduler-cache-hot-autodetect include/asm-i386/topology.h
--- devel/include/asm-i386/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/asm-i386/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -72,7 +72,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 1,			\
diff -puN include/asm-ia64/topology.h~scheduler-cache-hot-autodetect include/asm-ia64/topology.h
--- devel/include/asm-ia64/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/asm-ia64/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -55,7 +55,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.per_cpu_gain		= 100,			\
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 2,			\
@@ -81,7 +80,6 @@ void build_cpu_to_node_map(void);
 	.max_interval		= 8*(min(num_online_cpus(), 32)), \
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 2,			\
diff -puN include/asm-mips/mach-ip27/topology.h~scheduler-cache-hot-autodetect include/asm-mips/mach-ip27/topology.h
--- devel/include/asm-mips/mach-ip27/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/asm-mips/mach-ip27/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -24,7 +24,6 @@ extern unsigned char __node_distances[MA
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
diff -puN include/asm-ppc64/topology.h~scheduler-cache-hot-autodetect include/asm-ppc64/topology.h
--- devel/include/asm-ppc64/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/asm-ppc64/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -47,7 +47,6 @@ static inline int node_to_first_cpu(int 
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
diff -puN include/asm-x86_64/topology.h~scheduler-cache-hot-autodetect include/asm-x86_64/topology.h
--- devel/include/asm-x86_64/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/asm-x86_64/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -37,7 +37,6 @@ extern int __node_distance(int, int);
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 2,			\
diff -puN include/linux/sched.h~scheduler-cache-hot-autodetect include/linux/sched.h
--- devel/include/linux/sched.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/linux/sched.h	2005-09-07 20:11:02.000000000 -0700
@@ -566,7 +566,16 @@ struct sched_domain {
 
 extern void partition_sched_domains(cpumask_t *partition1,
 				    cpumask_t *partition2);
-#endif /* CONFIG_SMP */
+
+/*
+ * Maximum cache size the migration-costs auto-tuning code will
+ * search from:
+ */
+extern unsigned int max_cache_size;
+
+extern void calibrate_migration_costs(void);
+
+#endif	/* CONFIG_SMP */
 
 #ifdef CONFIG_DEBUG_PREEMPT
 # define MAX_PREEMPT_TRACE 25
diff -puN include/linux/topology.h~scheduler-cache-hot-autodetect include/linux/topology.h
--- devel/include/linux/topology.h~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/include/linux/topology.h	2005-09-07 20:11:02.000000000 -0700
@@ -86,7 +86,6 @@
 	.max_interval		= 2,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 110,			\
-	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
 	.busy_idx		= 0,			\
@@ -117,7 +116,6 @@
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.busy_idx		= 2,			\
diff -puN kernel/sched.c~scheduler-cache-hot-autodetect kernel/sched.c
--- devel/kernel/sched.c~scheduler-cache-hot-autodetect	2005-09-07 20:11:02.000000000 -0700
+++ devel-akpm/kernel/sched.c	2005-09-07 20:11:02.000000000 -0700
@@ -33,6 +33,7 @@
 #include <linux/notifier.h>
 #include <linux/profile.h>
 #include <linux/suspend.h>
+#include <linux/vmalloc.h>
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/smp.h>
@@ -4932,7 +4933,484 @@ static void init_sched_build_groups(stru
 
 #define SD_NODES_PER_DOMAIN 16
 
+/*
+ * Self-tuning task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the cost of manipulating buffers of varying
+ * sizes. For a given buffer-size here are the steps that are taken:
+ *
+ * 1) the source CPU reads+dirties a shared buffer
+ * 2) the target CPU reads+dirties the same shared buffer
+ *
+ * We measure how long they take, in the following 4 scenarios:
+ *
+ *  - source: CPU1, target: CPU2 | cost1
+ *  - source: CPU2, target: CPU1 | cost2
+ *  - source: CPU1, target: CPU1 | cost3
+ *  - source: CPU2, target: CPU2 | cost4
+ *
+ * We then calculate the cost3+cost4-cost1-cost2 difference - this is
+ * the cost of migration.
+ *
+ * We then start off from a small buffer-size and iterate up to larger
+ * buffer sizes, in 5% steps - measuring each buffer-size separately, and
+ * doing a maximum search for the cost. (The maximum cost for a migration
+ * normally occurs when the working set size is around the effective cache
+ * size.)
+ */
+#define SEARCH_SCOPE		2
+#define MIN_CACHE_SIZE		(64*1024U)
+#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
+#define ITERATIONS		2
+#define SIZE_THRESH		130
+#define COST_THRESH		130
+
+/*
+ * The migration cost is a function of 'domain distance'. Domain
+ * distance is the number of steps a CPU has to iterate down its
+ * domain tree to share a domain with the other CPU. The farther
+ * two CPUs are from each other, the larger the distance gets.
+ *
+ * Note that we use the distance only to cache measurement results,
+ * the distance value is not used numerically otherwise. When two
+ * CPUs have the same distance it is assumed that the migration
+ * cost is the same. (this is a simplification but quite practical)
+ */
+#define MAX_DOMAIN_DISTANCE 32
+
+static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
+		{ [ 0 ... MAX_DOMAIN_DISTANCE-1 ] = -1LL };
+
+/*
+ * Allow override of migration cost - in units of microseconds.
+ * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
+ * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
+ */
+static int __init migration_cost_setup(char *str)
+{
+	int ints[MAX_DOMAIN_DISTANCE+1], i;
+
+	str = get_options(str, ARRAY_SIZE(ints), ints);
+
+	printk("#ints: %d\n", ints[0]);
+	for (i = 1; i <= ints[0]; i++) {
+		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
+		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
+	}
+	return 1;
+}
+
+__setup ("migration_cost=", migration_cost_setup);
+
+/*
+ * Global multiplier (divisor) for migration-cutoff values,
+ * in percentiles. E.g. use a value of 150 to get 1.5 times
+ * longer cache-hot cutoff times.
+ *
+ * (We scale it from 100 to 128 to long long handling easier.)
+ */
+
+#define MIGRATION_FACTOR_SCALE 128
+
+static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+
+static int __init setup_migration_factor(char *str)
+{
+	get_option(&str, &migration_factor);
+	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
+	return 1;
+}
+
+__setup("migration_factor=", setup_migration_factor);
+
+/*
+ * Estimated distance of two CPUs, measured via the number of domains
+ * we have to pass for the two CPUs to be in the same span:
+ */
+__devinit static unsigned long domain_distance(int cpu1, int cpu2)
+{
+	unsigned long distance = 0;
+	struct sched_domain *sd;
+
+	for_each_domain(cpu1, sd) {
+		WARN_ON(!cpu_isset(cpu1, sd->span));
+		if (cpu_isset(cpu2, sd->span))
+			return distance;
+		distance++;
+	}
+	if (distance >= MAX_DOMAIN_DISTANCE) {
+		WARN_ON(1);
+		distance = MAX_DOMAIN_DISTANCE-1;
+	}
+
+	return distance;
+}
+
+static __initdata unsigned int migration_debug = 1;
+
+static int __init setup_migration_debug(char *str)
+{
+	get_option(&str, &migration_debug);
+	return 1;
+}
+
+__setup("migration_debug=", setup_migration_debug);
+
+/*
+ * Maximum cache-size that the scheduler should try to measure.
+ * Architectures with larger caches should tune this up during
+ * bootup. Gets used in the domain-setup code (i.e. during SMP
+ * bootup).
+ */
+__initdata unsigned int max_cache_size;
+
+static int __init setup_max_cache_size(char *str)
+{
+	get_option(&str, &max_cache_size);
+	return 1;
+}
+
+__setup("max_cache_size=", setup_max_cache_size);
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void touch_cache(void *__cache, unsigned long __size)
+{
+	unsigned long size = __size/sizeof(long), chunk1 = size/3,
+			chunk2 = 2*size/3;
+	unsigned long *cache = __cache;
+	int i;
+
+	for (i = 0; i < size/6; i += 8) {
+		switch (i % 6) {
+			case 0: cache[i]++;
+			case 1: cache[size-1-i]++;
+			case 2: cache[chunk1-i]++;
+			case 3: cache[chunk1+i]++;
+			case 4: cache[chunk2-i]++;
+			case 5: cache[chunk2+i]++;
+		}
+	}
+}
+
+/*
+ * Measure the cache-cost of one task migration. Returns in units of nsec.
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	cpumask_t mask, saved_mask;
+	unsigned long long t0, t1, t2, t3, cost;
+
+	saved_mask = current->cpus_allowed;
+
+	/*
+	 * Flush source caches to RAM and invalidate them:
+	 */
+	sched_cacheflush();
+
+	/*
+	 * Migrate to the source CPU:
+	 */
+	mask = cpumask_of_cpu(source);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != source);
+
+	/*
+	 * Dirty the working set:
+	 */
+	t0 = sched_clock();
+	touch_cache(cache, size);
+	t1 = sched_clock();
+
+	/*
+	 * Migrate to the target CPU, dirty the L2 cache and access
+	 * the shared buffer. (which represents the working set
+	 * of a migrated task.)
+	 */
+	mask = cpumask_of_cpu(target);
+	set_cpus_allowed(current, mask);
+	WARN_ON(smp_processor_id() != target);
+
+	t2 = sched_clock();
+	touch_cache(cache, size);
+	t3 = sched_clock();
+
+	cost = t1-t0 + t3-t2;
+
+	if (migration_debug >= 2)
+		printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n",
+			source, target, t1-t0, t1-t0, t3-t2, cost);
+	/*
+	 * Flush target caches to RAM and invalidate them:
+	 */
+	sched_cacheflush();
+
+	set_cpus_allowed(current, saved_mask);
+
+	return cost;
+}
+
+/*
+ * Measure a series of task migrations and return the average
+ * result. Since this code runs early during bootup the system
+ * is 'undisturbed' and the average latency makes sense.
+ *
+ * The algorithm in essence auto-detects the relevant cache-size,
+ * so it will properly detect different cachesizes for different
+ * cache-hierarchies, depending on how the CPUs are connected.
+ *
+ * Architectures can prime the upper limit of the search range via
+ * max_cache_size, otherwise the search range defaults to 20MB...64K.
+ */
+__init static unsigned long long
+measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
+{
+	unsigned long long cost1, cost2;
+	int i;
+
+	/*
+	 * Measure the migration cost of 'size' bytes, over an
+	 * average of 10 runs:
+	 *
+	 * (We perturb the cache size by a small (0..4k)
+	 *  value to compensate size/alignment related artifacts.
+	 *  We also subtract the cost of the operation done on
+	 *  the same CPU.)
+	 */
+	cost1 = 0;
+
+	/*
+	 * dry run, to make sure we start off cache-cold on cpu1,
+	 * and to get any vmalloc pagefaults in advance:
+	 */
+	measure_one(cache, size, cpu1, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu1, cpu2);
+
+	measure_one(cache, size, cpu2, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost1 += measure_one(cache, size - i*1024, cpu2, cpu1);
+
+	/*
+	 * (We measure the non-migrating [cached] cost on both
+	 *  cpu1 and cpu2, to handle CPUs with different speeds)
+	 */
+	cost2 = 0;
+
+	measure_one(cache, size, cpu1, cpu1);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu1, cpu1);
+
+	measure_one(cache, size, cpu2, cpu2);
+	for (i = 0; i < ITERATIONS; i++)
+		cost2 += measure_one(cache, size - i*1024, cpu2, cpu2);
+
+	/*
+	 * Get the per-iteration migration cost:
+	 */
+	do_div(cost1, 2*ITERATIONS);
+	do_div(cost2, 2*ITERATIONS);
+
+	return cost1 - cost2;
+}
+
+__devinit static unsigned long long measure_migration_cost(int cpu1, int cpu2)
+{
+	unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
+	unsigned int max_size, size, size_found = 0;
+	long long cost = 0, prev_cost;
+	void *cache;
+
+	/*
+	 * Search from max_cache_size*5 down to 64K - the real relevant
+	 * cachesize has to lie somewhere inbetween.
+	 */
+	if (max_cache_size) {
+		max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
+		size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE);
+	} else {
+		/*
+		 * Since we have no estimation about the relevant
+		 * search range
+		 */
+		max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
+		size = MIN_CACHE_SIZE;
+	}
+
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+
+	/*
+	 * Allocate the working set:
+	 */
+	cache = vmalloc(max_size);
+	if (!cache) {
+		printk("could not vmalloc %d bytes for cache!\n", 2*max_size);
+		return 1000000; // return 1 msec on very small boxen
+	}
+
+	while (size <= max_size) {
+		prev_cost = cost;
+		cost = measure_cost(cpu1, cpu2, cache, size);
+
+		/*
+		 * Update the max:
+		 */
+		if (cost > 0) {
+			if (max_cost < cost) {
+				max_cost = cost;
+				size_found = size;
+			}
+		}
+		/*
+		 * Calculate average fluctuation, we use this to prevent
+		 * noise from triggering an early break out of the loop:
+		 */
+		fluct = abs(cost - prev_cost);
+		avg_fluct = (avg_fluct + fluct)/2;
+
+		if (migration_debug)
+			printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): (%8Ld %8Ld)\n",
+				cpu1, cpu2, size,
+				(long)cost / 1000000,
+				((long)cost / 100000) % 10,
+				(long)max_cost / 1000000,
+				((long)max_cost / 100000) % 10,
+				domain_distance(cpu1, cpu2),
+				cost, avg_fluct);
+
+		/*
+		 * If we iterated at least 20% past the previous maximum,
+		 * and the cost has dropped by more than 20% already,
+		 * (taking fluctuations into account) then we assume to
+		 * have found the maximum and break out of the loop early:
+		 */
+		if (size_found && (size*100 > size_found*SIZE_THRESH))
+			if (cost+avg_fluct <= 0 ||
+				max_cost*100 > (cost+avg_fluct)*COST_THRESH) {
+
+				if (migration_debug)
+					printk("-> found max.\n");
+				break;
+			}
+		/*
+		 * Increase the cachesize in 5% steps:
+		 */
+		size = size * 20 / 19;
+	}
+
+	if (migration_debug)
+		printk("[%d][%d] working set size found: %d, cost: %Ld\n",
+			cpu1, cpu2, size_found, max_cost);
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 *
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 */
+	return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
+}
+
+void __devinit calibrate_migration_costs(void)
+{
+	int cpu1 = -1, cpu2 = -1, cpu;
+	struct sched_domain *sd;
+	unsigned long distance, max_distance = 0;
+	unsigned long long cost;
+	unsigned long j0, j1;
+
+	j0 = jiffies;
+
+	/*
+	 * First pass - calculate the cacheflush times:
+	 */
+	for_each_online_cpu(cpu1) {
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2)
+				continue;
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			/*
+			 * Do we have the result cached already?
+			 */
+			if (migration_cost[distance] != -1LL)
+				cost = migration_cost[distance];
+			else {
+				cost = measure_migration_cost(cpu1, cpu2);
+				migration_cost[distance] = cost;
+			}
+		}
+	}
+	/*
+	 * Second pass - update the sched domain hierarchy with
+	 * the new cache-hot-time estimations:
+	 */
+	for_each_online_cpu(cpu) {
+		distance = 0;
+		for_each_domain(cpu, sd) {
+			sd->cache_hot_time = migration_cost[distance];
+			distance++;
+		}
+	}
+	/*
+	 * Print the matrix:
+	 */
+	printk("---------------------\n");
+	printk("| migration cost matrix (max_cache_size: %d, cpu: %d MHz):\n",
+			max_cache_size,
+#ifdef CONFIG_X86
+			cpu_khz/1000
+#else
+			-1
+#endif
+		);
+	printk("---------------------\n");
+	printk("      ");
+	for_each_online_cpu(cpu1)
+		printk("    [%02d]", cpu1);
+	printk("\n");
+	for_each_online_cpu(cpu1) {
+		printk("[%02d]: ", cpu1);
+		for_each_online_cpu(cpu2) {
+			if (cpu1 == cpu2) {
+				printk("    -   ");
+				continue;
+			}
+			distance = domain_distance(cpu1, cpu2);
+			max_distance = max(max_distance, distance);
+			cost = migration_cost[distance];
+			printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
+				((long)cost / 100000) % 10, distance);
+		}
+		printk("\n");
+	}
+	printk("--------------------------------\n");
+	printk("| cacheflush times [%ld]:", max_distance+1);
+	for (distance = 0; distance <= max_distance; distance++) {
+		cost = migration_cost[distance];
+		printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
+			((long)cost / 100000) % 10, cost);
+	}
+	printk("\n");
+	j1 = jiffies;
+	printk("| calibration delay: %ld seconds\n", (j1-j0)/HZ);
+	printk("--------------------------------\n");
+}
+
 #ifdef CONFIG_NUMA
+
 /**
  * find_next_best_node - find the next node to include in a sched_domain
  * @node: node whose sched_domain we're building
@@ -5298,6 +5776,10 @@ next_sg:
 #endif
 		cpu_attach_domain(sd, i);
 	}
+	/*
+	 * Tune cache-hot values:
+	 */
+	calibrate_migration_costs();
 }
 /*
  * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
_