2012年10月29日星期一

Puzzle 1: 经理与工程师

在 Puzzle Toad 发现一些有趣的问题,基本上都很烧脑,记录一下。


有n个人,每个人不是经理就是工程师,每个人都知道其他人的职业。可以询问第i个人第j 个人是工程师还是经理。工程师总是说实话,但经理会说谎。问:
  1. 如果有超过一半的人为工程师,有什么策略可以保证在询问次数最多n-1的情况下,找到 一个工程师?
  2. 如果经理占一半人数,是否可以找出一个工程师?如果能,怎么找?
  3. 如果发现一个工程师,则可以通过他识别其他人的职业。是否有一种方法识别出所有人 的职业,但只需更少的询问次数?
















































Part 1:
A correct, optimal, and elegant solution was submitted by Chris Peikert, Grant Wang, and Abhi Shelat of MIT. A number of others submitted partial solutions.

Here's an n-1 query solution to part 1. Maintain three sets of people: UNSEEN, STACK, and DISCARD. Initialize the process by picking one arbitrarily to be the STACK, everything else is UNSEEN. Repeat the following step until UNSEEN is empty:

Pick an UNSEEN element x, remove it from UNSEEN. Ask the top of the STACK y about x. If y says "manager" pop y off the stack and DISCARD both x and y. If it says "engineer" add x to the top of the STACK.

After all elements have been processed in this way (n-1 comparisons), the top of the stack must be an engineer.

Why does this work? First observe that whenever we discard a pair, at least one of them is a manager. So among the rest of them (STACK and UNSEEN) a majority must still be engineers. So at the end, when UNSEEN is empty, there must be an engineer in the stack, therefore the top of the stack must be an engineer.

This can be improved to n-2 simply by stopping one earlier. When there's one UNSEEN left, if the stack is empty, that UNSEEN one is an engineer. Otherwise, the top of the stack must be an engineer.
If is n is even and n>=4, as a first step we can just throw out one person, and appy this algorithm to the remaining n-1 obtaining n-3 comparisons. This gives the optimal algorithm.

This is optimal. The proof appears in the solution of homework assignment 7 of Steven Rudich's course 15-251 taught at CMU in the spring semester of 2002. See Solution 7.

Part 2:
 If half or more of the people are managers, then the problem cannot be solved. The managers can ensure this simply by always lying. Now there's way to separate the two sets of people. Each one simply claims the others are Managers.

Part 3:
I don't know any better solution than to simply using the solution to Part 1 to identify everybody.

2012年9月7日星期五

Learning C with gdb



Coming from a background in higher-level languages like Ruby, Scheme, or Haskell, learning C can be challenging. In addition to having to wrestle with C's lower-level features like manual memory management and pointers, you have to make do without a REPL. Once you get used to exploratory programming in a REPL, having to deal with the write-compile-run loop is a bit of a bummer.
It occurred to me recently that I could use gdb as a pseudo-REPL for C. I've been experimenting with using gdb as a tool for learning C, rather than merely debugging C, and it's a lot of fun.
My goal in this post is to show you that gdb is a great tool for learning C. I'll introduce you to a few of my favorite gdb commands, and then I'll demonstrate how you can use gdb to understand a notoriously tricky part of C: the difference between arrays and pointers.

An introduction to gdb

Start by creating the following little C program, minimal.c:
int main()
{
    int i = 1337;
    return 0;
}
Note that the program does nothing and has not a single printf statement.1 Behold the brave new world of learning C with gdb!
Compile it with the -g flag so that gdb has debug information to work with, and then feed it to gdb:
$ gcc -g minimal.c -o minimal
$ gdb minimal
You should now find yourself at a rather stark gdb prompt. I promised you a REPL, so here goes:
(gdb) print 1 + 2
$1 = 3
Amazing! print is a built-in gdb command that prints the evaluation of a C expression. If you're unsure of what a gdb command does, try running help name-of-the-command at the gdb prompt.
Here's a somewhat more interesting example:
(gbd) print (int) 2147483648
$2 = -2147483648
I'm going to ignore why 2147483648 == -2147483648; the point is that even arithmetic can be tricky in C, and gdb understands C arithmetic.
Let's now set a breakpoint in the main function and start the program:
(gdb) break main
(gdb) run
The program is now paused on line 3, just before i gets initialized. Interestingly, even though i hasn't been initialized yet, we can still look at its value using the print commnd:
(gdb) print i
$3 = 32767
In C, the value of an uninitialized local variable is undefined, so gdb might print something different for you!
We can execute the current line with the next command:
(gdb) next
(gdb) print i
$4 = 1337

Examining memory with x

Variables in C label contiguous chunks of memory. A variable's chunk is characterized by two numbers:
  1. The numerical address of the first byte in the chunk.
  2. The size of the chunk, measured in bytes. The size of a variable's chunk is determined by the variable's type.
One of the distinctive features of C is that you have direct access to a variable's chunk of memory. The & operator computes a variable's address, and the sizeof operator computes a variable's size in memory.
You can play around with both concepts in gdb:
(gdb) print &i
$5 = (int *) 0x7fff5fbff584
(gdb) print sizeof(i)
$6 = 4
In words, this says that i's chunk of memory starts at address 0x7fff5fbff5b4 and takes up four bytes of memory.
I mentioned above that a variable's size in memory is determined by its type, and indeed, thesizeof operator can operate directly on types:
(gdb) print sizeof(int)
$7 = 4
(gdb) print sizeof(double)
$8 = 8
This means that, on my machine at least, int variables take up four bytes of space anddouble variables take up eight.
Gdb comes with a powerful tool for directly examing memory: the x command. The xcommand examines memory, starting at a particular address. It comes with a number of formatting commands that provide precise control over how many bytes you'd like to examine and how you'd like to print them; when in doubt, try running help x at the gdb prompt.
The & operator computes a variable's address, so that means we can feed &i to x and thereby take a look at the raw bytes underlying i's value:
(gdb) x/4xb &i
0x7fff5fbff584: 0x39    0x05    0x00    0x00
The flags indicate that I want to examine 4 values, formatted as hex numerals, one byte at a time. I've chosen to examine four bytes because i's size in memory is four bytes; the printout shows i's raw byte-by-byte representation in memory.
One subtlety to bear in mind with raw byte-by-byte examinations is that on Intel machines, bytes are stored in "little-endian" order: unlike human notation, the least significant bytes of a number come first in memory.
One way to clarify the issue would be to give i a more interesting value and then re-examine its chunk of memory:
(gdb) set var i = 0x12345678
(gdb) x/4xb &i
0x7fff5fbff584: 0x78    0x56    0x34    0x12

Examining types with ptype

The ptype command might be my favorite command. It tells you the type of a C expression:
(gdb) ptype i
type = int
(gdb) ptype &i
type = int *
(gdb) ptype main
type = int (void)
Types in C can get complex but ptype allows you to explore them interactively.

Pointers and arrays

Arrays are a surprisingly subtle concept in C. The plan for this section is to write a simple program and then poke it in gdb until arrays start to make sense.
Code up the following arrays.c program:
int main()
{
    int a[] = {1,2,3};
    return 0;
}
Compile it with the -g flag, run it in gdb, then next over the initialization line:
$ gcc -g arrays.c -o arrays
$ gdb arrays
(gdb) break main
(gdb) run
(gdb) next
At this point you should be able to print the contents of a and examine its type:
(gdb) print a
$1 = {1, 2, 3}
(gdb) ptype a
type = int [3]
Now that our program is set up correctly in gdb, the first thing we should do is use x to see what a looks like under the hood:
(gdb) x/12xb &a
0x7fff5fbff56c: 0x01  0x00  0x00  0x00  0x02  0x00  0x00  0x00
0x7fff5fbff574: 0x03  0x00  0x00  0x00
This means that a's chunk of memory starts at address 0x7fff5fbff5dc. The first four bytes store a[0], the next four store a[1], and the final four store a[2]. Indeed, you can check that sizeof knows that a's size in memory is twelve bytes:
(gdb) print sizeof(a)
$2 = 12
At this point, arrays seem to be quite array-like. They have their own array-like types and store their members in a contiguous chunk of memory. However, in certain situations, arrays act a lot like pointers! For instance, we can do pointer arithmetic on a:
(gdb) print a + 1
$3 = (int *) 0x7fff5fbff570
In words, this says that a + 1 is a pointer to an int and holds the address0x7fff5fbff570. At this point you should be reflexively passing pointers to the xcommand, so let's see what happens:
(gdb) x/4xb a + 1
0x7fff5fbff570: 0x02  0x00  0x00  0x00
Note that 0x7fff5fbff570 is four more than 0x7fff5fbff56c, the address of a's first byte in memory. Given that int values take up four bytes, this means that a + 1 points to a[1].
In fact, array indexing in C is syntactic sugar for pointer arithmetic: a[i] is equivalent to *(a + i). You can try this in gdb:
(gdb) print a[0]
$4 = 1
(gdb) print *(a + 0)
$5 = 1
(gdb) print a[1]
$6 = 2
(gdb) print *(a + 1)
$7 = 2
(gdb) print a[2]
$8 = 3
(gdb) print *(a + 2)
$9 = 3
We've seen that in some situations a acts like an array and in others it acts like a pointer to its first element. What's going on?
The answer is that when an array name is used in a C expression, it "decays" to a pointer to the array's first element. There are only two exceptions to this rule: when the array name is passed to sizeof and when the array name is passed to the & operator.2
The fact that a doesn't decay to a pointer when passed to the & operator brings up an interesting question: is there a difference between the pointer that a decays to and &a?
Numerically, they both represent the same address:
(gdb) x/4xb a
0x7fff5fbff56c: 0x01  0x00  0x00  0x00
(gdb) x/4xb &a
0x7fff5fbff56c: 0x01  0x00  0x00  0x00
However, their types are different. We've already seen that the decayed value of a is a pointer to a's first element; this must have type int *. As for the type of &a, we can ask gdb directly:
(gdb) ptype &a
type = int (*)[3]
In words, &a is a pointer to an array of three integers. This makes sense: a doesn't decay when passed to &, and a has type int [3].
You can observe the distinction between a's decayed value and &a by checking how they behave with respect to pointer arithmetic:
(gdb) print a + 1
$10 = (int *) 0x7fff5fbff570
(gdb) print &a + 1
$11 = (int (*)[3]) 0x7fff5fbff578
Note that adding 1 to a adds four to a's address, whereas adding 1 to &a adds twelve!
The pointer that a actually decays to is &a[0]:
(gdb) print &a[0]
$11 = (int *) 0x7fff5fbff56c

Conclusion

Hopefully I've convinced you that gdb a neat exploratory environment for learning C. You can print the evaluation of expressions, examine raw bytes in memory, and tinker with the type system using ptype.
If you'd like to experiment further with using gdb to learn C, I have a few suggestions:
  1. Use gdb to work through the Ksplice pointer challenge.
  2. Investigate how structs are stored in memory. How do they compare to arrays?
  3. Use gdb's disassemble command to learn assembly programming! A particularly fun exercise is to investigate how the function call stack works.
  4. Check out gdb's "tui" mode, which provides a grahical ncurses layer on top of regular gdb. On OS X, you'll likely need to install gdb from source.

2012年9月1日星期六

Knuth-shuffle洗牌算法

要写出一个真正随机(我意思是均匀分布)的洗牌程序并不容易,即使强如Jeff Atwood也是 会写错的。Jeff写错的版本是这样的:

//error!
for(int i=0;i<cards.Length;i++){
    int n = rand.Next(cards.Length);
    Swap(ref cards[i],ref cards[n]);
}

问题在于我们并不能轻易地看出这段程序生成的排列不是均匀分布的(容我想想,我现在看 得出么…)。

实际上,是有个O(n)的洗牌算法 Fisher-Yates,又称为Knuth-shuffle(Knuth在TAOCP介绍过 而得名):

//To shuffle an array a of n elements (indices 0..n-1):
for i from n − 1 downto 1 do
       j <- random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

这个算法就地洗牌,但它需要预先知道洗牌数组的长度。它的正确性(也就是生成的排列是 否均匀分布)不难看出。

Knuth-shuffle还有另一种实现:

//To initialize an array a of n elements to a randomly shuffled copy of
//source, both 0-based: 
a[0] <- source[0]
for i from 1 to n − 1 do
    j <- random integer with 0 ≤ j ≤ i
    a[i] <- a[j]
    a[j] <- source[i]

这个实现的正确性可以这么考虑:对于n-1次random调用生成的n!次随机序列,总会对应不 同的排列,所以n!次排列是均匀分布的。

这个实现有一个坏处,它需要额外的数组来存储生成的排列。但它也有一个大大的好处,它 可以不用预先知道数组的长度:

//To initialize an empty array a to a randomly shuffled copy of source whose
//length is not know:
while source.moreDataAvailable
    j <- random integer with 0 ≤ j ≤ a.length
    if j = a.length
        a.append(source.next)
    else
        a.append(a[j])
        a[j] <- source.next

Knuth-shuffle做些修改还可以有别的应用。比如下面这个问题:

  1. 如何从未知长度的序列中均匀地随机选择一个数?要求最多只遍历一遍。
  2. 如何从未知长度的序列中均匀地随机选择k个数?要求最多只遍历一遍。

.

问题(1)参考程序:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

问题(2)参考程序:

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                #at a decreasing rate,replace random items
                results[r] = v 

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

2012年8月30日星期四

enum的小问题

这个博客上看到这样一个问题:

//问题1
//foo函数输出什么?
enum MY_ENUM{
  MY_OK = 0,
  MY_NOT_OK,
}

void foo()
{
   int i = -1;
   enum My_ENUM my_enum = MY_OK;

   if( i < my_enum) printf("I am OK!\n");
   else printf("I am NOT OK!\n");

}

如果对上面的函数做小小的修改,又是什么结果?

//问题2
//foo函数输出什么?
enum MY_ENUM{
  MY_OK = 0,
  MY_NOT_OK,
}

void foo()
{
   int i = -1;
   enum My_ENUM my_enum = MY_OK;

   if( i < MY_OK) printf("I am OK!\n");
   else printf("I am NOT OK!\n");

}
FOR THE MARGIN. HAVE BETTER MATHOD?

文中给出的答案是问题1输出"I am NOT OK!",而问题2输出"I am OK!".多多少少让我有点琢 磨不透,为什么会这样呢?

实际上,正如文中提到的,gcc确实是这样的结果。但如果同样的程序在msvc去运行,就会发 现问题1和问题2都是输出"I am OK!". 这是跟编译器实现有关的问题。在gcc中,当enum常数 (所谓的enum常数就是enum类型中标记的那些值,如程序中的MY_OK)没有负数时,gcc将 enum类型变量(注意与enum常数的区别,如程序中的my_enum)存储为无符号int。当enum常 数有负数时,则仍然乖乖地存储为int。而enum常数是一直都存储为int。msvc则是不管什么 情况,都将enum常数和enum类型存储为int。在这个实现上,我觉得gcc真是吃饱了撑的,你 觉得呢?

综上所述,这是个不好的问题。姑且一笑而过。

2012年8月26日星期日

java对象模型

原文地址:http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

Update (December 18th, 2008): I've posted here an experimental library that implements Sizeof for Java.

One thing about Java that has always bothered me, given my C/C++ roots, is the lack of a way to figure out how much memory is used by an object. C++ features the sizeof operator, that lets you query the size of primitive types and also the size of objects of a given class. This operator in C and C++ is useful for pointer arithmetic, copying memory around, and IO, for example.

Java doesn't have a corresponding operator. In reality, Java doesn't need one. Size of primitive types in Java is defined in the language specification, whereas in C and C++ it depends on the platform. Java has its own IO infrastructure built around serialization. And both pointer arithmetic and bulk memory copy don't apply because Java doesn't have pointers.

But every Java developer at some point wondered how much memory is used by a Java object. The answer, it turns out, is not so simple.

The first distinction to be made is between shallow size and deep size. The shallow size of an object is the space occupied by the object alone, not taking into account size of other objects that it references. The deep size, on the other hand, takes into account the shallow size of the object, plus the deep size of each object referenced by this object, recursively. Most of the times you will be interested on knowing the deep size of an object, but, in order to know that, you need to know how to calculate the shallow size first, which is what I'm going to talk about here.

One complication is that runtime in memory structure of Java objects is not enforced by the virtual machine specification, which means that virtual machine providers can implement them as they please. The consequence is that you can write a class, and instances of that class in one VM can occupy a different amount of memory than instances of that same class when run in another VM. Most of the world, including myself, uses the Sun HotSpot virtual machine though, which simplifies things a lot. The remainder of the discussion will focus on the 32 bit Sun JVM. I will lay down a few 'rules that will help explain how the JVM organizes the objects' layout in memory.

Memory layout of classes that have no instance attributes

In the Sun JVM, every object (except arrays) has a 2 words header. The first word contains the object's identity hash code plus some flags like lock state and age, and the second word contains a reference to the object's class. Also, any object is aligned to an 8 bytes granularity. This is the first rule or objects memory layout:

Rule 1: every object is aligned to an 8 bytes granularity.

Now we know that if we call new Object(), we will be using 8 bytes of the heap for the two header words and nothing else, since the Objectclass doesn't have any fields.

Memory layout of classes that extend Object

After the 8 bytes of header, the class attributes follow. Attributes are always aligned in memory to their size. For instance, ints are aligned to a 4 byte granularity, and longs are aligned to an 8 byte granularity. There is a performance reason to do it this way: usually the cost to read a 4 bytes word from memory into a 4 bytes register of the processor is much cheaper if the word is aligned to a 4 bytes granularity.

In order to save some memory, the Sun VM doesn't lay out object's attributes in the same order they are declared. Instead, the attributes are organized in memory in the following order:
  1. doubles and longs
  2. ints and floats
  3. shorts and chars
  4. booleans and bytes
  5. references

This scheme allows for a good optimization of memory usage. For example, imagine you declared the following class:
class MyClass {
    byte a;
    int c;
    boolean d;
    long e;
    Object f;        
}

If the JVM didn't reorder the attributes, the object memory layout would be like this:
[HEADER:  8 bytes]  8
[a:       1 byte ]  9
[padding: 3 bytes] 12
[c:       4 bytes] 16
[d:       1 byte ] 17
[padding: 7 bytes] 24
[e:       8 bytes] 32
[f:       4 bytes] 36
[padding: 4 bytes] 40

Notice that 14 bytes would have been wasted with padding and the object would use 40 bytes of memory. By reordering the objects using the rules above, the in memory structure of the object becomes:
[HEADER:  8 bytes]  8
[e:       8 bytes] 16
[c:       4 bytes] 20
[a:       1 byte ] 21
[d:       1 byte ] 22
[padding: 2 bytes] 24
[f:       4 bytes] 28
[padding: 4 bytes] 32

This time, only 6 bytes are used for padding and the object uses only 32 bytes of memory.

So here is rule 2 of object memory layout:

Rule 2: class attributes are ordered like this: first longs and doubles; then ints and floats; then chars and shorts; then bytes and booleans, and last the references. The attributes are aligned to their own granularity.

Now we know how to calculate the memory used by any instance of a class that extends Object directly. One practical example is the java.lang.Boolean class. Here is its memory layout:
[HEADER:  8 bytes]  8 
[value:   1 byte ]  9
[padding: 7 bytes] 16

An instance of the Boolean class takes 16 bytes of memory! Surprised? (Notice the padding at the end to align the object size to an 8 bytes granularity.)

Memory layout of subclasses of other classes

The next three rules are followed by the JVM to organize the the fields of classes that have superclasses. Rule 3 of object memory layout is the following:

Rule 3: Fields that belong to different classes of the hierarchy are NEVER mixed up together. Fields of the superclass come first, obeying rule 2, followed by the fields of the subclass.

Here is an example:
class A {
   long a;
   int b;
   int c;
}

class B extends A {
   long d;
}

An instance of B looks like this in memory:
[HEADER:  8 bytes]  8
[a:       8 bytes] 16
[b:       4 bytes] 20
[c:       4 bytes] 24
[d:       8 bytes] 32

The next rule is used when the fields of the superclass don't fit in a 4 bytes granularity. Here is what it says:

Rule 4: Between the last field of the superclass and the first field of the subclass there must be padding to align to a 4 bytes boundary.

Here is an example:
class A {
   byte a;
}

class B {
   byte b;
}
[HEADER:  8 bytes]  8
[a:       1 byte ]  9
[padding: 3 bytes] 12
[b:       1 byte ] 13
[padding: 3 bytes] 16

Notice the 3 bytes padding after field a to align b to a 4 bytes granularity. That space is lost and cannot be used by fields of class B.

The final rule is applied to save some space when the first field of the subclass is a long or double and the parent class doesn't end in an 8 bytes boundary.

Rule 5: When the first field of a subclass is a double or long and the superclass doesn't align to an 8 bytes boundary, JVM will break rule 2 and try to put an int, then shorts, then bytes, and then references at the beginning of the space reserved to the subclass until it fills the gap.

Here is an example:
class A {
  byte a;
}

class B {
  long b;
  short c;  
  byte d;
}

Here is the memory layout:
[HEADER:  8 bytes]  8
[a:       1 byte ]  9
[padding: 3 bytes] 12
[c:       2 bytes] 14
[d:       1 byte ] 15
[padding: 1 byte ] 16
[b:       8 bytes] 24

At byte 12, which is where class A 'ends', the JVM broke rule 2 and stuck a short and a byte before a long, to save 3 out of 4 bytes that would otherwise have been wasted.

Memory layout of arrays

Arrays have an extra header field that contain the value of the 'length' variable. The array elements follow, and the arrays, as any regular objects, are also aligned to an 8 bytes boundary.

Here is the layout of a byte array with 3 elements:
[HEADER:  12 bytes] 12
[[0]:      1 byte ] 13
[[1]:      1 byte ] 14
[[2]:      1 byte ] 15
[padding:  1 byte ] 16

And here is the layout of a long array with 3 elements:
[HEADER:  12 bytes] 12
[padding:  4 bytes] 16
[[0]:      8 bytes] 24
[[1]:      8 bytes] 32
[[2]:      8 bytes] 40

Memory layout of inner classes

Non-static inner classes have an extra 'hidden' field that holds a reference to the outer class. This field is a regular reference and it follows the rule of the in memory layout of references. Inner classes, for this reason, have an extra 4 bytes cost.

Final thoughts

We have learned how to calculate the shallow size of any Java object in the 32 bit Sun JVM. Knowing how memory is structured can help you understand how much memory is used by instances of your classes.