Deep Water: printf float in int type

来源:互联网 发布:php实现单文件上传 编辑:程序博客网 时间:2024/06/02 23:09

转载地址:http://www.xiesiyi.com/posts/deep-water-printf-float-in-int-type.html

这片文章是与一个朋友聊天,聊起了一个问题,然后他研究完写了一篇文章,写的非常好,转载过来,记录一下。


Abstract

For a programmer, as a user of the interface printf in C language, he or she should assure the string specifier matches the types of the variables or the result isundefined. In fact, the result from unmatched types may be defined from the prospective of the implementor of this interface.

Table of Contents

  • Background
  • Solving
  • Conclusions
  • Take-away Tips

Background

Last week, a friend of mine showed me an Obj-C code snippet and we want to figure out what is the output exactly.

123456
float price = 1.5;NSLog(@"%f\n", price);NSLog(@"%d\n", (int)price);NSLog(@"%d\n", price);NSLog(@"%d\n", (int *)&price);NSLog(@"%d\n", *(int *)&price);

The output from the Xcode IDE is as follows, running on a iPhone 6s simulator :

 1 2 3 4 5 6 7 8 91011121314151617181920
// first run1.500000        // %f <- price1               // %d <- (int)price-209625088      // %d <- price1431444172      // %d <- (int *)&price1069547520      // %d <- *(int *)&price// second run1.500000        // %f <- price1               // %d <- (int)price1518649344      // %d <- price1430264524      // %d <- (int *)&price1069547520      // %d <- *(int *)&price// third run1.500000        // %f <- price1               // %d <- (int)price-1056837632     // %d <- price1457400524      // %d <- (int *)&price1069547520      // %d <- *(int *)&price

Note that after running three times, some numbers in the outputs keep the same while others change every time. Especially, the outputs by printing price of float type in inttype without type casting is indeterminate at the first sight.

This Obj-C code snippet seems too easy to give a quick answer. To verify my first thought, I just quickly translate them into C language to look them in a lower level, assumingNSLog is a macro wrapped C printf ( Obj-C is a superset of C anyway ~ ). Here is the C code snippet:

 1 2 3 4 5 6 7 8 91011121314151617
//// sample.c//#include <stdlib.h>#include <stdio.h>int main(int argc, char **argv) {    float a = 1.5;    printf("%specifier casting input\n");    printf("%%f             %f\n",          a);    printf("%%d (int)       %d\n",     (int)a);    printf("%%d             %d\n",          a);    printf("%%d (int *)     %d\n",  (int *)&a);    printf("%%d *(int *)    %d\n", *(int *)&a);    return 0;}

The output is as follows. Note the line of printing %d without casting, it seems unpredictable yet:

// compile sample.c$ gcc -g -o sample sample.c// run for once$ ./samplespecifier casting input%f             1.500000%d (int)       1%d             2147483630%d (int *)     -493699412%d *(int *)    1095237632

Solving

The output astonished and puzzled me instantly AND for the week. The question haunted around: is it true that the output of printf is undefined?

To study further and thoroughly, I decide to check a concise snippet focusing on the printf("%d\n", price). In my opinion, this is key point to demystify the hood.

Before we begin to check the code, it is necessary to make clear under which environment the programs will run. Actually the x86-64 Linux OS and gcc environment are on a Debian virtual guest.

// programming environment$ uname -aLinux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux$ gcc --versiongcc (Debian 4.9.2-10) 4.9.2Copyright (C) 2014 Free Software Foundation, Inc.This is free software; see the source for copying conditions.  There is NOwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The first C snippet code here is used to show the sizes of different data types.

 1 2 3 4 5 6 7 8 910111213
//// size.c//#include <stdlib.h>#include <stdio.h>int main(int argc, char **argv) {    printf("sizoef(int) is %zu\n", sizeof(int));    printf("sizoef(long) is %zu\n", sizeof(long));    printf("sizoef(float) is %zu\n", sizeof(float));    printf("sizoef(double) is %zu\n", sizeof(double));    return 0;}
// compile size.c$ gcc -g -o size size.c// and run$ ./sizesizoef(int) is 4sizoef(long) is 8sizoef(float) is 4sizoef(double) is 8

The concise C code demonstrating printf("%d\n", price) and it's output is shown below:

 1 2 3 4 5 6 7 8 9101112
//// printf_int.c//#include <stdlib.h>#include <stdio.h>int main(int argc, char **argv) {    float price = 1.5;    printf("%d\n", price);    return 0;}
// compile printf_int$ gcc -g -o printf_int printf_int.c// and run for three times$ ./printf_int-1044129512$ ./printf_int559638616$ ./printf_int1742869704

No surprise at all, the output is weird and it changes in every run.

I know a few gotchas that the specifier format string in the printf function should be matched with the types of the var_arg list or the behavior is undefined. Now the output makes me to think in deep and hard way why and how the undefined behavior comes.

The typical memory layout for a C program is composed of

  1. text segment : containing machine instructions
  2. data segment : initialized data
  3. bss segment : uninitialized data
  4. heap : dynamic allocated memory
  5. stack : memory area for storing variable during function calls

Usually, on Linux running on x86 intel CPU, the stack area starts from higher memory address and expands to lower address when it grows. The heap area starts from the top ofbss segment and grows to the bottom of the stack area.

When calling a function, the arguments passed from the caller is stored in the stack if needed (pushed onto the stack). The order for pushing arguments is reverse with that in source code. That is to say, the last argument in the source code will be the first pushed. The is true for stackless function calling. For x86-64 programs on Linux, some reigsters are involved storing the arguments.

As many as six arguments should be loaded into respective registers directly without being pushed onto stack. The registers and the arguments are specified by the CPU ABI which is well introduced in this article written by Eli Bendersky. I refer the illustrating image here:

The illustrating image is showing calling a function myfunc like:

long myfunc(long a, long b, long c, long d,        long e, long f, long g, long h)
x64 frame nonleaf (illustrating)

Fig. x64 frame nonleaf (illustrating) (copied from Eli)

By the x86-64 assembly specification, the first argument is loaded into register %rdi and the second into %rsi. Another important exception for function printf is that its arguments are examined by the compiler for type checking and type promotion. Thus the type float is promoted to type double. Regarding the statement printf("%d\n", price);, the first argument is the format string and the second argument price (float number 1.5) is promoted into double. ( See more about type promote).

If this is true, the output should be the content interpreted as type int of the price . Unfortunately, it is not. The binary representation of double price = 1.5 is

// 64-bit binary representation of double 1.5// as hex0x3FF8000000000000// as binary00111111 11111000 00000000 0000000000000000 00000000 00000000 00000000

If you do verify yourself, neither the upper 32 bits or lower 32 bits matches the %d output if interpreted as an int.

Things get more complicated. Thanks to Eli Bendersky again, the article referenced above also indicates that float arguments are stored into xmm registers while only arguments of integer type or pointer are handled by the common six registers. This gives a clue to examine the xmm0 register, the first xmm register. To examine registers, I have to use the powerful debugging tool gdb.

Tip

  1. In order to debug with gdb, the executable should be compiled with the -g option of gcc.
  2. The gdb command n (line 22 and line 25) is short for next, which is to execute the next step indicated by latest output statement (line 19 and line 23 respectively).
 1 2 3 4 5 6 7 8 91011121314151617181920
// compile printf_int.c with option -g$ gcc -g -o printf_int printf_int.c// run with gdb$ gdb printf_int(gdb) startTemporary breakpoint 1 at 0x400515: file d.c, line 5.Temporary breakpoint 1, main (argc=1, argv=0x7fffffffdf78) at d.c:55       float a = 1.5;(gdb) display $esi1: $esi = -8328(gdb) n6       printf("%d\n", a);1: $esi = -8328(gdb) n-83287       return 0;1: $esi = 2147483642(gdb)

Line 26 is the output integer -8328 of the printf function. Pay close attention to line 20: the int value -8328 represented by the lower 32 bits(%esi) of the %rsi register. To learn more about the x86-64 registers, follow this link

What a coincident! or is it deterministic? Yes, it is and deterministic and defined. I will explain soon.

Recall that the %rsi register holds the second argument of integer or pointer(address), but of which function? Here it is the main function! If you look closely at Line 18:

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffdf78) at printf_int.c:5

The argument argv is the second argument and its content is an 64-bit address 0x7fffffffdf78 . It is a pointer, so it's content is hold by register %rsi.

I use python to manipulate numbers. If we convert the lower 32 bits of this address into integer.

# address hexffffdf78# binary representation of the address0b10000010001000# interpret the binary address as integer-8328   // -0b10000010001000

Amazing!!!

Now we explain the these two statements.

float price = 1.5;printf("%d\n", price);

When we want to print it with a %d format specifier, the compiler does in such steps:

  1. Parsing the arguments. There are two. The first is pointer to the format string, so it(as address) is loaded into %rdi register; The second is price of float type(but promoted to double), so it( 1.5 ) is loaded into %xmm0 register and the content of the %rsi register remains unchanged. Here is the black magic.
  2. When printf is called, it parse the specifier format string to determine the value type will be printed. Here the %d specifier is first encountered during the parsing, so the printf considers it to be an integer(as the %d indicates). The printf then fetches the value from the %rsi register and prints the content as integer.
  3. The %rsi register is not alerted by printf here, so the output is not determined by the call of printf("%d\n", price);. It is determined by the last call which changes the%rsi register.

In general, we can summary:

  1. Type promoting is checked first. Type char is promoted into int, type float is promtped into double and so on.
  2. For the arguments of integer type or pointer, as many as six should be loaded into specified registers( %rdi %rsi %rdx %rcx %r8 %r9 ), floating(single or double) arguments are loaded into xmm registers, and others should be pushed onto the stack frame.
  3. For the arguments of double type, they are loaded into %xmm registers which is designed for holding float numbers.
  4. Before printf function is called, the arguments is stored according their declared types (after type promoting if needed); When executing, the value is fetched according to the format specifier string.

Let's verify these conclusions. Here we add a function sum which expects two int arguments. By taking in two integer, calling the sum function would make a side effect: the%rsi register will be loaded with the second argument and remains intact by a following call printf. The following printf function will take the integer value according to the %dspecifier from the %rsi register set by the previous sum. We can expect that the output of the printf should be determined by the second integer argument by the previous sumfunction and it varies with this second int argument.

 1 2 3 4 5 6 7 8 9101112131415161718192021
//// printf_int_sum.c//#include <stdlib.h>#include <stdio.h>void sum(int a, int b);  // declaring the function prototypeint main(int argc, char **argv) {    float price = 1.5;    sum(2, 5);    printf("%%d price: %d\n", price);    return 0;}// function implementationvoid sum(int a, int b) {    // actually we can do nothing...    int c = a + b;}

Here is the output

// sum(2, 5)// compile$ gcc -g -o printf_int_sum printf_int_sum.c// run for three times$ ./printf_int_sum%d price: 5$ ./printf_int_sum%d price: 5$ ./printf_int_sum%d price: 5// after we change to sum(2, 7)// compile$ gcc -g -o printf_int_sum printf_int_sum.c// run$ ./printf_int_sum%d price: 7$ ./printf_int_sum%d price: 7$ ./printf_int_sum%d price: 7

Observing the above outputs, when we call sum(2, 5);price is printed as 5; When we call sum(2, 7);price is printed as 7. The behavior is what we expect: the first integer or pointer argument of sum determines the output of price. It is defined! Hooray!

Until now, we have figured out how to expect the output of the printf. One mystery is still remained that the result sample.c program changes in every run. According to the conclusions, the output should be determined by the first integer or pointer argument of main function, I add a line in the file printf_int_argv.c to print the argv which is a pointer. This program printf_int_argv runs on a real machine and with gcb respectively.

 1 2 3 4 5 6 7 8 910111213
//// printf_int_argv.c//#include <stdlib.h>#include <stdio.h>int main(int argc, char **argv) {    float price = 1.5;    printf("%%d price: %d\n", price);    printf("%%p argv: %p\n", argv);    return 0;}

The output from running on a real machine is shown as follows:

// run on real machine, with default system settings$ ./printf_int_argv%d price: 1520235144%p argv: 0x7ffd5a9cf288$ ./printf_int_argv%d price: -1531767544%p argv: 0x7ffea4b31508$ ./printf_int_argv%d price: 1047310888%p argv: 0x7ffe3e6cb228

The output from running with gdb is shown as follows:

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132
// run with gdb$ gdb printf_int_argv(gdb) runStarting program: ./printf_int_argv%d price: -6536%p argv: 0x7fffffffdf68[Inferior 1 (process 22351) exited normally](gdb) runStarting program: ./printf_int_argv%d price: -6536%p argv: 0x7fffffffdf68[Inferior 1 (process 22355) exited normally](gdb) run%d price: -6536%p argv: 0x7fffffffdf68[Inferior 1 (process 22356) exited normally](gdb)// another run with gdb to show content of %esi register(gdb) startTemporary breakpoint 1, main (argc=1, argv=0x7fffffffe678) at printf_int_argv.c:55       float price = 1.5;(gdb) n6       printf("%%d price: %d\n", price);(gdb) display $esi1: $esi = -6536(gdb) n%d price: -6536// and so on

The outputs differ in these two different environments. The output from gdb can be explained well according to previous conclusions: they are determined by and varies with the value of argv. See line 22 - 32: the content of %esi is exactly the same the print of price. The difference is attributed to that argv stay unchanged when debugging using gdbwhile it varies on real machine.

Why and how argv changes? I goolge for c why argv address changes every time and get a useful link Environment variable's address is changing?. I get some key concepts:

ASLR/proc/sys/kernel/randomize_va_space

Continuing to search with these concepts, I get these from ASLR @wikipedia:

Address space layout randomization (ASLR) is a computer security technique involved in protection from buffer overflow attacks. In order to prevent an attacker from reliably jumping to, for example, a particular exploited function in memory, ASLR randomly arranges the address space positions of key data areas of a process, including the base of the executable and the positions of the stack, heap and libraries.

As far as I know, the argv list is just above the stack, so it should also change in every run. I finally decide to disable ASLR and run the printf_int_argv again, with the fresh output here:

Warning

You need root permission to disable or enable the ASLR. Here is a guide step by step from the link:

The following values are supported:

0 – No randomization. Everything is static.1 – Conservative randomization. Shared libraries, stack, mmap(), VDSO and heap are randomized.2 – Full randomization. In addition to elements listed in the previous point, memory managed through brk() is also randomized.So, to disable it, run    echo 0 | sudo tee /proc/sys/kernel/randomize_va_spaceand to enable it again, run    echo 2 | sudo tee /proc/sys/kernel/randomize_va_space
// when the ALSR is disabled$ ./printf_int_argv%d price: -6504%p argv: 0x7fffffffe698$ ./printf_int_argv%d price: -6504%p argv: 0x7fffffffe698$ ./printf_int_argv%d price: -6504%p argv: 0x7fffffffe698

Hooray! Hooray! The argv stays unchanged when the ALSR is disabled, so the output of printf as a result of interpreting the argv as an integer keeps the same in every run now.

Warning

You should enable ASLR after this experiment. Do NOT forget it.

Conclusions

For a programmer, as a user of the interface printf in C language, he or she should assure the string specifier matches the types of the variables or the result is undefined. In fact, the result from unmatched types may be defined as follows from the prospective of the implementor of this interface.

Function calls are modeled as stack frames and the arguments passed from the caller are stored according to their types(maybe after undergoing type checking and type promotion). As for x86-64 CPU, the first six arguments of integer type or pointer are loaded into respective registers ( %rdi %rsi %rdx %rcx %r8 %r9 ), floating(single or double) arguments are loaded into xmm registers, and others should be pushed onto the stack frame.

Specifically, when printf function is called to print something, the first %d in the format specifier string indicates that it is the content in the register %rsi should be fetched. When trying to print out a float variable price in a %d format printf("%d\n", price), the %rsi remains unchanged by printf since the second argument(price) is NOT ofinteger type or pointer; the %rsi keeps the value set by the last function call with a second argument of that proper type.

If the printf function is the first to be called in a int main(int argc, char **argv) program, the content of register %rsi is the the second argument is value of argv which is a pointer(holding an address). When ASLR is enabled(in a Linux system), the value of argv changes randomly for enhancing security, so does content of %rsi. That is why the output of printf("%d\n", price) changes in every run.

Take-away Tips

  1. Always use the right format specifier for printf function, or you will get unexpected results.
  2. In fact, the output by an unmatched format specifier string is defined in some way if you examine the registers and function call stack. When you understand this magic, you should alway refer to the previous tip ~.

simple qrcode chrome extension

Polishing the Simple Qrcode

More to do

The work simple-qrcode is so simple that some efforts are needed to polish it. Here I figure them out and will polish the code in the future. Ofcause you could have a try yourself.

  1. It should be avoided to update the QR code image synchronously for it blocks UI.

    To do: use Ajax to request the QR code asynchronously. Adding a loading indicator is more user-friendly.

  2. A new request for QR code is sent on every user click even the page does not refresh. This is a waste of extra traffic and undesirable UI updating.

    To do: cache a certain amount of requested QR images for sometime.

  3. The request for the QR code does not be sent until the badge(extension icon) is clicked even the page has been loaded for a long time.

    To do: pre-fetch the QR code image as soon as possible. DOMContentLoaded should be a better trigger event.

  4. When in poor network condition, the HTTP request for QR code may fail or be interrupted.

    To do: use a local js library to generate the QR code image.

  5. The server for returning the QR code may be not responsive. A number of servers could assure more reliable service.

    To do: provide an option page for users to customize their own QR code servers or/and emmed a list of servers in the extension.

0 0
原创粉丝点击