I submitted an entry to the 20th International Obfuscated C Code Contest (IOCCC) in January 2012.
It didn't win anything, and the way it abused the rules has
been patched, which means I can't try to improve or resubmit
it. So I'm publishing it here so I don't feel like
all that effort was for nothing.
Swine is not your average
quine - it
is effectively a "source-level virus", propagating itself via
stdio.h, and with a payload that causes if()
statements to very occasionally be inverted.
I've had code that more-or-less does this since around
mid-2000, and have always wanted to get it into shape for an IOCCC.
Now the 21st IOCCC has been announced, and the preliminary rules have
been updated (emphasis added):
| 21) Your program must not modify the content of the original
| prog.c C source file. If you need to modify the entry, copy
| prog.c to another filename in the same directory and then
| modify that file. Your entry must not create or modify
| files above the current directory with the exception of
| of /tmp the /var/tmp directories.
So at least I can be reasonably sure that the judges did pay some
attention to my entry. :)
You can get the as-submitted swine.c code,
remarks/README and build
instructions (which are just 'gcc -Wall -std=c99 -pedantic
-o swine swine.c'). The markdown-formatted remarks are also
included inline below. I might also put this onto github, including
the scripts used to generate swine.c.
This code is licensed under the
Creative
Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.
SWINE
Not only did I write this code, but I can read your mind. You are
thinking, "Oh great, yet another boring self-replicating code. I see
it's based on the venerable 1990/scjones, with a few extra tricks
thrown in, like jumbling up the code string array. yawn"
WRONG
Although I started with 1990/scjones as a base and there is some
resemblence, this entry does much more than just spew itself onto
stdout.
But first, a very important warning:
DO NOT RUN THIS CODE AS A SUPER-USER
This program may look like a quine, but it's actually more like
swine. Running it with root privileges will effectively
"infect" the C compiler on your system.
On the surface
For each file given on the command line, the program will ensure that
the file ends with a copy of the program's source code. If the file
doesn't exist, then it will be created. The program's source will be
appended to the file if necessary, ie. it checks if the end of the
file is correct, and if not then appends to it (as opposed to blindly
always appending, or blindly always rewriting/overwriting the final
bytes of the file).
SPOILER WARNING
Stop reading now if you want to try to figure it out for yourself.
What it really does
In addition to the files listed on the command line, the program will
similarly process /usr/include/stdio.h (or wherever your system's
stdio.h is located). If it has write permission to this file, then
it will ensure that an alternate set of code is always at the end of
stdio.h . This code is itself self-replicating, and by way of
payload includes a #define of the if keyword that occasionally
causes its result to be inverted.
Thus, if you are stupid/crazy enough to run this program as root, or
as any user that has write permission to the compiler's stdio.h ,
then your system will effectively be "infected" by this "source-level
virus". From this point, any other program compiled on the system (by
any user) will also be infected, and if those compiled programs are
subsequently run on other similar systems, then they too will be
infected.
Seeing it in action without nailing your system
The easiest way is to copy /usr/include/stdio.h (or wherever it is
on your system) into the current directory, and then add -I. to the
compiler flags when compiling swine.c. The resulting program will
then mess with ./stdio.h , rather than /usr/include/stdio.h .
You can then compile some other test program which uses stdio.h, and
give it -I. as well, so as to use the infected ./stdio.h . If
./stdio.h is "cleaned" by re-copying it from /usr/include , then
running the other program will nicely reinfect it.
Abuse of the rules
By now it should come as no surprise that this entry is aiming for the
"worst abuse of the rules" award, or maybe "most deceptive C code".
Clearly, code that is designed to propagate itself in perpetity (ie.
a virus by any other name) is outside the scope of the spirit of the
rules. However, it is not outside the letter of the rules - #5 is
very clear:
The build file, the source and the resulting executable should be
treated as read-only files.
Since no other files are mentioned, these must be the only files that
are considered read-only. This means that all system files, including
those used by the compiler (such as stdio.h ), are fair game for
writing. Thus, this entry is not invalidated merely because it tries
to adulterate your system files. There is also no rule which states
that malicious code is not allowed.
Probably the best plug for this hole in future years would be to add
"system files" to the list in rule #5.
Obfuscations and how it works
The top-level obfuscation is the usual mess of #define 's, along with
terse/misleading identifiers, hideous expressions, excessively long
lines and no code indentation at all. (Actually I wish these last two
didn't have to be the case, and the original version wasn't, but it
was necessary for the size limit - the self-replicating nature of the
code effectively halves the limit.)
A sloppy reader of the code will just assume that the string version
of the program in the a[] array contains the same lines as the main
program, only jumbled up somehow. Slightly more astute readers will
see that there are more lines in a[] than in the rest of the
program, but may just assume that they are red herrings thrown in as
distractions. Or they may notice the strange string of apparent line
noise, and wonder what that's all about.
There are three main, subtle obfuscations that the program hinges on.
Two are simple "bugs" that are reasonably common, but difficult to
catch visually (or to put it another way, will easily pass the "glance
test").
Obfuscation 1: Finding stdio.h
The first is the contents of the prgnam string. This is set to the
value of __FILE__ as evaluated inside stdio.h . This is how the
program is able to write to /usr/include/stdio.h without ever having
to mention /usr/include or anything else so blindingly suspicious.
The only mention of stdio.h at all is in the exceedingly common and
very innocent #include <stdio.h> . (Indeed, I haven't tested but I
wouldn't be surprised if the code was portable to Windows, where
prgnam may end up being something like "C:\Program Files\Some
Stuff\Include\stdio.h ".) The only very slightly conspicious aspect
is that this #include appears a little lower than it ordinarily does
(ie. the first line).
In order to do this, the code defines a macro for fflush() like so:
#define fflush(f) b; c prgnam[] = __FILE__; int fflush(f)
When the fflush() function is later defined inside stdio.h , it is
changed from looking something like this:
extern int fflush(FILE *f);
to this (newlines added for readability):
extern int b;
c prgnam[] = __FILE__;
int fflush(FILE *f);
This is a great little trick for injecting code into places it's not
supposed to be, and it's used again later.
(Technically speaking, the virus will be implanted into whichever
system file defines fflush() , which is usually stdio.h , but could
be any file #include d therein. In any case, #include <stdio.h>
will still pick up the implanted virus.)
Obfuscation 2: Using stdio.h
The second main obfuscation is being able to use the contents of
prgnam . This is done by setting argv[0] (referred to as *v ) to
point instead to prgnam . This seems innocent enough - the program
is just ensuring that argv[0] (which shows up in ps output, etc)
is what it should be: prgnam , presumably the program's name. So
this is a misleading identifier name.
This is followed by a subtle (intentional) bug in the subsequent loop
over argv , so that it starts from argv[0] instead of argv[1] .
The loop pointer p starts at v++ , ie. v = argv , when it should
actually start from ++v , ie. v = argv + 1 . Hidden inside the mess
of the for loop, it's easy to miss that:
for (q = (s = p = v++) + ac; p < q; s = p++)
should actually be:
for (q = (s = p = ++v) + ac; p < q; s = p++)
Obfuscation 3: Special treatment of stdio.h
The third obfuscation is the way that one set of lines from a[] is
used for the first file (stdio.h ), and another set for the rest
(from the command line).
There are three functions that define mappings into the array a[] .
l() is a linear mapping, g() uses a list encoded into one of the
unused strings in a[] , and x() uses a simple 5-bit Fibonnaci
linear feedback shift register (LFSR) to get a pseudo-random cycle
through the available 31 lines.
x() is used for the lines of swine.c itself, so that when they are
output unquoted as the program code, they appear in the correct
(unjumbled) order. l() is used to output the quoted string lines,
thus preserving their (jumbled) order. The lines specified by the
coded string used by g() are the lines that are output to stdio.h ,
and this is used for both the unquoted program code and the quoted
strings, so that the lines used by the stdio.h quine are unjumbled
and only require l() to self-output.
For easy access, the array r[] has pointers to these three functions
in order: l() , g() , x() . In the main argv loop inside
main() , p points to the current filename, and s points to the
previous one, since the loop increment is s = p++ . Thus, (p-s) is
always 1. The macro da() , where the functions in r[] are
referenced, uses array indicies 1+p-s (=2) to access x() , and
1-(p-s) (=0) to access l() . However, there is another subtle
intentional bug: s is incorrectly initialised as s = p = v++ .
Thus, the first time through the loop (which is the stdio.h case),
p == s , so p-s is 0, and so r[1±(p-s)] == r[1] == g() is
used in both cases - which is exactly what is required.
Doing things this way allows the string version of swine.c to be
nicely jumbled inside a[] , which helps to mask the fact that there
are more strings than source lines. (Having a[] appear after the
main source also helps with this.) In turn, this also helps to hide
the somewhat-less-than-innocent stdio.h lines in amongst the
swine.c lines. It also means that the g() string encoding can
refer to any line, allowing the bulk of the self-replicating code from
swine.c to be cheaply recycled into stdio.h .
More details
The LFSR used by x() uses 5 bits, with taps in bits 3 and 5 (so that
it is a maximal LFSR). The initial value of 23 is carefully chosen so
that the value of 31 (which indexes to the null string at the end of
a[] ) occurs immediately after all of the swine.c lines.
The encoding used by g() is very simple, and is shifted by 1
(modulus 32, the number of strings) so that the encoded string's
trailing '\0' maps to 31 (again, so as to end the list of strings).
The da() macro is used for counting the length of the potential
output (which is different for stdio.h vs the rest, so cannot be
hard-coded or computed once) (so as to know how far back to seek from
the end of the file), for comparing the contents of the file against
the strings, and for actually doing the output. This is done by
passing in different function names, which eventually work their way
back to the pt() , ck() and ct() macros.
Because there are no trigraphs, backslash encoded non-printable
characters, or other stupendities, the quoting macro e() can be quite
simple.
stdio.h : hijacking main()
The stdio.h code features a macro to hijack any main() that is
later defined by a user program which #includes <stdio.h> . It works
in a similar way to the fflush() trick described earlier. The
#define will turn this:
int main(int argc, char *argv[]) {
...
}
into this (newlines added for readability):
int main_(int argc, char *argv[]);
int main(int ac, char *v[]) {
char *z = w;
qu(p=s=&z);
return main_(ac, v);
}
int main_(int argc, char *argv[]) {
...
}
(Where w has previously been set to __FILE__ .)
Thus, the main() that the compiler sees is the one that contains our
code (that outputs itself to stdio.h if possible), while the
user-specified main() is now called main_() . Even in a debugger,
the extra frame and introduction of an apparently munged "main_ "
function is unlikely to attract the attention of the programmer -
after all, compilers munge and rearrange things all the time.
stdio.h: redefining if()
The stdio.h code also includes an "interesting" macro for the if()
keyword. This purportedly appears to be for debugging purposes,
passing a stringified version of the if condition to the d()
("debug"?) function. In fact, d() is a simple linear congruential
generator (LCG) pseudo-random number generator (PRNG) which returns a
pseudo-random unsigned long. If this random number is a multiple of
the address of the d() function, then the effect of the entire
expression is to invert (by way of an XOR) the result of the
conditional expression that has been passed to the if() statement.
Thus, the overall effect is that approximately one if() conditional out of
every 10 million (on x86_64 Linux, this will vary on other architectures) will
inexplicably and randomly fail, in the sense that if the condition is true, the
else block will be executed, and if it is false, then the if block will be
executed.
One interesting side effect of this is that the if keyword cannot be
trusted by the stdio.h code (or the swine.c code, because it
shares lines with the stdio.h code and in case it is run on an
already infected system). Instead, the y() macro implements an
"else-less if" by instead doing while(condition) { ... break;} .
Other code that attempts to test the "trustworthiness" of if() also
has to rely on similar tricks, or the ?: ternary operator.
Bugbears
Needs an ASCII character set. But what doesn't these days?
There are more space-saving #define 's (eg. "rt " for "return ")
than I'd like, but they were necessary to make the program fit within
the limit.
There's no way to properly initialise the PRNG used by the if()
macro. Time-based calls would be best, but would require sucking in
other headers (like time.h ) from inside stdio.h , which might raise
suspicions. Similarly for getpid() . Hashing the contents of argv
and environ is probably the best that could be done, and even that's
not very good. Any/all of these could be easily done from inside the
hijacked main() , though.
The stdio.h code leaves behind a fair amount of namespace pollution,
both preprocessor macros and non-static variables. With more space,
the macros could be #undef 'd, m could be moved inside d() as a
static variable, and the remaining global identifiers could have
munged/innocent/misleading names.
Finally, the main() hijacking macro won't work for reentrant code,
ie. code that calls main() again. This is because the call to
main() will also be macro expanded by the preprocessor, and the
expanded code won't make sense within a code block. Thankfully not
much does this (not much real code, anyway).
|