티스토리 뷰

Table of contents
1. What you should know already
2. What you should learn now
2.1 AT&T-style 80386 assembly language
2.1.1 Source and destination operand ordering
2.1.2 Instruction naming
2.1.3 Register naming
2.1.4 Constants and addressing modes
2.1.4.1 Constants
2.1.4.2 Indirect addressing modes
2.1.4.2.1 Immediate indirect
2.1.4.2.2 Register indirect
2.1.4.2.3 Base register plus offset indirect
2.1.4.2.4 Index register times width plus offset indirect
2.1.4.2.5 Base register plus index register times width plus offset indirect
2.2 UNIX policies that affect debugging
2.2.1 Advanced 80386 registers not available
2.2.2 Self-modifying code generally not permitted
2.2.3 

Symset is a tool that allows you to create your own debugging information for
FreeBSD/i386 executables.  Doing so makes it easier to reverse engineer an
application with gdb (the FreeBSD debugger).  By attaching debugging
information to an executable you can give names to functions and variables,
and even associate reverse engineered source code with a section of assembly.
This document describes the basics of the a.out symbol table (the format in
which debugging information is stored) and how to use symset to create
debugging information for reverse engineering a target application with
GDB.

1. What you should know already

In this tutorial I assume that you already have the following.

    1. Knowledge of the Intel 80386 instruction set.
    2. Basic knowledge of UNIX and your way around it.
    3. Access to an x86 machine running FreeBSD.
    4. Basic knowledge of debugging terminology (breakpoints, examining
       data and code).

2. What you should learn now.

I recognize that many readers have assembly language backgrounds in DOS and
Windows, but little or no experience with assembly under UNIX.  In this
section I introduce you to some of the hurdles you may encounter in your
transition.

2.1 AT&T-style 80386 assembly language.

The AT&T assembly language format used in BSD-based UNIX operating systems
(of which FreeBSD is a member), and consequently in the assembler listings
generated by gdb, is quite different than the Intel format that you likely
have learned.  If you are not familiar with the AT&T format, you should read
this section.

2.1.1 Source and destination operand ordering

Among the most noticable differences between the AT&T and Intel formats is the
way they refer to source and destination operands within an instruction.
Under the Intel format an instruction's source and destination operands appear
on the right and left of the comma which separates them, resepectively.
Under the AT&T format, these roles are reversed: source operands appear on the
left and destination operands appear on the right.  For an example, look at the
following set of instructions and how they are represented differently under
the two formats.

    +---------------+-------------------+
    | Intel         | AT&T              |
    +---------------+-------------------+
    | PUSH EBP      | pushl %ebp        |
    | MOV  EBP, ESP | movl  %esp, %ebp  |
    | SUB  ESP, 48  | subl  $0x48, %esp |
    +---------------+-------------------+
    Table 1.  AT&T format data assignment direction.

Other differences aside, you will notice that the AT&T format makes assignments
from left to right, and that modifying instructions modify their right-most
arguments.  The primary reason for this difference is due to the VAX assembly
format for which the AT&T style was originally invented.  (The Motorola 68000
and its descendents were heavily influenced by the VAX.  Likewise, their
assembly language format moves in this direction as well!)

2.1.2 Instruction naming

As you probably noticed in the example in Table 1, the AT&T format uses
slightly different names for 80386 instructions than the Intel format.  They
differ in keeping with VAX and Motorola traditions where instruction names
include a suffix which describes the size of the data they modify.  Under the
Intel format, these data size directives are normally described using the
'BYTE PTR', 'WORD PTR', and 'DWORD PTR' prefix phrases (if at all).  Table 2
illustrates an example.

    +------------------------------+------------------------+
    | Intel                        | AT&T                   |
    +------------------------------+------------------------+
    | MOVZX  EAX, BYTE PTR [ESI+5] | movzbl 0x5(%esi), %eax |
    | SUB    EAX, 30               | subl   $0x30, %eax     |
    | DEC    WORD PTR [EBX]        | decw   (%ebx)          |
    | INC    CX                    | incw   %cx             |
    | CMP    AL, 5                 | cmpb   $0x5, %al       |
    +------------------------------+------------------------+
    Table 2.  Data typing in AT&T format instruction names

Instruction suffixes are "b" for byte size operations (8 bits), "w" for
word size operations (16 bits), and "l" for double-word operations (32 bits).
As you may have noticed in the 'movzbl' (Move with zero-extend) instruction,
more than one suffix letter is used when an instruction's source and
destiniation operand differ in size.  The first suffix letter describes
the source operand while the second letter describes the destination.

(In the remainder of this section I leave out explicit 'BYTE PTR', 'WORD PTR'
and 'DWORD PTR' prefixes from all Intel format examples to save space, unless
they are absolutely necessary.  Most assemblers and debuggers follow this
convention as well because they can determine the proper sizing of an
instruction merely by looking at its operands.  These Intel size prefixes are
only really necessary for instructions that would have otherwise ambiguous
sizings, such as the first example in Table 2).

2.1.3 Register naming

All CPU register names in the AT&T format are prefixed with the percent
("%") character.  (This differentiates them from labeled memory addresses
of the same name).

2.1.4 Constants and addressing modes

The AT&T assembly format also differs significantly from the Intel format
in the way that it represents indirect addressing modes (that is, ways of
reading to or writing to memory) and the way in which it represents constants.

2.1.4.1 Constants

Constants under the AT&T format are written according to the same rules
which govern C:  All constants in hexadecimal are prefixed with the characters
"0x" (or "0X"); all constants in octal are prefixed with a zero; and all
constants in decimcal appear as-is, without any prefix.  Constants can also
be written in binary, in which case, they are prefixed with the characters
"0b".

If a constant is used as an immediate value operand inside an instruction
(which is the most common place a constant is used) a special prefix of "$"
is necessary.  The dollar sign prefix differentiates the constant from an
immediate indirect address, which is explained in the section that follows.
("2.1.4.2.1 Immediate indirect addressing mode").


2.1.4.2 Indirect addressing modes

Recall that the 80386 offers the programmer a choice of one of five indirect
addressing modes when writing an instruction: they are "immediate indirect",
"register indirect", "base register + offset indirect", "index register *
width + offset indirect", and "base register + index register * width + offset
indirect".  Table 3 illustrates an example instruction from each of these
categories in both formats.

    +-------------+----------------------------+-----------------------------+
    | Mode        | Intel                      | AT&T                        |
    +-------------+----------------------------+-----------------------------+
    | Immediate   | MOV EAX, [0100]            | movl           0x0100, %eax |
    | Register    | MOV EAX, [ESI]             | movl           (%esi), %eax |
    | Reg + Off   | MOV EAX, [EBP-8]           | movl         -8(%ebp), %eax |
    | R*W + Off   | MOV EAX, [EBX*4 + 0100]    | movl   0x100(,%ebx,4), %eax |
    | B + R*W + O | MOV EAX, [EDX + EBX*4 + 8] | movl 0x8(%edx,%ebx,4), %eax |
    +-------------+----------------------------+-----------------------------+
    Table 3.  The five 80386 indirect addressing modes and their syntax.

All AT&T format indirect addressing modes are written to the general form of
"OFFSET(BASE, INDEX, WIDTH)".  OFFSET, if present, must be a constant
integer.  BASE and INDEX, if either is present, must be registers.  WIDTH, if
present, applies to the register named in Index, and must be the constant
1, 2, or 4.  If width is not specified, a default of '1' is assumed.

The above paragraph may look intimidating, but it simply states a rule that you
can use to create or comprehend any AT&T format indirect addressing mode you
encounter.  Under the Intel format, this syntax is equivalent to
"[INDEX*WIDTH + BASE + OFFSET]"; if any of these paramaters doesn't apply
to a particular instruction, it is simply left out of the form.

2.1.4.2.1 Immediate indirect addressing mode

Under the  AT&T format, all immediate indirect addresses are written simply as
an OFFSET with a missing BASE, INDEX, and WIDTH parameter.  Since all three of
these parameters reside inside a parenthetical expression under the AT&T
format, the resulting empty parenthetical expression itself is left out.  This
leaves an instruction with a remarkably simple appearance: the immediate
indirect address is written by itself with no special prefix or suffix
characters, and constitutes the entire operand!

Thus, the instruction "MOV EAX, WORD PTR [0100]" (Intel format) is written
as "movl 0x0100, %eax" under the AT&T format.

(Recall from the discussion about constants that the immediate constant form
of this instruction, "MOV EAX, 100", would be written as "movl $0x100, %eax".
The dollar sign signifies that the 0x100 is an immediate constant, rather than
an immediate address).

2.1.4.2.2 Register indirect

A pure register indirect addressing mode instruction, such as "MOV EAX, [ESI]",
is written as the general form with only a BASE parameter: "movl (%esi), %eax".

2.1.4.2.3 Register plus offset indirect

A regsiter-plus-offset indirect addressing mode instruction, such as
"MOV EAX, [EBP-8]", is written as the general form with a BASE and OFFSET
parameter (but no INDEX or WIDTH): "movl -8(%ebp), %eax".

2.1.4.2.3 Index register times width plus offset indirect

An index-register-times-width-plus-offset indirect addressing mode instruction,
such as "MOV EAX, [EBX*4 + 0100]", is written as the general form with an
INDEX, WIDTH, and OFFSET parameter (but no BASE): "movl 0x100(,%ebx,4)".

2.1.4.2.4 Base plus index register times width plus offset indirect

A base-register-plus-index-register-times-width-plus-offset indirect addressing
mode instruction, such as "MOV EAX, [EDX + EBX*4 + 8]", is written as the
general form, with all parameters in place.  Namely, EDX as the BASE register,
EBX as the INDEX register, 4 as the WIDTH, and 8 as the OFFSET:
"movl 0x8(%edx, %ebx, 4), %eax".

2.2 UNIX policies that affect debugging

UNIX is a monolithic multi-tasking operating system and was designed as such
from the ground up.  Some of UNIX's policies affect the way 

3. GDB, the GNU Debugger

GDB is the one and only debugger available for FreeBSD, and the one for which
symset was created.  This section describes how to launch GDB, load a target,
execute it, generate assembly listings, insert breakpoints, and view and
modify both data and code.

GDB, like many UNIX applications, is a command-line debugger.  Configured
out-of-the-box, it is not quite as friendly as SoftICE or IDA, two major
disassembly tools for DOS.  However, it has several powerful features that are
unmatched in any other debugger.  If you find debugging via command-line too
difficult, there are graphical front-ends available for GDB.  However, I have
never tried them myself, nor do I know of the most recent versions.

3.1 How to launch gdb

To launch GDB, simply type it on the command line: "gdb".  You must then load
your target with the "load" command: "load <target>".  If you wish to save
keystrokes and time, you may specify the target on the command line when
starting GDB: "gdb <target>".

3.2 How to set program arguments

If your target requires command-line arguments then you must set them with the
"set args" command before running the target.

(Note to those who know a lot about arguments: In the strictest UNIX tradition
the target's executable name is automatically provided by GDB to the target as
the zeroth argument, "argv[0]".  The arguments you provide with the "set args"
command will appear as "argv[1]", "argv[2]", and so on).

3.3 How to run a target

To run a target, type "run".  

3.4 How to view assembly listings


3.5 How to place breakpoints
3.6 How to view and manipulate registers and data.
4. Symset
4.1 Symbol table basics
4.1.1 Symbol types
4.1.2 Symbol table entry format
4.2 GDB's data types
4.3 Marking and naming functions
4.4 Marking and naming global variables
4.5 Marking and naming function parameters

5.0 About the Author

Jeremy Cooper is a contributor to the NetBSD operating system, where he is the
port maintainer of the NetBSD/sun3x architecture.  He uses reverse engineering
techniques to verify compiler output and to fix BIOS bugs in Sun machines.
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
«   2024/05   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
글 보관함