266x Filetype PDF File size 0.06 MB Source: www1.cs.columbia.edu
AssemblyLanguages AssemblyLanguageModel
Onestepupfrommachine .
.
Assembly Languages language .
COMSW4995-02 Originally a more add r1,r2
user-friendly way to program sub r2,r3
Prof. Stephen A. Edwards Nowmostlyacompiler target cmpr3,r4
Fall 2002 Model of computation: PC → bneI1 ALU ↔ Registers ↔ Memory
Columbia University stored program computer sub r4,1
Department of Computer Science
I1:
jmp I3
.
.
.
AssemblyLanguageInstructions TypesofOpcodes Operands
Built from two pieces: Arithmetic, logical Eachoperand taken from a particular addressing mode:
• add, sub, mult Examples:
add R1, R3, 3 • and, or
• Cmp Register add r1, r2, r3
Opcode Operands Memoryload/store Immediate add r1, r2, 10
Whattodowiththedata Wheretogetthedata • ld, st Indirect movr1, (r2)
Control transfer Offset movr1, 10(r3)
• jmp PCRelative beq 100
• bne Reflect processor data pathways
Complex
• movs
TypesofAssemblyLanguages CISCAssemblyLanguage RISCAssemblyLanguage
Assembly language closely tied to processor architecture Developed when people wrote assembly language Response to growing use of compilers
At least four main types: Complicated, often specialized instructions with many Easier-to-target, uniform instruction sets
CISC: Complex Instruction-Set Computer effects “Make the most common operations as fast as possible”
RISC: Reduced Instruction-Set Computer Examples from x86 architecture Load-store architecture:
• String move
DSP:Digital Signal Processor • Arithmetic only performed on registers
VLIW: Very Long Instruction Word • Procedure enter, leave
Many, complicated addressing modes • Memoryload/store instructions for memory-register
transfers
Socomplicated, often executed by a little program Designed to be pipelined
(microcode)
Examples: Intel x86, 68000, PDP-11 Examples: SPARC, MIPS, HP-PA, PowerPC
DSPAssemblyLanguage VLIWAssemblyLanguage Example: Euclid’s Algorithm
Digital signal processors designed specifically for signal Response to growing desire for instruction-level int gcd(int m, int n)
processing algorithms parallelism {
Lots of regular arithmetic on vectors Using more transistors cheaper than running them faster int r;
Often written by hand Manyparallel ALUs while ((r = m % n) != 0) {
Objective: keep them all busy all the time m = n;
Irregular architectures to save power, area n = r;
Substantial instruction-level parallelism Heavily pipelined }
Moreregular instruction set return n;
Examples: TI 320, Motorola 56000, Analog Devices Very difficult to program by hand }
Looks like parallel RISC instructions
Examples: Itanium, TI 320C6000
i386 Programmer’s Model Euclid on the i386 Euclid on the i386
.file "euclid.c" # Boilerplate .file "euclid.c"
31 0 15 0 .version "01.01" .version "01.01"
eax Mostly cs Codesegment gcc2 compiled.: gcc2 compiled.: Stack Before Call
ebx General- ds Data segment .text # Executable .text n 8(%esp)
ecx Purpose- ss Stack segment .align 4 # Start on 16-byte boundary .align 4 m 4(%esp)
.globl gcd # Make “gcd” linker-visible .globl gcd %esp→ R.A. 0(%esp)
edx Registers es Extra segment .type gcd,@function .type gcd,@function
esi Source index fs Data segment gcd: gcd: Stack After Entry
gs Data segment pushl %ebp pushl %ebp n 12(%ebp)
edi Destination index movl %esp,%ebp movl %esp,%ebp m 8(%ebp)
ebp Basepointer pushl %ebx pushl %ebx R. A. 4(%ebp)
esp Stack pointer movl 8(%ebp),%eax movl 8(%ebp),%eax %ebp→ oldebp 0(%ebp)
movl 12(%ebp),%ecx movl 12(%ebp),%ecx %esp→ oldebx −4(%ebp)
eflags Status word jmp .L6 jmp .L6
.p2align 4,,7 .p2align 4,,7
eip Instruction Pointer
Euclid in the i386 Euclid on the i386 SPARCProgrammer’sModel
jmp .L6 # Jump to local label .L6 jmp .L6 31 0 31 0
.p2align 4,,7 # Skip ≤ 7 bytes to a multiple of 16 .p2align 4,,7
.L4: .L4: r0 Always 0 r24/i0 Input Registers
movl %ecx,%eax movl %ecx,%eax #m=n .
.
movl %ebx,%ecx movl %ebx,%ecx #n=r r1 Global Registers .
.
.
.L6: .L6: . r30/i6 FramePointer
cltd # Sign-extend eax to edx:eax cltd r7 r31/i7 Return Address
idivl %ecx # Compute edx:eax / ecx idivl %ecx r8/o0 Output Registers
movl %edx,%ebx movl %edx,%ebx .
.
testl %edx,%edx testl %edx,%edx #ANDofedxandedx . PSW Status Word
jne .L4 jne .L4 # branch if edx was 6= 0 r14/o6 Stack Pointer PC Program Counter
movl %ecx,%eax movl %ecx,%eax #Returnn r15/o7 nPC Next PC
movl -4(%ebp),%ebx movl -4(%ebp),%ebx r16/l0 Local Registers
leave leave # Move ebp to esp, pop ebp .
.
ret ret # Pop return address and branch .
r23/l7
SPARCRegisterWindows Euclid on the SPARC Euclid on the SPARC
r8/o0 .file "euclid.c" # Boilerplate mov %i0, %o1
. gcc2 compiled.: b .LL3
.
. mov %i1, %i0
Theoutput registers of r15/o7 .global .rem # make .rem linker-visible
r16/l0 .LL5:
. .section ".text" # Executable code
.
the calling procedure . .align 4 mov %o0, %i0 # n = r
becometheinputs to r23/l7
r8/o0 r24/i0 .global gcd # make gcd linker-visible .LL3:
. .
. .
the called procedure . . .type gcd, #function mov %o1, %o0 #Computetheremainderof
r15/o7 r31/i7 call .rem, 0 # m / n, result in o0
r16/l0 .proc 04
.
Theglobal registers . mov %i0, %o1
. gcd:
remain unchanged r23/l7 save %sp, -112, %sp # Next window, move SP
r8/o0 r24/i0
. .
. . cmp %o0, 0
Thelocal registers are . .
r15/o7 r31/i7 mov %i0, %o1 # Move m into o1 bne .LL5
r16/l0
not visible across . mov %i0, %o1 #m=n(alwaysexecuted)
.
. b .LL3 # Unconditional branch ret # Return (actually jmp i7 + 8)
procedures r23/l7 mov %i1, %i0 # Move n into i0
r24/i0 restore # Restore previous window
.
.
.
r31/i7
Digital Signal Processor Apps. EmbeddedProcessor Conventional DSP Architecture
Requirements
Low-cost embedded systems Harvard architecture
• Modems, cellular telephones, disk drives, printers Inexpensive with small area and volume • Separate data memory/bus and program memory/bus
High-throughput applications Deterministic interrupt service routine latency • Three reads and one or two writes per instruction cycle
Lowpower:≈50mW(TMS320C54xuses0.36µA/MIPS) Deterministic interrupt service routine latency
• Halftoning, base stations, 3-D sonar, tomography Multiply-accumulate in single instruction cycle
PCbasedmultimedia Special addressing modes supported in hardware
• Modulo addressing for circular buffers for FIR filters
• Compression/decompression of audio, graphics, video • Bit-reversed addressing for fast Fourier transforms
Instructions to keep the pipeline (3-4 stages) full
• Zero-overhead looping (one pipeline flush to set up)
• Delayed branches
Conventional DSPs Conventional DSPs Example
Fixed-Point Floating-Point Market share: 95% fixed-point, 5% floating-point Finite Impulse Response filter (FIR)
Cost/Unit $5–$79 $5–$381 Eachprocessor comes in dozens of configurations Canbeusedforlowpass, highpass, bandpass, etc.
Architecture Accumulator load-store • Data and program memory size Basic DSP operation
Registers 2–4 data, 8 address 8–16 data, 8–16 address For each sample, computes
Data Words 16 or 24 bit 32 bit • Peripherals: A/D, D/A, serial, parallel ports, timers
Chip Memory 2–64Kdata+program 8–64Kdata+program Drawbacks k
Address Space 16–128K data 16M–4Gdata • No byte addressing (needed for image and video) yn = Xaixn+i
16–64K program 16M–4Gprogram i=0
Compilers BadC Better C, C++ • Limited on-chip memory
Examples TI TMS320C5x TI TMS320C3x • Limited addressable memory on most fixed-point where
Motorola 56000 Analog Devices SHARC DSPs a0,...,ak are filter coffecients,
• Non-standard C extensions to support fixed-point data xn is the nth input sample, yn is the nth output sample.
56000 Programmer’s Model 56001 MemorySpaces 56001 Address Generation
55 4847 x1 2423 x00 Source 15 0 Program Counter Three memory regions, each 64K: Addresses come from pointer register r0 ...r7
y1 y0 Registers Status Register • 24-bit Program memory Offset registers n0 ...n7 can be added to pointer
a2 a1 a0 Accumulator Loop Address
b2 b1 b0 Accumulator Loop Count • 24-bit X data memory Modifier registers cause the address to wrap around
15 PCStack
15 0 15 0 15 0 .
. •
r7 n7 m7 . 24-bit Y data memory Zero modifier causes reverse-carry arithmetic
. . . 0
. . .
. . . 15 SRStack Idea: enable simultaneous access of program, sample, Address Notation Next value of r0
r4 n4 m4 Address .
. r0 (r0) r0
r3 n3 m3 Registers . and coefficient memory
. . .
. . . 0
. . . r0 + n0 (r0+n0) r0
r0 n0 m0 Stack pointer Three on-chip memory spaces can be used this way r0 (r0)+ (r0 + 1) mod m0
Oneoff-chip memory pathway connected to all three r0 - 1 -(r0) r0 - 1 mod m0
memoryspaces r0 (r0)- (r0 - 1) mod m0
r0 (r0)+n0 (r0 + n0) mod m0
Only one off-chip access per cycle maximum r0 (r0)-n0 (r0 - n0) mod m0
FIR Filter in 56001 FIR Filter in 56001 TI TMS320C6000 VLIWDSP
n equ 20 # Define symbolic constants movep y:input, x:(r0) #Loadsampleintomemory Eight instruction units dispatched by one very long
start equ $40 # Clear accumulator A instruction word
samples equ $0 # Load a sample into x0
coeffs equ $0 # Load a coefficient Designed for DSP applications
input equ $ffe0 #Memory-mappedI/O clr a x:(r0)+, x0 y:(r4)+, y0
output equ $ffe1 Orthogonal instruction set
rep #n-1 # Repeat next instruction n-1 times Big, uniform register file (16 32-bit registers)
org p:start #Locateinprog. memory # a = x0 × y0
move #samples, r0 #Pointers to samples # Next sample Better compiler target than 56001
move #coeffs, r4 # and coefficients # Next coefficient
move #n-1, m0 # Prepare circular buffer mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 Deeply pipelined (up to 15 levels)
move m0, m4 Complicated, but more regular, datapath
macr x0,y0,a (r0)-
movep a, y:output #Writeoutput sample
Pipelining on the C6 FIRinOne’C6AssemblyInstruction Peripherals
Oneinstruction issued per clock cycle Load a halfword (16 bits) Often the whole point of the system
Very deep pipeline Dothis on unit D1 Memory-mapped I/O
FIRLOOP:
• 4 fetch cycles LDH .D1 *A1++, A2 ;Fetchnextsample • Magical memory locations that make something
|| LDH .D2 *B1++, B2 ; Fetch next coeff. happen or change on their own
• 2 decode cycles || [B0] SUB .L2 B0, 1, B0 ; Decrement count
• 1-10 execute cycles || [B0] B .S2 FIRLOOP ; Branch if non-zero Typical meanings:
|| MPY .M1X A2, B2, A3 ;Sample×Coeff.
Branch in pipeline disables interrupts || ADD .L1 A4, A3, A4 ;Accumulate result • Configuration (write)
Conditional instructions avoid branch-induced stalls Usethecross path • Status (read)
Nohardwaretoprotect against hazards Predicated instruction (only if B0 non-zero) • Address/Data (access more peripheral state)
• Assembler or compiler’s responsibility Runtheseinstruction in parallel
no reviews yet
Please Login to review.