RISC microprocessors

CPU photo from Wood be Nice

ARM microprocessors have been making the headlines.  I thought I dig up my undergraduate paper on Reduced Instruction Set Computing (RISC) microprocessors.  This was written in 1995 with the resources of a small library.

As I was reading through, it struck me that only the ARM and PowerPC are still around today.  PowerPC is in IBM line of AS/400 servers.  These servers can be found in most if not all financial institutions.

1a) Novix Forth

Unable to locate

 

1b) Mips R3000 [1]

MIPS or MIcroProcessor without interlocked Stages R3000 is a 32 bit RISC processor rated at 20 VAX MIPS (20 MHz).  The integer register set comprises thirty two 32 bit general purpose registers plus two 32 bit registers used for the results of multiply and divide operations.  Only 30 of the registers are truly general purpose, r0 is hardwired to zero, r31 is a link register for some instructions. There are no register windows, MIPS contends that the chip area is not justifiable and can be replaced by a good compiler. In addition, the large register file makes it difficult to increase the clock rate.

The Memory Management Unit (MMU), and cache controller is implemented on chip.  There are separate caches for data and instructions.  The actual caches are implemented  off chip using standard SRAM chips. Floating point operations are executed on a separate chip, the Floating Point Accelerator.   Data ordering little or big Indian is determined at power up by a control bit.  This cannot be dynamically changed.

In other processors, a hardware interlock forces the pipeline to stall if an instruction attempts to use data being loaded by the immediately preceding instruction.   The MIPS design does not include such hardware interlocks for loads, thus the name.  The compiler is required to arrange instructions so that one instruction will not attempt to use data loaded by the immediately preceding instruction.  If the compiler cannot do this, it must place a no-op in the delay slot (the instruction following the load).  MIPS compilers are able to fill load delay slots with a useful instruction an average of 90% of the time, and branch delay slots 50% of the time.  This simplifies the design of the pipeline.

 

1c)  Motorola 88000 [1] [8]

The MC88100 is a 1.5 micron HCMOS RISC processor with a peak performance of 17 MIPS at 20 Mhz.  All instructions are 32 bit long.  The companion chip is the MC88200 Cache Memory Management Unit (CMMU).  It has a cache 16 Kbytes in size.  Two MC88200 are needed to implement data and instruction cache in a typical system.   The overall virtual address space is 4Gbytes for the user and 4Gbytes for the supervisor, separately for instructions and data for a total of 16Gbytes.  The main memory physical address is limited to 4Gbytes.

The 88100 features five independent pipelined execution units: the data unit, the instruction unit, the integer unit, the floating point adder and the floating point multiplier.  The integer unit has a three stage pipeline, whereas the adder has five and the multiplier, six.  There are 32 general purpose registers, 64 IU general control registers and 64 FPU control registers.  Register r0 is hardwired zero and r1 contains the subroutine return pointer.  The Scoreboard register is of particular interest.  Its 32 bits are associated with each of the 32 general purpose registers.  It is used for hardware synchronisation of allocation and utilisation of the registers.

 

1d) IBM 801 [7]

In 1975, this is the first RISC type system to be develped as an experimental computer. The IBM 801 was implemented on ECL MSI chips.  Its architecture and I/O organization enable the CPU to execute an instruction at almost every cycle.  The system features separate data and instruction cache.  At an instruction every 66 ns, it returns a performance of 10 MIPs.

The 801 is a 32 bit machine with 32 bit addresses. There are a total of thirty two CPU registers but without register windowing. This system is designed for use for high level language with a sophisticated compiler.  The 801 features a simple instruction set which can be hardwired

All instructions are aligned on full word boundaries in memory.  There  are a total of 120 instructions.  Multiplication is supported by a multiply step instruction.  The same applies for division.

 

1e) Acorn ARM [4] [5]

The ARM or Acorn Risc Machine was designed by Acorn in 1983 to replace the 6502 found in the BBC Micro.  ARM uses a three stage piplined load/store architecture to achieve a performance of 3 MIPS. There is no on chip cache or floating point unit.  It is implemented in double metal 3 micron CMOS, 7mm square containing 25 0000 transistors.

The ARM has twenty full 32 bit registers, a 32 bit data bus and a 26 bit address bus, giving 64Mbytes of address space. Only sixteen registers are normally available to the programmer.  During interrupts the extra registers become available to the processor to simulate a DMA channel without needing to save any of the user’s registers.   Register 15 contains the program counter; it also holds the status flags as there is no status register.

All instructions are 32 bits (aligned on word boundaries), divided into several fields, and can be fetched in one clock cycle.  The instructions are decoded directly by hard wired logic.  There are 44 basic instruction codes, with no multiply or divide instructions.  Most of the  instructions can be executed in one clock cycle, except for the load and store mulitple register instructions.

The original ARM has evolved into ARM6 a macro cell with 32 bit addressing.  One example is the ARM610 ASIC design. In addition to the ARM6 CPU core it contains a 4 Kbytes instruction and data cache, a wirte buffer and an MMU optimized for object oriented operating system.  The ARM610 can be found in the Newton.

1f)  Pyramid [8]

The Pyramid 90x is a 32 bit universal computing system manufactured by Pyramid Technology Corp, implemented in Schottky TTL MSI on three boards, with an 8 MHz clock.  It was the first commercial RISC type system when it was announced in  1983.  It has 528 registers 32 bit registers.  Out of 528, 16 are global and seen by all procedures.  Each procedure sees a total of 64 registers.  It can support up to 15 levels of nesting procedures without accessing the memory.  Directly supporting floating point operations, the Pyramid has six data types.  There are over 100 instructions of different length: 32 bits, 64 bits and 96 bits.  The Pyramid can be regarded more as a reduced CISC.  The CPU is organized as a three stage  instruction pipeline.  The data (32Kbytes) and instruction (4Kbytes) have separate caches.  A virtual memory of 4 Gbytes is supported.  A fixed page size of 2 Kbytes is implemented.  The physical memory is 8 Mbytes.  The control unit is microprogrammed, with an auxiliary MC68000-based system for system support and diagnostics.

 

1g) Inmos Transputer [6] [7]

Inmos invisaged that the future of computing belonged to interconnected microprocessors.  The Transputer is designed to communicate with other Transputers in a parallel processing network so that as many of them as desired can be used with minimum overhead.  Connecting conventional microprocessors together to perform in such an environment is a tedious chore, complicated by the need for a parallel bus to be securely shared and the terrifying problems of synchronization.  The Transputer avoids these problems by communicating over 4 high speed serial links at 10 Mbits/s each.  The necessary synchronization are built into the instruction set.

The user is not expected to program in assembly languare, but with a specially developed high level language, Occam.

The Transputer is a 32 bit machine, with 32 bit addresses, giving 4 Gbytes  of memory.  The address space is linear and it does not virtual memory. The architecture does not differentiate between on chip or off chip memory.  There is no on chip cache or floating point unit, but there are 4 Kbytes of main memory in SRAM.

The Transputer have six CPU registers,  workspace pointer, instruction pointer, operand register and a three register evaluation stack.  Different concurrent tasks have their own workspace using these registers. Task switching is a simple matter of changing a pointer. There are 111 instructions with only one format which is one byte long.  Since the Transputer is a 32 bit machine, four instructions can be fetched simultaneously.  The hardware is fully concurrent. Floating operations are supported by a run-time software package.

Implemented on 2 micron CMS, 250 0000 transistors,  its power consumption is 1 W at 20 MHz, with a 10 MIPs throughput.

 

1h) Sun Sparc [1] [3] [8]

Scaleable Processor ARChitecture is a architecture specification.  It is not connected to any specific hardware realization.  The detailed implementation is up to the semiconductor companies. Scaleability as seen by the creators as a wide spectrum of its possible price/performance implementations.

 

The processor is subdivided into two basic units Integer Units (IU) and Floating Point Unit (FPU).  The IU performs basic processing and integer arithmatic,  the FPU does floating point calculations concurrently with IU.    Each procedure running on the SPARC can use a total of thirty two 32 bit registers.  This registers are organised in the form of  windows looking into a circular bank of registers, providing fast context switching.  The number of registers can be increased, ie “scaled” but the software must be capable of such deep nesting.

In  version 7 of the standard,  there is no integer multiply or divide instructions, because difficulty implementing into a gate array.  In addition there was no support for software emulation.  These instructions can be found in version 8.

Primarily due to cost concerns, the early versions of SPARC are implemented on a combined instruction and data bus architecture.  In the interest of performance, Fujitsu Embedded SPARC Processor shows the start of the trend towards separate buses.  The latest definition, version 9 upgrades SPARC to 64 bits, with graphics instructions through the support of two dedicated executions units. The SPARC in its various forms, is the most successful RISC processor.

 

1i)   Intel 860.  [1]

The i860 microprocessor also known as the N10 is a significant technological achievement.  It is the first single chip microprocessor to integrate relatively large instruction (4Kbyte) and data caches (8Kbytes), a TLB based memory management unit, a pipeline RISC integer unit, high performance, pipelined floating points units, and a 3-D graphics unit.  The “N” in the “N10” presumably stands for “numeric”, and there is a heavy emphasis on floating point performance.

The i860 contains two semi-independent processing units, each with its own set of thirty-two 32 bit registers.  One of the units handles integer operations, and the other handles floating point and graphics instructions.  By defining floating point operations into two types, scalar and piplined, the floating point processing unit can split into two.  Scalar instructions are traditional instructions, whereas pipelined instructions are designed for vector calculations.  As a result, the i860 is capable of impressive peak performance.  Under the right circumstances, the integer core and the two floating points unit scan each generate a new result every cycle.  This is done .  Thus, at 40 MHz, the i860 can execute bursts at 40 native integer MIPS and 80 single-precision MFLOPS, in addition, the chip has a special graphics hardware unit, which has instructions that speed up hidden surface elimination and smooth shading algorithms for 3 D graphics.

 

1j)   Intel 80960 [1]

Intel’s 80960 is a RISC-inspired top of the line 32 bit embedded controller which  includes a radically new architecture.  A RISC style processor with a load/store architecture and large register set, complex multipart addressing modes.  On chip there is a large collection of support features, floating point operations, power on self test on chip debug, trace and breakpoint circuitry etc., just to name a few.  This system aimed at the high end controller market. But there is no facility for on chip EPROM or ROM program memory, timers, counters, or serial ports.  The only peripheral device on chip is an interrupt controller.

There is an instruction cache of 1Kbyte in size, using the least recently used algorithm.  There isn’t a data cache, but a small amount (1.5 Kbytes) of RAM for interrupt vectors, supervisor use and some register caching.  It uses register scoreboarding to improve performance.  Scoreboarding is a control mechanism by which a heavily pipelined processor keeps track of what resources are busy at any given instant.  It is useful when relatively slow operations in a instruction stream are followed by instructions that use different processor resources and don’t require the preceding instructions’ results.  In such cases, the later instructions may be safely initiated even though the prior ones are not yet completed.

The processor is at most RISC like, with 184 instruction mnemonics, 12 data types and 7 addressing modes.  Compare this with SPARC’s 99 instructions, 6 data types and 2 addressing modes. The 960CA performance at 33 MHz is 66 native MIPS.  It contains 600,000 transistors with a chip size 385 x 575 mils.  The 486 has approximately 1.2 million transistors.

 

1k)        PowerPC. [2]

In 1991, Apple, IBM and Motorola formed an alliance whose goal was to create a new hardware and software standard for personal computing.  The result is the PowerPC architecture.  After the introduction of various implementations (601, 603, 604), in October 1994, the high performance 64 bit PowerPC 620 was released.

The 620, a single chip RISC processor is implemented  on a Harvard style architecture, ie separate code and data paths.  The data path is 128 bits wide, so it fetches two longwords (64 bits each) of data during every bus access.  There is a precoder on the code bus between the code cache and the bus interface unit, shortening the pipelines to five stages.  The  code and data cache are 32 Kbytes in size.  Each cache has its own memory management unit and function independently.  At the heart of the 620, there are six independent execution units: a load/store unit, a branch unit, an FPU, and three integer units.  This enables up to four instructions to be fetched and dispatched at each clock cycle.  Results to operations are stored in renameable buffers.  This buffers make possible speculative execution of instructions based on branch prediction.   By controlling a mode bit, the 620 can execute little or big endian code on the fly.  The 620 is clocked at 133 MHz, 3.3V, with a die size of 331mm2.

 

1l)   TFP microprocessor. [9]

Tremendous Floating Point  microprocessor is a superscalar implementation of the Mips Technologies architecture.  This floating point, computation-oriented processor can dispatch up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units.   Integer function units consist of two integer arithmetic logic units, one shifter and one multiply-divide unit.  The ALUs and shifter operate in one cycle  The multiply-divide unit is not pipelined and has a latency for a 32 or 64 bit multiply of four or six cycles respectively.  The latency for division varies from 21 to 73 cycles depending on the number of significant digits in the result.  In each cycle, up to two integer operations are initiated.

The FPU contains two execution data paths, each capable of double-precision fused multiply-adds, simple multiples, adds, divides, square-roots and conversions.  Compares and moves take one cycle.  Adds, multiples and fused multiply-adds take four cycles and are fully pipelined.    In comparision, divides and square roots are not pipelined.

The  split-level cache structure reduces cache misses by directing integer data references to a 16 Kbytes on chip cache, while channeling floating point  data references to a 4 Mbytes off chip cache.

 

1m)       ECL microprocessor [10]

This microprocessor was developed to investigate VLSI ECL circuit techniques, custom CAD tools, high performance chip interfaces and advanced packaging techniques for high power microprocessors.  Current ECL microprocessors are built using a multiple gate arraies strategy, which results in significant interchip communication delays.  These gate arraies typically have a 30 W heat dissipation limit which limits integration and speed.    As there is limited gate selection available in a gate array marco library, a significant penalty in terms of the number of gates in series required to implement a required function.

ECL RISC processor is custom designed implemented in 1 um bipolar technology.  The die contains 468 000 transistors and 206 000 resistors.  The chip contains CPU,  on chip parity checked instruction and data caches of 2 Kbytes each.   The caches are Implemented in bipolar RAM.  It uses a subset of the R6000 architecture

The architecture of the ECL RISC uses a subset of the R6000 architecture.  It is kept simple at the expense of cycle time and performance.  If there is a cache miss, the five stage pipeline is stalled while the requried instruction is fetched from the main memory. An off chip differential clock drives an on chip phase lock loop to generate a one to eight multiples  of the clock. This clock is for on chip used only.

The processor  works with a three stage pipeline second level board cache containing both instructions and data.    By organizing the external cache RAMs into four banks, operation of the external interface at 3000 MHz with 10 ns RAM is possible.

 

References

1.    “A guide to RISC microprocessors”  edited by Michael Slater,

Academic Press.

2.    “PowerPC 620 soars” by Tom Thompson and Bob Ryan,

Byte November 1994.

3.    “SPARC Strikes back” by Peter Wayner, Byte November 1994.

4.    “The Acorn RISC Machine” by Dick Pountain, Byte January 1986.

5.    “A call to ARM” by Dick Pountain, Byte November 1992.

6.    “The Transputer and its Special Language Occam” by

Dick Pountain, Byte, August 1984.

7.    “RISC Architecture” by Daniel Tabak, Research Studies Press 1987.

8.    “RISC Systems” by Daniel Tabak, Reseach Studies Press 1990.

9.    “Designing the TFP Microprocessor” by Peter Yan-Tek Hsu,

IEEE Micro April 1994.

10.  “Designing, Packaging, and Testing a 300 Mhz 115W ECL      Microprocessor”

by Norman P Jouppi, Patrick Boyle and John S Fitch,

IEEE Micro April 1994.

 

Google Now and battery life

Google Now is a all seeing guide in your phone.  It knows where you are and where you will be going.  It pops up the bus schedule if you are near a bus stop, train schedule if you near a train station.

Google  tells you last night scores of your favourite teams.  It also keep tracks of your favourite stocks and the weather.  A better equivalent is iOS’s Siri.  I said better because I feel it is much more tightly integrated.  Google remembered my birthday.

Google Now wishes Happy Birthday

All this is accomplish by massive polling of the current WiFi in the area.  With Google Maps, the location can be identified and context sensitive information displayed.  Does this sounds familiar?  Yes another vendor that provides location data via WiFi is Skyhook. This means the battery life is very short.  In my first week, I lost about 4 hours off my 11 hour battery life.  I could not do much and nearly ran dry a few times, in the end I gave up and turn off Google Now.

After a few weeks and poking around, I found that the big culprit for the battery drain was due to Location Tracking.  Since I like the cards, I decided to run Google Now without Location Tracking.  Surprisingly the battery life was acceptable after the first two days.  People on Samsung  phones have reported acceptable battery life.

So it was with much dread when I decided to try Google Now with Location tracking.   Let it learn about my travels and habits.  Let it learn until it has decided it has enough and go easy on the polling.  That started on 2 Nov 12.

Lets see what the future brings.

 

27 Oct, No Google Now or location tracking.  1.5% of battery per hour.

Google Now without location tracking.
27 Oct, Google Now without location tracking

 

4 Nov, I went to sleep with 100% of battery life, WiFi and GPS turned on.  Below is a screenshot first thing in the morning.  Note blue lines on the Awake bar.

5 Nov 11, overnight

 

6 Nov Location settings, not automatically updating (was not checked).  Have to reset the start of baseline period to day.

10 Nov. The battery graphs are all screwy as I did not have the chance to fully charge the phone before sleeping. From what I can see, the Awake bar is full of CPU activity, even when the device is stationary. I have try to keep both WiFi and GPS on.

12 Nov

12 Nov Overnight

25 Nov

Starting from about 85% overnight to the next morning at 70% over 7.5 hours or 2.14% per hour.

Discharge from 100% 25 Nov

26 Nov

Overnight discharge from 100%, remaining battery was 80%. or 2.6%/hr.

26 Nov 12 Overnight discharge from 100%

 

In summary, with Google Now turned on, power consumption doubles, draining the battery at ~2-2.5% per hour compared to without at ~1.5% per hour.

Update: 15 Apr 13

Google Now has been upgraded to pull information from Gmail to present more information.  The screenshot below shows an alert about flight departure times.