Arm processor September 2005 Introduction


ARM Core Sight Technology



Download 376.11 Kb.
Page6/6
Date28.05.2018
Size376.11 Kb.
#51599
1   2   3   4   5   6

ARM Core Sight Technology

ARM Core Sight technology is designed to meet the wide range of needs of embedded developers and silicon manufacturers, such as providing wide system visibility with minimal overhead, thus reducing processor costs.

ARM Core Sight technology provides a complete debug and trace solution for the entire system- on-chip (SoC). It makes single ARM core and complex, multi-core SoCs easy to debug and thus speeds development of more reliable, higher performance ARM Powered products.

By providing system-wide visibility through the smallest port, Core Sight technology provides the highest standard of debug and trace capabilities and can be leveraged for all cores and complex peripherals. Core Sight technology builds on ARM’s current Embedded Trace Macrocell™ (ETM) products, which are widely licensed and supported by ARM Real View® development tools and all other leading tool vendors.

Key Benefits

• Higher visibility of complete system operation through fewer pins

Direct access by debugger to system memory, for visibility without affecting CPU operation and for faster code download

Ability to debug any of multiple cores, even if other cores are in sleep mode or powered down

• Standard solution across all silicon vendors for widest tools support

• Re-usable for single ARM core, multi-core or core and DSP systems

• Enables faster time-to-market with greater reliability and higher performance products

• Supports highest frequency processors, including ARM\'s new Cortex cores

• Builds on proven ARM ETM technology

Devices implementing Core Sight technology can comply at four independent levels:

Core Sight Debug

o Debug Access Port

o Embedded Cross Trigger

Core Sight ETM

o Embedded Trace Macro cells

Core Sight Multi-source Trace

o Trace Funnel

o Embedded Trace Buffer

o Trace Port Interface Unit

o AHB Trace Macro cell

o Instrumentation Trace

Core Sight Single Wire

o Single Wire Debug

o Single Wire Viewer

  Memory Management


Introduction

The RISC OS machines work with two different types of memory - logical and physical. The logical memory is the memory as seen by the OS, and the programmer. Your application begins at &8000 and continues until &xxxxx.


The physical memory is the actual memory in the machine.

Under RISC OS, memory is broken into pages. Older machines have a page of 8/16/32K (depending on installed memory), and newer machines have a fixed 4K page. If you were to examine the pages in your application workspace, you would most likely see that the pages were seemingly random, not in order. The pages relate to physical memory, combined to provide you with xxxx bytes of logical memory. The memory controller is constantly shuffling memory around so that each task that comes into operation 'believes' it is loaded at &8000. Write a little application to count how many wimp polls occur every second, you'll begin to appreciate how much is going on in the background.

 

MEMC: Older systems

In ARM 2, 250, and 3 machines; the memory is controlled by the MEMC (Memory Controller). This unit can cope with an address space of 64Mb, but in reality can only access 4Mb of physical memory. The 64Mb space is split into three sections:

0Mb - 32Mb : Logical RAM

32Mb - 48Mb : Physical RAM

48Mb - 64Mb : System ROMs and I/O
Parts of the system ROMs and I/O are mapped over each other, so reading from it gives you code from ROM, and writing to it updates things like the VIDC (video/sound).

It is possible to fit up to 16Mb of memory to an older machine, but you will need a matched MEMC for each 4Mb. People have reported that simply fitting two MEMCs (to give 8Mb) is either hairy or unreliable, or both. In practice, the hardware to do this properly only really existed for the A540 machine, where each 4Mb was a slot-in memory card with an on-board MEMC.

The MEMC is capable of restricting access to pages of memory in certain ways, either complete access, no access, no access in USR mode, or read-only access. Older versions of RISC OS only implemented this loosely, so you need to be in SVC mode to access hardware directly but you could quite easily trample over memory used by other applications.

 

MMU: Newer systems

The newer systems, with ARM6 or later processor, have an MMU built into the processor. This consists of the translation look-aside buffer (TLB), access control logic, and translation table walk logic. The MMU supports memory accesses based upon 1Mb sections or 4K pages. The MMU also provides support for up to 16 'domains', areas of memory with specific access rights.
The TLB caches 64 translated entries. If the entry is for a virtual address, the control logic determines if access is permitted. If it is, the MMU outputs the appropriate physical address otherwise is signals the processor to abort.
If the TLB misses (it doesn't contain an entry for the virtual address), the walk logic will retrieve the translation information from the (full) translation table in physical memory.If the MMU should be disabled, the virtual address is output directly as the physical address.

It gets a lot more complicated, suffice to say that more access rights are possible and you can specify memory to be bufferable and/or cacheable (or not), and the page size is fixed to 4K. A normal RiscPC offers two banks of RAM, and is capable of addressing up to 256Mb of RAM in fairly standard PC-style SIMMs, plus up to 2Mb of VRAM double-ported with the VIDC, plus hardware/ROM addressing.

On the RiscPC, the maximum address space of an application is 28Mb. This is not a restriction of the MMU but a restriction in the 26-bit processor mode used by RISC OS. A 32-bit processor mode could, in theory, allocate the entire 256K to a single task.All current versions of RISC OS are 26-bit.

 

System limitations

Consider a RiscPC with an ARM610 processor. The cache is 4K.The bus speed is 16MHz (note, only slightly faster than the A5000!), and the hardware does not support burst-mode for memory accesses. Upon a context switch (ie, making an application 'active') you need to remap it's memory to begin at &8000 and flush the cache.

The Pipeline


A conventional processor executes instructions one at a time, just as you expect it to when you write your code. Each execution can be broken down into three parts, which anybody who has learned this stuff at college will have fetch, decode, execute burned into their memory.

In English...


  1. Fetch
    Retrieve the instruction from memory. Don't get all techie - whether the instruction comes from system memory or the processor cache is irrelevant, the instruction is not loaded 'into' the processor until it is specifically requested. The cache simply serves to speed things up. By loading chunks of system memory into the cache, the processor can satisfy many more of its instruction fetches by pulling instructions from the cache. This is necessary because processors are very fast (StrongARMs, 200MHz+; Pentiums up to GHz!) and system memory is not (33, 66, or 133MHz). To see the effect the cache has on your processor, use *Cache Off.
     

  2. Decode

Figure out what the instruction is, and what is supposed to be done.

3.Execute

Perform the requested operation.

Each of these operations is performed along with the electronic 'heartbeat', the clock rate. Example clock rates for several microprocessors included in Acorn products are given here as an example:



BBC microcomputer

6502

2MHz

Acorn A310-A3000

ARM 2

8MHz

Acorn A5000

ARM 3

25MHz

Acorn A5000/I

ARM 3

30MHz

RiscPC600

ARM610

33MHz

RiscPC700

ARM710

40MHz

Early PC co-processor

486SXL-40

33MHz (not 40!)

RiscPC (StrongARM)

SA110

202MHz - 278MHz+

As shown in the PC world, processors are running into GHz speeds (1,000,000,000 ticks/sec) which will necessitate much in the way of speed tweaks (huge amounts of cache, extremely optimized pipeline) because there is no way the rest of the system can keep up. Indeed, the rest of the system is likely to be operating at a quarter of the speed of the processor. The RiscPC is designed to work, I believe, at 33MHz. That is why people thought the Strong-Arm wouldn't give much of a speed boost. However the small size of ARM programs, coupled with a rather large cache, made the Strong-Arm a viable proposition in the RiscPC, it bottlenecked horribly, but other factors meant that this wasn't so visible to the end-user, so the result was a system which is much faster than the ARM710. More recently, the Kinetic Strong-Arm processor card. This attempts to alleviate bottlenecks by installing a big wedge of memory directly on the processor card and using that. It even goes so far as to install the entirety of RISC OS into that memory so you aren't kept waiting for the ROMs (which are slower even than RAM).

There is an obvious solution. Since these three stages (fetch, decode, execute) are fairly independent, would it not be possible to:

fetch instruction #3

decode instruction #2

execute instruction #1
...then, on the next clock tick...
fetch instruction #4

decode instruction #3

execute instruction #2
...tick...
fetch instruction #5

decode instruction #4

execute instruction #3
In practice, the answer is yes. And this is exactly what a pipeline is. Simply by doing this, you have just made your processor three times faster!

Now, it isn't a perfect solution.



  • When it comes to a branch, the pipeline is dumped as instructions after a branch are not required. This is why it is preferable to use conditional execution and not branching.

  • Next, you have to keep in mind the program counter is ahead of the instruction that is currently being executed. So if you see an error at 'x', then the real error is quite possibly at 'x-8' (or 'x-12' for Strong-Arm).



RISC vs CISC


In the early days of computing, you had a lump of silicon which performed a number of instructions. As time progressed, more and more facilities were required, so more and more instructions were added. However, according to the 20-80 rule, 20% of the available instructions are likely to be used 80% of the time, with some instructions only used very rarely. Some of these instructions are very complex, so creating them in silicon is a very arduous task. Instead, the processor designer uses microcode. To illustrate this, we shall consider a modern CISC processor (such as a Pentium or 68000 series processor). The core, the base level, is a fast RISC processor. On top of that is an interpreter which 'sees' the CISC instructions, and breaks them down into simpler RISC instructions.

Already, we can see a pretty clear picture emerging. Why, if the processor is a simple RISC unit, don't we use that? Well, the answer lies more in politics than design. However Acorn saw this and not being constrained by the need to remain totally compatible with earlier technologies, they decided to implement their own RISC processor.

Up until now, we've not really considered the real differences between RISC and CISC, so...

A Complex Instruction Set Computer (CISC) provides a large and powerful range of instructions, which is less flexible to implement. For example, the 8086 microprocessor family has these instructions:

JA Jump if Above

JAE Jump if Above or Equal

JB Jump if Below

...


JPO Jump if Parity Odd

JS Jump if Sign

JZ Jump if Zero
There are 32 jump instructions in the 8086, and the 80386 adds more. I've not read a spec sheet for the Pentium-class processors, but I suspect it (and MMX) would give me a heart attack!

By contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the sub-components and use those. As these are much simpler, they can be implemented directly in silicon, so will run at the maximum possible speed. Nothing is 'translated'. There are only two Jump instructions in the ARM processor - Branch and Branch with Link. The "if equal, if carry set, if zero" type of selection is handled by condition options, so for example:

BLNV Branch with Link NeVer (useful!)

BLEQ Branch with Link if EQual

and so on. The BL part is the instruction, and the following part is the condition. This is made more powerful by the fact that conditional execution can be applied to most instructions! This has the benefit that you can test something, then only do the next few commands if the criteria of the test matched. No branching off, you simply add conditional flags to the instructions you require to be conditional:
SWI "OS_DoSomethingOrOther" ; call the SWI

MVNVS R0, #0 ; If failed, set R0 to -1

MOVVC R0, #0 ; Else set R0 to 0
Or, for the 80486:
INT $...whatever... ; call the interrupt

CMP AX, 0 ; did it return zero?

JE failed ; if so, it failed, jump to fail code

MOV DX, 0 ; else set DX to 0

return

RET ; and return



failed

MOV DX, 0FFFFH ; failed - set DX to -1

JMP return
The odd flow in that example is designed to allow the fastest non-branching throughput in the 'did not fail' case. This is at the expense of two branches in the 'failed' case

  Most modern CISC processors, such as the Pentium, uses a fast RISC core with an interpreter sitting between the core and the instruction. So when you are running Windows95 on a PC, it is not that much different to trying to get W95 running on the software PC emulator. Just imagine the power hidden inside the Pentium...

Another benefit of RISC is that it contains a large number of registers, most of which can be used as general purpose registers.

This is not to say that CISC processors cannot have a large number of registers, some do. However for it's use, a typical RISC processor requires more registers to give it additional flexibility. Gone are the days when you had two general purpose registers and an 'accumulator'.

One thing RISC does offer, though, is register independence. As you have seen above the ARM register set defines at minimum R15 as the program counter, and R14 as the link register (although, after saving the contents of R14 you can use this register as you wish). R0 to R13 can be used in any way you choose, although the Operating System defines R13 is used as a stack pointer. You can, if you don't require a stack, use R13 for your own purposes. APCS applies firmer rules and assigns more functions to registers (such as Stack Limit). However, none of these - with the exception of R15 and sometimes R14 - is a constraint applied by the processor. You do not need to worry about saving your accumulator in long instructions, you simply make good use of the available registers.

The 8086 offers you fourteen registers, but with caveats: The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They are 16bit and accessed as two 8bit registers, thus register A is really AH (A, high-order byte) and AL (A low-order byte). These can be used as general purpose registers, but they can also have dedicated functions - Accumulator, Base, Count, and Data. The next four registers are Segment registers for Code, Data, Extra, and Stack. Then come the five Offset registers: Instruction Pointer (PC), SP and BP for the stack, then SI and DI for indexing data. Finally, the flags register holds the processor state. As you can see, most of the registers are tied up with the bizarre memory addressing scheme used by the 8086. So only four general purpose registers are available, and even they are not as flexible as ARM registers.

The ARM processor differs again in that it has a reduced number of instruction classes (Data Processing, Branching, Multiplying, Data Transfer, and Software Interrupts).

A final example of minimal registers is the 6502 processor, which offers you: 

 Accumulator - for results of arithmetic instructions
  X register  - First general purpose register
  Y register   - Second general purpose register
  PC          - Program Counter
  SP          - Stack Pointer, offset into page one (at &01xx).
  PSR          - Processor Status Register - the flags.

While it might seem like utter madness to only have two general purpose registers, the 6502 was a very popular processor in the '80s. Many famous computers have been built around it. For the Europeans: consider the Acorn BBC Micro, Master, Electron...For the Americans: consider the Apple2 and the Commodore PET. The ORIC uses a 6502, and the C64 uses a variant of the 6502.(in case you were wondering, the Spicy uses the other popular processor - the ever bizarre and freaky Z80)

So if entire systems could be created with a 6502, imagine the flexibility of the ARM processor. It has been said that the 6502 is the bridge between CISC design and RISC. Acorn chose the 6502 for their original machines such as the Atom and the System# units. They went from there to design their own processor - the ARM.

  To summarize the above, the advantages of a RISC processor are:



  • Quicker time-to-market. A smaller processor will have fewer instructions, and the design will be less complicated, so it may be produced more rapidly.

  • Smaller 'die size' - the RISC processor requires fewer transistors than comparable CISC processors...This in turn leads to a smaller silicon size which, in turn again, leads to less heat dissipation. Most of the heat of ARM710 is actually generated by the 80486 in the slot beside it (and that's when it is supposed to be in 'standby').

  • Related to all of the above, it is a much lower power chip. ARM design processors in static form so that the processor clock can be stopped completely, rather than simply slowed down. The Solo computer (designed for use in third world countries) is a system that will run from a 12V battery, charging from a solar panel.
     

  • Internally, a RISC processor has a number of hardwired instructions.
    This was also true of the early CISC processors, but these days a typical CISC processor has a heart which executes microcode instructions which correlate to the instructions passed into the processor. Ironically, this heart tends to be RISC.

  • A RISC processor's simplicity does not necessarily refer to a simple instruction set.The stack is adjusted accordingly. The '^' pushes the processor flags into R15 as well as the return address. And it is conditionally executed. This allows a tidy 'exit from routine' to be performed in a single instruction. The RISC concept, however, does not state that all the instructions are simple. If that were true, the ARM would not have a MUL, as you can do the exact same thing with looping ADDing. No, the RISC concept means the silicon is simple. It is a simple processor to implement.

RISC vs ARM

You shouldn't call it "RISC vs CISC" but "ARM vs CISC". For example conditional execution of (almost) any instruction isn't a typical feature of RISC processors but can only(?) be found on ARMs. Furthermore there are quite some people claiming that an ARM isn't really a RISC processor as it doesn't provide only a simple instruction set, i.e. you'll hardly find any CISC processor which provides a single instruction as powerful as a LDREQ R0,[R1,R2,LSR #16]!


Today it is wrong to claim that CISC processors execute the complex instructions more slowly, modern processors can execute most complex instructions with one cycle. They may need very long pipelines to do so (up to 25 stages or so with a Pentium III), but nonetheless they can. And complex instructions provide a big potential of optimization, i.e. if you have an instruction which took 10 cycles with the old model and get the new model to execute it in 5 cycles you end up with a speed increase of 100% (without a higher clock frequency). On the other hand ARM processors executed most instruction in a single cycle right from the start and thus doesn’t have this optimization potential (except the MUL instruction).

The argument that RISC processors provide more registers than CISC processors isn't right. Just take a look at the (good old) 68000, it has about the same number of registers as the ARM has. And that 80x86 compatible processors don't provide more registers is just a matter of compatibility. But this argument isn't completely wrong: RISC processors are much simpler than CISC processors and thus take up much less space, thus leaving space for additional functionality like more registers. On the other hand, a RISC processor with only three or so registers would be a pain to program, i.e. RISC processors simply need more registers than CISC processors for the same job.

And the argument that RISC processors have pipelining whereas CISCs don't is plainly wrong. I.e. the ARM2 hadn't whereas the Pentium has...

The advantages of RISC against CISC are those today:



  • RISC processors are much simpler to build, by this again results in the following advantages:

    • easier to build, i.e. you can use already existing production facilities

    • much less expensive, just compare the price of a XScale with that of a Pentium III at 1 GHz...

    • less power consumption, which again gives two advantages:

      • much longer use of battery driven devices

      • no need for cooling of the device, which again gives to advantages:

  • RISC processors are much simpler to program which doesn't only help the assembler programmer, but the compiler designer, too. You'll hardly find any compiler which uses all the functions of a Pentium III optimally.

  • And then there are the benefits of the ARM processors:

  • Conditional execution of most instructions, which is a very powerful thing especially with large pipelines as you have to fill the whole pipeline every time a branch is taken, that's why CISC processors make a huge effort for branch prediction.
     

  • The shifting of registers while other instructions are executed which mean that shifts take up no time at all (the 68000 took one cycle per bit to shift)


The conditional setting of flags, i.e. ADD and ADDS, which becomes extremely powerful together with the conditional execution of instructions
 

KEY APPLICATIONS
ARM and Bluetooth Wireless Technology
The Bluetooth specification is controlled and issued by the SIG (Special Interest Group), which has approximately 2500 members at time of writing, including ARM which is an Associate Member.
ARM Architecture in Bluetooth Applications

ARM has a leading position as the 'CPU of choice' for Bluetooth applications, as shown by the IP vendors and silicon vendors that target the ARM architecture below:


ARM aims to:

•Encourage and assist all Bluetooth IP vendors to target ARM

•Enable all Bluetooth SoC designers to design in ARM technology

•Bring leading Bluetooth IP to the ARM partnership


ARM's Bluetooth activity provides a focal point for third parties wanting to work with ARM or for partners and OEMs wishing to access Bluetooth IP.
3D Graphics Acceleration
The anticipated growth of 3D graphics in a wide variety of consumer products from mobile phones to set top boxes has resulted in a market requirement for a complete 3D graphics rendering sub-system suitable for integration in embedded ARM core-based SoC devices.
The launch of 3G mobile networks and the growth of Java™ enabled mobile devices with large colour displays are together expected to lead to a dramatic growth in wireless gaming over the next few years. Industry analysts predict that with the number of wireless gamers around the world growing to between 53 million and 360 million by 2006.
The ARM range of 3D hardware acceleration solutions has been designed to meet this market requirement and support rich multimedia applications on a wide variety of portable and consumer products. The family currently features two products: the ARM MBX R-S™ and MBX HR-S™ for integration with all ARM processor families.
The ARM 3D graphics acceleration technology is based around the PowerVR® MBX graphics processor from Imagination Technologies, a low-power and efficient implementation of the PowerVR Series 3 architecture. Combined with ARM’s industry leading embedded RISC processor cores, MBX enables complex 3D, 2D and video graphics to be accessed on mobile and consumer platforms.

Voip
VoIP (Voice Over Internet Protocol) is the ability to packetize voice and send it through the internet infrastructure. With significant cost and features benefits over traditional telephony, VoIP is gaining momentum in both the residential and enterprise markets. At the end of 2004, IDC estimates the US had over 1M residential VoIP subscribers. This number is expected to reach 6 or 7M by 2006.

Support and processing for VoIP falls into a wide spectrum of product types, from line cards in infrastructure devices to desktop phones. ARM’s full range of processor cores is ideal for meeting the wide performance requirements of this market. For higher-end VoIP products, including voice infrastructure gateways, there are a number of chipsets combining an ARM core with digital signal processing engines or a DSP processor to enable multiple channels of voice. For low cost phones and terminal adapters, the combination of ARM cores with built-in DSP extensions and partner software solutions utilizing this digital signal processing capability enable low cost, low power VoIP implementations.


Hard Disk Drives

Hard disk drives may well be the ultimate real-time control system. Managing the combination of high rotation speeds, extreme precision of actuators, dealing with turbulence caused by fast disk speeds and external effects such as shock demand high performance and computationally intensive embedded processing. The market requires power efficiency, die size and debug capability. ARM is now widely accepted as the architecture of choice for this demanding market. Through years of working with HDD partners, ARM has perfected its cores, and developed leading edge real-time debug solution to meet the needs of HDD designers. ARM cores can be found in over 30% of all shipments, with ARM-based designs shipping, or in development, at all major OEMs.


Printers

ARM is proving to be an excellent choice for printer applications. ASIC integration risk reduction is balanced with the high performance requirements of the laser market. Lower costs are achieved in the ink market while still boosting image quality and throughput.




Conclusion

Basically the ARM architecture has a simple, powerful, yet compact instruction set which is easy to compile to. Furthermore, most ARM implementations use (almost) fully associative caches, 3 to 5 stage pipelines, have a narrow and relatively slow external bus (without L2 caches). They all support powering down the parts that don't do any work. Finally, the newest ARMs use the latest process technology to decrease supply voltage rather than to crank up clock speed.

The low power consumption is because it has approximately 1/25th of the number of gates of a Pentium. The high performance is because it's designed better than the Pentium. With RISC design you can make certain simplifications that speed things up - you can design the instruction decode using hardwired .

As far as RISC goes, the ARM has some wrinkles of its own that add to its performance. The ability to place a conditional flag on any instruction and to determine whether instructions can or cannot affect processor flags means that you can often avoid branches which result in instruction stalls or other slowdowns (on processors that don't have this ability then you have to add loads of power-consuming extra logic to try and compensate for branch stalls). The barrel shifter allows much more flexibility than ALU shifting and makes ARM instructions capable of doing a lot more than you first thought. Basically, the ARM is a better design than the Pentium.



Reference
www.arm.com

http://en.wikipedia.org/



Dept. of Computer Science Model Engineering College


Download 376.11 Kb.

Share with your friends:
1   2   3   4   5   6




The database is protected by copyright ©ininet.org 2024
send message

    Main page