Nic kaj posebnega Processor Types



Download 3.66 Mb.
Page15/15
Date28.05.2018
Size3.66 Mb.
#51588
1   ...   7   8   9   10   11   12   13   14   15

SA1100 variant


This is a version of the SA110 designed primarily for portable applications. I mention it here as I am reliably informed that the SA1100 is the processor inside the 'faster' Panasonic satellite digibox. It contains the StrongARM core, MMU, cache, PCMCIA, general I/O controller (including two serial ports), and a colour/greyscale LCD controller. It runs at 133MHz or 200MHz and it consumes less than half a watt of power.

 

 


Thumb


The Thumb instruction set is a reworking of the ARM set, with a few things omitted. Thumb instructions are 16 bits (instead of the usual 32 bit). This allows for greater code density in places where memory is restricted. The Thumb set can only address the first eight registers, and there are no conditional execution instructions. Also, the Thumb cannot do a number of things required for low-level processor exceptions, so the Thumb instruction set will always come alongside the full ARM instruction set. Exceptions and the like can be handled in ARM code, with Thumb used for the more regular code.

 

 


Other versions


These versions are afforded less coverage due, mainly, to my not owning nor having access to any of these versions.
While my site started as a way to learn to program the ARM under RISC OS, the future is in embedded devices using these new systems, rather than the old 26 bit mode required by RISC OS...
...and so, these processors are something I would like to detail, in time.

M variants


This is an extension of the version three design (ARM 6 and ARM 7) that provides the extended 64 bit multiply instructions.
These instructions became a main part of the instruction set in the ARM version 4 (StrongARM, etc).

 

T variants


These processors include the Thumb instruction set (and, hence, no 26 bit mode).

 

E variants


These processors include a number of additional instructions which provide improved performance in typical DSP applications. The 'E' standing for "Enchanced DSP".

 

 


The future


The future is here. Newer ARM processors exist, but they are 32 bit devices.
This means, basically, that RISC OS won't run on them until all of RISC OS is modified to be 32 bit safe. As long as BASIC is patched, a reasonable software base will exist. However all C programs will need to be recompiled. All relocatable modules will need to be altered. And pretty much all assembler code will need to be repaired. In cases where source isn't available (ie, anything written by Computer Concepts), it will be a tedious slog.
It is truly one of the situations that could make or break the platform.

I feel, as long as a basic C compiler/linker is made FREELY available, then we should go for it. It need not be a 'good' compiler, as long as it will be a drop-in replacement for Norcroft CC version 4 or 5. Why this? Because RISC OS depends upon enthusiasts to create software, instead of big corporations. And without inexpensive reasonable tools, they might decide it is too much to bother with converting their software, so may decide to leave RISC OS and code for another platform.

I, personally, would happily download a freebie compiler/linker and convert much of my own code. It isn't plain sailing for us - think of all of the library code that needs to be checked. It will be difficult enough to obtain a 32 bit machine to check the code works correctly, never mind all the other pitfalls. Asking us for a grand to support the platform is only going to turn us away in droves. Heck, I'm still using ARM 2 and ARM 3 systems. Some of us smaller coders won't be able to afford such a radical upgrade. And that will be VERY BAD for the platform. Look how many people use the FREE user-created Internet suite in preference to commercial alternatives. Look at all of the support code available on Arcade BBS. Much of that will probably go, yes. But would a platform trying to re-establish itself really want to say goodbye to the rest?
I don't claim my code is wonderful, but if only one person besides myself makes good use of it - then it has been worth it.

 

Click here to learn more on 32 bit operation



 

Return to assembler index

Copyright © 2004 Richard Murray



The Stack




 

The 6502 microprocessor features support for a stack, located at &1xx in memory, and extending for 256 bytes. It also featured instructions which performed instructions more quickly relative to page zero (&0xx).

Both of these are inflexible, and not in keeping with the RISC concept.
The ARM processor provides instructions for manipulating the stack (LDM and STM). The actual location where your stack lays it's hat is entirely up to you and the rules of good programming.

For example:

MOV R13, #&8000

STMFD R13!, {R0-R12, R14}

would work, but is likely to scribble your registers over something important. So typically you would set R13 to the end of your workspace, and stack backwards from there.

These are conventions used in RISC OS. You can replace R13 with any register except R14 (if you need it) and R15. As R14 and R15 have a defined purpose, the next register down is R13, so that is used as the stack pointer.


Likewise, in RISC OS, the stacks are fully descending (FD, or IA) which means the stack grows downwards in memory, and the updated stack pointer points to the next free location.

You can, quite easily, shirk convention and stack using whatever register you like (R0-R13 and R14 if you don't need it) and also you can set up any kind of stack you like, growing up, growing down, pointer to next free or last used... But be aware that when RISC OS provides you with stack information (if you are writing a module, APCS assembler, BASIC assembler, or being a transient utility, for example) it will pass the address in R13 and expect you to be using a fully descending stack. So while you can use whatever type of stack/location that suits you, it is suggested you follow the OS style. It makes life easier.

If you are not sure what a stack is, exactly, then consider it a temporary dumping area. When you start your program, you will want to put R14 somewhere so you know where to branch to in order to exit. Likewise, every time you BL, you will want to put R14 someplace if you plan to call another BL.
To make this clearer:

; ...entry, R14 points to exit location


BL one

BL two


MOV PC, R14 ; exit
.one

; R14 points to instruction after 'BL one'

...do stuff...

MOV PC, R14 ; return


.two

; R14 points to instruction after 'BL two'

...do stuff...

BL three


MOV PC, R14 ; return
.three

; R14 points to instruction after 'BL three'

B four

; no return


.four

; Not a BL, so R14 unchanged

MOV PC, R14 ; returns from .three because R14 not changed.

Take a moment to work through that code. It is fairly simple. And fairly obvious is that something needs to be done with R14, otherwise you won't be able to exit. Now, a viable answer is to shift R14 into some other register. So now consider that the "...do stuff..." parts use ALL of the remaining registers.


Now what? Well, what we need is a controlled way to dump R14 into memory until we come to need it.
That's what a stack is.

That code again:

; ...entry, R14 points to exit location, we assume R13 is set up
STMFD R13!, {R14}

BL one


BL two

LDMFD R13!, {PC} ; exit


.one

; R14 points to instruction after 'BL one'

STMFD R13!, {R14}

...do stuff...

LDMFD R13!, {PC} ; return
.two

; R14 points to instruction after 'BL two'

STMFD R13!, {R14}

...do stuff...

BL three

LDMFD R13!, {PC} ; return


.three

; R14 points to instruction after 'BL three'

B four

; no return


.four

; Not a BL, so R14 unchanged

LDMFD R13!, {PC} ; returns from .three because R14 not changed.

A quick note, you can write:

STMFD R13!, {R14}

...do stuff...

LDMFD R13!, {R14}

MOV PC, R14

but the STM/LDM does NOT keep track of which stored values belong in which registers, so you can store R14, and reload it directly into PC thus disposing of the need to do a MOV afterwards.

The caveat is that the registers are saved in ascending order...

STMFD R13!, {R7, R0, R2, R1, R9, R3, R14}

will save R0, R1, R2, R3, R7, R9, and R14 (in that order). So code like:

STMFD R13!, {R0, R1}

LDMFD R13!, {R1, R0}

to swap two registers will not work.

 

Please refer to this document for details on STM/LDM and how to use a stack.



Return to assembler index

Copyright © 2004 Richard Murray



Memory Management




 

Introduction


The RISC OS machines work with two different types of memory - logical and physical.
The logical memory is the memory as seen by the OS, and the programmer. Your application begins at &8000 and continues until &xxxxx.
The physical memory is the actual memory in the machine.

Under RISC OS, memory is broken into pages. Older machines have a page of 8/16/32K (depending on installed memory), and newer machines have a fixed 4K page. If you were to examine the pages in your application workspace, you would most likely see that the pages were seemingly random, not in order. The pages relate to physical memory, combined to provide you with xxxx bytes of logical memory. The memory controller is constantly shuffling memory around so that each task that comes into operation 'believes' it is loaded at &8000. Write a little application to count how many wimp polls occur every second, you'll begin to appreciate how much is going on in the background.

 

MEMC : Older systems


In ARM 2, 250, and 3 machines; the memory is controlled by the MEMC (MEMory Controller). This unit can cope with an address space of 64Mb, but in reality can only access 4Mb of physical memory. The 64Mb space is split into three sections:

0Mb - 32Mb : Logical RAM

32Mb - 48Mb : Physical RAM

48Mb - 64Mb : System ROMs and I/O

Parts of the system ROMs and I/O are mapped over each other, so reading from it gives you code from ROM, and writing to it updates things like the VIDC (video/sound).

It is possible to fit up to 16Mb of memory to an older machine, but you will need a matched MEMC for each 4Mb. People have reported that simply fitting two MEMCs (to give 8Mb) is either hairy or unreliable, or both. In practice, the hardware to do this properly only really existed for the A540 machine, where each 4Mb was a slot-in memory card with an on-board MEMC. Other solutions for, say, the A5000 and the A410, are elaborate bodges. Look at http://www.castle.org.uk/castle/upg25.htm for an example of what is required to fit 8Mb into an A5000!

The MEMC is capable of restricting access to pages of memory in certain ways, either complete access, no access, no access in USR mode, or read-only access. Older versions of RISC OS only implemented this loosely, so you need to be in SVC mode to access hardware directly but you could quite easily trample over memory used by other applications.

 

MMU : Newer systems


The newer systems, with ARM6 or later processor, have an MMU built into the processor. This consists of the translation look-aside buffer (TLB), access control logic, and translation table walk logic. The MMU supports memory accesses based upon 1Mb sections or 4K pages. The MMU also provides support for up to 16 'domains', areas of memory with specific access rights.
The TLB caches 64 translated entries. If the entry is for a virtual address, the control logic determines if access is permitted. If it is, the MMU outputs the appropriate physical address otherwise is signals the processor to abort.
If the TLB misses (it doesn't contain an entry for the virtual address), the walk logic will retrieve the translation information from the (full) translation table in physical memory.
If the MMU should be disabled, the virtual address is output directly as the physical address.

It gets a lot more complicated, suffice to say that more access rights are possible and you can specify memory to be bufferable and/or cacheable (or not), and the page size is fixed to 4K. A normal RiscPC offers two banks of RAM, and is capable of addressing up to 256Mb of RAM in fairly standard PC-style SIMMs, plus up to 2Mb of VRAM double-ported with the VIDC, plus hardware/ROM addressing.

On the RiscPC, the maximum address space of an application is 28Mb. This is not a restriction of the MMU but a restriction in the 26-bit processor mode used by RISC OS. A 32-bit processor mode could, in theory, allocate the entire 256K to a single task.
All current versions of RISC OS are 26-bit.

 

System limitations


Consider a RiscPC with an ARM610 processor.
The cache is 4K.
The bus speed is 16MHz (note, only slightly faster than the A5000!), and the hardware does not support burst-mode for memory accesses.
Upon a context switch (ie, making an application 'active') you need to remap it's memory to begin at &8000 and flush the cache.
I'll leave you to do the maths. :-)

 



Memory schemes
and
multitasking




 

Introduction


This is a reference, designed to help you understand the various types of memory handling and multitasking that exist.

 

Memory is a resource that needs careful management. It is expensive (£/Mb is much higher for memory than for conventional harddisc storage). A good system will offer flexible facilities trading off speed for functionality.


You need memory because it is fast. It is rarely as fast as the processor, these days, but it is faster than harddiscs. Because we need fast. We need big, so we can hold these large programs and large amounts of data that seem to be around. It boggles the mind that a commercial mainframe did accounts and stuff with a mere 4K of memory.

Typically, there will be three or four, possibly five, kinds of storage in the computer.



  1. Level 1 cache
    This is inside the processor, usually operating at the core speed of the processor. It is between 4K and 32K usually.
     

  2. Level 2 cache
    If the difference between the processor speed and system memory is quite large, you will often have a level 2 cache. This is mounted on the motherboard, and typically runs at a speed roughly halfway between the processor speed and the speed of the system memory.
    It is usually between 64K and 512K. RISC OS machines do not have Level 2 cache.
     

  3. Level 3 cache
    If your processor is running at some silly speed (such as 1GHz) and your system memory is running at a tenth of that, you might like a chunk (say a Mb or two) of cache between level 2 and system memory, so that you can further improve speed.
    Each layer of cache is getting slower, until we reach...
     

  4. System memory
    Your DRAM, SRAM, SIMMs, DIMMs, or whatever you have fitted. Speeds range from 2MHz in the old home computers, to around 133MHz in a typical PC compatible. Older PCs use 33MHz or 66MHz buses.
    The ARM2/250 machines have an 8MHz bus, the ARM3 machines (A5000,...) have a 12MHz bus, the RiscPC has a 16MHz bus. In these cases, only the ARM2 is clocked at the same speed as the bus. The ARM3 is clocked at 25 or 30MHz, the ARM610 at 33MHz, the ARM710 at 40MHz and the StrongARM at a variety of speeds up to 280-ish MHz.
     

  5. Harddisc
    Slow, huge, cheap.

 

Basic monoprogramming


This is where all of the memory is just available, and you run one application at a time. The kernel/OS/BIOS (whatever) sits in one place, either in RAM or ROM and it is mapped into the address map.

Consider:

.----------------. .----------------.

| OS in ROM | | Device drivers |

| | | in ROM |

|----------------| |----------------|

| | | |

| Your | | Your |



| application | | application |

| | |----------------|

|----------------| | |

|System workspace| | OS in RAM |

'----------------' '----------------'

The first example is similar to the layout of the BBC microcomputer. The second is not that different to a basic MS-DOS system, the OS is loaded low in memory, the BIOS is mapped in at the top, and the application sits in the middle.

To be honest, the first example is used a lot under RISC OS as well. It is exactly what a standard application is supposed to believe. The OS uses page zero (&0000 - &7FFF) for internal housekeeping, it (your app) begins at &8000, and the hardware/OS sit way up in the ether at &3800000.
Memory management under RISC OS is more complex, but this is how a typical application will see things.

When the memory is organised in this way, only one application can be running. When the user enters a command, if it is an application then that application is copied from disc into memory, then it is executed. When the application is done with, the operating system reappears, waiting for you to give it something else to do.

 

Basic multiprogramming


Here, we are running several applications. While they are not running concurrently (to do so would be impossible, a processor can only do one thing at a time), the amount of time given to an application is tiny, so the system is spending a lot of time faffing around hopping from one application to the next, all giving you the illusion that n applications are all happily running together on your computer.

Memory is typically handled as non-contiguous blocks. On an ARM machine, pages are brought together to fake a chunk of memory beginning at &8000. Anybody who has tried an address translation in their allocated memory will know two things. Firstly, it is near impossible to get an actual physical memory address out of the OS.


The following program demonstrates this:

END = &10000 : REM Constrain slot to 32K


DIM willow% 16

SYS "Wimp_SlotSize", -1, -1 TO slot%

SYS "OS_ReadMemMapInfo" TO page%
PRINT "Using "+STR$(slot% / page%)+" pages, each page being "+STR$(page%)+" bytes."

PRINT "Pages used: ";


more% = slot% / page%

FOR loop% = 0 TO (more% - 1)

willow%!0 = 0

willow%!4 = &8000 + (loop% * page%)

willow%!8 = 0

willow%!12= -1

SYS "OS_FindMemMapEntries", willow%

IF loop% > 0 THEN PRINT ", ";

PRINT STR$(willow%!0);

NEXT


PRINT

END


This outputs something similar to:

Using 8 pages, each page being 4096 bytes.

Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678

 

RISC OS handles memory by loading everything into memory. These applications are then 'paged in' by remapping the memory pointers in the page tables, consequently, other tasks are mapped out.



Windows/Unix systems load applications into memory, supported by a system called 'virtual memory' which dumps unused pages to disc in order to free system memory for applications that need it. I am not sure how Windows organises its memory, if it does it in a style similar to RISC OS (ie, remap to start from a specific address) or if each application is just told 'you are here'.
Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you are careful how you load it, and swap out old parts for new parts as necessary.

Some systems use a lazy-paging form of memory. In this case, only the first page of memory is filled by the application when execution starts. As more of the application is executed, the operating system fills in the parts as required.


By contrast, under RISC OS an application needs to load. Consider loading, well, practically anything, off of floppy disc. It takes time.

 

Virtual memory


When you no longer have actual physical memory, you may have virtual memory. A set of memory locations that don't exist, but the operating system tries real hard to convince you they do. And in the centre of the ring is the MMU (Memory Management Unit, inspired name, no?) keeping control
[note: you need an MMU anyway when your memory is broken into remappable pages, this just seemed like a good time to introduce it!]

When the processor is instructed to jump to &8000 to begin executing an application, it passes the address &8000 to the MMU. This translates the address into the correct real address and outputs this on the address lines, say &12FC00. The processor is not aware of this, the application is not aware of this, the computer user is not aware of this.

So we can take this one stage further by mapping onwards into memory that does not exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the operating system will be called in a panic (correctly known as a "page fault"). The operating system will be calm and collected and think, "Ah, virtual memory". A little-used page of real memory will be shoved out to disc, then the page that the MMU was trying to find will be reloaded in place of the page we just got rid of. The memory map will be updated accordingly, then control will be handed back to the user application at the exact point the page fault occured. It would, unknowing of all of this palaver, perform that instruction again, only this time the MMU will (happily?) output the correct address to the memory system, and all will continue.

 

Page tables and the MMU


The page table exists to map each page into an address. This allows the operating system to keep track of which memory is pretending to be which. However it is more complex. Some pages cannot be remapped, some pages are doubly mapped, some are not to be touched in user mode code, some aren't to be touched at all. Some are read only. Some just don't exist. All of this must be kept track of.

So the MMU takes an address, looks it up in the page table, and spits out the correct address.

Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A 32bit address space has a million pages. With one million pages, you'll need one million entries. In the ARM MMU, each entry takes 7 words. So we are looking at seven megabytes just to index our memory.
It gets better. Every single memory reference will be passed through the MMU. So we'll want it to operate in nanoseconds. Faster, if possible.
In reality, it is somewhat easier as most typical machines don't have enough memory to fill the entire addressing space, indeed many are unlikely to get close on technical reasons (the RiscPC can have 258Mb maximum RAM, or 514Mb with Kinetic - the extra 2Mb is the VRAM). Even so, the page tables will get large.

So there are three options:



  • Have a huge array of fast registers in the MMU. Costly. Very.

  • Hold the page tables in main memory. Slow. Very.

  • Compromise. Cache the active pages in the MMU, and store the rest on disc.

An example. A RiscPC, 64Mb of RAM, 2Mb of VRAM, 4Mb of ROM and hardware I/O (double mapped). That's 734000320 bytes, or 17920 pages. It would take 71680 bytes to store each address. But an address on it's own isn't much use. Seven words comprise an entry in the ARM's MMU. So our 17920 pages would require 501760 bytes in order to fully index the memory.
You just can't store that lot in the MMU. So you'll store a snippet, say 16K worth?, and keep the rest in RAM.

 

The TLB


The Translation Lookaside Buffer is a way to make paging even more responsive. Typically, a program will make heavy use of a few pages and barely touch the rest. Even if you plan to byte read the entire memory map, you will be making four thousand hits in one page before going to the next.
A solution to this is to fit a little bit in the MMU that can map virtual addresses to their physical counterparts without traversing the page table. This is the TLB. It lives within the MMU and contains details of a small number of pages (usually between four and sixty four - the ARM610 MMU TLB has thirty two entries).
Now, when we have a page lookup, we first pass our virtual address to the TLB which will check all of the addresses stored, and the protection level. If a match is found, the TLB will spit out the physical address and the page table isn't touched.
If a miss is encountered, then the TLB will evict one of it's entries and load in the page information looked up in the page table, so the TLB will know the new page requested, so it can quickly satisfy the result for the next memory access, as chances are the next access will be in the page just requested.

So far we have figured on the hardware doing all of this, as in the ARM processor. Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss problem to the operating system. This may allow the OS to use some intelligence to pre-load certain pages into the TLB.

 

Page size


Users of an RISC OS 3.5 system running on an ARM610 with two or more large (say, 20Mb) applications running will know the value of a 4K page. Because it's bloody slow. To be fair, this isn't the fault of the hardware, but more the WIMP doing stuff the kernel should do (as happens in RISC OS 3.7) and doing it slower!

Like with harddisc LFAUs, what you need is a sensible trade-off between page granularity and page size. You could reduce the wastage in memory by making pages small, say 256 bytes. But then you would need a lot of memory to store the page table. A bigger page table, slower to scan through it. Or you could have 64K pages, which make the page table small, but can waste huge amounts of memory.


To consider, a 32K program would require eight 4K pages, or sixty four 512 byte pages. If your system remaps memory when shuffling pages around, it is quicker to move a smaller number of large pages than a larger number of small pages.

The MEMC in older RISC OS machines had a fixed page table. So the size of page depended upon how much memory was utilised.



MEMORY

PAGE SIZE

0.5Mb

8K

1Mb

8K

2Mb

16K

4Mb

32K

3Mb wasn't a valid option, and 4Mb is the limit. You can increase this by fitting a slave MEMC, in which case you are looking at 2 lots of 4Mb (invisible to the OS/user).


In a RiscPC, the MMU accesses a number of 4K pages. The limits are due, I suspect, to the system bus or memory system, not the MMU itself.

Most commercial systems use page sizes in the order 512 bytes to 64K.


The later ARM processors (ARM6 onwards) and the Intel Pentium both use page sizes of 4K.

 

Page replacement algorithms


When a page fault occurs, the operating system has to pick a page to dump, to allow the required page to be loaded. There are several ways that this may be achieved. None of these are perfect, they are a compromise of efficiency.

Not Recently Used
This requires two bits to be reserved in the page table, a bit for read/write and a bit for page reference. Upon each access, the paging hardware (and it must be done in hardware for speed) will set the bits as necessary. Then on a fixed interval the operating system will clear these bits - either when idling or upon clock interrupt? This then allows you to track the recent page accesses, so when flushing out a page you can spot those that have not recently been read/written or referenced. NRU would remove a page at random. While it is not the best way of sorting out which pages to remove, it is simple and gives reasonably good results.

First-In First-Out
It is hoped you are familiar with the concept of FIFO, from buffering and the like. If you are not, consider the lame analogy of the hose pipe in which the first water in will be the first water to come out the other end. It is rarely used, I'll leave the whys and where-fores as an exercise for the bemused reader. :-)

Second Chance
A simple modification to the FIFO arrangement is to look at the access bit, and if it is zero then we know the page is not in current use and can be thrown. If the bit is set, then the page is shifted to the end of the page list as if it was a new access, and the page search continues.
What we are doing here is looking for a page unused since the last period (clock tick?). If by some miracle ALL the pages are current and active, then Second Change will revert to FIFO.

Clock Although Second Chance is good, all that page shuffling is inefficient so the pages are instead referenced in a circular list (ie, clock). If the page being examined in in use, we move on and look at the next page. With no concept of the start and end of the list, we just keep going until we come to a usable page.

Least Recently Used
LRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the most recently used at the front of the list, to the least recently used at the back. When you need a page, you pull the last entry and use it. Because of speed, this is only really possible in hardware as the list should be updated each memory access.

Not Frequently Used
In an attempt to simulate LRU in software, we can maintain something vaguely similar to LRU in a software implementation, in which the OS scans the available pages on each clock tick and increments a counter (held in memory, one for each page) depending on the read/written bit.
Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such as a rendering core) will have a high count for quite a while. Then, code that is not called often but should be all the more responsive, such as redraw code, will have a lower count and thus stand the possibility of being kicked out, even though the higher-rated renderer is no longer needed but not kicked out as it's count is higher.
But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just before the count is incremented, it is shifted one bit to the right. So after a number of shifts the count will be zero unless the bit is added. Here you might be wondering how adding a bit can work, if you've just shifted a bit off. The answer is simple. The added bit is added to the leftmost position, ie most significant.
The make this clearer...

Once upon a time: 0 0 1 0 1 1

Clock tick : 0 0 0 1 0 1

Clock tick : 0 0 0 0 1 0

Memory accessed : 1 0 0 0 0 1

Clock tick : 0 1 0 0 0 0

Memory accessed : 1 0 1 0 0 0

 

Multitasking


There is no such thing as true multitasking (despite what they may claim in the advocacy newsgroups). To multitask properly, you need a processor per process, with all the relevant bits so processes are not kept waiting. Effectively, a separate computer for each task.

However, it is possible to provide the illusion of running several things at once. In the old days, things happened in the background under interrupt control. Keyboards were scanned, clocks were updated. As computers became more powerful, more stuff happened in the background. Hugo Fiennes wrote a soundtracker player that runs on interrupts, so works in the background. You set it going, it carries on independent of your code.

So people began to think of the ability to apply this to applications. After all, most of the time an application is spent waiting for user input. In fact, the application may easily do sweet sod all for almost 100% of the time - measured by an event counter in Emily's polling loop, I type ~1 character a second, the RiscPC polls a few hundred times a second. That was measured in a multitasking application, using polling speed as a yardstick. Imagine if we were to record loops in a single-tasking program. So the idea was arrived at. We can load several programs into memory, provide them some standard facilities and messaging systems, and then let them run for a predefined duration. When the duration is up, we pass control to the next program. When that has used its time, we go to the next program, and so on.
As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but an extremely important one. It is physically impossible to measure system polling speed in software, and pretty difficult to measure it in hardware. You see, the very act of performing your measurement will affect the results. And you cannot easily 'account' for the time taken to make your measurements because measuring yourself is subject to the same artefacts as when measuring other things. You can only say 'to hell with it', and have your program report your polling rate as being 379 polls/sec, knowing that your measuring code may be eating around 20% of the available time, and use the figures in a relative form rather than trying to state "My computer achieves 379 polls every second". While there is no untruth in that, your computer might do 450 if you weren't so busy watching! You simply can't be JAFO.
...and you need to go to school/college and get bored rigid to find out what relevance any of this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware of all these heavy scientific concepts. She's probably got the right idea...

 

Co-operative multitasking


One such way of multitasking is relatively clean and simple. The application, once control has passed to it, has full control for as long as it needs. When it has finished, control is explicitly passed back to the operating system.
This is the multitasking scheme used in RISC OS.

 

Pre-emptive multitasking


Seen as the cure to all the world's ills by many advocates who have seen Linux (not Windows!), this works differently. Your application is given a timeslice. You can process whatever you want in your timeslice. When your timeslice is up, control is wrested away and given to another process. You have no say in the matter, peon.

 

 



I don't wish to get into an advocacy war here. My personal preference is co-operative, however I don't feel that either is the answer. Rather, a hybrid using both technologies could make for a clean system. The major drawback of CMT is that if an application dies and goes into a never-ending loop, control won't come back. The application needs to be forceably killed off.
Niall Douglas wrote a pre-emption system for RISC OS applications. Surprisingly, you didn't really notice anything much until an application entered some heavy processing (say, ChangeFSI) at which point life carried right on as normal while the task which would have stalled the machine for a while chugged away in the background.

Return to assembler index

Copyright © 2004 Richard Murray



32 bit operation




 

A lot of this information is taken from the ARM assembler manual. I didn't have a 32 bit processor at the time, so trusted the documentation...
As it happens, the documentation erroneously stated that UMUL and UMLA could
only be performed in 32bit mode. Well, that is incorrect, if your processor can do it (ie: StrongARM), it will work in 32bit OR 26bit...

 

The ARM2 and ARM3 have a 32 bit data bus and a 26 bit address bus. On later versions of the ARM, both the data bus and the address bus are a full 32 bits wide.


This explains how a "32 bit processor" can be referred to as 26 bit. The data width and instruction/word size is 32 bit, and always has been, but the address bus is only 24 bit.
Oh, whoops, I said 26 bit, didn't I?
:-) Well, as PC is always word aligned, the lower two bits will always be zero in an address, so on the ARM2/ARM3 processor these bits hold the processor mode setting. The width of PC is, effectively, 26 bit even though only 24 bits are actually used.

This is no a problem on the older machines. 4Mb memory was the norm. Some people upgraded to 8Mb, and 16Mb was the theoretical limit.


However a RiscPC with a 26 bit program counter would not have been possible, as 26 bits only allows you to address %11111111111111111111111100 (or 67108860 bytes, or 64Mb). The RiscPC allows for 258Mb of memory to be installed.
This, incidentally, explains the 28Mb size limit for application tasks; the system is expected to be compatible with the older RISC OS API.

The majority of the assembler site has been written regarding 26 bit mode of operation, which is compatible with the versions of RISC OS currently available (ie, RISC OS 2 to RISC OS 4); though some parts cover 32 bit modes (one example briefly runs in SVC32!), and I have noted parts of the examples that are 32 bit unfriendly.

Those with a RiscPC, Mico, RiscStation, A7000 etc have the ability to run a fully 32 bit operating system; indeed ARMLinux is such an operating system. RISC OS is not, because RISC OS needs, for the moment, to remain compatible with existing versions. It is the old dichotomy. It is wonderful to have a nice shiny new fully 32 bit version of RISC OS, but not so good when you realise a lot of your must-have software won't so much as load!
RISC OS isn't totally 26 bit. Some of the handlers need to work in 32 bit mode; however it is limited by money (ie, who's going to pay for RISC OS to be fully converted; and who's going to pay for new development tools to rebuild their code (PD software is strong on RISC OS)) and also by necessity (ie, lots of people use Impression but CC is no longer with us; it is quite likely Impression won't work on an updated RISC OS, so people will not see a necessity to upgrade if their desired software won't work).

 

Why is this even an issue?


Newer ARM processors will not support 26 bit operation. Several hybrids were made (ARM6, ARM7, StrongARM), but time has come to draw the line. You can either add the complexity of a 26/32 bit system, or you can go 32 bit only and have a simpler, smaller processor.
Either we go with the flow, or get left behind... So really, this is an issue, and we don't have a choice.

 

32 bit architecture


The ARM architecture changed significantly with the introduction of the ARM6 series. Below, we shall describe the differences in behaviour between 26 bit and 32 bit operation.

In the ARM 6, the program counter was extended to a full 32 bits. As a result:



  • The PSR had to be separated from the PC into its own register, the CPSR (Current Program Status Register).
     

  • The PSR can no longer be saved with the PC when changing processor modes;
    instead, each privileged mode now has an extra register - the SPSR (Saved Program Status Register) - to hold the previous mode's PSR.
     

  • Instructions have been added to use these new status registers.

A further change was the addition of extra privileged processor modes, allowed by the PSR now having a full 32 bits to use. These modes are used to handle Undefined instruction and Abort exceptions. Consequently:

  • Undefined instructions, aborts, and supervisor code no longer have to share the same mode. This has removed restrictions on Supervisor mode programs which existed on earlier ARMs.

  • The availability of these features in the ARM6 series (and other later compatible chips) is set by one of several on-chip control registers. One of three processor configurations can be selected:

    • 26 bit program and data space. This configuration forces ARM to operate with a 26 bit address space. In this configuration only the four 26 bit modes are available (refer to the Processor modes description); it is impossible to select a 32 bit mode.
      This configuration is set at reset on all current ARM6 and 7 series processors.
       

    • 26 bit program space and 32 bit data space. This is the same as the 26 bit program and data space configuration, except that address exceptions are disabled to allow data transfer operations to access the full 32 bit address space.
       

    • 32 bit program and data space. This configuration extends the address space to 32 bits, and introduces major changes to the programmer's model. In this configuration you can select any of the 26 bit and the 32 bit processor modes (see Processor modes below).

 

When configured for a 32 bit program and data space, the ARM6 and ARM7 series support ten overlapping processor modes of operation:



  • User mode: the normal program execution state
    or User26 mode: a 26 bit version
     

  • FIQ mode: designed to support a data transfer or channel process
    or FIQ26 mode: a 26 bit version
     

  • IRQ mode: used for general purpose interrupt handling
    or IRQ26 mode: a 26 bit version
     

  • SVC mode: a protected mode for the operating system
    or SVC26 mode: a 26 bit version
     

  • Abort mode (abbreviated to ABT mode): entered after a data or instruction prefetch abort
     

  • Undefined mode (abbreviated to UND mode): entered when an undefined instruction is executed.

When in a 26 bit processor mode, the programmer's model reverts to that of earlier 26 bit ARM processors. The behaviour is the same as that of the ARM2aS macrocell with the following alterations:

  • Address exceptions are only generated by ARM when it is configured for 26 bit program and data space.
    In other configurations the OS may still simulate the behaviour of address exception, using external logic such as a memory management unit to generate an abort if the 64Mbyte range is exceeded, and converting that abort into an `address exception trap' for the application.
     

  • The new instructions to transfer data between general registers and the program status registers remain operative. The new instructions can be used by the operating system to return to a 32 bit mode after calling a binary containing code written for a 26 bit ARM.
     

  • When in a 32 bit program and data space configuration, all exceptions (including Undefined Instruction and Software Interrupt) return the processor to a 32 bit mode, so the operating system must be modified to handle them.
     

  • If the processor attempts to write to a location between &0 and &1F inclusive (i.e. the exception vectors), hardware prevents the write operation and generates a data abort. This allows the operating system to intercept all changes to the exception vectors and redirect the vector to some veneer code. The veneer code should place the processor in a 26 bit mode before calling the 26 bit exception handler.

In all other respects, when operating in a 26 bit mode the ARM behaves as like a 26 bit ARM. The relevant bits of the CPSR appear to be incorporated back into R15 to form the PC/PSR with the I and F bits in bits 27 and 26. The instruction set behaves like that of the ARM2aS macrocell, with the addition of the MRS and MSR instructions.

 

The registers available on the ARM 6 (and later) in 32 bit mode are:


User26 SVC26 IRQ26 FIQ26 User SVC IRQ ABT UND FIQ
R0 ----- R0 ----- R0 ----- R0 -- -- R0 ----- R0 ----- R0 ----- R0 ----- R0 ----- R1

R1 ----- R1 ----- R1 ----- R1 -- -- R1 ----- R1 ----- R1 ----- R1 ----- R1 ----- R2

R2 ----- R2 ----- R2 ----- R2 -- -- R2 ----- R2 ----- R2 ----- R2 ----- R2 ----- R2

R3 ----- R3 ----- R3 ----- R3 -- -- R3 ----- R3 ----- R3 ----- R3 ----- R3 ----- R3

R4 ----- R4 ----- R4 ----- R4 -- -- R4 ----- R4 ----- R4 ----- R4 ----- R4 ----- R4

R5 ----- R5 ----- R5 ----- R5 -- -- R5 ----- R5 ----- R5 ----- R5 ----- R5 ----- R5

R6 ----- R6 ----- R6 ----- R6 -- -- R6 ----- R6 ----- R6 ----- R6 ----- R6 ----- R6

R7 ----- R7 ----- R7 ----- R7 -- -- R7 ----- R7 ----- R7 ----- R7 ----- R7 ----- R7

R8 ----- R8 ----- R8 R8_fiq R8 ----- R8 ----- R8 ----- R8 ----- R8 R8_fiq

R9 ----- R9 ----- R9 R9_fiq R9 ----- R9 ----- R9 ----- R9 ----- R9 R9_fiq

R10 ---- R10 ---- R10 R10_fiq R10 ---- R10 ---- R10 ---- R10 ---- R10 R10_fiq

R11 ---- R11 ---- R11 R11_fiq R11 ---- R11 ---- R11 ---- R11 ---- R11 R11_fiq

R12 ---- R12 ---- R12 R12_fiq R12 ---- R12 ---- R12 ---- R12 ---- R12 R12_fiq

R13 R13_svc R13_irq R13_fiq R13 R13_svc R13_irq R13_abt R13_und R13_fiq

R14 R14_svc R14_irq R14_fiq R14 R14_svc R14_irq R14_abt R14_und R14_fiq

--------- R15 (PC / PSR) --------- --------------------- R15 (PC) ---------------------

----------------------- CPSR -----------------------

SPSR_svc SPSR_irq SPSR_abt SPSR_und SPSR_fiq

In short, the 32 bit differences are:


  • The PC is a full 32 bits wide, and used singularly as a Program Counter.
     

  • The PSR is contained within its own register, the CPSR.
     

  • Each privileged mode has a private SPSR register in which to save the CPSR.
     

  • There are two new privileged modes, each of which has private copies of R13 and R14.

 

The CPSR and SPSR registers


The allocation of the bits within the CPSR (and the SPSR registers to which it is saved) is:

31 30 29 28 --- 7 6 - 4 3 2 1 0

N Z C V I F M4 M3 M2 M1 M0
0 0 0 0 0 User26 mode

0 0 0 0 1 FIQ26 mode

0 0 0 1 0 IRQ26 mode

0 0 0 1 1 SVC26 mode

1 0 0 0 0 User mode

1 0 0 0 1 FIQ mode

1 0 0 1 0 IRQ mode

1 0 0 1 1 SVC mode

1 0 1 1 1 ABT mode

1 1 0 1 1 UND mode

Please refer to the (26 bit) PSR for information on the N, Z, C, V flags and the I and F interrupt flags.

 

So what does it mean in practice?


Most ARM code will work correctly. The only things that will not work are any operations which fiddle with R15 to set the processor status. Unfortunately, this isn't as easy to fix as it seems.
I examined a 9K program (a MODE 7 teletext frame viewer, written in C) for potential problems, basically looking for:

  • A MOVS with R15 as the destination.

  • Any LDMFD suffixed with the '^' character and loading R15.

About 64 instructions fell into one of these categories.

There is likely to be few ways to make the conversion process automatic. Basically...



  • How will the system know what is data, and what is code.
    Actually, a clever rules-based program should be able to make a fairly good guess, but is a "fairly good guess" good enough?

  • There is NO simple instruction replacement. An automatic system probably could patch in the required instructions and jiggle the code around, but this could cause unexpected side effects, like an ADR directive no longer being in range.

  • It is incredibly hacky. Surely, much better to recompile, or to repair the source code.

 

It is NOT easy. Such a small change, but with such far-reaching consequences.

 

In comp.sys.acorn.programmer, Stewart Brodie answered my query with a hint that may be useful to people intending to work with 32 bit code:



> How is it possible, if 32 bit code uses MSR/MRS to transfer status and

> register, and older ARMs don't have those instructions?

> Are we into "black magic" code for this?
You take advantage of the fact that the encodings for MSR and MRS act as NOPs

on ARM2 and ARM3 ;-) With some careful arrangement, you can write fairly

tight code.
To refer back to earlier postings, an example of when MOVS pc, lr in a

32-bit mode is useful (entered in SVC or IRQ mode, IRQs disabled):


ADR r14, CallBackRegs

TEQ PC,PC

LDREQ r0, [r14, #16*4] ; The CPSR

MSREQ SPSR_cxsf, r0 ; put into SPSR_svc/SPSR_irq ready for MOVS

LDMIA r14, {r0-r14}^ ; Restore user registers

NOP


LDR r14, [r14, #15*4] ; The pc

MOVS pc, r14 ; Back we go (32-bit safe - SPSR set up)


(CallBackRegs contains user mode registers: R0-R15, plus the CPSR if in a

32-bit mode)

 

 

Download a 32 bit code scanner (12K)



 

 

Where is the example?


In the logical place, in the document describing the processor status register...

 

What about old stuff for which we don't have sources?


There are two options...

The first option is a one-time conversion. We can use an intelligent disassembler (such as D.Ruck's !ARMalyser to provide us with a source of the software, with the 32bit unsafe parts identified. I used this method to cobble together a 32bit version of one of my modules.


For fairly short things, this will be okay. For large projects... I shudder to think! One thing to be especially aware of is that some older software uses tricks like popping flags into 'unused' bits of addresses. A good example here is software that uses bits 0-27 as an address and bits 28-31 as flags...

1 << 28 = 268435456

What this means, in essence, is that the software will work fine on all older machines - including the majority of RiscPCs for which 256Mb was the limit of installable memory.
If, though, we run this on a 512Mb Iyonix (which is no longer out of the realms of possibility), as soon as it is loaded to an address over 256Mb ... bit 28 will be set!
The code will need to be examined to ensure such things don't occur, and if they do, it'll need to be worked around.
As far as I'm aware, which APCS-R requires flags to be saved, I've yet to see my C compiler generate code that depends upon the saving of flags across function calls. The typical example is:

Note that the N, Z, C and V flags from lr at the instant of entry must be reinstated; it is not sufficient merely to preserve the PSR across the call. Consider, a function ProcA which tail continues to ProcB as follows:

CMPS a1, #0

MOVLT a2, #255

MOVGE a2, #0

B ProcB

If ProcB merely preserves the flags it sees on entry, rather than restoring those from lr, the wrong flags may be set when ProcB returns direct to ProcA's caller.

While it has not been my experience that the C compiler generates such code, humans can. And much worse. This, too, must be taken into account. And all those ORRing values into R14 to directly twiddle the processor flags (on return)...

The other method is to make a new computer. All we need to to load up a few old modules, poke our application at 'troublesome' points, force everything to be in an area of memory that we may consider is 'safe'. Then we let our program loose with the same sort of critical care that you'd attend to a hungry cat in a room full of budgies... This, more or less, is what Aemulor does.
But, at a cost.

From the "Inside Aemulor" article on the Foundation RISC User (issue 11; January 2003) CD-ROM, we encounter a very important point:



From RISC OS's perspective, the Aemulor RMA is a normal dynamic area, but Aemulor remaps the memory at an address below 64Mb so that it becomes addressable within the 26-bit environment. Because this emulated RMA is visible to all applications, native 32-bit applications are also restricted to a maximum size of 28Mb each (as per RISC OS 4) whilst Aemulor is running. It is hoped that this limitation can be removed with a later version.

Or, as they say: There's no such thing as a free lunch.

Having said that, the use of Aemulor is essential for all those must-have programs that either cannot sensibly be modernised, or are unlikely to be modernised.
I have heard that somebody is 32bitting Impression Publisher. Well, you know, I heard once that somebody was porting Mozilla to RISC OS. Who knows, maybe I'm wrong... :-)

 

 


What API changes have there been?


The "Technical information on 26/320bit RISC OS binary interfaces" (v0.2) states:

Many existing APIs do not actually require flag preservation, such as service call entries. In this case, simply changing MOVS PC... to MOV PC... and LDM {}^ to LDM {} is sufficient to achieve 32-bit compatibility.

This is possibly worse than useless as it doesn't specify exactly which APIs need it and which don't. Is it safe to assume that everything not otherwise described is safe?

The best thing to do is get hold of that document and browse through it. Please do not simply 'assume' that things will work if you simply don't save flags.
Generally, this is the case, but unless you have a RISC OS 3.10 machine to test it on...

 

 



Return to assembler index

Copyright © 2004 Richard Murray
Download 3.66 Mb.

Share with your friends:
1   ...   7   8   9   10   11   12   13   14   15




The database is protected by copyright ©ininet.org 2024
send message

    Main page