Permitted aids:
One A4 sheet of handwritten notes. This sheet may not be a photocopy.
Instructions:
Calculators are not permitted. Annotated numeric expressions that can be trivially reduced to the correct answer with a calculator will be accepted for full marks. You are expected to be able to do arithmetic on powers of 2.
Answer as many questions as you can in the time allowed. There are 113 marks on this paper: a reasonable target is 100 marks.
Write the answers to questions in Section A on the examination paper. If you need additional space or wish to change your answer, you may use your answer book. Make sure to indicate clearly when answers to Section A questions appear in your answer book. A large “SAB” (See Answer Book) over wrong answers will suffice.
When you have finished, hand in Section A of the examination paper, your answer book and your A4 sheet of notes.
Section A
Short Answer Questions
The questions in this section can be answered with no more than two concise sentences, a short calculation or a simple diagram. Sometimes a single word will suffice. The marks allocated to each question give some indication of the length of answer required. Questions marked with an * continue on from the previous question: read all the questions in the group before answering.
Question 1
Except where otherwise indicated, use the following operating system and processor characteristics in all questions.
Your operating system uses 8 kbyte pages. The machine you are using has a 4-way set associative 32 kbyte unified L1 cache and a 64 entry fully associative TLB. Cache lines contain 32 bytes. Integer registers are 32 bits wide. Physical addresses are also 32 bits. It supports virtual addresses of 46 bits. 1 Gbyte of main memory is installed.
Marks
(a)
Give one advantage of a direct mapped cache.
Simple, fast hardware
Smaller circuitry (essentially same as ‘simple’)
Only one comparator (essentially same as ‘simple’)
1
(b)
What is the main disadvantage of a direct mapped cache?
Your program is a text processor for large documents: in an initial check, it scans the document looking for illegal characters. For an 8 Mbyte document, what would you expect the L1 cache hit rate to be during the initial check? (You are expected to do a calculation and give an approximate numeric answer!)
So L1 cache hit rate = 31/32 (sufficient for 4 marks) = 97%
Note that this is independent of the data size! Question (i) deals with the same issue.
4
(g)
*
Your program manipulates large arrays of data. In order to consistent good performance, you should avoid one thing. What is it? (Be precise – a numeric answer relevant to the processor described above and an explanation is required here.)
General answer: Matrices with rows containing 2k+m bytes.
where k is number of bits to address a set ( = 8 here)
and
m is number of bits to address a byte within a line ( = 5 here)
So rows of 213 bytes will have a high probability of generating conflicts.
Alternative: Arrays of any structure containing 213 bytes may exhibit the same problem.
4
(h)
*
What is the alternative to a unified cache? What advantages does it provide?
Main reason: Greater bandwidth to the caches; ability to access both caches simultaneously
Secondary reasons:
Instruction cache can be simpler (so faster and uses fewer transistors)
Ability to store pre-decoded instructions in I-cache
3
(i)
*
In addition to data and tags, a cache will have additional bits associated with each entry. List these bits and add a short phrase describing the purpose of each bit (or set of bits). (In all cases, make your answers concise: simply list any differences from a preceding answer.) (i) A set-associative write-back cache
Valid – cache data is valid
Dirty – cache data is modified and needs to be written back
LRU – LRU algorithm to identify ‘old’ data or best candidate for replacement
32 processes are currently running. If the OS permitted each process to use the maximum possible address space, how many page table entries are required.
(i) Conventional page tables
Draw a diagram showing how the bits of a virtual address are used to generate a 32-bit physical address.
Diagram from lecture notes annotated with correct number of bits for this processor needed for full marks.
Low order bits of VA: offset within page, 13 bits copied directly to physical address (PA)
High order 33 bits of VA address a PTE, from which high order 19 bits of PA are extracted.
2
(l)
“A program which simply copies a large block of data from one memory location to another exhibits little locality of reference, therefore its performance is not improved by the presence of a cache.” Comment on this statement. Is it strictly true, mostly true or not true at all? Explain your answer. Assume you are running programs on the processor described at the beginning of this section.
Not true: the best thing for the copying program to do is to work sequentially through memory. This means that it will exploit spatial locality: once one word from a line is loaded, all remaining words in the same cache line will lead to cache hits.
4
(m)
You are advising a team of programmers writing a large scientific simulation program. The team mainly consists of CS graduates who skipped any study of computer architecture in their degrees. Performance is critical.
List some simple things that you would advise them to do when writing code for this system. Provide a one sentence explanation for each point of advice.
(1 mark for each valid piece of advice, 1 for explaining it and 1 for adding a number that makes the advice specific to the processor described earlier.)
Put frequently accessed data in small blocks (ensure spatial locality) – to benefit from long caches line – specifically, pack data into 32-byte chunks wherever possible
Do not make matrices with rows that are 2x bytes long – to reduce cache conflicts – specifically avoid x = 13 for this processor
Pack as much data into pages as possible – to reduce page thrashing – a page is 213 bytes on this processor
Do not make random jumps through memory when accessing large structures – spatial locality at all levels – packing into 32 byte lines or 8k byte pages best for this processor
Small ‘tight’ program loops will use the I-cache efficiently – critical sizes are 32 bytes or 8 instructions and 32kbytes (L1 cache size)
Max 12
(v)
This question deliberately left blank.
No marks even if you do know the answer.
Predictably, several readers of Douglas Adams wrote 42 here