CIS 451 Week 11

Synchronization

To effectively use multiple threads, you typically need some mechanism for mutual exclusion: Guaranteeing that a segment of code is run while no other thread is an a corresponding sensitive area.
- For example: Producer thread adds data to a shared array, consumer removes data from that array. The sections in each thread that access the array must be mutually exclusive
- Producer: ++last; buffer[last] = newItem
- Consumer: itemToConsume = buffer[last]; --last;
- What happens if code gets executed c1, p1, p2, c2?
- Conventional solution is to make sure consumer can’t execute it’s “critical section” while consumer is executing its critical section.
In a single-processor system an atomic testAndSet instruction can provide the basis for mutual exclusion
- top: tas r1, addr, 1; bnez r1 top
- This guards against threads/processes being interleaved.
- This needs to be a special instruction built into the hardware — even in single processor CPU.
In a multi-core / multiprocessor system the two threads may be running at the same time on different cores.
- Typical algorithms still applicable
  - Test and set
  - atomic exchange
  - fetch-and-increment
- However, these require an atomic mem read / mem write combination. This can be difficult to do efficiently.
- More common approach is not requiring atomicity, but detecting when a read/write pair gets interrupted.
- MIPS and RISC V use load reserved and store conditional
  - load reserved loads data from the requested address, then creates a reservation on the memory address (i.e., stores the address somewhere “safe”)
  - store conditional verifies that the reservation is still in place and returns a 0 if success, non-zero if fail.
  - To be specific, an intervening store (or invalidation message on the bus) will overwrite the reservation and cause other stores to fail.
    - context switch also invalidates reservation
  - “Atomic” exchange of x4 and data located at addr1

try:   mov x3 <-- x4
       lr  x2 <-- addr
       sc  x3, addr1  # stores value of x3 *and* x3 gets 0 for success, 1 for failure
       bnez x3, try   # loop if fails
       mov x4 <-- x2

Register value and memory value should be atomically swapped.
If another memory access to addr intervenes, then attempt fails and we try again.
- lr places addr in a special register called the reserve register
- When cache sees an invalidate message come in, and the address for that message is the one in the reserve register, then the subsequent sc must fail.
- Note that the bus access protocol ensures that all processors will agree on which sc is first (and which others must fail).

        li x2 <-- #1
lockit: EXCH x2, 0(addr_of_lock)
        bnez x2, lockit

Kind of like “test and set”
In other words,
- set x2 equal to 1.
- Do an atomic swap with the memory address of the lock
- If x2 comes back with a 0 in it, then the lock wasn’t held previously and the process now has the lock.
- Otherwise, the lock is in use, try again.
Problem with doing this in a multiprocessor system?
- The sc will cause coherency messages.
- The spinning will cause a lot of coherency messages.
How to improve?
- Do a simple read first. (Don’t try to acquire the lock until you know it’s free).
- The simple read won’t cause traffic.
- Eventually, the write that frees the lock will cause a coherence message to be sent which will update the cached copy and allow your code to progress to the acquisition stage.

lockit: ld x2 <-- 0(lock_addr)
        bnez x2, lockit
        li x2 <-- #1
lockit: EXCH x2, 0(addr_of_lock)
        bnez x2, lockit

Consistency

Coherence ensures that multiple processors see a consistent view of memory.
But, does not answer how consistent.
Sounds like a simple question; but, it is very difficult.

P1:     A = 0;          P2:     B = 0;
        ......                  .....
        A = 1;                  B = 1;
L1:     if (B == 0) ... L2:     if (A == 0)

If writes are serialized and take effect immediately, then it should be impossible for both if statements to run (i.e., for both A and B to be 0)
What if the write invalidate messages are delayed? (Think “cross in the mail”)
- Should this be allowed?
- Under what circumstances?
Sequential consistency:
- Require that the system behave as if the two codes were “shuffled together”
- This requirement would rule out the situation above.
- Can be implemented by requiring that stores block until all invalidations are complete.
- Can also be implemented by delaying next memory access until the previous one is complete.
- Effective, but can cause delays in busy or big systems (where response would be slow).
- Writes can take 100+ cycles to complete.
  - Establish ownership (50+)
  - Send invalidate (10)
  - Time to receive acknowledgement of invalidate (80+)
Assume programs are synchronized.
- That is, there is a pair synchronization ops between a write on one processor and a read on another. (For example, a mutual exclusion lock)
- If there isn’t, there is a data race
- This is a reasonable assumption; because most programs are synchronized.
  - Mutex, semaphores, locks, critical sections — all that good stuff from 452.
  - It is difficult to reason about programs that aren’t — even if the underlying hardware has sequential consistency.
- Common approach is to use a relaxed consistency model that can use synchronization to behave as if the system had sequential consistency.
- Many processors use “release consistency”, which means that consistency is only guaranteed among special operations that acquire and release locks on shared data.
  - RISC also has a FENCE operation that ensures that all previous operations have completed.

Cross-cutting issues (Section 5.7)

Compiler optimization
- Currently programs must usually be synchronized, because otherwise it is difficult for compilers to optimize code.
- If the synchronization is done behind the scenes by the consistency model, then compilers don’t know which reads and writes are shared and which aren’t. This affects
  - Reordering
  - register allocation
- Whether/how compilers can leverage more precise consistency models is an open research question.

RAID

Disks have long latency (even SSDs, currently)
One solution is to read data in parallel by striping data across disks.
What is the problem?
- One failure ruins all of the data.
- Probability of failure increases exponentially
Solution RAID
- Redundant Array of Inexpensive/Independant Disks
Different levels
- RAID 0: Striping. Gives performance but risks failures
- RAID 1: Mirroring.
- RAID 1/0: Striped mirroring. Performance, but uses 2X space or more.
- RAID 2-4: Not typically used.
- RAID 5: Striping with parity.
RAID 5:
- Simple explanation: One parity disk.
- problem?
  - Parity disk becomes a bottleneck
- Solution?
  - Rotate the parity block around so it’s not always on the same disk.

Warehouse scale Computing

Nicholas Carr Quote from 1st page of chapter
An alternative to SMPs are clusters: Groups of independent networked machines.
- Clusters work well for highly parallelizable problems that require little synchronization.
- Often cheaper than SMPs, because build from commodity hardware.
The natural extension of a cluster is a Warehouse Scale Computer
- 50,000 - 100,000 servers in a big building.
- Think Google, Amazon, etc.
Slightly different focus
- Clusters focus on thread level parallelism (parts of a single problem)
  - Often using OpenMP or MPI for message-passing synchronization
- Warehouse scale computers focus on request level parallelism
  - Think about serving web requests
  - These requests tend to be very independent and require very little synchronization.
Other differences with clusters:
- Clusters tend to be homogeneous (multiple copies of the same HW — like EOS)
- Warehouses tend to be heterogeneous. Either
  - to provide different levels of service to different customers, or
  - because they can (and it can be cheaper)
Differences with SMP:
- SMP designers don’t tend to worry about operational cost. Cost is dominated by the cost of the machine itself.
- WHC costs are dominated by operational cost (electricity, cooling)
- With WHC small savings in operational costs add up quickly
With WHC, location counts. Give examples
- Need population center nearby for staff
- Nearby source of inexpensive power
- Need good Internet connectivity
  - Singapore and Australia are close, but Internet bandwidth is lacking
- Low risk of environmental problems (earthquake, tornado, etc.)
- Cool weather is a plus.
- Things like local laws and taxes matter
WHC must be cost effective at low utilization
- WHC are often way over-provisioned to handle bursts.
- “spare” capacity can’t consume too much power when not in use.
WHC need a plan to handle failures
- Appropriate redundancy
- Ability to recover
Job scheduling
- With 50,000+ machines, performance varies greatly between machines — even supposedly homogeneous ones.
  - Scheduling is dynamic, watching when jobs finish and assigning new tasks accordingly.
  - Not unusual to re-submit a job if it takes too long in case the current CPU/machine is slow because of some sort of failure.
    - Take result of first job to finish.
Lots of redundancy for performance reasons.
- e.g., data is often replicated for performance reasons (e.g., to be close to where it’s needed)
Consistency models very relaxed where possible
- Sometimes it is not necessary for all replicas to agree precisely.
- “Eventual consistency” for video sharing
Typical configuration:
- 10s of machines in a single rack.
- Switch at the top of the rack.
- Groups of racks called “arrays” with a switch for the array
- Entire warehouse comprises many arrays.
  - Arrays may even have their own hierarchy.
- This means, of course, that administrators need to take care to place related applications in the same rack to get best communication time.
This hierarchy “collapses” the differences in latency between RAM and disk
- Huge difference in latency between RAM and disk within a machine.
- Very small difference at the top of the hierarchy.
- See Figure 6.6

Efficiency and Cost (Section 6.4)

Infrastructure costs for power distribution and cooling are the majority of the construction costs of a WSC.
Actual power usage inside a Google Warehouse in 2012:
- 42% to power processors
- 12% to power DRAM
- 14% for disks
- 5% for networking
- 15% for cooling overhead (not sure why term “overhead” used.)
- 8% for power overhead
- 4% miscellaneous
PUE (Power utilization effectiveness) used to measure efficiency.
- (Total facility power ) / (IT equipment power)
- Sample of 19 data centers showed PUE of 1.33 to 3.03.
- AC ranged from .3 to 1.4 times IT power
- Google improved from 1.22 to 1.12 from 2008 through 2017
- Does not account for performance — smallest PUE is not necessarily cheapest $/operation
Effects of latency
- Increasing latency had a much larger effect on total time to completion.
- Cutting response time by 30% led to a 70% decrease in total interaction time
  - Any ideas why?
    - Hypothesis: Distraction
- Another study showed a 200ms delay at server produced 500ms increase in time to next click.
  - Revenue dropped linear with increase in delay
  - User satisfaction also dropped linearly
- Effects lingered
  - After 200ms delay, usage was down 0.1% four weeks after latency returned to normal
  - 400ms delay produced lingering 0.2% decrease
  - This results in the loss of a lot of money
- How to set latency targets in response?
  - Can’t just pick average. Need a target for 99% of users below x
  - The very slow ones are lost revenue. At some point, faster than noticeable doesn’t help. So, you don’t want the very fast ones to “give permission” for slow ones.
- Goals called SLOs Service Level Objectives

Section 6.5

Economies of scale
- $4.6 per GB per year for WSC vs $26/GB for a data center
- 7.1x reduction in administrative costs (1000 servers/admin vs 140)
- 7.3x reduction in networking costs ($13/Mbit/month vs $95)
- Can also purchase equipment in larger volumes.
Warehouses tend to serve several customers.
Even when warehouse built for one primary client, excess is leased out to public. Why
- Public tends to have different peaks, so resources can be used more efficiently.
In 2007 many power supplies were only 60% to 80% efficient

Section 6.7 — Example Google Warehouse

They avoid typical A/C whenever possible
- Hot air forced up and into heat exchanger with cool water
- Water cooled on the roof using evaporation, when possible.
  - Finland’s warehouse uses cold water from Gulf of Finland
- Some traditional A/C/ capacity present when necessary.
- Designed to operate at 80, rather than the 65 to 70 of a traditional data center
  - Allows for cooling towers to work in most cases (as opposed to traditional A/C)
Power delivered as AC to each rack, then 48V DC down the rack to the individual servers.
Each warehouse has diesel generators for power failures
- But, they take 10s of seconds to spin up.
- UPS on each rack to fill the gap
- Contracts for continuous diesel delivery for extended outages.