World Library  
Flag as Inappropriate
Email this Article

Duff's device

Article Id: WHEBN0000456753
Reproduction Date:

Title: Duff's device  
Author: World Heritage Encyclopedia
Language: English
Subject: Tom Duff, Loop-switch sequence, C programming language, Rc, Articles with example C++ code
Collection: Articles with Example C Code, C (Programming Language), C Programming Language, Computer Folklore
Publisher: World Heritage Encyclopedia

Duff's device

In computer science, Duff's device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic use of case label fall-through in the C programming language to date.[1] Duff does not claim credit for discovering the concept of loop unrolling, just this particular expression of it in C.

When used in combination with various optimizations performed by modern compilers, in some cases using Duff's device does not provide expected performance improvements.[2] Thus, it may be advisable to perform some benchmarking of different implementations on the target computer architecture.


  • Background 1
  • Original version 2
  • Mechanism 3
  • Simplified explanation 4
  • Performance 5
  • Stroustrup's version 6
  • See also 7
  • Notes 8
  • References 9
  • Further reading 10
  • External links 11


Loop unrolling revolves around lowering the number of branches made, by batching them together. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique is to jump directly into the middle of the unrolled loop for copying the remainder.[3]

Duff was looking for a similar optimization for his case, and succeeded in doing so in C, unrolling a loop into a loop which assigns (up to) eight values on each iteration.[4]

Original version

A straightforward code to copy items from an array to a memory-mapped output register might look like this:

do {                          /* count > 0 assumed */
    *to = *from++;            /* "to" pointer is NOT incremented, see explanation below */
} while(--count > 0);

The code above does not perform a memory-to-memory copy, in which *to++ would be used.

While optimizing this, Duff realized that an unrolled version of this loop could be implemented by interlacing the structures of a fallthrough switch and a loop.[3][1]

send(to, from, count)
register short *to, *from;
register count;
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);

Duff's device can similarly be applied with any other size for the unrolled loop, not just eight as in the example above.


Based on an algorithm used widely by programmers coding in assembly for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid C by virtue of two attributes in C:

  1. Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
  2. The ability to jump into the middle of a loop in C.

As documented in the comment appearing in Duff's un-optimized version, the code assumes that count is strictly positive.

Simplified explanation

A functionally equivalent version
with switch and do disentangled
send(to, from, count)
register short *to, *from;
register count;
    register n = (count + 7) / 8;
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;

The basic idea of Duff's device is that the CPU time can be saved when running a loop by reducing the number of loop tests. For example, in the case of a loop with only a single instruction in the block code, the loop test will be performed for every iteration of the loop, that is every time the instruction is executed. If, instead, eight copies of the same instruction are placed in the loop, then the test will be performed only every eight iterations, and this will gain time by avoiding seven tests (this method is called loop unwinding). The problem is that for this to work the total number of iterations must be a multiple of eight.[3]

Duff's device provides a solution by first performing the "extra" number of iterations, after which a multiple of eight remains (in fact, the remainder of the integral division by eight), followed by iterating as many times as necessary the groups of eight similar instructions. To obtain the number of "extra" iterations, the code calculates the value of the total number of iterations modulo eight and then applies the following trick: according to that value, the processor will jump to a case statement placed in such a way that it is followed by exactly the number of iterations you need. Once this is done, everything is straightforward – the code continues by doing iterations of groups of eight instructions, this has become possible since the remaining number of iterations is a multiple of eight.[3]

An unusual feature of the code is that, to be able to do that jump, case keywords are required both inside and outside the loop. This is unusual because the contents of a case statement are traditionally thought of as a block of code nested inside the case statement, and a reader would typically expect it to end before the next case statement. According to the specifications of C language, this is not necessary; indeed, case statements can appear anywhere inside the switch code block, and at any depth; the processor will simply jump to the next statement, wherever it may be.


Many compilers will optimize the switch into a jump table just as would be done in an assembly implementation. C's default fall-through in case statements has long been one of its most controversial features; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."[1]

The primary increase in speed versus a simple, straightforward loop comes from loop unwinding, which reduces the number of branches performed (which are computationally expensive due to the need to flush - and hence stall - the pipeline). The switch/case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1–7 bytes automatically).

This automatic handling of the remainder may not be the best solution on all systems and compilers – in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures.[5] When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance and a noticeable reduction in size of the executable.[2] Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.

For the purpose of memory-to-memory copies (which was not the original use of Duff's device, although it can be modified to serve this purpose as described in section below), the standard C library provides function memcpy; it will not perform worse than a memory-to-memory copy version of this code, and may contain architecture-specific optimizations that will make it significantly faster.[6][7]

Stroustrup's version

The original Device was designed to copy to a memory-mapped I/O port, pointed to by the variable to. To copy one memory location to another, one solution would be to auto-increment to in every assignment statement; that is, change every *to = *from++; to *to++ = *from++;. This modification of the Device appears in a "what does this code do?" exercise in Bjarne Stroustrup's book The C++ Programming Language.[8]

See also

  • Coroutine – Duff's device can be used to implement coroutines in C/C++


  1. ^ The code fragment uses K&R C version of the C programming language.


  1. ^ a b Duff's device from FOLDOC
  2. ^ a b Ted Tso (2000-08-22). "Re: [PATCH] Re: Move of input drivers, some word needed from you".  
  3. ^ a b c d Ralf Holly (2005-08-01). "A Reusable Duff Device".  
  4. ^ Tom Duff (1988-08-29). "Subject: Re: Explanation, please!".  
  5. ^ James Ralston's USENIX 2003 Journal
  6. ^ Wall, Mike (2002-03-19). "Using Block Prefetch for Optimized Memory Performance" (PDF).  
  7. ^ Fog, Agner (2012-02-29). "Optimizing subroutines in assembly language" (PDF).  
  8. ^   Exercise 15 in Section 6.6 (p.141)

This article is based on material taken from Duff's Device at the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

Further reading


External links

  • Description and original mail by Duff at Lysator
  • Simon Tatham's coroutines in C utilizes the same switch/case trick
  • Adam Dunkels' Protothreads - Lightweight, Stackless Threads in C also uses nested switch/case statements (see also The lightest lightweight threads, Protothreads)
  • Pigeon's device is a related technique: intertwined switch/case and if/else statements
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.