Mill Computing, Inc. Forums The Mill Tools Applications Application Walkthrough

  • Author
    Posts
  • mermerico
    Moderator
    Post count: 10
    #341 |

    Now that we know a little bit about how many parts of the Mill work, I think it might be useful to create a walkthrough of a sample application. What I am imagining is a simple C program and a sample compilation in x86-like assembly and Mill-like assembly (just like in the talks, this can be psuedocode assembly). The walkthrough would then show how the two machines would execute the program and approximately how much time it would take. I think this would be useful in integrating all that we have learned so far from the talks into a simple demonstration of why the Mill is a better architecture.

    • This topic was modified 10 years, 3 months ago by  mermerico.
  • Ivan Godard
    Keymaster
    Post count: 689

    Have you a suggested app or fragment?

  • Joe Taber
    Participant
    Post count: 25

    GCD
    Factorial
    Fibonacci
    Anagrams
    FizzBuzz
    Find the median of 3 (4,5?) numbers
    Matrix or Vector products.
    Exponentiation
    CRC-32
    Run-length encoding
    Any of the sorting algorithms, or maybe just the merge operation (I feel like it might be interesting but short in mill asm).
    CSV manipluation
    Towers of Hanoi (recursion)
    Zig-zag matrix

    Basically almost everything on Rosetta Code will be interesting, and have sample code in C and assembly. Of course some of these may be too complicated or require os-specific console input/output.

    • Ivan Godard
      Keymaster
      Post count: 689

      A good list. We do have to pick a Mill member, but for now assume one big enough (unlimited slots and belt) for anything; the actual slot and belt requirement is an in interesting result in its own right.

      I’ll take the first: GCD.

      /* code based on Rosetta C++ example:
         int gcd(int u, int v) {
         return (v != 0)?gcd(v, u%v):u;
         } */
      F("gcd");            // u in b0, v in b1
         neqs(b1, 0), rems(b0, b1);
         retnfl(b0, b2);
         nop(4);           // wait for rem
         call1("gcd", b3, b0);
         retn(b0);
      

      This needs a 3-long belt, one flow slot and two exu slots (suitably populated); 8 cycles, excluding the nested call body.

      /* code based on Rosetta C++ example:
         int
         gcd_iter(int u, int v) {
           int t;
           while (v) {
             t = u; 
             u = v; 
             v = t % v;
           }
           return u < 0 ? -u : u; /* abs(u) */
         } */
      F("gcd_iter");       // u in b0, v in b1
         L("loop");
            neqs(b1, 0), rems(b0, b1);
            brfl(b0, "xit");
            nop(4);        // wait for rem
            conform(b3, b0);
            br("loop");
         L("xit");
            lsss(b0, 0), negs(b0);
            pick(b0, b1, b2);
            retn(b0);

      This needs a 3-long belt, two exu slots, onr flow slot and a pick slot; the loop body is 8 cycles, plus 3 cycles for the wrap-up.

      In both I have used speculation to launch the rems operation before it is known to be needed; without speculation the cycle counts would be 8 not 7.

      The code does not use phasing (NYF). With phasing the count drops to 7 cycles for the first, while the second gets a 7 cycle loop and a one cycle wrap-up.

      Your turn 🙂

      • This reply was modified 10 years, 3 months ago by  staff. Reason: formatting fix
    • Ivan Godard
      Keymaster
      Post count: 689

      Bummer: I see the forum has squeezed out all the extra blanks in my carefully-formatted code. I’ll take it up with forum admin.

      • Joe Taber
        Participant
        Post count: 25

        Too bad there is no preview option before posting. The code button in the edit bar surrounds selected text with backticks (same key as ~), but I don’t know if it works on multiple lines. Lets see how bad wordpress butchers this:

        (this is a test)
        (on three)
        (separate lines)

        I find it interesting you mentioned that we have to pick a mill member. Is there no way to write directly in the intermediate language that gets translated by the Specializer? Or is that what you just did and the IL code is just written for a hypothetical mill member with unlimited slots and belt?

        Is there a reference for all the instructions?

        It seems like it would be easier to tag an instruction and refer to its outputs by tag (and index if multiple outputs) instead of by belt number. The tags should be trivially converted to belt numbers during translation, but be much easier to program in, especially for hand-coded assembly like this.

        • Ivan Godard
          Keymaster
          Post count: 689

          Yep, I hadn’t noticed the “code” button 🙁

          The code was hand written; as you see, Mill asm is distinctly *not* humane.

          The ability to write relAsm (which is what we have called specializer input) was initially supposed to exist, but then dropped when we realized that there was no need and great cost, if it could be done at all – writing graph structures is no friendlier than absAsm (what the assembler accepts).

          Dave has long advocated a form of absAsm that at least calculated the belt for you, and even has a personal tool that sort-of does that that he used for some benchmarks. While there are issues that so far there are no solutions for, it looks like it could be done and would make asm easier. It remains a someday project, awaiting a volunteer and progress on higher priority projects.

          Unfortunately the entire operation set contains a few NYFs, so we can’t publish it yet. However, it should be possible to write at least a few of the samples using only the known operations, or pseudo-ops standing in for ops that you don’t know the mnemonic for yet. The format I used on the slides omits belt numbers in favor of a comment indicating what was being referenced. Thus the first gcd would be:

          /* code based on Rosetta C++ example:
             int gcd(int u, int v) {
             return (v != 0)?gcd(v, u%v):u;
             } */
          F("gcd"); 
             neqs({v}, 0), rems({u}, {v});
             retnfl({neqs}, {u});
             nop(4);           // wait for rem
             call1("gcd", {v}, {rems});
             retn({call});
          

          Would that help?

  • Will_Edwards
    Moderator
    Post count: 98

    One interesting micro-problem that would help me understand the Mill is computing the Euclidean distance between two 3D points.

    The source code might look like this:

    d = sqrt(sqr(a.x - b.x) + sqr(a.y - b.y) + sqr(a.z - b.z))

    Or it might be:

    d = sqrt(sqr(a[0] - b[0]) + sqr(a[1] - b[1]) + sqr(a[2] - b[2]);

    Which are naturally equivalent if the fields in the point struct used in the first form are adjacent.

    The parallelism of the subtractions and squaring is obvious, and easy as vector or as separate parallel operations.

    • If vectorised, can you load a non-power-of-two length vector (perhaps it puts a power of two length vector on the belt, with None in the last slot?)
    • If vectorised, how do you then sum the values in the vector together?
    • And if done as separate operations, do you need two sequential add operations to add them together?
    • This reply was modified 10 years, 2 months ago by  Will_Edwards. Reason: clarifications
    • Ivan Godard
      Keymaster
      Post count: 689

      I don’t know this kind of application and your description is a bit cryptic, so this is a guess.

      The two loads, subtract, and square are routine vector ops. By “sum” I guess you mean an add-reduction; if so, NYF. A scalar sqrt uses a hardware sqrt-approximation and a software in-line algorithm, typically Newton-Rapheson. The whole can be pipelines, NYF.

You must be logged in to reply to this topic.