6 comments

  • kreelman 1 hour ago
    This is very much worth watching. It is a tour de force.

    Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.

    Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.

    She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.

    She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer

    You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.

    The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?

    As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.

    Go Laurie!

    • ufocia 44 minutes ago
      I like the video, but this is hardly groundbreaking. You send out two or more messengers hoping at least one of them will get there on time.
      • npunt 3 minutes ago
        and dropbox was just rsync
  • foltik 59 minutes ago
    Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

    The hedging technique is quite a cool demo too, but I’m not sure it’s practical.

    At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

    I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

    Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

  • mzajc 1 hour ago
  • boznz 40 minutes ago
    Should say DRAM, SRAM does not have this.
  • rationalist 1 hour ago
    [dead]
  • dragonsenseiguy 2 days ago
    [flagged]