Cell-based architecture for resilient payment systems

(americanexpress.io)

62 points | by birdculture 3 days ago

11 comments

  • physix 1 hour ago
    Nobody uses Amex for payments, so the system isn't ever under high load.

    Just kidding!

    I find the idea quite good, and have to assume that the amount of payment fails they experience due to partitions/outages isn't very high and that the post-payment reconciliation and reclamation process gives them the liberty to rank availability a bit higher than correctness.

    One thing that looked a bit shaky was the interplay between the global transaction router's state of knowing which cells can handle a particular payment and the asynchronous distribution of the "failover data", which I presume it needs to know to route correctly. To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.

    It also doesn't go into the HA setup of the global transaction router itself.

    But still, I kind of like the design.

    • mixdup 57 minutes ago
      >To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.

      But if the router sends to the wrong cell the cell will either send it back to be rerouted or it will fail and the router will try again (or report back the failure so upstream can try again I assume)

  • nightshift1 20 minutes ago
    All i can see is a giant single point of failure called the Global Transaction Router.
    • otterley 1 minute ago
      GLBs aren’t SPOFs. They are typically deployed around the world redundantly, often using Anycast IPs or using DNS geographic and failover records. Think AWS Global Accelerator and Route 53 as an example.
  • neerajsi 2 hours ago
    I wonder how they ensure durability. Is it possible that a cell going down would roll back a payment after it has occurred. Or do they depend on a non cell database?
    • subtlejellyfish 1 hour ago
      I would assume nothing related to a given transaction crosses the cell boundary.

      We use a cellular architecture to help constrain the blast radius of a modular monolith. Each one of our customers lives in exactly 1 cell. Any kind of cross-customer BI/reporting happens through a data warehouse.

  • stevefan1999 43 minutes ago
    Backing up would be hell
    • simmonmt 27 minutes ago
      Maybe? If you assume a cell can just disappear at a moment's notice, then I'm guessing you don't even try backing it up. Whatever goes into and out of the cell (request logs and results) gets backed up, and no doubt that's more complicated than a monolithic system, but it may not be so bad assuming the replay systems and global transaction router do their thing?
  • jeremycarter 2 hours ago
    As Reddit already pointed out, this is nothing novel.
    • christophilus 30 minutes ago
      “They reinvented Erlang OTP.” - Reddit
  • badlibrarian 1 hour ago
    Ah yes, the financial services company that runs a travel agency, allows me to book my hotel and rental car weeks in advance, registers a hold for incidentals for both the hotel and car when I check in, then blocks the card when I try to buy dinner that night in that same hotel due to fraud detection.

    Last week it required me to take pictures of my face from multiple angles to regain membership privileges. I suspect this may be part Palantir data collection and part Peter Thiel dating service.

  • kev009 2 hours ago
    There things are always a clusterfsck compared to the mainframe deployments.
    • vb-8448 35 minutes ago
      Ahahha so true man!

      Some CICS regions, a DB2 and a couple of VSAMs and that's it.

  • llmslave 1 hour ago
    American Express tech is some of the worst in the world among big companies. All of the value in the company is just in the branding. They put some work into the mobile app and the website, but other than that, its a facade.
    • mcintyre1994 1 hour ago
      A few years ago someone kept signing up for loads of bank accounts/credit cards in my name, with my address. I’m not sure what the point of it was. But while everyone else happily sent cards and stacks of welcome paperwork to me, Amex were the only one that contacted me and told me they’d detected something weird in the signup. They gave me some helpful advice to resolve that situation too.
    • tracerbulletx 3 minutes ago
      What are you basing that statement on? It has not been by personal experience.
    • jmpman 1 hour ago
      Having worked at Amex and other huge banks, let me assure you that there's much worse than Amex. Amex's Fraud analytics team was good. Risk was good. Ben's team is good.
  • great_wubwub 33 minutes ago
    Makes me a little nervous that a web page about resilience is failing to connect.
  • toast0 3 days ago
    They run their payment systems on ps3??? Somebody bought into the marketting a bit much.
  • rekttrader 2 hours ago
    So you’re telling me these cells operate independently like distributed Ethereum nodes and L2s… got it.