9 comments

  • maxloh 1 hour ago
    I find SingleFile [0] a much more robust version of this.

    It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.

    They also offer a CLI powered by Puppeteer. [1]

    [0]: https://github.com/gildas-lormeau/singlefile

    [1]: https://github.com/gildas-lormeau/single-file-cli

    • tamnd 57 minutes ago
      It seems this repo only saves one web page?

      What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.

      • sdevonoes 38 minutes ago
        Worst example website ever. Use another
        • sermah 13 minutes ago
          Um. Whose website are you on right now?
    • HelloUsername 26 minutes ago
      What's the difference with, any webbrowser on a computer, File -> Save as ?
      • nmstoker 19 minutes ago
        That's for a single page, this handles the whole site. Also the browser Save As options often work poorly.
    • tamnd 56 minutes ago
      And thanks for the link. Let me implement this single HTML feature, it looks nice to have!
  • gregwebs 46 minutes ago
    This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos? Is there a way to only get a subset of a website?
    • tamnd 42 minutes ago
      Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )
  • wolttam 39 minutes ago
    One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).

    Cool!

    It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.

    Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?

    • tamnd 37 minutes ago
      Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)

      Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.

  • daviding 13 minutes ago
    Nice idea! fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it: `leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
  • sanqui 43 minutes ago
    Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
    • tamnd 39 minutes ago
      I'm working on WARC too, with format from Common Crawl!

      By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli

      • sanqui 34 minutes ago
        That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.
        • tamnd 26 minutes ago
          I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.

          For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!

          • sanqui 19 minutes ago
            I'm a fan of compatibility with established formats!

            Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!

    • Dhavidh 38 minutes ago
      sound interesting
  • lolpython 15 minutes ago
    This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
  • dimiprasakis 20 minutes ago
    Neat project, I like the idea. One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!

    In any case, cool stuff :)

  • rahimnathwani 42 minutes ago
    So this is like using wget --mirror except that it works on pages that require javascript, right?
    • tamnd 41 minutes ago
      Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.
  • grahamstanes17 29 minutes ago
    nice