Question

Future proof HTML format for saving a web page

I use to save web pages in my computer for offline viewing and I prefer saving them in HTML format because I want my saved files to preserve the original layout of a page and not just the text.

However simply saving a page in HTML format is not very handy since you get the HTML file and a separate folder with all its companion resources. It's much better to have just one file, so for some time now I had settled on the Save Page WE extension which saves a webpage as a single HTML file.

My concern is how future proof this single HTML file format is. Is it a safe choice for saving a web page, or it might become obsolete in a couple of years with all my saved files becoming inaccessible? Confused

Would you say that MHT format perhaps is more future proof?

Thank you.

(If it helps, I could upload the file of a page saved through "Save Page WE" for you guys to check out in detail the particular format).

Discussion is locked
Answer
Follow
Reply to: Future proof HTML format for saving a web page
PLEASE NOTE: Do not post advertisements, offensive materials, profanity, or personal attacks. Please remember to be considerate of other members. If you are new to the CNET Forums, please read our CNET Forums FAQ. All submitted content is subject to our Terms of Use.
Reporting: Future proof HTML format for saving a web page
This post has been flagged and will be reviewed by our staff. Thank you for helping us maintain CNET's great community.
Sorry, there was a problem flagging this post. Please try again now or at a later time.
If you believe this post is offensive or violates the CNET Forums' Usage policies, you can report it below (this will not automatically remove the post). Once reported, our moderators will be notified and the post will be reviewed.
Comments
- Collapse -
Clarification Request
Do you have Word installed?

If so, save to a Word document if you prefer it over PDF, just be sure it's set to include everything on the page.

- Collapse -
I used that for a time as well.

But the doc still reached out to the web for content so for archival work? I'm going with PDFs.

- Collapse -
That's what I use too, PDF

MHT files were great, but unless Internet Explorer continues, or someone creates an MHT reader, I think it's becoming obsolete. Does Edge read MHT files?

- Collapse -
Not a browser I use.

We don't even test sites on Edge. Go figure.

- Collapse -
Chrome and Opera do

Chrome and Opera do read MHT files natively and I think it would be unlikely to stop reading them since it's a rather simple format.

Post was last edited on July 7, 2018 11:37 AM PDT

- Collapse -
It may read them

But with most sites not using pure HTML and having dynamic real time frames and more, it doesn't seem like a good solution if you need to have a copy of what the page was showing at that date. Think "legal challenges."

- Collapse -
...

Fair enough. You mean that a saved page in HTML might be displayed correctly today but not in a couple of years right?
I think I'm OK with that; If the file can be read the text and basic formatting will still be there.

In any case, the only way of having an exact copy of what the page is showing is by taking a screenshot, but this method has other major disadvantages.

Post was last edited on July 7, 2018 12:27 PM PDT

- Collapse -
You got it.

What if the text was sourced from the web?

- Collapse -
Sure

Sure, you can always just copy the text and paste it on a Word document (or Notepad for maximum security Laugh ) but the point for me always was to save a copy that retains the basic formatting of the page as well.

Post was last edited on July 7, 2018 12:42 PM PDT

- Collapse -
Or what I suggested.

So many web sites today include content from other sites, pull information from a database so while I can note this, it's your choice to make and live with the choice.

Here, most of that time it's about legal claims so PDF captures it without having to have an internet connection and relying on content that is dynamically sourced.

Me? Just an older programmer that has worked with developers, lawyers and more.

- Collapse -
Answer
Have to disagree. Why?

So few web pages are just HTML that this would be very iffy today. Maybe over a decade ago?

If you just need what's on screen, just print it to PDF files since that seems to be the bet for future archival copies.

- Collapse -
?

You mean that MHT or even ordinary HTML files might become unaccessible in the future? Isn't that a little far-off? Confused

- Collapse -
I noted why I felt this is a dead end.

Not all pages are just HTML and many include content from the web that would vanish over time.

This is why I felt it failed your criteria for archival work.

Far off? Nope. Today. It will happen today.

- Collapse -
Fair enough.

Fair enough. But even if someone has saved HTML files of pages which do include content that would vanish over time, does this mean that those files could become completely inaccessible and therefore useless?

- Collapse -
No, that's why there's an html folder

to put all that content into, so it will always appear the same. When you save it, the internal links are adjusted to access the information from that html folder.

- Collapse -
So if I understand correctly

So if I understand correctly, according to that HTML files should indeed be future proof right?

(As I mentioned in the first post, I always refer to single HTML files, if this makes any difference at all).

- Collapse -
Re: html

Firefox lets me choose "html only" or "complete" if I save it.

- If I choose the first one, I get one file that references a lot of scripts and other things like pics on the net. Scripts and pictures are not inside a html file, as you might know. If those don't exist any more in the future, it's unlikely to show or work as it does now.

- If I choose the second one, I get a file and a folder. The folder contains all external data. I didn't check how it handles iframes (a rather special kind of external data that shows a webpage inside a webpage (kind of old fashioned, but still used here and there).

On a pdf you get kind of snapshot that you can view now and in the future.

- Collapse -
Single HTML file
- Collapse -
Re: html

That's not standard html, even if Firefox shows it correctly without that add-on, so I wouldn't call it a html file.
Extensions come and go. Better check how Chrome, Internet Explorer and Edge show it. Also try it on another PC, after copying only that file to a USB stick.

- Collapse -
Chrome reads them natively and so do Firefox and Opera.

Chrome reads them natively and so do Firefox and Opera. Will soon check if they also work on a Mac with Safari.

As for the extensions, they sure come and go but here the particular extension is only needed in order to save a page not to access it.

- Collapse -
Mac

Just checked and the single HTML files can also be accessed on a Mac with Safari. Only some images don't show up.

- Collapse -
Stuff that can't be encapsulated or captured?

So while the complete does that, if your web page of interest calls on the web to get something from a database, which is very common today, your HTML and scripts save will not function into the future PLUS may change results over time.

This is why, for archival and legal works I'm back to PDF.

Post was last edited on July 8, 2018 6:54 AM PDT

- Collapse -
I agree.

Printing the page to pdf is fast and easy, and it results in one file per webpage, whatever the length, so the result is quite manageable.
For example, this thread - as it is now - prints as a 5 page pdf in Firefox.

Post was last edited on July 7, 2018 1:23 PM PDT

- Collapse -
What method do you use

What method do you use in order to save a page to PDF?

- Collapse -
Re: pdf

I didn't say "save to pdf". I said "print to pdf". That's done by choosing File>Print in the browser, and then choosing a pdf-printer. Windows 10 has a native one built in. In the past I used cutepdf or pdf995 (both free downloads); but I see that foxit and soda pdf-readers install their own. Choose what you like most.

CNET Forums