Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?
I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive's capacity so I want to compress them at the highest ratio supported by standard tools. I've zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it's lossless since file level compression can regenerate the original file in its entirety?)
I've heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don't know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.
I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I'm only looking at gz, xz, or bz2.
So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?
like this
US ready to completely replace Russian gas and oil product supplies to Europe - Energy Minister
US ready to completely replace Russian gas and oil product supplies to Europe - Energy Minister
US Energy Secretary Christopher Wright stated that America is ready to replace all Russian gas and oil products flowing to Europe. He spent six days in Europe, assuring leaders of the US's readiness to meet their needs.Pavlo Bashynskyi (UNN)
like this
Their bridges with Russia can be rebuilt.
But even if that's a bridge too far, they could embrace Chinese solar and batteries and electrification. A large up front investment for the ability to generate their own power and cut fossil fuel dependency. China would happily work with them even if they still refuse reproachment with Russia.
Instead they're just idly waiting for the US to complete their colonization.
don't like this
Are you saying that Russia is not occupying parts of Ukraine? That the war would not end if Russia retracted their armies and left Ukraine alone?
Honestly, I don't even know why I'm trying to reason with a troll...
No, I'm saying that history didn't start on February of 2022. Here's a perspective from an actual adult with a fully developed brain
It's adorable that you consider what you're doing here to be reasoning.
- YouTube
Profitez des vidéos et de la musique que vous aimez, mettez en ligne des contenus originaux, et partagez-les avec vos amis, vos proches et le monde entier.www.youtube.com
Thank you for this tip, I'll try to watch it when I get the time.
So you agree that Russia can end this war by pulling back their armies from Ukraine territory. That would be a first step for Russia to start mending the bridges that they burned with Europe.
don't like this
So you agree that Russia can end this war by pulling back their armies from Ukraine territory.
Ukraine could end the war by accepting Russia's demands. Or the US and Europe could stop arming and funding Ukraine, that would also end the war. That would be a first step to mending the bridges they burned with Russia by starting a proxy war.
Why would Russia surrender when they are winning? Makes no sense. Losers don't get to dictate terms to the winner in a war.
The terms for peace are well known. They had already been agreed to by both sides during the Istanbul negotiations in 2022 before the British and the Americans went to tell their proxy to abandon negotiations and fight to the last Ukrainian instead.
like this
Veo 3 | Transform Your Ideas Into AI-Generated Videos
Veo 3 | Transform Your Ideas Into AI-Generated Videos
Create stunning, high-quality videos from simple text prompts or images in minutes with Veo 3's advanced AI video generation platform.veo-3.net
Tobacco companies aid vape shops in push to repeal Denver flavored tobacco ban, outraising law’s backers
A campaign group seeking to overturn Denver’s ban on flavored tobacco sales in the November election has far outraised supporters of the prohibition, campaign finance records show.
The opponents of the ban, a coalition of Denver vape store owners organized as a group called “Citizen Power!,” raised $410,000 through the end of August, according to campaign finance reports filed this month. The campaign group supporting the ban, “Denver Kids vs. Big Tobacco,” raised about $245,000.
In December, the Denver City Council near-unanimously approved a ban on sales of most flavored tobacco and nicotine products after public health and children’s advocates argued the products could lure young people into a life of addiction.
The council approved the ban, which applies to any sales within city limits, despite heavy lobbying from tobacco companies and vape stores. Mayor Mike Johnston signed it.
Tobacco companies aid vape shops in push to repeal Denver flavored tobacco ban, outraising law's backers
The president of the campaign group opposing the ban said that instead of creating a new prohibition for adults, Denver should better enforce its earlier ban on sales of the products to children.Elliott Wenzler (The Denver Post)
copymyjalopy likes this.
Echoes of Gaza: A Soundscape of Resistance and Memory
"Echoes of Gaza" is a protest song but also a piece of literature. The lyrics are metaphorically charged:
"History writes with a broken hand" summarizes distortion of accounts.
"Phantom voice, a dream you shed" symbolizes lost humanity.
"Ghosts in the wars you weave" judges global complicity in silence.
Echoes of Gaza: A Soundscape of Resistance and Memory
Explore "Echoes of Gaza," a unique track by Ali Taha Alnobani that uses AI to create a soundscape of resistance and memory.www.thebirdali.com
Nearly 2 million evacuated as deadly Typhoon Ragasa slams into southern China, after killing at least 17 in Taiwan
Nearly 2 million people in southern China were evacuated as a powerful typhoon hurtled into one of the world’s most densely populated coasts, having already unleashed deadly flooding in Taiwan.
Typhoon Ragasa, which a few days ago was the strongest storm on earth so far this year, brought finance hub Hong Kong and swathes of southern China to a standstill on Wednesday, after barreling through remote islands in the Philippines and mountainous regions of Taiwan.
https://edition.cnn.com/2025/09/23/asia/typhoon-ragasa-hong-kong-southern-china-impact-intl-hnk
Lawmakers and activists call for action after AP reveals US tech role in China's surveillance state
Lawmakers and activists across the political spectrum called on American tech firms to stop selling surveillance equipment to Chinese police and for Congress to examine the issue after The Associated Press reported that U.S. technology had played a far greater role than previously known in enabling human rights abuses by Beijing.
Republican Sen. Josh Hawley of Missouri told AP he wanted to summon tech companies before Congress to address how their technology exports were used. Hawley, a longtime critic of U.S. technology companies, bemoaned Silicon Valley’s general lack of cooperation with Congress on that and similar inquiries.
“I think eventually we’re going to have to subpoena these people,” Hawley said.
In a post on the social media site X this month, Hawley vowed that “Big Tech must cut ties with the CCP - or face my committee,” referring to the ruling Chinese Communist Party. Hawley sits on several Senate panels that might have jurisdiction to examine technology issues
An AP investigation published this month revealed that U.S. technology companies to a large degree designed and built China’s surveillance state. Firms including IBM, Dell, and Cisco sold billions in technology to Chinese police and government agencies, despite repeated warnings that such tools were being used to quash dissent, persecute religious sects and target minorities. Companies named in AP’s reporting said they complied with all export control laws.
Yang Caiying, who told AP for its investigation about how her family was targeted by Chinese surveillance using American technology because of their activism in rural Jiangsu, said she was “shocked by the pivotal role that major U.S. tech companies have played” in her family’s ordeal. Yang is now collecting signatures for petitions urging Washington to bar U.S. firms from selling to Chinese police, both online and on the street.
Other lawmakers from both parties urged Congress to beef up export laws to prevent more American technology from being used to fuel human rights abuses abroad.
“China has been utilizing partnerships with U.S. tech companies to build malignant ‘smart cities’ that are used for mass surveillance and human rights abuses against millions of innocent Chinese people,” said Rep. John Moolenaar, a Michigan Republican who chairs the House Select Committee on the Chinese Communist Party. The panel is charged with examining the strategic global competition between the U.S. and China.
“As executives at Nvidia and other American tech companies chase business in China, they cannot deny that their technology will be used to commit atrocities, strengthen China, and weaken America,” Moolenaar said.
Moolenaar called for American companies to work with Congress to write new laws that restrict the export of technologies that enable oppression. and work harder to keep their products from being smuggled into China.
adhocfungus likes this.
Footage of deadly ICE shooting in Chicago suburb challenges official narrative
Police records and witness accounts from a Chicago suburb where a man was fatally shot by a federal immigration enforcement agent earlier this month complicate the picture of the event presented by the U.S. Department of Homeland Security, which said the agent fired his weapon after the man drove his vehicle toward agents.
Silverio Villegas Gonzalez, 38, was pulled over and eventually shot by a U.S. Immigration and Customs Enforcement agent in Franklin Park, Illinois on September 12, just after dropping off his two children at Passow Elementary School and Small World Learning Center, a daycare located blocks away from the incident.
Bodycam footage, which Reuters obtained on Tuesday, captures an interview with the truck driver, named in police records as Josue Hernandez-Rodriguez.
“He was trying to escape from them,” Hernandez-Rodriguez said.
In multiple statements, DHS has said the agent, who has not been identified, responded with lethal force because he was "fearing for his own life." But in bodycam footage, the agent, in a bullet-resistant police vest and torn jeans, described his injuries as “nothing major.”
The Fundraising-Industrial Complex Is Eating American Politics
In 2004, political campaigns spent 9 cents of every dollar raised on fundraising operations. By 2024, that number had reached 30 cents. American political campaigns are raising more and more money less and less efficiently. I’ve analyzed data from FEC disbursement records, using an algorithm I developed to classify expenditures by spending category. It reveals that campaigns are now spending 38 cents of every dollar raised just to raise more money—a fourfold increase from the 9 cents spent in 2004. In raw terms, campaigns burned through $3 billion on fundraising operations in 2024 alone.This represents a fundamental shift in how political money flows through our democracy. Twenty years ago, fundraising operations were a necessary but modest expense, like renting office space or printing yard signs. Today, it has metastasized into the primary activity of most campaigns. In 2022, 31% of total expenditures were for fundraising expenses. This came close to exceeding the 33% of total expenditures going towards advertising. If current trends hold in 2026, it’s likely that fundraising costs will for the first time exceed what is spent on advertising, thus becoming the biggest spending category.
The Fundraising-Industrial Complex Is Eating American Politics
New data reveals campaigns burn about a third of donations raised just asking for more donationsAdam Bonica (On Data and Democracy)
Two dead in ‘horrific’ Ukrainian strike on southern Russian city – officials (VIDEOS)
Two dead in ‘horrific’ Ukrainian strike on southern Russian city – officials (VIDEOS)
At least two people have been killed and seven injured in a drone attack on Novorossiysk, local authorities have saidRT
Portland threatens to evict Ice from Oregon facility over permit violations
The city office that oversees land use and zoning notified the owner of the building that Ice leases on 18 September that the federal agency had violated a conditional use permit approved in 2011. The permit limits the number of detainees Ice can hold at the facility each day to fewer than 15, and the duration for which they can be held, to less than 12 hours. The permit also bars the agency from “housing” anyone overnight.
But Ice data the city obtained via a Freedom of Information Act request, included in the official notice, shows 25 instances since January in which Ice held a person for more than 12 hours. On 26 January alone, agents held 16 people – listed as citizens of Venezuela, El Salvador, Honduras, Mexico and other countries – for over 27 hours before transferring them, according to public records obtained by Street Roots and the Guardian.
The city’s notice also said the building was illegally altered when exterior windows were boarded up without proper approval. An Ice spokesperson did not respond to a question asking when the wood was installed, but photos and video taken at protests and posted on social media show the boarded up windows first appeared around 16 June.
Ice is illegally detaining immigrants in a Portland field office, mayor says
Records obtained by Street Roots and the Guardian back claim from Oregon city’s mayor that Ice violated a land use permit to detain people overnightGuardian staff reporter (The Guardian)
like this
FACTBOX: What is known about liberation of Kirovsk in DPR
FACTBOX: What is known about liberation of Kirovsk in DPR
According to the ministry’s information, Russian forces are currently mopping up Kirovsk in the direction of Krasny Liman, clearing it of the remaining forces of the Ukrainian army’s 63rd mechanized brigadeTASS
Georgia’s Medicaid Work Requirement Program Spent Twice as Much on Administrative Costs as on Health Care, GAO Says
Most of the tax dollars used to launch and implement the nation’s only Medicaid work requirement program have gone toward paying administrative costs rather than covering health care for Georgians, according to a new report by the Government Accountability Office, the nonpartisan agency that monitors federal programs and spending.
The government report examined administrative expenses for Georgia Pathways to Coverage, the state’s experiment with work requirements. It follows previous reporting by The Current and ProPublica showing that the program has cost federal and state taxpayers more than $86.9 million while enrolling a tiny fraction of those eligible for free health care.
The GAO analysis, which does not include all the Pathways administrative expenses detailed by the news outlets, shows that as of April the Georgia program had spent $54.2 million on administrative costs since 2021, compared to $26.1 million spent on health care costs. Nearly 90% of administrative expenditures came from the federal budget, the report concluded, meaning that Georgia’s experiment is being funded by taxpayers around the country. Federal spending will likely increase given that the Centers for Medicare and Medicaid Services has approved $6 million more in administrative costs not reflected in this report because it was published before the state submitted invoices.
Georgia’s Medicaid Work Requirement Program Spent Twice as Much on Administrative Costs as on Health Care, GAO Says
Republican lawmakers cite Georgia’s Pathways to Coverage as a national model for federal Medicaid work requirements that are set to take effect in 2027. A new report shows the program has spent at least $54 million on administrative costs alone.ProPublica
Extremists are using Discord to radicalize American youth, officials warned this year
Discord used by extremists to recruit US youth, officials warned
Law enforcement agencies warned earlier this year that young people were being radicalized in Discord servers, according to documents obtained by NBC News.Kevin Collier (NBC News)
Fitik likes this.
Video for 1st Amendment Win: Jimmy Kimmel is Back!
He cries a lot for Kirk, I think he might actually be a good guy? I'm not sure I would have, but he's been under a huge amount of stress the last week. Idk
:::
- YouTube
Profitez des vidéos et de la musique que vous aimez, mettez en ligne des contenus originaux, et partagez-les avec vos amis, vos proches et le monde entier.www.youtube.com
AI coding hype overblown, Bain shrugs
AI coding hype overblown, Bain shrugs
: Tried by two-thirds of firms, ignored by most devs, and productivity barely movedDan Robinson (The Register)
Seasonal retail hiring to fall to lowest level since 2009, signaling trouble for holidays, report says
Seasonal retail hiring to fall to lowest level since 2009, signaling trouble for holidays, report says
Seasonal hiring, an indicator of how strong or weak the holiday shopping season is expected to be, is projected to fall to the lowest level since 2009 recessionGabrielle Fonrouge (CNBC)
Perchance Ai Chat won’t work
Youth Is No Substitute for Politics
Italia - Polonia: semifinale Mondiali 2025, programma, precedenti e diretta tv
Italia e Polonia si sfidano nella semifinale dei Mondiali di pallavolo maschile 2025. Ecco programma, orari e dove vederla in tv e streaming.
L’Italvolley torna a incrociare la sua rivale storica, la Polonia, nella semifinale dei Campionati del mondo di pallavolo maschile 2025. In palio, c’è la finale di Pasai City (Filippine) e la possibilità di difendere il titolo iridato vinto tre anni fa.
LEGGI L'ARTICOLO: Italia – Polonia: semifinale Mondiali 2025, programma, precedenti e diretta tv
NBA su Prime Video: accordo globale da 11 anni. Dal 25 ottobre inizia la stagione in streaming
L’NBA sbarca ufficialmente su Prime Video. A partire da ottobre 2025, il campionato di basket più importante del mondo entra nell’offerta sportiva del colosso streaming di Amazon, senza costi aggiuntivi per gli abbonati Prime. L’accordo, valido per 11 anni, rappresenta una delle più grandi operazioni globali sui diritti sportivi e promette di rivoluzionare l’esperienza di visione per i fan italiani, grazie al doppio commento in lingua italiana e inglese.
I DETTAGLI DELL'ACCORDO: NBA su Prime Video: accordo globale da 11 anni. Dal 25 ottobre inizia la stagione in streaming
NBA su Prime Video: accordo di 11 anni per le partite in diretta
Prime Video trasmetterà l’NBA per 11 anni. Dal 25 ottobre partite in esclusiva: Regular Season, Playoff e Finals. Scopri orari e dettagli.Redazione (Atom Heart Magazine)
Memori - Memory Engine for AI
Memori - Memory Engine for AI
Open-Source Memory Engine for LLMs, AI Agents & Multi-Agent Systems. Enhance your AI applications with intelligent memory capabilities.memori.gibsonai.com
BrikoX doesn't like this.
AI coding hype overblown, Bain shrugs
AI coding hype overblown, Bain shrugs
: Tried by two-thirds of firms, ignored by most devs, and productivity barely movedDan Robinson (The Register)
adhocfungus likes this.
Memori - Memory Engine for AI
Memori - Memory Engine for AI
Open-Source Memory Engine for LLMs, AI Agents & Multi-Agent Systems. Enhance your AI applications with intelligent memory capabilities.memori.gibsonai.com
like this
Technology reshared this.
Italy sends navy ship to help Gaza aid flotilla after drone attack
Italy sends navy ship to help Gaza aid flotilla after drone attack
ATHENS/ROME - An international aid flotilla trying to deliver aid to Gaza said on Wednesday it was attacked overnight by drones in international waters off Greece, prompting Italy to send a navy ship to come to its assistance.ST
like this
Japan city passes ordinance to cap smartphone use at 2 hours per day
Japan city passes ordinance to cap smartphone use at 2 hours per day
The assembly of a central Japan city on Monday passed an ordinance that recommends all residents limit their use of smartphones, video game consoles and other digital devices to two hours a day outside of work and school, though there will be no pena…KYODO NEWS (Japan Wire by KYODO NEWS)
copymyjalopy likes this.
Kami doesn't like this.
Donald Trump is failing to stop China’s rise as a manufacturing superpower
It's not already a manufacturing superpower? Think that boat mighta sailed already...
Jesse Watters Makes Fox Co-Hosts Cringe With Revenge Plan Against U.N.
Now it's the time to call the FCC and demand that Brendan Carr make Watters and Fox learn the hard way and pull their broadcast license which is not a threat.
Jesse Watters Makes Fox Co-Hosts Cringe With Revenge Plan Against U.N.
Fox’s Jesse Watters has a deranged idea for how to respond to the stopped escalator during Trump’s visit to the United Nations.The New Republic
SGH
in reply to HiddenLayer555 • • •Honestly, given that they should be purely compressing data, I would suppose that none of the formats you mentioned has ECC recovery nor builtin checksums (but I might be very mistaken on this). I think I only saw this within WinRAR, but also try other GUI tools like 7zip and check its features for anything that looks like what you need, if the formats support ECC then surely 7zip will offer you this option.
I just wanted to point out, no matter what someone else might say, if you were to split your data onto multiple compressed files, the chances of a bit rotting deleting your entire library are much lower, i.e. try to make it so that only small chunks of your data is lost in case something catastrophic happens.
However, if one of your filesystem-relevant bits rot, you may be in for a much longer recovery session.
tromboneflatsteel
in reply to HiddenLayer555 • • •~~Error correction and compression are usually at odds.~~
Error correction usually relies on redudant data to identify what was corrupted it also helps if the process for error correction is ran more frequent. So storing it away offline is counter to the correction and the added redundancy will reduce the space gains. You can check different error correction software or technique. Ex RAID. I recommend following the 3-2-1 data backup rule. Also even if you can't do all the steps doing the ones you can, helps.
Sidenote optionally investigate which storage brand/medium/grade you want. Some are more resistant than other for long term vs short term. Also even unused storage will degrade over time whether the physical components, the magnetic charge weakening or electric charge representing your data. So again offline all the time isn't the best; run it a couple times a year if not more to ensure errors don't accumulate.
Sadly I won't give specifics because I haven't tried your use case and I am not familiar, but hopefully the keywords help.
waigl
in reply to tromboneflatsteel • • •Not really. If your data compresses well, you can compress it by easily 60, 70%, then add Reed-Solomon forward error correction blocks at like 20% redundancy, and you'd still be up overall.
anotherspinelessdem
in reply to HiddenLayer555 • • •Olap
in reply to HiddenLayer555 • • •YaBoyMax
in reply to HiddenLayer555 • • •AFAIK none of those formats include any mechanism for error correction. You'd likely need to use a separate program like zfec to generate the extra parity data. Bzip2 and Zstandard are somewhat resistant to errors since they encode in blocks, but in the event of bit rot the entire affected block may still be unrecoverable.
Alternatively, if you're especially concerned with robustness then it may be more advisable to simply maintain multiple copies across different drives or even to create an off-site backup. Parity bits are helpful but they won't do you much good if your hard drive crashes or your house catches fire.
blackbrook
in reply to YaBoyMax • • •Ŝan
in reply to HiddenLayer555 • • •YaBoyMax
in reply to Ŝan • • •Ironfist79
in reply to YaBoyMax • • •wewbull
in reply to YaBoyMax • • •TrickDacy
in reply to YaBoyMax • • •Lemmchen
in reply to Ŝan • • •Ŝan
in reply to Lemmchen • • •Built-in compression for tar was added by GNU; Solaris didn't get it until later, and IIRC it supported only gz and bzip2, not xz. AIX didn't get bzip2 until 2008(-ish?).
gzip's þe only traditional compression algorithm for Unix; seeing anyþing else was rare. Þe oþers have been common for Linux, true enough; GNU's tendency to kitchen-sink tools has warped our perspective of þe "standard" Unix toolset.
GenderNeutralBro
in reply to HiddenLayer555 • • •Generally speaking, xz provides higher compression.
None of these are well optimized for images. Depending on your image format, you might be better off leaving those files alone or converting them to a more modern format like JPEG-XL. Supposedly JPEG-XL can further compress JPEG files with no additional loss of quality, and it also has an efficient lossless mode.
As far as I know, no common compression algorithms feature built-in error correction, nor does
tar. This is something you can do with external tools, instead.For validation, you can save a hash of the compressed output. md5 is a bad hashing algorithm but it's still generally fine (and widely used) for this purpose. SHA256 is much more robust if you are worried about dedicated malicious forgery, and not just random corruption.
Usually, you'd just put hash files alongside your archive files with appropriate names, so you can manually check them later. Note that this will not provide you with information about which parts of the archive are corrupt, only that it is corrupt.
For error correction, consider par2. Same idea: you give it a file, and it creates a secondary file that can be used alongside the original for error correction later.
That is a key advantage of this method. Adding a hash file or par file does not change the basic archive, so you don't need any special tools to work with it.
You should also consider your file system and media. Some file systems offer built-in error correction. And some media types are less susceptible to corruption than others, either due to physical durability or to baked-in error correction.
just_another_person
in reply to HiddenLayer555 • • •Compression formats are just as susceptible to bitrot as any other file. The filesystem is where you want to start if you're discussing archival purposes. All of the modern filesystems will support error correction, so using BTRFS or ZFS with proper configuration is what you're looking for to prevent files from getting corrupted.
That being said, if you store something on a medium and then don't use said medium (lock it in a safe or whatever), then the chances you'll end up with corrupted files approaches 0%. Bitrot and general file corruption happens as the bits on a disk are shifted around, so by not using that disk, the likelihood this will happen is nearly 0.
TerHu
in reply to just_another_person • • •now this is just my guess, but i’d think that zfs with frequent automatic checks and and such will keep your data safer than an unplugged hdd
Blue_Morpho
in reply to just_another_person • • •Bitrot happens even when sitting around. Magnetic domains flip. SSD cells leak electrons.
Reading and rewriting with an ECC system is the only way to prevent bit rot. It's particularly critical for SSDs.
bacon_pdp
in reply to HiddenLayer555 • • •You forgot lzip
nongnu.org/lzip/lzip.html
Which is the best for that use case
Lzip - LZMA lossless data compressor
www.nongnu.orgMangoPenguin
in reply to HiddenLayer555 • • •DasFaultier
in reply to MangoPenguin • • •DasFaultier
in reply to HiddenLayer555 • • •You're asking the right questions, and there have been some great answers on here already.
I work at the crossover between IT and digital preservation in a large GLAM institution, so I'd like to offer my perspective. Sorry of there are any peculiarities in my comment, English is my 2nd language.
First of all (and as you've correctly realizes), compression is an antipattern in DigiPres and adds risk that you should only accept of you know what you're doing. Some formats do offer integrity information (MKV/FFV1 for video comes to mind, or the BagIt archival information package structure), including formats that use lossless compression, and these should be preferred.
You might want to check this to find a suitable format here: en.wikipedia.org/wiki/List_of_… -> Containers and compression
Depending on your file formats, it might not even be beneficial to use a compressed container, e.g. if you're archiving photos/videos that already exist in compressed formats (JPEG/JFIF, h.264, ...).
You can make your data more resilient by choosing appropriate formats not only for the compressed container but also for the payload itself. Find significant properties of your data and pick formats accordingly, not the other way round. Convert before archival of necessary (the term is normalization).
You might also want to consider to reduce the risk of losing the entirety of your archive by compressing each file individually. Bit rot is a real threat, and you probably want to limit the impact of flipped bits. Error rates for spinning HDDs are well studied and understood, and even relatively small archives tend to be within the size range for bit flips. I can't seem to find the sources just now, but iirc, it was something like 1 Bit in 1.5TB for disks at write time.
Also, there's only so much you can do against bit rot on the format side, so consider using a filesystem that allows you to run regular scrubs and so actually run them; ZFS or Btrfs come to mind. If you use a more "traditional" filesystem like ext4, you could at least add checksum files for all of your archival data that you can then use as a baseline for more manual checks, but these won't help you repair damaged payload files. You can also create BagIt bags for your archive contents, because bags come with fixity mechanisms included. See RFC 8493 (datatracker.ietf.org/doc/html/…). There are even libraries and software that help you verify the integrity of bags, so that may be helpful.
The disk hardware itself is a risk as well; having your disk laying around for prolonged periods of time might have an adverse effect on bearings etc. You don't have to keep it running every day, but regular scrubs might help to detect early signs of hardware degradation. Enable SMART if possible. Don't save on disk quality. If at all possible, purchase two disks (different make & model) to store the information.
DigiPres is first and foremost a game of risk reduction and an organizational process, even of we tend to prioritize the technical aspects of it. Keep that in mind at all times
And finally, I want to leave you with some reading material on DigiPres and personal archiving on general.
* langzeitarchivierung.de/Webs/n… (in German)
* meindigitalesarchiv.de/ (in German)
* digitalpreservation.gov/person… (by the Library of Congress, who are extremely competent in DigiPres)
I've probably forgotten a few things (it's late...), but if you have any further questions, feel free to ask.
EDIT: I answered to a similar thread a few months ago, see sh.itjust.works/comment/139223…
RFC 8493: The BagIt File Packaging Format (V1.0)
IETF DatatrackerRiverRabbits
in reply to DasFaultier • • •Besonders die Perspektive, wie in deinem Feld an das Thema herangegangen wird ist für Laien sehr wertvoll um ein Gefühl für die wichtigen Aspekte zu erkennen!
(und denke mal, bei dem Username, dass du deutsch sprechen kannst haha)
DasFaultier
in reply to RiverRabbits • • •Ich bleib' trotzdem mal bei Englisch, damit's im englischen Thread verstanden wird.
ENGLISH:
Yeah, you're right, I wasn't particularly on-topic there. 😁 I tried to address your underlying assumptions as well as the actual file format question, and it kinda derailed from there.
Sooo, file format... I think you're restricting yourself too much if you just use the formats that are included in binutils. Also, you have conflicting goals there: it's compression (make the most of your storage) vs. resilience (have a format that is stable in the long term). Someone here recommended
lzip, which is definitely a right answer for good compression ratio. The Wikipedia article I linked features a table that compares compressed archive formats, so that might be a good starting point to find resilient formats. Look out for formats with at least Integrity Check and possibly Recovery Record, as these seem to be more important than compression ratio. When you have settled on a format, run some tests to find the best compression algorithm for your material. You might also want to measure throughput/time while you're at it to find variants that offer a reasonable compromise between compression and performance. If you're so inclined, try to read a few format specs to find suitable candidates.You're generally looking for formats that:
* are in widespread use
* are specified/standardized publicly
* are of a low complexity
* don't have features like DRM/Encryption/anti-copy
* are self-documenting
* are robust
* don't have external dependencies (e.g. for other file formats)
* are free of any restrictive licensing/patents
* can be validated.
You might want to read up on more technical infos on how an actual archive handles these challenges at slubarchiv.slub-dresden.de/tec… and the PDF files with specifications linked there (all in German).
Technische Standards für die Ablieferung von digitalen Dokumenten
slubarchiv.slub-dresden.deFerk
in reply to DasFaultier • • •Just note that @RiverRabbits@lemmy.blahaj.zone wasn't the one who opened the Thread, that's why they said they didn't ask the question (I get the feeling there might have been some confusion here 😛 ).
Still, very informative comment.
RiverRabbits
in reply to Ferk • • •But the way my german is phrased here and how the replier interpreted it would read as super passive aggressive (think "I didn't ask that question but thanks"), and for that I apologize 😭 I just meant I'm not the OP😌
DasFaultier
in reply to Ferk • • •borZ0 the t1r3D b3aR
in reply to DasFaultier • • •IanTwenty
in reply to HiddenLayer555 • • •like this
Infrapink likes this.
TrickDacy
in reply to HiddenLayer555 • • •