Disappearing repositories

**Frogfall** · Oct 15th 2023

Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data

Quote

Abstract
Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term availability of data. This paper takes an infrastructure perspective on the preservation of research data by using a registry to identify 191 research data repositories that have been closed and presenting information on the shutdown process. The results show that 6.2 % of research data repositories indexed in the registry were shut down. The risks resulting in repository shutdown are varied. The median age of a repository when shutting down is 12 years. Strategies to prevent data loss at the infrastructure level are pursued to varying extent. 44 % of the repositories in the sample migrated data to another repository, and 12 % maintain limited access to their data collection. However, both strategies are not permanent solutions. Finally, the general lack of information on repository shutdown events as well as the effect on the findability of data and the permanence of the scholarly record are discussed.

Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data

Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term…

arxiv.org

**orsova** · Oct 15th 2023

I shared some related thoughts re: the archiving of LENR materials (both research and event related) here:

Post

RE: THE ISCMNS AT ICCF-25, REORGANISATION PLANS, SOCIAL MEDIA,

[…]

There are a number of open source, cloud based software packages that, whilst free, require hosting. As a consequence, they are mostly deployed by institutions and limited to the memberships thereof. Thus, whilst free and open source, they’re not necessarily available publicly for an unaffiliated end user. Indeed, those I talk about below are mostly meant to be deployed by institutions, rather than made open to the public.

The society could host some of these software packages and make them…

orsova

Aug 31st 2023

**JedRothwell** · Oct 16th 2023

On problem with data depositories is the cost of data storage. Granted, disk storage is millions of times cheaper than it used to be, but it still adds up. Especially when you store images and video data. The cost includes media (HDD or SSD) and -- if the data is going to be available online -- electricity. Data corruption and losses to accidents are also problems. I expect that neglect is the biggest problem. I have failed to save many cold fusion papers because no one preserved them.

The problems of data storage capacity, cost, and longevity might be solved with DNA storage. It is a little difficult to imagine how this might be made online. Perhaps you could request the data and have it transferred to an SSD and put online some hours later. Anyway, several groups are working on this. Some examples:

DNA Storage - Microsoft Research

The DNA Storage project enables molecular-level data storage into DNA molecules by leveraging biotechnology advances to develop archival storage.

www.microsoft.com

DNA29 - Tohoku University, Sendai

**orsova** · Oct 16th 2023

Quote from JedRothwell

On problem with data depositories is the cost of data storage. Granted, disk storage is millions of times cheaper than it used to be, but it still adds up. Especially when you store images and video data. The cost includes media (HDD or SSD) and -- if the data is going to be available online -- electricity. Data corruption and losses to accidents are also problems. I expect that neglect is the biggest problem. I have failed to save many cold fusion papers because no one preserved them.

The problems of data storage capacity, cost, and longevity might be solved with DNA storage. It is a little difficult to imagine how this might be made online. Perhaps you could request the data and have it transferred to an SSD and put online some hours later. Anyway, several groups are working on this. Some examples:

https://www.microsoft.com/en-u…arch/project/dna-storage/

https://dna29.org/

Display More

Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.

Quote from JedRothwell

I have failed to save many cold fusion papers because no one preserved them.

At least part of the beauty of a solution like DSpace is that users can upload their own data. Everybody is free to add their own materials at their own pace. And the admin layer can be a number of people.

**JedRothwell** · Oct 16th 2023

Quote from orsova

Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.

That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.

You can store data offline at home for far less than that.

The cheapest offline storage is mag tape, but I do not trust it. Over the long term I doubt it is stable.

Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online. In that case, I predict that vast amounts of data will be stored on DNA, but data that people seldom access will be left offline. For example, the proceedings from a conference in 1975; the dataset from an experiment done decades ago; or back issues of a defunct newspaper. When you request data, it will be automatically transferred from DNA to something like an online SSD. That might take an hour. Perhaps 10 minutes? They will charge $1/TB, and it will remain on line for 1 month, where anyone can access it. In other words, rather than charging $6 per month for 1 TB of online storage, they will charge one particular user $1 per month, but the data goes away after a month. Others can get it for free for that month. This would be similar to me uploading a paper to my storage at LENR-CANR.org, which costs me a little, and costs my readers almost nothing.

The index to the offline DNA stored data will be available online. Perhaps some abstracts or summaries will be available.

Some defunct newspapers are available online. I imagine this costs a lot of money to maintain. This website charges for access:

NewspaperArchive 1700s - 2023 | NewspaperArchive

Search through 16,077 historic newspaper archives to do genealogy and family history. Find obituaries, marriage and birth announcements, and other local and…

newspaperarchive.com

**JedRothwell** · Oct 16th 2023

Quote from JedRothwell

Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online.

It is lighting fast to copy multiple times, and literally dirt cheap. You could copy all the data in the world, and store hundreds of copies in data centers around the world. Or 100,000 copies at one data center, to allow parallel access to the files. You would have 100,000 itty-bitty read devices to respond to requests for data to be transferred to online storage.

The data would never be lost even in a catastrophe. Physically shipping copies to multiple locations would take days, but each package would weigh only 1 kg, so not much money. People going to remote locations, such as antarctica or Mars, would take their own 1 kg copy of all the data in the world. Eccentric people like me might purchase copies of public data, such as everything on Wikipedia and YouTube, all census data, all back issues of newspapers, and every public domain book, movie and television program ever broadcast.

The entire human genome is replicated in about an hour. That's 3.1 gigabasepairs. That is slow reproduction compared to a hard disk, but you can make copies of copies and end up with billions of copies in a short time. I assume a large database will be split into many segments which can all be replicated at one time, in parallel processes. I gather researchers are developing such processes. I do not know the details.

**orsova** · Oct 16th 2023

Quote from JedRothwell

That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.

You can store data offline at home for far less than that.

That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.

Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

**Pete** · Oct 16th 2023

Quote from orsova

That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.

Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

Amazon S3 is the cheapest I'm aware of with S3 Glacier deep archive being the cheapest. A lot of static websites use S3 with LightSail, also very cheap compared to a VPS.

**Frogfall** · Mar 5th 2024

More than 2 million research papers have disappeared from the Internet

An analysis of DOIs suggests that digital preservation is not keeping up with burgeoning scholarly knowledge.

www.nature.com

**Frogfall** · Mar 5th 2024

The full paper is well worth reading.

Digital Scholarly Journals Are Poorly Preserved: A Study of 7 Million Articles

Introduction: Digital preservation underpins the persistence of scholarly links and citations through the digital object identifier (DOI) system. We do not…

www.iastatedigitalpress.com

**Frogfall** · Apr 15th 2024

I've recently been playing around with IA Scholar - which has been quietly indexing and capturing papers and articles, from various sources, since 2020. Searches also include some material already on the Wayback Machine.

It might be a saviour if you are looking for material from journals that have already disappeared.

Internet Archive Scholar

Disappearing repositories

Amazon S3 is the cheapest I'm aware of with S3 Glacier deep archive being the cheapest. A lot of static websites use S3 with LightSail, also very cheap compared to a VPS.

The perpetual “is LENR even real” argument thread.

EMDrive: Newton's Laws can be "bypassed"?

Generator Tarasenko based on the model of the planet Earth

LK99 -- A new room temperature superconductor?

The Playground

Water evaporation using light

RIP Tom Passel, LENR Researcher and Supporter.

FreeL Tech's Vacuum Capacitor - Utmost Importance!

Following the webinar by Klimov-Zatelepin on March 27, 2024

AI and LENR - The Bots Get Busy.

Share

Amazon S3 is the cheapest I'm aware of with S3 Glacier deep archive being the cheapest. A lot of static websites use S3 with LightSail, also very cheap compared to a VPS.

Subscribe to our newsletter

Share