Disappearing repositories

  • Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data


    Quote
    Abstract

    Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term availability of data. This paper takes an infrastructure perspective on the preservation of research data by using a registry to identify 191 research data repositories that have been closed and presenting information on the shutdown process. The results show that 6.2 % of research data repositories indexed in the registry were shut down. The risks resulting in repository shutdown are varied. The median age of a repository when shutting down is 12 years. Strategies to prevent data loss at the infrastructure level are pursued to varying extent. 44 % of the repositories in the sample migrated data to another repository, and 12 % maintain limited access to their data collection. However, both strategies are not permanent solutions. Finally, the general lack of information on repository shutdown events as well as the effect on the findability of data and the permanence of the scholarly record are discussed.

    Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data
    Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term…
    arxiv.org

    "The most misleading assumptions are the ones you don't even know you're making" - Douglas Adams

  • I shared some related thoughts re: the archiving of LENR materials (both research and event related) here:


    Edited once, last by orsova ().

  • On problem with data depositories is the cost of data storage. Granted, disk storage is millions of times cheaper than it used to be, but it still adds up. Especially when you store images and video data. The cost includes media (HDD or SSD) and -- if the data is going to be available online -- electricity. Data corruption and losses to accidents are also problems. I expect that neglect is the biggest problem. I have failed to save many cold fusion papers because no one preserved them.


    The problems of data storage capacity, cost, and longevity might be solved with DNA storage. It is a little difficult to imagine how this might be made online. Perhaps you could request the data and have it transferred to an SSD and put online some hours later. Anyway, several groups are working on this. Some examples:


    DNA Storage - Microsoft Research
    The DNA Storage project enables molecular-level data storage into DNA molecules by leveraging biotechnology advances to develop archival storage.
    www.microsoft.com


    DNA29 - Tohoku University, Sendai

  • Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.


    I have failed to save many cold fusion papers because no one preserved them.

    At least part of the beauty of a solution like DSpace is that users can upload their own data. Everybody is free to add their own materials at their own pace. And the admin layer can be a number of people.

    Edited 3 times, last by orsova ().

  • Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.

    That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.


    You can store data offline at home for far less than that.


    The cheapest offline storage is mag tape, but I do not trust it. Over the long term I doubt it is stable.


    Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online. In that case, I predict that vast amounts of data will be stored on DNA, but data that people seldom access will be left offline. For example, the proceedings from a conference in 1975; the dataset from an experiment done decades ago; or back issues of a defunct newspaper. When you request data, it will be automatically transferred from DNA to something like an online SSD. That might take an hour. Perhaps 10 minutes? They will charge $1/TB, and it will remain on line for 1 month, where anyone can access it. In other words, rather than charging $6 per month for 1 TB of online storage, they will charge one particular user $1 per month, but the data goes away after a month. Others can get it for free for that month. This would be similar to me uploading a paper to my storage at LENR-CANR.org, which costs me a little, and costs my readers almost nothing.


    The index to the offline DNA stored data will be available online. Perhaps some abstracts or summaries will be available.


    Some defunct newspapers are available online. I imagine this costs a lot of money to maintain. This website charges for access:


    NewspaperArchive 1700s - 2023 | NewspaperArchive
    Search through 16,077 historic newspaper archives to do genealogy and family history. Find obituaries, marriage and birth announcements, and other local and…
    newspaperarchive.com

  • Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online.

    It is lighting fast to copy multiple times, and literally dirt cheap. You could copy all the data in the world, and store hundreds of copies in data centers around the world. Or 100,000 copies at one data center, to allow parallel access to the files. You would have 100,000 itty-bitty read devices to respond to requests for data to be transferred to online storage.


    The data would never be lost even in a catastrophe. Physically shipping copies to multiple locations would take days, but each package would weigh only 1 kg, so not much money. People going to remote locations, such as antarctica or Mars, would take their own 1 kg copy of all the data in the world. Eccentric people like me might purchase copies of public data, such as everything on Wikipedia and YouTube, all census data, all back issues of newspapers, and every public domain book, movie and television program ever broadcast.


    The entire human genome is replicated in about an hour. That's 3.1 gigabasepairs. That is slow reproduction compared to a hard disk, but you can make copies of copies and end up with billions of copies in a short time. I assume a large database will be split into many segments which can all be replicated at one time, in parallel processes. I gather researchers are developing such processes. I do not know the details.

  • That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.


    You can store data offline at home for far less than that.

    That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.


    Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

    Edited once, last by orsova ().

  • That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.


    Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

    Amazon S3 is the cheapest I'm aware of with S3 Glacier deep archive being the cheapest. A lot of static websites use S3 with LightSail, also very cheap compared to a VPS.

  • I've recently been playing around with IA Scholar - which has been quietly indexing and capturing papers and articles, from various sources, since 2020. Searches also include some material already on the Wayback Machine.


    It might be a saviour if you are looking for material from journals that have already disappeared.


    Internet Archive Scholar

    "The most misleading assumptions are the ones you don't even know you're making" - Douglas Adams

  • Post by orsova ().

    This post was deleted by the author themselves ().

Subscribe to our newsletter

It's sent once a month, you can unsubscribe at anytime!

View archive of previous newsletters

* indicates required

Your email address will be used to send you email newsletters only. See our Privacy Policy for more information.

Our Partners

Supporting researchers for over 20 years
Want to Advertise or Sponsor LENR Forum?
CLICK HERE to contact us.