Disappearing repositories

  • Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data


    Quote
    Abstract

    Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term availability of data. This paper takes an infrastructure perspective on the preservation of research data by using a registry to identify 191 research data repositories that have been closed and presenting information on the shutdown process. The results show that 6.2 % of research data repositories indexed in the registry were shut down. The risks resulting in repository shutdown are varied. The median age of a repository when shutting down is 12 years. Strategies to prevent data loss at the infrastructure level are pursued to varying extent. 44 % of the repositories in the sample migrated data to another repository, and 12 % maintain limited access to their data collection. However, both strategies are not permanent solutions. Finally, the general lack of information on repository shutdown events as well as the effect on the findability of data and the permanence of the scholarly record are discussed.

    Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data
    Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term…
    arxiv.org

    "The most misleading assumptions are the ones you don't even know you're making" - Douglas Adams

  • I shared some related thoughts re: the archiving of LENR materials (both research and event related) here:


    Edited once, last by orsova ().

  • On problem with data depositories is the cost of data storage. Granted, disk storage is millions of times cheaper than it used to be, but it still adds up. Especially when you store images and video data. The cost includes media (HDD or SSD) and -- if the data is going to be available online -- electricity. Data corruption and losses to accidents are also problems. I expect that neglect is the biggest problem. I have failed to save many cold fusion papers because no one preserved them.


    The problems of data storage capacity, cost, and longevity might be solved with DNA storage. It is a little difficult to imagine how this might be made online. Perhaps you could request the data and have it transferred to an SSD and put online some hours later. Anyway, several groups are working on this. Some examples:


    DNA Storage - Microsoft Research
    The DNA Storage project enables molecular-level data storage into DNA molecules by leveraging biotechnology advances to develop archival storage.
    www.microsoft.com


    DNA29 - Tohoku University, Sendai

  • Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.


    I have failed to save many cold fusion papers because no one preserved them.

    At least part of the beauty of a solution like DSpace is that users can upload their own data. Everybody is free to add their own materials at their own pace. And the admin layer can be a number of people.

    Edited 3 times, last by orsova ().

  • Object storage is relatively cheap. Backblaze and Vultr are both $6 per month per tb. Certainly cheaper than keeping everything on a VPS, at least.

    That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.


    You can store data offline at home for far less than that.


    The cheapest offline storage is mag tape, but I do not trust it. Over the long term I doubt it is stable.


    Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online. In that case, I predict that vast amounts of data will be stored on DNA, but data that people seldom access will be left offline. For example, the proceedings from a conference in 1975; the dataset from an experiment done decades ago; or back issues of a defunct newspaper. When you request data, it will be automatically transferred from DNA to something like an online SSD. That might take an hour. Perhaps 10 minutes? They will charge $1/TB, and it will remain on line for 1 month, where anyone can access it. In other words, rather than charging $6 per month for 1 TB of online storage, they will charge one particular user $1 per month, but the data goes away after a month. Others can get it for free for that month. This would be similar to me uploading a paper to my storage at LENR-CANR.org, which costs me a little, and costs my readers almost nothing.


    The index to the offline DNA stored data will be available online. Perhaps some abstracts or summaries will be available.


    Some defunct newspapers are available online. I imagine this costs a lot of money to maintain. This website charges for access:


    NewspaperArchive 1700s - 2023 | NewspaperArchive
    Search through 16,077 historic newspaper archives to do genealogy and family history. Find obituaries, marriage and birth announcements, and other local and…
    newspaperarchive.com

  • Let us assume DNA storage will remain pretty much serial (not random access) and it will be slow to read and put online.

    It is lighting fast to copy multiple times, and literally dirt cheap. You could copy all the data in the world, and store hundreds of copies in data centers around the world. Or 100,000 copies at one data center, to allow parallel access to the files. You would have 100,000 itty-bitty read devices to respond to requests for data to be transferred to online storage.


    The data would never be lost even in a catastrophe. Physically shipping copies to multiple locations would take days, but each package would weigh only 1 kg, so not much money. People going to remote locations, such as antarctica or Mars, would take their own 1 kg copy of all the data in the world. Eccentric people like me might purchase copies of public data, such as everything on Wikipedia and YouTube, all census data, all back issues of newspapers, and every public domain book, movie and television program ever broadcast.


    The entire human genome is replicated in about an hour. That's 3.1 gigabasepairs. That is slow reproduction compared to a hard disk, but you can make copies of copies and end up with billions of copies in a short time. I assume a large database will be split into many segments which can all be replicated at one time, in parallel processes. I gather researchers are developing such processes. I do not know the details.

  • That seems expensive to me. That's $72/year. Google One Storage Premium is $100 per year for 2 TB. $50/year.


    You can store data offline at home for far less than that.

    That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.


    Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

    Edited once, last by orsova ().

  • That's the cheapest object storage I'm aware of. Linode is $20 per tb/m, for example. Wasabi is $6.99. DigitalOcean is $20.


    Google Drive may be cheaper, but it's not apples to apples. It's a consumer product. I wouldn't relish trying to run a large website out of Google Drive. The proper comp would be Google Cloud object storage, which is more like $20 per tb/m.

    Amazon S3 is the cheapest I'm aware of with S3 Glacier deep archive being the cheapest. A lot of static websites use S3 with LightSail, also very cheap compared to a VPS.

  • I've recently been playing around with IA Scholar - which has been quietly indexing and capturing papers and articles, from various sources, since 2020. Searches also include some material already on the Wayback Machine.


    It might be a saviour if you are looking for material from journals that have already disappeared.


    Internet Archive Scholar

    "The most misleading assumptions are the ones you don't even know you're making" - Douglas Adams

  • You want government to control the publishing of science?

    How did that work with COVID?

    The government did not control the publishing of science papers relating to COVID. They were published by academic journals, hospitals, public and private organizations, and governments all over the world.


    It worked extremely well. An effective vaccine was designed and delivered in record time. It saved more than a million lives in the U.S. alone. It would have saved a lot more if the anti-vaccine people had not opposed it.


    The government does have a lot of control over research funding. Especially big-ticket research such as plasma fusion. But it does not control publishing in these areas. Private academic publishers such as Elsevier do. They have too much power in my opinion, and their profits are too large for the internet era. Academic researchers are upset by this. That is why free, online, open source publications are increasingly popular.

  • Regarding this article:


    When Online Content Disappears

    38% of webpages that existed in 2013 are no longer accessible a decade later

    I am not happy about that. But let us have a sense of perspective. And a look at history. How much information was lost before the internet was established? How many 11-year-old journals, newspapers, newsletters, novels, or corporate annual reports? I expect it was more than 38%. Go back 100 years and it is probably ~99%.


    Information was ephemeral before the computers and the internet were invented. It has become somewhat less ephemeral. But there is so much of it we are bound to lose a lot! Furthermore, to be honest, most of what we lose is unimportant. Nobody wants to read a corporate annual report from a company that went out of business decades ago. The New York Times converted every issue to digital format. Subscribers can access any newspaper from any day from 1851 to the present. Most of the news on most of these days is unimportant. You wouldn't bother reading it. Historians, of course, want to read it. I have occasionally used it to learn about an obscure person or event. But if we lost 99% of it, it would probably not be missed.


    You also have to consider the expense. Keeping old web sites around costs money. When the website owner dies, who is going to pay for it? Not the ISP. Not Google, or Facebook, or you or me. New storage technology may eventually make it so cheap we might as well preserve old data. Such as Project Silica:


    Project Silica
    Project Silica is developing the first-ever storage technology designed and built from the ground up for the cloud, using femtosecond lasers to store data.
    www.microsoft.com


    DNA storage may ultimately allow us to keep fantastic amounts of data for hundreds of thousands of years. It may be so cheap we might as well keep it all. You could own a copy of all the data in the world encoded in DNA for a few pennies. Until data storage becomes as cheap as that, we are going to lose a lot of data.

  • But let us have a sense of perspective.

    Yeah - I posted the article, but I thought there were a lot of problems with it. As you say, it is unreasonable to expect someone to pay to host a webpage in perpetuity. And many, many webpages were never meant to have a long life anyway.


    For instance, the report is including in its statistics pages from local government websites - and treating a page disappearance as some sort of tragic event. However, a local government website needs to be continually updated with current accurate information - and that means a continual “weeding out” of pages containing out-of-date information. Often that would mean deleting a page, and diverting people to a new page, not just tweaking the information on the old page.


    In fact the Pew Research data also completely misses the fact that many pages “evolve” with time. It seems to assume that a web page is a fixed thing, once it is uploaded to a server, and doesn’t change unless completely deleted (or moved). Unless you have a series of historical snapshots (like on the wayback machine) then you never know if the page you are seeing now is the same as it was last week - but the Pew research would simply treat the page as being "still accessible".

    "The most misleading assumptions are the ones you don't even know you're making" - Douglas Adams

Subscribe to our newsletter

It's sent once a month, you can unsubscribe at anytime!

View archive of previous newsletters

* indicates required

Your email address will be used to send you email newsletters only. See our Privacy Policy for more information.

Our Partners

Supporting researchers for over 20 years
Want to Advertise or Sponsor LENR Forum?
CLICK HERE to contact us.