Web archives as research infrastructure for digital societies: the case study of Arquivo.pt
cytuj
pobierz pliki
RIS BIB ENDNOTEChoose format
RIS BIB ENDNOTEPublication date: 14.11.2022
Archeion, 2022, 123, pp. 46 - 85
https://doi.org/10.4467/26581264ARC.22.012.16665Authors
Web archives as research infrastructure for digital societies: the case study of Arquivo.pt
Humans are the dominant species on Earth. Our advantage comes from our unique capacity of organising at large scale to reach common goals. In digital societies, organising requires communicating information and these days, most of it is published exclusively online. The problem is that online information disappears quickly, after a few months. Humanity’s dependence on online information is strong but still recent and the consequences of losing the historical perspective over online data are yet to be seen. Web archives are digital preservation systems that collect, store and provide access to historical web data. Scientific researchers have been using web archives. However, web archives should also be used by the wider public so that they may serve digital societies. Arquivo.pt is a public web archive started in 2007 that enables search and access to historical information preserved from the Web since the 1990s. This article presents Arquivo. pt as a case study for a research infrastructure that has been developed to serve wider communities at national and international levels. The article shares the main lessons learned so that other web archiving initiatives may arise and be developed at a faster pace. It describes the existing tools and activities which enable exploration of historical web-archived collections. Finally, it presents challenges related to creating web archives and proposes actions to address them.
Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C. and Nelson, M.L., 2011, June. How much of the web is archived? In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 133–136).
Ainsworth, S.G., Nelson, M.L. and de Sompel, H.V., 2015. Evaluating the Temporal Coherence of Archived Pages.
Alam, S., Weigle, M., Nelson, M., Melo, F., Bicho, D. and Gomes, D., 2019, June. MementoMap framework for flexible and adaptive web archive profiling. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 172–181). IEEE.
AlSum, A., Weigle, M.C., Nelson, M.L. and Van de Sompel, H., 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries, 14(3), pp. 149–166.
Ben-David, A., 2019. 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories, 3(3–4), pp. 316–342.
Ben-David, A., 2019. National web histories at the fringe of the Web: Palestine, Kosovo, and the quest for online self-determination. In The Historical Web and Digital Humanities (pp. 89–109). Routledge.
Ben-David, A. and Amram, A., 2018. The Internet Archive and the socio-technical construction of historical facts. Internet Histories, 2(1–2), pp. 179–201.
Bicho, D. and Gomes, D., 2016. Preserving Websites Of Research & Development Projects. In iPRES. Brügger, N., 2005. Archiving Websites. General Considerations and Strategies: General Considerations and Strategies.
Brügger, N., 2018. The archived web: doing history in the digital age. MIT Press.
Brügger, N. and Laursen, D. eds., 2019. The historical web and digital humanities: the case of national web domains. Routledge.
Brügger, N. and Milligan, I. eds., 2018. The SAGE handbook of web history. Sage. Brügger, N. ed., 2010. Web history (Vol. 56). Peter Lang.
Brügger, N., Goggin, G., Milligan, I. and Schafer, V., 2017. Introduction: Internet histories. Internet Histories, 1(1–2), pp. 1–7.
Brügger, N., Locatelli, E., Weber, M. and Nanni, F., 2017. Web 25: histories from the first 25 years of the World Wide Web.
Classificação automática de artigos estigmatizantes de doenças mentais em jornais de notícias portugueses online, https://alina-yanchuk02.github.io/estigma/, accessed: 31 October 2022.
Costa M., 2014. Information Search in Web Archives (Doctoral dissertation, Universidade de Lisboa (Portugal)).
Costa, M., Gomes, D. and Silva, M.J., 2017. The evolution of web archiving. International Journal on Digital Libraries, 18(3), pp. 191–205.
Cruz, D. and Gomes, D., 2013, September. Adapting search user interfaces to web archives. In Proc. of the 10th International Conference on Preservation of Digital Objects (Vol. 17). Dados.gov.pt – Portal de dados abertos da Administração Pública, Arquivo.pt – pesquise páginas do passado, https://arquivo.pt/dadosabertos, accessed 31 October 2022.
Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, http://data.europa.eu/eli/dir/2019/1024/oj, accessed 31 October 2022.
Gomes, D. and Costa, M., 2014. The importance of web archives for humanities. International Journal of Humanities and Arts Computing, 8(1), pp. 106–123.
Gomes, D. and Silva, M.J., 2006, July. Modelling information persistence on the web. In Proceedings of the 6th international conference on Web engineering (pp. 193–200).
Gomes, D. and Silva, M.J., 2008. The Viúva Negra crawler: an experience report. Software: Practice and Experience, 38(2), pp. 161–188.
Gomes, D., Costa, M., Cruz, D., Miranda, J. and Fontes, S., 2013, May. Creating a billion-scale searchable web archive. In Proceedings of the 22nd International Conference on World Wide Web (pp. 1059–1066).
Gomes, D., Demidova, E., Winters, J. and Risse, T., 2021. Past Web. Springer International Publishing.
Gomes, D.C., 2006. Web Modelling for Web Warehouse Design (Doctoral dissertation, Universidade de Lisboa (Portugal)).
Graham, S., Milligan, I., Weingart, S.B. and Martin, K., 2016. Exploring big historical data: the historian’s macroscope.
Harari, Y.N., 2014. Sapiens: A brief history of humankind. Random House.
Hockx-Yu, H., Laursen, D. and Gomes, D., 2019. The curious case of archiving. eu. In The Historical Web and Digital Humanities (pp. 64–72). Routledge.
International Internet Preservation Consortium, SolrWayback 4.0 release! What’s it all about? Part 2, https://netpreserveblog.wordpress.com/2021/03/04/solrwayback-4-0-release-whats-it-all-about-part-2/, accessed 31 October 2022.
Internet Archive, Wayback Machine Save Page Now, https://web.archive.org/save/, accessed 31 October 2022.
ISO 28500:2017 Information and documentation — WARC file format.
Jones, S.M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R. and Grover, C., 2016. Scholarly context adrift: three out of four URI references lead to changed content. PloS one, 11(12).
Kahle, B., 1997. Preserving the internet. Scientific American, 276(3), pp. 82–83.
Klein, M. and Nelson, M.L., 2014. Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. International Journal on Digital Libraries, 14(1), pp. 17–38.
Klein, M., Balakireva, L. and Van de Sompel, H., 2018, May. Focused crawl of web archives to build event collections. In Proceedings of the 10th ACM Conference on Web Science (pp. 333–342).
Masanes, J., 2006. Web archiving: issues and methods. In Web archiving (pp. 1–53). Springer, Berlin, Heidelberg.
Masanès, J., Major, D. and Gomes, D., 2021. The Past Web: A Look into the Future. In The Past Web (pp. 285–291). Springer.
Milligan, I., 2019. History in the age of abundance?: how the web is transforming historical research. McGill-Queen’s University Press.
Milligan, I., 2022. The Transformation of Historical Research in the Digital Age. Elements in Historical Theory and Practice.
Ministério da Educação e Ciência, Decreto-Lei n.º 55/2013, Diário da República, n.º 75/2013, Série I de 2013-04-17, páginas 2257–2261.
Miranda, J. and Gomes, D., 2009, November. Trends in Web characteristics. In 2009 Latin American Web Congress (pp. 146–153). IEEE.
Mourão, A. and Gomes, D., 2021. The Anatomy of a Web Archive Image Search Engine-Technical Report, https://sobre.arquivo.pt/wp-content/uploads/The_Anatomy_of_a_Web_Archive_Image_Search_Engine_tech_report-1.pdf, accessed 31 October 2022.
Quitney Anderson, J., 2009. Tim Berners-Lee launches “WWW Foundation” at IGF 2009, https://arstechnica.com/tech-policy/2009/11/tim-berners-lee-launches-www-foundation-at-igf-2009/, accessed 31 October 2022.
Ruest, N., Lin, J., Milligan, I. and Fritz, S., 2020, August. The archives unleashed project: Technology, process, and community to improve scholarly access to web archives. In Proceedings of the ACM/ IEEE Joint Conference on Digital Libraries in 2020 (pp. 157–166), https://archivesunleashed.org/, accessed 31 October 2022.
SalahEldeen, H.M. and Nelson, M.L., 2013, May. Carbon dating the web: estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web (pp. 1075–1082).
Schafer, V. and Winters, J., 2021. The values of web archives. International Journal of Digital Humanities, 2(1), pp. 129–144.
Schroeder, R. and Brügger, N., 2017. The Web as History: Using Web Archives to Understand the Past and the Present (p. 296). UCL Press.
Sherratt, T. and Jackson, A., 2020. GLAM-Workbench/web-archives, https://glam-workbench.net/web-archives/, accessed 31 October 2022.
Spaniol, M., Mazeika, A., Denev, D. and Weikum, G., 2009, September. Catch me if you can: Visual analysis of coherence defects in web archiving. In 9th International Web Archiving Workshop (IWAW 2009), Corfu, Greece (pp. 27–37).
Upwork, How Much Does It Cost To Build a Website? (2022 Data), https://www.upwork.com/resources/how-much-does-it-cost-to-build-website, accessed 31 October 2022.
Van de Sompel, H., Nelson, M. and Sanderson, R., 2013. RFC 7089-HTTP framework for time- based access to resource states-Memento. Internet Engineering Task Force (IETF), RFC.
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S. and Shankar, H., 2009. Memento: Time travel for the web. arXiv preprint arXiv:0911.1112.
Winters, J., 2015. „Big UK Domain Data for the Arts and Humanities”, Presentation, 2015 International Internet Preservation Coalition General Assembly, April 27-May 1, 2015. Silicon Valley, California, https://buddah.projects.history.ac.uk/, accessed 31 October 2022.
Internet sources:
Arquivo do Parlamento, https://arquivo-parlamento.pt/, accessed 31 October 2022.
Arquivo.pt, A first attempt to archive the .EU domain, https://sobre.arquivo.pt/en/a-first-attempt-to-archive-the-eu-domain/, accessed 31 October 2022.
Arquivo.pt, Arquivo.pt Application Programming Interfaces (APIs), https://arquivo.pt/api, accessed 31 October 2022.
Arquivo.pt, Arquivo.pt Awards, https://arquivo.pt/awards, accessed 31 October 2022.
Arquivo.pt, Arquivo.pt Memorial: preserves information of historical websites, https://arquivo.pt/memorialen, accessed 31 October 2022.
Arquivo.pt, Cross-lingual collection about the 2019 European Elections is available, https://sobre.arquivo.pt/en/cross-lingual-collection-about-the-2019-european-elections-is-available/, accessed 31 October 2022.
Arquivo.pt, Exhibitions, https://arquivo.pt/exhibitions/, accessed 31 October 2022.
Arquivo.pt, H2020 projects preserved by Arquivo.pt, https://sobre.arquivo.pt/en/h2020-projects- preserved-by-arquivo-pt/, accessed 31 October 2022.
Arquivo.pt, Open dataset about cryptocurrency, https://sobre.arquivo.pt/en/open-dataset-about-cryptocurrency/, accessed 31 October 2022.
Arquivo.pt, Publications, https://arquivo.pt/publications, accessed 31 October 2022.
Arquivo.pt, Put an end to “page not found” on your website, https://arquivo.pt/arquivo404en, accessed 31 October 2022.
Arquivo.pt, Recommendations for authors to enable web archiving, https://arquivo.pt/recommendations, accessed 31 October 2022.
Arquivo.pt, SavePageNow, https://arquivo.pt/savepagenow, accessed 31 October 2022.
Arquivo.pt, Search the Geocities history!, https://sobre.arquivo.pt/en/historical-collection-geocities-available-at-arquivo-pt/, accessed 31 October 2022.
Arquivo.pt, Suggest websites to be preserved – Collaborate, https://arquivo.pt/suggest, accessed 31 October 2022.
Arquivo.pt, Training courses, https://arquivo.pt/training, accessed 31 October 2022.
GitHub, Arquivo.pt, https://github.com/arquivo/, accessed 31 October 2022.
Memento Time Travel, http://timetravel.mementoweb.org/, accessed 31 October 2022.
Memória de festivais e eventos de arte, https://arteparasempre.wordpress.com/, accessed 31 October 2022.
MeuParlamento.pt, http://www.meuparlamento.pt/, accessed 31 October 2022.
Pywb, Configuring the Web Archive – pywb 2.0 documentation, https://pywb.readthedocs.io/en/latest/manual/configuring.html#recording-mode, accessed 31 October 2022.
Webrecorder: Web archiving for all!, https://webrecorder.net/, accessed 31 October 2022.
Wikiquote, George Santayana, https://en.wikiquote.org/wiki/George_Santayana, accessed 31 October 2022.
Information: Archeion, 2022, 123, pp. 46 - 85
Article type: Original article
Titles:
Web archives as research infrastructure for digital societies: the case study of Arquivo.pt
The Foundation for Science and Technology, Portugal
Published at: 14.11.2022
Article status: Open
Licence: CC BY-NC-ND
Percentage share of authors:
Article corrections:
-Publication languages:
EnglishView count: 1396
Number of downloads: 541
Suggested citations: Chicago