Leveraging HPC to extend research potential in the humanities
- Principal Investigators:
- Umut Bașaran, Florian Barth, George Dogaru, Prof. Dr. Philipp Wieder
- HPC Platform used:
- NHR@Göttingen
- Project ID:
- nhr_textplus_dh_llama
- Date published:
- Researchers:
- Umut Bașaran, Florian Barth, George Dogaru, Prof. Dr. Philipp Wieder
- Introduction:
- Within Text+, the NFDI consortium dedicated to building and providing infrastructure for the field of digital humanities, high performance computing (HPC) is gaining ground quickly. With the arrival of large language models (LLMs), the motivation for providing HPC infrastructure increased decisively. As a consequence, the first HPC service was established, easing the way for developments in several other areas where access to HPC is needed for enabling solutions otherwise not feasible. Examples of HPC use in Text+ are the Text+ LLM service and the NLP tool MONAPipe.
- Body:
-
Within Text+, the NFDI consortium dedicated to building and providing infrastructure for the field of digital humanities, high performance computing (HPC) is gaining ground quickly. With the arrival of large language models (LLMs), the motivation for providing HPC infrastructure increased decisively. As a consequence, the first HPC service was established, easing the way for developments in several other areas where access to HPC is needed for enabling solutions otherwise not feasible.
The Text+ LLM service [2] is an AI chatbot which offers entitled users access to multiple LLMs, offers helpful features tailored to the Text+ community, and focuses on data privacy. The available LLMs are hosted on the HPC infrastructure provided via the KISSKI [3] project, which also offers other LLM-based chatbots tailored to other communities [4]. The Text+ LLM service unifies access to multiple high quality LLMs, reducing the need for users to access different services. It does so while focusing on data privacy by ensuring that for all models except those from OpenAI, data entered by users is not shared with anyone. When users delete the chat history, no user data remains on the servers. It allows giving instructions and context information by using a custom prompt, which can be set on a model basis. Furthermore, it allows uploading own files and subsequently asking questions about the contents thereof. This way, information not contained in the LLM becomes available and can be queried via the chat, making this feature immensely useful. The chatbot can also indicate which uploaded sources were used for answering the queries. In the first version, plaintext and PDF files can be uploaded. Support for further file formats will be added, as will other features, among them the possibility of asking questions based on a specific public data source such as Wikidata, thus enhancing the versatility of the service and its suitability for research purposes.
For use cases that go beyond the already rich possibilities available in the LLM service, there is another option of accessing the LLMs hosted on the KISSKI infrastructure. HPC users can gain access to the underlying API upon request and process data according to their goals.
Furthermore, several other efforts are taking place involving the current or future use of HPC. Worth highlighting – not least because of its high level of maturity – is MONAPipe [5] – a tool for NLP tasks that goes well beyond lemmatization and named entity recognition, which makes use of spaCy and other NLP libraries, to which it offers a unified access, while also allowing the integration of own components.
While the core of MONAPipe is light and easy to install, using it effectively can be challenging on a local machine, because of the great size of the models required for actually running the tasks and of the high hardware requirements. For this reason, MONAPipe, which already has a REST-API, is being adapted so it can run on HPC and be queried via its API. It will then be easily possible to have only the lightweight core of the application run locally or in a JupyterHub and have the real computing take place on adequate HPC resources, thus making the use of the tool effortless and effectively enabling research with it.[1] text-plus.org
[2] https://text-plus.org/daten-dienste/llm_service
[3] https://kisski.gwdg.de
[4] Decker, Hosseini, Meisel: “The New GWDG LLM Service”, GWDG Nachrichten 03/2024, https://gwdg.de/about-us/gwdg-news/2024/GN_03-2024_www.pdf
[5] Barth et al.: "MONAPipe: Modular Natural Language Processing Pipeline for Digital Humanities", DOI: 10.5281/zenodo.8424925
[6] spacy.io - Institute / Institutes:
- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
- Affiliation:
- Georg-August-Universität Göttingen
- Image:
-