During its first year, ESCAPE paved the way for an open access Data Lake infrastructure, the ESCAPE Data Infrastructure for Open Science (DIOS). This will allow large national research data centres to work together and build a single, robust, service to store, distribute and provide seamless access to data, and be able to scale up to multi-Exabyte needs.
By observing FAIR data management services at the basis, the Data Lake will serve global users to efficiently manage large volumes of data, making them accessible to distributed communities, while optimizing the cost of storage.
How do we put the astronomy, particle physics & astroparticle physics data to use in data systems from science initiatives, namely the ESFRIs (European Strategy Forum on Research Infrastructures)? This was the main focus of the ESCAPE DIOS team during its first year, who shaped the architecture and its functional elements were structured on the basis of the data management requirements of the ESCAPE ESFRIs.
A Data Lake pilot is in place, composed of a range of storage services provided by the ESCAPE partners and orchestrated in such a way it is seen as a single service by the ESFRIs and the scientists. The first datasets from several ESFRIs already arrived: astrophysics, cosmology and particle physics experiments injected real data into the current pilot Data Lake. Data access methods and tools are being explored to provide scientists the flexibility to process the data from several resources, from sites accessible through Grid interfaces, to Cloud resources and High Performance Computers (HPC).
The data at ESCAPE DIOS is being organised, orchestrated and catalogued, with clear policies for data replication and deletion, along with a set of application programming interfaces (APIs). These APIs will allow the end-user to manage and access the data.
Figure – The building blocks of the ESCAPE DIOS infrastructure and the connection with compute services
Finally, given the heterogeneous and distributed nature of the system, the Authentication, Authorization and Identity management (AAI) play a crucial role in the architecture. The system must meet the ESFRI policies and allow open access outside the data embargo periods. At the same time, the mechanisms for AAI need to scale for exa-scale data management, reducing the overhead and ensuring an adequate level of security.
Different components are being used to build the ESCAPE DIOS and made available to the ESFRI as stand-alone services:
The development of ESCAPE DIOS is leveraging as much as possible on technologies that already exist, developed in the context of other initiatives, namely the European projects, such as:
To bring the results closer to market, the ESCAPE DIOS relies on open source protocols, such as HTTP/WebDav that will enable commercial storage providers to be integrated. After demonstrating the ESCAPE DIOS computing interface and scalability, the platform will also integrate commercial resources.
The implementation of ESCAPE DIOS has already started and a Data Lake pilot is being deployed with the goal to demonstrate the model with a small scale and functional system which integrates the technologies identified in the preparation phase. The pilot should be assessed early 2021 to show that data can be organised and distributed across sites, and how a storage orchestration service at the level of a few Terabytes can be used across different storage technologies will also be demonstrated.
The prototype phase will focus on deploying a full-scale system, allowing functional tests and stress tests on all the capabilities needed by the ESFRIs for FAIR data management. More partner sites will be integrated into ESCAPE DIOS, consolidating functionalities and scaling up performance.
These will include large data centres with experience in long-term bit preservation on archival media, such as INFN-CNAF, SURF-SARA, IN2P3-CC, CERN, PIC. During this phase, ESCAPE DIOS will be ready to suit the real functionality, performance and usability needs of the ESFRI projects, with sizeable storage resources of up to 100TB.
The ESCAPE DIOS provides a flexible and robust data lake in terms of storage, security, safety and transfer and the basic orchestration machinery to enable the technology to be combined with high quality data from different communities and, therefore, the exploration of new areas in science. It will serve the international user communities, while being connected to the European Open Science Cloud (EOSC).