HPC challenge: Global Data-Intensive Grid Collaboration

Title:
Global Data-Intensive Grid Collaboration

Contact person:
Buyya, Rajkumar, Melbourne University (raj@cs.mu.oz.au)

Author(s):

Kim Branson, Walter and Eliza Hall Institute for Medical Research

Rajkumar Buyya, University of Melbourne

Susumu Date, Osaka University

Benjamin Khoo, IBM (Global Services)

Baden Hughes, University of Melbourne

Rafael Moreno-Vozmediano, Complutense University of Madrid

Jon Smillie, Australian National University

Srikumar Venugopal, University of Melbourne

Jia Yu, University of Melbourne

Lyle Winton, University of Melbourne

Problem statement:

In a diverse range of disciplines, researchers are constrained by the inability to efficiently process large-scale, "distributed data-sets" in an integrated, context sensitive fashion. Additionally, multidisciplinary and collaborative international projects are becoming more common, necessitating solutions which enable geographically distributed research teams to collaboratively collect, store, analyze, visualize, and interpret data. Within the disciplines (high-energy physics, natural language processing, portfolio analysis, molecular-docking and neuroscience) considered in this proposal, there exists a requirement to provide a unified yet flexible infrastructure for on-demand sharing, managing, and analysing distributed data-sets.

Emerging Grid technologies have been proposed as a potential solution as they enable efficient sharing and aggregation of heterogeneous, geographically distributed and dynamically available resources. However, in the context of distributed data-intensive and multi-components applications, there are significant challenges yet to be addressed at both the applications and infrastructure level. At the application level, workflow specification, discovery of and subsequent collation of multi-modal components, compatibility of a variety of data formats, algorithm adaptation, formal representations and composition represent problematic areas. At the Grid infrastructure level, advancement is needed in the areas of tools, programming framework, scheduling, resource management, economy of computations, quality-of-service, and as well as security and interoperability.

Approach:

We have assembled applications, resources, and technologies of both tightly and loosely coordinated groups and institutions around the world to demonstrate both HPC Challenges: Most Data-Intensive and Geographically Distributed Applications. The only requirement of being a member of this collaboration is that they are able to provide access to their Grid resources.

We plan to demonstrate various applications from natural language processing (Melbourne University) and particle physics (Belle collaboration: KEK-Japan and School of Physics@Melbourne) to portfolio analysis (Complutense University of Madrid). For each application area, we developed a data catalogue directory and remote databases (e.g., protein data banks) access mechanism along with a brokering system (Gridbus Data-Grid scheduler). The broker performs discovery and online extraction of data-sets from the closest data sources and then farms out analysis jobs to optimal resources. The broker evaluates whether to process jobs on a resource where the data is available by moving the application code, move data to a resource where the application is available, or move both of them to a suitable computing resource.

Applications will make use of real-world distributed data-sets. Eg, high-energy physics demonstration will utilise datasets generated from the Belle experiment particle generator/detector based at the KEK Laboratory in Japan.

Description of the visual presentation:

Analysis will be launched, monitored, managed, steered from a web portal including starting of data farming and scheduling of events parameters studies. We will demonstrate visual parametric application creation tools and a portal that provides a visual overview of the state of the testbed along with status of execution of various applications. We will make use of G-monitor to launch, manage, and steer execution of applications. From the portal, we will specify analysis requirements and the Gridbus data grid broker will then launch applications on resources. The results of analysis will be visualized and their scientific significance will be demonstrated.

Description of the required/used resources:

As we will be setting up our own “Global Data-Intensive Grid" testbed in collaboration with partners and collaborators (e.g.,IBM-BelleGrid, ApGrid, PRAGMA-Grid,Australian-Grid,Spain-IRIS-Grid, N*GridKorea,ThaiGrid, SingaporeGrid,IGrid-India,ItalianGrid,UK-Grid,NRC-CanadaGrid,BrazilianGrid, DutchGrid,SunGrid,JapanGrid, ANL,and UCSD) around the world, we only need access to the Internet from the conference site.

Computational resources ranging from pocket PCs to vector supercomputers; data resources ranging from scientific instruments (e.g., KEK Belle-accelerator/detector) to biological databases (PDBs) from many countries will be utilized in the demonstration.

Unlike previous year HPC-Challnge-Global-Grid-demo(focused on compute-intensive-applications), this demonstration focuses on Distributed Data-Intensive-Grid infrastructure and applications. In addition, data will be captured from scientific instruments and analysed "online" and "on-demand".