UvA-DARE (Digital

—Data-intensive workﬂow applications are characterized by their continuously growing volumes of data being processing, the complexity of tasks in the pipeline, and infrastructure capacity required for computation and storage. The infrastructure technologies of computing, storage and networking have made tremendous progress during the past yeas. We review the emerging trends in the data-intensive workﬂow applications, in particular the potential challenges and opportunities enabled by the decentralized application paradigm.


I. INTRODUCTION
Nowadays, many applications, e.g.industrial automation using IoT [1], earlier warning for disasters [2], or tackling large scale society challenges [3], involve large volumes data, which are often highly distributed and provided by different sources [4].They require high performance, e.g. for decision making [5], or for real-time simulation [6].In those distributed applications, computational tasks, such as accessing and integrating data from a source, processing data using machine learning or generating new data products are often modelled as workflows, and executed via a workflow management system (WFMS), e.g.Pegasus [7], and Galaxy [8].Workflow systems have become an important facilitator for automating scientific experiments and business processes on distributed infrastructures [9].
High quality infrastructures play a crucial role in distributed workflow applications; the infrastructure in many cases require not only the super computers from computing centers, but also general purposes virtual machines in cloud data centers which can be flexibly configured and defined based on the application requirement, e.g. for being customized for data geo-locations or for dynamic changes at run time.When the application is getting more and more distributed, for both data sources and infrastructure, multi providers are inevitably involved.The classical centralized model for managing data and infrastructures are facing challenges for scalability, quality guarantee, and sustainable business model for engaging those different providers.
Decentralized paradigm has been discussed by researchers in the context of workflow management in earlier 2000 for data sharing, e.g. using P2P network [10], and for coordinating processes, e.g.service chronograph [11].Recently, blockchains and decentralized applications (DApp) gain lots of research interests from a big spectrum of applications [12].We can also clearly see the changes on workflow systems enable by those decentralized application paradigms, for instance managing crowd sensing data from large amount of engaged citizens [13], workflow provenance using immutable ledgers [14] and infrastructure automation using smart contracts [15].Thus, it is time for us to think if we need a decentralized workflow management paradigm, what are the challenges, and research agenda.In this short paper, we first review the state of the art, and then identify the research opportunities and challenges, using examples from medical domain.

II. DECENTRALIZED WORKFLOW MANAGEMENT
There are currently no agreed definition on decentralized workflow management systems.In this section, we will first review the state of the art from aspects of decentralized systems, Cloud and software defined infrastructure, workflow management, and then identify the key differences that decentralized data management may have compare to the classical ones.

A. State of the art
Workflow management has been studied in the past years in both industry, e.g. business process modelling [16] and academic, e.g.e-Science [11].A workflow for data intensive application often involves data, software tools/components implement each step in the workflow, and the infrastructure where workflow steps are executed and communicated.Compared with the conventional centralized architecture in workflow management, the decentralized ones have evolved and developed, technically, the multiple server groups components, P2P network and blockchain made it a monumental shift.
The research for decentralized workflow management has started in 1990s, as shown in figure 1. Early, Schill et al. [17] implemented CodAlf approach by using DCE/DC++ communication with a Petri-Net-based workflow description language towards distributed workflow management.The authors explained its decentralization, that is enabling a dynamic mapping of tasks onto execution instances, to forward and synchronize workflows in a completely decentralized way.Then authors in [18] discussed the information infrastructure for business collaboration, and introduced the multi- agent system to enhance the flexibility of workflow execution under multiple enterprise collaboration via service-oriented computing environment.Mittash tionet al. used a few groups of server components -instead of one server, to enable an effective decentralized control up to thousands of distributed workflows [19].Javadi et al. [20] utilized the flexibility of the Object Modeling System (OMS) architecture to implement decentralized service orchestration.
Decentralized workflow can be found from literature's are mainly about using specific P2P network to enhance the data sharing [21], or using federated engines or controllers to coordinate the execution [13].Most of these work appear independent from the blockchain based DApps.For example, both [22] and [23] applied the p2p network as an important part of their proposed decentralized workflow systems.SwinDeW [22] uses P2P network for workflow management, but no explicit assumption on the assets management or infrastructure market place.And [23] presented a decentralized grid workflow management framework to support collaborative virtual enterprises through p2p overlay network, ranging from grid middleware and p2p communication layers to agent and application layers.
In recent years, thanks to its decentralized nature, as well as immutability and trace-ability features, blockchain has been applied in healthcare workflows [13] and the GDPR (General Data Protection Regulation) for cross-organizational workflow management [24].The recent advances in Cloud computing, Internet of Things (IoT), Artificial Intelligence and big data greatly accelerate the innovations in the digitisation of business applications towards Next Generation Internet (NGI) [25] and scientific research [26].Blockchain technologies demonstrated their great potential for realizing trustworthy (via immutable ledgers and consensus among peers) and fault tolerance (no single point failure among decentralized nodes) in business applications, and become a basis for developing Decentralized Application (DApp) [27].However, the current blockchain technologies suffer from high storage cost of ledgers, low collaboration efficiency among distributed peers for consensus, and insecure off-chain data sources for blockchain transactions [28].Cloud environments provide not only elastic capacity but also customisable connectivity, often called virtual infrastructure, over a large-scale network [29].
The resilience of the Cloud has enabled significant advances in software-defined storage, networking, infrastructure, and every technology, which promotes the emergence of heterogeneous programmable infrastructures across different Clouds, and devices on the network edges (often called Edge or Fogs) [30].However, the rich programmability of infrastructure, in particular, the advances of new hardware accelerators in the infrastructures can still not be effectively included in the development and operations (DevOps) of Decentralized Applications.

B. Decentralization aspects
Based on the overview of the state of the art, we can see the decentralization of a workflow can have different levels.In this paper, we roughly conclude such decentralization in the following different aspects, as shown in table I: 1) Decentralized management of the workflow assets such as data, software components, and workflow descriptions; P2P environment based sharing, blockchain based market place, and traceable evolution of the assets are typical paradigms.In such a decentralized ecosystem, workflow assets will be shared and reused in a transparent and traceable way in multi-stakeholder and collaborative environments.2) Decentralized workflow execution management e.g.coordinating runtime execution of workflow tasks in the decentralized infrastructure.For instance, transferring data among tasks using a P2P or information centric networking paradigm [31], or orchestrating workflow execution via a federation of engines (provided and operated by different organization), 3) Decentralized infrastructure, e.g.provided by multiple providers via a decentralized market.During workflow development and execution, infrastructure quality are crucial for optimising the workflow logic and scheduling.Decentralized technologies like smart contract demonstrate a big potential in automating the infrastructure service level agreement (SLA) and trustworthiness [15].

III. RESEARCH OPPORTUNITIES AND CHALLENGES
The decentralized paradigm brings several advantages to workflow, e.g.open and transparent market enhanced by blockchain for sharing assets, fault tolerant decentralized network topology for sharing data and for workflow control, and smart contract based trustworthy SLA management.But we have also seen the critical limits of the current blockchain systems, including performance bottlenecks and scale-ability challenges.There are different research opportunities and challenges can be identified.We shall discuss them using the medical workflow example from a recent EU project CLARIFY 1 , in which pathology data from different hospitals and individuals will shared and processed to enable doctors to effectively diagnose deceases like cancer.A decentralized data fabric is proposed to enable the distributed users to collaboratively process data, extract knowledge and make decisions in HPC and Cloud environments.We analyze the research problem in the context of data management life cycle, a number of challenges can be highlighted: 1) Challenge in designing effective collaboration among distributed peers, including the consensus and incentive model for participants, and tasks coordination across players.For instance, when a medical workflow involves images from different hospitals, and algorithms from contributors, how can the workflow system engage them into the workflow with effective incentive model to share assets and to get credits.2) Challenge in building community standards and interoperable reference model.The future decentralized ecosystems will have different workflows from many communities, e.g. one hospital may provide images and data for multiple workflows, and a cancer researcher may also have to develop new knowledge based on different workflows, e.g. for learning patterns from pathology images and for discovering side effects of treatments.
The interoperability among workflow is crucial.3) Challenges in workflow development and operation, for a complex decentralized workflow, it is expensive to restart the entire workflow, in particular during the production.A continuously testing, integration, deployment, monitoring and adaptation solution is crucial for work-1 http://www.clarify-project.eu/flow management.We can clearly see such needs when a pipeline is deployed across several hospitals to automate the annotation, processing and learning for pathology images.New learning components or smart contract for accessing images might be updated; however, it is not possible to break the operation in hospitals.4) Challenges in optimising workflow performance, when the complexity of the underlying network is very high, in particular the high uncertainty on part of the system performance.The high sensitivity of the privacy in medical data gives different constraints for workflow design and execution, which make the optimization difficult to handle the time critical scenarios, e.g. for real-time decision making.5) Challenges in keeping consistency of data, performance and service quality across participants.The blockchains can already provide immutable transaction information for tracing the changes of data and services; however, the actual content of the data(e.g.medical images), or quality of the service (e.g.deep learning service) still have to be managed off chain, due to their size or limits of current technology.Decentralized workflow also face the challenge to keep those on-chain and off-chain data consistent.There are other challenges in the energy efficiencies and storage; we will discuss them extensively in another paper.Those challenges open new research opportunities for workflow community.

IV. SUMMARY
In this short paper, we review the emerging trend of data intensive workflow applications, in particular the potential challenges and opportunities enabled by the decentralized application paradigm.We have seen different practices in developing and using decentralized; however, it is still challenges to apply it in real applications.We identified a number opportunities and challenges based on the use case of a recent funded project CLARIFY.In this short paper, we did not present any concrete decentralized workflow solution.By analysing the research opportunities, develop a decentralized workflow management system for medical imaging processing will be our future work.