Project — Development, Operation and Data Governance for ML-based Software Systems. DOGO4ML — UPC. Universitat Politècnica de Catalunya

Conceptualization and objectives

DOGO4ML proposes a holistic end-to-end framework to develop, operate and govern MLSS and their data. This framework revolves around the DevDataOps lifecycle, which unifies two software lifecycles: a DevOps lifecycle and a DataOps lifecycle. The DevOps cycle aims to transform the requirements of an MLSS into deployed code (Dev) and get feedback as soon as possible from the end-users (Ops), which can be used to evolve the requirements (including those that apply to the ML models). The DataOps cycle provides support to the data management and analysis processes that characterise MLSS. The DataOps processes are inter-related with those in the Dev phase of the DevOps software cycle, since they produce the required ML models (created through several iterations in the DataOps lifecycle) to be embedded into the ML software components of the MLSS. Further, the DataOps cycle aims to get feedback from the data analysts to continuously improve the data management and analysis processes. A detailed explanation of both cycles follows.

The DevOps software cycle

DevOps is a software development and delivery process that produces software from its conceptualization, as well as from the feedback provided by monitors when the software system is in an operational environment. This feedback is then used to maintain and evolve the system. The specificity of MLSS requires continuous context-aware delivery, and feedback to adjust and refine their embedded ML components.

In the Dev phase, a typical requirement engineering process starts from the goals of the MLSS. This process applies both at the system and the ML component levels. At the system level, the requirements are elicited (from both the stakeholders and the feedback gathered in the Ops phase when the system is in operation, see below), including quality requirements specific for the MLSS (e.g., trustworthiness, safety, evolvability and ethics), with the support of a requirement patterns catalogue to be built during the project, as we have already explored for non-ML-based software system agile development [35]. At the level of the ML components, requirements also include quality requirements of ML models (e.g., data representativeness, model accuracy, training cost and proving latency, model appropriateness). These requirements are key to identify and process the relevant data for the ML model construction, validation, and operation, as well as the input for the two main processes during the Dev phase: (i) the DE-related aspects (i.e., data management and analysis processes needed to create, maintain and evolve the models embedded in the ML components), and (ii) the required software cycles to design, develop, test, deploy, and operate the whole MLSS (including the integrated ML components).
Going on with the Dev phase, agile practices will be adopted to continuously deliver high-quality MLSS. As we experienced for self-adaptive systems, references architectures and best practices (e.g., iterative integration of ML models into the system), driven by the MLSS quality requirements, enable rapid MLSS deployment in small iterations. MLSS implementation: (i) embeds the ML models (provided by the data management and analysis processes in DataOps) into ML components, and (ii) integrates them into the whole MLSS, facilitating their continuous and rapid integration. Continuous and automated integration testing of those systems will reconcile the particularities of both types of components, ML and non-ML (e.g., in terms of uncertainty in the functional validation). Once validated, the MLSS will be deployed in its operational environment. In the general case, such deployment will be contextual, to recognize the importance of several context factors. For instance, in a type of system with high decisional capabilities as a smart vehicle, the final ML model may depend on the type of user and their driving experience and preferences, or the primary use of the vehicle (long distance trips vs. urban mobility).
During the Ops phase, the MLSS in production interacts with both the user and the environment. Considering again the smart vehicle example, the MLSS will receive input from the user (e.g., through voice) and a continuous stream of sensed data (e.g., data from the car itself, or from the surrounding environment, like people crossing the street). Through these interactions and input, the ML models deployed inside the system are able to make classifications (e.g., that object is a dog), predictions (e.g., there is an increased risk of accident), or recommendations (e.g., brake 20 meters earlier at traffic lights to improve the car brakes life).
In addition, while the system is in execution, it generates runtime data, mainly in the form of measurements of the system behaviour (through monitors) and log files that contain the sequence of time-stamped interactions. This data is gathered by a module able to analyze it and assess a set of high-level indicators that may refer to MLSS quality requirements (e.g., runtime efficiency, trustworthiness of the system) or other more general aspects (e.g., users’ ethical behaviour). The data gathering process relies on the development of an infrastructure to monitor key data related to those indicators as well as context-specific feedback data. At this respect, we plan to adapt our previous results in self-adaptive systems monitoring to the area of MLSS [49]. Gathered data is then prepared for visualization in the form of a dashboard, which we call the DevOps dashboard. This dashboard is an essential aspect of the DevOps lifecycle to monitor high-level indicators and quality requirements, to generate the needed feedback to impact on the Dev cycle, and to close the continuous DevOps loop. Feedback enables the evolution of the MLSS (also including the ML components, by evolving the quality requirements of ML models). Then, the approach starts over again, and the Dev phase uses the feedback to evolve the MLSS. Relevantly, note that data in operation may thus require revisiting the DataOps lifecycle in another Dev phase.

The DataOps software cycle

DataOps defines the lifecycle of the data management and analysis processes, characteristic aspects of MLSS related to DE. These processes are interdependent with DevOps activities undertaken in the Dev phase. It is thus one of our objectives to identify and operationalize such dependencies. Data management processes are responsible to ingest, store, process and prepare data according to the requirements gathered. Data ready to be consumed is then the main driver enabling the data analysis processes. The complexity and iterative nature of these processes require their own software cycle specific for data-related aspects. During the data phase, the data and analysis backbones are developed. The former is a system devoted to ingest external data sources, store, process and serve data in the form of data views (i.e., datasets generated from the wealth of data ingested ready to be consumed). Data views are consumed by the analysis backbone, where the data analysis processes take place. The data backbone must be common for the whole organization, while each software project defines its own data analysis processes.
Data Engineers are responsible for the creation, maintenance and evolution of the backbone that manages the organization's data assets. Then, each project, during its requirement engineering process conducted in the DevOps Dev phase, decides the specific subset of data assets required. Importantly, project-specific data analysis processes must be aligned with the data backbone and, typically, a set of views are defined to serve them. However, note that nothing prevents the reuse of a view in several analytical processes of the same or different projects. Therefore, a relevant aspect of the data backbone management is to decide the right number of views serving the data required by all the projects and, at the same time, minimizing its maintenance effort by reusing and sharing.
Next, within the data phase, the data analysts conduct the data analysis, which includes data discovery (i.e., finding the relevant data assets in the data backbone and requesting the needed data views), feature engineering, data preparation and the model learning processes. The data phase spans through both the data and analysis backbones, while the latter is the one responsible for learning the models that will be eventually deployed in the Dev phase. Some works frame the data analysis processes into their own lifecycle (e.g., under the concept of MLOps5). However, many authors argue that the complete data lifecycle (management and analysis) should be jointly governed within a single unified view. In DOGO4ML we follow this approach and the tasks identified by MLOps are considered in our DataOps lifecycle.
The complexity of the data management and analysis processes requires dedicated data and model governance, embedded in the data governance subsystem that produces and gathers the required metadata to automate, trace, monitor and assess specific requirements for the data management and analysis backbones. Quality requirements of ML models elicited during the Dev phase must be monitored during the DataOps Ops (OpsDO) phase and visualized through the DataOps dashboard.
In the OpsDO phase, we distinguish two types of quality requirements of ML models: (i) indicators related to learning models (generated during the data analysis processes, such as model appropriateness or training time) and indicators related to data (generated during the data management process, such as quantifying data bias or query time when accessing the data views). The former indicators are interpreted and validated by domain experts assessing the quality of the current model, while the latter are interpreted and analysed by the data and software engineers. This feedback is key to close the loop with the data cycle. For example, the feedback obtained from monitoring the analysis backbone (e.g., an ML model poor accuracy) may require to consider features from another data view, to learn new models or even ingest a new external data source (e.g., decide to invest in buying an external dataset) that after being ingested, stored and processed will generate or complement existing data views with new attributes that may generate new model features. In real projects, this loop usually requires many iterations that overall may last months until learning a model that meets the project requirements. In parallel, the feedback obtained from monitoring the data backbone may result in decisions such as optimizing identified bottlenecks (e.g., within a database or in a data flow), adding new servers to the infrastructure or even changing the Cloud provider where the data backbone is hosted.
Finally, the resulting learned models are embedded and integrated in the MLSS in the form of an ML component during the Dev phase.

The Holistic Software Cycle

While the DevOps and DataOps cycles raise significant challenges by themselves, the emerging grand challenge is their combination into an overarching cycle smoothly integrating their different process elements (activities, roles, etc.) into a unique holistic process. We already made a first approximation to the problem in the context of trustworthy autonomous systems. Overall, we envisage three major determinants.
Inter-dependency. Both lifecycles generate a number of inter-dependencies, which, due to the iterative nature of the problem, are not easy to identify, formalize and generalize as to guarantee adaptability to different scenarios.
Context-awareness. We do not aim at defining a universal holistic MLSS lifecycle. Instead, we recognize the fact that different organizations, projects and teams may respond to different context characteristics (e.g., data quality, available human skills, problem size, etc.), and that the MLSS lifecycle needs to be flexible enough as to apply to all of them.
Systematization. To assist SE and DE engineers in customizing the lifecycle according to context, we propose a systematic, tool-supported knowledge-based approach that assists them in: (i) defining parameterized process fragments (possibly inter-dependent with others) that describe activities that may take part in the holistic cycle; (ii) select the most appropriate process fragments in a particular context, respecting their inter-dependencies; (iii) combine them into the holistic process.
Given these determinants, the project will use situational method engineering (SME) s the conceptual framework for defining MLSS lifecycles. In SME, we can define a library of process fragments (“chunks”) classified according to some context criteria. We will use our knowledge in context ontologies [8] to define the relevant context criteria in the scope of MLSS. SME supports the composition of such chunks (although the current state of the art does not handle the problem of inter-dependencies), as we have done in previous work (e.g. in the field of software evolution). Tool-support will take the form of a handy web application establishing a conversation with the engineers to proceed with the context criteria elicitation, chunk selection and final composition.

Project objectives

Based on this vision, we break down the goal of the project into the following general objectives:

[GO1] Specify, design and implement a holistic and configurable end-to-end lifecycle for MLSS aligning SE and DE development and operational processes.
[GO2] Specify, design and implement the data-driven Dev phase for MLSS considering quality requirements and architectural aspects.
[GO3] Specify, design and implement the Ops phase increasing users' trust in MLSS by transparently monitoring quality requirements in near real-time.
[GO4] Specify, design, implement and govern the data management and analysis processes for MLSS in the form of a DataOps lifecycle.