Data is one of the strategic assets in any organization. Data forms the base of digital transformations that makes organizations venture into new realms of capabilities such as predictive analytics, machine learning, robotics to promote their business. Despite all the hype and trends, the uses of data remain unexplored. McKinsey's global banking annual review 2015 aptly titled "The Fight for the Customer" says that approximately 60% of banks have never quantified the potential value to be gained from investments in data migration tools and capabilities. With the tidal wave of new technologies washing up the shores and organizations spending millions of dollars for data transformation, most of the solutions developed were ad hoc, hence making it difficult to replicate with other business units. Lack of vision and metrics based on the project results to make decisions and build strategies makes the transformation a very painful nagging process.

Challenges with Waterfall Model

Business needs insights based on analytics quicker. An ever-changing marketplace means changes in requirements. Data transformation requires a robust development methodology to suit the needs of both changing requirements and quicker results. The traditional waterfall model is the earliest method of system development. This methodology has been used extensively for decades, it comes with its own set of problems for being too rigid and unrealistic when it comes to meeting the business requirements quickly and efficiently. The following are the challenges:

  • No clear big-picture understanding of the deliverables with respect to the outcome of the project.
  • Business involved only during the beginning (requirements gathering) and towards the end of the project (user acceptance testing), leading to lesser visibility for the business on the product being developed.
  • The team has fewer opportunities for feedback in terms of acceptance. This could be too late and the process of incorporating the feedback at a later stage can be expensive.
  • In the big data world, the requirements could possibly be irrelevant by the end of the project or before the project is complete. This model is not flexible with changing requirements that often happens in the real world.
  • Given the uncertainty of customer requirements, estimating time and costs with any degree of accuracy is often extremely difficult.
  • Waterfall comes with the implicit assumption that designs can be feasibly translated into real products. However, it could potentially run into roadblocks once the team starts to implement the design. Designs that often look feasible on paper could potentially turn out to be expensive or difficult in the real world.

Defining Agile Data Transformation

Agile in big data projects offers a business-driven approach to digital transformation. With this approach, organizations can create a master list of possible business use cases for advanced analytics. Data teams can work on lists to identify data sources, architecture, quality, and governance. Combining these two will result in cross-functional teams from business and IT that can design and build minimally viable products quickly, get the customer acceptance, and enhance them in quick iterations. With short sprint cycles, organizations could get impactful and actionable information rather than struggle from misinterpreted requirements and low business adoption. Agile in data transformation can come as a culture shock to organizations.

Objections to Agile Big Data

  • How do you define minimum viable product? - MVP is the smallest deliverable that the business can evaluate and provides feedback to the data team. An MVP need not be the ultimate visualization dashboard - it could be something interim such as the formatted data for analysis, model assessment, evaluate and review results, or deployment to production.
  • How do you handle ingestions from multiple data sources? - Data preparation could take as much as 60% of the entire time. It is very tedious and must be done meticulously to avoid wrong predictions from the model. When the data is incorrect, the output can never be right. Multiple data sources can be treated as a layer of data input. If the data from the multiple sources is too big, each one could be a single MVP or if small, it can be grouped together under one MVP.
  • How do you develop reports when the data model is not complete? - When changes are inevitable, the data model will be subject to changes as well. Agile data management will make it easier to adopt to these changes.
  • What would be your recommendation for these changes? - It is best to invest in DevOps and automated functional and regression testing to build a stable system that evolves with constant changes.
  • Is co-location a requirement? - While it is ideal for everyone to be in the same conference room, it need not be achieved physically. Technologies such as video conferencing to a simple dial-in call can put the entire team in the same conference room and produce the same effective result.
  • How can you complete an iteration in two weeks? - Agile recommends 2 to 3 weeks' iteration, but when it comes to complex data store or model, iterations can usually be extended to 4-5 weeks.


Cross Industry Standard Process for Data Mining (CRISP-DM), which has been in practice since 1997, is a data mining process model. It is a robust and proven methodology for guidance. This framework is tool-agnostic and application/industry neutral. The main advantage of going with CRISP-DM is it focuses on business problems as well as data analysis. We could bring together agile and scrum with CRISP-DM to make sense of data transformation.

Business understanding (the project objectives and requirements from a business perspective) and data understanding (exploring data and quality), phases of CRISP-DM, forms the basis of product and sprint backlog. The heart of data transformation - data preparation (construction of final data set from raw data) and modeling (selection and application of the model) phases along with interim deployments could comprise sprints. The evaluation phase of CRISP-DM could become story or epic review within the agile framework. The whole process can be repeated until the model is perfected and business accepts the results/product.


CRISP-DM and Agile for Data Transformation

Sprint Zero

As agile often comes out as a culture shock to organizations who are comfortable with traditional waterfall, sprint zero can help ease the transformation. Usually defined as "A project before the project" and scorned by agile purists, it need not necessarily be a bad thing. Here are a few items that could be part of sprint zero for the skeptics:

  • Meet with key stakeholders to get the vision and value for the data transformation.
  • Identify the team and roles - product owner, scrum master, data scientist, business analyst, SME, lead, and data developers.
  • Identify the sprint length that is suitable for the organization and the project.
  • Set up an initial product catalog
  • Set up Development and QA environments.
  • Set up the baseline architecture, continuous integration architecture, and framework for deployments.
  • Set up project infrastructure and conventions, schedule project activity and track and report progress.

Sample Sprint Zero

Let us take an example product vision - Increase Productivity of Sales Team - of an organization who wants to capitalize on data transformation. The organization is looking forward to increase average sale size, reduce churn rate, and grow customer base. The user stories could be:

User Stories

Here is the sample release plan of the user stories

Release Plan