Decision-makers at the Inter-American Development Bank were looking to create an effective way to mitigate risks in their infrastructure investment portfolio. To achieve this, the bank partnered with Data Society to create an innovative, machine-learning-based model that would enable decision-makers to proactively identify projects with significant risk factors and take preemptive actions to save time and budget and to secure project success.
Historically, conventional statistical models have afforded bankers risk-related insights, but couldn’t predict the success and failure of infrastructure projects overall. Studies of infrastructure projects throughout the world find that 9 out of 10 experienced cost overruns, which vary by sector and average between 20% and 45% of baseline costs (Flyvbjerg, 2007). The Inter-American Development Bank (IDB) wanted to test its hypothesis that subtle factors can impact the delivery and budget of an infrastructure effort.
In order to make the analysis simple for a non-technical audience to digest, Data Society developed an easy-to-use application that allows users to browse IDB contracts and investigate ones that presented the most risk.
Open-source data enabled Data Society to most efficiently scale the modeling framework. The Open Contracting Partnership - a consortium of hundreds of stakeholders across government, business, and civil societies with support from The World Bank - developed the Open Contracting Data Standard (OCDS), an internationally accepted set of guidelines for contracting data, to provide a common format for the publication of data about the ~USD 9.5 trillion in annual government contracts awarded globally.
Data Society leveraged the OCDS’s standard data model that reflects the way in which governments structure and award contracts. The Item Classification Scheme (ICS) within the OCDS includes various fields that describe the specific items included in the procurement. Through such classification, OCDS offers a particular standard that is useful in categorizing items with a unique ID, which helps create a standard system for contract risk analysis. The application allows users to browse IDB contracts and investigate ones that are flagged as high risk, as well as explore the patterns in the raw data.
When first examining IDB’s infrastructure project contracts, Data Society recognized that there was a distribution imbalance that had to be rectified in order to produce accurate forecasting results for the solution. Research found that there were 80-85% unmodified contracts in our data set and 15-20% modified contracts. Data Society’s models were achieving high accuracy by mostly predicting that a contract would not be modified. Data Society further trained the model on the few contracts that did have modifications. The team utilized ROSE and SMOTE oversampling techniques to bring our modified contract percentage up to ~30-40%. Data Society then used the re-balanced data to develop the final model.
After testing numerous machine learning techniques, Data Society selected a gradient boosting model optimized via cross-validation. The designed algorithm used an ensemble of weak learners, and built them sequentially to obtain a strong learner. Data Society then applied cross-validation to minimize the effects of randomness. This allowed our analysts and IDB stakeholders access to a more accurate idea of the performance of the model that captured the real-world phenomena driving the outcomes of infrastructure projects, which has been especially useful for the imbalanced datasets on which the model was built.
Data Society created a scalable machine-learning-driven modeling framework to specifically identify projects with higher than average risk levels. What made our approach scalable is our use of Open Contracting Data Standards (OCDS) when pulling the data.
Data Society created a comprehensive database and data schema to gather the data necessary to operate at a larger scale. For the IDB and its client countries, adherence to a data standard is an imperative policy and data governance prerequisite. Machine learning was applied to risk management and contract structuring, primarily useful when starting with a comprehensive data set.
To deploy at scale, a standardized pipeline was designed for the data extract transform load (ETL) process, and this process evolves over four steps:
Ultimately, applying this consistent data governance framework and reporting format, standardizing processes across the 26 countries it serves, the IDB is able to leverage machine learning capabilities to automatically flag infrastructure projects that are at risk of not meeting expectations.