The Role of Data in Credit Modelling

Nathan Porteous
Jun 17
4 min read

Updated: Jul 21

The hottest topic in credit risk modelling often centres around which algorithm to use. Logistic regression and linear regression are valued for their simplicity and transparency. More sophisticated machine learning models, like random forests and gradient boosting machines, hold more predictive power though are harder to interpret. But while the choice of algorithm matters, it’s not where the biggest improvements in performance come from.

The Real Progress Is in the Data

The biggest improvements in model performance come from the data itself, specifically, how we use it and how we bring different sources together to reflect real world behaviour.

Lenders now have access to far more than traditional application and credit bureau data. Open banking, transactional patterns, behavioural trends, internal Customer Relationship Management (CRM) systems and even unstructured notes from customer conversations all hold valuable signals. The next challenge is not collecting more data; it’s integrating what we already have.

The data scientists making the biggest impact aren’t just tuning models. They’re connecting messy datasets, aligning definitions, resolving inconsistencies, and engineering features that reflect how customers actually behave. Often, that means aggregating transactions into meaningful summaries, building variables that represent business logic, or cleaning up records so that models don’t learn from noise. These steps can drive far greater gains in performance than tweaking an algorithm ever will.

The Importance of Collaboration

One of the most underrated aspects of credit modelling is communication. The best models aren’t built in isolation; they’re made by talking to underwriters, front line teams, and the people who use them day to day. These conversations help clarify what variables mean in practice, how they’re used in decisions, and what really drives credit risk.

Without that input, it’s easy to build a model that’s technically sound but misaligned with the business. It might rely on unstable or unnecessary data, miss important signals, or use variables in ways that don’t make sense to decision makers. Good data scientists help bridge that gap. They make sure features are grounded in real world understanding, and that models genuinely support business needs.

Start with the Data, Not the Model

It’s easy to get caught up chasing marginal gains through model tuning or trying out the latest algorithm. But often, those changes move the needle by just a few percentage points. In contrast, fixing your data by cleaning it, enriching it and engineering better features can deliver dramatic improvements in model performance. This will often result in a much higher gain in ROCAUC compared to just tuning model hyperparameters.

What Lenders Should Do Next

Traditional bureau data is very useful but limited. It can often be limited by providing a slightly delayed view of a customer and sometimes misses key segments (e.g. thin file or new borrowers). open banking data, on the other hand, gives a real time view of income, spending habits and account activity. Lenders should prioritise integrating open banking data into their models and engineer powerful features like income volatility and discretionary spending patterns to improve model performance.

Capture First Party Behavioural Data

Application forms, call centre notes, CRM data, and product usage logs all provide signals that bureau files can’t. For example, missed direct debits on other products, frequency of service interactions, or how quickly applicants respond to document requests can all be predictive of future risk. These sources are often underused, but hold real value when cleaned and engineered properly.

Use LLMs And AI For Conversational Data

Call notes, emails, and chat logs often contain early warning signs (e.g. complaints and income issues) but they’re hard to use at scale. LLMs can now scan and categorise these conversations automatically, flagging risk-relevant events like financial hardship mentions, broken arrangements, or repeated contact. This allows lenders to turn unstructured comms into structured risk signals that models can learn from bringing previously untapped data into the credit decision process.

Strengthen Cross Team Communication

Involve underwriters, product managers, and credit policy owners early. For example, if a model is using “number of accounts opened in the last 3 months,” ask: is that a proxy for fraud, opportunism, or normal customer onboarding? This context shapes whether a feature should be included, excluded, or transformed. The best models reflect not just what’s predictive, but what’s usable in decisions. The best way to do this is set ensure that the model owners and data scientist have a clear understanding of what the features are telling us through cross collaboration with different teams such as data engineering, operations and product owners.

Prioritise Data Monitoring

As alternative data grows, so does the risk of drift. Lenders need to ensure they have granular, ongoing monitoring for data inputs. For example, tracking changes in open banking connection success rates or distribution shifts in derived income features through population stability index (PSI).

Build for Explainability

Even complex, high feature count models need to be explainable. Use tools like SHAP early in the development process to spot misleading variables or features that can’t be easily communicated to the business or the regulator.

Final Thoughts

Predicting credit risk is at the heart of any lending business. And while models continue to evolve, the most meaningful improvements now come from getting the data right. That means pulling in the right sources and creating features that reflect real behaviour, not just what’s easy to model. It also means working closely with the people who use these models every day to make sure the outputs are trusted, explainable, and aligned with how lending decisions are actually made.

The most successful lenders of the future won’t necessarily be the ones with the most advanced algorithms; they’ll be the ones who know how to turn messy, fragmented data into structured, actionable insight.