Back to projects

Quantitative Finance · Data Science · Machine Learning

Machine Learning Allocation Project

This project is linked to the portfolio backtesting and wealth-tracking stack: it is the machine-learning layer used to test signals and compare their impact on portfolio performance. The repository structures a full pipeline, from market-data preparation to test-metric production, with a reproducible logic built around dedicated scripts and run_all orchestration. The approach combines classic baselines such as Equal Weight and Markowitz Minimum Variance with supervised models like Random Forest and Logistic Regression in order to predict the direction of daily returns and convert those probabilities into portfolio weights.

PythonPandasNumPyscikit-learnMatplotlibSeaborn

What this project demonstrates

Ability to turn a quantitative topic into a usable ML pipeline: feature engineering, variable selection, multi-model training, comparison against financial baselines, and clear reporting of results.

Machine Learning Allocation Project - detailed view

My role

I contributed to project structuring, to the integration of strategy-comparison modules, and to the design of an end-to-end flow covering data preparation, training, backtesting, and reporting.

Context

The objective was to go beyond an exploratory notebook and produce a rigorous, collaborative working base with modular code, dedicated scripts, exported metrics, and directly usable comparison charts.

Objective

Assess the contribution of supervised models to portfolio allocation in a structured way while preserving a robust financial reference through Equal Weight and Markowitz baselines.

Deep dive

Technical reading of the project

The project follows a full quant and machine-learning workflow: price ingestion, technical-feature construction, directional model training by asset, portfolio backtest, and systematic comparison against financial baselines.

  • Data preparation and exploratory analysis through normalized prices, return distributions, correlations, and rolling volatility.
  • Financial baselines: Equal Weight Buy and Hold and Markowitz Minimum Variance.
  • Supervised modeling by asset through Random Forest and Logistic Regression.
  • ML strategy backtests on the test period with portfolio weights derived from probabilities.
  • Export of metrics and figures for comparative reading.

Gallery

Key screens and visualizations

Machine-learning pipeline used to compare portfolio strategies

Synthetic view of the ML pipeline and the compared strategy curves.

Random Forest compared with baselines on the test period

Visual comparison of the Random Forest model against the financial baselines.

Architecture

Technical organization

Data layer

Modules such as src/data/load_data.py and src/data/preprocess.py load prices, compute returns, and create a proper temporal train or test split.

Feature engineering

Technical indicators are built in src/features/technical_indicators.py and ANOVA selection is applied through src/features/feature_selection.py to keep the most relevant variables.

Models and strategies

Financial baselines live in src/baselines while ML models such as Random Forest and Logistic Regression are implemented with the backtest logic in the model layer.

Orchestration and reporting

Scripts such as scripts/run_*.py and run_all.py execute the full pipeline, produce metric tables, and generate equity-curve comparisons.

Pipeline

Data flow

  1. 1.Load price series and compute daily returns.
  2. 2.Create technical features and save the consolidated dataset.
  3. 3.Split train and test chronologically and train the models asset by asset.
  4. 4.Project signals or probabilities into portfolio weights over the test period.
  5. 5.Compute equity curves and cross-strategy comparison metrics.
  6. 6.Export results to reports/tables and reports/figures for decision-oriented analysis.

Technical choices

Structuring decisions

Systematic comparison with baselines

Every ML strategy is evaluated against Equal Weight and Markowitz to keep a stable financial benchmark rather than relying only on a classification score.

Explicit temporal split

The train or test split follows market chronology in order to limit information leakage in time-series data.

ANOVA selection plus GridSearchCV

Random Forest combines feature reduction and hyperparameter tuning to avoid a purely arbitrary modeling approach.

Asset-level modeling, then portfolio aggregation

Predictions are generated at the ticker level and then aggregated through a weighting logic, which makes the model closer to a real allocation use case.

Reliability

Quality and controls

  • Dedicated scripts for each stage improve reproducibility and make debugging more targeted.
  • Metrics and test visualizations are exported in versionable files.
  • The modular structure across data, features, models, baselines, and scripts is compatible with collaborative work.

Limitations

Current attention points

  • Performance remains dependent on the historical window and the market regime observed.
  • Real market frictions are simplified, especially transaction costs, liquidity, and advanced slippage.
  • Validation can still be strengthened through more robust walk-forward protocols and stress testing.

Roadmap

Next steps

  • Add walk-forward validation and stricter out-of-sample robustness protocols.
  • Extend the model library with approaches such as gradient boosting under the same comparison framework.
  • Connect ML pipeline outputs automatically to the wealth-tracking modules.

Challenges

Main project challenges

Build a clean temporal train or test pipeline to limit information leakage in a time-series setting.

Compare fundamentally different approaches, from static allocations to supervised models, under homogeneous metrics.

Maintain a readable architecture despite the multiplication of components such as features, models, baselines, and orchestration scripts.

Outcomes and learnings

What I take away

Reproducible execution chain through run_prepare.py, run_baselines.py, run_random_forest.py, run_logistic_regression.py, and run_all.py.

Systematic production of test reports such as metrics_test_* and comparative figures such as equity_*_vs_baselines_test.png.

Integration of Random Forest with ANOVA feature selection and hyperparameter search through GridSearchCV.

Addition of a Logistic Regression supervised benchmark compared with classical financial strategies.

Other projects

Continue exploring

Wealth Tracking Application - project preview

Personal Finance · Python · PyQt6

Wealth Tracking Application

PyQt6 + SQLite desktop application to centralize multi-asset accounts, rebuild weekly history, and analyze portfolio performance.

View this project
Portfolio Backtesting and Optimization - project preview

Quantitative Finance · Analysis · Python

Portfolio Backtesting and Optimization

Python environment to backtest strategies, compare risk and return metrics, and analyze portfolio behavior.

View this project

Discuss

I can detail technical choices and outcomes during an interview.

If this project is relevant for you, I can detail the initial need, data structure, assumptions, challenges encountered, and analysis limitations.