ETF comparison and analysis utility

This python script helps to compare ETF holdings using their CSV holdings files. It loads multiple files, normalizes company names, and merges similar holdings using fuzzy matching so that the same company listed with slightly different spellings is treated as a single holding. This allows more accurate comparisons across ETFs where names or spellings may differ. E.g, Company Co. AG and COMPANY CO AG.

The tool can analyze several aspects of a portfolio set: company crossover between ETFs, percentage overlap (similarity) between funds, and aggregated regional or sector exposure. Optional portfolio weightings can be applied to simulate custom ETF allocations when combining multiple funds into a porfolio.

Any ETF holding can be passed to this script, provided that the following four columns are included and spelled as follows: Name, Weight (%), Country and Sector.

Additionally, a line before the column headings with: Fund ASX Code,<ticker>, will be sourced for print output. Otherwise the filename will be used.

The script is licensed: MIT

Features

A full portfolio assment may look like:
etf-util.py --files BGBL.csv EXUS.csv ROYL.csv CRYP.csv --weightings 0.6 0.3 0.05 0.05 <feature>

All features calculate using the provided --weightings for each ETF. If no weighting is provided, a default of 100% per holding is attributed.

--holdings

Company Crossover (Portfolio-Weighted):

Fund                              BGBL    CRYP    EXUS    ROYL     Sum
Canonical Holding Name
NVIDIA                          3.1523  0.0000  0.0000  0.0000  3.1523
APPLE                           2.8131  0.0000  0.0000  0.0000  2.8131
ALPHABET                        2.4582  0.0000  0.0000  0.0000  2.4582
MICROSOFT                       2.1615  0.0000  0.0000  0.0000  2.1615
AMAZON COM                      1.4705  0.0000  0.0000  0.0000  1.4705
BROADCOM                        1.0967  0.0000  0.0000  0.0000  1.0967
ASML                            0.3945  0.0000  0.6998  0.0000  1.0944
META PLATFORMS                  1.0483  0.0000  0.0000  0.0000  1.0483
TESLA                           0.8640  0.0000  0.0000  0.0000  0.8640
WHEATON PRECIOUS METALS         0.0618  0.0000  0.0864  0.6297  0.7779
ROCHE                           0.2517  0.0000  0.4357  0.0000  0.6875
FRANCO NEVADA                   0.0358  0.0000  0.0573  0.5478  0.6409
ELI LILLY                       0.6371  0.0000  0.0000  0.0000  0.6371
NOVARTIS                        0.2324  0.0000  0.3994  0.0000  0.6318
ASTRAZENECA                     0.2251  0.0000  0.3905  0.0000  0.6156
... etc

--regions

Fund          BGBL    CRYP     EXUS    ROYL    Total
Region
US/Canada  45.3048  3.8502   4.0512  3.3958  56.6020
Europe      6.8574  0.0090  17.1186  0.6704  24.6554
Asia        4.9717  0.3002   8.1567  0.2695  13.6981
Other       2.6301  0.3193   0.6409  0.6429   4.2333
Australia   0.1689  0.5213   0.0326  0.0214   0.7442

--sectors

Outputs the sectors and attributed weightings per ETF and as a portfolio total.

Fund                       BGBL    CRYP    EXUS    ROYL    Total
Sector
Information Technology  15.5286  3.2416  2.7900  0.5458  22.1061
Financials               9.3376  1.5391  7.3774  0.0000  18.2541
Industrials              7.3206  0.0000  5.8359  0.0000  13.1566
Healthcare               5.8694  0.0040  2.9246  1.1472   9.9452
Consumer Discretionary   5.7128  0.2038  2.5529  0.0000   8.4696
Communication Services   5.4025  0.0087  1.1940  0.4075   7.0127
Materials                2.0176  0.0000  1.9193  1.9326   5.8695
Consumer Staples         3.5323  0.0000  2.1953  0.0000   5.7276
Energy                   2.4176  0.0000  1.4667  0.9325   4.8169
Utilities                1.6583  0.0000  1.1180  0.0000   2.7763
Real Estate              1.0413  0.0000  0.5094  0.0272   1.5780

--similarity

Outputs a similiarity matrix, comparing each ETF against every other provided ETF.

BGBL vs CRYP: 0.6835% overlap
BGBL vs EXUS: 17.3505% overlap
BGBL vs ROYL: 0.1637% overlap
CRYP vs EXUS: 0.1004% overlap
CRYP vs ROYL: 0.0020% overlap
EXUS vs ROYL: 0.2184% overlap

--frequency

Outputs individual holdings that appear more than once amongst the provided ETFs. The list is sorted by largest total weighting first.

                          Name                  Funds  Total Weighted (%)
               ASML HOLDING NV             BGBL, EXUS              1.0944
  WHEATON PRECIOUS METALS CORP       BGBL, EXUS, ROYL              0.7779
              ROCHE HOLDING AG             BGBL, EXUS              0.6875
            FRANCO-NEVADA CORP       BGBL, EXUS, ROYL              0.6409
                   NOVARTIS AG             BGBL, EXUS              0.6318
               ASTRAZENECA PLC             BGBL, EXUS              0.6156
             HSBC HOLDINGS PLC             BGBL, EXUS              0.6050
                     NESTLE SA             BGBL, EXUS              0.5268
             TOYOTA MOTOR CORP             BGBL, EXUS              0.5183
                      VISA INC             BGBL, CRYP              0.4853
          ROYAL BANK OF CANADA             BGBL, EXUS              0.4698
                     SHELL PLC             BGBL, EXUS              0.4571
             MICROSTRATEGY INC             BGBL, CRYP              0.4454
     CIRCLE INTERNET GROUP INC             BGBL, CRYP              0.4352
                    SIEMENS AG             BGBL, EXUS              0.4219
MITSUBISHI UFJ FINANCIAL GROUP             BGBL, EXUS              0.4150
                        SAP SE             BGBL, EXUS              0.4023
           COINBASE GLOBAL INC             BGBL, CRYP              0.3949
                MASTERCARD INC             BGBL, CRYP              0.3929
      UNIVERSAL MUSIC GROUP NV       BGBL, EXUS, ROYL              0.3798
            BANCO SANTANDER SA             BGBL, EXUS              0.3593
            ... etc

Script

#!/usr/bin/env python
import sys
import argparse
import pandas as pd
import re
from rapidfuzz import fuzz


def classify_region(country):
    if country in ["United States", "USA", "Canada"]:
        return "US/Canada"
    elif country in [
        "Europe",  # SPDW lists "Europe" as a catch-all
        "United Kingdom",
        "Germany",
        "France",
        "Switzerland",
        "Netherlands",
        "Spain",
        "Italy",
        "Sweden",
        "Denmark",
        "Finland",
        "Norway",
        "Belgium",
        "Austria",
        "Ireland",
        "Portugal",
    ]:
        return "Europe"
    elif country == "Australia":
        return "Australia"
    elif country in [
        "China",
        "Japan",
        "South Korea",
        "Taiwan",
        "Singapore",
        "Hong Kong",
        "India",
        "Thailand",
        "Malaysia",
        "Indonesia",
        "Philippines",
    ]:
        return "Asia"
    else:
        return "Other"


def normalize_company_name(name: str) -> str:
    name = name.upper()

    name = re.sub(r"[.,&()\-]", " ", name)

    name = re.sub(r"\bCLASS\s+[A-Z]\b", "", name)
    name = re.sub(r"\bSHS?\b", "", name)
    name = re.sub(r"\bADR\b", "", name)
    name = re.sub(r"\bGDR\b", "", name)

    suffixes = [
        "LTD",
        "LIMITED",
        "PLC",
        "INC",
        "CORP",
        "CORPORATION",
        "NV",
        "AG",
        "SA",
        "SPA",
        "BV",
        "AB",
        "ASA",
        "OYJ",
        "CO",
        "COMPANY",
        "GROUP",
        "HOLDINGS",
        "HOLDING",
    ]

    for s in suffixes:
        name = re.sub(rf"\b{s}\b", "", name)

    name = re.sub(r"\s+", " ", name).strip()

    return name


def build_fuzzy_name_map(names, threshold=92):
    """
    Build a mapping of similar company names to a canonical name.
    """
    names = list(names)
    canonical = {}

    for i, a in enumerate(names):
        if a in canonical:
            continue

        canonical[a] = a

        for b in names[i + 1 :]:
            if b in canonical:
                continue

            if fuzz.ratio(a, b) > threshold:
                canonical[b] = a

    return canonical


def find_header_row(path, required_columns):
    # Find the header line index dynamically
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            cols = [c.strip().strip('"') for c in line.split(",")]
            if all(
                any(req.lower() == col.lower() for col in cols)
                for req in required_columns
            ):
                return i
    raise ValueError(f"Could not find required columns in file: {path}")


def load_holdings(filepath):
    required_cols = ["Name", "Weight (%)", "Country", "Sector"]

    header_row = find_header_row(filepath, required_cols)

    df = pd.read_csv(filepath, skiprows=header_row)
    df.columns = [col.strip() for col in df.columns]

    df = df[["Name", "Country", "Sector", "Weight (%)"]].copy()
    df["Weight (%)"] = pd.to_numeric(df["Weight (%)"], errors="coerce")
    df = df.dropna(subset=["Name"])

    # Extract fund name
    fund_name = None
    with open(filepath, "r", encoding="utf-8-sig") as f:
        for line in f:
            if line.startswith("Fund ASX Code"):
                fund_name = line.split(",", 1)[1].strip()
                break

    if not fund_name:
        fund_name = filepath

    return fund_name, df


def load_all_holdings(files, weightings):
    """
    Load all ETF holdings and apply consistent name normalization + fuzzy merging.
    Returns a unified dataframe.
    """

    all_holdings = []

    for filepath, w in zip(files, weightings):
        fund_name, df = load_holdings(filepath)

        df["Weight (%)"] *= w
        df["Fund"] = fund_name
        df["Normalized"] = df["Name"].apply(normalize_company_name)

        all_holdings.append(df)

    combined = pd.concat(all_holdings, ignore_index=True)

    # Build fuzzy canonical mapping across ALL ETFs
    name_map = build_fuzzy_name_map(combined["Normalized"].unique())

    combined["Canonical Holding Name"] = combined["Normalized"].map(name_map)

    return combined


##################
# USER FUNCTIONS #


def run_holdings(files, weightings):

    combined = load_all_holdings(files, weightings)

    pivot = (
        combined.groupby(["Canonical Holding Name", "Fund"])["Weight (%)"]
        .sum()
        .unstack(fill_value=0)
    )

    if pivot.empty:
        print("\nNo overlapping holdings found.\n")
        return
    # Add sum column
    pivot["Sum"] = pivot.sum(axis=1)

    # Sort by total exposure
    pivot = pivot.sort_values(by="Sum", ascending=False)

    print("\nCompany Crossover (Portfolio-Weighted):\n")
    print(pivot.round(4).to_string())


def run_frequency(files, weightings):

    combined = load_all_holdings(files, weightings)

    grouped = combined.groupby("Canonical Holding Name")

    results = []

    for name, group in grouped:
        if group["Fund"].nunique() > 1:
            results.append(
                {
                    "Name": group["Name"].mode().iloc[0],
                    "Funds": ", ".join(sorted(group["Fund"].unique())),
                    "Total Weighted (%)": group["Weight (%)"].sum(),
                }
            )

    freq_df = pd.DataFrame(results)

    if freq_df.empty:
        print("\nNo stocks found in more than one ETF.\n")
        return

    freq_df = freq_df.sort_values(by="Total Weighted (%)", ascending=False)

    print("\nStocks Appearing in More Than One ETF:\n")
    print(freq_df.round(4).to_string(index=False))


def run_similarity(files, weightings):

    combined = load_all_holdings(files, weightings)

    pivot = (
        combined.groupby(["Canonical Holding Name", "Fund"])["Weight (%)"]
        .sum()
        .unstack(fill_value=0)
    )

    funds = list(pivot.columns)

    print("\nETF Similarity Matrix (% overlap):\n")

    for i in range(len(funds)):
        for j in range(i + 1, len(funds)):
            a = pivot[funds[i]]
            b = pivot[funds[j]]

            overlap = pd.concat([a, b], axis=1).min(axis=1).sum()

            print(f"{funds[i]} vs {funds[j]}: {overlap:.4f}% overlap")


def run_regions(files, weightings):

    combined = load_all_holdings(files, weightings)

    combined["Region"] = combined["Country"].apply(classify_region)

    region_df = (
        combined.groupby(["Region", "Fund"])["Weight (%)"].sum().unstack(fill_value=0)
    )

    region_df["Total"] = region_df.sum(axis=1)
    region_df = region_df.sort_values("Total", ascending=False)

    print("\nRegional Exposure (Portfolio-Weighted):\n")
    print(region_df.round(4).to_string())


def run_sectors(files, weightings):

    combined = load_all_holdings(files, weightings)

    sector_df = (
        combined.groupby(["Sector", "Fund"])["Weight (%)"].sum().unstack(fill_value=0)
    )

    sector_df["Total"] = sector_df.sum(axis=1)
    sector_df = sector_df.sort_values("Total", ascending=False)

    print("\nSector Exposure (Portfolio-Weighted):\n")
    print(sector_df.round(4).to_string())


def main():
    parser = argparse.ArgumentParser(description="ETF Comparison Tool")

    parser.add_argument(
        "--files", nargs="+", required=True, help="ETF holdings CSV files"
    )

    parser.add_argument(
        "--weightings", nargs="+", type=float, help="Portfolio weightings for each ETF"
    )

    parser.add_argument("--similarity", action="store_true")
    parser.add_argument("--frequency", action="store_true")
    parser.add_argument("--regions", action="store_true")
    parser.add_argument("--sectors", action="store_true")
    parser.add_argument("--holdings", action="store_true")

    args = parser.parse_args()

    if not any(
        [
            args.regions,
            args.frequency,
            args.similarity,
            args.sectors,
            args.holdings,
        ]
    ):
        print("Please specify --holdings, --frequency, --similarity, --regions, or --sectors")
        sys.exit(1)

    files = args.files

    if args.weightings:
        if len(args.weightings) != len(files):
            print("Number of weightings must match number of files")
            sys.exit(1)
        weightings = args.weightings
    else:
        weightings = [1.0] * len(files)

    if args.similarity:
        run_similarity(files, weightings)

    if args.regions:
        run_regions(files, weightings)

    if args.sectors:
        run_sectors(files, weightings)

    if args.holdings:
        run_holdings(files, weightings)

    if args.frequency:
        run_frequency(files, weightings)


if __name__ == "__main__":
    main()