ETF comparison and analysis utility
This python script helps to compare ETF holdings using their CSV holdings files.
It loads multiple files, normalizes company names, and merges similar holdings
using fuzzy matching so that the same company listed with slightly different
spellings is treated as a single holding. This allows more accurate comparisons
across ETFs where names or spellings may differ. E.g, Company Co. AG and
COMPANY CO AG.
The tool can analyze several aspects of a portfolio set: company crossover between ETFs, percentage overlap (similarity) between funds, and aggregated regional or sector exposure. Optional portfolio weightings can be applied to simulate custom ETF allocations when combining multiple funds into a porfolio.
Any ETF holding can be passed to this script, provided that the following four
columns are included and spelled as follows: Name, Weight (%), Country and
Sector.
Additionally, a line before the column headings with: Fund ASX Code,<ticker>,
will be sourced for print output. Otherwise the filename will be used.
The script is licensed: MIT
Features
A full portfolio assment may look like:
etf-util.py --files BGBL.csv EXUS.csv ROYL.csv CRYP.csv --weightings 0.6 0.3 0.05 0.05 <feature>
All features calculate using the provided --weightings for each ETF. If no
weighting is provided, a default of 100% per holding is attributed.
--holdings
Company Crossover (Portfolio-Weighted):
Fund BGBL CRYP EXUS ROYL Sum
Canonical Holding Name
NVIDIA 3.1523 0.0000 0.0000 0.0000 3.1523
APPLE 2.8131 0.0000 0.0000 0.0000 2.8131
ALPHABET 2.4582 0.0000 0.0000 0.0000 2.4582
MICROSOFT 2.1615 0.0000 0.0000 0.0000 2.1615
AMAZON COM 1.4705 0.0000 0.0000 0.0000 1.4705
BROADCOM 1.0967 0.0000 0.0000 0.0000 1.0967
ASML 0.3945 0.0000 0.6998 0.0000 1.0944
META PLATFORMS 1.0483 0.0000 0.0000 0.0000 1.0483
TESLA 0.8640 0.0000 0.0000 0.0000 0.8640
WHEATON PRECIOUS METALS 0.0618 0.0000 0.0864 0.6297 0.7779
ROCHE 0.2517 0.0000 0.4357 0.0000 0.6875
FRANCO NEVADA 0.0358 0.0000 0.0573 0.5478 0.6409
ELI LILLY 0.6371 0.0000 0.0000 0.0000 0.6371
NOVARTIS 0.2324 0.0000 0.3994 0.0000 0.6318
ASTRAZENECA 0.2251 0.0000 0.3905 0.0000 0.6156
... etc
--regions
Fund BGBL CRYP EXUS ROYL Total
Region
US/Canada 45.3048 3.8502 4.0512 3.3958 56.6020
Europe 6.8574 0.0090 17.1186 0.6704 24.6554
Asia 4.9717 0.3002 8.1567 0.2695 13.6981
Other 2.6301 0.3193 0.6409 0.6429 4.2333
Australia 0.1689 0.5213 0.0326 0.0214 0.7442
--sectors
Outputs the sectors and attributed weightings per ETF and as a portfolio total.
Fund BGBL CRYP EXUS ROYL Total
Sector
Information Technology 15.5286 3.2416 2.7900 0.5458 22.1061
Financials 9.3376 1.5391 7.3774 0.0000 18.2541
Industrials 7.3206 0.0000 5.8359 0.0000 13.1566
Healthcare 5.8694 0.0040 2.9246 1.1472 9.9452
Consumer Discretionary 5.7128 0.2038 2.5529 0.0000 8.4696
Communication Services 5.4025 0.0087 1.1940 0.4075 7.0127
Materials 2.0176 0.0000 1.9193 1.9326 5.8695
Consumer Staples 3.5323 0.0000 2.1953 0.0000 5.7276
Energy 2.4176 0.0000 1.4667 0.9325 4.8169
Utilities 1.6583 0.0000 1.1180 0.0000 2.7763
Real Estate 1.0413 0.0000 0.5094 0.0272 1.5780
--similarity
Outputs a similiarity matrix, comparing each ETF against every other provided ETF.
BGBL vs CRYP: 0.6835% overlap
BGBL vs EXUS: 17.3505% overlap
BGBL vs ROYL: 0.1637% overlap
CRYP vs EXUS: 0.1004% overlap
CRYP vs ROYL: 0.0020% overlap
EXUS vs ROYL: 0.2184% overlap
--frequency
Outputs individual holdings that appear more than once amongst the provided ETFs. The list is sorted by largest total weighting first.
Name Funds Total Weighted (%)
ASML HOLDING NV BGBL, EXUS 1.0944
WHEATON PRECIOUS METALS CORP BGBL, EXUS, ROYL 0.7779
ROCHE HOLDING AG BGBL, EXUS 0.6875
FRANCO-NEVADA CORP BGBL, EXUS, ROYL 0.6409
NOVARTIS AG BGBL, EXUS 0.6318
ASTRAZENECA PLC BGBL, EXUS 0.6156
HSBC HOLDINGS PLC BGBL, EXUS 0.6050
NESTLE SA BGBL, EXUS 0.5268
TOYOTA MOTOR CORP BGBL, EXUS 0.5183
VISA INC BGBL, CRYP 0.4853
ROYAL BANK OF CANADA BGBL, EXUS 0.4698
SHELL PLC BGBL, EXUS 0.4571
MICROSTRATEGY INC BGBL, CRYP 0.4454
CIRCLE INTERNET GROUP INC BGBL, CRYP 0.4352
SIEMENS AG BGBL, EXUS 0.4219
MITSUBISHI UFJ FINANCIAL GROUP BGBL, EXUS 0.4150
SAP SE BGBL, EXUS 0.4023
COINBASE GLOBAL INC BGBL, CRYP 0.3949
MASTERCARD INC BGBL, CRYP 0.3929
UNIVERSAL MUSIC GROUP NV BGBL, EXUS, ROYL 0.3798
BANCO SANTANDER SA BGBL, EXUS 0.3593
... etc
Script
#!/usr/bin/env python
import sys
import argparse
import pandas as pd
import re
from rapidfuzz import fuzz
def classify_region(country):
if country in ["United States", "USA", "Canada"]:
return "US/Canada"
elif country in [
"Europe", # SPDW lists "Europe" as a catch-all
"United Kingdom",
"Germany",
"France",
"Switzerland",
"Netherlands",
"Spain",
"Italy",
"Sweden",
"Denmark",
"Finland",
"Norway",
"Belgium",
"Austria",
"Ireland",
"Portugal",
]:
return "Europe"
elif country == "Australia":
return "Australia"
elif country in [
"China",
"Japan",
"South Korea",
"Taiwan",
"Singapore",
"Hong Kong",
"India",
"Thailand",
"Malaysia",
"Indonesia",
"Philippines",
]:
return "Asia"
else:
return "Other"
def normalize_company_name(name: str) -> str:
name = name.upper()
name = re.sub(r"[.,&()\-]", " ", name)
name = re.sub(r"\bCLASS\s+[A-Z]\b", "", name)
name = re.sub(r"\bSHS?\b", "", name)
name = re.sub(r"\bADR\b", "", name)
name = re.sub(r"\bGDR\b", "", name)
suffixes = [
"LTD",
"LIMITED",
"PLC",
"INC",
"CORP",
"CORPORATION",
"NV",
"AG",
"SA",
"SPA",
"BV",
"AB",
"ASA",
"OYJ",
"CO",
"COMPANY",
"GROUP",
"HOLDINGS",
"HOLDING",
]
for s in suffixes:
name = re.sub(rf"\b{s}\b", "", name)
name = re.sub(r"\s+", " ", name).strip()
return name
def build_fuzzy_name_map(names, threshold=92):
"""
Build a mapping of similar company names to a canonical name.
"""
names = list(names)
canonical = {}
for i, a in enumerate(names):
if a in canonical:
continue
canonical[a] = a
for b in names[i + 1 :]:
if b in canonical:
continue
if fuzz.ratio(a, b) > threshold:
canonical[b] = a
return canonical
def find_header_row(path, required_columns):
# Find the header line index dynamically
with open(path, "r", encoding="utf-8") as f:
for i, line in enumerate(f):
cols = [c.strip().strip('"') for c in line.split(",")]
if all(
any(req.lower() == col.lower() for col in cols)
for req in required_columns
):
return i
raise ValueError(f"Could not find required columns in file: {path}")
def load_holdings(filepath):
required_cols = ["Name", "Weight (%)", "Country", "Sector"]
header_row = find_header_row(filepath, required_cols)
df = pd.read_csv(filepath, skiprows=header_row)
df.columns = [col.strip() for col in df.columns]
df = df[["Name", "Country", "Sector", "Weight (%)"]].copy()
df["Weight (%)"] = pd.to_numeric(df["Weight (%)"], errors="coerce")
df = df.dropna(subset=["Name"])
# Extract fund name
fund_name = None
with open(filepath, "r", encoding="utf-8-sig") as f:
for line in f:
if line.startswith("Fund ASX Code"):
fund_name = line.split(",", 1)[1].strip()
break
if not fund_name:
fund_name = filepath
return fund_name, df
def load_all_holdings(files, weightings):
"""
Load all ETF holdings and apply consistent name normalization + fuzzy merging.
Returns a unified dataframe.
"""
all_holdings = []
for filepath, w in zip(files, weightings):
fund_name, df = load_holdings(filepath)
df["Weight (%)"] *= w
df["Fund"] = fund_name
df["Normalized"] = df["Name"].apply(normalize_company_name)
all_holdings.append(df)
combined = pd.concat(all_holdings, ignore_index=True)
# Build fuzzy canonical mapping across ALL ETFs
name_map = build_fuzzy_name_map(combined["Normalized"].unique())
combined["Canonical Holding Name"] = combined["Normalized"].map(name_map)
return combined
##################
# USER FUNCTIONS #
def run_holdings(files, weightings):
combined = load_all_holdings(files, weightings)
pivot = (
combined.groupby(["Canonical Holding Name", "Fund"])["Weight (%)"]
.sum()
.unstack(fill_value=0)
)
if pivot.empty:
print("\nNo overlapping holdings found.\n")
return
# Add sum column
pivot["Sum"] = pivot.sum(axis=1)
# Sort by total exposure
pivot = pivot.sort_values(by="Sum", ascending=False)
print("\nCompany Crossover (Portfolio-Weighted):\n")
print(pivot.round(4).to_string())
def run_frequency(files, weightings):
combined = load_all_holdings(files, weightings)
grouped = combined.groupby("Canonical Holding Name")
results = []
for name, group in grouped:
if group["Fund"].nunique() > 1:
results.append(
{
"Name": group["Name"].mode().iloc[0],
"Funds": ", ".join(sorted(group["Fund"].unique())),
"Total Weighted (%)": group["Weight (%)"].sum(),
}
)
freq_df = pd.DataFrame(results)
if freq_df.empty:
print("\nNo stocks found in more than one ETF.\n")
return
freq_df = freq_df.sort_values(by="Total Weighted (%)", ascending=False)
print("\nStocks Appearing in More Than One ETF:\n")
print(freq_df.round(4).to_string(index=False))
def run_similarity(files, weightings):
combined = load_all_holdings(files, weightings)
pivot = (
combined.groupby(["Canonical Holding Name", "Fund"])["Weight (%)"]
.sum()
.unstack(fill_value=0)
)
funds = list(pivot.columns)
print("\nETF Similarity Matrix (% overlap):\n")
for i in range(len(funds)):
for j in range(i + 1, len(funds)):
a = pivot[funds[i]]
b = pivot[funds[j]]
overlap = pd.concat([a, b], axis=1).min(axis=1).sum()
print(f"{funds[i]} vs {funds[j]}: {overlap:.4f}% overlap")
def run_regions(files, weightings):
combined = load_all_holdings(files, weightings)
combined["Region"] = combined["Country"].apply(classify_region)
region_df = (
combined.groupby(["Region", "Fund"])["Weight (%)"].sum().unstack(fill_value=0)
)
region_df["Total"] = region_df.sum(axis=1)
region_df = region_df.sort_values("Total", ascending=False)
print("\nRegional Exposure (Portfolio-Weighted):\n")
print(region_df.round(4).to_string())
def run_sectors(files, weightings):
combined = load_all_holdings(files, weightings)
sector_df = (
combined.groupby(["Sector", "Fund"])["Weight (%)"].sum().unstack(fill_value=0)
)
sector_df["Total"] = sector_df.sum(axis=1)
sector_df = sector_df.sort_values("Total", ascending=False)
print("\nSector Exposure (Portfolio-Weighted):\n")
print(sector_df.round(4).to_string())
def main():
parser = argparse.ArgumentParser(description="ETF Comparison Tool")
parser.add_argument(
"--files", nargs="+", required=True, help="ETF holdings CSV files"
)
parser.add_argument(
"--weightings", nargs="+", type=float, help="Portfolio weightings for each ETF"
)
parser.add_argument("--similarity", action="store_true")
parser.add_argument("--frequency", action="store_true")
parser.add_argument("--regions", action="store_true")
parser.add_argument("--sectors", action="store_true")
parser.add_argument("--holdings", action="store_true")
args = parser.parse_args()
if not any(
[
args.regions,
args.frequency,
args.similarity,
args.sectors,
args.holdings,
]
):
print("Please specify --holdings, --frequency, --similarity, --regions, or --sectors")
sys.exit(1)
files = args.files
if args.weightings:
if len(args.weightings) != len(files):
print("Number of weightings must match number of files")
sys.exit(1)
weightings = args.weightings
else:
weightings = [1.0] * len(files)
if args.similarity:
run_similarity(files, weightings)
if args.regions:
run_regions(files, weightings)
if args.sectors:
run_sectors(files, weightings)
if args.holdings:
run_holdings(files, weightings)
if args.frequency:
run_frequency(files, weightings)
if __name__ == "__main__":
main()