HouseStockWatcher Is Dead — Here’s How to Pull the Latest House Trades Yourself

Written by

in

If you typed “housestockwatcher latest trades june 2026” into Google this week and landed on a dead page, you’re not imagining things. The old HouseStockWatcher S3 bucket that half the internet built scrapers against now returns a flat 403 AccessDenied. I checked this morning:

$ curl -sI "https://house-stock-watcher-data.s3-us-west-2.amazonaws.com/data/all_transactions.json"
HTTP/1.1 403 Forbidden
<Error><Code>AccessDenied</Code>...

That endpoint fed dashboards, Discord bots, and more than a few backtests. When it went dark, a lot of “latest House trades” tools quietly started serving stale data without telling anyone. So here’s the thing worth knowing: HouseStockWatcher was always just a friendly wrapper around a government source that’s still up, still free, and still updated daily. You can pull the same data yourself in about 30 lines of Python with zero API key. Let me show you exactly where it lives and how to read it.

Where the House trade data actually comes from

Every House member files financial disclosures with the Clerk of the House under the STOCK Act. The one you care about for trades is the Periodic Transaction Report (PTR) — that’s the form a representative files within 30-45 days of buying or selling a stock. HouseStockWatcher scraped these, parsed the PDFs, and republished them as tidy JSON. The scraping target never moved:

https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2026FD.ZIP

That ZIP is the yearly index. It contains a tab-delimited .txt (and an identical XML) listing every disclosure filed in 2026 — name, district, filing date, filing type, and a document ID. I pulled it just now and it’s 46 KB, HTTP 200, no auth:

$ curl -s -o 2026FD.ZIP -w "%{http_code} %{size_download}\n" \
    -H "User-Agent: Mozilla/5.0" \
    "https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2026FD.ZIP"
200 46043

One gotcha up front: send a real User-Agent. Hit that host with the default python-urllib string and you’ll sometimes get throttled. A browser UA string sails through.

Reading the index in 30 lines

The index columns look like this once unzipped:

Prefix  Last       First   Suffix  FilingType  StateDst  Year  FilingDate  DocID
        Suozzi     Thomas          P           NY03      2026  6/9/2026    20034747

The column that matters is FilingType. A value of P means Periodic Transaction Report — an actual trade. Everything else (C, X, D, A) is an annual report, amendment, or candidate filing with no fresh transactions. Here’s the full pull, sorted newest-first:

#!/usr/bin/env python3
import csv, io, zipfile, urllib.request
from datetime import datetime

YEAR = 2026
UA = "Mozilla/5.0 (research script; contact [email protected])"
INDEX = f"https://disclosures-clerk.house.gov/public_disc/financial-pdfs/{YEAR}FD.ZIP"

def get(url):
    req = urllib.request.Request(url, headers={"User-Agent": UA})
    with urllib.request.urlopen(req, timeout=60) as r:
        return r.read()

zf = zipfile.ZipFile(io.BytesIO(get(INDEX)))
txt = zf.read(f"{YEAR}FD.txt").decode("utf-8", "replace")
rows = list(csv.DictReader(io.StringIO(txt), delimiter="\t"))

# FilingType "P" = Periodic Transaction Report (the actual trades)
ptrs = [r for r in rows if r["FilingType"] == "P"]

def filed(r):
    try: return datetime.strptime(r["FilingDate"], "%m/%d/%Y")
    except ValueError: return datetime.min

ptrs.sort(key=filed, reverse=True)
print(f"{len(ptrs)} House PTRs filed in {YEAR}\n")
for r in ptrs[:10]:
    name = f"{r['First']} {r['Last']}".strip()
    print(f"{r['FilingDate']:>10}  {name:24.24} {r['StateDst']:5} {r['DocID']}")

Run it and you get the genuinely latest filings. This is the output I got on June 22, 2026 — note the most recent entries are only days old:

262 House PTRs filed in 2026

 6/19/2026  Jared Moskowitz          FL23  20034749
 6/19/2026  Scott H. Peters          CA50  20034784
 6/18/2026  Thomas H. Kean           NJ07  20034783
 6/17/2026  Steve Cohen              TN09  20034796
 6/17/2026  Matthew Robert Van Epps  TN07  20034807
 6/16/2026  Richard W. Allen         GA12  20034740
 6/16/2026  Jonathan Jackson         IL01  20034688
 6/12/2026  Nicholas Begich          AK00  20020055
 6/12/2026  Julie Johnson            TX32  20034706
 6/12/2026  David J. Taylor          OH02  20034780

That’s 262 trade reports for the year so far, 37 of them filed in June alone. No scraper farm, no paid tier, no rate limit worth mentioning.

From DocID to the actual trades

The index tells you who filed and when, not what they traded. For that you fetch the PTR itself. The URL is mechanical — just slot the DocID into this pattern:

https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/{YEAR}/{DocID}.pdf

So Scott Peters’ June 19 filing is at .../ptr-pdfs/2026/20034784.pdf. All three I spot-checked returned HTTP 200. Building the link in code is one function:

def ptr_pdf_url(r, year=2026):
    return f"https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/{year}/{r['DocID']}.pdf"

Here’s where it gets interesting, and where most “I’ll just parse the PDF” projects fall apart. There are two completely different kinds of PTR PDF, and you can tell them apart from the DocID alone:

  • 8-digit DocID starting with 2 (e.g. 20034784) — an e-filed report. It has a real text layer, built from Type0/CID fonts with a ToUnicode map. A proper PDF library can read it.
  • 7-digit DocID (e.g. 9116142) — a scanned paper form. It’s just images. I checked one: 41 embedded image objects, zero fonts. You need OCR, full stop.

That split is the single most useful thing to know before you write a parser. In the 2026 data, 235 of 262 PTRs are e-filed and 27 are scans. A quick classifier:

def is_efiled(doc_id):
    # e-filed PDFs have a text layer; 7-digit scans need OCR
    return len(doc_id) == 8 and doc_id.startswith("2")

For the e-filed ones, don’t hand-roll the PDF decoding. I tried a naive regex pass over the content streams to prove a point and got back zero characters — the text is hidden behind compressed object streams and CID font maps. Use a library that handles ToUnicode CMaps properly:

import pdfplumber, urllib.request

def read_efiled_ptr(url):
    raw = urllib.request.urlopen(
        urllib.request.Request(url, headers={"User-Agent": UA})
    ).read()
    with open("ptr.pdf", "wb") as f:
        f.write(raw)
    with pdfplumber.open("ptr.pdf") as pdf:
        # PTRs are tabular: asset, ticker, type (P/S), date, amount range
        for page in pdf.pages:
            for table in page.extract_tables():
                for row in table:
                    print(row)

For the scanned 7-digit minority, route them to Tesseract or a hosted OCR call instead of pretending extract_text() will work. If you skip that branch, your pipeline silently drops every paper filer — and some of the more active traders still file on paper.

The no-code option if you just want to look

Not everyone wants to babysit a PDF parser. If you only need to eyeball recent activity, Capitol Trades and Quiver Quantitative both keep clean, current front-ends over the same Clerk data, with tickers already matched and amounts normalized. They’re great for browsing. The catch is you don’t control the refresh cadence or the export format, and the free tiers cap how much history you can pull. For anything programmatic — alerts, backtests, joining against price data — going straight to the Clerk source is faster and never breaks when a third party changes their terms.

If you’re wiring this into a broader research stack, two related teardowns on this site pair well with it: reverse-engineering SEC EDGAR’s full-text search API for corporate filings, and tracking pre-IPO valuations with a free API. Same philosophy: skip the paid aggregator, read the primary source.

A couple of things that’ll bite you

The FilingDate is when the report hit the Clerk, not when the trade happened. The actual transaction date lives inside the PDF and is often weeks earlier — members get a 30-to-45-day window. If you’re building a “follow the trades” signal, sort on the in-PDF transaction date, not the index date, or you’ll think you have fresh information that’s actually a month stale.

Also, amounts are ranges, never exact. The form reports buckets like $1,001 – $15,000. Don’t store a single number; store the low and high bounds and decide later how to weight them.

If you’d rather read about the mechanics of congressional trading and the STOCK Act before building, The Stock Act backstory is covered well in a few trade books — and a basic Python data toolkit goes a long way here. A copy of Python for Data Analysis by Wes McKinney (the pandas author) is the one reference I keep open when I’m reshaping messy filing data into something joinable. Full disclosure: that’s an Amazon affiliate link.

The whole thing — index pull, PTR classification, PDF link building — is maybe 40 lines and zero dependencies beyond pdfplumber for the parse step. The data’s public, it’s yours, and unlike that dead S3 bucket, the Clerk’s office isn’t going anywhere.


Tracking what Congress trades is one signal among many. For daily market intelligence — narratives, sector rotation, and macro reads — join https://t.me/alphasignal822 for free.

📧 Get weekly insights on security, trading, and tech. No spam, unsubscribe anytime.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Also by us: StartCaaS — AI Company OS · Hype2You — AI Tech Trends