Max L

Reverse-Engineering SEC EDGAR’s Full-Text Search API (efts.sec.gov)

Q: The endpoint and its real parameters

The base URL is https://efts.sec.gov/LATEST/search-index. The path casing matters — /LATEST/ is uppercase and a lowercase /latest/ 404s. These are the query parameters that actually do something: q — the search term. Wrap a phrase in URL-encoded double quotes (%22climate+risk%22) for an exact match, or it tokenizes into an OR search. forms — comma-separated filing types: 10-K, 8-K, SC 13D, etc. Leave it off to search everything. startdt and enddt — date bounds in YYYY-MM-DD. Both required if you

Q: A real scraper that paginates

Pagination is the one thing that trips people up. Each request returns up to 100 documents in hits.hits; there is no size parameter the backend honors past that, so you walk the result set with from. Step by 100 and watch hits.total.value for when to stop. Here’s a small client that pulls every hit for a query and respects SEC’s rate limits: import time import requests EFTS = "https://efts.sec.gov/LATEST/search-index" HEADERS = {"User-Agent": "orthogonal-research max@orthogonal.info"} def search_all(q, forms=None, startdt=None, enddt=None, max_results=1000): res

Q: The gotchas that cost me time

Phrase vs token search. A bare q=climate risk matches documents containing “climate” OR “risk” anywhere. That returned 40x more noise than I expected. The quoted form q=%22climate risk%22 is the exact phrase, and it’s what you almost always want. The 10,000 result ceiling. Elasticsearch caps deep pagination. Once from passes 10,000 the endpoint errors out. If a query has more hits than that, narrow it with a tighter date range and stitch the windows together — there’s no scroll cursor exposed. F

Q: Where this fits

I use this as the front door for a few projects: a script that flags new 8-K filings mentioning specific risk language, and an insider-buying alerter that cross-references full-text hits against Form 4 data. The full-text endpoint finds the filings; the structured EDGAR APIs pull the details. Pair it with the congressional trade tracker approach and you’ve got a decent picture of who’s filing what. If you want to go deeper on parsing the filings you find, two books earned their shelf space for m

Written by

Max L

in

Finance & Trading

Updated Last updated: July 14, 2026 · Originally published: June 7, 2026

The official SEC EDGAR full-text search box at efts.sec.gov is great if you’re a human clicking around. It’s useless if you want to pull 200 filings that mention “going concern” into a script. So I opened the network tab, watched what the search page actually calls, and rebuilt the request myself.

The page is a thin React front end. Every search fires a GET to https://efts.sec.gov/LATEST/search-index and gets back raw Elasticsearch JSON. No API key, no signup, no OAuth dance. Here’s the exact request that powers it, and the gotchas that cost me an afternoon.

The endpoint and its real parameters

The base URL is https://efts.sec.gov/LATEST/search-index. The path casing matters — /LATEST/ is uppercase and a lowercase /latest/ 404s. These are the query parameters that actually do something:

q — the search term. Wrap a phrase in URL-encoded double quotes (%22climate+risk%22) for an exact match, or it tokenizes into an OR search.
forms — comma-separated filing types: 10-K, 8-K, SC 13D, etc. Leave it off to search everything.
startdt and enddt — date bounds in YYYY-MM-DD. Both required if you want a window.
from — pagination offset. The page size is fixed at 10, so from=10 is page two, from=20 is page three.
ciks — restrict to a specific company by its zero-padded CIK number.

A complete request looks like this:

curl -s \
  -A "your-app [email protected]" \
  "https://efts.sec.gov/LATEST/search-index?q=%22machine+learning%22&forms=8-K&startdt=2026-01-01&enddt=2026-06-01"

The User-Agent header is not optional. SEC’s fair-access policy rejects requests with a generic or empty agent — you’ll get a 403. Put your app name and a contact email in there. I learned this the hard way after my first ten curls returned nothing but an HTML block page.

What comes back

The response is the Elasticsearch result envelope, untouched. The shape you care about:

{
  "took": 305,
  "hits": {
    "total": { "value": 662, "relation": "eq" },
    "hits": [
      {
        "_id": "0001193125-26-032000:ionq-ex99_2.htm",
        "_source": {
          "ciks": ["0001824920"],
          "display_names": ["IonQ, Inc.  (IONQ)  (CIK 0001824920)"],
          "root_forms": ["8-K"],
          "form": "8-K",
          "file_date": "2026-01-30",
          "adsh": "0001193125-26-032000",
          "file_type": "EX-99.2",
          "sics": ["7373"],
          "biz_states": ["MD"]
        }
      }
    ]
  }
}

Two fields unlock everything else. The _id is {accession}:{filename} — split on the colon and you can build a direct link to the document. The adsh is the accession number with dashes, which is what you feed into the rest of EDGAR’s data endpoints.

To turn a hit into a clickable filing URL, strip the dashes from the accession number for the folder path:

def filing_url(hit):
    adsh, fname = hit["_id"].split(":", 1)
    cik = int(hit["_source"]["ciks"][0])  # drops leading zeros
    folder = adsh.replace("-", "")
    return f"https://www.sec.gov/Archives/edgar/data/{cik}/{folder}/{fname}"

Every field in the response, decoded

The partial _source above is enough to build links, but if you’re parsing filings programmatically you’ll hit fields the docs never explain. Here’s the full envelope from a real forms=8-K query, with the parts most people skip:

{
  "took": 4771,            // ES query time in ms — handy for spotting slow filters
  "timed_out": false,      // true means partial results; retry the request
  "_shards": { "total": 50, "successful": 50, "skipped": 0, "failed": 0 },
  "hits": {
    "total": { "value": 150, "relation": "eq" },  // "eq" = exact; "gte" = capped count
    "max_score": 19.15,
    "hits": [ /* up to 100 documents, see below */ ]
  },
  "aggregations": {
    "form_filter":       { "buckets": [ { "key": "8-K", "doc_count": 150 } ] },
    "entity_filter":     { "buckets": [ /* top filers */ ] },
    "sic_filter":        { "buckets": [ /* industry codes */ ] },
    "biz_states_filter": { "buckets": [ /* HQ states */ ] }
  }
}

Two things here matter and aren’t obvious. First, hits.total.relation: when it reads "eq" the count is exact, but on broad queries it flips to "gte" and the value caps out — don’t treat it as a precise total past that point. Second, the aggregations block is a free faceted-search index. You can read form_filter, entity_filter, sic_filter, and biz_states_filter to build a filings dashboard without a single extra request — the counts come back on every query whether you asked for them or not.

Now the part the search traffic actually wants — every field inside a hit’s _source:

"_source": {
  "ciks":          ["0001498148"],          // zero-padded CIK(s); int() to drop zeros
  "display_names": ["Artificial Intelligence Technology Solutions Inc.  (AITX)  (CIK 0001498148)"],
  "form":          "8-K",                    // exact form type
  "root_forms":    ["8-K"],                  // base type (8-K/A rolls up to 8-K)
  "file_date":     "2026-06-09",             // when it was filed (YYYY-MM-DD)
  "period_ending": "2026-06-09",             // reporting period end, not the filing date
  "adsh":          "0001062993-26-003112",   // accession number — the join key for EDGAR
  "file_type":     "EX-99.1",                // the specific exhibit/document type
  "file_description": "EXHIBIT 99.1",
  "sequence":      "2",                       // position of this doc within the filing
  "items":         ["2.02", "8.01", "9.01"], // 8-K item numbers — what the filing reports
  "sics":          ["7372"],                 // SIC industry code
  "biz_states":    ["MI"],                   // principal office state
  "biz_locations": ["Ferndale, MI"],
  "inc_states":    ["NV"],                   // state of incorporation
  "file_num":      ["000-55079"],
  "film_num":      ["261074480"],
  "xsl":           null
}

Field	What it’s actually for
`adsh`	The accession number. This is the join key — feed it to `data.sec.gov` submission and XBRL endpoints to pull the rest of the filing.
`ciks`	Zero-padded company IDs. Wrap in `int()` for the Archives path; keep the padding for `data.sec.gov/submissions/CIK##########.json`.
`items`	8-K item codes. This is the fast filter for event-driven work — `2.02` is earnings, `5.02` is an exec change, `1.01` is a material agreement.
`file_date` vs `period_ending`	Filing date vs the period the filing covers. For “what was disclosed today” you want `file_date`; for fundamentals you want `period_ending`.
`root_forms`	Use this, not `form`, when you want amendments grouped with originals (8-K/A under 8-K).
`display_names`	Pre-formatted “Name (TICKER) (CIK …)” string. Regex the ticker out instead of a second lookup.

The pagination ceiling is worth restating in response terms: each request returns at most 100 documents in hits.hits, and you advance with from. The hits.total.value tells you how many to expect, so the loop is “while from < total, bump from by your page size.” The scraper below does exactly that.

A real scraper that paginates

Pagination is the one thing that trips people up. Each request returns up to 100 documents in hits.hits; there's no size parameter the backend honors past that, so you walk the result set with from. Step by 100, watch hits.total.value for when to stop, and you'll pull a full query cleanly. Here's a small client that does it and respects SEC's rate limits:

import time
import requests

EFTS = "https://efts.sec.gov/LATEST/search-index"
HEADERS = {"User-Agent": "orthogonal-research [email protected]"}

def search_all(q, forms=None, startdt=None, enddt=None, max_results=1000):
    results = []
    offset = 0
    while offset < max_results:
        params = {"q": q, "from": offset}
        if forms:   params["forms"] = forms
        if startdt: params["startdt"] = startdt
        if enddt:   params["enddt"] = enddt

        r = requests.get(EFTS, params=params, headers=HEADERS, timeout=15)
        r.raise_for_status()
        hits = r.json()["hits"]["hits"]
        if not hits:
            break
        results.extend(hits)
        offset += 100
        time.sleep(0.15)  # stay under ~10 req/sec
    return results

filings = search_all('"going concern"', forms="10-K",
                     startdt="2026-01-01", enddt="2026-06-01")
for f in filings:
    src = f["_source"]
    print(src["file_date"], src["form"], src["display_names"][0])

The time.sleep(0.15) keeps you under SEC’s documented limit of 10 requests per second. Go faster and you’ll get temporary IP blocks that last about ten minutes. There’s no X-RateLimit header to watch — the only signal is a sudden 403, so it’s better to throttle up front than to detect and back off.

The gotchas that cost me time

Phrase vs token search. A bare q=climate risk matches documents containing “climate” OR “risk” anywhere. That returned 40x more noise than I expected. The quoted form q=%22climate risk%22 is the exact phrase, and it’s what you almost always want.

The 10,000 result ceiling. Elasticsearch caps deep pagination. Once from passes 10,000 the endpoint errors out. If a query has more hits than that, narrow it with a tighter date range and stitch the windows together — there’s no scroll cursor exposed.

Full-text only covers 2001 onward. The full-text index starts in 2001. Older filings exist in EDGAR but won’t show up here. For anything pre-2001 you’re back to the structured submissions API.

It indexes exhibits, not just the main doc. A single 8-K can return several hits — one per attached exhibit. Dedupe on the accession number (adsh) if you only want one row per filing.

Where this fits

I use this as the front door for a few projects: a script that flags new 8-K filings mentioning specific risk language, and an insider-buying alerter that cross-references full-text hits against Form 4 data. The full-text endpoint finds the filings; the structured EDGAR APIs pull the details. Pair it with the congressional trade tracker approach and you’ve got a decent picture of who’s filing what.

If you want to go deeper on parsing the filings you find, two books earned their shelf space for me. Python for Data Analysis by Wes McKinney is the reference I keep open when I’m reshaping messy filing data with pandas. And for the finance side of reading what’s actually in these documents, Financial Statement Analysis and Security Valuation is dense but it’s the one I reach for. Full disclosure: those are affiliate links — they don’t change the price, and I only link books I actually own.

The whole thing is one undocumented GET request returning clean JSON. No key, no cost. The SEC quietly shipped one of the better free financial data APIs and never put a docs page on it.

A quick plug: I run Alpha Signal, a free Telegram channel where I post market structure and data-driven trade ideas built on exactly this kind of public-filing intelligence. Worth a look if SEC data is your thing.

Reverse-Engineering SEC EDGAR’s Full-Text Search API (efts.sec.gov)

The endpoint and its real parameters

What comes back

Every field in the response, decoded

A real scraper that paginates

The gotchas that cost me time

Where this fits

📚 You Might Also Like

You Might Also Like

Comments

Leave a Reply Cancel reply

More posts

Your Password Generator Is Only as Good as crypto.getRandomValues

The FDIC BankFind API: Pull Any U.S. Bank’s Financials and Failure History as JSON (No Key)

Reading a JWT Offline: How to Spot alg:none and Algorithm Confusion Before They Bite

Your Photos Are Broadcasting Your Home Address — Strip EXIF GPS in the Browser