Content Discovery

Sections Content Discovery

Find every hidden path, parameter, vhost, and file the app didn’t link to. Done well, this finds 80% of the bugs before any “real” testing.

T=https://target.com
WL=/usr/share/wordlists/seclists

i. Where it lives

Standard surfaces to brute-force:

Path enumeration (directories + files)
Parameter mining (GET, POST body, JSON keys, headers)
Vhost discovery (Host header tricks)
JS endpoint extraction (URLs and APIs referenced in JS bundles)
Subdomain enumeration (see Recon (external) )
Archive endpoints (waybackmachine, gau, hakrawler)

ii. Path fuzzing

ffuf is the default, fastest and most flexible:

ffuf -u "$T/FUZZ" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
  -c -t 100 -mc all -fc 404 -recursion -recursion-depth 2 -o ffuf.json -of json

Filter noise by size or words when a wildcard 200 hides real hits:

ffuf -u "$T/FUZZ" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
  -c -t 100 -fs 1234 -fw 567
## -fs filter-size, -fw filter-words, -fl filter-lines, -fr filter-regex

Extension fuzzing (different extensions per app stack):

ffuf -u "$T/FUZZ.EXT" \
  -w "$WL/Discovery/Web-Content/raft-medium-words.txt:FUZZ" \
  -w ext.txt:EXT \
  -c -t 100 -fc 404
## ext.txt: php, asp, aspx, jsp, jspx, do, action, html, json, xml, txt, bak, old, swp

feroxbuster, recursive by default, great as a second-pass:

feroxbuster -u "$T" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
  -t 50 -x php,html,txt,json --depth 3 -o ferox.txt

Wordlist tiers (start small, escalate if needed):

quickhits.txt (60 lines, instant)
common.txt (4600 lines, ~1 min)
raft-medium-directories.txt (30k lines)
raft-large-directories.txt (170k)
directory-list-2.3-medium.txt (220k, dirbuster classic)
SecLists Discovery/Web-Content/big.txt for the kitchen sink

iii. Vhost discovery

Same IP, different hostname = different app. Brute the Host header:

ffuf -u "$T" -H "Host: FUZZ.target.com" \
  -w "$WL/Discovery/DNS/subdomains-top1million-110000.txt" \
  -mc all -fs <baseline_size>
## Baseline size = the body size returned for a known-bad subdomain. Filter it out.

gobuster vhost mode as alternative:

gobuster vhost -u "$T" -w "$WL/Discovery/DNS/subdomains-top1million-110000.txt" --append-domain

iv. Parameter mining

Apps almost always accept more parameters than they document. arjun finds them by reflection and behavior diffing:

arjun -u "$T/api/search" -m GET
arjun -u "$T/api/search" -m POST --headers "Content-Type: application/json"
arjun -u "$T/api/search" -m JSON -w /path/to/params.txt

x8, faster Rust alternative:

x8 -u "$T/search" -w "$WL/Discovery/Web-Content/burp-parameter-names.txt"

Param Miner (Burp extension) for unkeyed headers and cache poisoning, see WEB11 Web Cache Poisoning .

v. JS endpoint extraction

Modern apps live in JavaScript. Every endpoint they call is in there somewhere.

## getallurls / waybackurls / gau combined
echo target.com | gau --threads 10 | tee gau.txt
echo target.com | waybackurls | tee -a gau.txt
## Just keep JS / API patterns:
grep -E '\.(js|json)(\?|$)|/api/|/v[0-9]/' gau.txt | sort -u > endpoints.txt

LinkFinder regex for URLs inside JS files:

linkfinder -i "$T/static/app.js" -o cli
## or recursively across a domain:
linkfinder -i "$T" -d -o html -o links.html

katana, modern crawler that handles SPA routing and JS:

katana -u "$T" -jc -kf all -o katana.txt
## -jc = JS crawling, -kf = known files (sitemap.xml, robots.txt etc)

gospider, fast Go crawler with form/cookie tracking:

gospider -s "$T" -d 3 -c 10 --robots --sitemap --js -o gospider/

hakrawler, focused on URL extraction:

echo "$T" | hakrawler -d 3

vi. Well-known files always to check

Hit these every time, they’re free findings:

for path in robots.txt sitemap.xml security.txt .well-known/security.txt humans.txt crossdomain.xml clientaccesspolicy.xml; do
  curl -sI "$T/$path" -o /dev/null -w "%{http_code} $path\n"
done

Dev / source exposure (the big ones):

## Git repo exposure:
curl -sI "$T/.git/HEAD"        ## 200 with content = full repo extractable
curl -sI "$T/.git/config"
## If exposed:
git-dumper "$T/.git/" ./loot/
## Other source files:
for f in '.env' '.env.local' '.env.production' '.env.development' '.env.example' \
         'docker-compose.yml' 'Dockerfile' 'docker-compose.override.yml' \
         '.DS_Store' '.htaccess' '.htpasswd' 'web.config' 'config.php.bak' \
         'wp-config.php.bak' 'composer.json' 'composer.lock' 'package.json' \
         'package-lock.json' '.npmrc' 'pom.xml' 'build.gradle' \
         'phpinfo.php' 'info.php' 'test.php' 'config.json'; do
  code=$(curl -sI "$T/$f" -o /dev/null -w "%{http_code}")
  [ "$code" = "200" ] && echo "OPEN: $T/$f"
done

Backup file convention sweep (one filename, many extensions):

ffuf -u "$T/FUZZ" -w backup-ext-permutations.txt
## permutations: index.php.bak, index.bak.php, index.php~, index.php.swp, index.php.old, index.php.orig

.DS_Store exposure leaks directory contents:

ds_store_exposed.py "$T/.DS_Store"

vii. Common framework paths

Knowing the stack narrows the wordlist. Identify first with wappalyzer-cli, whatweb, httpx tech-detect, or just look at the response headers and HTML:

httpx -u "$T" -tech-detect -title -web-server -status-code -ip

By stack:

Java / Spring:

/actuator         /actuator/env       /actuator/heapdump   /actuator/mappings
/v2/api-docs      /swagger-ui.html    /console

Spring Boot < 1.5 the actuator endpoints often expose env vars and heap dumps with no auth.

Node.js / Express:

/api              /admin              /.env               /server.js
/node_modules

Django:

/admin            /api                /__debug__          /static/

Default debug=True leaks stack traces with config and creds.

Flask:

/console          /admin              /api                /static

/console with debug=True = direct Python REPL.

Tomcat:

/manager/html     /manager/text       /host-manager/html  /examples/

Default creds tomcat/tomcat, admin/admin, tomcat/s3cret.

Apache / .htaccess apps:

/server-status    /server-info        /cgi-bin/           /icons/

/server-status from internal IP leaks every recent request.

WordPress:

/wp-admin/        /wp-login.php       /wp-json/wp/v2/users    /wp-content/
/wp-config.php.bak

wp-json/wp/v2/users enumerates usernames on default installs.

Joomla / Drupal / etc: run cms-specific scanners (wpscan, joomscan, droopescan).

viii. Tricks worth knowing

Method spraying

Some endpoints behave differently per HTTP method:

for m in GET POST PUT DELETE PATCH OPTIONS HEAD TRACE; do
  echo "=== $m ==="
  curl -sX "$m" "$T/api/users" -o /dev/null -w "%{http_code} %{size_download}\n"
done

Trailing slash and case sensitivity

## Same endpoint, different behavior:
curl -sI "$T/admin"      ## may 302 -> /admin/
curl -sI "$T/admin/"     ## the actual app
curl -sI "$T/Admin"      ## case-sensitivity on Linux backends
curl -sI "$T/ADMIN"

403 bypass attempts

When you find an interesting endpoint that returns 403, try nomore403 or bypass-403 :

nomore403 -u "$T/admin"
## Header tricks:
curl -H "X-Original-URL: /admin" "$T/"
curl -H "X-Rewrite-URL: /admin" "$T/"
curl -H "X-Forwarded-For: 127.0.0.1" "$T/admin"
curl -H "X-Forwarded-Host: localhost" "$T/admin"
## Path tricks:
curl "$T/admin/."
curl "$T/admin/.."
curl "$T/admin%20"
curl "$T/admin%09"
curl "$T/admin/?"
curl "$T/admin/%2e"
curl "$T//admin"
curl "$T/./admin"

Same wordlist, multiple modes

Web content lists also work for parameters and headers. Try them across all three modes when stuck.

ix. Archive sources

Wayback Machine often shows older endpoints removed in production, but still alive:

echo "target.com" | gau --providers wayback,otx,commoncrawl --threads 10 | tee gau.txt
echo "target.com" | waybackurls > wayback.txt
## Filter interesting:
grep -E '\.(env|bak|sql|tar|zip|gz)(\?|$)' gau.txt
grep -E '/admin|/api|/internal|/debug|/test' gau.txt

URLscan.io and Shodan also archive HTTP responses, sometimes useful for finding old admin panels.

x. GitHub recon

The target’s own repos and developer accounts often leak paths and creds:

trufflehog github --org=target-inc --only-verified
## Search GitHub directly with dorks:
##   "target.com" filename:.env
##   "target.com" password
##   "target.com" extension:pem
##   site:github.com target.com api_key

xi. References

PayloadsAllTheThings - Directory and File Bruteforcing
SecLists - Discovery wordlists
Assetnote wordlists — high-quality per-tech wordlists
HackTricks - Pentesting Web
PortSwigger - Information disclosure

xii. Where it leads

Each new path is a new vuln class to check:

API endpoints → WEB02 Auth & Session then injection files
File uploads → WEB13 File Upload & LFI
Admin panels → default creds first, then WEB02 Auth & Session
Source exposure (.git, .env) → read every secret, validate keys per provider in 00 Cloud MOC
Debug endpoints → often direct RCE (Spring actuator/jolokia, Flask debug console, Rails console)

i. Where it lives #

ii. Path fuzzing #

iii. Vhost discovery #

iv. Parameter mining #

v. JS endpoint extraction #

vi. Well-known files always to check #

vii. Common framework paths #

viii. Tricks worth knowing #

Method spraying #

Trailing slash and case sensitivity #

403 bypass attempts #

Same wordlist, multiple modes #

ix. Archive sources #

x. GitHub recon #

xi. References #

xii. Where it leads #