Content Discovery
Sections Content Discovery
Find every hidden path, parameter, vhost, and file the app didn’t link to. Done well, this finds 80% of the bugs before any “real” testing.
T=https://target.com
WL=/usr/share/wordlists/seclists
i. Where it lives
Standard surfaces to brute-force:
- Path enumeration (directories + files)
- Parameter mining (GET, POST body, JSON keys, headers)
- Vhost discovery (Host header tricks)
- JS endpoint extraction (URLs and APIs referenced in JS bundles)
- Subdomain enumeration (see Recon (external) )
- Archive endpoints (waybackmachine, gau, hakrawler)
ii. Path fuzzing
ffuf is the default, fastest and most flexible:
ffuf -u "$T/FUZZ" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
-c -t 100 -mc all -fc 404 -recursion -recursion-depth 2 -o ffuf.json -of json
Filter noise by size or words when a wildcard 200 hides real hits:
ffuf -u "$T/FUZZ" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
-c -t 100 -fs 1234 -fw 567
## -fs filter-size, -fw filter-words, -fl filter-lines, -fr filter-regex
Extension fuzzing (different extensions per app stack):
ffuf -u "$T/FUZZ.EXT" \
-w "$WL/Discovery/Web-Content/raft-medium-words.txt:FUZZ" \
-w ext.txt:EXT \
-c -t 100 -fc 404
## ext.txt: php, asp, aspx, jsp, jspx, do, action, html, json, xml, txt, bak, old, swp
feroxbuster, recursive by default, great as a second-pass:
feroxbuster -u "$T" -w "$WL/Discovery/Web-Content/raft-medium-directories.txt" \
-t 50 -x php,html,txt,json --depth 3 -o ferox.txt
Wordlist tiers (start small, escalate if needed):
quickhits.txt(60 lines, instant)common.txt(4600 lines, ~1 min)raft-medium-directories.txt(30k lines)raft-large-directories.txt(170k)directory-list-2.3-medium.txt(220k, dirbuster classic)- SecLists
Discovery/Web-Content/big.txtfor the kitchen sink
iii. Vhost discovery
Same IP, different hostname = different app. Brute the Host header:
ffuf -u "$T" -H "Host: FUZZ.target.com" \
-w "$WL/Discovery/DNS/subdomains-top1million-110000.txt" \
-mc all -fs <baseline_size>
## Baseline size = the body size returned for a known-bad subdomain. Filter it out.
gobuster vhost mode as alternative:
gobuster vhost -u "$T" -w "$WL/Discovery/DNS/subdomains-top1million-110000.txt" --append-domain
iv. Parameter mining
Apps almost always accept more parameters than they document. arjun finds them by reflection and behavior diffing:
arjun -u "$T/api/search" -m GET
arjun -u "$T/api/search" -m POST --headers "Content-Type: application/json"
arjun -u "$T/api/search" -m JSON -w /path/to/params.txt
x8, faster Rust alternative:
x8 -u "$T/search" -w "$WL/Discovery/Web-Content/burp-parameter-names.txt"
Param Miner (Burp extension) for unkeyed headers and cache poisoning, see WEB11 Web Cache Poisoning .
v. JS endpoint extraction
Modern apps live in JavaScript. Every endpoint they call is in there somewhere.
## getallurls / waybackurls / gau combined
echo target.com | gau --threads 10 | tee gau.txt
echo target.com | waybackurls | tee -a gau.txt
## Just keep JS / API patterns:
grep -E '\.(js|json)(\?|$)|/api/|/v[0-9]/' gau.txt | sort -u > endpoints.txt
LinkFinder regex for URLs inside JS files:
linkfinder -i "$T/static/app.js" -o cli
## or recursively across a domain:
linkfinder -i "$T" -d -o html -o links.html
katana, modern crawler that handles SPA routing and JS:
katana -u "$T" -jc -kf all -o katana.txt
## -jc = JS crawling, -kf = known files (sitemap.xml, robots.txt etc)
gospider, fast Go crawler with form/cookie tracking:
gospider -s "$T" -d 3 -c 10 --robots --sitemap --js -o gospider/
hakrawler, focused on URL extraction:
echo "$T" | hakrawler -d 3
vi. Well-known files always to check
Hit these every time, they’re free findings:
for path in robots.txt sitemap.xml security.txt .well-known/security.txt humans.txt crossdomain.xml clientaccesspolicy.xml; do
curl -sI "$T/$path" -o /dev/null -w "%{http_code} $path\n"
done
Dev / source exposure (the big ones):
## Git repo exposure:
curl -sI "$T/.git/HEAD" ## 200 with content = full repo extractable
curl -sI "$T/.git/config"
## If exposed:
git-dumper "$T/.git/" ./loot/
## Other source files:
for f in '.env' '.env.local' '.env.production' '.env.development' '.env.example' \
'docker-compose.yml' 'Dockerfile' 'docker-compose.override.yml' \
'.DS_Store' '.htaccess' '.htpasswd' 'web.config' 'config.php.bak' \
'wp-config.php.bak' 'composer.json' 'composer.lock' 'package.json' \
'package-lock.json' '.npmrc' 'pom.xml' 'build.gradle' \
'phpinfo.php' 'info.php' 'test.php' 'config.json'; do
code=$(curl -sI "$T/$f" -o /dev/null -w "%{http_code}")
[ "$code" = "200" ] && echo "OPEN: $T/$f"
done
Backup file convention sweep (one filename, many extensions):
ffuf -u "$T/FUZZ" -w backup-ext-permutations.txt
## permutations: index.php.bak, index.bak.php, index.php~, index.php.swp, index.php.old, index.php.orig
.DS_Store exposure leaks directory contents:
ds_store_exposed.py "$T/.DS_Store"
vii. Common framework paths
Knowing the stack narrows the wordlist. Identify first with wappalyzer-cli, whatweb, httpx tech-detect, or just look at the response headers and HTML:
httpx -u "$T" -tech-detect -title -web-server -status-code -ip
By stack:
Java / Spring:
/actuator /actuator/env /actuator/heapdump /actuator/mappings
/v2/api-docs /swagger-ui.html /console
Spring Boot < 1.5 the actuator endpoints often expose env vars and heap dumps with no auth.
Node.js / Express:
/api /admin /.env /server.js
/node_modules
Django:
/admin /api /__debug__ /static/
Default debug=True leaks stack traces with config and creds.
Flask:
/console /admin /api /static
/console with debug=True = direct Python REPL.
Tomcat:
/manager/html /manager/text /host-manager/html /examples/
Default creds tomcat/tomcat, admin/admin, tomcat/s3cret.
Apache / .htaccess apps:
/server-status /server-info /cgi-bin/ /icons/
/server-status from internal IP leaks every recent request.
WordPress:
/wp-admin/ /wp-login.php /wp-json/wp/v2/users /wp-content/
/wp-config.php.bak
wp-json/wp/v2/users enumerates usernames on default installs.
Joomla / Drupal / etc: run cms-specific scanners (wpscan, joomscan, droopescan).
viii. Tricks worth knowing
Method spraying
Some endpoints behave differently per HTTP method:
for m in GET POST PUT DELETE PATCH OPTIONS HEAD TRACE; do
echo "=== $m ==="
curl -sX "$m" "$T/api/users" -o /dev/null -w "%{http_code} %{size_download}\n"
done
Trailing slash and case sensitivity
## Same endpoint, different behavior:
curl -sI "$T/admin" ## may 302 -> /admin/
curl -sI "$T/admin/" ## the actual app
curl -sI "$T/Admin" ## case-sensitivity on Linux backends
curl -sI "$T/ADMIN"
403 bypass attempts
When you find an interesting endpoint that returns 403, try nomore403 or bypass-403 :
nomore403 -u "$T/admin"
## Header tricks:
curl -H "X-Original-URL: /admin" "$T/"
curl -H "X-Rewrite-URL: /admin" "$T/"
curl -H "X-Forwarded-For: 127.0.0.1" "$T/admin"
curl -H "X-Forwarded-Host: localhost" "$T/admin"
## Path tricks:
curl "$T/admin/."
curl "$T/admin/.."
curl "$T/admin%20"
curl "$T/admin%09"
curl "$T/admin/?"
curl "$T/admin/%2e"
curl "$T//admin"
curl "$T/./admin"
Same wordlist, multiple modes
Web content lists also work for parameters and headers. Try them across all three modes when stuck.
ix. Archive sources
Wayback Machine often shows older endpoints removed in production, but still alive:
echo "target.com" | gau --providers wayback,otx,commoncrawl --threads 10 | tee gau.txt
echo "target.com" | waybackurls > wayback.txt
## Filter interesting:
grep -E '\.(env|bak|sql|tar|zip|gz)(\?|$)' gau.txt
grep -E '/admin|/api|/internal|/debug|/test' gau.txt
URLscan.io and Shodan also archive HTTP responses, sometimes useful for finding old admin panels.
x. GitHub recon
The target’s own repos and developer accounts often leak paths and creds:
trufflehog github --org=target-inc --only-verified
## Search GitHub directly with dorks:
## "target.com" filename:.env
## "target.com" password
## "target.com" extension:pem
## site:github.com target.com api_key
xi. References
- PayloadsAllTheThings - Directory and File Bruteforcing
- SecLists - Discovery wordlists
- Assetnote wordlists — high-quality per-tech wordlists
- HackTricks - Pentesting Web
- PortSwigger - Information disclosure
xii. Where it leads
Each new path is a new vuln class to check:
- API endpoints → WEB02 Auth & Session then injection files
- File uploads → WEB13 File Upload & LFI
- Admin panels → default creds first, then WEB02 Auth & Session
- Source exposure (
.git,.env) → read every secret, validate keys per provider in 00 Cloud MOC - Debug endpoints → often direct RCE (Spring actuator/jolokia, Flask debug console, Rails console)