scrape tooling: live capture triage + master-server WS decoder + PlayFab REST scraper

Built and unit-tested ahead of a live playtest window:
- reverse/capture_hosts.py: pcap -> DNS/SNI/endpoints in order; extracts PlayFab TitleId,
  flags hologryph master-server region + config CDN.
- reverse/ws_scrape.py: TCP reassembly + RFC-6455 framing for the cleartext ws://<region>.
  hologryph.com/gameclient/ stream; decodes JSON/BSON/MessagePack; auto-labels ServerDto,
  CompartmentDefinitionDto, ResearchNodeJsonDto, OperationResult, etc. No MITM needed.
- reverse/playfab_scrape.py: LoginWithSteam (or captured EntityToken) -> Catalog/SearchItems
  (+ Inventory/TitleData); prices resolved to item names. Read-only.
- docs/SCRAPE_RUNBOOK.md: turnkey steps for when servers are online.
This commit is contained in:
DownloadPizza
2026-06-12 10:06:48 +02:00
parent 5946e0910b
commit 3df0797acc
4 changed files with 653 additions and 0 deletions

67
docs/SCRAPE_RUNBOOK.md Normal file
View File

@@ -0,0 +1,67 @@
# Live-scrape runbook — when a playtest is online
Everything below is read-only and runs outside the game process (no BattlEye interaction).
Tooling is built and unit-tested; the only thing that needs a live backend is the data itself.
## 0. Capture (once servers are up)
1. `ipconfig /flushdns` (so hostnames show as clean DNS queries, incl. the PlayFab TitleId).
2. Start a packet capture on the game NIC (Wireshark, or `pktmon`/`dumpcap`). Save as `.pcapng`.
- Master-server traffic is **cleartext `ws://` on port 80** — Wireshark reads it directly,
**no MITM/cert needed**.
- PlayFab is HTTPS/443 — to read its bodies you need your MITM (cert already installed) on 443,
or use the REST scraper (step 3) instead.
3. Launch SAND, **click through past the "no servers"/welcome dialog and let it log in**, then open
the screens whose data you want (walker editor → compartment defs; research tree; store → prices).
Keep capturing through it. Stop the capture.
## 1. Triage the capture → get the TitleId + confirm the master server
```bash
venv/bin/python reverse/capture_hosts.py <capture.pcapng>
```
Prints DNS/SNI/endpoints in order and a **BACKENDS DETECTED** block:
- `PlayFab host=<id>.playfabapi.com ** TitleId = <ID> **` ← the one constant the REST scraper needs
- `Master server host=<region>.hologryph.com (ws://80 cleartext)`
- `Config CDN host=sandconfigstorage…`
## 2. Master server (compartments + research tree + server list) — cleartext, no auth replay
```bash
venv/bin/python reverse/ws_scrape.py <capture.pcapng> --out extracted/master_ws.json
```
Reassembles the port-80 WebSocket to `*.hologryph.com/gameclient/`, parses RFC-6455 frames, and
decodes each message (tries JSON → BSON → MessagePack — the game's `IDataSerializer` is JSON-likely).
Messages are auto-tagged when their shape matches a known DTO:
`ServerDto`, `RegionInfo`, **`CompartmentDefinitionDto`** (HP/Weight/Properties/prices),
**`ResearchNodeJsonDto`** (connections via `RequiredNodesIds`/`DependentNodesIds`, costs via
`ResearchPrice`), `ItemDto`/`ShopItemDto`/`PriceDto`, `OperationResult`, `IClientEvent`.
If it finds no WS stream, the capture didn't span the master-server connection (re-capture through
the login), or try `--port`/`--host`.
> First run, eyeball one frame to confirm the encoding (JSON vs BSON). The decoder already handles
> both; this is just a sanity check.
## 3. PlayFab prices / catalog / inventory
Either read them from the MITM'd 443 capture, **or** pull them directly (cleaner, gets the *full*
catalog, more than the client requests):
```bash
# with a Steam auth ticket (captured, or minted via Steamworks GetAuthSessionTicket):
venv/bin/python reverse/playfab_scrape.py --title-id <ID> --steam-ticket <hex> --catalog --inventory
# or skip login with an EntityToken lifted from your MITM capture:
venv/bin/python reverse/playfab_scrape.py --title-id <ID> --entity-token <tok> --catalog
```
`--catalog``extracted/playfab_catalog.json`: every item with `PriceOptions` (→ currency-item +
amount, names resolved via `extracted/item_names.json`) and `DisplayProperties` (check here for any
catalog-authored base stats). `--inventory` → wallet + items + transaction history. `--titledata`
`Client/GetTitleData` config blobs. Read-only endpoints only — no write/purchase calls.
## Notes / unknowns to confirm live
- **WS payload encoding** (JSON vs BSON): decoder handles both; confirm on first capture.
- **Steam ticket reuse**: tickets are short-lived/single-use — if `--steam-ticket` fails, lift an
`EntityToken` from the MITM capture and use `--entity-token` instead.
- **Damage**: still server-computed; check `DisplayProperties` (catalog) and
`CompartmentDefinitionDto.Properties` (master server) for any base values — don't assume present.