Cloudflare Research logo
 

data set description:

two datasets, one from clickhouse, one from nginx-cache, clickhouse dataset has more data, colo information, but missing

  1. TTL
  2. zone plan
  3. cacheKey (objID)
  4. originReponseChunked
  5. range request

Because of the size of clickhouse dataset, it is sampled from 1% to 40 by object id.

the clickhouse data format

timestamp 
client country (anonymized)
coloId, 
obj_id (md5 of host+path+query) 
obj_id2 (md5 of host+path)
responseSize 
content type 
file extension 
n level (the number of '/' in path)
ttl (augment from crawling, not always accurate and still missing 10% data)
age
cache status
method (get/purge)
zone id (md5 hash)
client request host (md5 hash)
edgehost (md5 hash)
referhost (md5 hash)
has query string (bool)
n param in query 

the nginx data format

client country
hot object (bool)
timestamp
obj_id
response body size
http method
content type
file extension
n level
expire ttl
age
cache status
zone id (md5 hash) 
zone plan 
chunked response
has query
n param in query
cache control
hostname (sha1 hash)
http range