Privacy Implications of Windows 10 Telemetry, Part 4: Processing of Raw Traffic Dumps

Posted on Jan 28, 2018

Windows 10 telemetry traffic collection experiment is over. We have collected 55,945,178,210 bytes of data, which was recorded continuously during 346 days, from 2017-02-15 to 2018-01-27.

Let’s discuss tools and scripts we need to extract useful statistics from these raw dumps of traffic. We will rely on tcpdump builtin filtering packets, and also on standard grep, sed and awk UNIX tools for processing text information extracted from these packets.

Wireshark

Wireshark is very useful on Windows to analyze small traffic dumps, it has command-line tools as well. Downside of Wireshark is memory-hungriness, it’s almost impossible to analyze large (> 8Gb) traffic dumps even on the powerful workstation with 64 Gb of RAM.

Nevertheless, Wireshark has a useful tool to merge .pcap files, so let’s install it on the server:

# pkg install wireshark

Preprocessing

If there are .pcap files with incomplete/truncated packets at the end (due to server power loss, unscheduled reboots etc), they can be fixed in such way (tcpdump -r <input_file> -w <output_file>):

# tcpdump -r traffic.pcap.1 -w traffic-01.pcap
# tcpdump -r traffic.pcap.2 -w traffic-02.pcap
...
# tcpdump -r traffic.pcap.12 -w traffic-12.pcap

Then multiple .pcap files can be merged with mergecap tool from Wireshark toolset. Unlike other Wireshark tools, mergecap is not memory hungry:

# mergecap -w all.pcap traffic-01.pcap traffic-02.pcap traffic-03.pcap \
       traffic-04.pcap traffic-05.pcap traffic-06.pcap traffic-07.pcap \
       traffic-08.pcap traffic-09.pcap traffic-10.pcap traffic-11.pcap \
       traffic-12.pcap

Cleaning up:

# rm traffic.pcap.*
# rm traffic-*.pcap

Counting inbound and outbound traffic

The following commands can be used to split inbound and outbound traffic:

# tcpdump -r all.pcap -w inbound_only.pcap \
  '(dst net 172.21.97) and \
   (not(dst host 172.21.97.1)) and (not(dst host 172.21.97.255))'
# tcpdump -r all.pcap -w outbound_only.pcap \
  '(src net 172.21.97) and \
   (not(src host 172.21.97.1)) and (not(src host 172.21.97.255))'

.pcap files produced by tcpdump have 24-byte header, so sizes of resulting files inbound_only.pcap and outbound_only.pcap (less 24 bytes) can be used to estimate percentage of inbound and outbound traffic.

Splitting traffic by protocol type

.pcap files can be split further by protocol type (dns, http, https, everything else).

Splitting Windows 10 outbound telemetry traffic by protocol type (tcp[2] + tcp[3] · 256 is a destination TCP port number, udp[2] + udp[3] · 256 is a destination UDP port number):

# tcpdump -r outbound_only.pcap -w outbound_https.pcap \
    '((tcp[2] == 1)and(tcp[3] == 0xBB))'
# tcpdump -r outbound_only.pcap -w outbound_http.pcap \
    '((tcp[2] == 0)and(tcp[3] == 80))'
# tcpdump -r outbound_only.pcap -w outbound_dns.pcap \
    '((udp[2] == 0)and(udp[3] == 53))'
# tcpdump -r outbound_only.pcap -w outbound_else.pcap \
    'not( ((tcp[2] == 1)and(tcp[3] == 0xBB)) or \
          ((tcp[2] == 0)and(tcp[3] == 80)) or \
          ((udp[2] == 0)and(udp[3] == 53)) )'

Splitting Windows 10 inbound telemetry traffic by protocol type (tcp[0] + tcp[1] · 256 is a source TCP port number, udp[0] + udp[1] · 256 is a source UDP port number):

# tcpdump -r inbound_only.pcap -w inbound_https.pcap \
    '((tcp[0] == 1)and(tcp[1] == 0xBB))'
# tcpdump -r inbound_only.pcap -w inbound_http.pcap \
    '((tcp[0] == 0)and(tcp[1] == 80))'
# tcpdump -r inbound_only.pcap -w inbound_dns.pcap \
    '((udp[0] == 0)and(udp[1] == 53))'
# tcpdump -r inbound_only.pcap -w inbound_else.pcap \
    'not( ((tcp[0] == 1)and(tcp[1] == 0xBB)) or \
          ((tcp[0] == 0)and(tcp[1] == 80)) or \
          ((udp[0] == 0)and(udp[1] == 53)) )'

Sizes of resulting files (less 24 bytes) can be used to estimate percentages of various protocol types in the recorded traffic data.

Timeline derivation

Besides recording and splitting traffic, tcpdump can print summary information for each packet:

14:46:57.786911 IP 172.21.97.193.61764 > 172.21.97.1.53: 5306+ A? clientconfig.passport.net. (43)
14:46:57.944420 IP 172.21.97.1.53 > 172.21.97.193.61764: 5306 4/0/0 CNAME auth.msa.akadns.net., CNAME auth.gfx.ms.edgekey.net., CNAME e8318.g.akamaiedge.net., A 172.227.131.16 (156)```
14:46:57.966532 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [S], seq 2228456890, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
14:46:58.053248 IP 172.227.131.16.80 > 172.21.97.193.49431: Flags [S.], seq 1358959154, ack 2228456891, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
14:46:58.053374 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [.], ack 1, win 513, length 0
14:46:58.053566 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [P.], seq 1:325, ack 1, win 513, length 324: HTTP: GET /ppcrlconfig600.bin HTTP/1.0

But this information is overly verbose. Instead, we need brief timeline like this:

DNSQ 5306 clientconfig.passport.net.
DNSR 5306 172.227.131.16
HTTP 172.227.131.16

Brief timeline records should contain DNS requests, DNS responses, and initial (SYN) packets of outbound TCP connections for http and https protocols.

Brief timeline can be derived from raw traffic dump using the following script:

#!/usr/local/bin/bash
tcpdump -r all.pcap -w dns_and_outbound_http_https_synonly.pcap \
    '(port 53) or ((tcp[2] == 1)and(tcp[3] == 0xBB)and((tcp[13] & 0x12) == 0x02)) or ((tcp[2] == 0)and(tcp[3] == 80)and((tcp[13] & 0x12) == 0x02))'
tcpdump -n -r dns_and_outbound_http_https_synonly.pcap | \
    grep -Eio '(> ([0-9]+\.){4}53: [0-9]+\+ A\? [0-9A-Za-z\.\-]+ )|(([0-9]+\.){4}53 > ([0-9]+\.){4}[0-9]+: [0-9]+ .+A ([0-9]+\.){3}[0-9]+)|(> ([0-9]+\.){4}443: )|(> ([0-9]+\.){4}80: )' | \
    sed -E -e 's|> ([0-9]+\.){4}53: ([0-9]+)\+ A\? ([0-9A-Za-z\.\-]+) |DNSQ \2 \3|g' \
           -e 's|([0-9]+\.){4}53 > ([0-9]+\.){4}[0-9]+: ([0-9]+) .+A (([0-9]+\.){3}[0-9]+)|DNSR \3 \4|g' \
           -e 's|> (([0-9]+\.){3}[0-9]+)\.443: |HTTPS \1|g' \
           -e 's|> (([0-9]+\.){3}[0-9]+)\.80: |HTTP \1|g' > \
    timeline_dnsq_dnsr_http_https.txt

Few comments about this script:

preliminary filtering is neccessary to throw away all non-DNS and all non-SYN/non-http/non-https TCP packets
-n option prevents tcpdump from being slow by resolving all IP addresses via reverse DNS lookup
pipeline of grep and sed selects information about relevant packets and rewrites it in compact way, as demonstrated by example above

Timeline postprocessing

DNS queries and responses can be folded for further simplification of the traffic timeline:

# cat timeline_dnsq_dnsr_http_https.txt | \
     awk '{if($1=="DNSQ"){dns[$2]=$3}else if($1=="DNSR"){d=dns[$2];if(d!=""){printf("DNS %s %s\n",d,$3);}}else{print($0);}}' > \
     timeline_dns_http_https.txt

DNS clientconfig.passport.net. 172.227.131.16
HTTP 172.227.131.16

Finally, DNS information can be combined with information about http/https requests:

# cat timeline_dns_http_https.txt | \
     awk '{if($1=="DNS"){dns[$3]=$2;}else{d=dns[$2];if(d!=""){printf("%s %s\n",$1,d);}}}' > \
     timeline_http_https.txt

HTTP clientconfig.passport.net.

Report generation

Final reports can be generated using this script:

#!/usr/local/bin/bash
cat timeline_dns_http_https.txt | grep -F 'DNS ' | cut -d\  -f2 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_domains.txt
cat timeline_dns_http_https.txt | grep -F 'DNS ' | cut -d\  -f3 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_ips.txt
UNIQUE_IPS=$(cat top_ips.txt | wc -l)
UNIQUE_DOMAINS=$(cat top_domains.txt | wc -l)
echo "Unique IPs / domains:       $UNIQUE_IPS / $UNIQUE_DOMAINS"

cat timeline_dns_http_https.txt | grep -F 'HTTPS ' | cut -d\  -f2 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_https_ips.txt
cat timeline_http_https.txt | grep -F 'HTTPS ' | cut -d\  -f2 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_https_domains.txt
UNIQUE_HTTPS_IPS=$(cat top_https_ips.txt | wc -l)
UNIQUE_HTTPS_DOMAINS=$(cat top_https_domains.txt | wc -l)
echo "Unique https IPs / domains: $UNIQUE_HTTPS_IPS / $UNIQUE_HTTPS_DOMAINS"

cat timeline_dns_http_https.txt | grep -F 'HTTP ' | cut -d\  -f2 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_http_ips.txt
cat timeline_http_https.txt | grep -F 'HTTP ' | cut -d\  -f2 | sort | uniq -c | \
  awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_http_domains.txt
UNIQUE_HTTP_IPS=$(cat top_http_ips.txt | wc -l)
UNIQUE_HTTP_DOMAINS=$(cat top_http_domains.txt | wc -l)
echo "Unique http IPs / domains:  $UNIQUE_HTTP_IPS / $UNIQUE_HTTP_DOMAINS"

Traffic breakdown by AS (autonomous systems)

Assuming we have IP-to-AS database in ip2as.txt file in the following format:

IP2AS <ip_address_1> <as_number_1>
IP2AS <ip_address_2> <as_number_2>
...
IP2AS <ip_address_N> <as_number_N>

we can also generate traffic breakdown report by AS numbers. This applies both to incoming and outgoing traffic.

First of all, we need to prepare timelines, collating adjacent records for matching IP addresses:

# tcpdump -q -e -n -r outbound_only.pcap '(port 53) or ((tcp[2] == 1)and(tcp[3] == 0xBB)) or ((tcp[2] == 0)and(tcp[3] == 80))' | \
    grep -Eio 'IPv4, length [0-9]+: ([0-9]+\.){4}[0-9]+ > ([0-9]+\.){4}[0-9]+' | \
    sed -E -e 's|IPv4, length ([0-9]+): ([0-9]+\.){4}[0-9]+ > (([0-9]+\.){3}[0-9]+)\.[0-9]+|OUT \3 \1|g' | \
    grep -vF '172.21.93.1' | \
    awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("OUT %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("OUT %s %d\n",a,c)}}' > \
    traffic_outbound.txt
# tcpdump -q -e -n -r inbound_only.pcap '(port 53) or ((tcp[0] == 1)and(tcp[1] == 0xBB)) or ((tcp[0] == 0)and(tcp[1] == 80))' | \
  grep -Eio 'IPv4, length [0-9]+: ([0-9]+\.){4}[0-9]+ > ([0-9]+\.){4}[0-9]+' | \
    sed -E -e 's|IPv4, length ([0-9]+): (([0-9]+\.){3}[0-9]+)\.[0-9]+ > ([0-9]+\.){4}[0-9]+|IN \2 \1|g' | \
    grep -vF '172.21.93.1' | \
    awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("IN %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("IN %s %d\n",a,c)}}' > \
    traffic_inbound.txt

Timelines will look like this (IN or OUT, IP address, and byte count):

IN 88.221.113.75 461 
IN 172.227.139.113 66

OUT 88.221.113.75 397
OUT 23.40.1.157 66

Windows 10 telemetry traffic breakdown reports by AS number can be generated in such way:

# cat ip2as.txt traffic_outbound.txt | \
    awk '{if($1=="IP2AS"){ip2as[$2]=$3};if($1=="OUT"){as=ip2as[$2];if(as!=""){printf("%s %s %d\n",$1,as,$3)}}}' | \
    sort | awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("OUT %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("OUT %s %d\n",a,c)}}' | \
    awk '{printf("%12d %s\n",$3,$2);}' | sort -r > top_as_out.txt
# cat ip2as.txt traffic_inbound.txt | \
    awk '{if($1=="IP2AS"){ip2as[$2]=$3};if($1=="IN"){as=ip2as[$2];if(as!=""){printf("%s %s %d\n",$1,as,$3)}}}' | \
    sort | awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("IN %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("IN %s %d\n",a,c)}}' | \
    awk '{printf("%12d %s\n",$3,$2);}' | sort -r > top_as_in.txt