Privacy Implications of Windows 10 Telemetry, Part 4: Processing of Raw Traffic Dumps
Windows 10 telemetry traffic collection experiment is over. We have collected 55,945,178,210 bytes of data, which was recorded continuously during 346 days, from 2017-02-15 to 2018-01-27.
Let’s discuss tools and scripts we need to extract useful statistics from these raw dumps of traffic. We will rely on tcpdump
builtin filtering packets, and also on standard grep
, sed
and awk
UNIX tools for processing text information extracted from these packets.
Wireshark
Wireshark is very useful on Windows to analyze small traffic dumps, it has command-line tools as well. Downside of Wireshark is memory-hungriness, it’s almost impossible to analyze large (> 8Gb) traffic dumps even on the powerful workstation with 64 Gb of RAM.
Nevertheless, Wireshark has a useful tool to merge .pcap files, so let’s install it on the server:
# pkg install wireshark
Preprocessing
If there are .pcap files with incomplete/truncated packets at the end (due to server power loss, unscheduled reboots etc), they can be fixed in such way (tcpdump -r <input_file> -w <output_file>
):
# tcpdump -r traffic.pcap.1 -w traffic-01.pcap
# tcpdump -r traffic.pcap.2 -w traffic-02.pcap
...
# tcpdump -r traffic.pcap.12 -w traffic-12.pcap
Then multiple .pcap files can be merged with mergecap
tool from Wireshark toolset. Unlike other Wireshark tools, mergecap
is not memory hungry:
# mergecap -w all.pcap traffic-01.pcap traffic-02.pcap traffic-03.pcap \
traffic-04.pcap traffic-05.pcap traffic-06.pcap traffic-07.pcap \
traffic-08.pcap traffic-09.pcap traffic-10.pcap traffic-11.pcap \
traffic-12.pcap
Cleaning up:
# rm traffic.pcap.*
# rm traffic-*.pcap
Counting inbound and outbound traffic
The following commands can be used to split inbound and outbound traffic:
# tcpdump -r all.pcap -w inbound_only.pcap \
'(dst net 172.21.97) and \
(not(dst host 172.21.97.1)) and (not(dst host 172.21.97.255))'
# tcpdump -r all.pcap -w outbound_only.pcap \
'(src net 172.21.97) and \
(not(src host 172.21.97.1)) and (not(src host 172.21.97.255))'
.pcap files produced by tcpdump
have 24-byte header, so sizes of resulting files inbound_only.pcap
and outbound_only.pcap
(less 24 bytes) can be used to estimate percentage of inbound and outbound traffic.
Splitting traffic by protocol type
.pcap files can be split further by protocol type (dns, http, https, everything else).
Splitting Windows 10 outbound telemetry traffic by protocol type (tcp[2]
+ tcp[3]
· 256 is a destination TCP port number, udp[2]
+ udp[3]
· 256 is a destination UDP port number):
# tcpdump -r outbound_only.pcap -w outbound_https.pcap \
'((tcp[2] == 1)and(tcp[3] == 0xBB))'
# tcpdump -r outbound_only.pcap -w outbound_http.pcap \
'((tcp[2] == 0)and(tcp[3] == 80))'
# tcpdump -r outbound_only.pcap -w outbound_dns.pcap \
'((udp[2] == 0)and(udp[3] == 53))'
# tcpdump -r outbound_only.pcap -w outbound_else.pcap \
'not( ((tcp[2] == 1)and(tcp[3] == 0xBB)) or \
((tcp[2] == 0)and(tcp[3] == 80)) or \
((udp[2] == 0)and(udp[3] == 53)) )'
Splitting Windows 10 inbound telemetry traffic by protocol type (tcp[0]
+ tcp[1]
· 256 is a source TCP port number, udp[0]
+ udp[1]
· 256 is a source UDP port number):
# tcpdump -r inbound_only.pcap -w inbound_https.pcap \
'((tcp[0] == 1)and(tcp[1] == 0xBB))'
# tcpdump -r inbound_only.pcap -w inbound_http.pcap \
'((tcp[0] == 0)and(tcp[1] == 80))'
# tcpdump -r inbound_only.pcap -w inbound_dns.pcap \
'((udp[0] == 0)and(udp[1] == 53))'
# tcpdump -r inbound_only.pcap -w inbound_else.pcap \
'not( ((tcp[0] == 1)and(tcp[1] == 0xBB)) or \
((tcp[0] == 0)and(tcp[1] == 80)) or \
((udp[0] == 0)and(udp[1] == 53)) )'
Sizes of resulting files (less 24 bytes) can be used to estimate percentages of various protocol types in the recorded traffic data.
Timeline derivation
Besides recording and splitting traffic, tcpdump
can print summary information for each packet:
14:46:57.786911 IP 172.21.97.193.61764 > 172.21.97.1.53: 5306+ A? clientconfig.passport.net. (43)
14:46:57.944420 IP 172.21.97.1.53 > 172.21.97.193.61764: 5306 4/0/0 CNAME auth.msa.akadns.net., CNAME auth.gfx.ms.edgekey.net., CNAME e8318.g.akamaiedge.net., A 172.227.131.16 (156)```
14:46:57.966532 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [S], seq 2228456890, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
14:46:58.053248 IP 172.227.131.16.80 > 172.21.97.193.49431: Flags [S.], seq 1358959154, ack 2228456891, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
14:46:58.053374 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [.], ack 1, win 513, length 0
14:46:58.053566 IP 172.21.97.193.49431 > 172.227.131.16.80: Flags [P.], seq 1:325, ack 1, win 513, length 324: HTTP: GET /ppcrlconfig600.bin HTTP/1.0
But this information is overly verbose. Instead, we need brief timeline like this:
DNSQ 5306 clientconfig.passport.net.
DNSR 5306 172.227.131.16
HTTP 172.227.131.16
Brief timeline records should contain DNS requests, DNS responses, and initial (SYN) packets of outbound TCP connections for http and https protocols.
Brief timeline can be derived from raw traffic dump using the following script:
#!/usr/local/bin/bash
tcpdump -r all.pcap -w dns_and_outbound_http_https_synonly.pcap \
'(port 53) or ((tcp[2] == 1)and(tcp[3] == 0xBB)and((tcp[13] & 0x12) == 0x02)) or ((tcp[2] == 0)and(tcp[3] == 80)and((tcp[13] & 0x12) == 0x02))'
tcpdump -n -r dns_and_outbound_http_https_synonly.pcap | \
grep -Eio '(> ([0-9]+\.){4}53: [0-9]+\+ A\? [0-9A-Za-z\.\-]+ )|(([0-9]+\.){4}53 > ([0-9]+\.){4}[0-9]+: [0-9]+ .+A ([0-9]+\.){3}[0-9]+)|(> ([0-9]+\.){4}443: )|(> ([0-9]+\.){4}80: )' | \
sed -E -e 's|> ([0-9]+\.){4}53: ([0-9]+)\+ A\? ([0-9A-Za-z\.\-]+) |DNSQ \2 \3|g' \
-e 's|([0-9]+\.){4}53 > ([0-9]+\.){4}[0-9]+: ([0-9]+) .+A (([0-9]+\.){3}[0-9]+)|DNSR \3 \4|g' \
-e 's|> (([0-9]+\.){3}[0-9]+)\.443: |HTTPS \1|g' \
-e 's|> (([0-9]+\.){3}[0-9]+)\.80: |HTTP \1|g' > \
timeline_dnsq_dnsr_http_https.txt
Few comments about this script:
- preliminary filtering is neccessary to throw away all non-DNS and all non-SYN/non-http/non-https TCP packets
-n
option preventstcpdump
from being slow by resolving all IP addresses via reverse DNS lookup- pipeline of
grep
andsed
selects information about relevant packets and rewrites it in compact way, as demonstrated by example above
Timeline postprocessing
DNS queries and responses can be folded for further simplification of the traffic timeline:
# cat timeline_dnsq_dnsr_http_https.txt | \
awk '{if($1=="DNSQ"){dns[$2]=$3}else if($1=="DNSR"){d=dns[$2];if(d!=""){printf("DNS %s %s\n",d,$3);}}else{print($0);}}' > \
timeline_dns_http_https.txt
DNS clientconfig.passport.net. 172.227.131.16
HTTP 172.227.131.16
Finally, DNS information can be combined with information about http/https requests:
# cat timeline_dns_http_https.txt | \
awk '{if($1=="DNS"){dns[$3]=$2;}else{d=dns[$2];if(d!=""){printf("%s %s\n",$1,d);}}}' > \
timeline_http_https.txt
HTTP clientconfig.passport.net.
Report generation
Final reports can be generated using this script:
#!/usr/local/bin/bash
cat timeline_dns_http_https.txt | grep -F 'DNS ' | cut -d\ -f2 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_domains.txt
cat timeline_dns_http_https.txt | grep -F 'DNS ' | cut -d\ -f3 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_ips.txt
UNIQUE_IPS=$(cat top_ips.txt | wc -l)
UNIQUE_DOMAINS=$(cat top_domains.txt | wc -l)
echo "Unique IPs / domains: $UNIQUE_IPS / $UNIQUE_DOMAINS"
cat timeline_dns_http_https.txt | grep -F 'HTTPS ' | cut -d\ -f2 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_https_ips.txt
cat timeline_http_https.txt | grep -F 'HTTPS ' | cut -d\ -f2 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_https_domains.txt
UNIQUE_HTTPS_IPS=$(cat top_https_ips.txt | wc -l)
UNIQUE_HTTPS_DOMAINS=$(cat top_https_domains.txt | wc -l)
echo "Unique https IPs / domains: $UNIQUE_HTTPS_IPS / $UNIQUE_HTTPS_DOMAINS"
cat timeline_dns_http_https.txt | grep -F 'HTTP ' | cut -d\ -f2 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_http_ips.txt
cat timeline_http_https.txt | grep -F 'HTTP ' | cut -d\ -f2 | sort | uniq -c | \
awk '{printf("%8d %s\n",$1,$2)}' | sort -r > top_http_domains.txt
UNIQUE_HTTP_IPS=$(cat top_http_ips.txt | wc -l)
UNIQUE_HTTP_DOMAINS=$(cat top_http_domains.txt | wc -l)
echo "Unique http IPs / domains: $UNIQUE_HTTP_IPS / $UNIQUE_HTTP_DOMAINS"
Traffic breakdown by AS (autonomous systems)
Assuming we have IP-to-AS database in ip2as.txt
file in the following format:
IP2AS <ip_address_1> <as_number_1>
IP2AS <ip_address_2> <as_number_2>
...
IP2AS <ip_address_N> <as_number_N>
we can also generate traffic breakdown report by AS numbers. This applies both to incoming and outgoing traffic.
First of all, we need to prepare timelines, collating adjacent records for matching IP addresses:
# tcpdump -q -e -n -r outbound_only.pcap '(port 53) or ((tcp[2] == 1)and(tcp[3] == 0xBB)) or ((tcp[2] == 0)and(tcp[3] == 80))' | \
grep -Eio 'IPv4, length [0-9]+: ([0-9]+\.){4}[0-9]+ > ([0-9]+\.){4}[0-9]+' | \
sed -E -e 's|IPv4, length ([0-9]+): ([0-9]+\.){4}[0-9]+ > (([0-9]+\.){3}[0-9]+)\.[0-9]+|OUT \3 \1|g' | \
grep -vF '172.21.93.1' | \
awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("OUT %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("OUT %s %d\n",a,c)}}' > \
traffic_outbound.txt
# tcpdump -q -e -n -r inbound_only.pcap '(port 53) or ((tcp[0] == 1)and(tcp[1] == 0xBB)) or ((tcp[0] == 0)and(tcp[1] == 80))' | \
grep -Eio 'IPv4, length [0-9]+: ([0-9]+\.){4}[0-9]+ > ([0-9]+\.){4}[0-9]+' | \
sed -E -e 's|IPv4, length ([0-9]+): (([0-9]+\.){3}[0-9]+)\.[0-9]+ > ([0-9]+\.){4}[0-9]+|IN \2 \1|g' | \
grep -vF '172.21.93.1' | \
awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("IN %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("IN %s %d\n",a,c)}}' > \
traffic_inbound.txt
Timelines will look like this (IN or OUT, IP address, and byte count):
IN 88.221.113.75 461
IN 172.227.139.113 66
OUT 88.221.113.75 397
OUT 23.40.1.157 66
Windows 10 telemetry traffic breakdown reports by AS number can be generated in such way:
# cat ip2as.txt traffic_outbound.txt | \
awk '{if($1=="IP2AS"){ip2as[$2]=$3};if($1=="OUT"){as=ip2as[$2];if(as!=""){printf("%s %s %d\n",$1,as,$3)}}}' | \
sort | awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("OUT %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("OUT %s %d\n",a,c)}}' | \
awk '{printf("%12d %s\n",$3,$2);}' | sort -r > top_as_out.txt
# cat ip2as.txt traffic_inbound.txt | \
awk '{if($1=="IP2AS"){ip2as[$2]=$3};if($1=="IN"){as=ip2as[$2];if(as!=""){printf("%s %s %d\n",$1,as,$3)}}}' | \
sort | awk 'BEGIN{a="";c=0;}{if(a!=$2){if(a!=""){printf("IN %s %d\n",a,c);};a=$2;c=$3;}else{c=c+$3;}}END{if(a!=""){printf("IN %s %d\n",a,c)}}' | \
awk '{printf("%12d %s\n",$3,$2);}' | sort -r > top_as_in.txt