Request for feedback on crypto privacy protections of geolocation data

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Request for feedback on crypto privacy protections of geolocation data

Chris Peterson-12
I'm looking for some feedback on crypto privacy protections for a
geolocation research project I'm working on with the Mozilla Services
team. If you have general questions or suggestions about the project,
I'm happy to answer them, but I'd like to focus this thread on crypto.

Our team is prototyping a crowd-sourced version of Google's Street View
cars to correlate Wi-Fi access points and cell towers to GPS positions.
Our primary motivation is to provide non-proprietary location services
for Firefox OS devices. We would also like to publish this location data
for researchers or other projects that might have novel uses for it.

Google's Location Service prevents people from tracking individual
access points by requiring requests to include at least 2-3 access
points that Google knows are near each other. This "proves" the
requester is near the access points.

Below is a sketch of a scheme that I think will allow us to publish a
database of access point locations while still requiring knowledge of
two neighboring access points.

Unlike Google's Location Service, our server does not store MAC
addresses or SSIDs. We identify access points by hash IDs, specifically
SHA1(MAC+SSID). To query the location of an access point in the
database, you must know both its MAC address and current SSID.

Our private database maps access point hash IDs to locations (and other
metadata). Assuming:

     H1 = Hash(AP1.MAC + AP1.SSID)
     H2 = Hash(AP2.MAC + AP2.SSID)

Our private database's schema looks something like:

     Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
     Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...

Our published database would include two tables. The first table would
map a random row id to metadata about an anonymous access point:

     Random1 ==> AP1.latitude, AP1.longitude, ...
     Random2 ==> AP2.latitude, AP2.longitude, ...

The second table's primary key would be a hash of hashes. It would map a
hash of two neighboring access points' hash IDs to a row id of the first
table. Something like:

     Hash(H1 + H2) ==> Random1
     Hash(H2 + H1) ==> Random2

Someone querying the published database would need to know the MAC
addresses and current SSIDs of two neighboring access points to look up
either's location.

btw, should we use SHA-2 instead of SHA-1? In 2009, NIST recommended
that "Federal agencies should stop using SHA-1 for applications that
require collision resistance as soon as practical, and must use the
SHA-2 family of hash functions for these applications after 2010."

Other layers of privacy protection include filtering out ad-hoc Wi-Fi
networks; MAC addresses with vendor prefixes from mobile device
manufacters (e.g. Apple and HTC); SSIDs commonly associated with mobile
devices (e.g. "XXX's iPhone" and Google's "_nomap" opt-out); and APs
reported in multiple locations.


thanks,
chris
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

R. Jason Cronk
I haven't done a full analysis but do have a few questions


On 9/9/2013 5:58 PM, Chris Peterson wrote:
> Our private database maps access point hash IDs to locations (and
> other metadata). Assuming:
>
>     H1 = Hash(AP1.MAC + AP1.SSID)
>     H2 = Hash(AP2.MAC + AP2.SSID)

I assume + means concatenate. I might suggest XORing the values. SSID
names are usually human readable, not meant to be secure and thus follow
predictable patterns. I also hope you're not using the patterned MAC
notation but rather the 48 bit address space representation.


>
> Our private database's schema looks something like:
>
>     Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>     Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...

Is the data aged? What happens if I move? Does this give Mozilla the
ability to historically track me if I move my device? Is that a problem?
(I'm not saying it is, just an observation).
You mention below about filtering APs in multiple locations but clearly
they can move as people relocate.
What is the granularity of the lat/long?

>
> Our published database would include two tables. The first table would
> map a random row id to metadata about an anonymous access point:
>
>     Random1 ==> AP1.latitude, AP1.longitude, ...
>     Random2 ==> AP2.latitude, AP2.longitude, ...

I would be hesitant to use the word anonymous here. Latlong is easily
combine with other publicly available databases that could identify
individual address and thus individuals. Again, it comes down to
granularity of the data.

>
> The second table's primary key would be a hash of hashes. It would map
> a hash of two neighboring access points' hash IDs to a row id of the
> first table. Something like:
>
>     Hash(H1 + H2) ==> Random1
>     Hash(H2 + H1) ==> Random2
>
> Someone querying the published database would need to know the MAC
> addresses and current SSIDs of two neighboring access points to look
> up either's location.

When you say published, do you mean that the entire DB is published for
use by "researchers" or that it's just has a publicly exposed API that
responds to queries?
I'm assuming if AP3 through AP10 were all also in the vicinity that
Hash(H1+Hx) ==> Random1 where x is in {2,..,10}, correct?
If so, is whatever value Hy is the prefix in the concatenation will
correspond to APy's Random id?



>
> btw, should we use SHA-2 instead of SHA-1? In 2009, NIST recommended
> that "Federal agencies should stop using SHA-1 for applications that
> require collision resistance as soon as practical, and must use the
> SHA-2 family of hash functions for these applications after 2010."

Yes


*R. Jason Cronk, Esq., CIPP/US*
/Privacy Engineering Consultant/, *Enterprivacy Consulting Group*
<enterprivacy.com>

  * phone: (828) 4RJCESQ
  * twitter: @privacymaverick.com
  * blog: http://blog.privacymaverick.com

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Brian Smith-19
In reply to this post by Chris Peterson-12
On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson <[hidden email]> wrote:
> Google's Location Service prevents people from tracking individual access
> points by requiring requests to include at least 2-3 access points that
> Google knows are near each other. This "proves" the requester is near the
> access points.

I assume by "prevents people from tracking individual access points"
means the following: Some people have a personal access point on them
(e.g. in their phone). If somebody knows the SSID and MAC of this
personal access point, then they could track this person's location by
polling the database for that (SSID, MAC) pair. Google tries to limit
this type of abuse as much as practical while providing still
providing a location service based on such crowdsourced data.

> Unlike Google's Location Service, our server does not store MAC addresses or
> SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
> To query the location of an access point in the database, you must know both
> its MAC address and current SSID.

MAC addresses are 48 bits. SSIDs are often guessable or predictable.
Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
not buying you much in terms of privacy, IMO. Basically, if you are
really trying to use this as a privacy mechanism then you should store
the MAC+SSID according to best practices for storing passwords. For
example, use PBKDF2 with a large number of iterations. Regardless of
whether you use SHA1, SHA2, PBKDF2, or something else, I will still
call whatever function you use H(x). But, I am not sure that switching
to PBKDF2 even buys you much improved privacy protection.

>     H1 = Hash(AP1.MAC + AP1.SSID)
>     H2 = Hash(AP2.MAC + AP2.SSID)
>
> Our private database's schema looks something like:
>
>     Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>     Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...
>
> Our published database would include two tables. The first table would map a
> random row id to metadata about an anonymous access point:
>
>     Random1 ==> AP1.latitude, AP1.longitude, ...
>     Random2 ==> AP2.latitude, AP2.longitude, ...
>
> The second table's primary key would be a hash of hashes. It would map a
> hash of two neighboring access points' hash IDs to a row id of the first
> table. Something like:
>
>     Hash(H1 + H2) ==> Random1
>     Hash(H2 + H1) ==> Random2
>
> Someone querying the published database would need to know the MAC addresses
> and current SSIDs of two neighboring access points to look up either's
> location.

If  you know the MAC+SSID of person X's personal access point and the
MAC+SSID of person Y's personal access point, then you can use this
database to ask the question "are person X and person Y in the same
location?" This seems bad. I see that you attempt to address this
below.

> btw, should we use SHA-2 instead of SHA-1?

There is no reason to use SHA-1 when you have SHA-2 available.
However, as I indicated above, it isn't clear it is a good idea to be
using any plain hash function as H(x).

> Other layers of privacy protection include filtering out ad-hoc Wi-Fi
> networks; MAC addresses with vendor prefixes from mobile device manufacters
> (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
> "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple
> locations.

I think that these things are much more important than the protection
offered by H(x). My concern is that if you store the data on the
server as H(x) then you will not be able to do the above filtering on
the server unless H(x) is ineffective. That seems bad, because the
server will be much easier to update to improve the filtering than the
clients will be, AFAICT. Also, you will not be able to measure the
effectiveness of the privacy protections on the server, which is also
very bad.

Therefore, I'd suggest that you avoid using any protection at all, and
just use x instead of H(x) until we are very confident there is no way
we can further improve the filtering.

Cheers,
Brian Smith
--
Mozilla Networking/Crypto/Security (Necko/NSS/PSM), NSA plant
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Eric Rescorla
In reply to this post by Chris Peterson-12
Chris,

I have some basic and perhaps stupid questions.

1. How do I bootstrap? I turn on my device and want to get the coordinates of the aps I see. That requires a lat long for neighbors. What now?

2. As asked previously will the db be published or query able?

3. What is the lat/long resolution? How is it measured?

Thanks
Ekr

On Sep 9, 2013, at 14:58, Chris Peterson <[hidden email]> wrote:

> I'm looking for some feedback on crypto privacy protections for a geolocation research project I'm working on with the Mozilla Services team. If you have general questions or suggestions about the project, I'm happy to answer them, but I'd like to focus this thread on crypto.
>
> Our team is prototyping a crowd-sourced version of Google's Street View cars to correlate Wi-Fi access points and cell towers to GPS positions. Our primary motivation is to provide non-proprietary location services for Firefox OS devices. We would also like to publish this location data for researchers or other projects that might have novel uses for it.
>
> Google's Location Service prevents people from tracking individual access points by requiring requests to include at least 2-3 access points that Google knows are near each other. This "proves" the requester is near the access points.
>
> Below is a sketch of a scheme that I think will allow us to publish a database of access point locations while still requiring knowledge of two neighboring access points.
>
> Unlike Google's Location Service, our server does not store MAC addresses or SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID). To query the location of an access point in the database, you must know both its MAC address and current SSID.
>
> Our private database maps access point hash IDs to locations (and other metadata). Assuming:
>
>    H1 = Hash(AP1.MAC + AP1.SSID)
>    H2 = Hash(AP2.MAC + AP2.SSID)
>
> Our private database's schema looks something like:
>
>    Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>    Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...
>
> Our published database would include two tables. The first table would map a random row id to metadata about an anonymous access point:
>
>    Random1 ==> AP1.latitude, AP1.longitude, ...
>    Random2 ==> AP2.latitude, AP2.longitude, ...
>
> The second table's primary key would be a hash of hashes. It would map a hash of two neighboring access points' hash IDs to a row id of the first table. Something like:
>
>    Hash(H1 + H2) ==> Random1
>    Hash(H2 + H1) ==> Random2
>
> Someone querying the published database would need to know the MAC addresses and current SSIDs of two neighboring access points to look up either's location.
>
> btw, should we use SHA-2 instead of SHA-1? In 2009, NIST recommended that "Federal agencies should stop using SHA-1 for applications that require collision resistance as soon as practical, and must use the SHA-2 family of hash functions for these applications after 2010."
>
> Other layers of privacy protection include filtering out ad-hoc Wi-Fi networks; MAC addresses with vendor prefixes from mobile device manufacters (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g. "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple locations.
>
>
> thanks,
> chris
> _______________________________________________
> dev-security mailing list
> [hidden email]
> https://lists.mozilla.org/listinfo/dev-security
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Hanno Schlichting-5
In reply to this post by Brian Smith-19
On 09.09.2013, at 18:13 , Brian Smith <[hidden email]> wrote:

> On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson <[hidden email]> wrote:
>> Google's Location Service prevents people from tracking individual access
>> points by requiring requests to include at least 2-3 access points that
>> Google knows are near each other. This "proves" the requester is near the
>> access points.
>
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair. Google tries to limit
> this type of abuse as much as practical while providing still
> providing a location service based on such crowdsourced data.

Yes :) Though there's one crucial difference between Google and us: We would like to make as much of this data public as possible, while Google will always just provide a service without access to the underlying data.

>> Unlike Google's Location Service, our server does not store MAC addresses or
>> SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
>> To query the location of an access point in the database, you must know both
>> its MAC address and current SSID.
>
> MAC addresses are 48 bits. SSIDs are often guessable or predictable.
> Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
> not buying you much in terms of privacy, IMO. Basically, if you are
> really trying to use this as a privacy mechanism then you should store
> the MAC+SSID according to best practices for storing passwords. For
> example, use PBKDF2 with a large number of iterations. Regardless of
> whether you use SHA1, SHA2, PBKDF2, or something else, I will still
> call whatever function you use H(x). But, I am not sure that switching
> to PBKDF2 even buys you much improved privacy protection.

We were looking for two things with using the sha1:

- Make it possible for the end-user to change their unique value (they cannot change the mac address, but they can change the ssid). This allows them to "invalidate" historical records in the database.
- Make it harder for spammers to "guess" actual unique keys and flood our service. Mac addresses have a vendor prefix, which makes it rather easy to generate lots of valid mac addresses. Taking the ssid into account makes it harder to generate valid keys. Unfortunately the ssid itself is considered private data in European countries, so you aren't allowed to store it without the users consent. That's why Google and everyone else has stopped storing them and only use mac addresses now.

The sha1 scheme might be ineffective in doing this.

>>    H1 = Hash(AP1.MAC + AP1.SSID)
>>    H2 = Hash(AP2.MAC + AP2.SSID)
>>
>> Our private database's schema looks something like:
>>
>>    Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>>    Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...
>>
>> Our published database would include two tables. The first table would map a
>> random row id to metadata about an anonymous access point:
>>
>>    Random1 ==> AP1.latitude, AP1.longitude, ...
>>    Random2 ==> AP2.latitude, AP2.longitude, ...
>>
>> The second table's primary key would be a hash of hashes. It would map a
>> hash of two neighboring access points' hash IDs to a row id of the first
>> table. Something like:
>>
>>    Hash(H1 + H2) ==> Random1
>>    Hash(H2 + H1) ==> Random2
>>
>> Someone querying the published database would need to know the MAC addresses
>> and current SSIDs of two neighboring access points to look up either's
>> location.
>
> If  you know the MAC+SSID of person X's personal access point and the
> MAC+SSID of person Y's personal access point, then you can use this
> database to ask the question "are person X and person Y in the same
> location?" This seems bad. I see that you attempt to address this
> below.

On the service level, we can prevent this with adding extra thresholds. Like filtering out "moving" APs and only reporting APs which have been seen in the same location a number of times over a minimum time period.

But this doesn't help us when publishing the underlying data.

>> btw, should we use SHA-2 instead of SHA-1?
>
> There is no reason to use SHA-1 when you have SHA-2 available.
> However, as I indicated above, it isn't clear it is a good idea to be
> using any plain hash function as H(x).
>
>> Other layers of privacy protection include filtering out ad-hoc Wi-Fi
>> networks; MAC addresses with vendor prefixes from mobile device manufacters
>> (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
>> "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple
>> locations.
>
> I think that these things are much more important than the protection
> offered by H(x). My concern is that if you store the data on the
> server as H(x) then you will not be able to do the above filtering on
> the server unless H(x) is ineffective. That seems bad, because the
> server will be much easier to update to improve the filtering than the
> clients will be, AFAICT. Also, you will not be able to measure the
> effectiveness of the privacy protections on the server, which is also
> very bad.
>
> Therefore, I'd suggest that you avoid using any protection at all, and
> just use x instead of H(x) until we are very confident there is no way
> we can further improve the filtering.

This sounds like good advice and I'm starting to lean into this direction.

But this only helps us on the "we provide a service" side. It's still unclear to me if and how we could share any of this data as database dumps.

Hanno
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Hanno Schlichting-5
In reply to this post by Eric Rescorla
On 09.09.2013, at 18:41 , Eric Rescorla <[hidden email]> wrote:
> 1. How do I bootstrap? I turn on my device and want to get the coordinates of the aps I see. That requires a lat long for neighbors. What now?

We build the database by having people use a stumbler application to sent us observations. The stumbler app uses the mobile phones GPS sensor to know its location. It reports all cell towers and wifi APs it sees to us in a certain location. We crunch some data, then we make a search API available over this data. Later someone else asks us what their location is, based on seeing cell towers or APs.

> 2. As asked previously will the db be published or query able?

It will definitely be queryable, but with a lot of restrictions to enhance privacy. We would like to publish it or as much of it as possible, but it's unclear how to do that, when a lot of the individual records are considered personally identifiable information.

> 3. What is the lat/long resolution? How is it measured?

The resolution differs, but is generally "as precise as it gets". So GPS sensors often have 5 meter precision, Google aims to do 1 meter resolution for indoor locations based on Wifi access points. Internally we currently store things with centimeter precision and timestamps in milliseconds - so definitely all on the far side of "extremely detailed / private".

Hanno
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Brian Smith-19
In reply to this post by Hanno Schlichting-5
On Mon, Sep 9, 2013 at 7:15 PM, Hanno Schlichting
<[hidden email]> wrote:
> On 09.09.2013, at 18:13 , Brian Smith <[hidden email]> wrote:
>> On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson <[hidden email]> wrote:

> [T]here's one crucial difference between Google and us: We would
> like to make as much of this data public as possible, while Google will always
> just provide a service without access to the underlying data.

> We were looking for two things with using the sha1:
>
> - Make it possible for the end-user to change their unique value (they cannot change the mac address, but they can change the ssid). This allows them to "invalidate" historical records in the database.

There is friction in changing SSIDs as it affects every device that
would connect to that network. There will also probably not be much
awareness among users of when/why/how to do this or what effect it
will have. So, I think this is an aspect that sounds great in theory,
but in practice will nearly never be used.

> - Make it harder for spammers to "guess" actual unique keys and flood our service. Mac addresses have a vendor prefix, which makes it rather easy to generate lots of valid mac addresses. Taking the ssid into account makes it harder to generate valid keys. Unfortunately the ssid itself is considered private data in European countries, so you aren't allowed to store it without the users consent. That's why Google and everyone else has stopped storing them and only use mac addresses now.
>
> The sha1 scheme might be ineffective in doing this.

If x is private data then SHA1(x), SHA2(x), PBKDF2(x), and even
AES256(x, key) with a key known to you are all private data too.

>> Therefore, I'd suggest that you avoid using any protection at all, and
>> just use x instead of H(x) until we are very confident there is no way
>> we can further improve the filtering.
>
> This sounds like good advice and I'm starting to lean into this direction.
>
> But this only helps us on the "we provide a service" side. It's still unclear to me if and how we could share any of this data as database dumps.

If you wanted to publish this data, and the data was stored in its raw
state, then you could always apply whatever mapping (SHA2, PKBKFD2,
AES256 with random and thrown-away key, etc.) right before you share
the data.

Even if you use AES256 with a random, thrown-away key, the data will
be subject to reverse engineering. For example, one could correlate a
subset of the data with a separate database of known
(MAC,SSID,Location) triples, and/or attempt "traffic analysis" to see
relationships in how (MAC,SSID) pairs interact with each other with
respect to location. You have probably heard of the Netflix Prize
privacy issues [1]; this is a very similar problem to the Netflix
prize. Therefore, while it may be important to obscure the data before
giving it to researchers, we should still consider the obscured data
to be highly-sensitive confidential user data.

[1] http://en.wikipedia.org/wiki/Netflix_Prize#Privacy_concerns

Cheers,
Brian
--
Mozilla Networking/Crypto/Security (Necko/NSS/PSM)
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Chris Peterson-12
In reply to this post by Chris Peterson-12
On 9/9/13 6:41 PM, Eric Rescorla wrote:
> 1. How do I bootstrap? I turn on my device and want to get the coordinates of the aps I see. That requires a lat long for neighbors. What now?

The device would scan for nearby APs and send the hash of each AP's MAC
and SSID to our location server. Our server would not need to worry
about the hash of hashes pairs because that would only be used for
published data. The server would return an estimated latitude,
longitude, and accuracy (radius in meters) of the device among the APs.

The simple approach for predicting the device's position is
trilateration using a weighted average of the nearby APs' positions. A
more robust approach is a grid-based approach where the server divides
the world into squares and knows which APs have been seen from which
squares.

The device might be able to cache a portion of the geo data (e.g. part
of the current city) to allow offline geolocation.


> 2. As asked previously will the db be published or query able?

We are investigating both a web service API and a downloadable database.
We are collecting position data for both Wi-Fi access points and cell
towers. Depending on privacy protections, if we can't publish the whole
database to the world, we can publish just the cell tower data to the
world and possibly make the Wi-Fi data available only to trusted
researchers.


> 3. What is the lat/long resolution? How is it measured?

This depends on the GPS of the device used to collect the data, but our
database stores 7 decimal places (less than one meter resolution).



chris
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Chris Peterson-12
In reply to this post by Chris Peterson-12
On 9/9/13 4:25 PM, R. Jason Cronk wrote:

> On 9/9/2013 5:58 PM, Chris Peterson wrote:
>> Our private database maps access point hash IDs to locations (and
>> other metadata). Assuming:
>>
>>     H1 = Hash(AP1.MAC + AP1.SSID)
>>     H2 = Hash(AP2.MAC + AP2.SSID)
>
> I assume + means concatenate. I might suggest XORing the values. SSID
> names are usually human readable, not meant to be secure and thus follow
> predictable patterns. I also hope you're not using the patterned MAC
> notation but rather the 48 bit address space representation.

We currently use concatenation, but I see how XOR would make more sense.
We are using the SSID as a weak protection against someone "polluting"
our database results by submitting random MAC addresses. Our database
still might have their junk data, but real location requests shouldn't
hit them.

We are using the MAC string notation like "45:67:89:ab:cd:ef", but I see
that this format has predictable patterns, too. I will recommend we use
the 48-bit binary representation.


> What is the granularity of the lat/long?

This depends on the GPS of the device used to collect the data, but our
database stores 7 decimal places (less than one meter resolution).


>> Someone querying the published database would need to know the MAC
>> addresses and current SSIDs of two neighboring access points to look
>> up either's location.
>
> When you say published, do you mean that the entire DB is published for
> use by "researchers" or that it's just has a publicly exposed API that
> responds to queries?

We are investigating both a web service API and a downloadable database.
We are collecting position data for both Wi-Fi access points and cell
towers. Depending on privacy protections, if we can't publish the whole
database to the world, we can publish just the cell tower data to the
world and possibly make the Wi-Fi data available only to trusted
researchers.


> I'm assuming if AP3 through AP10 were all also in the vicinity that
> Hash(H1+Hx) ==> Random1 where x is in {2,..,10}, correct?
> If so, is whatever value Hy is the prefix in the concatenation will
> correspond to APy's Random id?

In the proposed scheme, yes. Since AP1 and AP2 have different (but
close) latitude and longitude positions, Hash(H1+H2) would fetch the
random row id for AP1's location and Hash(H2+H1) would fetch the row id
for AP2's location.


chris
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Chris Peterson-12
In reply to this post by Chris Peterson-12

On 9/9/13 6:13 PM, Brian Smith wrote:
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair.

Tracking a person's movements by polling the database would not be
useful because we would probably update the database infrequently (days
or weeks). The location database would be generated offline from
analysis of many raw measurements submitted by the stumbler app.

The tracking scenario that might be viable is a tracker who knows
someones MAC address and current SSID and that person moves to a
different city or state. The database delay wouldn't matter as much. The
hash of hashes scheme tries to protect against that by requiring two
neighboring APs.


> MAC addresses are 48 bits. SSIDs are often guessable or predictable.
> Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
> not buying you much in terms of privacy, IMO. Basically, if you are
> really trying to use this as a privacy mechanism then you should store
> the MAC+SSID according to best practices for storing passwords. For
> example, use PBKDF2 with a large number of iterations. Regardless of
> whether you use SHA1, SHA2, PBKDF2, or something else, I will still
> call whatever function you use H(x). But, I am not sure that switching
> to PBKDF2 even buys you much improved privacy protection.

The primary motivation for hashing the MAC+SSID was to avoid uploading
the SSID (which is considered private data in some European countries)
while still using the SSID as sort of weak protection against "database
pollution" from malicious stumblers reporting spoofed MAC addresses.
Even if our database will filled with junk MAC address, real clients
would probably not see the same combination of MAC and SSID in the real
world when they sent a geolocation request to the server.


>> Other layers of privacy protection include filtering out ad-hoc Wi-Fi
>> networks; MAC addresses with vendor prefixes from mobile device manufacters
>> (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
>> "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple
>> locations.
>
> I think that these things are much more important than the protection
> offered by H(x). My concern is that if you store the data on the
> server as H(x) then you will not be able to do the above filtering on
> the server unless H(x) is ineffective. That seems bad, because the
> server will be much easier to update to improve the filtering than the
> clients will be, AFAICT. Also, you will not be able to measure the
> effectiveness of the privacy protections on the server, which is also
> very bad.

Very good points. We are currently filtering on the stumbler client
side. Today, the server just receives mystery hashes with latitude and
longitude.

Given just MAC addresess, the server could still filter out ad-hoc
networks; vendor prefixes for known mobile device manufacturers; and
unrecognized vendor prefixes (because some mobile devices supposedly
generate a completely random MAC addresses).

We would still need to rely on the stumbler to filter SSIDs. We can't
upload SSIDs to the server because they are considered private data in
some European countries (though MAC addresses, which are more unique,
are apparently not considered private data, in a legal sense).

We've compiled a list of about 70 SSID prefixes and suffixes we've seen
from mobile devices (e.g. "Android*", "Verizon *", or "*'s iPhone"). Not
all of these mobile devices use ad-hoc MAC addresses.

Trivia: over a couple years of my own Wi-Fi stumbling/wardriving in
three countries and six US states, I have recorded over 100K unique APs
and only eight used Google's "_nomap" SSID opt-out suffix!


chris
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Henri Sivonen-2
In reply to this post by Brian Smith-19
On Tue, Sep 10, 2013 at 4:13 AM, Brian Smith <[hidden email]> wrote:
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair.

I put "_nomap"  at the end of my portable SSID, since Google says they
filter out SSIDs ending in "_nomap". However, I don't expect all
people to do that.

 1) Android has a mechanism for detecting when it is connecting to a
portable AP provided by another Android device. Can we use the same or
a similar detection mechanism to detect portable APs and filter them
out?
 2) I think I read somewhere that Mozilla is trying to filter out
"_nomap" as well. If Mozilla's servers only see hashes and the client
is modifiable, how can the filtering be enforced?
 3) There are some APs that move but whose name does not end in
"_nomap" and those access points confuse Android. (Consider an AP on a
train and trying to look at where you are on the map when  you are a
passenger on the train and Google has seen the train AP at a different
location.) Are there any plans for a crowdsourced mechanism  for
blacklisting such APs?

--
Henri Sivonen
[hidden email]
http://hsivonen.iki.fi/
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

ianG-2
In reply to this post by Chris Peterson-12
On 10/09/13 00:58 AM, Chris Peterson wrote:
> I'm looking for some feedback on crypto privacy protections for a
> geolocation research project I'm working on with the Mozilla Services
> team. If you have general questions or suggestions about the project,
> I'm happy to answer them, but I'd like to focus this thread on crypto.
>
> Our team is prototyping a crowd-sourced version of Google's Street View
> cars to correlate Wi-Fi access points and cell towers to GPS positions.
> Our primary motivation is to provide non-proprietary location services
> for Firefox OS devices.


If I read this correctly, you want your client devices to figure out
where they are, right?

If that is the case, why not flip it around.  Instead of trying to
interpolate the existing data that is broadcast out there, why not write
a protocol to broadcast the direct location from the wireless access point?

A lot of these routers run Linux, and this is a place where people would
be interested in running a new service.

A wireless router that broadcasts its geolocation is not a privacy
issue.  There is no reason why it can't be turned on by default.

But anything else requires a horrible mishmash of approaches.  To obtain
what?  Something the wireless can easily tell you directly.



iang
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Chris Peterson-12
On 10/09/13 06:05, Chris Peterson wrote:
> The device would scan for nearby APs and send the hash of each AP's MAC
> and SSID to our location server. Our server would not need to worry
> about the hash of hashes pairs because that would only be used for
> published data. The server would return an estimated latitude,
> longitude, and accuracy (radius in meters) of the device among the APs.

BTW, how does the service figure out the lat/long of an AP? Do we do
anything at all with signal strengths? Could we?

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Hanno Schlichting-5
On 10/09/13 04:14, Brian Smith wrote:
> There is friction in changing SSIDs as it affects every device that
> would connect to that network. There will also probably not be much
> awareness among users of when/why/how to do this or what effect it
> will have. So, I think this is an aspect that sounds great in
> theory, but in practice will nearly never be used.

When I moved house, I changed my SSID from "99FooStreet" to
"88BarAvenue". I name the SSID like this so people know whose network it
is. Perhaps I'm unusual, but I'm sure I'm not unique.

> Even if you use AES256 with a random, thrown-away key, the data will
> be subject to reverse engineering. For example, one could correlate
> a subset of the data with a separate database of known
> (MAC,SSID,Location) triples, and/or attempt "traffic analysis" to
> see relationships in how (MAC,SSID) pairs interact with each other
> with respect to location. You have probably heard of the Netflix
> Prize privacy issues [1]; this is a very similar problem to the
> Netflix prize.

Can you explain how?

Say I have:

<HASH1> => LAT1, LONG1
<HASH2> => LAT2, LONG2
from the published database, where the two LAT/LONGs are nearby.

If I guess some possible SSIDs, I could work out some possible MAC
addresses for AP 1 and AP 2. I could even validate them and find they
are correct by submitting them to the web service and seeing if it
returned a location (let's say one is "linksys" and the other is
"BTHomeHub"). And the service gives me back... the location I already
know. Ta da. Where's the privacy issue?

Gerv
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Brian Smith-19
On 10/09/13 08:04, Henri Sivonen wrote:
>  1) Android has a mechanism for detecting when it is connecting to a
> portable AP provided by another Android device. Can we use the same or
> a similar detection mechanism to detect portable APs and filter them
> out?

I suspect actually connecting to the APs, as opposed to passively
sniffing, might be on the project's big list of NoNos... But if we
could, I agree we could find more useful data.

> location.) Are there any plans for a crowdsourced mechanism  for
> blacklisting such APs?

Not sure about crowdsourcing, but I believe they plan to use over-time
algorithms for blocking regularly-moving APs.

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Chris Peterson-12
On 10/09/13 10:48, ianG wrote:
> If that is the case, why not flip it around.  Instead of trying to
> interpolate the existing data that is broadcast out there, why not write
> a protocol to broadcast the direct location from the wireless access point?

Because only a tiny, tiny fraction of devices would run it, and for most
of those, the user wouldn't have correctly set the device's location
anyway, and for some of them, they'd have set it and then moved.

This is a "boil the sea" approach to the problem.

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Chris Peterson-12
On 10/09/13 02:13, Brian Smith wrote:

> On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson <[hidden email]> wrote:
>> Google's Location Service prevents people from tracking individual access
>> points by requiring requests to include at least 2-3 access points that
>> Google knows are near each other. This "proves" the requester is near the
>> access points.
>
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair. Google tries to limit
> this type of abuse as much as practical while providing still
> providing a location service based on such crowdsourced data.

Actually, it more means: "prevents people from figuring out where their
ex-partner moved to". The database update frequency is not sufficient to
worry about real-time tracking of mobile phones.

> MAC addresses are 48 bits. SSIDs are often guessable or predictable.
> Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
> not buying you much in terms of privacy, IMO.

It is, because raw SSIDs are personal information, and being unable to
separate the two pieces reliably means that neither is retrievable.

> Basically, if you are
> really trying to use this as a privacy mechanism then you should store
> the MAC+SSID according to best practices for storing passwords. For
> example, use PBKDF2 with a large number of iterations.

What's the threat model here?

If I hash MAC+SSID, someone could say "OK, if the SSID happened to be
"linksys", then the MAC would be "12:34:56:79:90", but how does that
help them at all if they don't _know_ that the SSID is "linksys"?

> If  you know the MAC+SSID of person X's personal access point and the
> MAC+SSID of person Y's personal access point, then you can use this
> database to ask the question "are person X and person Y in the same
> location?" This seems bad.

Can you envisage a scenario where one might know this information, but
not the shared location? (Note that mobile access points are fairly well
excluded by the protections Chris outlines.)

> I think that these things are much more important than the protection
> offered by H(x). My concern is that if you store the data on the
> server as H(x) then you will not be able to do the above filtering on
> the server unless H(x) is ineffective.

I believe the plan is to have a database of raw findings, then a
processed database used by the web service, and a published database
which may have even more data reduction.

Chris P: can we get permission to store the raw SSID in the
_unpublished_ database?

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Chris Peterson-12
On 10/09/13 00:25, R. Jason Cronk wrote:
> Is the data aged?

Not AFAIAA.

> What happens if I move?

The raw database notes that you are now being detected in a new
location. What happens then is up for debate. I'd argue that if your
position was fixed for N months before, and it seems fixed again now, we
should assume you have moved house and keep the point in the DB. APs
which seem to move a lot, or move regularly, should be excluded.

> Does this give Mozilla the
> ability to historically track me if I move my device?

Yes; this is why publishing the full raw stumbled data sets is sadly
going to be not possible.

>> Our published database would include two tables. The first table would
>> map a random row id to metadata about an anonymous access point:
>>
>>     Random1 ==> AP1.latitude, AP1.longitude, ...
>>     Random2 ==> AP2.latitude, AP2.longitude, ...
>
> I would be hesitant to use the word anonymous here. Latlong is easily
> combine with other publicly available databases that could identify
> individual address and thus individuals. Again, it comes down to
> granularity of the data.

I'm not sure what threat you are seeing. Can you elaborate? This is just
a list of latlongs which have a wireless access point. How can this
information assist in identifying individuals or their locations?

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Gervase Markham
In reply to this post by Chris Peterson-12
On 09/09/13 22:58, Chris Peterson wrote:
> Google's Location Service prevents people from tracking individual
> access points by requiring requests to include at least 2-3 access
> points that Google knows are near each other. This "proves" the
> requester is near the access points.

Related question: it would be great if there were some way to lift this
restriction, at least for the web service if not for the database, while
preserving the necessary privacy protections. My family's house, which
is in a rural area, has a single access point; I want my phone to know
where it is immediately when I'm there. Not everywhere has lots of
access points.

One thought I had was to allow submission of the MMC/MNC (mobile network
IDs) as proof that you were nearby.

> Unlike Google's Location Service, our server does not store MAC
> addresses or SSIDs. We identify access points by hash IDs, specifically
> SHA1(MAC+SSID). To query the location of an access point in the
> database, you must know both its MAC address and current SSID.

I think that this is an excellent idea, for the reasons you articulate
later in the thread.

Gerv

_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
Reply | Threaded
Open this post in threaded view
|

Re: Request for feedback on crypto privacy protections of geolocation data

Hanno Schlichting-5
In reply to this post by Gervase Markham
On 10.09.2013, at 03:39 , Gervase Markham <[hidden email]> wrote:
> BTW, how does the service figure out the lat/long of an AP? Do we do
> anything at all with signal strengths? Could we?

This is a bit off-topic for the security discussion.

I suggest starting a new thread on dev-geolocation, if you want to know more about the technical details. The short answer is: Yes, but it's a lot more complicated than that :)

Cheers :)
Hanno
_______________________________________________
dev-security mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-security
12