Anonymous metrics collection from Firefox

classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Anonymous metrics collection from Firefox

Benjamin Smedberg
There has been a project being worked on for some time to collect
metrics from Firefox installations in an "on by default" manner. This is
different from off-by-default telemetry. I became aware of this project
recently when I was asked to review some implementation code, and I have
some concerns about our privacy stance in this feature. Because the bugs
are getting a bit out of hand, I wanted to move the discussion to the
proper newsgroup.

For background, the feature page (not strictly a feature page) is here:
https://wiki.mozilla.org/MetricsDataPing

Note that this page contains data from several different authors and
isn't a coherent proposal page any more. See the wiki history for
context if necessary.

The tracking bug is https://bugzilla.mozilla.org/show_bug.cgi?id=718066 
from which several other bugs (core implementation, preference UI) are
available.

I understand that this opt-out data collection is vastly superior than
telemetry in terms of collecting a representative sample and controlling
for bias. But it's not clear to me why that makes it "ok" from a privacy
perspective, compared with telemetry, to make this opt-out instead of
opt-in. Just from my personal experience, I would be surprised by any
data submitted by Firefox to Mozilla which was not part of regular
Firefox functionality (app update seems pretty straightforward,
extension update also, crash submission is opt-in). It seems that if
this data submission contains any information which is potentially
personally identifying, then it would be a "surprise". As already
identified in the bug, there are so many different ways in which data
can be potentially identifying:

* unique sets of themes (theme collection was removed)addons
* unique sets of addons (addon collection is still proposed)
* the unique IDs used to keep track of particular installations can
potentially track data back to users (note that the UUID proposal has
changed somewhat due to privacy concerns, but that there is still a
local ID -> remote data mapping)

A fair bit of the proposal is focused on how we would be protecting and
anonymizing the data. But if we're not actually collecting personally
identifyable data, why couldn't we make the entire server system public
and queryable? It seems that any system that requires server-side
anonymization to meet user privacy expectations is an unexpected privacy
risk. Might it also open up our users to potential tracking via court
order (search warrants) from both U.S. courts and whatever countries we
put data centers in?

It seems as if we are saying that since we already collect most of this
data via various product features, that makes it ok to also collect this
data in a central place and attach an ID to it. Or, that because we
*need* this data in order to make the product better, it's ok to collect
it. This makes me intensely uncomfortable. At this point I think we'd be
better off either collecting only the data which cannot be used to track
individual installs, or not implementing this feature at all.

Note that while Ben Bucksch has also brought up legal concerns about
whether German or European law forbids this kind of data collection, I'm
not particular interested in that portion of the discussion because very
few of us in the project are legal experts who can have an informed
opinion. So please let's avoid ratholing on those legal issues instead
of the basic privacy issue.

--BDS

_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Dao-6
I was just going to post this to bug 718066, now commenting here instead:

(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #54)

> (In reply to Dão Gottwald [:dao] from comment #52)
> > I'd consider add-ons problematic, partly because the IDs alone can let you
> > track down a person, partly because the use of some add-ons could be illegal
> > in some countries. I also second Ben's view that IP addresses + GUIDs need
> > to be considered personally identifiable information. You say you don't
> > store IP addresses, but this just brings us back to good intentions vs.
> > systems that inherently protect privacy by just not sending out problematic
> > data.
>
> Based on your feedback, we removed persona and theme IDs from the list of
> data submitted.  We also implemented the honoring of the setting that an
> add-on developer can put into the manifest to prevent submitting the add-on
> ID to Mozilla services.  That preference was originally set up as part of
> the services.addons.mozilla.org features that support the Add-on manager.

There's no direct link between the use of an add-on being illegal in
some country and the developer setting that pref. In general, I wouldn't
count on people setting that pref.

> > The client has the list of installed add-ons, knows about crashes and could
> > be told what to consider "slow". Providing it with a list of add-ons that
> > generally tend to be problematic would probably cover 99.9+%. It's unclear
> > why this requires fain-grained data from hundreds of millions of users.
>
> That presumes that we can know with accuracy what add-ons tend to be
> problematic for most of our users.  If we don't collect data from the
> general usage base, the best we could ever hope to know is what AMO hosted
> add-ons cause problems on our own specific test machines and what add-ons
> people have told us cause problems for them.

No, there's also telemetry, which I think we haven't fully utilized yet.
I don't see how some user selection bias would hinder linking add-ons
with performance and stability problems.

(In reply to Blake Cutler from comment #57)
> 2) I didn't say Mozilla is going to die. I implied it's headed toward
> irrelevance. Let's look a the numbers:
> * Webkit's market share is already 10 points higher than Gecko's.
> * Gecko is losing .5% market share per month and has no meaningful presence
> mobile devices.
> * Webkit is gaining over 1% market share per month and dominates mobile
> browsing.
> * Mobile browsing is rapidly overtaking desktop browsing (gaining nearly 1%
> share per month)

It's unclear how the proposed Metrics Data Ping would change this. See
again the questions I asked in
<https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c35>.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Ben Bucksch
In reply to this post by Benjamin Smedberg
On 06.02.2012 19:56, Benjamin Smedberg wrote:

> There has been a project being worked on for some time to collect
> metrics from Firefox installations in an "on by default" manner. This
> is different from off-by-default telemetry. I became aware of this
> project recently when I was asked to review some implementation code,
> and I have some concerns about our privacy stance in this feature.
> Because the bugs are getting a bit out of hand, I wanted to move the
> discussion to the proper newsgroup.
>
> For background, the feature page (not strictly a feature page) is
> here: https://wiki.mozilla.org/MetricsDataPing
>
> Note that this page contains data from several different authors and
> isn't a coherent proposal page any more. See the wiki history for
> context if necessary.
>
> The tracking bug is
> https://bugzilla.mozilla.org/show_bug.cgi?id=718066 from which several
> other bugs (core implementation, preference UI) are available.
>
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and
> controlling for bias. But it's not clear to me why that makes it "ok"
> from a privacy perspective, compared with telemetry, to make this
> opt-out instead of opt-in. Just from my personal experience, I would
> be surprised by any data submitted by Firefox to Mozilla which was not
> part of regular Firefox functionality (app update seems pretty
> straightforward, extension update also, crash submission is opt-in).
> It seems that if this data submission contains any information which
> is potentially personally identifying, then it would be a "surprise".
> As already identified in the bug, there are so many different ways in
> which data can be potentially identifying:
>
> * unique sets of themes (theme collection was removed)addons
> * unique sets of addons (addon collection is still proposed)
> * the unique IDs used to keep track of particular installations can
> potentially track data back to users (note that the UUID proposal has
> changed somewhat due to privacy concerns, but that there is still a
> local ID -> remote data mapping)

Thanks, Benjamin.

A few additions:

  * Finterprinting: The data we submit under the current proposal from
    the Metrics group is highly fingerprintable. For example, it has not
    only the list of addons (which in many cases will already be unique
    in its combination, or even pinpoint company association with custom
    addons), but also install date of each addon.
  * UUID: The "document UUID" proposal (actually simply a submission ID)
    sends the previous submission ID as well, which allows the server to
    trivially connect them together and still have a server-side UUID.
    The submission ID may have some advantages in some cases, but it
    doesn't remove the ability to track individual users.


To fingerprinting: I doubt that we really critically need all of that
data to answer the most pressing questions. More data can always be nice
and justified somehow, but it's not necessarily critical.

To UUID: I also think that there are solutions without tracking
individual users. I proposed one, one that even allows to see when users
stopped using Firefox. See
https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative

---

Another, additional way to limit the privacy impact is to only take a
representative sample. Instead of collecting the data from all of
200,000,000 users, we only pick a random (!) sample of 10,000.
Concretely: if ( ! pref.userSet()) pref.set(Math.random * 20000 > 1). If
true, submit, otherwise no data collection. Given that the sample is
random, it's guaranteed to be statistically representative.
It makes a huge difference whether you collect data from 200,000,000
people or just 10,000.

Again, you can find arguments why it's better to get a lot more data,
but when you consider the user interest of privacy, I think that's a
fair balance of needs.

---

I would like to add that this feature has a serious potential of
actively decreasing Firefox market share. Firefox is biggest in Europe,
and there still has the largest market share, from what I know. The
reason why people here in Europe use Firefox is mostly philosophical,
including privacy. It is not so much pure technical merits that wins
users, these are only the second priority. Now, if the users get the
idea that Firefox is not dramatically and fundamentally different than,
say, Google Chrome, then people see no reason to be loyal to Firefox,
and switch to Chrome.

This project will make very bad news, that is almost certain. The
Telemetry question already gave a bad impression.

This project has a very real risk of actively decreasing the market
share that it is trying to preserve.

----

There are other ways to get the needed data without offending users. I
propose to 1) remove the UUID and use the algorithm I proposed, which
still allows to gather the critically needed data, but without tracking
users, 2) remove any data which has a high likeliness of being unique
when fingerprinting 3) reduce the collected sample to a random sample of
10,000.

If all 3 are done, I would have a good conscience that this is a good
balance between need of data for produce decisions and user interests
for privacy, and I'd even be OK with an opt-out. But only if the
tracking of individual users is removed and the sample is limited to 10,000.

Ben

_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Ben Bucksch
On 06.02.2012 20:47, Ben Bucksch wrote:
> To UUID: I also think that there are solutions without tracking
> individual users. I proposed one, one that even allows to see when
> users stopped using Firefox. See
> https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative

Sorry for posting this again, but Usenet lives a lot longer than Wikis
or bugzilla, so here's a copy for future reference.


  Anonymous alternative

The following is an alternative approach, proposed by Ben Bucksch:

For simplicity, I will take the number of crashes (e.g. in the last week
or overall) as data point that you want to gather. The data itself is
anonymous and can (apart from fingerprinting, more to that later) not
identify a single user.


    Avoiding UUID

You wanted to know which profiles are not used anymore (dormant,
retention problem) and which characteristics they have. This is
inherently difficult without tracking individual users (installations),
but it is possible with the following algo:

The client submits:

  * Date of last submission - e.g. 2012-01-18
  * Current date (from client perspective) - only date, not time - e.g.
    2012-01-20
  * Age of profile (Firefox installation) in days - e.g. 500
  * (Last submitted age is implied or explicit - e.g. 498 )
  * Number of crashes - e.g. 15
  * Number of crashes submitted last time - e.g. 10

Then, on the server, you write that information in a database, as such:

Date of submission | Age of installation | Crash count | Number of users
2012-01-20         | 500                 | 15          | 100000

Any additional user also submitting today the same combination "age 500,
crash count 15" increases the "number of users" column by 1, new value
is 100001. Also, you look up the row for the last submission, namely

2012-01-18         | 498                 | 10          | 20000

and decrease the number of users by 1, new value is 19999.

If the user later that day decided that there were too many crashes and
switches to Chrome, he will now be stranded on the row

2012-01-20         | 500                 | 15          | 5000

while other users who have continued to use FF have been subtracted
after a while. So, you can say with certainty that there were 5000 users
who used Firefox the last time on 2012-01-20, after having used Firefox
for 500 days, and they had 15 crashes (per day/week/total, whatever you
submit) when they stopped using Firefox.

That is exactly the information you are so desperately seeking. Tsere,
you has it. Without tracking any individual user: it's completely
anonymous.


    Avoiding Fingerprinting

Now, what about all the other information that you need: startup times,
addons, etc.? If we just add all that information to the same table and
row, it would allow fingerprinting. But that is not necessary. You
merely make one table per atomic information. I.e.

Table A
Date of submission | Age of installation | Crash count | Number of users
Table B
Date of submission | Age of installation | Startup time | Number of users

or of course whatever other database schema you want, as long as each
value is separate. That takes care of the fingerprinting.

At least on the server side, not on the submission side. I would have to
trust you, and anything between you and me. It would be possible to
separate the calls and submit each value separately, but I think that
would be overdoing it.


_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

beltzner
In reply to this post by Benjamin Smedberg
The wiki page is pretty clear about goals for the feature ("ability to
measure adoption, retention, stability, performance, and aggregated
search counts by engine") as well as requirements for success. What
it's lacking, other than in terms of caveats and warnings throughout
the documentation, is against which privacy principles those
requirements must be evaluated.

Recently Ben Adida posted on the Mozilla Privacy Blog
(http://blog.mozilla.com/privacy/2012/01/13/mozilla-to-offer-new-user-centric-services-in-2012/)
outlining a series of design guidelines to use when designing new
features, and committing Mozilla to a basic policy of "no surprises,
real choices, sensible settings, limited data, and user control." I
think that the Data Safety Team he outlines in that post should
evaluate the proposal (once it reaches a final stage, see below!)
using those guidelines and making a judgement on whether or not it
meets the plain-language policy as stated.

The other thing the wiki page is lacking is an understanding of who is
running the project aside from the "metrics team." A clear project
owner should be identified, I think, so that we can better know what's
in plan, in flux, etc. Once there's a final proposal about what's to
happen, it can be judged and evaluated from a privacy perspective.

Our shared goal should be to try and design a system by which we
accomplish the laudable goals and requirements of the metrics team
(plainly: better understanding our product, its users, and how it's
being used) in a way that meets our high standards for data
sovereignty and privacy. We must build a better mousetrap. I suggest
people look to the Crash Stats efforts to this end, as they have long
avoided privacy-invasive actions (at non-trivial cost) while still
mining the available data to gain significant understanding of our
broad user base's experience with the browser.

Finally, and my own personal $0.02 on the issue: I think there are
ways of pre-cleaning data so that you get the benefit of aggregate
data collection (double-blinding, binning and grouping, etc) and the
easiest way to figure those ways out is to begin with the question:
what is the end state we're trying to get to? No data should be
collected without understanding exactly how that data will be
presented to its consumers; that way you can be sure to only collect
the minimum amount of data required to answer the question.

cheers,
mike

ps: let's remember that we're all on the same team here, and all want
what's best for Firefox and its users; think carefully before writing,
and always assume the best of your colleagues and community members
when participating in this discussion!
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Ben Bucksch
In reply to this post by Benjamin Smedberg
Blake Cutler wrote on
https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c57 :
 > The short answer is that correlation is not causation.

How do you want to get causation, and *not* by correlation, from the
data delivered by your proposal? I think that's impossible, but maybe
I'm missing something. (If so, maybe I can improve my proposal.)

E.g. you may see that all users of a certain addon stop using Firefox.
But maybe that's just a custom internal addon that a company created,
and the CEO decided to switch to MSIE, because he played golf with
somebody. The cause is bribe, not technical.

Also see the case of a government agency recommending Google Chrome that
you mentioned yourself. The agency *told* in the announcement what the
reason is, it was only one: Chrome's sandbox, which is better than that
of competitors and leads to an inherently more security browser. So,
users switch to Chrome as a result of that recommendation (or as one of
the reasons). You will never get that cause by metrics.

---

I think: If you want the cause, just 1) listen to people when they
scream at you, and 2) ask them with surveys (small random set, free-form
answers, not multiple choice), that's the only way. Mozilla has been a
bit too stubborn recently, and more metrics data is not going to turn
the ship around. Listening to users is.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Justin Lebar-3
Daniel Einspanjer wrote in comment 26
(https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c26):

> I just need to make one small clarification. I am happy to have more people looking
> at the problem and challenge, and I would love to see a mechanism that provides a
> feasible alternative to the current ID-centric solution.  The only thing I can honestly
> promise is to collaborate on the thinking of, and consider such a solution if
> presented, and if it meets the stated needs of the project

I think this is the wrong way of looking at this discussion.

The question must be not "is there a better way", but rather "is this
way acceptable"?  We need to be careful not to take this project as a
fait accompli.

Yeah, it sucks that we can't tell why people stop using Firefox.  But
our principals are more important than that.

To that end, the discussion shouldn't center on why these metrics are
important or difficult to obtain another way.  The discussion is about
whether we can at once collect the proposed metrics and stay true to
our values.  If we can't, then we can't collect the data, no matter
how important it may be.

If the current proposal is in violation of our values, it's up to the
metrics team (and whoever wants to help) to come up with an
alternative.  It is explicitly *not* up to those of us opposing the
current proposal to propose an alternative.

I think bsmedberg laid out a good case for why the proposal is
troubling.  I'm curious to hear the metrics team respond to his
points, again *without* referencing the critical need for the data.

-Justin

On Mon, Feb 6, 2012 at 3:30 PM, Ben Bucksch <[hidden email]> wrote:

> Blake Cutler wrote on
> https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c57 :
>> The short answer is that correlation is not causation.
>
> How do you want to get causation, and *not* by correlation, from the data
> delivered by your proposal? I think that's impossible, but maybe I'm missing
> something. (If so, maybe I can improve my proposal.)
>
> E.g. you may see that all users of a certain addon stop using Firefox. But
> maybe that's just a custom internal addon that a company created, and the
> CEO decided to switch to MSIE, because he played golf with somebody. The
> cause is bribe, not technical.
>
> Also see the case of a government agency recommending Google Chrome that you
> mentioned yourself. The agency *told* in the announcement what the reason
> is, it was only one: Chrome's sandbox, which is better than that of
> competitors and leads to an inherently more security browser. So, users
> switch to Chrome as a result of that recommendation (or as one of the
> reasons). You will never get that cause by metrics.
>
> ---
>
> I think: If you want the cause, just 1) listen to people when they scream at
> you, and 2) ask them with surveys (small random set, free-form answers, not
> multiple choice), that's the only way. Mozilla has been a bit too stubborn
> recently, and more metrics data is not going to turn the ship around.
> Listening to users is.
>
> _______________________________________________
> dev-planning mailing list
> [hidden email]
> https://lists.mozilla.org/listinfo/dev-planning
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

blake.cutler
In reply to this post by Ben Bucksch
We can better determine causation because we can build models that account for multiple product and usage dimensions at once.

i.e.  retention = # crashes + startup speed + sync use + # add-ons + ...

Collecting data this way is not sufficient to turn Firefox growth around, but I believe it is necessary. For the first time, Mozilla will have concrete answers to important, long-standing questions. Answers that Mozilla's competitors already have.

It's not about gathering more metrics. It about collecting and analyzing metrics correctly.

That's not to say there isn't room for improvement. I like your ideas on sampling, for example.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Joshua Cranmer-2
In reply to this post by Benjamin Smedberg
On 2/6/2012 12:56 PM, Benjamin Smedberg wrote:
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and
> controlling for bias.

This part troubles me a bit. I do realize that opt-in data collection
does have a bias, but do we have any reason to expect that any data we
would collect opt-out would be affected by this bias to a degree that it
would change decision making processes?

Opt-out is a scary thing, especially when Mozilla has a brand reputation
of user privacy. As a bit of a data junkie myself, I can definitely see
the temptation to want data just to answer questions. But first, I think
there needs to be a clear goal that any data that would have to be
collected opt-out needs to satisfy these guidelines:

1. The data needs to be useful in answering a specific question.
2. This question needs to be identified as one whose answer matters: the
answer needs to be crucial to some active policy discussion (like "do we
drop support for X feature?")
3. The data cannot be reliably collected or estimated from other means.
It shouldn't be a case of "we suspect that this is most likely the case,
but we need confirmation first"; it needs to be "we have no idea".
4. Collecting opt-in would create serious bias that cannot be overcome.

Reading the page and skimming the bug has started to lead me to the
impression that the data being collected is more oriented in "just in
case" or "we want hard numbers to back up what we know", which is
definitely not the kind of collection we want to be encouraging.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Daniel Einspanjer
In reply to this post by Benjamin Smedberg
On Feb 6, 1:56 pm, Benjamin Smedberg <[hidden email]> wrote:

> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and controlling
> for bias. But it's not clear to me why that makes it "ok" from a privacy
> perspective, compared with telemetry, to make this opt-out instead of
> opt-in. Just from my personal experience, I would be surprised by any
> data submitted by Firefox to Mozilla which was not part of regular
> Firefox functionality (app update seems pretty straightforward,
> extension update also, crash submission is opt-in). It seems that if
> this data submission contains any information which is potentially
> personally identifying, then it would be a "surprise". As already
> identified in the bug, there are so many different ways in which data
> can be potentially identifying:
>
> * unique sets of themes (theme collection was removed)addons
> * unique sets of addons (addon collection is still proposed)
> * the unique IDs used to keep track of particular installations can
> potentially track data back to users (note that the UUID proposal has
> changed somewhat due to privacy concerns, but that there is still a
> local ID -> remote data mapping)

It is an unfortunate fact that even in the other data available to us
today, there are occasional ways in which a user can modify their
system or browser such that some private information is leaked out.
One of the best examples I can give of that is the ability to change
variables that are used in the update or blocklist checks.  There are
requests to those systems that have an e-mail address in the place of
the product name ("Firefox").  There are systems that have a changeset
or bug number or username in the channel or distribution name.
Obviously these are rare cases, but we have seen them.  That is why we
instituded early aggregation of the data before it goes into our data
warehouse so we can filter those identifying long tail requests out.
I would still qualify it as a surprise to some unsuspecting developer
though.

That is actually one of the things that I hope could be improved by
this system.  Unlike AUS or Blocklist, this proposal has a user facing
component that can allow a user to easily see the data being sent in.
It provides an actual value to the user to let them look at statistics
about their browser and compare them to aggregates from other
installations.  If a developer went in to about:metrics and saw their
username in the channel field, they could take immediate action.  They
could delete the data from our data warehouse, and they could change
the config of their profile so it isn't there anymore.  On our end, we
would continue to do what we have always done which is to attempt to
aggregate that data and drop long tail values which we have no value
in seeing anyway.

> A fair bit of the proposal is focused on how we would be protecting and
> anonymizing the data. But if we're not actually collecting personally
> identifyable data, why couldn't we make the entire server system public
> and queryable? It seems that any system that requires server-side
> anonymization to meet user privacy expectations is an unexpected privacy
> risk. Might it also open up our users to potential tracking via court
> order (search warrants) from both U.S. courts and whatever countries we
> put data centers in?

It was critical for us when we proposed this system to have data
collection that was focused on the browser installation rather than
any attempt to learn anything about an individual person.  If there
was any reasonable way we could get the information without using TCP/
IP and having an IP address, I would jump on trying to use that.
Since we don't have that, we have made sure that part of the proposal
was a commitment not to store the IP address with the data and we have
taken several extra steps with how we propose the data is stored and
used so that if another party were to have access to the data, it
would not be of any interest because it would have only information in
it about browser metrics and not PII.

_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

blake.cutler
In reply to this post by Joshua Cranmer-2
On Monday, February 6, 2012 1:18:56 PM UTC-8, Joshua Cranmer wrote:
> Reading the page and skimming the bug has started to lead me to the
> impression that the data being collected is more oriented in "just in
> case" or "we want hard numbers to back up what we know", which is
> definitely not the kind of collection we want to be encouraging.

I understand where you're coming from, but this data isn't being collected "just in case." We need this data to 1) calculate Firefox's retention rate and 2) identify factors that drive retention.

It's hard to overstate how important these questions are right now. Firefox is rapidly losing market share everywhere in the world, Europe included.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Nicholas Nethercote
In reply to this post by Ben Bucksch
On Mon, Feb 6, 2012 at 11:47 AM, Ben Bucksch
<[hidden email]> wrote:
>
> This project will make very bad news, that is almost certain. The Telemetry
> question already gave a bad impression.

Can you give more details about this?  I haven't heard anything about it.

Nick
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Ben Bucksch
In reply to this post by Ben Bucksch
On 07.02.2012 00:15, Nicholas Nethercote wrote:
>> The Telemetryquestion already gave a bad impression.
> Can you give more details about this?  I haven't heard anything about it.

FYI: It's the question that comes up at the top of the browser window
when you start Firefox the second time (with a new profile). It asks you
whether you want to submit performance data etc.

It makes a bad impression on *me*, because Mozilla wants to collect data
from me. Other companies have abused that "anonymous" so badly that any
such question for me now is suspicious. I think that many users feel the
same. (Obviously, not asking is even worse.)

As for hard numbers about Telemetry for other people and in general, I
can't speak about that. Somebody else would need to give that information.
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

David E. Ross-3
In reply to this post by Benjamin Smedberg
On 2/6/12 10:56 AM, Benjamin Smedberg wrote [in part]:
>
> Note that while Ben Bucksch has also brought up legal concerns about
> whether German or European law forbids this kind of data collection, I'm
> not particular interested in that portion of the discussion because very
> few of us in the project are legal experts who can have an informed
> opinion. So please let's avoid ratholing on those legal issues instead
> of the basic privacy issue.

I think you have this backwards.

An enterprise the size of Mozilla must surely have attorneys on staff or
retainer.  You should find out if what is proposed is legal before
expending any efforts to implement it.  Besides Germany, there might be
other nations with laws impacting on this concept.

Furthermore, where such laws do not exist, Mozilla needs to have a firm
policy on how the organization would respond to a warrant or subpoena
for the data.  That policy must be in place before the data collection
begins and should address not only a government's request for the data
but also a request resulting from a civil lawsuit.

--

David E. Ross
<http://www.rossde.com/>.

Anyone who thinks government owns a monopoly on inefficient, obstructive
bureaucracy has obviously never worked for a large corporation.
© 1997 by David E. Ross
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

beltzner
In reply to this post by Nicholas Nethercote
On Mon, Feb 6, 2012 at 6:15 PM, Nicholas Nethercote
<[hidden email]> wrote:
> On Mon, Feb 6, 2012 at 11:47 AM, Ben Bucksch
> <[hidden email]> wrote:
>>
>> This project will make very bad news, that is almost certain. The Telemetry
>> question already gave a bad impression.
>
> Can you give more details about this?  I haven't heard anything about it.

I'm not sure that it's really germane to the discussion at hand - I
don't think our choices here should be governed significantly by our
fear of bad press, or our belief that the issue will not garner
significant public notice at all. We should be making our choices
based on:

 - an actual need for the information (how will we use it to better
the product?)
 - our ability to design a feature that meets the stated goals while
still meeting our strict stance on privacy

cheers,
mike
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Justin Wood (Callek)-2
In reply to this post by blake.cutler
[hidden email] wrote:
> On Monday, February 6, 2012 1:18:56 PM UTC-8, Joshua Cranmer wrote:
>> Reading the page and skimming the bug has started to lead me to the
>> impression that the data being collected is more oriented in "just in
>> case" or "we want hard numbers to back up what we know", which is
>> definitely not the kind of collection we want to be encouraging.
>
> I understand where you're coming from, but this data isn't being collected "just in case." We need this data to 1) calculate Firefox's retention rate and 2) identify factors that drive retention.

We need to make sure the data metrics issues reflect properly our
privacy policies/plans, and not reflect this with just a "we need this".
As has been said elsewhere.

> It's hard to overstate how important these questions are right now. Firefox is rapidly losing market share everywhere in the world, Europe included.

Using this logic, SeaMonkey should gather all data about all users, we
possibly can, because we have been losing market share heavily every
since we became SeaMonkey from "the Mozilla Suite".

 From where I sit, the largest fault of our market share is the fact
that Google has heavy brand awareness, and is doing LOTS of expensive
advertising campaigns, and well-done in most cases. So "Google Chrome"
is interesting to the ignorant-of-computer users.

Also Microsoft is (Finally) developing a Sane IE, which means less
reason for people to install a different web browser on Windows.

Lastly Apple has a lead on Mobile in general, and we can't even offer a
Firefox for mobile, and instead we are stuck with doing a Firefox Home
to share bookmarks, while the default webkit-based browser[s] are
pulling ahead there, given the iPhone/iPad proliferation.

Now I admit my observations are not based on concrete data I can cite
right now, but are based on sporadic news research I have done, as well
as hours of TV and Internet use over the past years.

--
~Justin Wood (Callek)
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Henri Sivonen
In reply to this post by Benjamin Smedberg
On Mon, Feb 6, 2012 at 8:56 PM, Benjamin Smedberg <[hidden email]> wrote:
> I became aware of this project recently when
> I was asked to review some implementation code, and I have some concerns
> about our privacy stance in this feature.
...
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and controlling for
> bias. But it's not clear to me why that makes it "ok" from a privacy
> perspective, compared with telemetry, to make this opt-out instead of
> opt-in.

Thanks for posting this.

This reminds me of Sync and Fennec Native. First, Sync was very
carefully designed to have privacy characteristics that suit Mozilla's
stated privacy principles and those characteristics were bragged about
(which is good since the characteristics were special in the
industry). Then (as I understand it) another team than the one who had
designed the feature suggested that Fennec Native write (part of) the
Sync data to storage that could get synced to Google without the same
privacy characteristics and suggested that crypto characteristics
could be weakened in the name of ease of use (without even
demonstrating that losing the crypto would have been the key to making
the setup flow better).

Now Telemetry has been very carefully designed to have privacy
characteristics that suit Mozilla's stated privacy principles and
those characteristics have been bragged about. And then another team
comes along, treats that design as a bug wants to send a per-user ID
to enable longitudinal study. If doing what this metrics feature
suggests to be done was OK, surely Telemetry would already have UUIDs
and support for "longitudinal study".

It bothers me that this scenario repeats. While in general discussing
various ideas is good, having this scenario repeat makes it look like
Mozilla's privacy principles are constantly on the verge of getting
overturned instead of being something that users can trust on the long
term. (Fortunately, the Fennec Native situation turned out OK. Fennec
Native now has its own data store and the crypto flow is what it used
to be.)

As for the Germany/EU aspect: (Note the rest of this paragraph says
nothing about law. I'm not trying to play a lawyer here.) Even if
sending an UUID had no real privacy impact, sending an UUID would be
bad publicity in Europe. The usage share of Firefox is in the decline.
Europe in general and Germany in particular is a place where the usage
share of Firefox is high. It seems like a bad idea to hurt that market
share in order to study metrics related to it.

--
Henri Sivonen
[hidden email]
http://hsivonen.iki.fi/
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Gervase Markham
In reply to this post by Daniel Einspanjer
On 06/02/12 22:16, Daniel E wrote:
> It is an unfortunate fact that even in the other data available to us
> today, there are occasional ways in which a user can modify their
> system or browser such that some private information is leaked out.
> One of the best examples I can give of that is the ability to change
> variables that are used in the update or blocklist checks.  There are
> requests to those systems that have an e-mail address in the place of
> the product name ("Firefox").  There are systems that have a changeset
> or bug number or username in the channel or distribution name.

I have no reason to doubt you that this happens, but there is a big
difference between designing your system to request particular data, and
accidentally receiving some of it because a user misconfigures their
browser.

If I have a web "contact me" form, and someone pastes their entire
medical history into it and hits Submit, I probably want to delete the
data - but I don't have to engineer my data handling process for content
coming from that form so that it's robust for handling medical data!

> That is actually one of the things that I hope could be improved by
> this system.  Unlike AUS or Blocklist, this proposal has a user facing
> component that can allow a user to easily see the data being sent in.
> It provides an actual value to the user to let them look at statistics
> about their browser and compare them to aggregates from other
> installations.  If a developer went in to about:metrics and saw their
> username in the channel field, they could take immediate action.  They
> could delete the data from our data warehouse, and they could change
> the config of their profile so it isn't there anymore.  On our end, we
> would continue to do what we have always done which is to attempt to
> aggregate that data and drop long tail values which we have no value
> in seeing anyway.

These sound like excellent ideas, but they don't seem to have a bearing
on the question of opt-in or the question of a unique identifier.

> It was critical for us when we proposed this system to have data
> collection that was focused on the browser installation rather than
> any attempt to learn anything about an individual person.

I'm not sure that's a distinction we can make. I am the only user of my
browser, and I'm sure that's true of lots of other people too. What can
you tell about me from my list of installed add-ons? I won't give you
the full list, but I suspect you could tell:

- I do web development of RESTful services using JSON
- I work for Mozilla
- I care about my privacy

Gerv
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

Daniel Einspanjer
On Feb 6, 8:02 pm, "David E. Ross" <[hidden email]> wrote:

> An enterprise the size of Mozilla must surely have attorneys on staff or
> retainer.  You should find out if what is proposed is legal before
> expending any efforts to implement it.  Besides Germany, there might be
> other nations with laws impacting on this concept.
>
> Furthermore, where such laws do not exist, Mozilla needs to have a firm
> policy on how the organization would respond to a warrant or subpoena
> for the data.  That policy must be in place before the data collection
> begins and should address not only a government's request for the data
> but also a request resulting from a civil lawsuit.
>

We do have a legal team, we also engaged outside legal council
specifically on the question of European and German law for this
project.  We have asked the legal and privacy teams to share the
results of their reviews.




On Feb 7, 1:28 am, "Justin Wood (Callek)" <[hidden email]> wrote:
> Using this logic, SeaMonkey should gather all data about all users, we
> possibly can, because we have been losing market share heavily every
> since we became SeaMonkey from "the Mozilla Suite".
>

Please reconsider the phrase "should gather all data about all users
we possibly can".  This project is not about gathering all data
possible.  It has a very specific list of the minimal data that was
determined to be required to answer the questions determined as
necessary to answer.  There has been a lot of information shared about
what those questions are and the justifications for most of the data
points on other mediums such as the bugs and the wiki.  I am happy to
continue to work toward sharing justifications and considerations for
any of the data listed.  It is right for Mozilla and the community to
ask for those explanations.  It is difficult to maintain a productive
discussion where everyone has a clear picture of the facts when using
exaggerated phrases though.

>  From where I sit, the largest fault of our market share is the fact
> that Google has heavy brand awareness, and is doing LOTS of expensive
> advertising campaigns, and well-done in most cases. So "Google Chrome"
> is interesting to the ignorant-of-computer users.
>
> Also Microsoft is (Finally) developing a Sane IE, which means less
> reason for people to install a different web browser on Windows.
>

Both of these are great concerns that tie in to this project.  These
changes in the market are significant changes that primarily deal with
a large class of mainstream users that are under-represented in our
current understanding.  These other companies are focusing a lot of
attention on understanding how the browser is used by mainstream
users.  We are striving to improve our own understanding.

We don't want to just do things the same way as others though.  We
have tried to develop a project that can analyze usage without
collecting personally identifying information.  We have worked with
the privacy and legal teams to propose policies to mitigate the
unavoidable PII such as ensuring that IP addresses are never tied to
the data and that we don't leave any easy way to associate identifying
information such as an e-mail address or name with the data.  We have
also put into the project a set of goals around giving the users
visibility, functionality, and control of the data generated by their
browser.




On Feb 7, 3:25 am, Henri Sivonen <[hidden email]> wrote:
> ...
> Now Telemetry has been very carefully designed to have privacy
> characteristics that suit Mozilla's stated privacy principles and
> those characteristics have been bragged about. And then another team
> comes along, treats that design as a bug wants to send a per-user ID
> to enable longitudinal study. If doing what this metrics feature
> suggests to be done was OK, surely Telemetry would already have UUIDs
> and support for "longitudinal study".

We definitely spent a lot of time looking at Telemetry and working
with that team.  The data that Telemetry collects and the purpose that
it exists for is different though.  Telemetry was designed to enable
developers to understand the performance characteristics of individual
features or code paths "in the wild".  It does not require retention
or the same sort of longitudinal data that MDP proposes to meet those
requirements.  Putting those characteristics into Telemetry would be
doing the very thing that several people have spoken out against,
adding data to a system that is not directly needed by that system.

There is a significant value in judiciously partitioning data by
purpose.  It enables better policy governing the data.  It allows
finer control over what data is collected and how it is reviewed.  It
allows walls to be put up to prevent associations from being made
where the organization does not wish them to be made (for instance
tying usage data directly to crash reports).


> As for the Germany/EU aspect: (Note the rest of this paragraph says
> nothing about law. I'm not trying to play a lawyer here.) Even if
> sending an UUID had no real privacy impact, sending an UUID would be
> bad publicity in Europe. The usage share of Firefox is in the decline.
> Europe in general and Germany in particular is a place where the usage
> share of Firefox is high. It seems like a bad idea to hurt that market
> share in order to study metrics related to it.

I just want to clarify precisely what is being discussed when we say
"sending an UUID".  MDP is generating cumulative data on the client
and submitting that data as a document.  That document is given a new
UUID and the client retains that document ID.  Every time a new
submission is made, it will have a new document identifier.  It is
even possible for the identifier to not be part of the URL (which is
sent using SSL).  If the user wishes to delete the usage data for
their installation, the browser submits a delete request with last
submitted ID.  When a new document is generated on another day and
submitted, the client also sends the old document ID to be deleted so
that there are not two copies of the data on the server.  This allows
us to look at retention.  If a document is older than N days, we know
that there have been no further submissions from that installation.
This implementation does still require policy and trust.  It requires
that we not record IP addresses with the data set.  It requires that
we do not longitudinally track location.  There might be further ways
we can make it easier to follow those policies.



On Feb 7, 6:19 am, Gervase Markham <[hidden email]> wrote:

> On 06/02/12 22:16, Daniel E wrote:
>
> > It is an unfortunate fact that even in the other data available to us
> > today, there are occasional ways in which a user can modify their
> > system or browser such that some private information is leaked out.
> > One of the best examples I can give of that is the ability to change
> > variables that are used in the update or blocklist checks.  There are
> > requests to those systems that have an e-mail address in the place of
> > the product name ("Firefox").  There are systems that have a changeset
> > or bug number or username in the channel or distribution name.
>
> I have no reason to doubt you that this happens, but there is a big
> difference between designing your system to request particular data, and
> accidentally receiving some of it because a user mis-configures their
> browser.
>
> If I have a web "contact me" form, and someone pastes their entire
> medical history into it and hits Submit, I probably want to delete the
> data - but I don't have to engineer my data handling process for content
> coming from that form so that it's robust for handling medical data!
>

We need the legitimate data that is expected to be in those
variables.  We are designing the system to be able to use that data.
We do not want to be burdened by illegitimate data that is available
as the result of a mistake on the part of a developer or user, so we
have made sure that the system has checks and features to restrict and
eliminate that data easily.


> > It was critical for us when we proposed this system to have data
> > collection that was focused on the browser installation rather than
> > any attempt to learn anything about an individual person.
>
> I'm not sure that's a distinction we can make. I am the only user of my
> browser, and I'm sure that's true of lots of other people too. What can
> you tell about me from my list of installed add-ons? I won't give you
> the full list, but I suspect you could tell:
>
> - I do web development of RESTful services using JSON
> - I work for Mozilla
> - I care about my privacy

I believe that it is important to consider even the worst cases, but
please keep in mind that this is not a normal case.  The system is
designed such that it would have no way of telling that Gerv is a web
developer who works for Mozilla and cares about privacy.  There are
specific policies and features put in place to prevent the system from
ever being able to associate those conclusions with a person.  We
don't keep IP addresses with the data to prevent the possibility of
using that IP address to identify the person using an installation.
We use a document identifier so that even if one document ID were ever
leaked or shared by you (say via an e-mail), the ID would change at
the next submission so we would not be able to use that ID to look up
the data from your installation next month and see if you still care
about privacy.


_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
Reply | Threaded
Open this post in threaded view
|

Re: Anonymous metrics collection from Firefox

beltzner
In reply to this post by blake.cutler
On Mon, Feb 6, 2012 at 4:04 PM,  <[hidden email]> wrote:
> Collecting data this way is not sufficient to turn Firefox growth around, but I believe it is necessary. For the first time, Mozilla will have concrete answers to important, long-standing questions. Answers that Mozilla's competitors already have.

That's a laudable and excellent goal; the wiki page should specify
exactly what those questions are, and how the data will be used to
answer them, *before* any action is taken to collect the data. If the
questions are indeed important and long-standing, it shouldn't be hard
to generate that list!

> It's not about gathering more metrics. It about collecting and analyzing metrics correctly.

I'm very comforted to hear that, as it implies that things are being
thought of in terms of "what do we need to know? what is the minimal
amount of data that can be collected to answer those questions?" which
is the right way to go about things.

I don't think anyone here is questioning the motives of the metrics
team, or indeed the benefit of being able to answer those questions.
Instead, we're trying to review the proposal so that we can do this
better than any other company, and prove that the questions can be
answered in a way that is sensitive to individual privacy and data
collection processes.

cheers,
mike
_______________________________________________
dev-planning mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-planning
1234