The author defines a formal protection model named k-anonymity to address the re-identification problem, points out three possible attacks against k-anonymity, and provides a set of policies which are used to thwart the attacks. The author also analyzes existing works in statistical and security communities, showing none of them is able to provide effective solutions. This paper is neatly organized, enable the readers to grasp the main idea more easily. The introduction section explains the re-identification problem and basic idea behind the k-anonymity, which is increasing the number of candidates for linking. Then the main body introduces the k-anonymity model and accompanying polices, in which each section begins with a summary paragraph. Sufficient examples are provided, accompanying definitions, lemma and policies, making this paper more readable. One of the strengths of this paper is it finds a problem and provides novel solutions. However, dose the k-anonymity work in practice? It is more convincing if the author provides sufficient experimental results as proof. In addition, the accompanying policies are heuristic and there might exists other kinds of attacks that the three policies can not thwart. These problems should be addressed in future works. Another possibility is extending k-anonymity to more complex data model. ============================================================================= What are the contributions of the paper? This paper tries to develop an approach that allows to reveal data relevant enough for statistical purposes but that is unsufficient to violate privacy. It gives a guarantee of the certain user-chosed level of anonymity in the revealed data. The papers aims to provide a model for understanding, evaluating and constructing of computational systems that control inferences in the revealed data. What is the quality of the presentation? The general structure of the papers is excellent. Flow of ideas is clear and logical. What are the strengths of the paper? Good quality of the presentation and style (although there are some informalities like "Let me ..."). Innovative approach that might lead to a breakthrough in the area. Major assumption, which is possibility to accurately identify quasi-identifiers, explicitly stated, although the phrase "it is believed that these attributes can be easily identified by the data holder" is strange unless we assume that some general external guidelines on this would be developed. Analysis of possible attacks against k-anonymity is made. What are its weaknesses? Although the first part of the paper (introduction / motivation + background) is too long for non-introduction paper in the field. Some notions such as the ones from relational database area are redudant. Formal definitions looks too cumbersome and redudant as well while there is a lack of intuitions sometimes. Extremely scarce conclusion. No future directions of work are proposed as the author does not consider the topic promising enough. No attention to k-anonymity invariant maintaining while information in database modifying. The approach may drop some information needed for statistical purposes while anonymizing the data. The source of the information may implicitly reveal additional info. Ex.: database of the patients in the specific hostipal gives a lot of locality info even without providing of ID. What is some possible future work? Consider the situation when different entities reveales information with different levels of k-anonymity. Consider how the k-anonymity invariant can be supported effectively while adding new or removing old records to/from the database. Developing of the framework or guidelines for implementing this idea on practise (the idea makes sense only if it is established as strict guidelines for developing data-collectioning systems). Some ideas that came to my mind while reading the paper. An one-time likage of the sensitive information is very possible and very hard to disguise. Personal info does not change to often, thus likage with the rate of once in several years may make a lot of hard to the personal privacy. ============================================================================= What are the contributions of the paper? The paper has the following main contributions: 1. It brings up the attention of potential privacy violations caused by releasing person-specific data from the data holders. 2. It presents a formal protection model named K-Anonymity which can prevent the re-identifying problems of the released data. 3. It introduces a set of accompanying policies that can prevent re-identification attacks on the data released based on the k-anonymity model. ------------------------------------------------------------------------------------------ What is the quality of the presentation? The quality of the paper presentation is good for the following reasons: 1. The structure of the paper is very well-organized so that it presents the sequence of the topics in a very logical way: Problems investigation-> solutions --> solution evaluations --> improvements 2. The concept and model are clearly and precisely represented by using formal definitions, tables and examples, so that it makes themselves easier to be understood. ------------------------------------------------------------------------------------------ What are the strengths of the paper? 1. It is a fine paper for both technical and non-technical audiences. ie. it uses both the high level(such as concrete real life examples) and technical (such as formal definitions and symbols) to present the problems and solutions. 2. The paper does not only present the concept and use of the k-Anonymity model very clearly but also very precisely by discussing the following: - Assumptions and constraints - Exceptional cases - Further improvement ------------------------------------------------------------------------------------------ What are its weaknesses? It would be more convincible if the paper demonstrates the benefits of the k-Anonymity model by direct comparisons between the model with other solutions in areas such as attack vulnerability and data structure difference. It would be more precise if the paper could use formal mathematical proofs to convince the technical audiences that it is indeed infeasible for re-identifying when the K-Anonymity model is being used under general conditions. The mathematical proofs may include the cost and complexity analysis of the possible re-identifying attacks. ------------------------------------------------------------------------------------------ What is some possible future work? 1. Other types of possible attacks against the k-Anonymity model would be examined and possible solutions that will prevent these attacks can be investigated. 2. An algorithm needs to be developed to implement the k-Anonymity model and its performance (such as running cost and complexity) need to be evaluated when it does the following: - filtering data when data is generated - comparing data when the number of occurrences is checked - soring data when prevent possible re-identifying attacks ============================================================================= Contribution: Sometimes private and confidential information can be inadvertently disclosed when the holder of data from different sources can draw inferences by joining the data from all sources. The author's model attempts to create circumstances where such inferences cannot be drawn because a data record is linked to at least k other persons. Quality: I liked her style of presentation. The author's writing is clear, concise, and intuitive. Strengths: The author presents a very interesting and difficult-to-solve problem in the area of personal privacy. She presents not only the strengths of her model, but admits to the possible attacks which can be made against it. The author discusses the possible defences against those attacks. Weaknesses: The use of the first-person singular has given the paper a "folksy" tone. The paper didn't sound as professional as it could have sounded. Although she is the sole author, it is unlikely that she did the work without any other assistance. The formula on page 7, definition 2, need further clarification. It's a trifle difficult to understand. Future Work: Using the k-anonymity model or real data as opposed to the toy examples in the paper would be the next step in testing the author's model to see if it works in practise. ============================================================================= Summary/Contributions: - Main contribution is the k-anonymity requirement, a way of being ensuring that released person-specific data cannot be associated to an individual. Focus is to protect the identity of the people who are the subjects of the data - Four possible attacks are presented Strengths & Quality of Presentation: - Identifies a very real problem, and offers a solution - Inferences are an easily overlooked attack vector, this paper increases awareness - Running examples make it easier to follow, even for non-database people. Weaknesses/Future Possible Work: - very weak security guarantee/argument; "this paper seeks to primarily protect against known attacks", i.e. the four attacks the author was able to think of. Future work must strengthen this argument, if possible provide some sort of guarantee. - The Assumption. Without some assumption, this problem is likely untractable. However, imagine the optimal case, where the data holder has complete knowledge of all the external data available at the time of release. She releases the data, then two weeks later an external source is released to compromise her release. Even in the optimal case under this assumption, there is little hope of the desired outcome. Future work must work to relax and remove assumptions if this is to be practical. This may be "show-stopper" since the data holder has no control over external information. - Paper assumes a proper quasi-identifier is identified, how difficult is this problem? Is this a reasonable assumption? I think that determining the quasi-identifier posses a significant challenge, and may not even be possible. This assumption is necessary for the paper to proceed, but approach is not practical until this problem is solved. I suppose the data holder could always take the complete set of attributes as the quasi identifier. - Formal definition of quasi-identifier is obfustcated by the use of poorly defined functions f_c and f_g. Meaning only becomes clear when considered with the example. - It is o.k. to consider only the case of protecting someone's identity, but it would have been nice to know how else this could be applied. - The idea of "adding noise to the data while still maintaing some statistical invariant" was dismissed with only one small reason. Could this be used in conjunction with k-anonymity? In cases where k-anonymity is not possible (say only k-2 is possible), could noise be used to reach k-anonymity? - The "Lemma" on page 9 is merely a restatement of Definition 3. It dosen't get used to prove anything later in the paper either. It should be omitted. Also, examples 3 and 4 are highly redundant. ============================================================================= In this paper the author does an excellent job of introducing the idea of k-anonymity by not only defining it precisely but clarifying it with table based examples which required no previous background knowledge of the various types of "joins" as considered in database query languages. However, I believe that the concept of Quaisi-Indentifiers was not well explained and the importance of identifying them precisely was not done justice to. Specifically, the author makes the assumption that all quasi-identifies will be known to the person disclosing the information which I believe is a very unrealistic assumption. Lastly I think that the attacks described against k-anonymity in the paper were perhaps the tip of the ice-berg. There are various other attacks on k-anonymity that are presented in literature. A good example is the paper ["l-diversity: Privacy beyond k-anonymity", Machanavajjhala, Gehrke, Kifer] and it illustrates with "Homogenity attack" and the "Background Information Attack" that k-anonymity does not imply privacy. ============================================================================= The author developed a information quasi-suppressing technic on quasi-identifiers stated in his quasi-academic paper. The paper addresses the issue of releasing massive amount of person-specific data while maintaining individual privacy. Hence making the subjects of the data anonymous is the goal. k-anonymity, a protection model devised by the author, managed to reach this goal as well as retain data usefulness. The author explained his idea in fluent English but lack of clear perception. The first question that comes to mind is that how the algorithm works in terms of modifying sequences of values corresponding to the attributes in quasi-identifiers so that k-anonymity is achieved. What is the trade-off between the modification of the data (in order to satisfy k-anonymity) and the usefulness of the data? I personally feel that properly recognizing quasi-identifiers in a reasonable amount of time is of great importance. However, the paper does little elaboration on this aspect. Could this be a possible future work or someone might has already come up with a suggestion. One neat thing about the paper is that the author foresaw the possible attacks against k-anonymity and provided corresponding solutions. ============================================================================= What are the contributions of the paper? -Overview of other models of anonymity. -Overview of possible attacks on anonymity from released data. -Proposed a new model to provide k-anonymity on releasing data. -Proposed possible attacks on the model and how it could be addressed. What is the quality of the presentation? -Very clear. Simple and right to the point. What are the strengths of the paper? -Like the above points mentioned. Very simple to understand, right to the point. What are its weaknesses? I found a slight problem with the lack of statistics shown. It would be great to have a bit of results that back up the idea before the conclusion. Also, about the first example stating 87% of the population can be identified from linking medical data and voter data. It seems like a very astonishing result at first. But given the Zip code, birth-date, as well as sex was in the list, the result doesn't seem too alarming. What is some possible future work? -More empirical results. -Test on how much the level of anonymity is lowered with different Quasi- identifier released in real world data. -In general, implement the idea, test on more real life data and publish those results. ============================================================================= Q: What are the contributions of the paper? A: The contribution of this paper is that it proposes a k-anonymity model, which can be by used private data holer to mitigate re-identification attacks on released information without undermining the usefulness of the released data. In summary, by altering the released information to map to many possible people can effectively throttle this kind of attack. "The greater the number of candidates provided, the more ambiguous the linking, and therefore, the more anonymous the data." Q: What is the quality of the presentation? A: The paper has a well organized overall structure. It first analyzes the current posing attacks on released information. And then, it proposes and elaborates the k-anonymity model. Finally, it paper ends with some examples of potential attacks to the proposed k-anonymity model and the proposed solutions to each of the attacks. Q: What are the strengths of the paper? A: This paper is good at its analysis of current potential re-identification attacks on released information. Also, the representation of k-ananymity model uses mathematical notation which is very formal and accurate. Finally, the use of some examples in this paper to illustrate some concepts are very effective. Q: What are its weaknesses? A: This paper fails to mention the steps of how to convert/create a k-anonymity table on the basis of a private data table which is not k- anonymity. Q: What is some possible future work? A: More work may be put in the research on how to model an efficient procedure to create a k-anonymity table from a non-k-anonymity private table. ============================================================================= * What are the contributions of the paper? It gives a formal method(k-anonymity) that protects privacy against inference from linking to known external sources. This method can provide a guard against re-identifying individuals if the data holder follows the policies (mentioned in attack examples) to release the data. * What is the quality of the paper? The paper explains the definition of K-Anonymity deeply and accurately. It uses enough examples to describe the K-Anonymity, which clarify the quality and accuracy of this new method. Also, it gives some example of different exiting attack to the method and gives solutions to each of them. * What are the strengths of the paper? The structure of this paper is well formed. It gave the brief descriptions to the existing works (statistical database and multi-level database) and listed their weakness. Then, give the k-anonymity method. In this way, it highlights the importance and success of this method. * What are its weaknesses In the abstraction, the paper mentions there are “a set of accompanying policies for deployment”. But I did not find them. I think the author should list them in a separated section. The weakness of this method is k occurrences of the same sequence of data in the table. In many situations, the data are not always duplicated. The paper did not mention any future work. * What is some possible future work? 1) As the author said in paper “The greater the number of candidates provided, the more ambiguous the linking and therefore, the more anonymous the data.”, can we find another way base on k-anonymity that does not require huge number of candidates and still can give the same quality of protecting the privacy? Maybe we can combine it with multi-level database. 2) The whole assumption is based on the assumption that the data holder can accurately identify quasi-identifier. Can we improve the k-anonymity when we restrict the assumption that the holder only knows some part of quasi-identifier? ============================================================================= What are the contributions of the paper? Person-specific data becomes more and more valuable to industry and academic research. Privacy protection while releasing such data becomes an important concern. In the author's 1998 paper [1], she proposed a model named k-anonymity and associated policies to ensure that the information corresponding to one person can not be identified from other k-1 individuals on the same release. In this paper, the author discussed several potential attacks against k-anonymity and new policies to defeat such attacks. What is the quality of the presentation? I think the quality of the presentation is ok. The whole paper is easy to understand. The problem is well motivated. The examples are especially good to demonstrate the problems. However, I feel this is not a serious paper in that the contribution of this paper is not so clear and the conclusion is too short to cover the main points of this paper. What are the strengths of the paper? I really like section 4, which discusses several potential attacks against k-anonymity and possible solutions. What are its weaknesses? The paper itself is not self-contained in that it tells little about the actual algorithms and policies should be used to generate released data satisfying k-anonymity property. The author comments that "this paper significantly amends and substantially expends the earlier paper...", but to me, the only obvious extension is the discussions on possible attacks against k-anonymity, which itself may or may not has sufficient contributions as a new paper. At least for the first three sections, I see no fundamental difference from the author's 1998 paper. I am also curious why the author's 1998 paper end up unpublished. What is some possible future work? The degree of protection of k-anonymity really depends on the correct selection of quasi-identifiers, which further depends on the data holder's ability to identify those attributes that could leak personal identity information and depends on the information that the receiver of released data already has. The rules or policies to identify these quasi-identifiers may be worth for future research work. k-anonymity works by making all information that could uniquely identify a person fuzzy, while the techniques used in statistical database usually add noise to information while maintaining statistical invariant. In general, the former approach is strong to known attacks while the latter is better to unknown attacks. It might be possible to combine these two approaches in some way, to provide better protection to both known and unknown attacks with less data distortion. Reference: [1] P. Samarati and L. Sweeney, "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression", submitted to IEEE Security and Privacy 1998, unpublished. ============================================================================= What are the contributions of the paper? It provided a model for protecting privacy that made some amelioration of contemporary methods. We are no longer need to insert some noisy data or change the structure of database to improve the level of the protection of data privacy. And the paper also shows us a concept to use original data frame to improve the protection. What is the quality of the presentation? It’s good. The presentation of the model includes definitions, explanations and some examples. All that makes us easy to understand the meaning and the author’s intention. What are the strengths of the paper? Since there is only a model and do not have some practical achievement, the strength of the paper is not so strong. However, the concept in my view is good. What are its weaknesses? 1.It takes too long time to get into the main topic. 2.There is no mathematic method to prove the feasibility and efficiency of the model. 3.There is no consideration of how to make use of the model What is some possible future work? The most important thing is to improve the model is more suitable for protect privacy ============================================================================= This paper introduces a method of releasing a table of person-specific anonymized data in such a way that it is not possible to determine which entries correspond to a specific person. Studies have shown that 87% of the US population can likely be identified given only a table containing just zipcode, birthdate, and gender, a set of data which may appear to grant a sufficient amount of anonymization. Tables released using the author's scheme guarantee that any given entry cannot be distinguished from atleast k others in the table, thus allowing control over the degree to which data can be linked to a specific person. The author provides several examples of releases of information that have been thought to be sufficiently anonymized, but in fact are able to be linked using other publically available information. In one specific case, the author combined a voter list with medical data released by a hospital to re-identify the entry corresponding to the Governor of Massachusetts. Other examples include releases of the same Government document censored by different departments. These examples clearly illustrate the motivation for researching the proper anonymization of data. A summary of previous work in the field, and summaries of potential attacks against k-anonymity and their solutions are also given. While k-anonymity provides a nice theoretical model, a real-world implementation is made difficult because of an assumption made on what external data is available. To effectively implement k-anonymity, one must be aware of which portions of the private data are publically available and could be used in identifying subjects in the anonymized table. In practice, this is extremely difficult, and requires information release policies (which are not addressed in the paper) to be in effect. In addition, information released using k-anonymity may be compromised if other identifying factors are made public in the future. In addition, k-anonymity cannot account for data that can be inferred from other sources. For example, if medical diagnosis were released under 2-anonymity, then an entry may be linked to exactly 2 identifiable persons. If the diagnosis was obesity, this could be possibly be used to link the data to a specific person based on physical characteristics. Future work in the area should address these problems. ============================================================================= The paper entitled "k-Anonymity: A Model For Protecting Privacy" by Latanya Sweeney is an intriguing and well organized report. It discusses a model for protecting the privacy of person specific, field structured data found from data holders such as hospitals and banks. Sweeney presents examples that show how anyone with access to public information lists, such as voter lists, can link to sensitive and private data, including medical records. This leaves the reader more interested in the problem that this paper explores since it seems to be a real life issue that should be solved. The introduction is well presented as it is clear and easy to understand. The example of finding out some private information about the past Governor of Massachusetts catches the attention of the reader and encourages him/her to read on with interest. Sweeney researches well into this problem as she cites many past and current work on the problem. Sweeney also touches on the related area of multi-level databases where the primary technique used to control the flow of sensitive information is suppression. This strengthens the purpose of why a new method is needed for this area of research since suppression can reduce the quality of the data and perhaps "rendering the data practically useless" for the purposes of research. The author also discusses that computer security is not sufficient for this particular problem of privacy since she argues that we must be aware of what values will constitute a possible leak of information. This once again strengthens the purpose for this type of model for protecting privacy. This paper is covered with several definitions followed by helpful examples. This paper is easy to understand and easy to follow because of these definitions and examples. The author also sets up this paper so that the actual definition of k-Anonymity is about 3/4 into the paper. This allows the reader to fully understand all of the background information before leading into the main topic. k-Anonymity is thus explained well and easy to understand at this point. In my opinion, once the k-Anonymity related attacks were described, the paper weakened in substance. I was left with many questions and also felt that the examples of the attacks could be better. I was left with the following questions/comments: - I am a little concerned with why the values of certain properties, in the examples with tables, were changed to more general values (such as from "male" or "female" to "human"). Does this not degrade the information which would lead to the same argument of suppressing data? - The examples used showed simple tables. Larger and more complicated tables lead to more complex solutions. Is this model scalable? - This model still leaves some vulnerablilty in that k individuals may be pointed out. This means that a small k is not good. What would be a "normal" sized k? Is this good enough? - It seems like there is still alot more work needed in this research area. ============================================================================= What are the contributions of the paper? Many large databases contain confidential data that can be of great use to humanity. Naturally, privacy is important, but even more so, in the context of medical data, there are privacy laws upon doctors that can be legally liable for breeches of confidentially. Such doctors would thus be less enthusiastic about releasing useful information into the public domain without the express consent of patients, and obtaining individual consent would be certainly time consuming. This paper presents a means to provide large amounts of useful information in a way that ensures that information can be released without violating privacy. Specifically, information is presented in such a way that for any indentifiable attribute, or set of attributes that can be cross linked to existing public data for identification (called a quasi-identifier) matches to at least k other entries in the data set. K is a parameter that controls the degree to which the data is private, where there is assured a 1 in k chance of guessing randomly based on cross listing a database with personal names to the information that is presented. What is the quality of the presentation? * Figure 3 is poorly placed. Since all the figures use the same * example to illustrate different aspects, the fact that table three * appears in section three makes the reader confused about its * significations since it blends nicely with the other sections. its * not until the bottom of the page 3 paragraphs later when figure * three is referenced. Also in figure 3 the construction of GT1,GT2 * from PT is not mentioned, that is, erasing race and replacing with * person, and especially wiping the last digit of the zip code, could * be illustrated with a bold type over the corrected * component. Additionally, the data is race column: white, black, * asian, person; is not uniform. that is, some races are identified by * a colour and others by their anthropological title. Figure 2,3,4 * all represent the same sort of data, that is, tuples in a database, * but they all have different formatting: two has a bottom/right * shadow and intercolumn and interow lines and the head is very * lightly shaded; three is without interrow or intercolumn lines and * the header is unshaded; 4 has interrow and intercolumn lines with * darkly shaded headers; it would have been wise to use the same * standard for all these similar tables. Figure 1 illustrates the * union of two sets. However, the none intersection text is light and * harder to read. it would have been nicer to bold the intersection * instead of fade the non-intersection The horizontal rule on the top * of page 14 is likely not needed What are the strengths of the paper? * Identifies that there is already a problem with the means that * confidential data is released into the public domain, that is anyone * could take a list of zip/birthday/gender information and cross link * it to the public voter registry to resolve 87% of Americans. Since * medical information is already released with this information, this * paper illustrates an existing major leak in privacy, thus providing * an urgency to the problem it discusses. Models a real world * application of appropriate privacy laws restraining good for * humanity. Model of having each row correspond to a single person, * with some collection of attributes forming a means to identify them * and having other *Implementation of the model is useful. Often data * collected cannot influence the intended data mining, for instance, * birth year, i.e. age, may be useful to scientists, but birthday, * i.e. zodiac sign, is only useful to astrologists. Such data can then * be removed increasing the number of people who share similar * characteristics. Therefore, the solution it presents is useful, and * also not all columns need to be removed, merely that some entries * must have their specificity reduced to ensure that it has more * matches. What are its weaknesses? *Does not mention that some data can be reduced to a "wild card", (however it does illustrate this in an example in figure 3 where the entire column is reduced to a base wild card). For instance, by setting 4 entries race to "person", then when matching only on race we know that there will be at least 4-anonymity in the results, because the person results will match to both categories. By taking a only k results and removing all the identifiers and setting them all to wild cards, we can have that already a lower bound of k-anonymity. *The solution it suggests to ensure data is sufficiently duplicated is to remove information from some columns, such as removing race information for some participants, or removing zip-code information. This may remove useful data from the analysis instead of ensuring that the sample size for each match is naturally greater than or equal to k. * When a dataset is released into the public domain, the method of * removing data from some of the quasi-identifier columns is fixed, * and all future updates or releases are constrained by this original, * and already released data or poorly planned released data have * already imposed lifelong constraints, without any suggestions on how * such data columns should be constrained. * It does not mention the following weakness: A set of quasi * identifiers may match at least k records in the data table. However, * if all k records have the same associated data, i.e. for medical * history they all have the same symptoms and diseases, then if you * know someone is on the list, you can match their partial result to * find k people who have the same diseases and conclude that the * person you were matching has the disease, an information leak. * No algorithm was presented to create a table of k-anonymity from a * data set and a list of quasi identifiers. What is some possible future work? * Perhaps there is a method to isomorph the data into another dataset * that retains all statistical information, but the inverse * isomorphism is computationally hard to compute. This way the privacy * is guarantied by mathematically hard problems instead of by removing * certain aspects of data. It would also allow updating without worry * of cross linking the old set, but the difference of the two * isomorphed sets may leak data. Such an isomorphism may involve * probability, where a partial unit of one disease is spread over a * variety of people such that the end it will all weight and sum to * the same result, however an observer cannot determine what * specifically each person has even if they were to link their names. *Each value has levels of specificity. for instance, to specify address they could specify country, province, city, street, etc. Perhaps an algorithm can look at the number of people that match a quasi identifier at one level, and if the projection is below the thresh hold k, then the data generator would bump them up to the next level of generality. There could be different nodes in a tree that branch into different specific values on lower levels, and each tuple that is set to a higher node must be considered as a candidate for any of the nodes beneath it. An algorithm that sets tuple data to the appropriate level can thus ensure that a minimal amount of reduction in specificity occurs when setting the levels of the quasi identifiers in each tuple. * Perhaps there is a way to provide the information in a method that * permits people to analyse components of it without compromising * privacy. Such a system would interpret individual requests and * limit the output to that which would not compromise privacy with * regards to the other requests that have been made. Such a system may * allow people to run analysis methods on the entire data system and * the system would only return results instead of raw data, and the * results are ensured to protect security. This way the person who * wishes to examine the data provides an algorithm to run upon the * data, and the data server executes it after ensuring its not a data * leak. * A method to revoke data that has been provided, or to change the * methodology of release in case of poorly designed released patterns * that lack the usefulness required of the full system. ============================================================================= This paper have presented the k-anonymity protection model, which alters the released information to map to many possible people (at least k) in order to thwart linking attack. The author also have explored related attacks to this method and provided ways in which these attacks can be thwarted. The presentation of this paper is not very good. The author uses half of this paper on background and related research before talking about the "real stuff". Moreover, the presentation is not very clear sometimes. For example, the symbol PT appears before its definition, which may confuse readers. No possible future work is mentioned in this paper, and I think in order to prove that the model introduced in this paper is really good, the author should do more comparison between this model and other known methods. The model presented in this paper can do a good job in privacy preserving, the larger k is the better privacy is preserved. The author has done a good job exploring related attacks and providing ways in which these attacks can be thwarted. These efforts help make the model more applicable in practice. However, if k is too small, little privacy can be preserved. This doesn't mean the larger k is the better the model is. Too much anonymity may make the data useless. In fact, even when k is very small, the usability of the data can be badly damaged with this model, because the data that is most useful may not be involved in "QI" and thus may not be released according to this model. Some survey can be done for future work to find out what is a proper k in most people's eyes and measure the usability of this model ============================================================================= The main contribution of this paper is a model for protecting personal information when data holders release data. More precisely, the model makes individual information contained in the released data undistinguishable from other k-1 individuals which also contain data in the release. Another interesting contribution of this paper is the attacks than can be performed against k-Anonymity (Unsorted matching attack, Complementary release attack, and Temporal attack). In general, the paper is very well presented, but the author should be careful when using abbreviations. For example, in page 8, the term PT is used, but without mentioning what is means. I would also suggest the author to be more detailed when defining a "Quasi-identifier", because this definition is intensively used in the rest of the paper. Presenting several examples was also interesting, making better the understanding of the paper. The main strength of this paper is that it presents a comprehensive study of privacy protection in data releases. If k-anonymity is used in a real-world application, it is also to prevent some attacks by taking this paper as reference. In my opinion, a weakness of this paper is that is should consider a real-world (and big) database, and show how to use k-anonymity to prevent data disclosure. This way, the author could show how to create a search space of size k, so that it is hard (or even infeasible) to find a person from his/her data. Finally, some future work that I would consider is how to efficiently identify Quasi-identifiers in a huge database. In small tables this can be easily done, just by inspecting the table. However, when the database is quite big, one would need some methods and tools to perform this efficiently and still guarantee that such a set of Quasi-identifiers will not lead to a data disclosure. Another interesting future work would determine how big k must be so that it becomes harder (or even infeasible) to find someone from the released data. From this, two questions arise (and should be answered in a future work): Would it be necessary to adopt safety margins? Would it be interesting to consider levels of anonymity? ============================================================================= Due to an exponential growth in the number/variety of data collections containing person-specific information and the demand to ultimately release this data for research purposes, there is a strong need to protect the privacy of the individuals involved. This paper presents a very simple yet effective model to address this problem without compromising the usefulness of the data itself. The author very systematically proceeds by first identifying the issue, providing a concrete example of how an individual can be re-identified from the supposedly anonymous data by linking it with easily available external data sources. She then establishes the scope/need of the work by laying down similar work being done in other areas, briefly pointing out their shortcomings. Next the author moves on to present the actual ‘k-anonymity’ model providing the definitions needed to comprehend the work along the way. Finally, three possible attacks on the model and ways of getting around them are mentioned. The model is very intuitive and easily understandable. The paper proceeds in more or less plain English without involving many of the complex jargons, providing examples at each step for further clarification. But at times author has made deliberate attempts to keep the discussion as simple as possible which leaves an impression of incompleteness on the readers mind. While listing the risks and difficulties involved on the part of data holder the author refers to contracts and policies that can provide complementary protection but doesn’t provide any example of what they might look like. The examples provided are very trivial and limited to protecting a person's identity. Furthermore the author has made extensive assumptions that make the solution work only for known attacks also identifying the Quasi-identifier itself in real world data can pose serious challenges. Future extensions to this work could include making the model more robust with fewer assumptions made on the part of data holder. For example ‘k-anonymity’works only under the assumption that the data-holder is able to predict with a reasonable accuracy about other external data sources and the attributes present that could be used to potentially link with the released data to re-identify sensitive information. What if this assumption didn’t hold? Another way to improve the model might be to identify more attacks that can be launched against it and then incorporating measures to prevent them. ============================================================================= # What are the contributions of the paper? The paper contributes a # model for protecting privacy in sharing, exchanging personal data # between agents, organizations,.. It also contributes a framework for # working on algorithms and systems in releasing information without # revealing properties of the entities that are to be protected. What # is the quality of the presentation? The presentation is clear and # well-organized. What are the strengths of the paper? The # k-anonymity model is straightforward and effective The paper # fingures out some good attacks on the k-anonymity model and also # suggests appropriate solutions to those attacks. What are its # weaknesses? The paper claims that the k-anomynity model prevents # individual re-identified while the data remain practically useful, # however the usefulness of the data really depends on the context, # then in a specific situation, the k-anonymity model might destroy # the usefulness of the data. The k-anonymity model is not proven # totally secure and there still exists possible another attack on it. # What is some possible future work? We might want to know the # trade-off in practice between the k-anonymity model and the # usefulness of the data that adheres to the model. The k-anomynity # model prevents only data linking attacks, we might need to develop a # different model or a model based on the k-anomynity model to prevent # more kind of attacks. We might want to know if it is possible to # apply the k-anonymity model in computer security in general to # protect personal privacy ? ============================================================================= * What are the contributions of the paper? This paper addresses the * problem of releasing the person-specific data for scientific * research without compromising on the privacy of the individuals, * which are the subjects in the released data. It provides formal * protection model, k-anonymity, in which information of a subject * can not be distinguished from k-1 other subjects in a release. It * also deals with some know attacks on the anonymizing systems, * specifically against introduced k-Anonymity model. * What is the quality of the presentation? Paper is informative and * explores background of the problem with real life examples. Figures * are added to give the visual information. But, Figure 1 "Linking to * re-identify data" is unnecessary. Also, few rudimentary definitions * like Attributes could have avoided. * What are the strengths of the paper? It provides a formal way to achieve a balance between data release and privacy concern by avoiding data released to be useless at the same time providing user defined anonymity level. * What are its weaknesses? Implementation techniques for k-anonymity * are missing. Also, only few known attacks are considered. With * increases value of k, the technique might become verse than older * results, which author refers to. A practical example to k-anonymity * system could have strengthen the claim. * What is some possible future work? Practical implementation of the * system to strengthen the claim is necessary. Also, automatic * identification of ============================================================================= What are the contributions of the paper? The paper's primary contribution is the k-anonymity method for anonymizing sensitive, publicly-released data. It provides a formal description of the method and examples of its use. It also enumerates several possible attacks that could be used to infer sensitive data from k-anonymous releases and provides additional procedures to follow to ensure that these attacks cannot be used. In addition, the possible attacks identified could be useful as a basis for evaluating other protection models. What is the quality of the presentation? The presentation of the information in the paper is of high quality. The motivation, background and the method itself are all explained clearly. In particular, the method is described both in precise mathematical notation, useful for theoretical analysis, and in ordinary language. Concrete examples are used to good effect throughout the paper, making it all the more easy to understand the definition and applications of the model. What are the strengths of the paper? The method provided by the paper is simple but effective, and can allow the release of data that is reasonably anonymous but still useful for analysis. It is presented clearly and convincingly. The use of examples, for motivation and demonstration, is particularly effective for explaining the model. What are its weaknesses? While the k-anonymity method is theoretically sound, there are a few assumptions made in the paper that may reduce its effectiveness in practical situations. In particular, the quasi-identifier, all the attributes in the private information that could be used for linking with external data, must be identified by the data holder, which may not always be possible or practical. What is some possible future work? Since the anonymity guarantees provided by the model rely on the identification of the quasi-identifier, future work could explore methods for making this task easier and more practical for the data holders. Similarly, since the method presented protects against known attacks, more work could be done to identify potential avenues for attacks. Also, research could be carried on real data released using this model to determine whether it is effective in practice by searching for unanticipated linkage attacks. ============================================================================= The paper proposes a formal protection model, k-anonymity, for constructing and evaluating systems where private data is protected. The framework provides guarantee on the anonymity of data when stated assumption is satisfied. The paper also address realistic attacks that cannot be protected by such model. The quality of the presentation is clear and concise, except table naming gets a little confusing in Section 4. The assumption of identifying quasi-identifier can be a little too strong in practice as each data source may not have enough knowledge about all other available data to accurately identify all quasi-identifiers. As mentioned in the paper, the model still exhibits certain degrees of vulnerability. As a future work, it would be interesting to extend the current model to capture characteristics of the attacks mentioned so we can provide better guarantee on anonymity of private data. ============================================================================= The author has proposed a k-anonymity model to provide a person-specific protection from re-identifying an individual when a data holder, such as hospital or insurance company, released data for pleasant reasons. First, the author defined quasi-identifier which is that the sequence of values in a group of data appears at least k occurrence which is available for attackers. Lastly, the author described different attacks scenario against k-anonymity. The author briefly discussed the previous work and then provided a formal framework for constructing the system of k-anonymity model. After that the author gave several attacks against the k-anonymity. This presentation seems to me as high-quality presentation. One of the weaknesses I can see is that the author motioned the tubular privacy protection.