GeoJitter:
NinaFiore1Emailcarolyn.b.fiore.mil@army.mil
SebastianNeumann1Emailsebastian.a.neumann.mil@army.mil
GeoffreyMoores2Emailgeoffrey.a.moores.mil@army.mil
1A
A
A
Department of Mathematical SciencesUnited States Military AcademyWest PointUSA 2Department of Electrical Engineering and Computer ScienceMilitary AcademyWest PointUnited States, USA
3† B.S. in Mathematical Sciences and Computer ScienceUnited States Military Academy2025West PointUSA
Nina Fiore1, Sebastian Neumann12†, Geoffrey Moores2
1Department of Mathematical Sciences, United States Military Academy, West Point, USA
2Department of Electrical Engineering and Computer Science, United States Military Academy, West Point, USA
†B.S. in Mathematical Sciences and Computer Science, United States Military Academy, West Point, USA, 2025
Corresponding Author: Nina Fiore
Emails: carolyn.b.fiore.mil@army.mil, sebastian.a.neumann.mil@army.mil, geoffrey.a.moores.mil@army.mil
ABSTRACT
This paper introduces the Python package GeoJitter, which provides region-aware randomization for spatial location of nodes. This package aids in preventing reidentification of private or proprietary information or spatial display of simulated, predicted, or uncertain information. The library is compatible with Networkx and standard shapefiles and will provide tiling solutions if region data is unknown or unavailable. We compare this technique to Census Enumeration Area, Radius, K-Nearest Neighbor (KNN), Degree Preserving, Stochastic Block Model (SBM), and Markov Chain Monte Carlo (MCMC) randomization.
Keywords:
Network randomization
privacy
GIS
anonymization
1. INTRODUCTION
Network visualization is essential for collaboration, communication of research, education, and more. Frequently, datasets include either overly precise location information that poses privacy risks or incomplete geographic data that hinders meaningful analysis. Existing anonymization techniques often compromise structural network properties or struggle in regions with atypical population density. We introduce an algorithm and companion Python package, GeoJitter, that addresses these challenges by enabling privacy-preserving transformations and uncertainty-aware modeling for spatial networks. The name, GeoJitter, denotes the use of random perturbations (“jittering”) to a node’s geospatial location, preserving the structural integrity of the original network while either obfuscating sensitive coordinates or generating plausible spatial configurations from uncertain inputs.
This package is optimized for cases in which nodes must be assigned a precise geospatial location within user parameters. To date there is not a tool that easily generates randomized location data for existing nodes using GIS and shapefiles. Whether researchers possess real-world location data that must be obscured for privacy or proprietary reasons or wish to generate this data to test a model or run simulations, GeoJitter provides the utility of controlled spatial randomization that is often impractical with existing tools.
In this paper, we show that GeoJitter provides location shifts within user parameters, allowing the researcher to specify parameters to meet anonymity requirements for their specific network data. We identify two main use cases for the tool: (a) Location data exists but must be obscured (jittered) to protect private or proprietary information; and (b) Location data is imprecise or missing and visualization or simulation would benefit from distributing the nodes across their regions. Base Networkx plotting tools do not maintain spatial relationships between nodes and regions, distorting the physical distances within the network.
We compare our algorithm to current best practices in terms of privacy metrics, spatial obfuscation methods and network obfuscation methods. Differential privacy and k-anonymity are current privacy metrics in research and industry. Differential privacy (DP) is big data focused and less applicable to this technique; GeoJitter does not provide DP guarantees, but DP algorithms can be run in conjunction with GeoJitter without issue as desired. k-anonymity will be discussed in the background section; it traditionally uses a radius-based randomization, but GeoJitter’s methods guarantee the researcher k-anonymity within their specifications. For spatial obfuscation we compare GeoJitter to Census Enumeration Area, Radius, and K-Nearest Neighbor (KNN) randomization. For network rewiring methods we consider Degree Preserving, Stochastic Block Model (SBM), and Markov Chain Monte Carlo (MCMC) randomization.
The remainder of this paper is structured as follows: first, we provide an overview of relevant research and techniques. We then address the primary algorithms and functions of our package and the major use cases. The discussion section addresses comparison of our techniques with similar available alternatives, technical aspects of the algorithms, limitations and future extensions, with emphasis on ethical considerations relating to privacy and anonymity.
2. BACKGROUND
2.1. K-ANONYMITY AND DIFFERENTIAL PRIVACY
Privacy-preserving data publishing is an expansive body of work consisting of domain-specific tools across many fields. Broadly, these tools and studies seek to provide rigorous methods for selectively protecting personal information from datasets while maximally communicating relevant characteristics. A common assumption is that smoothing or coarsening data increases information security, but these techniques do not completely prevent data recovery. Brownstein et al. investigate how publicly released, low-resolution disease incidence maps—intended to protect individual privacy—can still be reverse-engineered to approximate the original patient-level geographic data [1]. The authors introduce an unsupervised classification technique that infers likely point sources of disease cases by analyzing visual patterns in smoothed maps. Despite the aggregation and smoothing (techniques meant to anonymize data), the method from Brownstein et al. can estimate the original geographic coordinates of cases with surprising accuracy—from a presentation quality map, 99.8% of addresses were within 70 meters and from a publication quality map, all addresses were within 14 meters of the predicted location [1]. Studying human mobility data collected from phone tower, de Montjoye et al. achieve similar results, finding that 95% of the 1.5 million person dataset could be uniquely identified using only four geolocated points [2]. We find two main foci in the research for creating robust privacy protection: k-anonymity and DP.
In her foundational work on k-anonymity, Sweeney demonstrated that seemingly innocuous attributes to uniquely identify a large fraction of a dataset’s individuals when combined [3]. If any individual record within the data is indistinguishable from k-1 others in the data set, then the data set is considered k-anonymous. Because location data is a highly effective quasi-identifier, even coarsened information such as home ZIP code or neighborhood can drastically increase reidentification risk. The uniqueness of human movement patterns or residential areas, especially in sparsely populated regions, makes location a potent means of re-identification, particularly when aggregated to other (often publicly available) data.
Furthermore, aggregation increases privacy concerns. In a field study, Majeed and Lee note that despite the efforts of many disciplines to protect individual privacy, aggregation of datasets increases the risk of re-identification [4]. They also note the heightened risk of re-identification due to the cycle whereby more detailed data informs the techniques used to extrapolate private data, particularly identity theft, user profiling, and social engineering.
DP is a technique intended to guarantee probabilistic privacy within large datasets by ensuring that the output of computations on a dataset do not materially change if any one record is removed or included, and that risk of harm to the individual is not increased by inclusion in a particular database [5]. Unlike k-anonymity, which addresses data protection at the dataset level, DP addresses it at the computational output level, ensuring that algorithmic results are consistent across similar databases. As such, DP is not an appropriate benchmark for the functions of the GeoJitter library. However, if deemed necessary for a given use case, DP algorithms could be used in the pipeline to ensure this privacy guarantee for other relevant attributes within the data.
2.2. SPATIAL OBFUSCATION TECHNIQUES
There does not yet exist a definitive method to guarantee spatial anonymization, but some methods do exist to mitigate re-identification risk. In their proceedings on spatial anonymization, the United Nations Statistical Commission considers census regions as coarsened location surrogates. Census regions vary in resolution depending on geographic region and data source but are generally considered sufficient to serve as a basis for anonymization [6]. Their proposed method involves aggregating multiple adjacent districts into larger composite areas and reporting only the coarsened geographic information, thereby increasing the uncertainty surrounding any individual’s precise location. To satisfy the k-anonymity criterion, regions are combined to create a new region that contains at least k individuals.
Nevertheless, this method exhibits significant limitations in regions with nonuniform population densities. In sparsely populated rural areas, achieving k-anonymity may require the combination of numerous EAs over a large geographic expanse, producing regions so diffuse that the released data loses much of its value for localized spatial analysis.
Radius-based anonymization methods have been proposed to address these challenges, wherein the radius for anonymization is dynamically adjusted according to local population density: smaller radii are employed in densely populated urban centers and larger in sparsely populated regions [7]. Although such adaptive methods improve the balance between privacy preservation and data utility, they introduce additional complexities. In rural contexts, the radii required to satisfy k-anonymity may extend across tens or hundreds of kilometers, resulting once again in overly coarse spatial representations. Furthermore, the implementation of variable-radius approaches often necessitates access to auxiliary demographic information to appropriately calibrate thresholds. This may itself pose privacy risks if improperly managed.
The KNN algorithm can be used for spatial anonymization by designating a random location within a region determined by a node’s
k neighborhood, but this is substantially radius-based anonymization with radii determined by
k neighborhood. Avraam et al. in their 2021 paper used a similar technique, standardizing continuous variables with z-score transformations to find clusters for each node and its
k-1 nearest neighbors, plotting the cluster’s centroid as the anonymized location before reversing the transformation [
7]. Linderman et al. explored KNN connections to generate sparser connections while maintaining a giant component and ~
log log
edges with comparable connectivity, which is a non-spatial randomization technique that may yield privacy benefits in combination with other methods [
8].
2.3. NETWORK OBFUSCATION TECHNIQUES
Network randomization techniques are often used to mitigate the burden of data collection. Randomization and creation of ensembles enable analysis of various attributes without requiring modeling tens of existing networks. The cost of these randomizations is loss of granularity or changes to original attributes.
Degree Preserving Randomization, commonly accomplished by random edge swaps [9],. provides a degree of anonymity for nodes because they cannot be traced by their relationships with their neighbors. The simple elegance of the algorithm and its ability to maintain degree distribution make it an extremely useful tool for analysis and simulation; however, it does not in isolation provide k-anonymity, differential privacy, or another privacy guarantee. Other common rewiring algorithms work similarly, either discarding much of the original structure or leaving enough to risk reidentification.
Stochastic Block Modeling (SBM) randomization provides a statistically sound and extremely generalizable approach to network analysis, nesting different levels of network representation to perform different analyses while preserving the overall community structure [10]. An SBM model of a given network can produce a randomized graph that shares community structure, including intra- and inter-region edge probabilities, with the original. This prevents loss of community relationship information, scales well to networks of all sizes, and is primarily used for community detection. While the nested models of a hierarchical SBM do provide anonymity to individual nodes, the overall structure is reduced to some level of core-periphery representation, which can be detrimental to analyses that do not relate to community structure.
In 1986 Frank and Strauss used Markov Chain Monte Carlo (MCMC) estimation with Markov graphs to simulate networks with dependence between edges and conditional independence properties [11]. In 2002, Snijders proposed algorithmic improvements applying the concept to the exponential random graph model, increasing applicability of the technique by enabling convergence on graphs with multimodal distributions on sufficient statistics [12]. This can be either simulation, the generation of a MCMC network with given parameters, or estimation, determining the best fit parameters for observed network data. For our purposes, we can assume that to protect data from reidentification one might first estimate the parameters of the real data and then simulate a representative network. As Snijders notes, MCMC algorithms do not always produce satisfactory results, which limit applicability, but also require transformations from the original to a ‘very different graph’ [12].
3. METHODS
The minimum user input to GeoJitter is a Networkx graph. Shapefiles for desired regions increase specificity of output but require the network to have an attribute identifying the region name of each node. If no region files are specified, the algorithm creates a configurable tiling of rectangular regions to jitter the spatial locations.
3.1. ALGORITHM
The Jitter algorithm accepts the network and applicable shapefiles as well as identification of either the existing geospatial coordinates or the nodes’ region attribute. It creates an empty graph, jitters the node locations using the Point Generator algorithm before appending the edge list from the original network. The Jitter algorithm is expected to run in
time for regular region structures, or
in the worst case, where
is the number of nodes and
is the maximum number of sides to a region’s polygonal approximation. When node attributes do not include region designation, time is increased by a factor of
, the number of regions (or tiles if applicable).
where
is an Nx graph with attributes for location and/or region name,
(optional) is a set of shapefiles,
(optional) identifies and formats node coordinate attribute, and
identifies node region attribute.
Initialize
as an empty Nx graph
For both shape file and tile region designation, the Jitter algorithm calls our Point Generator algorithm, which defaults to the Dartboard method to designate the new geospatial position for each node. This method identifies a node’s region and the corresponding bounding box, then selects a point at random within the bounding box. To check for coordinates in region requires solving the Point-In-Polygon problem, for which we use the method outlined by Sutherland et al [13]. For a tiling solution, this point will always be in the desired region; when using shapefiles the algorithm checks if the solution is inside the region. If true, the solution is returned as the new node location attribute; if not, it reselects.
where
is the respective shapefile/ tile,
is the user-selected distribution for random selection, and
identifies when the algorithm switches methods from Dartboard to Triangulate.
of the region’s shapefile or tile
Initialize
using the Triangle library [
14]
Initialize tile
using percentage of total region area
random choice of
tiles using
independent, random numbers
Else do
If the algorithm is set for regions and fails to converge in the specified maximum iterations (by default, 50), it switches to the Triangulation method. Using the Triangle library [14], the shape file region is transformed into a polygon and subdivided into triangle tiles. Then, a random triangle is selected; using the area formula a random point inside that triangle is selected to be the node’s jittered location. This guarantees an acceptable solution in one iteration but is computationally more intensive than the Dartboard Algorithm and slower by a factor of s, where s is the number of sides in the polygonal representation of the region. This uses equations outlined in Osada [15] when the distribution is set to uniform. Currently, in the non-uniform case, failure of the Dartboard method raises an error. Application across subcomponent triangles of the region mesh will not result in the appropriate distribution over the region. We hope to address this functionality in future updates.
To set up the architecture for tiling, the algorithm intakes user-specified arguments for the m x n desired solution or defaults to a 10 x 10 grid. The bounding box of this grid is formed by the minimum and maximum latitude and longitude, which is then subdivided as specified (see Fig. 1 for an example of a 10 x 10 tiling solution on the generated Boston toy network). Each node is then jittered within its assigned tile. This ensures only one perturbation per node when selecting random points within the rectangular tiles.
4. DISCUSSION
4.1. CUSTOMIZATIONS
Regional shapefiles are often available at multiple levels of granularity from as fine as census blocks or neighborhoods to nations or global shapefiles that display continents or coastlines. However, areas with uneven population density will often maintain unevenly sized regional designation areas for ease of administration; in the United States, Alaska and California demonstrate this characteristic. Sparsely populated regions can be over ten times the size of the small regions of dense urban areas. Additionally, shape file regions tend to be predicated on human population density, which may not be germane to the user’s network. Here, the user has a choice. If unsatisfied with the applicable regions, they can use the tiling regions to jitter the network. The input parameters of the function allow specification of a usefully sized grid of tiles for the given situation.
The package defaults to uniform joint probability of latitude/longitude selection but allows specification of a joint probability surface within the regions. To an extent, the selection of region shapes influences location probability; the regions are conditional on population, which is reflected in node density and the possible jitter distance. Nodes in highly populated urban areas will tend to have small regions with high population and will have shorter jitter distances, which does reflect the probability for location in dense vs. sparse areas. However, if the user wishes to apply a specific distribution (e.g. a joint normal distribution), they may adjust the probabilities accordingly.
4.2. USE CASES
GeoJitter enables the user to adjust its parameters based on the specific requirements of their project. The established spatial or network randomization techniques do not provide the same utility: a flexible location randomization technique that improves network visualizations. The limited scope of the tool also makes it easier to fine-tune results. Common network randomization methods disrupt or mask the significant structure of the original network. While this does support privacy, visualizations of these representations may fail to demonstrate key results from analysis.
We anticipate utility to come from protecting information privacy from reidentification and from working with uncertainty within data. In the first scenario, our package enables privacy-preserving transformations that retain the structure of the underlying network while obscuring sensitive location data. In the second, users may only have access to simulated, approximate, or uncertain location data, often in the form of bounded regions or probabilistic estimates, and our package provides tools for constructing and working with spatial graphs in the presence of location uncertainty. We anticipate two use cases for our package: too much location data, and not enough location data.
Precise geographic coordinates can compromise the anonymity of study participants; including visualization randomization in the study design reduces privacy risks such as those analyzed by Institutional Review Boards and ethics committees. Even in the absence of explicit identifiers such as names, quasi-identifiers (i.e. height, weight, gender, or ethnicity) can be cross-referenced with external datasets to enable re-identification. These linkage attacks are particularly damaging with the combination of multiple, individually innocuous, datasets – increasingly accessible by publication or scraping [16].
In scenarios such as disease transmission modeling, military logistics, and search and rescue operations, the exact locations of key nodes are often unknown. Instead, analysts work with zones of probable location, relying on partial or indirect evidence to infer spatial distributions. When considering the dependencies and conditional probabilities incurred within a network, traditional spatial graph analysis can be unreliable or misleading. Our library provides tools to formally incorporate spatial uncertainty, allowing users to define uncertainty regions associated with each node. This enables the network analysis to reflect the imprecise nature of the underlying data without discarding relational structure.
In the first use case, the network contains precise location data attributes where clear communication of results does not necessitate, or where there is a compelling need to obscure, the location due to privacy or proprietary concerns. After determining an appropriate set of regions, GeoJitter is used to coarsen the shared information. Figure 3 shows an example of a synthetic toy network generated with only region names and the GeoJitter location output. This example uses Boston neighborhood shapefiles published by the City of Boston [17].
In the second use case, the network contains region identifier attributes without precise location (e.g. ZIP code). If the network is synthetic or uses predicted location, regions are assigned upon creation either randomly or based on known probability distributions, often for simulation. Standard plotting tools will overlay all such nodes in a region directly in the center of that region, creating the appearance of a single node with the connectivity of the entire region’s node, as if coarse-graining the entire region to one representative node. In cases where this is undesirable, GeoJitter can be applied to the network and the geospatial perturbation will clarify the internodal relationships during visualization. In Fig. 3 above, 2 node pairs (2–3 and 7–8) share regions but the Point Generator algorithm provides random spacing according to the set distribution.
4.3. LIMITATIONS
Because the original network structure (degree distribution, motifs, communities, spanning trees, shortest paths, etc.) is unchanged, it is difficult to rigorously quantify the changes in a meaningful way beyond a geographical change in edge lengths. Based on aggregational reidentification risk, this continuity of structure means that researchers must be cautious in determining how to protect nonspatial network attributes.
4.4. FUTURE EXTENSIONS
To allow more flexibility in networks where spatial distances are unevenly distributed, we would like to add functions that would allow the user to specify an input tuple array indicating the desired
tiling for each region. This would enable application of region-specific parameters for appropriately sized tilings across the network. In addition to specifying the number of tiles desired within the network’s bounding box, we would like to allow the user to specify the size of the tiles (e.g. 5km x 5km or similar). We are also developing methods that agglomerate regions based on the number of nodes within regions, thereby adjusting for network node density in addition to the regional population density metrics. Currently, the tiling option only supports rectangular tiles. It is possible to support any regular tiling but current exploration has not indicated utility in adding other tiling options.
Though the distribution argument currently supports use for a number of known statistical distributions, we have built in an argument to support extension to allow the user to define a 3-dimensional surface representation of location probability. This would enable users to fully specify any distribution of choice. We plan to incorporate full probability surfaces as user defined distributions in future implementations, providing maximum control over generated networks
5.3. ETHICS APPROVAL
Not applicable.
A
A
Data Availability
Code and examples are available at [https://github.com/SebastianNeumann2003/GeoJitter/](https:/github.com/SebastianNeumann2003/GeoJitter) .
A
Author Contribution
N.F. was responsible for conceptualization, formal analysis, supervision, and wrote the main manuscript text. N.F. and G.M. designed the methodology and conducted formal analysis and validation. S.N. prepared figures 1-3 and S.N. and G.M. worked jointly on the software. All authors reviewed the manuscript.
A
Acknowledgement
Unstructured finite element meshes were created with Triangle version 1.6 (downloaded from http://www.cs.cmu.edu/~quake/triangle.html).
6. REFERENCES
1.Brownstein JS, Cassa CA, Kohane IS, Mandl KD. An unsupervised classification method for inferring original case locations from low-resolution disease maps. Int J Health Geogr. 2006;5(1):56. https://doi.org/10.1186/1476-072X-5-56/.
2.de Montjoye YA, Hidalgo C, Verleysen M, Blondel V. Unique in the Crowd: The privacy bounds of human mobility. Sci Rep. 2013;3:1376. https://doi.org/10.1038/srep01376/.
3.Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowledge-Based Syst. 2002;10(5):557–70.
4.Lee S, Majeed A. (2021) Anonymization techniques for privacy preserving data publishing: A comprehensive survey. IEEE Access. https://ieeexplore.ieee.org/document/9298747/
5.Dwork C, Roth A. (2014) The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (Aug 2014), 211–407. https://doi.org/10.1561/0400000042/
6.United Nations Statistical Commission. (2021) Spatial anonymization. In Statistical Commission: Fifty-second session. https://unstats.un.org/unsd/statcom/52nd-session/documents/BG-3l-Spatial_Anonymization-E.pdf/
7.Avraam D, Butters O, Wilson R, Burton T, Nicolaides C, Jones E, et al. Privacy preserving data visualizations. EPJ Data Sci. 2021;10:2. https://doi.org/10.1140/epjds/s13688-020-00257-4/.
8.Jaffe A, Kluger Y, Linderman GC, Mishne G, Steinerberger S. (2020) Randomized near-neighbor graphs, giant components and applications in data science. J Appl Probab. 2020;57(2):458–476. https://doi.org/10.1017/jpr.2020.21/. Epub 2020 Jul 16. PMID: 32913373; PMCID: PMC7480951.
9.Maslov S, Sneppen K. (2002) Specificity and Stability in Topology of Protein Networks. Science. 2002;296(5569):910-3. https://doi.org/10.1126/science.1065103/. PMID: 11988575.
10.Peixoto T. (2014) Hierarchical Block Structures and High-Resolution Model Selection in Large Networks. Phys. Rev. X 4, 011047–24 March, 2014. https://doi.org/10.1103/PhysRevX.4.011047/
11.Frank O, Strauss D. (1986) Markov graphs. J. Am. Statist. Ass. 1986;81:832–842.
12.Snijders T. (2002) Markov Chain Monte Carlo Estimation of Exponential Random Graph Models. Journal of Social Structure. 2002 Jun.
13.Sutherland IE, Sproull RF, Schumacker RA. A characterization of ten hidden-surface algorithms. ACM-CSUR. 1974;6(1):1–55. https://doi.org/10.1145/356625.356626/. March 1974.
14.Shewchuk JR. Triangle: Engineering a 2D quality mesh generator and Delaunay triangulator. In: Lin MC, Manocha D, editors. Applied Computational Geometry Towards Geometric Engineering. WACG 1996. Lecture Notes in Computer Science. Volume 1148. Berlin, Heidelberg: Springer; 1996. https://doi.org/10.1007/BFb0014497/.
15.Osada R, Funkhouser T, Chazelle B, Dobkin D. (2002) Shape Distributions. ACM Transactions on Graphics 21(4):807–832, October 2002.
16.Vidanage A, Ranbaduge T, Christen P, Randall S. A privacy attack on multiple dynamic match-key based privacy-preserving record linkage. Int J Popul Data Sci. 2020;5(1):1345. https://ijpds.org/article/view/1345/.
A
18.Boston Maps. (2025) Boston Neighborhood Boundaries. Public record. https://data.boston.gov/dataset/bpda-neighborhood-boundaries/