Clustering Skills from Urban Planning & Data Analytics Job Postings
Contexts
Urban planning roles increasingly demand data analytics expertise. To help me grasp this intersection for myself (since I'm interested in positions in this field), I analyzed 50 recent (Jan–Apr 2025) urban planning job postings with data analytics components. The goal was to extract the technical and soft skills mentioned in these postings, then use clustering to reveal distinct “skill ecosystems” – groups of skills that tend to co-occur in job requirements. By understanding these skill clusters, we can discuss how easy or difficult it might be to transition from one skill group to another.
Methods
First, I leveraged large language model (LLM) capabilities (GPT-4o) to parse job descriptions for skill mentions. LLMs offer a powerful approach to skill extraction from text, outperforming manual keyword matching by recognizing diverse skill phrases in context. After extraction, I converted the data into a structured skill matrix and applied Principal Component Analysis (PCA) for dimensionality reduction. Reducing high-dimensional skill data with PCA is a common step to improve clustering performance and interpretability. Finally, I applied clustering (K-Means) to identify skill groups. This approach follows recent trends in NLP-driven labor market analysis, where clustering helps uncover common patterns and skill gaps across roles.
Below, I'll go through the analysis process step-by-step, including the output of the code. You can follow along with me using the Jupyter notebook code file here.
Data and Skill Extraction
First, we load the dataset of 50 job postings. The data comes from job posts on LinkedIn (I picked it from Prof. Liang's liked posts). Each posting contains fields like job_title, job_description, and requirements. We then use GPT-4o via OpenAI’s API to extract skills mentioned in the text. We instruct GPT-4o to return a structured JSON with two lists: technical_skills (e.g. programming languages, software, domain-specific tools) and soft_skills (e.g. communication, teamwork, problem-solving).
Running the code shows that we have 50 postings with expected columns:
Total job postings: 50
Columns: ['job_id', 'job_title', 'job_description', 'work_type', 'employer', 'job_location', 'salary', 'requirements']
Next, we prepare the GPT-4 extraction. We’ll iterate through each job’s text and prompt GPT-4o to identify skills. In practice, the code will call the GPT-4o API for each job posting (50 calls). Here's an example output for one posting:
Job Title: Data Manager
Technical skills: ['data processes management', 'health data governance', 'big data projects']
Soft skills: ['effective communication', 'partnership collaboration']
Explanation: In this Data Manager posting, GPT-4o identified technical skills such as data processes management, health data governance, and experience with big data projects. These reflect a focus on overseeing complex data workflows, particularly in health-related contexts, and likely stem from job description phrases like “coordinate data-related processes” or “lead governance initiatives for health data systems.” For soft skills, the model extracted effective communication and partnership collaboration, likely inferred from language emphasizing stakeholder coordination or interdepartmental teamwork. This result demonstrates how the model can detect both domain-specific expertise and interpersonal competencies embedded in job language.
We can count the number of unique items for all technical skills and soft skills:
Total unique technical skills: 544
Total unique soft skills: 301
Example technical skills: ['statistical research and analysis', 'TensorFlow', 'ArcGIS']
Example soft skills: ['Manage time efficiently', 'commitment to equity', 'driving innovation']
After extracting skills from all 50 job postings, we compiled a total of 544 unique technical skills and 301 unique soft skills. These ranged widely across the urban planning and data analytics spectrum. Technical skills included tools and methods such as statistical research and analysis, ArcGIS, and machine learning frameworks like TensorFlow. Soft skills were equally diverse, with examples like manage time efficiently, commitment to equity, and driving innovation.
Vectorizing Skills for Analysis
With structured skill data in hand, we create a binary matrix to enable numerical analysis. Each row will represent a job posting, and each column a unique skill. A cell is 1 if the skill is mentioned in that job posting, otherwise 0. This way, each job is encoded as a “skill vector” indicating which skills it requires.
After constructing the matrix, we obtained a shape of (50, 841) – 50 job postings by 841 distinct skills. This confirms that job descriptions are highly varied, each containing a unique combination of both technical and soft skills. As expected, the resulting matrix is very sparse, with most jobs having just a handful of “1” s indicating presence of specific skills. For instance, a GIS-focused role might include 1’s for ["GIS", "ArcGIS", "Spatial analysis", "Project management", "Communication"], and 0’s elsewhere.
Given the high dimensionality, we applied Principal Component Analysis (PCA) to reduce the skill space to two dimensions for visualization. However, the first two principal components together explain only 10.6% of the total variance, which is relatively low. This suggests that skill usage is quite diverse and does not strongly cluster in simple linear patterns. Still, the 2D projection can help reveal broad structural trends and groupings among skills, even if it doesn’t capture the full complexity of the dataset.
Clustering Similar Skills
Now we cluster the skills based on their occurrence vectors (we cluster skills, not jobs, to find groups of skills that appear together in postings). We choose K-Means clustering for its simplicity. To decide on a suitable number of clusters K, we can use the elbow method or silhouette score:
Examining the elbow curve, we observe a sharp drop in within-cluster sum of squares (WCSS) up to K=4, after which the marginal improvement begins to flatten. This suggests that the most meaningful gains in cohesion occur in the range of 3 to 5 clusters. While we did not calculate silhouette scores in this case, the elbow clearly indicates diminishing returns beyond K=5. We ultimately select K=5 as a reasonable compromise—offering enough differentiation between skill groups to enable interpretation, without over-fragmenting the space.
Now we run K-Means with 5 clusters and inspect the clusters. We can visualize these clusters in the 2D PCA space:
Figure: 2D PCA scatter plot of skills, colored by cluster. Each point represents a unique skill (unlabeled for clarity). Skills that often appear together in job postings are grouped in the same color.
From this visualization, we see distinct groupings of skills. Now, let's interpret each cluster in plain language.
Outcomes
Interpreting the Skill Clusters
The clustering results reveal five major skill groups with distinct thematic focuses:
Cluster 0 – Comprehensive Tech & Soft Skill Bundle: This is the most expansive cluster, covering a wide mix of skills. From GIS tools (e.g., ArcMap, ArcGIS Experience Builder) and programming (Pandas, SQL) to cloud platforms (AWS, Azure) and professional competencies like communication, project management, and problem-solving. It also includes values-based soft skills like commitment to equity and collaborative leadership, suggesting this cluster represents well-rounded roles that require both technical breadth and interpersonal awareness—often seen in data-heavy planning, sustainability, or public policy positions.
Cluster 1 – Modern Data Science & LLM Tools: This cluster includes contemporary data science platforms and programming libraries (e.g., Python, R, Hugging Face, LangChain, QLoRA), as well as soft skills like critical thinking, cross-team leadership, and adaptability. The presence of LLM-based applications, prompt engineering, and transformer architectures indicates these roles are situated at the cutting edge of urban AI development and natural language technologies. This group reflects innovation-oriented roles in civic AI, smart city research, and experimental urban tech environments.
Cluster 2 – Entry-to-Mid GIS & Communication Roles: This group includes more traditional planning and technical fieldwork skills such as ArcGIS Pro, AutoCAD, georeferencing, and field data collection, combined with interpersonal strengths like public speaking, listening skills, and a team-oriented mindset. These skills often align with municipal or consulting roles that balance spatial analysis with stakeholder-facing communication.
Cluster 3 – Standalone Programming Core: This smaller cluster features foundational technical skills like Python, R, and SQL, alongside soft skills such as communication, collaboration, and project management. Spatially, it sits apart from the other clusters in the PCA plot, which may indicate that these core programming competencies are widely used and flexible. That’s serving as building blocks in a range of job types rather than defining a specific role. Its isolation could reflect versatility more than thematic coherence.
Cluster 4 – Specialized GIS & Infrastructure Toolkits: This compact and focused cluster revolves around applied geospatial and infrastructure tools, including ArcGIS Enterprise, LiDAR, GNSS receivers, AutoCAD, and municipal utility systems. Supporting skills like training delivery, field mapping, and civil engineering concepts suggest a specialization in public works, utilities management, or engineering-driven GIS roles. Its tight formation in the PCA plot underscores its niche and specific use case.
While Cluster 0 and Cluster 1 contain the most diverse and abundant skills, Clusters 3 and 4 are more compact and specialized. This variation in size and distribution reflects a core-periphery structure within the urban planning and analytics job market. Where some skill sets are foundational and broadly expected, while others signify more advanced or domain-specific expertise. GIS-related skills exemplify this dynamic, appearing across several clusters but serving different functions. From general analysis and mapping (Cluster 2) to advanced cloud-integrated systems (Cluster 0), to infrastructure-focused deployments (Cluster 4).
Transitioning Between Skill Clusters
Understanding how skills group together only part of the picture is. What’s equally important is identifying how professionals might move between clusters as their careers evolve. The PCA plot reveals a core-periphery structure: some skill sets (like core programming or communication) appear foundational and transferable, while others are more niche and may require targeted upskilling to access.
Cluster 3 (standalone programming core) serves as a common entry point or foundational layer. Skills like Python, R, and SQL are widely applicable and show up across other clusters. This makes transitions from Cluster 3 to either Cluster 0 (comprehensive tech & soft skill) or Cluster 1 (modern data science & LLM) relatively smooth, especially for professionals who start by learning core tools and gradually add domain-specific methods.
Moving from Cluster 3 to Cluster 1 might involve expanding knowledge in machine learning, LLM-based applications, and data pipelines. While also becoming familiar with collaborative tools and innovation practices, that’s key traits in urban AI and research-heavy teams.
Cluster 2 (GIS & communication) connects strongly to Cluster 0, as many GIS-related skills appear in both. Professionals with field-based or traditional GIS experience can enhance their prospects by acquiring programmatic GIS workflows (e.g., Python scripting for ArcGIS) or data visualization tools (e.g., Tableau), which are prevalent in Cluster 0.
Transitions into Cluster 4 (specialized GIS & infrastructure) likely require more focused retraining. While there is some GIS overlap, Cluster 4 skills are highly applied. Tools like LiDAR, GNSS, and municipal utility systems demand field-specific experience and technical certifications. This cluster may represent roles with more infrastructure, engineering, or public works orientation.
Moving from Cluster 1 to 4 is rare but conceptually interesting. While these roles are far apart in both content and PCA space, they can converge in smart infrastructure planning (e.g., applying LLMs or AI models to physical asset monitoring or urban sensor data). However, this path would require considerable interdisciplinary bridging.
In short, transitions are easiest between overlapping or thematically adjacent clusters, such as from Cluster 3 into Clusters 0 or 1. The farther apart clusters are in the PCA space, the greater the likely skill gap, and the more strategic learning or cross-domain experience is needed to make that leap.
It’s also worth noting that proximity in the 2D PCA plot doesn’t always reflect ease of transition. For instance, while Cluster 3 (programming core) appears visually closer to Cluster 1 (data science & LLMs), the conceptual leap is steeper, requiring familiarity with advanced AI models, frameworks, and experimental methods. In contrast, moving into Cluster 0 may involve a wider toolset but generally lower conceptual barriers. This highlights how visual closeness and skill barrier are not always aligned, especially when working with dimensionality-reduced representations.
Reflections
This project started from a personal need: I’m a planning graduate student curious about where data skills meet urban practice, and what that means for the job market I’m entering.
What I found wasn’t just a list of tools, but a map of distinct skill ecosystems. Each shaped by different assumptions, technologies, and planning roles. Using GPT-4o to extract skills helped me move beyond surface-level buzzwords, while clustering and PCA revealed deeper structures: how skills travel together, which ones are more niche, and where the barriers really are.
This process made me reflect on AI in a different way, not just as a tool for automation or prediction, but as a lens for making sense of labor markets and professional identity. It also highlighted something planners bring to AI work that’s often-missing contextual thinking. While engineers may optimize for precision, planners are trained to ask why and for whom. That shift changes what you notice, and what you care to cluster.
I hope this exploration helps other students and early-career professionals in urban tech not only map what’s out there but also chart a path through it.
Reference
[1] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
[2] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer.
[3] Kandpal, N., Koh, P. W., Bai, Y., & Liang, P. (2023). Large Language Models (LLMs) as Skill Extractors: Evaluating Performance on Labor Market Data. arXiv preprint arXiv:2305.16408. https://arxiv.org/abs/2305.16408
[4] Loukissas, Y. A. (2019). All Data Are Local: Thinking Critically in a Data-Driven Society. MIT Press.
[5] Ratti, C., & Claudel, M. (2021). The City of Tomorrow: Sensors, Networks, Hackers, and the Future of Urban Life. Yale University Press.
[6] Ruder, S. (2019). Neural Transfer Learning for Natural Language Processing. Doctoral dissertation, National University of Ireland.
[7] Zhou, R., Yu, X., & Wu, Z. (2023). Visualizing the Skill Landscape of AI Jobs: A Network-Based Approach Using NLP. ACM Transactions on Intelligent Systems and Technology, 14(2), Article 17.