Attract High-Signal Contributors
The goal wasn’t attention — it was finding NLP engineers capable of improving model performance and dataset quality.
Yachay is an open-source ML initiative built on large-scale natural language datasets sourced from news, social platforms, developer ecosystems, and legal records. It pairs data engineering with applied NLP tooling, including a geolocation detection model, released as open infrastructure for the community.
✦ the challenge wasn’t the model — it was attracting the right contributorsThe goal wasn’t attention — it was finding NLP engineers capable of improving model performance and dataset quality.
Competing in an ecosystem where thousands of ML repos launch weekly and most never gain meaningful technical adoption.
Turn passive discovery (stars, reads, forks) into active engineering participation.
Positioned Yachay through a technical narrative focused on geolocation NLP and large-scale dataset engineering to drive high-quality early exposure.
Optimized repository structure, metadata, and keywording to surface in GitHub search for NLP, geolocation, and dataset-related queries.
Activated Reddit, Discord, and partnerships with TripleTen coding bootcamp students to bring in early contributors with applied ML interest.
Featured: Bellingcat Hackathon, Hacker News coverage, and live deployment on Hugging Face.
The project reached a critical threshold: enough visibility to attract the right technical audience, and enough signal to start self-sustaining contributions.
More importantly, it created a filtered funnel of NLP developers who engaged directly with the dataset, tooling, and model layers.
Yachay achieved its core objective: identifying and attracting ML engineers capable of improving the system.
After validation through community and early partnerships, the project transitioned beyond its initial open-source phase, while core models and datasets remain publicly accessible.


