AI

The "Vietnamese portrait" data set attracts the international AI community

Bùi Đăng MinhMonday, June 22, 202625 min read
The "Vietnamese portrait" data set attracts the international AI community

The data set helps AI understand Vietnamese people better

FPT and NVIDIA have just announced Nemotron-Personas-Vietnam, a Vietnamese data set built to serve research, training and development of AI systems.

Notably, after only 4 days of appearing on Hugging Face, the world's largest open source AI data and model sharing platform, this dataset quickly entered the Top 15 trending datasets globally.

Hugging Face's rankings reflect the level of interest from the community through the number of downloads, favorites and related interactions.

According to the development team, Nemotron-Personas-Vietnam is not a complete AI model but a background data set.

The "Vietnamese portrait" data set attracts the international AI community - 1
Profiles in Nemotron-Personas-Vietnam are described by many characteristics such as occupation, skills, education level, interests and living area (Photo: FPT).

In other words, this is a source of raw materials for researchers and businesses to use in the process of building, training or evaluating Vietnamese AI models.

The special feature of the data set lies in the use of "personas", which are character profiles that simulate many different Vietnamese groups in society.

Profiles are built from AI-generated synthetic data, containing no personal information of real people.

Each profile is described through many characteristics such as occupation, education level, skills, interests, age, gender, marital status, living area or career goals.

The above approach helps recreate a relatively diverse picture of Vietnamese users in many different contexts.

900,000 portraits in an open dataset

The publicly released version of Nemotron-Personas-Vietnam includes 100,000 records, corresponding to about 900,000 Vietnamese personas with a total capacity of 118 million tokens.

In the field of AI, tokens can be understood as small linguistic units that the model uses to read and process text.

The scale of hundreds of millions of tokens shows that this is a relatively large data source, enough to support many AI research and development activities.

The "Vietnamese portrait" data set attracts the international AI community - 2
The publicly released version of Nemotron-Personas-Vietnam includes 100,000 records, equivalent to about 900,000 Vietnamese personas (Photo: FPT).

The data set covers large localities such as Hanoi, Ho Chi Minh City, Hai Phong, Da Nang, Can Tho and Dong Nai according to Vietnam's new administrative boundaries after the 2025 arrangement.

Thanks to being described in many different information dimensions, developers can easily filter and create specific user groups to serve each specific purpose.

For example, a business can build data specifically for groups of students, young workers or users in each locality.

Nemotron-Personas-Vietnam is currently released in open form on Hugging Face. This means that researchers, startups, businesses or the programming community can access and use for both commercial and non-commercial purposes when complying with the source attribution conditions.

According to FPT, publishing the data set will help expand resources for the domestic AI ecosystem, especially in the context that many small businesses and research groups still have difficulty accessing high-quality data sets in Vietnamese.

Promoting sovereign AI for Vietnam

In recent years, the concept of “sovereign AI” has received increasing attention from many countries. This concept emphasizes building AI systems capable of reflecting each country's unique language, culture, regulations, and development needs rather than relying entirely on global models.

FPT believes that most popular AI models today are trained mainly on English data and Western contexts.

Therefore, when applied to Vietnam, these systems sometimes do not fully understand the differences in language, region, culture or communication style of domestic users.

As a result, responses may not be natural or appropriate to the local context.

Mr. Ngo Xuan Bach, Director of AI Products, FPT Smart Cloud and Director of Quantum AI & Cyber ​​Security Institute, FPT Corporation, said that sovereign AI needs to be built from a data foundation that truly reflects the language, culture and economy of each country.

The "Vietnamese portrait" data set attracts the international AI community - 3
Mr. Ngo Xuan Bach commented that sovereign AI needs to be built from a data platform that truly reflects the language, culture and economy of each country (Photo: FPT).

According to Mr. Bach, Nemotron-Personas-Vietnam is a step to help the AI ​​development community in Vietnam access the necessary resources to build solutions specifically for Vietnamese people.

Within the framework of the project, NVIDIA contributes the Nemotron-Personas method framework, the NVIDIA NeMo Data Designer synthetic data library, and tools to support large-scale data generation.

Meanwhile, FPT takes on the role of providing local knowledge, data validation, computing infrastructure and AI research capacity.

The "Vietnamese portrait" data set attracts the international AI community - 4
FPT AI Factory's new AI infrastructure uses the NVIDIA HGX B300 platform with 8 NVIDIA Blackwell Ultra GPUs (Photo: FPT).

Around the world, NVIDIA has also developed similar data sets for many countries and regions such as the US, Japan, India, Singapore, Brazil or France.

Vietnam's appearance in this ecosystem shows the growing demand for local data sources to serve AI development.

At the same time, this is considered a step forward to build AI systems that understand Vietnamese people better, better serve domestic needs and create a foundation to expand into regional markets in the future.

Nguồn / Original source: Dân trí