National data standardization to support AI development

•

National data standardization to support AI development By 2030, Vietnam aims to digitize and standardize 100% of priority databases to integrate into the National Integrated Database to serve AI development. This effort not only helps Vietnam master Vietnamese-language large language models and core AI algorithms but also aligns with the country’s cultural and linguistic characteristics. The Ministry of Science and Technology is drafting a Prime Ministerial Decision to establish a data catalog for AI development in essential sectors. The objective of the draft is to create a cohesive, open, and safe AI data ecosystem, and to complete large-scale data stores and data lakes in essential sectors. By 2030, 100% of priority databases will be digitized and standardized to integrate into the National Integrated Database to serve AI development. This will enable Vietnam to master Vietnamese-language large language models and core AI algorithms, while also matching the country’s cultural and linguistic specifics. The data catalog is built based on core screening criteria such as alignment with the principles of national AI development, public interest and essential sectors, feasibility of deployment in Vietnam, capabilities for standardization and de-identification, compliance with data laws and personal data protection, clear governing authorities, and the ability to update periodically. The catalog’s structure is divided into two detailed appendices that outline investment and exploitation roadmaps. Appendix I includes the group of data catalogs in essential sectors that serve AI development, regarded as the nation’s “digital data resource map.” Representative groups include data in Vietnamese and minority languages, national knowledge, laws and governance, data from key fields such as health, education, agriculture, transportation, resources and environment, economy and markets, as well as data infrastructure and security such as maps and geospatial data, telecommunications and digital infrastructure, safety, security and risk management. The goal is to clearly identify core data resources managed by the state that must be standardized to enable connection and sharing within the AI ecosystem. Appendix II focuses on high-value data, addressing five major groups. Group 1 is data for Vietnamese-language large language models, focusing on large-scale text corpora, news, scholarship, and multilingual speech to master domestic AI technology and safeguard cultural sovereignty in the digital space. Group 2 is data for testing and evaluation of AI systems, providing standardized benchmarks including exams, legal scenarios, and dialogue scripts to measure capability and accuracy before deployment. Group 3 is AI data for computer vision, focusing on image and video data from traffic cameras, urban environments, advanced medical imaging, agriculture, and satellite/remote sensing. Group 4 is AI data in specialized fields, prioritizing structured data and statistical tables in health, education, finance, energy, and the environment. Group 5 is safe and reliable AI data, building datasets to train filters for misinformation, harmful content, fraud, and attack scenarios to ensure cybersecurity.

National data standardization to support AI development

National data standardization to support AI development

Latest News

Latest News

JPG Store and Comet to Permanently Close Cardano NFT Marketplace by May 23, 2026