Wikimedia, DataStax, and Jina AI launch semantic search for non-profit AI developers

Posted by Admin September 17, 2024 Tech

Today Wikimedia Deutschland announced the launch of a semantic search concept in collaboration with search experts from DataStax and Berlin’s Jina AI.

The concept makes Wikidata’s openly licensed data available in an easier-to-use format for AI application developers. This simplifies the process of developing open-source, non-profit AI applications and contributes to a more reliable information ecosystem.

As an open knowledge graph with over 112 million human- and machine-readable entries, Wikidata represents a valuable treasure trove of data for developers and society. With the constant contribution of over 12.000 active editors, Wikidata’s data is diverse and well-maintained.

The need for access to large amounts of high-quality data has significantly increased in the last decade.

Generative AI, in particular, requires vast amounts of training data, which is often scraped from the internet.

However, this scraping requires workforce and time resources that are primarily available to large commercial organisations. This leads to a closed ecosystem for data utilisation, which is contrary to the ideals of open-source.

Wikidata wants to contribute to opening this closed system up by transforming Wikidata’s crowd-sourced, validated entries into an easy-to-access data source for open-source AI application development.

Once Wikidata is integrated into more open-source machine learning workflows, the quality of the information ecosystem can be improved: Gen AI mistakes can be reduced, and the output from LLMs could become more reliable.

In the long run, the wider public could benefit from having more reliable alternatives based on Wikidata’s data to commercial generative AI providers.

Wikimedia Deutschland is a non-profit organisation with over 111,000 members and 180 employees committed to promoting freely available knowledge in the digital space. It is the largest country representative of the international Wikimedia Movement, develops free software and the free Wikidata database, and is involved in political and educational activities to promote free access to knowledge and data.

According to Dr Jonathan Fraine, Head of Software Development at Wikimedia Deutschland:

“We’re focused on helping developers who share our values. However, many developers find accessing Wikidata challenging, and our current methods don’t support the data volume required for some of the most recent generative AI development needs.”

Now, with the support of DataStax and Jina AI, Wikidata’s data will be transformed and made more convenient for AI developers as semantic vectors in a vector database. DataStax provides the vector database while Jina AI provides the open-source embedding model for vectorising the text data.”

Vector embeddings are mathematical representations of words or themes –- they get created to turn those words and their semantic meanings into a form that computers can understand and use. When you work with large language models to understand text, they use those embeddings as part of their method to create responses.

Vector embeddings are then used for operations like search – you want to respond to a user’s request with relevant data, so you turn their request into a vector embedding and then search for something similar in your own data sets. Once the search is done, you can return those embedding results back to the LLM and provide a response to the user.

Dom Couldwell, Head of Field Engineering, EMEA, at DataStax, explained:

“Embeddings improve responses and make them more relevant to the user. However, many developers currently have to create their own embedding data, which can be costly when they have a lot of data to use.

For this project, providing vector embeddings of Wikimedia data will improve the quality of responses that get generated. It can also be accessed so you get more up to date information — rather than relying on old data that was used for training, you can get the latest version that is based on the most recent updates to Wikimedia.

There is no AI without data, and this provides a higher quality source for developers to use.”

The vectorisation will enable direct semantic analysis and could help facilitate the detection of vandalism in the knowledge graph. The vectorisation also simplifies the process of using Wikidata in RAG (retrieval-augmented generation) applications in the future — this can reduce AI mistakes by including current, verified facts in the results. Wikimedia Deutschland started creating the concept in December 2023.

Lydia Pintscher, the Portfolio Lead Product Manager of Wikidata is convinced that better access to the data volume of Wikidata can be a game-changer for open-source generative AI communities:

“By providing high-quality data, we support the communities with their work and realisation of new ideas that are not profit-driven but have the intention to serve humanity with valid information.”

The first beta tests of a prototype are planned for 2025.

Lead image: Wikimedia.