u/Beneficial_Ebb_1210

Self-Maintaining Knowledge Graphs. Stupid or the Future of RDM?

Hi,

I am a rookie to the ontology and KG space. After a long time in the AI startup world, I recently started a PhD in AI-assisted RDM.

I have worked quite a bit on AI-maintained expert systems in the free market, developed for agentic workflow software, and a long and painful time in large-scale AI-driven datarization and surrogates of the WTG industry.

Full disclaimer. I am aware that I am quite wet behind the ears in the KG/ontology field; thus, some of my ideas might sound fantastic to me but ridiculous to someone who has tripped over many of the stones in that space already.

I am looking for a reality check from some !experienced! people here.

Here goes: I am investigating agentically maintained and updated temporal ecosystem KGs.

What that means (to me) is that whenever we want to describe an ecosystem (e.g. the compound material manufacturing science output of a particular institute with hundreds of researchers), we choose artifacts from that ecosystem that help us derive a model that's informed enough to answer the questions we might have.

So, e.g. if the ecosystem we aim to model in our KG is meant to answer questions such as: "Who, at what department, has made a software package that is meant for task x? When did they do it? Are they still at the institute? And is the package maintained during this quarter? How was it funded? (Before you worry about the task X part, we are currently working on taxonomic task ontologies to derive machine-readable scopes and JTBD from process descriptions in papers and docs.)

This could just be one of many questions. (The type of questions and info the KG should inform about are informed by strategic institute goals such as reducing redundancies, discovering abandoned projects or synergies, and are based on needs and knowledge bottlenecks in a specific domain.)

So what we need to describe are ontologies around people, articles, data, software, organizations, grants, etc. .... and their connecting properties.)

My “currently naive” goal is to see how far we can drive AI(LLM)-orchestrated “living” KGs tied to the information systems we have at the institute using the following steps.

  1. Dummy-describe the artifacts and their relationships of the ecosystem that would be needed to answer sets of questions aligned with the needs of the people that will use it.
  2. Map the outcome to existing ontologies as well as possible, bridging fuzzy connections between ontologies (that's something I already see as an almost philosophical, goliath task).
  3. Once we have a “good enough” ontology, we engineer logical constraints (e.g. SHACL).
  4. Then I will define the information endpoints that will act as information wells to instantiate classes from the ontology (e.g., paper, software, and data repositories inside the institute, with all possible properties).
  5. Inside the KG Pipeline, I will now have transformer-orchestrated agents that harvest from said endpoint on defined intervals or, based on webhooks, instantiate classes inside the KG, decide what is new, or an iteration/version jump of an existing instance, redundant, ...etc.
  6. The goal is to basically have a self-versioning KG that functions on a small, well-defined scope and acts as a continuous time capsule/active status harvester for our domain.
  7. People ontologies are informed by HR software and registries, papers by our in-house pub API, software and data by our on-premise repositories, and so on, but the ontology stays fixed and enforced. Updates to the ontology are a conscious and informed decision.
    ---
    (All this is extremely dumbed down, of course; I am aware of the work concerning the ontological description and nuances of the pipeline. Most of my time is currently devoted to prototyping and researching inside these problem spaces.

The goal of all of this is to alleviate the current pains of increasing redundant development and research efforts and allow for faster connection of people with synergetic output, automatic reporting, or human language querying the KG.

I don't want you to solve this for me. I'll do that myself as far as possible. :D I am just here to get some…

"Man, you haven't even scratched the surface of all the problems involved in this”

… comments.

I definitely have the skills to tackle all this. However, a few ontology veterans at conferences and some younger non-AI researchers inside the RDM field have given me the message that this is naive thinking. They have occasionally even laughed at the concept when I explained it. But, the thing is, I have seen similar things work in small, well-defined scopes, and a working prototype based on only a few classes has given me at least a slight POC.

The biggest problems I see coming towards me currently are:

- Data is very noisy (or, opposite, - lack of information), and the way people currently dump their research output, without docs or metadata, etc., is a nightmare.
- Bad info sources result in garbage-graphs.
- There can be multiple sources of truth with different truths, that might all be incorrect or outdated.
- Some ontologies can be difficult to bridge.
- Definition and distinction tasks can enter the realm of philosophical debate.

I have heard everything from...

"This already exists and is a well-proven concept", or "And what is the use of this?", to "This is world-ontology nonsense."

I know, this is a massive post, and I don't think I have covered 1% of my mental workbench, but I would be grateful for some diverse perspectives, ideas about problems I don't see, or pointers at fellow researchers or resources that can inform my research. I am currently in the "don't you see why this is the way" phase, while I often hear, "Don't you see why it's not?"

reddit.com
u/Beneficial_Ebb_1210 — 7 days ago