How To Parse Text To Yaml With Local LLM
How To Parse Text To Yaml With Local LLM
Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.
- https://dsa.ulisses-regelwiki.de/Kul_Auelfen.html
- https://dsa.ulisses-regelwiki.de/erw_zauber_sf.html?erw_zaubersf=Alchimieanalytiker
- https://dsa.ulisses-regelwiki.de/KSF_Alter_Adersin.html
However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.
To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.
Any idea what model to use for that or how to improve results?
Did you give the raw html as input to the model?
I assume for smaller open source models the context window might be too small. I tried it with chatgpt5 just for testing and it did pretty well: https://chatgpt.com/share/68a36599-2f5c-8003-9214-fdd693b53b72
Maybe instead of the raw html you could convert it to md first via pandoc to reduce the tokens or split a page into multiple sections then combine the resulting partial yaml files.
But tbh LLM seems like the wrong tool for this task... A simpler way would be to write a userjs (or python) script for this job. Since the input is a lot of well structured data - LLM are bad at handling a lot of data at once and script are good at wrangling well structured data, no matter how much it is.
I tried feeding the html which didn't work at all and then just the raw text of the tag with id main (so the text in the white area, but no html tags). It didn't feel like the task was too difficult in the sense that it never produced good results but that it was too often deviating from the task talking about stuff or not sticking to the pattern once more than one pattern was introduced.
Could you elaborate how userjs might help? Haven't heard of it before but a quick google search didn't make it immediately obvious. As I hinted before I tried using a python script with beautifulsoup parsing but due to the page being inconsistent, my results where debatable.
Yeah that sounds like it can't keep a large enough context. Maybe try a beefier model.
I just suggested userjs because it runs directly in the browser and can use js dom parsers. Also userjs could inject a button that downloads the yaml. Idk if thats desired.
The page doesn't seem too complex, as you said - you just have to find the tag with the bold text and then the following paragraphs. A simple loop based parser logic will do.