Hi everyone, I’m a fresh grad (Data Science/AI background) building a solo project—an AI research assistant for technical PDFs.
Since I don't have a mentor, I’m struggling to know if my approach to a project is right or i'm just "In my own head" 😞 . I’m also intentionally avoiding AI-assisted coding (Copilot/Cursor) for this project to master the fundamentals of RAG/LLM/AI pipelines.
For MVP, I have PDF parsing -> Chunking -> LLM reasoning -> Output of paper insights/methodology etc..
My current bottleneck: PDF Parsing. I’ve spent a week testing different parsers (Docling, MinerU, PyMuPDF). My current approach is:
- Select 3-5 diverse papers (tables, math, multi-column).
- Run each paper through the parsers.
- Manually evaluate/compare output vs. use an LLM-as-a-Judge to score formatting retention. -> log to MLflow
Results:
- PyMuPDF -> the worst (cant parse equations/images), but is the fastest
- Docling -> better at parsing than PyMuPDF (but cant parse images). slower than PyMuPDF
- MinerU -> Best at parsing overall but is very slow. (can be 20min for long papers)
I'm thinking of MinerU since its the best, but its so slow to run in my local Mac 😞. Any solution to this? or free GPUs online?
My Questions for Seniors:
- Is this too much? Should I be evaluating every single component (parsing, chunking, retrieval) this deeply, or should I just pick the "most popular" tool and move on?
- How do you Time Box? I feel like I could spend >1 week just on parsing. How do you decide when a component is "good enough" for a solo project?
- The Solo Trap: How do you validate your architectural decisions when you don't have a senior dev to do a code review?
I want this to be a solid project for my portfolio, but I’m worried I’m spending too much time on the details and am also not sure if I'm approaching a GenAI project the right way. Any advice on how to manage the workflow?
Thank you guys!!!!