Been a while but finally got back round to this, i've found Ollama to be a good project for running models locally, and you can use the continue plugin in vscode for co-pilot like experience. My Macbook Pro M2 16GB is probably too constrained to go above 7B models.
I have a pair of Nvidia Tesla P40's I can use for transfer learning but unsure of a good dataset to use or which tools to go with.I was hoping I could add the corpus to a tool like PrivateGPT or AnythingLLM and avoid getting lost in langchain and all the sub tools and terminology since I'm not a data scientist just an engineer (Dev/SRE)
I would like to figure out an architecture and process to collate the corpus, produce the new model and push to huggingface for consumers as opensource.
Yes, going to need above 16GB+ GPU to run the larger models.
I'm excited by some of the new hardware coming online from Nvidia and in particular Groq's chip architecture.
One solution to your problem is to spin up rented hardware through one of the various "cloud" providers to fine-tune a pre-existing model.
Either way you slice it there is going to be a price tag on the compute. I would to prefer to do it locally myself, but that is a challenge in training any model, fine tuning or otherwise.
There is a lot of power in embeddings and retrieval processes as well.
You can pick a model and train it from Hugging face under the trained section.
There is even a "no-code" option called auto-train.
Just make sure you have well structed data sets for the particular model you are planning on using usually found in the data card.
Anasse Bari, David Nagel, et al. have structured data which may be useful.
I will reach out and see if there datasets are open-source or if they will allow open access to there SOLR server to make request queries to there server.
This is my first time learning about SOLR and Tika, so I'm not sure about the backend integration necessary yet.
If I understand correctly, I believe they are working on making a chat bot that uses RAG and perhaps is finetuned on the data already?
Not entirely sure tbh.
Most of it is collected via Jed's Online Library using beutifulsoup, Serpapi, and arxiv api using pandas to put the json data into tables exactly as Dave Nagel's team did.
I also integrated it with Langchain using python which did make the job a lot easier because I could use Autogpt to carry out the tasks of collecting and reviewing data in parallel and I would review the output for RLHF to ensure the data was free of error.
This did end up costing me about $100 in total on all the api calls and countless hours. 😅
It was a lot of fun though and I would have liked to use some Finetuned models before ICCF25, but as I said above, it gets expensive to train and deploy them.
If you are trying to avoid as much code as possible and having access to a local LLM, you could look at using LMstudios and the Auto-train method from HuggingFace.
Hopefully that as helpful to you and I have some datasets I have played around with both structured and un-structured on this HuggingFace if your interested.
🍻