The ecosystem is the moat
Building AI agents for enterprise clients and OpenAI API is still where most projects start. Not because it's the best at everything, but because the ecosystem around it is massive.
The function calling implementation is clean. Structured outputs with JSON mode make it predictable enough for production. The assistants API gave us a decent starting point for stateful agents, though we ended up building custom state management on top.
Batch API is a lifesaver for our eval pipelines. We run thousands of test cases through models nightly and the 50% discount on batch makes it actually affordable.
Realtime API opened up interesting possibilities for voice agents we're building. Not perfect but workable.
The frustrations: model behavior changes between versions even within the same model name. We've had prompts that worked perfectly on one snapshot break on the next. That's dangerous in production. Also, rate limits and pricing tiers are confusing. I've read the docs multiple times and still get surprised by limits.
If you're building AI products commercially, you'll probably end up here eventually. The developer community, tooling, and documentation are hard to beat. Just plan for the quirks.
Nothing else comes close for visual quality
I use Midjourney for all the visual assets in my DevRel content. Blog post headers, social media graphics, presentation slides. The aesthetic quality is just in a different league compared to DALL-E or Stable Diffusion.
V6 was a big step up. Text rendering actually works now (mostly). The photorealism is insane when you want it, and the artistic styles are incredibly varied. I can go from watercolor illustration to cinematic photography with a few words.
The style reference feature is a game changer for brand consistency. I have a few reference images that match our brand aesthetic and Midjourney keeps new generations in the same visual family.
The Discord workflow is... fine. I actually don't mind it anymore but I know a lot of people hate it. The web editor is getting better but Discord is still faster for me.
For developers specifically: it's great for generating UI mockups and app screenshots for landing pages. I described a dashboard interface and got something that looked like a real product screenshot. Way better than using generic stock photos.
Pricing is fair at $10/month for basic. I use the $30 plan because I burn through generations fast.
Makes building AI features in Next.js actually pleasant
Tried building AI chat into our product from scratch first. Streaming, tool calling, managing conversation state. It was a mess. Then I found the Vercel AI SDK and rewrote it in a weekend.
The useChat hook alone is worth it. Streaming responses, optimistic updates, error handling. All the frontend plumbing that takes days to build yourself. It just works with Next.js out of the box.
Provider support is good. We started with OpenAI, switched to Claude for better responses, and the code change was literally swapping one import and updating the model name. That flexibility matters when you're iterating on which model works best.
The structured output with Zod schemas is great for building features that need predictable AI responses. We use it for a classification feature and the type safety all the way through is really nice.
Complaints: documentation assumes you know a lot already. I spent time reading source code to understand some behaviors. Also, when things break it can be hard to tell if it's the SDK, the provider, or your own code. Error messages could be more specific.
If you're building AI features in a Next.js app, this should be your starting point.
Useful but the abstraction layer adds real complexity
We evaluated LangChain for building our RAG pipeline at work. Used it for about 3 months before making a decision.
The good parts: it gives you building blocks for AI workflows that would take weeks to build from scratch. Chaining prompts, managing memory, connecting to vector stores. The integrations ecosystem is huge. We connected it to Pinecone and OpenAI in maybe an afternoon.
But the abstractions got in our way more than they helped. Debugging is painful because there are so many layers between your code and the actual LLM call. When something goes wrong you're digging through a chain of callbacks trying to figure out where it broke.
The API changes frequently too. We upgraded from 0.1 to 0.2 and half our chains broke because they renamed things and changed how memory works. That's frustrating when you're running in production.
For prototyping and demos it's great. For production systems where you need to understand every piece, I'd honestly recommend writing your own thin wrapper around the LLM APIs. Less magic, more control.
LangGraph is more interesting to me. The stateful agent approach makes more sense architecturally. But the core LangChain library tries to do too much.
Cursor's agent mode turned a 2-week project into 3 days
I already reviewed Cursor as a tool but the agent mode deserves its own review because it's a different beast entirely. Regular Cursor is great for autocomplete and inline edits. Agent mode is like having a junior developer who can actually follow instructions.
Last month I needed to add Stripe billing to our SaaS app. I described the requirements in Composer, told the agent which files were relevant, and let it work. It created the webhook handlers, the pricing page components, the subscription management API, and even the database migrations. Took about 2 hours of me reviewing and nudging it, versus the 2 weeks it would have taken a contractor.
The multi-file editing is where it really clicks. The agent understands your project structure, knows which files need to change together, and makes coordinated edits. When I asked it to rename a database column, it updated the schema, the migration, the API handlers, the TypeScript types, and the frontend components. All in one pass.
The terminal integration is clutch. It can run your tests, see the failures, and fix the code automatically. I set it up on a test suite with 15 failing tests and it fixed 13 of them without me touching anything.
Biggest win is consistency. Because it sees your whole codebase, the code it writes matches your style, your patterns, your naming conventions. It doesn't write code that looks like a Stack Overflow copy-paste.
Worth every cent of the $20/month. If you're building anything in production and not using Cursor's agent mode, you're honestly just working harder than you need to.
Good enough for corporate, frustrating for real dev work
I use Copilot at work because that's what IT approved and my feelings are mixed. For straightforward enterprise dev tasks like writing CRUD endpoints, unit tests for existing code, and basic refactoring, it's fine. The GitHub integration is smooth and having it right in VS Code is convenient.
The PR summary feature actually saves my team time during code reviews. It generates decent descriptions and highlights the key changes. For a team of 8 doing 20+ PRs a week, that adds up.
But here's my problem. Copilot plays it safe to a fault. Ask it anything remotely related to security, infrastructure, or system design and it hedges so much the answer is useless. I asked it to help write iptables rules and it basically told me to consult a professional. I am the professional. That's why I'm asking.
The chat feature is noticeably worse than ChatGPT or Claude for complex coding questions. It feels like it's running a smaller model or has heavy safety filters. Half the time I end up opening Claude in another tab to get a real answer.
The workspace agent feature where it can run commands and edit files is decent but not as smooth as Cursor's implementation. It works for simple tasks but gets confused on multi-file refactors.
For corporate environments where you need something IT-approved with enterprise compliance, it's the default choice. But if you have freedom to pick your own tools, there are better options for serious development work.
Powerful but the abstraction tax is real
I've been using LangChain since the early days and my feelings are complicated. On one hand, it's the most complete framework for building LLM applications. Chains, agents, RAG pipelines, tool use, memory, it's all there. The ecosystem is huge and if you need to integrate with some random service, there's probably a LangChain integration for it.
On the other hand, the abstraction layers have gotten deep. Really deep. When something breaks in a chain, the stack trace is 40 lines of internal LangChain code before you get to anything useful. Debugging is painful. I spent two hours last week tracking down a bug that turned out to be a type mismatch three abstraction layers deep.
The LCEL (LangChain Expression Language) was supposed to simplify things but honestly it added another layer of confusion. Now I have to think about whether to use the old Chain interface, LCEL pipes, or the newer LangGraph for any given task. The documentation covers all three but doesn't give clear guidance on when to use what.
For research prototyping, it's still my default because the iteration speed is unmatched. Need to swap out a model, change the retriever, add a reranker? A few lines of code. But for production systems, I've started moving to LlamaIndex or just writing custom code. The overhead of LangChain's abstractions isn't worth it when you need predictable behavior and clean error handling.
It's not bad software. It's ambitious software that's trying to do too much. The team ships fast but sometimes I wish they'd slow down and consolidate.
Cool concept but rough around the edges
I wanted to love Khoj. The idea of a self-hosted AI assistant that indexes your personal data and gives you answers grounded in your own files is exactly what I need. And sometimes it works great. I pointed it at my Obsidian vault with 3 years of dev notes and it can find stuff I'd forgotten about.
The search is genuinely good when it works. Asked it about a Next.js deployment issue I solved 8 months ago and it pulled up the exact note with my fix. That's the dream.
But here's the thing. The setup experience is rough. Getting it to properly index everything took multiple attempts. The Obsidian plugin crashed twice. The web UI feels like a side project. And the chat interface sometimes ignores the context from your files and just gives you generic ChatGPT-style answers.
The self-hosted aspect means you need to manage your own embeddings, which adds a decent amount of complexity if you want good search quality. I ended up spending a whole weekend configuring the embedding model and chunking strategy to get acceptable results.
I think in 6 months this could be amazing. The core idea is right and the team is shipping updates fast. But right now, if you're not willing to tinker and troubleshoot, you'll probably get frustrated. It's a power user tool that's pretending to be a consumer product.
Gemini 2.5 Pro is seriously underrated for code
Everyone sleeps on Gemini and I don't get it. Since 2.5 Pro dropped, it's become my second most-used model after Claude. The 1M token context window is not a gimmick. I've fed it entire repos with 200+ files and it keeps context better than anything else at that scale.
For ML engineering work, the Google ecosystem integration is huge. It plays nice with Vertex AI, BigQuery, and Colab in ways that feel native. I can go from a Gemini chat about a training pipeline straight to executing it in Colab without copy-pasting boilerplate.
The multimodal capabilities are strong. I've been using it to analyze training curves from screenshots, debug UI issues from photos, and even interpret architecture diagrams. It's the best vision model for technical content in my experience.
The thinking mode in 2.5 Pro is a significant jump from 2.0. It handles complex debugging sessions where you need to trace through 5-6 interconnected functions without losing the thread. The chain-of-thought is transparent and you can see where its reasoning is heading.
Why not 5 stars? Two things. First, it's still weaker than Claude on pure creative writing and nuanced communication. The tone can feel clinical. Second, Google's API pricing and rate limits change so frequently that I've had production pipelines break twice because of surprise quota changes. Hard to trust for critical infrastructure when the platform keeps shifting under you.
Quietly the best agent monitoring tool nobody talks about
I've been evaluating observability tools for AI agents as part of my research on agent reliability, and Vane surprised me. It's an agent monitoring and debugging platform that focuses on the things that actually matter: tracing agent decisions, identifying failure patterns, and showing you where your agent's reasoning went off the rails.
The trace visualization is the standout feature. You get a tree view of every decision your agent made, what context it had, what tools it called, and what the intermediate results were. When an agent fails, you can pinpoint exactly which step went wrong. I've used LangSmith and Helicone for similar things and Vane's UI is more intuitive for debugging.
The anomaly detection is neat. It learns your agent's typical behavior patterns and flags when something looks off. We caught a regression in one of our research agents where it started looping on the same tool call. Vane flagged it before we noticed.
Pricing is reasonable for what you get. The free tier is generous enough for personal projects and the team plan isn't going to break the bank.
The main gap is integrations. Right now it works best with LangChain and LlamaIndex. If you're using a custom agent framework, the SDK is there but you'll be writing more instrumentation code. Also, the alerting system is pretty basic. Slack notifications only, no PagerDuty or custom webhooks yet.
Replaced three internal tools with Worktrunk
My team was using a patchwork of Notion, Linear, and a custom dashboard to track our AI agent fleet. Worktrunk replaced all three. It's basically a workspace manager built specifically for teams running multiple AI agents and workflows.
The best feature is the unified timeline. You can see what every agent did, when, what it cost, and whether it succeeded or failed. We're running about 15 different automated workflows and before Worktrunk I had no idea which ones were actually worth the compute cost. Now I can see that our content generation pipeline costs $12/day but our code review agent only costs $0.80 and catches way more bugs.
The collaboration features are solid too. Team members can annotate agent runs, flag issues, and share workflows. It's not trying to be Slack or anything, it's just the right amount of team features for managing AI workloads.
I took off one star because the pricing tiers are a bit aggressive for small teams. The free tier is pretty limited and you hit the paid wall fast once you start adding more than 5 agents. Also the mobile app is basically useless right now. Just shows a read-only feed.
But if you're a startup running multiple AI agents and you need visibility into what's happening, this is the best option I've found. Way better than trying to build your own dashboards.
Still the Swiss Army knife of AI assistants
Look, there's a reason ChatGPT has a billion users. It does everything reasonably well. Need to brainstorm product names? Good. Need to debug Python code? Solid. Need to draft an investor email? Fine. It's the generalist that covers 80% of use cases without breaking a sweat.
The plugin ecosystem and GPTs marketplace give it a huge advantage. My team uses a custom GPT I built for our standup summaries that pulls from Linear and Slack. It saves us 15 minutes every morning. You can't do that with Claude or Gemini.
The voice mode is legitimately great for rubber duck debugging. I'll talk through a problem while driving and ChatGPT actually keeps up with the conversation. The multimodal capabilities are strong too. I've taken photos of whiteboard sketches and had it convert them into structured specs.
I dropped it to 4 stars because GPT-4o's coding has gotten worse since the beginning of the year. It used to nail complex coding tasks on the first try. Now it hallucinates APIs that don't exist more often than it should. The latest model updates feel like they optimized for speed over accuracy.
Also the pricing is confusing. Plus, Team, Enterprise, API, different rate limits, different features. Just give me one plan that does everything.
Still my recommendation for anyone who wants one AI tool that does it all. It's not the best at any single thing anymore, but it's good enough at everything to be indispensable.
Three months in and I can't go back to regular VS Code
I was skeptical about Cursor for the longest time. Another AI code editor? Sure. But a client basically forced me to try it and now I'm a convert. The tab completion alone saves me probably 30 minutes a day on boilerplate. It doesn't just complete the current line, it anticipates the next 3-4 lines based on what you're building.
The Cmd+K inline editing is where it really shines for freelance work. I can select a function, type 'add error handling and retry logic' and it rewrites the function in place. For the kind of repetitive full-stack work I do across client projects, this is enormous.
The codebase indexing means it actually understands your project. I asked it to write a new API endpoint and it automatically matched the patterns from my existing routes, used the right ORM methods, and even imported from the correct paths. No other tool does this as well.
Composer mode is great for larger refactors. I described a migration from REST to tRPC for a Next.js app and it generated about 80% of the changes correctly across multiple files. Still needed cleanup but it cut a 2-day task to half a day.
The $20/month Pro plan is worth every penny if you're writing code professionally. I make that back in the first hour of saved time each month. Only thing I wish is that it handled monorepos better. With large workspaces it sometimes gets confused about which package's types to use.
This is what agent sandboxing should look like
As someone who manages infrastructure for a team of 30 engineers, Agent Safehouse is exactly what I've been waiting for. We've been rolling our own Docker-based sandboxes for AI agents and it was a nightmare to maintain. Safehouse gives you isolated execution environments with proper network policies, resource limits, and audit logging out of the box.
The setup took maybe 20 minutes. You define your sandbox profiles in YAML, specify what the agent can and can't access, and that's it. We're running it in production with four different AI coding agents and haven't had a single breakout or resource abuse incident since deploying it.
What really sold me was the network isolation. You can give an agent internet access but restrict it to specific domains and ports. So our coding agent can hit npm and GitHub but can't phone home to random IPs. The real-time monitoring dashboard shows exactly what each sandboxed agent is doing, which makes our security team happy.
The filesystem snapshotting is killer too. If an agent messes something up, you can roll back to any checkpoint. We've caught a few cases where agents tried to rm -rf directories they shouldn't have touched.
Honestly can't think of major downsides. The documentation could be more detailed for the advanced networking configs, but the Discord community is super responsive. This should be standard infrastructure for anyone running AI agents in production.
Cut my fine-tuning time by 70% and I'm not exaggerating
I fine-tune models almost every week for our startup's domain-specific tasks and Unsloth has completely changed my workflow. Before Unsloth, fine-tuning a 7B model on our dataset took about 6 hours on a single A100. With Unsloth, same model, same data, 1 hour 45 minutes. The memory savings are insane too.
The QLoRA implementation is best-in-class. I can fine-tune Llama 3.1 70B on a single 48GB GPU that would normally need at least two A100s. For a startup that can't afford to rent massive GPU clusters, this is a game changer.
The API is clean and Pythonic. If you've used Hugging Face's trainer before, you'll feel right at home. They've got good defaults for learning rate, batch size, and gradient accumulation so you don't have to babysit the training run. I literally just point it at my dataset, pick a base model, and let it rip.
The 4-bit quantization quality is surprisingly good. I was skeptical at first but our evals show less than 2% quality drop compared to full precision fine-tuning. For production use cases, that trade-off is absolutely worth it.
Only complaint is that the model export options are a bit limited. Getting your fine-tuned model into GGUF format for Ollama requires a few extra steps that could be streamlined. But that's a minor thing. If you're doing any kind of model fine-tuning and you're not using Unsloth, you're literally wasting money and time.
Solid AI code editor, best in class for the workflow
Cursor took the VS Code experience and bolted real AI into it. Tab completions are genuinely useful, not just autocomplete on steroids. The inline editing with Cmd+K is where it shines. Codebase context awareness means it actually understands what you are working on. Gets confused sometimes on larger monorepos and the AI can hallucinate file paths. Worth the subscription if you write code daily.
Best reasoning model on the market right now
Been using Claude daily for about 6 months now, mostly for coding and writing tasks. The reasoning capability on Opus is genuinely impressive. It catches edge cases I wouldn't think of and explains its thinking clearly. Sonnet is my go to for everyday stuff because its fast and still pretty sharp. The API is clean and well documented. Context window is massive which matters when you're working with large codebases. Only real gripes: rate limits can be annoying during heavy usage, and the pricing on Opus adds up quick if you're not careful. Also wish they had better image generation, that's still a weak spot compared to OpenAI. But for pure reasoning and code, nothing else comes close right now.
Impressive demo, mixed results in practice
I wanted Devin to work so badly. An autonomous AI software engineer that can take a GitHub issue and ship a PR? Sign me up. And to be fair, the demos are jaw-dropping. Watching it navigate a codebase, write code, run tests, and debug errors autonomously is wild.
In practice, the results are more nuanced. For well-defined, isolated tasks with clear specs, it's genuinely useful. I gave it a task to add pagination to an API endpoint and it nailed it. Good code, proper tests, clean PR. That saved me maybe 2 hours.
But for anything ambiguous or architecturally complex, it struggles. I asked it to refactor our authentication system from JWT to session-based and it went down a rabbit hole for 45 minutes, wrote code that didn't compile, and then spent another 30 minutes trying to fix its own bugs. I could have done it myself in less time.
The autonomous nature is both a feature and a bug. When it works, you feel like you have a superpower. When it doesn't, you've wasted an hour and still have to do the work yourself. There's no middle ground. You're either saved or you're frustrated.
The billing model is rough too. You're paying for compute time whether Devin succeeds or fails. I've had sessions where it burned through $15 of credits spinning its wheels on something I could have specified better.
I'll keep using it for specific, well-scoped tasks where the specs are crystal clear. But it's not replacing a developer anytime soon. It's more like a really expensive intern who's great at following instructions but can't think independently yet.
Finally a research agent that doesn't hallucinate every other sentence
I've been using DeerFlow 2.0 for about two weeks now and it's become my go-to for deep research tasks. The multi-agent workflow is genuinely impressive. You give it a topic and it breaks down the research into sub-tasks, assigns different agents to each part, and then synthesizes everything into a coherent report.
The best part is the human-in-the-loop design. It doesn't just run off and generate 10 pages of nonsense. It checks in with you at key decision points so you can steer the research. I used it to do competitive analysis on three different vector database solutions and the output was actually usable. Like, I sent it to my team without editing.
The code execution sandbox is clutch too. It can run Python to verify claims, generate charts, and pull live data. Not just regurgitating training data.
My main gripe is speed. The multi-agent orchestration adds real latency. A simple research task that should take 2 minutes ends up taking 8-10 because of all the back-and-forth between agents. Also the UI could use work. It's functional but feels like a prototype.
Still, for serious research where accuracy matters more than speed, this is the best open-source option I've found. The fact that it's built on LangGraph means you can customize the workflow if you're willing to get your hands dirty.
The backbone of my entire local AI setup
Ollama is one of those tools that just works and keeps getting better. I run local models for everything: coding assistance, document Q&A, even a personal chatbot for brainstorming. Ollama makes running all of this trivial.
The model library is massive now. Llama 3.1, Mistral, Phi, DeepSeek, Qwen, whatever you want. One command to pull, one command to run. The API is OpenAI-compatible so every tool that works with GPT-4 works with your local models too. I've got it hooked up to Continue in VS Code, a RAG pipeline with LlamaIndex, and a custom Telegram bot.
Performance on Apple Silicon is excellent. I run Llama 3.1 8B on my M2 MacBook Pro and get about 40 tokens/sec. Totally usable for interactive coding. The 70B quantized model runs too, just slower. For my desktop with a 3090, the CUDA support is solid.
The Modelfile system is underrated. You can create custom model configs with specific system prompts, temperature settings, and context windows. I've got like 8 different Modelfiles for different tasks.
Taking off a star because multi-modal support is still catching up. Vision models work but the experience isn't as polished as the text-only stuff. Also, I wish there was a built-in way to expose Ollama over the network securely. Right now I'm proxying through nginx with auth, which works but it'd be nice to have native auth.
The best reasoning model for anything that requires actual thinking
I've been benchmarking LLMs for my research and Claude consistently outperforms on tasks that require genuine reasoning rather than pattern matching. Multi-step math, code debugging, analyzing complex papers, anything where you need the model to actually think through a problem.
The extended thinking feature with Claude is a game changer for research work. I give it a complex ML paper and ask it to identify methodological flaws, and it actually walks through the math, checks assumptions, and finds real issues. GPT-4 tends to give surface-level critiques. Claude goes deep.
For coding, it's become my primary assistant. The context window is huge, so I can dump an entire codebase into a conversation and it keeps track of everything. I refactored a 2000-line Python module with Claude and it maintained consistency across the whole thing. It understands architectural patterns, not just syntax.
The system prompt adherence is noticeably better than competitors. When I set up specific output formats or constraints, Claude follows them consistently. With other models I'm constantly fighting against the model wanting to do its own thing.
The artifacts feature in the web UI is underrated for producing structured outputs. I use it to generate LaTeX tables, comparison matrices, and formatted research notes.
The only reason anyone wouldn't pick Claude is if they need real-time web access built in, which it doesn't have natively. But for pure reasoning and coding tasks, nothing else comes close right now.
Genuinely love this one
It works perfectly for me. I don't really have much else to say.
Highly recommend
Good value for the features you get.
Exceeded expectations
Helped me save a ton of time on repetitive tasks.
A keeper
Using OpenAI API, after trying several alternatives, I settled on this one. The balance of features and usability is right. Updates seem frequent which is always a good sign.
Really solid tool
This fills a gap I didn't know I had in my toolkit. The AI capabilities are genuinely useful, not just marketing speak. Recommended it to my team already.
Average experience
OpenRouter does the job but I wish it had more customization options.
Average experience
Does the job but I wish it had more customization options.
It's okay
It's okay. Does what it promises but nothing more.
Great experience so far
Better than I expected honestly.
Does the job well
With Runway I found that was skeptical at first but gave it a shot anyway. Turns out the hype is mostly deserved. There are some rough edges but nothing that breaks the experience.
Exceeded expectations
Using OpenAI Whisper, does exactly what it says. No complaints here.
Good stuff
After trying several alternatives, I settled on this one. The balance of features and usability is right. Updates seem frequent which is always a good sign.
Worth checking out
SkillGuard does exactly what it says. No complaints here.
Good stuff
Started using this about a month ago for my daily workflow. It's become essential for how I work now. The integrations with other apps I use made setup painless.
It's okay
It's okay. Does what it promises but nothing more.
Does the job well
Started using this about a month ago for my daily workflow. It's become essential for how I work now. The integrations with other apps I use made setup painless.
Not bad
Using Pi, average experience overall. Works fine for simple stuff.
A keeper
Was skeptical at first but gave it a shot anyway. Turns out the hype is mostly deserved. There are some rough edges but nothing that breaks the experience.
Exceeded expectations
Using Poe, been using this for a few weeks now. Does the job well.
Exceeded expectations
Solid tool. Not perfect but gets most things right.
Had issues
Had issues getting it to work consistently. Support was slow.
Really solid tool
Pretty good overall. The interface is clean.
Impressed with this
I've tested a lot of similar tools and this one stands out. The learning curve is minimal and you can start getting results almost immediately. Customer support responded within hours when I had a question.
Exceeded expectations
Pretty good overall. The interface is clean.
Really solid tool
Helped me save a ton of time on repetitive tasks.
Worth checking out
Started using this about a month ago for my daily workflow. It's become essential for how I work now. The integrations with other apps I use made setup painless.
Room to grow
Not bad but not amazing either. Room for improvement.
Room to grow
Using Together AI, average experience overall. Works fine for simple stuff.
Decent option
Mixed feelings. Some features are great, others need work.