198 points by akrulino 45 days ago | 119 comments

kamma4434 44 days ago [-]

Nice to see Asterisk on the home page of HN. It’s been a while…

Even if the focus is now on hosted telephony, my experience is that everywhere you can hear the default nusic-on-hold

louky 44 days ago [-]

I don't mind music-on-hold as much except when they break with a human voice at intervals saying inane things such as "your call is important to us. Thank you for holding" Which makes me think the call has been answered. If there's not actual call queue information, just play music so I can ignore it and continue working without that NOP interrupt.

ale42 44 days ago [-]

I guess the comment was just about how much Asterisk deployments are out there because you can often hear its default music-on-hold? But maybe I'm wrong.

jonway 44 days ago [-]

thats funny, i interpreted this as "Hosted Asterisk People Sell a Default Experience" (and probably charge for custom messages) so its cool to see something different and marketable

WD-42 44 days ago [-]

What is the application of this that makes anything better for anyone? All I can think of is more spammers, scammers, horrible customer support lines.

bronco21016 44 days ago [-]

My local dealership adopted one of these for their service department. Prior to the AI assistant, you would call and immediately be placed on hold because the same people serving customers at the desk were answering the phone. The AI picks up right away and can schedule an oil change. It's fantastic and I'd love to see more of this. Of course it also has the ability to escalate to a human for the things it doesn't have capabilities for.

For narrow use cases like this I personally don't mind these tools.

WD-42 44 days ago [-]

If they have the ability to hook an llm up to a system that can schedule an oil change why can’t they provide a form on their website to do the same thing and save everyone the hassle?

Rebelgecko 44 days ago [-]

Just ask your LLM to call the dealership. The only downside is spoken word is a bit slow for computers. Maybe we can even work out a protocol where the LLM voices talk faster and faster until they can't hear tokens clearly

WD-42 44 days ago [-]

At that point we’ll have to convert the voices into a form more amenable to machine to machine communication. Perhaps a system based on high and low signals.

Seriously what is the point of all this.

ffsm8 44 days ago [-]

Pretty sure rebelgecko was using [Sarcasm] in order to increase his [Satire] enjoyment

satvikpendem 42 days ago [-]

Some people just like to call over navigate to a site and fill out a form. Sometimes just speaking is easier.

PoachedEggs 44 days ago [-]

Both can be true. They might have those web forms, while also having enough customers that prefer phone calls to justify it.

mh- 44 days ago [-]

Not everyone wants to use a form on their website?

You and I certainly do, but a ton of people prefer calling on the phone.

brunoarueira 44 days ago [-]

In Brazil, multiple companies are offering a call and WhatsApp, both through automated messages with menus and in the end escalate to humans.

bronco21016 44 days ago [-]

They do, via the manufacturer's app. It works fine as well.

Situational context matters though, sometimes you get in the vehicle and get the alert. Just say "Hey Siri, call dealership" and away you go hands free. No messing with apps.

quaheezle 44 days ago [-]

They do offer the ability to schedule an oil change via the website and yet some people still prefer to call. User preference and multi-channel servicing options are nice to support

WD-42 44 days ago [-]

They do? How do you know?

Antoniocl 44 days ago [-]

Unsure about whether the specific dealership in question supports online booking, but there existing consumers whose preference is for a phone call over a web-based experience is definitely the case, at least in the US.

For example, even with the (digital-only) SAAS company I work at, we have a non-trivial amount of customers who with strong preferences to talk on the phone, ex to provide their credit card number, rather than enter it in the product. This is likely more pronounced if your product serves less tech-savvy niches.

That said, a strong preference for human call > website use doesn't necessarily imply even a weak preference for AI call > website use (likely customer-dependent, but I'd be surprised if the number with that preference was exactly 0)

44 days ago [-]

keeganpoppen 44 days ago [-]

how about the plainly obvious fact that every call tree system first spends 1-8 minutes going through all the things that you can actually do on the website instead of calling: do you really think they would bother with that if people aren’t calling about stuff that is easily done on the website? sure, we all agree that it is partly designed to get people to hang up in disgust and give up, but that is an obviously insufficient explanation compared to the simpler and more comprehensive explanation that people simply do, as a matter of fact, prefer to use the phone despite it being clearly less useful for easily-computer-able tasks.

tguvot 44 days ago [-]

>sure, we all agree that it is partly designed to get people to hang up in disgust and give up

actually, as someone who works in this area - no, it's not. it designed to help people to do things and metrics of success are closely monitored

keeganpoppen 39 days ago [-]

fair enough. i totally believe that, and for the record i threw that bit in as an olive branch to the parent commenter… in retrospect, i shouldn’t have even included that rhetorical sludge. major chesterton’s fence area, that.

HeatrayEnjoyer 44 days ago [-]

It is often both.

tguvot 44 days ago [-]

not in my experience and it will be more expensive to company

HeatrayEnjoyer 44 days ago [-]

Have you never worked in customer service?

krashidov 44 days ago [-]

because a lot of people still prefer voice communication over navigating the internet and filling out a form

dmos62 44 days ago [-]

Good customer support lines? Is there a reason why it can't provide good support. I often use chatgpt's voice function.

WD-42 44 days ago [-]

How? Businesses will use this to justify removing what few actual human support staff they have left. Nobody, and I mean it, nobody calls customer support because they want to talk to a computer. It’s the last resort for problems that usually can’t be accomplished via the already existing technical flows available via computer.

dmos62 44 days ago [-]

That's not true. I recently called to make an appointment. I don't care if it's an AI. I would actually prefer it, because I wouldn't feel bad about taking a long time to pick the best time. Don't you think you're being a bit dogmatic about this?

f33d5173 44 days ago [-]

I have to feel that an online booking system is substantially lower tech than an ai voice assitant chatbot, and makes it even easier to ruminate as you pick the time that works for you.

dmos62 44 days ago [-]

In my case, the business did have human support assistants, but didn't do reservations via the phone. I had to switch to the web app for that, which was annoying (I was driving?). I guess doing user identification over the phone and scheduling the appointment are time-consuming for the human assistant, while these are some of the few things an app can do well. I presume the logic is to preserve human assistants for actually complicated or dynamic assistance, for the sake of cost-efficiency. A voice llm can bring down the cost of these basic-but-time-consuming interactions.

b112 44 days ago [-]

Beyond true.

I wonder what Amazon's goals are, as an example. Currently, at least on the .ca website, there is no way to even get to chat to fix problems. All their spider text of help options, now always lead back to the return page.

So it's call them (which you can only find the number via Google.)

I suspect they're so disfunctional, that they don't understand why the massive uptick in calls, so then they slap AI in via phone too.

And so now that's slow and AI drivel. I guess soon I'll just have to do chargebacks!? Eg, if a package is missing or whatever.

Antoniocl 44 days ago [-]

Interesting, I regularly use chat-based support on amazon.ca to speak with (what I presume is) a real human after none of the control flow paths adequately resolve my issue. I've always found the support quick to reply and very helpful.

Granted, it's been 1-2 weeks since I had an issue, so it may have changed since then, or it could be only released to a subset of users.

awad 44 days ago [-]

Amazon is generally good at 1) resolving an issue in your favor and 2) getting you to a human if needed but gosh does it feel like I've taken a different path to do so every single time I'ever needed support.

b112 44 days ago [-]

I wonder if I'm stuck on the (A)wseome/(B)ad side of A/B testing.

TZubiri 44 days ago [-]

The expectation of customer support lines is that customers want to speak to humans. It isn't just the fact that these are semantics that aren't written anywhere and are open to change, because by using a human-like voice agent on a customer support line, you are pretending that that is a human, which is a scam or fraud.

If you really believe that the support can be good, then use a robotic text to speech, don't pretend it's a human. And make it clear to users that they are talking to a human, phone is a protocol that has the semantic that you speak to a human. Use something else.

The bottom line is that you have clients that registered under the belief that they could call a phone number and speak to a human, businesses are performing a short-term switcheroo at the expense of their clients, it's a scam.

znpy 44 days ago [-]

> The expectation of customer support lines is that customers want to speak to humans.

Not really. The expectation is to be able to express their need in a natural language, maybe because their issue is not covered by a fixed-form web form (pun not intended).

So yeah AI might be a good fit in that scenario.

TZubiri 44 days ago [-]

It's a protocol and network that has backwards compatibility with 19th century telegram wire networks that later were voice lines for a full century.

If that isn't the channel to speak to a human, nothing is. You can speak to a bot with an app or whatever.

At least make it sound robotty instead of pretending to be a human.

Kim_Bruning 44 days ago [-]

This might be good for Back-office hands-free tool access, for employees who are on the road. (they shouldn't be looking at the screen, and they might be limited to voice calling due to coverage issues besides). Aka: really weird terminal.

anhner 44 days ago [-]

> What is the application of this?

spammers, scammers and horrible customer support lines.

NedF 44 days ago [-]

[dead]

asdfsfds 44 days ago [-]

[flagged]

mcny 44 days ago [-]

It is funny you mention Microsoft s customer support because it is a publicly known issue at this point that I'd you are a Microsoft employee or a v dash, the first level of support you talk to is basically something you have to overcome to get any help at all.

krater23 45 days ago [-]

Please don't. I had a talk with a shitty AI bot on a Fedex line. It's absolute crap. Just give me a 'Type 1 for x, type 2 for y'. Then I don't need to guess what are the possibilities.

EvanAnderson 44 days ago [-]

Voice-controlled phone systems are hugely rage-inducing for me. I am often in loud setting with background chatter. Muting my audio and using a touchtone keypad is so much more accurate and easy than having to find a quiet place and worrying that somebody is going to say something that the voice response system detects.

ssl-3 44 days ago [-]

I hate those, too. Especially when others are around.

The interface is so inconsistent between different implementations that they're always terribly awkward to navigate at best, and completely infuriating at worst. I don't like presenting the image of an progressively-angrier man who is standing around and speaking incongruous short phrases that are clearly directed towards nobody at all.

But I've found that many of them still accept DTMF. Just mash a button instead of utter a response, and a more-traditional IVR tree shows up with a spoken list of enumerated options. Things get a lot better after that.

Like pushing buttons at the gas pump to try to silence the ad-roll, it's pretty low-cost to try.

9x39 44 days ago [-]

One problem is once you’re in deep building a phone IVR workflow beyond X or Y (yes, these are intentional), callers don’t care about some deep and featured input menu. They just mash 0 or pick a random option and demand a human finish the job and transfer them - understandably.

When you’re committed to phone intent complexity (hell), the AI assisted options are sort of less bad since you don’t have to explain the menu to callers, they just make demands.

tartoran 44 days ago [-]

What if the goal is to keep gaslighting you until you give up your demands?

9x39 44 days ago [-]

Most voice agents for large companies are a calculated game to deter customers from expensive humans as we know, but not always.

Sort of like how Jira can be a streamlined tool or a prison of 50-step workflows, it's all up to the designer.

8note 44 days ago [-]

you bought something from the wrong company, and you arent gonna get helped by phone, bot, or person

russdill 44 days ago [-]

The problem here is that if it's something a voice assistant can solve, I can solve it from my account. I'm calling because I need to speak to an actual human.

hectormalot 44 days ago [-]

Im in this business, and used to think the same. It turns out this is a minority of callers. Some examples:

- a client were working does advertising in TV commercials, and a few percent of their calls is people trying to cancel their TV subscriptions, even though they are in healthcare - in the troubleshooting flow for a client with a physical product, 40% of calls are resolved after the “did you try turning it off and on again” step. - a health insurance client has 25% of call volume for something that is available self-service (and very visible as well), yet people still call. - a client in the travel space gets a lot of calls about: “does my accommodation include X”, and employees just use their public website to answer those questions. (I.e., it’s clearly available for self-service)

One of the things we tend to prioritize in the initial conversation is to determine in which segment you fall and route accordingly.

mcny 44 days ago [-]

(reposting because something ate your newlines, I've added comments in line)

Im in this business, and used to think the same. It turns out this is a minority of callers. Some examples:

- a client were working does advertising in TV commercials, and a few percent of their calls is people trying to cancel their TV subscriptions, even though they are in healthcare

I guess these are probably desperate people who are trying to get to someone, anyone. In my opinion, the best thing people can do is get a really good credit card and do a charge back for things like this.

- in the troubleshooting flow for a client with a physical product, 40% of calls are resolved after the “did you try turning it off and on again” step.

I bought a Chinese wifi mesh router and it literally finds a time between two am and five am and reboots itself every night, by default. You can turn this behavior off but it was interesting that it does this by default.

- a health insurance client has 25% of call volume for something that is available self-service (and very visible as well), yet people still call.

In my defense, I've been on the other side of this. I try to avoid calling but whenever I use self service, it feels like ny settings never stick and always switch back to what they want the next billing cycle. If I have to waste time each month, you have to waste time each month.

- a client in the travel space gets a lot of calls about: “does my accommodation include X”, and employees just use their public website to answer those questions. (I.e., it’s clearly available for self-service)

These public websites are regularly out of date. Someone who is actually on site confirm that yes, they have non smoking rooms or ice machines that aren't broken is valuable.

One of the things we tend to prioritize in the initial conversation is to determine in which segment you fall and route accordingly.

hectormalot 44 days ago [-]

Thx, forgot to double enter.

cyberax 44 days ago [-]

Well, the future is here: https://www.youtube.com/watch?v=HbDnxzrbxn4

Tikrong 44 days ago [-]

Looked at a repo and it seems this project was heavily AI generated. Would be interesting to hear from the author of the project how it was built.

ghurtado 44 days ago [-]

> Looked at a repo and it seems this project was heavily AI generated

That's the first thing that I noticed too.

It's gotten to the point that my body subconsciously rejects bullet lists and headings that start with emojis.

scsh 44 days ago [-]

Independent of any feelings about the use of AI, I find that specific style of formatting to be extremely visually distracting and tend to click away because of that. Emoji, imo, are very information dense per character so it makes the rest of the info hard to parse.

Tikrong 44 days ago [-]

When I see this style first thing that comes to mind is that the whole text was generated and maybe not even read by the author. And I am not yet confident to trust AI in writing the docs without supervision. So I have troubles in trusting the whole thing I see because of that.

scsh 44 days ago [-]

That's also a totally fair take!

Tikrong 44 days ago [-]

Yep, and the number of commits is insane. In early commits there are cursor rules and reports about task complexity made by AI. So it's really interesting, how was the project created.

fennecbutt 44 days ago [-]

Yes, developers on my teams have started submitting PRs exactly like this and the content of the PR is similar. Still need to figure out how to put a stop to it.

hkjarral 44 days ago [-]

[dead]

RockRobotRock 44 days ago [-]

Long shot, if anyone here is an Asterisk wizard. I would like to correlate CDRs to voicemail recording locations. I am building an integrated dashboard for call recordings, and want voicemails to be included, but that's been surprisingly difficult.

parrotplatform 44 days ago [-]

So, there is no straight forward way that I can think of. But, what I would do is set a channel variable and log it somewhere:

exten => s,n,Set(VM_UNIQUEID=${UNIQUEID}) exten => s,n,VoiceMail(${EXTEN}@default)

If you are using AGI or ARI, you can log it somewhere useful so you can correlate later.

If you are using a more vanilla configuration I’d say use the voicemail metadata .txt file that will be in the same folder as the recording to get info to find the CDR. It has things like callerid, origmailbox, origdate (or maybe it’s origtime), and duration. origmailbox should match the CDR destination and the orig time should also match. Haven’t done this specifically. But, I’m hoping I’m pointing you in the right direction.

I work with Freeswitch almost exclusively these days. But, my first experience with VoIP was Asterisk and a huge Perl AGI file keeping everyone talking to each other. Those were good time!

RockRobotRock 43 days ago [-]

You are awesome!

wild_egg 44 days ago [-]

The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.

Is that really where SOTA is right now?

dnackoul 44 days ago [-]

I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.

I attended VAPI Con earlier this year, and a lot of the discussion centered on how interruptions and turn detection are the next frontier in making voice agents smoother conversationalists. Knowing when to speak is a hard problem even for humans, but when you listen to a lot of voice agent calls, the friction point right now tends to be either interrupting too often or waiting too long to respond.

The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.

hogrug 44 days ago [-]

I think interruptions had better be the top priority. I find text LLMs rage inducing with their BS verbiage that takes multiple prompts to reduce, and they still break promises like one sentence by dropping punctuation. I can't imagine a world where I have to listen to one of these things.

gessha 44 days ago [-]

I wonder if it’s possible to do the apple trick of hiding latency using animations. The audio equivalent can be the chime that Siri does after receiving a request.

echelon 44 days ago [-]

Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.

https://app.sesame.com/

duckkg5 44 days ago [-]

Absolutely not.

500-1000ms is borderline acceptable.

Sub-300ms is closer to SOTA.

2000ms or more means people will hang up.

fragmede 44 days ago [-]

play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet.

ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.

phantasmish 44 days ago [-]

I haaaaate the fake typing noises.

emil-lp 44 days ago [-]

Alternate between that and

    play "ehh".wav

matt-p 44 days ago [-]

160ms is essentially optimal and you can get down to about 200ms AFAIK.

abdullahkhalids 44 days ago [-]

With what system?

russdill 44 days ago [-]

Been experimenting with having a local Home Assistant agent include a qwen 0.5B model to provide a quick response to indicate that the agent is "thinking" about the request. It seems to work ok for the use case, but it feels like it'd get really repetitive for a 2 way conversation. Another way to handle this would be to have the small model provide the first 3-5 words of a (non-commital) response and feed that in as part of the prompt to the larger model.

daneel_w 44 days ago [-]

From my personal experience building a few AI IVR demos with Asterisk in early 2025, testing STT/TTS/inference products from a handful of different vendors, a reliable maximum latency of 2-3 seconds sounds like a definite improvement. Just a year ago I saw times from 3 to 8 seconds even on short inputs rendering short outputs. One half of this is of course over-committed resources. But clearly the executional performance of these models is improving.

numpad0 44 days ago [-]

Very randomly and personally I appear to have experimented with that few months ago, based on a Japanese advent calendar project[1] - the code is all over the place and only works with Japanese speeches, but the gist is as follows. Also in [2].

The trick is to NOT wait for the LLM to finish talking, but:

  1 at end of user VAD, call LLM, stream response into a buffer(simple enough)   
  2 chunk the response at [commas, periods, newlines], and queue sentence-oid texts  
  3 pipe queued sentence-oid fragments into a fast classical TTS and queue audio snippets   
  4 play queued sentence-oid-audio-snippets, maintaining correspondence of consumed audio and text 
  5 at user VAD, stop and clear everything everywhere, undoing queued unplayed voice, nuking unplayed text from chat log 
  6 at end of user VAD, feed the amended transcripts that are canonical to user's ears to step 1
  7 (make sure to parallelize it all)

This flow (hypothetically)allow such interactions as:

  user: "what's the date today"
  sys:  "[today][is thursday], [decem"
  user: "sorry yesterday"
  sys:  "[...uh,][wednesday?][Usually?]"

  1: https://developers.cyberagent.co.jp/blog/archives/44592/
  2: https://gist.github.com/numpad0/18ae612675688eeccd3af5eabcfdf686

coderintherye 44 days ago [-]

Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.

wellthisisgreat 44 days ago [-]

No, there are models with sub-second latency for sure

mohsen1 44 days ago [-]

Just try Gemini Live on your phone. That's state of the art

44 days ago [-]

bharrison 44 days ago [-]

Perhaps you didnt read that these are "production-ready golden baselines validated for enterprise deployment."

How does their golden nature not dissuade these concerns for you?

matt3210 44 days ago [-]

I (and most people I ask) definitely don’t answer calls from robots or watch videos with robot voice. I’m not sure what value the customer gets here.

hkjarral 44 days ago [-]

[dead]

aftbit 44 days ago [-]

This opens up new possibilities for interactive phone services. Retro-futuristic for sure.

looneysquash 44 days ago [-]

That seems like bad news for Allison. Though I know she already had some TTS voices available, so many not.

kamma4434 44 days ago [-]

She seems to be doing well

https://m.youtube.com/watch?v=wairnc-2Hyo

eugene3306 44 days ago [-]

I've created Asterisk Codex Skill, but turns out there is ten seconds timeout for scripts

nextworddev 45 days ago [-]

Can I connect this to Twilio

kwindla 44 days ago [-]

One easy way to build voice agents and connect them to Twilio is the Pipecat open source framework. Pipecat supports a wide variety of network transports, including the Twilio MediaStream WebSocket protocol so you don't have to bounce through a SIP server. Here's a getting started doc.[1]

(If you do need SIP, this Asterisk project looks really great.)

Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].

[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn

Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/

ldenoue 44 days ago [-]

The problem with PipeCat and LiveKit (the 2 major stacks for building voice ai) is the deployment at scale.

That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.

Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).

nextworddev 44 days ago [-]

let me get this straight, you are storing convo threads / context in DOs?

e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?

ldenoue 42 days ago [-]

Yes DO let you handle long lived websocket connections. I think this is unique to Cloudflare. AWS or Google Cloud don't seem to offer these things (statefulness basically).

Same with TTS: some like Deepgram and ElevenLabs let you stream the LLM text (or chunks per sentence) over their websocket API, making your Voice AI bot really really low latency.

nextworddev 44 days ago [-]

This is good stuff.

In your opinion, how close is Pipecat + OSS to replacing proprietary infra from Vapi, Retell, Sierra, etc?

kwindla 44 days ago [-]

It depends on what you mean by replacing.

The integrated developer experience is much better on Vapi, etc.

The goal of the Pipecat project is to provide state of the art building blocks if you want to control every part of the multimodal, realtime agent processing flow and tech stack. There are thousands of companies with Pipecat voice agents deployed at scale in production, including some of the world's largest e-commerce, financial services, and healthtech companies. The Smart Turn model benchmarks better than any of the proprietary turn detection models. Companies like Modal have great info about how to build agents with sub-second voice-to-voice latency.[1] Most of the next-generation video avatar companies are building on Pipecat.[2] NVIDIA built the ACE Controller robot operating system on Pipecat.[3]

[1] https://modal.com/blog/low-latency-voice-bot - [2] https://lemonslice.com/ = [3] https://github.com/NVIDIA/ace-controller/

nextworddev 44 days ago [-]

Is there a simple, serverless version of deploying Pipecat stack, without: - me having to self host on my infra

I just want to provide: - business logic - tools - configuration metadata (e.g. which voice to use)

I don't like Vapi due to 1) extensive GUI driven experience, 2) cost

ldenoue 42 days ago [-]

Check out something like LayerCode (Cloudflare based).

Or PipeCat Cloud / LiveKit cloud (I think they charge 1 cent per minute?)

ldenoue 44 days ago [-]

I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.

Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)

picardo 44 days ago [-]

Is AssemblyAI or Deepgram compatible with OpenAI Realtime API, esp. around voice activity detection and turn taking? How do you implement those?

ldenoue 42 days ago [-]

I am not using speech to speech APIs like OpenAI, but it would be easy to swap the STT + LLM + TTS to using Realtime (or Gemini Live API for that matter).

OpenAI realtime voices are really bad though, so you can also configure your session to accept AUDIO and output TEXT, and then use any TTS provider (like ElevenLabs or InWord.ai, my favorite for cost) so generate the audio.

pugio 44 days ago [-]

Do you have anything written up about how you're doing this? Curious to learn more...

ldenoue 42 days ago [-]

I don't but I should open source this code. I was trying to sell to OEM though, that's why. Are you interested in licensing it?

VladVladikoff 44 days ago [-]

Technically yes, twilio has sip trunks.

44 days ago [-]

johnebgd 45 days ago [-]

I welcome the spam calls from our asterisk overlords.

haroldp 44 days ago [-]

I was more thinking I could add it to my Asterisk server to honey-pot the spam callers into an infinite time waster cycle.

Daviey 44 days ago [-]

"Hello, this is Lenny" - well known Asterisk configuration from 20 years ago.

haroldp 43 days ago [-]

And, “They have been carried away by monkeys!”

43 days ago [-]

VladVladikoff 44 days ago [-]

I’m honestly surprised it hasn’t been more prevalent yet. I still get call centre type spam calls where you can hear all the background noise of the rest of the call centre.

userbinator 44 days ago [-]

Is the background noise real, or is it also AI-generated to make you think that it's a human?

tartoran 44 days ago [-]

The background noise is a recording for sure, no AI needed, just a background noise audiofile in a loop would do.

VladVladikoff 44 days ago [-]

Why though? It adds nothing positive, it only makes me sure it is a scam call.

the_af 44 days ago [-]

I assume it's to make it seem like an actual call center rather than a scam. I recently got two phone scam attempts (credit card related) that sounded exactly like this.

ldenoue 44 days ago [-]

I built a voice AI stack and background noise can be really helpful to a restaurant AI for example. Italian background music or cafe background is part of the brand. It’s not meant to make the caller believe this is not a bot but only to make the AI call on brand.

grim_io 44 days ago [-]

You can call it what ever you like, but to me this is deceptive.

Where is the difference between this and Indian support staff pretending to be in your vicinity by telling you about the local weather? Your version is arguably even worse because it can plausibly fool people more competently.

ldenoue 42 days ago [-]

It doesn't have to be. You can configure your bot to great the user. E.g. "Aleksandra is not available at the moment, but I'm her AI assistant to help you book a table. How may I help you?"

So you're telling the caller that it is an AI, and yet you can have a pleasant background audio experience.

SoftTalker 44 days ago [-]

you actually answer unknown callers?

Loughla 44 days ago [-]

Yes. I own a business.

mcny 44 days ago [-]

Also, it only takes one legitimate collect call from a jail from a loved one and now I'm all in favor of reform in our jail system.

No, it does not cost over thirty dollars to allow someone accused to call their loved ones. We pay taxes. I want my government to use the taxes and provide these calls for free.

the_af 44 days ago [-]

Yes. Sometimes it's a legit call. Not often, though.

Example of legit calls: the pizza delivery guy decided to call my phone instead of ringing the bell, for whatever reason.

mcny 44 days ago [-]

I worked door dash for a couple of days and there were multiple people who wrote in all caps to not ring the door bell. Why? I have no idea.

SoftTalker 44 days ago [-]

Probably because it make their dogs go nuts.

quesera 44 days ago [-]

Sleeping children or shift workers, too.

44 days ago [-]

Loading comments...

kamma4434 44 days ago [-]

Nice to see Asterisk on the home page of HN. It’s been a while…

Even if the focus is now on hosted telephony, my experience is that everywhere you can hear the default nusic-on-hold

louky 44 days ago [-]

ale42 44 days ago [-]

I guess the comment was just about how much Asterisk deployments are out there because you can often hear its default music-on-hold? But maybe I'm wrong.

jonway 44 days ago [-]

thats funny, i interpreted this as "Hosted Asterisk People Sell a Default Experience" (and probably charge for custom messages) so its cool to see something different and marketable

WD-42 44 days ago [-]

What is the application of this that makes anything better for anyone? All I can think of is more spammers, scammers, horrible customer support lines.

bronco21016 44 days ago [-]

For narrow use cases like this I personally don't mind these tools.

WD-42 44 days ago [-]

If they have the ability to hook an llm up to a system that can schedule an oil change why can’t they provide a form on their website to do the same thing and save everyone the hassle?

Rebelgecko 44 days ago [-]

WD-42 44 days ago [-]

At that point we’ll have to convert the voices into a form more amenable to machine to machine communication. Perhaps a system based on high and low signals.

Seriously what is the point of all this.

ffsm8 44 days ago [-]

Pretty sure rebelgecko was using [Sarcasm] in order to increase his [Satire] enjoyment

satvikpendem 42 days ago [-]

Some people just like to call over navigate to a site and fill out a form. Sometimes just speaking is easier.

PoachedEggs 44 days ago [-]

Both can be true. They might have those web forms, while also having enough customers that prefer phone calls to justify it.

mh- 44 days ago [-]

Not everyone wants to use a form on their website?

You and I certainly do, but a ton of people prefer calling on the phone.

brunoarueira 44 days ago [-]

In Brazil, multiple companies are offering a call and WhatsApp, both through automated messages with menus and in the end escalate to humans.

bronco21016 44 days ago [-]

They do, via the manufacturer's app. It works fine as well.

Situational context matters though, sometimes you get in the vehicle and get the alert. Just say "Hey Siri, call dealership" and away you go hands free. No messing with apps.

quaheezle 44 days ago [-]

They do offer the ability to schedule an oil change via the website and yet some people still prefer to call. User preference and multi-channel servicing options are nice to support

WD-42 44 days ago [-]

They do? How do you know?

Antoniocl 44 days ago [-]

44 days ago [-]

keeganpoppen 44 days ago [-]

tguvot 44 days ago [-]

>sure, we all agree that it is partly designed to get people to hang up in disgust and give up

actually, as someone who works in this area - no, it's not. it designed to help people to do things and metrics of success are closely monitored

keeganpoppen 39 days ago [-]

HeatrayEnjoyer 44 days ago [-]

It is often both.

tguvot 44 days ago [-]

not in my experience and it will be more expensive to company

HeatrayEnjoyer 44 days ago [-]

Have you never worked in customer service?

krashidov 44 days ago [-]

because a lot of people still prefer voice communication over navigating the internet and filling out a form

dmos62 44 days ago [-]

Good customer support lines? Is there a reason why it can't provide good support. I often use chatgpt's voice function.

WD-42 44 days ago [-]

dmos62 44 days ago [-]

f33d5173 44 days ago [-]

I have to feel that an online booking system is substantially lower tech than an ai voice assitant chatbot, and makes it even easier to ruminate as you pick the time that works for you.

dmos62 44 days ago [-]

b112 44 days ago [-]

Beyond true.

So it's call them (which you can only find the number via Google.)

I suspect they're so disfunctional, that they don't understand why the massive uptick in calls, so then they slap AI in via phone too.

And so now that's slow and AI drivel. I guess soon I'll just have to do chargebacks!? Eg, if a package is missing or whatever.

Antoniocl 44 days ago [-]

Granted, it's been 1-2 weeks since I had an issue, so it may have changed since then, or it could be only released to a subset of users.

awad 44 days ago [-]

b112 44 days ago [-]

I wonder if I'm stuck on the (A)wseome/(B)ad side of A/B testing.

TZubiri 44 days ago [-]

znpy 44 days ago [-]

> The expectation of customer support lines is that customers want to speak to humans.

Not really. The expectation is to be able to express their need in a natural language, maybe because their issue is not covered by a fixed-form web form (pun not intended).

So yeah AI might be a good fit in that scenario.

TZubiri 44 days ago [-]

It's a protocol and network that has backwards compatibility with 19th century telegram wire networks that later were voice lines for a full century.

If that isn't the channel to speak to a human, nothing is. You can speak to a bot with an app or whatever.

At least make it sound robotty instead of pretending to be a human.

Kim_Bruning 44 days ago [-]

anhner 44 days ago [-]

> What is the application of this?

spammers, scammers and horrible customer support lines.

NedF 44 days ago [-]

[dead]

asdfsfds 44 days ago [-]

[flagged]

mcny 44 days ago [-]

krater23 45 days ago [-]

Please don't. I had a talk with a shitty AI bot on a Fedex line. It's absolute crap. Just give me a 'Type 1 for x, type 2 for y'. Then I don't need to guess what are the possibilities.

EvanAnderson 44 days ago [-]

ssl-3 44 days ago [-]

I hate those, too. Especially when others are around.

Like pushing buttons at the gas pump to try to silence the ad-roll, it's pretty low-cost to try.

9x39 44 days ago [-]

When you’re committed to phone intent complexity (hell), the AI assisted options are sort of less bad since you don’t have to explain the menu to callers, they just make demands.

tartoran 44 days ago [-]

What if the goal is to keep gaslighting you until you give up your demands?

9x39 44 days ago [-]

Most voice agents for large companies are a calculated game to deter customers from expensive humans as we know, but not always.

Sort of like how Jira can be a streamlined tool or a prison of 50-step workflows, it's all up to the designer.

8note 44 days ago [-]

you bought something from the wrong company, and you arent gonna get helped by phone, bot, or person

russdill 44 days ago [-]

The problem here is that if it's something a voice assistant can solve, I can solve it from my account. I'm calling because I need to speak to an actual human.

hectormalot 44 days ago [-]

Im in this business, and used to think the same. It turns out this is a minority of callers. Some examples:

One of the things we tend to prioritize in the initial conversation is to determine in which segment you fall and route accordingly.

mcny 44 days ago [-]

(reposting because something ate your newlines, I've added comments in line)

Im in this business, and used to think the same. It turns out this is a minority of callers. Some examples:

- a client were working does advertising in TV commercials, and a few percent of their calls is people trying to cancel their TV subscriptions, even though they are in healthcare

- in the troubleshooting flow for a client with a physical product, 40% of calls are resolved after the “did you try turning it off and on again” step.

- a health insurance client has 25% of call volume for something that is available self-service (and very visible as well), yet people still call.

These public websites are regularly out of date. Someone who is actually on site confirm that yes, they have non smoking rooms or ice machines that aren't broken is valuable.

One of the things we tend to prioritize in the initial conversation is to determine in which segment you fall and route accordingly.

hectormalot 44 days ago [-]

Thx, forgot to double enter.

cyberax 44 days ago [-]

Well, the future is here: https://www.youtube.com/watch?v=HbDnxzrbxn4

Tikrong 44 days ago [-]

Looked at a repo and it seems this project was heavily AI generated. Would be interesting to hear from the author of the project how it was built.

ghurtado 44 days ago [-]

> Looked at a repo and it seems this project was heavily AI generated

That's the first thing that I noticed too.

It's gotten to the point that my body subconsciously rejects bullet lists and headings that start with emojis.

scsh 44 days ago [-]

Tikrong 44 days ago [-]

scsh 44 days ago [-]

That's also a totally fair take!

Tikrong 44 days ago [-]

Yep, and the number of commits is insane. In early commits there are cursor rules and reports about task complexity made by AI. So it's really interesting, how was the project created.

fennecbutt 44 days ago [-]

Yes, developers on my teams have started submitting PRs exactly like this and the content of the PR is similar. Still need to figure out how to put a stop to it.

hkjarral 44 days ago [-]

[dead]

RockRobotRock 44 days ago [-]

parrotplatform 44 days ago [-]

So, there is no straight forward way that I can think of. But, what I would do is set a channel variable and log it somewhere:

exten => s,n,Set(VM_UNIQUEID=${UNIQUEID}) exten => s,n,VoiceMail(${EXTEN}@default)

If you are using AGI or ARI, you can log it somewhere useful so you can correlate later.

I work with Freeswitch almost exclusively these days. But, my first experience with VoIP was Asterisk and a huge Perl AGI file keeping everyone talking to each other. Those were good time!

RockRobotRock 43 days ago [-]

You are awesome!

wild_egg 44 days ago [-]

Is that really where SOTA is right now?

dnackoul 44 days ago [-]

I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.

The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.

hogrug 44 days ago [-]

gessha 44 days ago [-]

I wonder if it’s possible to do the apple trick of hiding latency using animations. The audio equivalent can be the chime that Siri does after receiving a request.

echelon 44 days ago [-]

Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.

https://app.sesame.com/

duckkg5 44 days ago [-]

Absolutely not.

500-1000ms is borderline acceptable.

Sub-300ms is closer to SOTA.

2000ms or more means people will hang up.

fragmede 44 days ago [-]

play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet.

ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.

phantasmish 44 days ago [-]

I haaaaate the fake typing noises.

emil-lp 44 days ago [-]

Alternate between that and

    play "ehh".wav

matt-p 44 days ago [-]

160ms is essentially optimal and you can get down to about 200ms AFAIK.

abdullahkhalids 44 days ago [-]

With what system?

russdill 44 days ago [-]

daneel_w 44 days ago [-]

numpad0 44 days ago [-]

The trick is to NOT wait for the LLM to finish talking, but:

  1 at end of user VAD, call LLM, stream response into a buffer(simple enough)   
  2 chunk the response at [commas, periods, newlines], and queue sentence-oid texts  
  3 pipe queued sentence-oid fragments into a fast classical TTS and queue audio snippets   
  4 play queued sentence-oid-audio-snippets, maintaining correspondence of consumed audio and text 
  5 at user VAD, stop and clear everything everywhere, undoing queued unplayed voice, nuking unplayed text from chat log 
  6 at end of user VAD, feed the amended transcripts that are canonical to user's ears to step 1
  7 (make sure to parallelize it all)

This flow (hypothetically)allow such interactions as:

  user: "what's the date today"
  sys:  "[today][is thursday], [decem"
  user: "sorry yesterday"
  sys:  "[...uh,][wednesday?][Usually?]"

  1: https://developers.cyberagent.co.jp/blog/archives/44592/
  2: https://gist.github.com/numpad0/18ae612675688eeccd3af5eabcfdf686

coderintherye 44 days ago [-]

Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.

wellthisisgreat 44 days ago [-]

No, there are models with sub-second latency for sure

mohsen1 44 days ago [-]

Just try Gemini Live on your phone. That's state of the art

44 days ago [-]

bharrison 44 days ago [-]

Perhaps you didnt read that these are "production-ready golden baselines validated for enterprise deployment."

How does their golden nature not dissuade these concerns for you?

matt3210 44 days ago [-]

I (and most people I ask) definitely don’t answer calls from robots or watch videos with robot voice. I’m not sure what value the customer gets here.

hkjarral 44 days ago [-]

[dead]

aftbit 44 days ago [-]

This opens up new possibilities for interactive phone services. Retro-futuristic for sure.

looneysquash 44 days ago [-]

That seems like bad news for Allison. Though I know she already had some TTS voices available, so many not.

kamma4434 44 days ago [-]

She seems to be doing well

https://m.youtube.com/watch?v=wairnc-2Hyo

eugene3306 44 days ago [-]

I've created Asterisk Codex Skill, but turns out there is ten seconds timeout for scripts

nextworddev 45 days ago [-]

Can I connect this to Twilio

kwindla 44 days ago [-]

(If you do need SIP, this Asterisk project looks really great.)

[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn

Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/

ldenoue 44 days ago [-]

The problem with PipeCat and LiveKit (the 2 major stacks for building voice ai) is the deployment at scale.

That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.

Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).

nextworddev 44 days ago [-]

let me get this straight, you are storing convo threads / context in DOs?

e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?

ldenoue 42 days ago [-]

Yes DO let you handle long lived websocket connections. I think this is unique to Cloudflare. AWS or Google Cloud don't seem to offer these things (statefulness basically).

Same with TTS: some like Deepgram and ElevenLabs let you stream the LLM text (or chunks per sentence) over their websocket API, making your Voice AI bot really really low latency.

nextworddev 44 days ago [-]

This is good stuff.

In your opinion, how close is Pipecat + OSS to replacing proprietary infra from Vapi, Retell, Sierra, etc?

kwindla 44 days ago [-]

It depends on what you mean by replacing.

The integrated developer experience is much better on Vapi, etc.

[1] https://modal.com/blog/low-latency-voice-bot - [2] https://lemonslice.com/ = [3] https://github.com/NVIDIA/ace-controller/

nextworddev 44 days ago [-]

Is there a simple, serverless version of deploying Pipecat stack, without: - me having to self host on my infra

I just want to provide: - business logic - tools - configuration metadata (e.g. which voice to use)

I don't like Vapi due to 1) extensive GUI driven experience, 2) cost

ldenoue 42 days ago [-]

Check out something like LayerCode (Cloudflare based).

Or PipeCat Cloud / LiveKit cloud (I think they charge 1 cent per minute?)

ldenoue 44 days ago [-]

I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.

Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)

picardo 44 days ago [-]

Is AssemblyAI or Deepgram compatible with OpenAI Realtime API, esp. around voice activity detection and turn taking? How do you implement those?

ldenoue 42 days ago [-]

I am not using speech to speech APIs like OpenAI, but it would be easy to swap the STT + LLM + TTS to using Realtime (or Gemini Live API for that matter).

pugio 44 days ago [-]

Do you have anything written up about how you're doing this? Curious to learn more...

ldenoue 42 days ago [-]

I don't but I should open source this code. I was trying to sell to OEM though, that's why. Are you interested in licensing it?

VladVladikoff 44 days ago [-]

Technically yes, twilio has sip trunks.

44 days ago [-]

johnebgd 45 days ago [-]

I welcome the spam calls from our asterisk overlords.

haroldp 44 days ago [-]

I was more thinking I could add it to my Asterisk server to honey-pot the spam callers into an infinite time waster cycle.

Daviey 44 days ago [-]

"Hello, this is Lenny" - well known Asterisk configuration from 20 years ago.

haroldp 43 days ago [-]

And, “They have been carried away by monkeys!”

43 days ago [-]

VladVladikoff 44 days ago [-]

I’m honestly surprised it hasn’t been more prevalent yet. I still get call centre type spam calls where you can hear all the background noise of the rest of the call centre.

userbinator 44 days ago [-]

Is the background noise real, or is it also AI-generated to make you think that it's a human?

tartoran 44 days ago [-]

The background noise is a recording for sure, no AI needed, just a background noise audiofile in a loop would do.

VladVladikoff 44 days ago [-]

Why though? It adds nothing positive, it only makes me sure it is a scam call.

the_af 44 days ago [-]

I assume it's to make it seem like an actual call center rather than a scam. I recently got two phone scam attempts (credit card related) that sounded exactly like this.

ldenoue 44 days ago [-]

grim_io 44 days ago [-]

You can call it what ever you like, but to me this is deceptive.

ldenoue 42 days ago [-]

It doesn't have to be. You can configure your bot to great the user. E.g. "Aleksandra is not available at the moment, but I'm her AI assistant to help you book a table. How may I help you?"

So you're telling the caller that it is an AI, and yet you can have a pleasant background audio experience.

SoftTalker 44 days ago [-]

you actually answer unknown callers?

Loughla 44 days ago [-]

Yes. I own a business.

mcny 44 days ago [-]

Also, it only takes one legitimate collect call from a jail from a loved one and now I'm all in favor of reform in our jail system.

No, it does not cost over thirty dollars to allow someone accused to call their loved ones. We pay taxes. I want my government to use the taxes and provide these calls for free.

the_af 44 days ago [-]

Yes. Sometimes it's a legit call. Not often, though.

Example of legit calls: the pizza delivery guy decided to call my phone instead of ringing the bell, for whatever reason.

mcny 44 days ago [-]

I worked door dash for a couple of days and there were multiple people who wrote in all caps to not ring the door bell. Why? I have no idea.

SoftTalker 44 days ago [-]

Probably because it make their dogs go nuts.

quesera 44 days ago [-]

Sleeping children or shift workers, too.

44 days ago [-]