AI Data best practice

When adding your own data, sometime you'll face a problem where your ai answer is not based on desired data.

This is typically happen when your data is related each other, but highly similar.

for example, one data about register, one data about login, one data about reset password.

sometime the ai score chooses "login" data when the actual chat is about "cannot login".

therefore, "reset password" data is the actual data you expect to have highest score.

let's dig deeper.

Short vs long data

using previous case, why would i need to separate them to 3 data?

i can merge them and use 1 data instead!

yes, this is one of the correct approach.

with longer data, it means broader data range.

it's reserving overall topic, so you should not worry too much.

but still, if you still have another data that having similar topic, the problem will still exist.

this solution will only works, if you combine all related content as one data.

but, this also raise another issue, your cost will possibly increase.

so the second solution is to cut it to super specific short data.

only around 200-500 characters.

the data must super specific and not overlapping another data.

for example, "register","login","reset password".

this share common topic : account.

therefore, you need to rephrase your data to be very specific to each data about "register","login" and "reset password".

LongShort
+ Keep the context- separate context
+ easily do comparation and get related data- cannot do comparation and related data is hard to get
+ easier to maintain- harder to maintain
- less precise+ more precise
- cost more+ cost less
- slower response+ faster response

When short data are good:

  • Sensitive content where correctness is important.
  • Scenarios where the answer needs to be very specific
  • focused chat that only need a very specific part of the information.

example : register, login, reset password

When long data are good:

  • Your data need to be related each other eventhough the subdata/section is different.
  • Topics that require a lot of context for accuracy each other.
  • Comparation or global data where subdata is meaningless without it (like "What are the differences between X and Y? or all api parameter data").

example : pricing, api documentation, step by step data

Add description & category

after cut the data short, you can also add some addition information and categorize it.

with this action, your data will obtain additional context and the ai would understand better.

inside the data, you can also add some keyword related to the data.

for example : Instead of "our lite plan," use "our lite plan includes API and webhook support.

do avoid ambiguity inside the data.

avoid "all plans have basic feature", instead use : "all plans include API and webhook support".

Add example

You can also add more example to it so AI can enhance it based on your example model.

for example in the documentation, fonnte offer some example in PHP.

this can be included as example.

other thing is add incoming message example.

so you can add something like :
question example : "how to order?","how much is ai quota price?"

Conclusion

we cannot ask ai to always correctly use specific data from our collection of data.

ai works using the power of statistic, which means it's all about probability.

what can we do is increase the probability by using known best practice above and let the statistic works as is.

Related knowledge

See more
Made with in Indonesia