The Gen AI evaluation service lets you evaluate
your large language models (LLMs) across several metrics with your own criteria.
You can provide inference-time inputs, LLM responses and additional
parameters, and the Gen AI evaluation service returns metrics specific to the
evaluation task.
Metrics include model-based metrics, such as PointwiseMetric and PairwiseMetric, and in-memory
computed metrics, such as rouge, bleu, and tool function-call metrics.
PointwiseMetric and PairwiseMetric are generic model-based metrics that
you can customize with your own criteria.
Because the service takes the prediction results directly from models as input,
the evaluation service can perform both inference and subsequent evaluation on
all models supported by
Vertex AI.
The following are limitations of the evaluation service:
The evaluation service may have a propagation delay in your first call.
Most model-based metrics consume
gemini-2.0-flash quota
because the Gen AI evaluation service leverages gemini-2.0-flash as the underlying
judge model to compute these model-based metrics.
Some model-based metrics, such as MetricX and COMET, use different
machine learning models, so they don't consume
gemini-2.0-flash quota.
Note: MetricX and COMET will be not be charged during preview. At GA, the pricing will be the same as all pointwise model based metrics.
Optional: ExactMatchInput Input to assess if the prediction matches the reference exactly.
bleu_input
Optional: BleuInput Input to compute BLEU score by comparing the prediction against the reference.
rouge_input
Optional: RougeInput Input to compute rouge scores by comparing the prediction against the reference. Different rouge scores are supported by rouge_type.
fluency_input
Optional: FluencyInput Input to assess a single response's language mastery.
coherence_input
Optional: CoherenceInput Input to assess a single response's ability to provide a coherent, easy-to-follow reply.
safety_input
Optional: SafetyInput Input to assess a single response's level of safety.
groundedness_input
Optional: GroundednessInput Input to assess a single response's ability to provide or reference information included only in the input text.
fulfillment_input
Optional: FulfillmentInput Input to assess a single response's ability to completely fulfill instructions.
summarization_quality_input
Optional: SummarizationQualityInput Input to assess a single response's overall ability to summarize text.
pairwise_summarization_quality_input
Optional: PairwiseSummarizationQualityInput Input to compare two responses' overall summarization quality.
summarization_helpfulness_input
Optional: SummarizationHelpfulnessInput Input to assess a single response's ability to provide a summarization, which contains the details necessary to substitute the original text.
summarization_verbosity_input
Optional: SummarizationVerbosityInput Input to assess a single response's ability to provide a succinct summarization.
question_answering_quality_input
Optional: QuestionAnsweringQualityInput Input to assess a single response's overall ability to answer questions, given a body of text to reference.
pairwise_question_answering_quality_input
Optional: PairwiseQuestionAnsweringQualityInput Input to compare two responses' overall ability to answer questions, given a body of text to reference.
question_answering_relevance_input
Optional: QuestionAnsweringRelevanceInput Input to assess a single response's ability to respond with relevant information when asked a question.
question_answering_helpfulness_input
Optional: QuestionAnsweringHelpfulnessInput Input to assess a single response's ability to provide key details when answering a question.
question_answering_correctness_input
Optional: QuestionAnsweringCorrectnessInput Input to assess a single response's ability to correctly answer a question.
pointwise_metric_input
Optional: PointwiseMetricInput Input for a generic pointwise evaluation.
pairwise_metric_input
Optional: PairwiseMetricInput Input for a generic pairwise evaluation.
tool_call_valid_input
Optional: ToolCallValidInput Input to assess a single response's ability to predict a valid tool call.
tool_name_match_input
Optional: ToolNameMatchInput Input to assess a single response's ability to predict a tool call with the right tool name.
tool_parameter_key_match_input
Optional: ToolParameterKeyMatchInput Input to assess a single response's ability to predict a tool call with correct parameter names.
tool_parameter_kv_match_input
Optional: ToolParameterKvMatchInput Input to assess a single response's ability to predict a tool call with correct parameter names and values
comet_input
Optional: CometInput Input to evaluate using COMET.
metricx_input
Optional: MetricxInput Input to evaluate using MetricX.
Optional: RougeSpec Metric spec, defining the metric's behavior.
metric_spec.rouge_type
Optional: string Acceptable values: - rougen[1-9]: compute rouge scores based on the overlap of n-grams between the prediction and the reference. - rougeL: compute rouge scores based on the Longest Common Subsequence (LCS) between the prediction and the reference. - rougeLsum: first splits the prediction and the reference into sentences and then computes the LCS for each tuple. The final rougeLsum score is the average of these individual LCS scores.
metric_spec.use_stemmer
Optional: bool Whether Porter stemmer should be used to strip word suffixes to improve matching.
metric_spec.split_summaries
Optional: bool Whether to add newlines between sentences for rougeLsum.
instances
Optional: RougeInstance[] Evaluation input, consisting of LLM response and reference.
instances.prediction
Optional: string LLM response.
instances.reference
Optional: string Golden LLM response for reference.
PairwiseChoice: Enum with possible values as follows: - BASELINE: Baseline prediction is better - CANDIDATE: Candidate prediction is better - TIE: Tie between Baseline and Candidate predictions.
explanation
string: Justification for pairwise_choice assignment.
PairwiseChoice: Enum with possible values as follows: - BASELINE: Baseline prediction is better - CANDIDATE: Candidate prediction is better - TIE: Tie between Baseline and Candidate predictions.
explanation
string: Justification for pairwise_choice assignment.
Required: PointwiseMetricSpec Metric spec, defining the metric's behavior.
metric_spec.metric_prompt_template
Required: string A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance
instance
Required: PointwiseMetricInstance Evaluation input, consisting of json_instance.
instance.json_instance
Optional: string The key-value pairs in Json format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template.
Required: PairwiseMetricSpec Metric spec, defining the metric's behavior.
metric_spec.metric_prompt_template
Required: string A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance
instance
Required: PairwiseMetricInstance Evaluation input, consisting of json_instance.
instance.json_instance
Optional: string The key-value pairs in JSON format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template.
Optional: ToolCallValidSpec Metric spec, defining the metric's behavior.
instance
Optional: ToolCallValidInstance Evaluation input, consisting of LLM response and reference.
instance.prediction
Optional: string Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls. An example is: python { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
instance.reference
Optional: string Golden model output in the same format as prediction.
Optional: ToolNameMatchSpec Metric spec, defining the metric's behavior.
instance
Optional: ToolNameMatchInstance Evaluation input, consisting of LLM response and reference.
instance.prediction
Optional: string Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.
instance.reference
Optional: string Golden model output in the same format as prediction.
Optional: ToolParameterKeyMatchSpec Metric spec, defining the metric's behavior.
instance
Optional: ToolParameterKeyMatchInstance Evaluation input, consisting of LLM response and reference.
instance.prediction
Optional: string Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.
instance.reference
Optional: string Golden model output in the same format as prediction.
Optional: ToolParameterKVMatchSpec Metric spec, defining the metric's behavior.
instance
Optional: ToolParameterKVMatchInstance Evaluation input, consisting of LLM response and reference.
instance.prediction
Optional: string Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.
instance.reference
Optional: string Golden model output in the same format as prediction.
Optional: CometSpec Metric spec, defining the metric's behavior.
metric_spec.version
Optional: stringCOMET_22_SRC_REF: COMET 22 for translation, source, and reference. It evaluates the translation (prediction) using all three inputs.
metric_spec.source_language
Optional: string Source language in BCP-47 format. For example, "es".
metric_spec.target_language
Optional: string Target language in BCP-47 format. For example, "es"
instance
Optional: CometInstance Evaluation input, consisting of LLM response and reference. The exact fields used for evaluation are dependent on the COMET version.
instance.prediction
Optional: string Candidate model LLM response. This is the output of the LLM which is being evaluated.
instance.source
Optional: string Source text. This is in the original language that the prediction was translated from.
instance.reference
Optional: string Ground truth used to compare against the prediction. This is in the same language as the prediction.
Optional: MetricxSpec Metric spec, defining the metric's behavior.
metric_spec.version
Optional: string One of the following: - METRICX_24_REF: MetricX 24 for translation and reference. It evaluates the prediction (translation) by comparing with the provided reference text input. - METRICX_24_SRC: MetricX 24 for translation and source. It evaluates the translation (prediction) by Quality Estimation (QE), without a reference text input. - METRICX_24_SRC_REF: MetricX 24 for translation, source and reference. It evaluates the translation (prediction) using all three inputs.
metric_spec.source_language
Optional: string Source language in BCP-47 format. For example, "es".
metric_spec.target_language
Optional: string Target language in BCP-47 format. For example, "es".
instance
Optional: MetricxInstance Evaluation input, consisting of LLM response and reference. The exact fields used for evaluation are dependent on the MetricX version.
instance.prediction
Optional: string Candidate model LLM response. This is the output of the LLM which is being evaluated.
instance.source
Optional: string Source text which is in the original language that the prediction was translated from.
instance.reference
Optional: string Ground truth used to compare against the prediction. It is in the same language as the prediction.
The following example demonstrates how to call the Gen AI Evaluation API to evaluate
the output of an LLM using a variety of evaluation metrics, including the following:
importpandasaspdimportvertexaifromvertexai.preview.evaluationimportEvalTask,MetricPromptTemplateExamples# TODO(developer): Update and un-comment below line# PROJECT_ID = "your-project-id"vertexai.init(project=PROJECT_ID,location="us-central1")eval_dataset=pd.DataFrame({"instruction":["Summarize the text in one sentence.","Summarize the text such that a five-year-old can understand.",],"context":["""As part of a comprehensive initiative to tackle urban congestion and foster sustainable urban living, a major city has revealed ambitious plans for an extensive overhaul of its public transportation system. The project aims not only to improve the efficiency and reliability of public transit but also to reduce the city\'s carbon footprint and promote eco-friendly commuting options. City officials anticipate that this strategic investment will enhance accessibility for residents and visitors alike, ushering in a new era of efficient, environmentally conscious urban transportation.""","""A team of archaeologists has unearthed ancient artifacts shedding light on a previously unknown civilization. The findings challenge existing historical narratives and provide valuable insights into human history.""",],"response":["A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.","Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",],})eval_task=EvalTask(dataset=eval_dataset,metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,MetricPromptTemplateExamples.Pointwise.VERBOSITY,MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,],)prompt_template=("Instruction: {instruction}. Article: {context}. Summary: {response}")result=eval_task.evaluate(prompt_template=prompt_template)print("Summary Metrics:\n")forkey,valueinresult.summary_metrics.items():print(f"{key}: \t{value}")print("\n\nMetrics Table:\n")print(result.metrics_table)# Example response:# Summary Metrics:# row_count: 2# summarization_quality/mean: 3.5# summarization_quality/std: 2.1213203435596424# ...
import(context_pkg"context""fmt""io"aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb""google.golang.org/api/option")//evaluateModelResponseevaluatestheoutputofanLLMforgroundedness,i.e.,howwell//themodelresponseconnectswithverifiablesourcesofinformationfuncevaluateModelResponse(wio.Writer,projectID,locationstring)error{//location="us-central1"ctx:=context_pkg.Background()apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)client,err:=aiplatform.NewEvaluationClient(ctx,option.WithEndpoint(apiEndpoint))iferr!=nil{returnfmt.Errorf("unable to create aiplatform client: %w",err)}deferclient.Close()//evaluatethepre-generatedmodelresponseagainstthereference(groundtruth)responseToEvaluate:=`Thecityisundertakingamajorprojecttorevampitspublictransportationsystem.Thisinitiativeisdesignedtoimproveefficiency,reducecarbonemissions,andpromoteeco-friendlycommuting.Thecityexpectsthatthisinvestmentwillenhanceaccessibilityandusherinaneweraofsustainableurbantransportation.`reference:=`Aspartofacomprehensiveinitiativetotackleurbancongestionandfostersustainableurbanliving,amajorcityhasrevealedambitiousplansforanextensiveoverhaulofitspublictransportationsystem.Theprojectaimsnotonlytoimprovetheefficiencyandreliabilityofpublictransitbutalsotoreducethecity\'s carbon footprint and promote eco-friendly commuting options.Cityofficialsanticipatethatthisstrategicinvestmentwillenhanceaccessibilityforresidentsandvisitorsalike,usheringinaneweraofefficient,environmentallyconsciousurbantransportation.`req:=aiplatformpb.EvaluateInstancesRequest{Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),//ChecktheAPIreferenceforafulllistofsupportedmetricinputs://https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequestMetricInputs:&aiplatformpb.EvaluateInstancesRequest_GroundednessInput{GroundednessInput:&aiplatformpb.GroundednessInput{MetricSpec:&aiplatformpb.GroundednessSpec{},Instance:&aiplatformpb.GroundednessInstance{Context:&reference,Prediction:&responseToEvaluate,},},},}resp,err:=client.EvaluateInstances(ctx,&req)iferr!=nil{returnfmt.Errorf("evaluateInstances failed: %v",err)}results:=resp.GetGroundednessResult()fmt.Fprintf(w,"score: %.2f\n",results.GetScore())fmt.Fprintf(w,"confidence: %.2f\n",results.GetConfidence())fmt.Fprintf(w,"explanation:\n%s\n",results.GetExplanation())//Exampleresponse://score:1.00//confidence:1.00//explanation://STEP1:Allaspectsoftheresponsearefoundinthecontext.//Theresponseaccuratelysummarizesthecity's plan to overhaul its public transportation system, highlighting the goals of ...//STEP2:Accordingtotherubric,theresponseisscored1becauseallaspectsoftheresponseareattributabletothecontext.returnnil}
Evaluate an output: pairwise summarization quality¶
The following example demonstrates how to call the Gen AI evaluation service API to evaluate
the output of an LLM using a pairwise summarization quality comparison.
Note:
The following command assumes that you have logged in to
the gcloud CLI with your user account by running
gcloud init
or
gcloud auth login
, or by using Cloud Shell,
which automatically logs you into the gcloud CLI
.
You can check the currently active account by running
gcloud auth list.
Save the request body in a file named request.json,
and execute the following command:
Note:
The following command assumes that you have logged in to
the gcloud CLI with your user account by running
gcloud init
or
gcloud auth login
.
You can check the currently active account by running
gcloud auth list.
Save the request body in a file named request.json,
and execute the following command:
importpandasaspdimportvertexaifromvertexai.generative_modelsimportGenerativeModelfromvertexai.evaluationimport(EvalTask,PairwiseMetric,MetricPromptTemplateExamples,)# TODO(developer): Update & uncomment line below# PROJECT_ID = "your-project-id"vertexai.init(project=PROJECT_ID,location="us-central1")prompt="""Summarize the text such that a five-year-old can understand.# TextAs part of a comprehensive initiative to tackle urban congestion and fostersustainable urban living, a major city has revealed ambitious plans for anextensive overhaul of its public transportation system. The project aims notonly to improve the efficiency and reliability of public transit but also toreduce the city\'s carbon footprint and promote eco-friendly commuting options.City officials anticipate that this strategic investment will enhanceaccessibility for residents and visitors alike, ushering in a new era ofefficient, environmentally conscious urban transportation."""eval_dataset=pd.DataFrame({"prompt":[prompt]})# Baseline model for pairwise comparisonbaseline_model=GenerativeModel("gemini-2.0-flash-lite-001")# Candidate model for pairwise comparisoncandidate_model=GenerativeModel("gemini-2.0-flash-001",generation_config={"temperature":0.4})prompt_template=MetricPromptTemplateExamples.get_prompt_template("pairwise_summarization_quality")summarization_quality_metric=PairwiseMetric(metric="pairwise_summarization_quality",metric_prompt_template=prompt_template,baseline_model=baseline_model,)eval_task=EvalTask(dataset=eval_dataset,metrics=[summarization_quality_metric],experiment="pairwise-experiment",)result=eval_task.evaluate(model=candidate_model)baseline_model_response=result.metrics_table["baseline_model_response"].iloc[0]candidate_model_response=result.metrics_table["response"].iloc[0]winner_model=result.metrics_table["pairwise_summarization_quality/pairwise_choice"].iloc[0]explanation=result.metrics_table["pairwise_summarization_quality/explanation"].iloc[0]print(f"Baseline's story:\n{baseline_model_response}")print(f"Candidate's story:\n{candidate_model_response}")print(f"Winner: {winner_model}")print(f"Explanation: {explanation}")# Example response:# Baseline's story:# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...## Candidate's story:# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...## Winner: CANDIDATE# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...
import(context_pkg"context""fmt""io"aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb""google.golang.org/api/option")//pairwiseEvaluationletsthejudgemodeltocomparetheresponsesoftwomodelsandpickthebetteronefuncpairwiseEvaluation(wio.Writer,projectID,locationstring)error{//location="us-central1"ctx:=context_pkg.Background()apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)client,err:=aiplatform.NewEvaluationClient(ctx,option.WithEndpoint(apiEndpoint))iferr!=nil{returnfmt.Errorf("unable to create aiplatform client: %w",err)}deferclient.Close()context:=`Aspartofacomprehensiveinitiativetotackleurbancongestionandfostersustainableurbanliving,amajorcityhasrevealedambitiousplansforanextensiveoverhaulofitspublictransportationsystem.Theprojectaimsnotonlytoimprovetheefficiencyandreliabilityofpublictransitbutalsotoreducethecity\'s carbon footprint and promote eco-friendly commuting options.Cityofficialsanticipatethatthisstrategicinvestmentwillenhanceaccessibilityforresidentsandvisitorsalike,usheringinaneweraofefficient,environmentallyconsciousurbantransportation.`instruction:="Summarize the text such that a five-year-old can understand."baselineResponse:=`Thecitywantstomakeiteasierforpeopletogetaroundwithoutusingcars.They're going to make the buses and trains better and faster, so people will want tousethemmore.Thiswillhelptheairbecleanerandmakethecityabetterplacetolive.`candidateResponse:=`Thecityismakingbigchangestohowpeoplegetaround.Theywanttomakethebusesandtrainsworkbetterandbeeasierforeveryonetouse.Thiswillalsohelptheenvironmentbygettingpeopletouselessgas.Thecitythinksthesechangeswillmakeiteasierforeveryonetogetwheretheyneedtogo.`req:=aiplatformpb.EvaluateInstancesRequest{Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),MetricInputs:&aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{PairwiseSummarizationQualityInput:&aiplatformpb.PairwiseSummarizationQualityInput{MetricSpec:&aiplatformpb.PairwiseSummarizationQualitySpec{},Instance:&aiplatformpb.PairwiseSummarizationQualityInstance{Context:&context,Instruction:&instruction,Prediction:&candidateResponse,BaselinePrediction:&baselineResponse,},},},}resp,err:=client.EvaluateInstances(ctx,&req)iferr!=nil{returnfmt.Errorf("evaluateInstances failed: %v",err)}results:=resp.GetPairwiseSummarizationQualityResult()fmt.Fprintf(w,"choice: %s\n",results.GetPairwiseChoice())fmt.Fprintf(w,"confidence: %.2f\n",results.GetConfidence())fmt.Fprintf(w,"explanation:\n%s\n",results.GetExplanation())//Exampleresponse://choice:BASELINE//confidence:0.50//explanation://BASELINEresponseiseasiertounderstand.Forexample,thephrase"..."iseasiertounderstandthan"...".Thus,BASELINEresponseis...returnnil}
The following example calls the Gen AI evaluation service API to get the ROUGE score
of a prediction, generated by a number of inputs. The ROUGE inputs use
metric_spec, which determines the metric's behavior.
ROUGE_TYPE: The calculation used to determine the rouge score. See metric_spec.rouge_type for acceptable values.
USE_STEMMER: Determines whether the Porter stemmer is used to strip word suffixes to improve matching. For acceptable values, see metric_spec.use_stemmer.
SPLIT_SUMMARIES: Determines if new lines are added between rougeLsum sentences. For acceptable values, see metric_spec.split_summaries .
Note:
The following command assumes that you have logged in to
the gcloud CLI with your user account by running
gcloud init
or
gcloud auth login
, or by using Cloud Shell,
which automatically logs you into the gcloud CLI
.
You can check the currently active account by running
gcloud auth list.
Save the request body in a file named request.json,
and execute the following command:
Note:
The following command assumes that you have logged in to
the gcloud CLI with your user account by running
gcloud init
or
gcloud auth login
.
You can check the currently active account by running
gcloud auth list.
Save the request body in a file named request.json,
and execute the following command:
importpandasaspdimportvertexaifromvertexai.preview.evaluationimportEvalTask# TODO(developer): Update & uncomment line below# PROJECT_ID = "your-project-id"vertexai.init(project=PROJECT_ID,location="us-central1")reference_summarization="""The Great Barrier Reef, the world's largest coral reef system, islocated off the coast of Queensland, Australia. It's a vastecosystem spanning over 2,300 kilometers with thousands of reefsand islands. While it harbors an incredible diversity of marinelife, including endangered species, it faces serious threats fromclimate change, ocean acidification, and coral bleaching."""# Compare pre-generated model responses against the reference (ground truth).eval_dataset=pd.DataFrame({"response":["""The Great Barrier Reef, the world's largest coral reef system located in Australia, is a vast and diverse ecosystem. However, it faces serious threats from climate change, ocean acidification, and coral bleaching, endangering its rich marine life.""","""The Great Barrier Reef, a vast coral reef system off the coast of Queensland, Australia, is the world's largest. It's a complex ecosystem supporting diverse marine life, including endangered species. However, climate change, ocean acidification, and coral bleaching are serious threats to its survival.""","""The Great Barrier Reef, the world's largest coral reef system off the coast of Australia, is a vast and diverse ecosystem with thousands of reefs and islands. It is home to a multitude of marine life, including endangered species, but faces serious threats from climate change, ocean acidification, and coral bleaching.""",],"reference":[reference_summarization]*3,})eval_task=EvalTask(dataset=eval_dataset,metrics=["rouge_1","rouge_2","rouge_l","rouge_l_sum",],)result=eval_task.evaluate()print("Summary Metrics:\n")forkey,valueinresult.summary_metrics.items():print(f"{key}: \t{value}")print("\n\nMetrics Table:\n")print(result.metrics_table)# Example response:## Summary Metrics:## row_count: 3# rouge_1/mean: 0.7191161666666667# rouge_1/std: 0.06765143922270488# rouge_2/mean: 0.5441118566666666# ...# Metrics Table:## response reference ... rouge_l/score rouge_l_sum/score# 0 The Great Barrier Reef, the world's ... \n The Great Barrier Reef, the ... ... 0.577320 0.639175# 1 The Great Barrier Reef, a vast coral... \n The Great Barrier Reef, the ... ... 0.552381 0.666667# 2 The Great Barrier Reef, the world's ... \n The Great Barrier Reef, the ... ... 0.774775 0.774775
import("context""fmt""io"aiplatform"cloud.google.com/go/aiplatform/apiv1beta1"aiplatformpb"cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb""google.golang.org/api/option")//getROUGEScoreevaluatesamodelresponseagainstareference(groundtruth)usingtheROUGEmetricfuncgetROUGEScore(wio.Writer,projectID,locationstring)error{//location="us-central1"ctx:=context.Background()apiEndpoint:=fmt.Sprintf("%s-aiplatform.googleapis.com:443",location)client,err:=aiplatform.NewEvaluationClient(ctx,option.WithEndpoint(apiEndpoint))iferr!=nil{returnfmt.Errorf("unable to create aiplatform client: %w",err)}deferclient.Close()modelResponse:=`TheGreatBarrierReef,theworld's largest coral reef system located in Australia,isavastanddiverseecosystem.However,itfacesseriousthreatsfromclimatechange,oceanacidification,andcoralbleaching,endangeringitsrichmarinelife.`reference:=`TheGreatBarrierReef,theworld's largest coral reef system, islocatedoffthecoastofQueensland,Australia.It's a vastecosystemspanningover2,300kilometerswiththousandsofreefsandislands.Whileitharborsanincrediblediversityofmarinelife,includingendangeredspecies,itfacesseriousthreatsfromclimatechange,oceanacidification,andcoralbleaching.`req:=aiplatformpb.EvaluateInstancesRequest{Location:fmt.Sprintf("projects/%s/locations/%s",projectID,location),MetricInputs:&aiplatformpb.EvaluateInstancesRequest_RougeInput{RougeInput:&aiplatformpb.RougeInput{//ChecktheAPIreferenceforthelistofsupportedROUGEmetrictypes://https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespecMetricSpec:&aiplatformpb.RougeSpec{RougeType:"rouge1",},Instances:[]*aiplatformpb.RougeInstance{{Prediction:&modelResponse,Reference:&reference,},},},},}resp,err:=client.EvaluateInstances(ctx,&req)iferr!=nil{returnfmt.Errorf("evaluateInstances failed: %v",err)}fmt.Fprintln(w,"evaluation results:")fmt.Fprintln(w,resp.GetRougeResults().GetRougeMetricValues())//Exampleresponse://[score:0.6597938]returnnil}