Get the last segment of a URL

pdessus · March 21, 2024, 7:58am

Dear all,
My project is to fetch some web pages, then to parse their URLs to get only their latest segment (i.e., page names).
I’ve stumbled upon this regular expression but the result is the other way around (page names are removed from the URL):
$File_Name = $URL.replace(“/([^/]+)(?=[^/]*/?$)/, ‘’”);
got the trick from here: Benchmark: Get last part from URL (Regex vs Split vs Substring) - MeasureThat.net

Would you like, please, to help me?
best
ph

mwra · March 21, 2024, 11:29am

The referenced page is a bit confusing as its the test URL contains no year fragment. So, I assume you are strting with an example like:

https://example.com?=123

and you wish to retrieve the ‘123’. Of more specifically as this is using a computer, everything after the ?= in a source URL. Assumption: a ?= sequence should, I think occur only once.

So , we want to split our URL and retrieve the second part, i.e. ‘123’ for the test above. This should work, but doesn’t:

$MyString = $URLsplit("\?=").at(-1);

Note the need to escape the ? as .split() uses a regular expression pattern and ? is a regex specaial ooperator character. \? means treat this as a question mark not a speciail operator.

Why does this not work? It seems the // in the URL is misread as a comment and causes the code to returns a wrong result.

[Edit: turns out the culprit is .at() and the presence of // in an list item: same holds whither using .at(0) or [0] list item addressing. Regardless, the problem is now known to the developer.]

We can fix that, by first removing the // sequence from the URL. It is only done on the being-evaluated version so the source $URL value is untouched:

$MyString = $URL.replace("//","").split("\?=").at(-1);

Succeess. The flow is essentially:

read $URL value (https://example.com?=123)
replace '//' with noting (https:example.com?=123)
split into list based on '?=' substring   (list [https:example.com;123])
get last item in list (123)

One last wrinkle. Perhaps the query marker is just ? and not ?=, or indeed, could we split on eiher of the two? Yes, regex to the rescue. In the split pattern we change from "\?=" to "\?=*". The * in regex means “the preceding character occurs zero or more times”. Tada!

$MyString = $URL.replace("//","").split("\?=*").at(-1);

Text TBX doc (tested using v9.7.2): url-split.tbx (102.2 KB)

pdessus · March 21, 2024, 12:16pm

Dear Mark,
My question wasn’t so clear: my goal was to grab the latest segment after the latest slash (a.k.a, the page name).
Wasn’t so difficult to change the edict:
$File_Name= $URL.replace(“//”,“”).split(“/=*”).at(-1);
this works like a charm!
thanks again,
P.S.: I will report this use case in the forum when completed.
ph

mwra · March 21, 2024, 2:58pm

Glad you’ve got a solution. Aside, it is easiest with this sort of problem—because written language can be ambiguous—to give deliberate before and after examples, i.e. the source string and the part(s) to be saved as discrete values. So in my test case above:

Source: https://example.com?=123

Result: 123

URLs can be quite variable so knowing the exact target can help fellow users here not only solve the test case but to do so in a manner that spots and avoids edge cases (URL variants) when in actual use.

For instance, returning to the thread above for an example, you note you actually wanted to “grab the latest segment after the latest slash (a.k.a, the page name)” so my test was insufficient as the URL—after the protocol part—had no slashed. So perhaps we need a test source URL like:

https://example.com/foobar/index.html?page=123&ref=34

but even then, do we want to capture /index.html?=page=123&ref=34, index.html, or something like list[index.html;123;34]. So for the example just above:

var:string vString =;
var:list vList1 = [];
var:list vList2; = [];
// get a URL without double slash
vString = $URL.replace("//","");
// get list item #0, URL; item #1 (or #-1)query args
vList1 = vString.split("\?=*");
// get last part of URL
$MyString = vList1[0].split("/").at(-1);

$MyString now has a value of ‘index.html’.

Or at the end, if we split the variable vString on slashes:

// get last part of URL including args
$MyString = vString.split("/").at(-1);

$MyString now has a value of ‘index.html?page=123&ref=34’.

Computers are in that sense dumb: you have to ask the right question and in a form that does not allow for a different interpretation of your intent. Within action code, parts using regular expressions (‘regex’ pattern matching) these are similarly frustratingly narrow in their understanding. It’s a constant reminder of the human mind’s ability to resolve—by context—ambiguities which completely confuse a computer.

The previous example file with some new tests:
url-split1.tbx (109.6 KB)

HTH

eastgate · March 22, 2024, 7:07pm

The mishandling of lists which have elements that contain “//” is corrected in backstage release b672.