Get selected tag from html file

Posted on

Problem :

I have a source of a page and I need to get all tags from this file. Order is important. I need both external and inline scripts. tag must be included in the output. I’m looking for a console linux tool.

I tried searching but I couldn’t find anything, to the point I used jQuery to obtain this info and pasted it into a file. But this output has some strange encoding, so I need to parse it traditionally.

Example:
Input:

<html>
  <head>
    <script src="script1.js"></script>
    <script src="script2.js"></script>
    <script>alert('hello');</script>
  </head>
  <body>
    <div id="main">...</div>
    <script src="footer.js">
  </body>
</html>

Output:

<script src="script1.js"></script>
<script src="script2.js"></script>
<script>alert('hello');</script>
<script src="footer.js">

Second example, output only src attibutes.

script1.js
script2.js
inline script 
footer.js

Solution :

You can use grep for that and its only-matching parameter (-o), e.g.:

$ grep -o "<[^>]*>" <(curl -s http://example.com/)

This will print all the html tags including the order.

To include only <script> tags, try (change index.html with your file):

$ grep -Eo "<script.*(</script>|>)" index.html

For getting just the file names (from src attribute), you can extend by adding another grep, e.g.:

$ grep -Eo "<script.*(</script>|>)" index.html | grep -o '"[^"]*"' | tr -d '"'

Above syntax won’t help you with many different variations of html code, therefore for more complex solutions, using regex to parse html is in general not advised, therefore you should to use appropriate tools (language of your preference or check out these shell tools).

I know you’ve accepted an answer already, but I also want to add that you can look into xpath.

It’s meant specifically for xml style data.

In your case, the xpath for this would be

//script

Here is also another example of someone using xpath to parse HTML

Leave a Reply

Your email address will not be published. Required fields are marked *